Understanding Syntax and Conceptual Text Modelling – A Journey

I am not certain that it will be possible to automatically create BPMN reliably  from text but it will be fun trying.  This task will require prior object knowledge,  knowledge of object properties actions and responses, understanding of the meaning of words (a dictionary or as it is known in this trade a Lexicon), access to ontologies (for relative how things are related and logically deriving further knowledge), a model of how words relate to one another (linguistic theories), a system (systems) for word sense disambiguation, a mechanism for classifying words and sentence types into parts of speech, a mechanism for classifying or better still viewing objects in a real world context in relation to other objects and the environment.  Thinking about the basics of word understanding you need a visual spacial context and a sense of number to appreciate “this”, “that” “those” before understanding the spatially abstract “the”.  As far as real world object representation is concerned I think you could integrate and dynamically build or load a view of an object in a virtual 3D space using web3d (see http://www.web3d.org/standards).

I have started building a UML model for the software components necessary to achieve this task.  After studying some of the work of senseval I can see that there is no one size fits all solution to word sense disambiguation.  I think it therefore makes sense to implement multiple solutions and associate the most applicable to particular words (this could be done automatically against a marked up corpus).  I feel a natural implementation for this will be to use a service locator to find the most relevant word sense disambiguation provider implemented via a provider interface.

As previously described in How dynamic creation of BPMN in part involves the classification of textual information into the following categories:

  1. Activity
    Activities will be associated with verbs and represent processes.  Processes can be associated with additional information such as set up time, minimum or maximum batch size or a processing rate and pre and post process queues of a defined capacity.
  2. Entity
    Entities are the things or information that gets transformed by processes and travel through a process model.  When an entity is transformed by a process it could renamed e.g fleece to yarn in wool processing.
  3. Resource
    Resources are additional things that are needed to support the processing of entities.
  4. Event
    Events are things that happen and are created by a trigger. They may pass information and cause an action.  Events can illicit a response and be either synchronous or asynchronous.
  5. Actor
    Actors are the sources of system inputs and destinations for outputs or the source or destination of external events. Actors can be the source or destination of “entities”.
  6. Goal
    Goals are difficult to define but a likely to be identified by the fact that they involve systems that create added value.
  7. System
    A system is a group of things that have a definable boundary and probably has has a goal.

Looking at these categories they are a subset of the data you can find described in schema.org.  I have been thinking that the schema.org XML schema might be a better intial target mapping than BPMN.

An obvious implementation for this problem would be a deep learning classification engine.  Before this can be considered I need a better understanding of word and sentence meaning (Semantics, Pragmatics and Conceptual meaning) .

There are multiple theories available for grammar.  I started with generative grammar and am now reading about dependency grammar.  I have again hit the frustration of not being able to read references as I am not a member of a University library.

I am often getting the basic story of a topic off wikipedia and then trying to find peer reviewed journal references.

I finally found some good references about deep learning.  Some people with have been telling me I should give up my study of linguistics and forget these procedural approaches to solving the problem of language understanding and focus on understanding deep learning.  From what I have read so far in academic papers (i.e no hype but an explanation), deep learning is about classifying and understanding things through a hierarchical chain.  Each neural layer in deep learning currently tends to needs training before it can be used to feed into the next layer.  Deep learning is not a means of stirring a pot of neuron soup before letting it settle out into a brain.  From what I have read deep learning represents advanced pattern matching tool.  I have seen articles about how to build a brain which I have not yet read.  It maybe that my understanding of what you can do with deep learning is out of date but I also read 2015 articles.  I have found http://deeplearning.net which does appear to be an excellent source for finding out the state of the art.

More Reading and Prototyping

I have DKPro Core as well as the Stanford Core NLP running in Eclipse producing text, XML and XMI output from my selected text.  I can see various tasks I need to address in performing the kind of data extraction and knowledge base construction that I am interested in undertaking.  Before leaping in any further I plan to do some more reading on the work presented at COLING 2014 and also what I can learn through finding references using http://saffron.insight-centre.org/acl/topics/ and other topics such as LEMON.  In between my reading I will start developing a task list and prototype.

Looking at WebAnno (UIMA + Brat)

Well I took a look at U-Compare and found the display gave me an error when I loaded a large file.  I started looking at the code to solve the problem only to discover the source is only partially open.  I am not keen on reporting the error as if I did I would need to say what the file was. I do not want to encourage people onto my work area too soon.  So I understand the reasons for the non-open source but I am frustrated.

In looking around for a UIMA / Brat integration I have found the fully open source WebAnno project.  They have youtube tutorial videos on this product.  The server version installs on Debian a unix variant.  I am tempted to move to unix, I worked on unix years ago, but it is a lot of extra learning I do not need.  Then the question is whether to go for a dual boot on my laptop or to get another system.  I will try running on Windows first.

Still Running The UIMA Tutorial. Thinking About Adding Brat Integration

I have been running the UIMA tutorial which I still need to complete.  I am interested to discover how I can search for PEAR files so that I can reuse existing annotators out of the box (ways of analysing documents).  I want such annotators as an HTML to TEXT convertor that gives me a handle on heading titles and their associated sub paragraphs.  For feature extraction and automated learning I will want to associate headings with their associated paragraphs.

Once I have got this far I will want to see whether integration of BRAT is a good idea for user presentation and collaboration purposes (See http://brat.nlplab.org/index.html).  I will then be looking at POS (Parts Of Speech) tagging.  It does appear unsurprisingly perhaps that others have already trod this path in thinking of combining BRAT and a UIMA application. But I am thinking of integrating and building so much more that.  More news when I get there…