What is the Conatix System?
Conatix is building a semi-automated business intelligence system based on recent advances in machine learning to enable companies to discover, source, structure and share previously unstructured data and information from outside their organizations. Conatix disrupts the business research value chain: rethinking online search and discovery, organizing the research process, creating a system that learns, and iterating the process to create knowledge that scales. Companies need to do more than just one-off keyword searches.
Once documents (HTML pages, PDFs, text documents or other formats available on the web and on company intranets) have been discovered by the Conatix business intelligence system, Stanbol can aid in identifying important terms, which can in turn serve as an input for discovery and relevance classification of further documents, in addition to those inputs and feedback provided manually by users.
What is the SEESAW Widget? Stanbol Enumerated Entity Set Acquisition Widget (SEESAW) is a contribution of Conatix UK Ltd. to Apache Stanbol under the Early Adopters program of IKS, SEESAW is specifically built to fetch pages from Wikipedia, get the titles and the URLs and add them to the entityhub of the Apache Stanbol. Once they are added to the entityhub, the entities can be used as Named Entities. SEESAW can be used to add entities for a language, for example. Another application is to add a category or a set of categories as well as portals from Wikipedia in different languages and enhance the text based on those new named entities.
There are some repetitive patterns in Wikipedia pages to separate the list of categories, subcategories, and pages. SEESAW automatically parses the pages and semi-recursively adds those entities to the entityhub. Named Entities in different languages and/or customized inputs by users add value to the text enhancement service of Apache Stanbol. The text of the documents can be in any language, or may be about some categories/sciences which are not yet supported by the Apache Stanbol Engine. The Entityhub adds user-defined customized Named Entities, which led to the idea of making an automated widget to add a massive number of entities for the text enhancement service.
Apache Stanbol and VIE integration into the beta prototype of the Conatix semi-automated business intelligence system focuses on named entity recognition and important term mining which is independent of the language used by the users. Therefore, Conatix needed a way to parse entities across multiple languages, and we developed that functionality in the form of the SEESAW widget to address our own business needs. Then we decided to make it available to the broader Stanbol community.
All documentation is provided in the project as well as in standard javadocs. Documentation can be accessed in github. The source code runnable JAR library can be downloaded here. You can find the documentation in this link.
The returned enhanced properties of the text from the Stanbol Engine can be used to enrich the terms of the input text by the users. These properties and links from the Stanbol Engine can be used further in the Conatix system for more related important entities. Therefore the intelligent core of the Conatix system has a broader range of important and related terms. Having more terms related to the text of the documents and user inputs result in discoveries of new documents which do not contain the same important terms of the original documents added by the user, but which do contain similar terms and patterns given by the customized Stanbol Named Entities.
In addition VIE also uses a new chain to (solely) utilize the customized Named Entities which are added to the Entityhub. These entities are in a specific language in which the research is being conducted and/or are about specific topic(s)/category(ies).
The pages in Wikipedia belong to categories. Given a category, SEESAW parses the content of the category page, discovers pages and subcategories and adds them to the Entityhub which is used for the text enhancement. Furthermore subcategories are parsed separately to add the linked pages.
To test the idea we used Wikipedia in Persian. As indicated in the example provided below, the entities are recognized given a text in Persian. Adding a new language is important as research can be performed using the Conatix system in any language and our system is language-independent. Persian was chosen for this purpose for several reasons: (I) It uses a non-latin alphabet, (II) It is a right-to-left language so some properties of the text are different, and (III) The main contributor to SEESAW is a Persian speaker. However, the system was also tested using seed URLs in English, Azerbaijani, Turkish, and Finnish and the results were valuable.
Here is an example of the input text in Persian and the output results from Apache Stanbol: