Module of automatic Language Detection (LanguageDetector)
A high-precision service operating on a message flow.
LanguageDetector is a self sufficient service that can automatically define the language (or a mix of a few languages) of an incoming message. At present, the module recognizes 67 languages that are most used online.
- List of the recognized languages:
- Germanic: English, Dutch, German, Norwegian, Danish, Swedish.
- Roman: French, Spanish, Italian, Portuguese, Rumanian.
- Cyrillic: Russian, Ukrainian, Belarusian, Tartar, Serbian, Bulgarian, Kazakh.
- Slavic (latin alphabet): Polish, Czech, Croatian, Bosnian, Slovenian, Slovak.
- Finno-Ugric: Finnish, Hungarian, Estonian.
- Baltic: Latvian, Lithuanian.
- Turkic: Azerbaijani, Turkish, Kyrgyz, Mongolian, Uzbek.
- Others: Chinese, Japanese, Korean, Vietnamese, Thai, Laotian, Khmer, Tibetan, Burmese, Filipino (Buhid, Tagbanva, Hanunoo, Baybayin, Sebuan, Varayan), Swahili, Arabic, Farsi, Hindi, Georgian, Armenian, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhalese, Saurashtra, Hebrew, Syrian (Aramaic).
Average language recognition accuracy is 97-98%. We should also note, that because of similarities of some labguages, they can be defined less precisely (for example, the accuracy in defining the Portuguese language is relatively low - about 90% because of its similarity with Spanish, same with Bosnian and Slovenian). New languages can be added upon requirement.
The defining feature of this module is its high accuracy of processing short sentences as well as larger texts, and also its quality in working with languages from one group (like Russian, Bulgarian, Belarusian).
Module of automatic sentiment definition (SentiFinder)
It is a unique high-precision and high-performance service of automatic sentiment detection of both Russian and English texts.
SentiFinder is realized for commercial standards thus allowing the usage of this module as a separate service as well as a solution integrated in an existing infrastructure.
The module includes two types of sentiment analysis:
- object-based, where object is given by a user;
- and on sentence level, where the system automatically determines an object (given the information from the sentence).
The second type of sentiment analysis allows a more precise classification of a document, in general. It is especially useful for opinion mining and references, where it is difficult for a user to determine an object.
The module identifies three types of sentence sentiment (positive, negative and neutral) both for a given object within the sentence and as an average within the whole document. The average precision for all three types of sentiment for Russian is about 86%, while for English it is about 80%.
The main feature of this module is that it can detect the strength of the emotion. Hence the user can not only get a qualitative estimation of the document sentiment but also a quantitative correlation of negative to positive attitude to the document.
Модуль автоматической классификации текстов (TextClassifier)
A service realized on machine learning.
TextClassifier is also a separate service that is implemented as a commercial version. The module automatically assigns a beforehand specified class to a Russian document of any size and any topic. It is designed so it can be retrained for any topics and for any language in a short time. For the moment the service distinguishes among the following categories:
- - Autos
- - Economy and Business
- - Showbiz and Entertainment
- - Family
- - Fashion
- - Computer Games
- - Health and Medicine
- - Politics
- - Real Estate
- - Science and Technology
- - Sport
- - Turism and Traveling
- - Cooking
Automatic Named Entity Recognition Module (NER)
This service works with texts in Russian and English.
The module is realized as a separate service that allows to group named entities from a text in Russian into 5 classes (proper names, organizations, geography, names of products and brands and named events) and from a text in English into 3 classes (proper names, organizations and geography). The table below shows the precision and recall of the detection of every type of named entities for Russian.
Type Precision Recall F1 Proper Names 94.04 94.28 94.16 Geography 92.19 91.76 91.97 Organisations 85.27 86.52 85.89 Products 79.20 80.03 79.62 Events 80.15 76.27 78.16 Average 86.18 85.78 85.97
The main feature of our NER is that it does not use any dictionaries or thesauri. This way it can detect with high precision the type of a previously unseen object or of an object that can belong to different types depending on context. Another advantage is that it does not need any linguistic preprocessing of a text, thus the speed of the module is high.
Normalization module for Russian language
This module is designed for word normalization, i. e. assigning a dictionary form to a word: singular number, nominative case for nouns and infinitive for the verbs.
Normalization is essential for syntactic and semantic analysis of a text especially for languages with rich morphology (like Russian).
The module secures high accuracy of normalization of texts in Russian at the expense of morphological analysis, extended dictionaries and ambiguity resolution.
Lexical ambiguity is the match of a different word with different semantics to have a coincident word form. For example word «saw» can be a noun with the meaning of a cutting tool and it can be a verb used in form of simple past tense depending on a context. Russian language is very rich in such ambiguity, thus this kind of resolution is an important part of processing and analyzing texts.
In order to resolve lexical ambiguity our module analyses context, applies linguistic rules and choses one form from all the variants suggested by the vocabularies. We use mixed methods for normalization: statistical and based on the linguistic rules. Due to this fact the module shows high precision and a speed of about 200 kB/s for a text in two bytes encoding.
Module for morphological Analysis of Russian Language
This module is designed for morphological analysis of the words of a sentence - detecting their parts of speech and morphological characteristics. It is usually used in different linguistic services such as sentiment detection, named entity recognition and normalization.
Morphological analysis is based on the use of morphological vocabularies and combinatorial algorithms. In addition to showing morphological word characteristics one can get all the possible word forms and stems (stemming) as an output. The analyzer resolves lexical ambiguity and coordinates word forms with their morpho attributes.
Combination of quantitative linguistics and an extended set of rules allows getting exceptionally high precision of detecting morphological attributes of words.
The purpose of Eureka Engine is to automatize the complicated process of understanding «human speech», access new data from huge ARRAYS of various texts and to expand horizon of already used services and applications, when it is integrated.