@article{26583204_325243423_2019,
author = {Vladimir Barakhnin and Olga Kozhemyakina and Ravil Mukhamediev and Yulia Borzilova and Kirill Yakunin},
keywords = {natural language processing, streaming word processing, text analysis information systemdevelopment of a text corpus processing system},
title = {The design of the structure of the software system for processing text document corpus},
year = {2019},
number = {4 Vol.13},
pages = {60-72},
url = {https://bijournal.hse.ru/en/2019--4 Vol.13/325243423.html},
}
Media-monitoring system which solves the following problems:
- Parsing of news web sites using custom configurable Spider (Scrapy)
- Storage (Redis, PostgreSQL, Elasticsearch)
- NLP data preprocessing (PyMorphy2, NLTK, Gensim)
- Topic modelling (LDA, BigARTM, ETM), including dynamic models (Custom DTM, DETM)
- Classification of documents according to arbitrary criteria (M4A, traditional ML approach)
- Visualization (Django, HTML+CSS+JS, Plotly, MapBox)
- Automatic report generation (LaTex+Jinja2)
All components of the system are implemented as Docker containers. Such implementation allows components and subsystems to work independently, interchangeable and allows easy scalability.
Airflow is an ETL subsystem, upon which scrapping spiders Spider(Scrapy) which are being stored in PostgreSQL as a persistent structured SQL storage. Data obtained through preprocessing, modifications and modelling is stored in ElasticSearch, which is the main storage for pre-calculated results necessary for displaying dashboard and reports.
Topic Document Dynamics | Criteria Dynamics |
---|---|
The system also provides tools for visualization, such as dynamics of publications of topics in media according to various criteria, histograms of criteria value distribution, distributions amongh sources, etc.
Mapping DTM is a custom algorithm for analyzing topic dynamics based on context semantic mapping (Context Fuzzy Jaccard). It allows to visualize topics lifesycle, analyze changes in vocabulary, classify topics by their dynamic characteristics in order to distinguish events, informational attacks, long-term trends, etc.
Dashboards - set of configurable widgets, which are able to perform the above mentioned visualizations.
Dashboard can be configured according to client's needs and does not require additional development.
Monitoring objects are implemented as a special NER requests language which allow to filter information based on any given entities.
Example of such request is 1(Machine Learning) AND 1(Deep | Convolutional)
, which would require "Machine learning" phrase to be present in a text,
along with either "Deep" or "Convolutional". This language allows to flexibly filter the corpus in order to analyze different entities such as persons, organisations, location and topics.
Media Analytics can be applied in industrial tasks as :
-
Competitive to ALEM MEDIA MONITORING service for:
Monitoring of media space (news websites, social networks, TV, etc.)
Reputation management, public opinion analysis, PR policies assessment and optimization
Decision making support
Configurable reporting and dashboards
NER requests filtering
KPI of marketing campaigns estimation, competitors comparative analysis, etc.
-
Service for searching most relevant bloggers/influencers for advertising in social networks: YouTube, Instagram, Facebook, etc. Example of such service is GetBlogger
Social network parsing, filtering by bloggers/authors popularity
Topic modelling of the corpora, obtaining topic embedding for separate publications and aggregating them to bloggers'/authors' topic embeddings
Creating a model which accepts textual information about business or product as an input, and outputs the most relevant bloggers/authors