The idea here is to collect many articles from the website g1.globo.com to apply NLP techniques
The set of words here is in portuguese.
After running
./database/collect_data.sh
A database, using the RDBMS SQLite3, called g1database.db will be created. This database has only one relation, given by the scheme:
create table articles (
id TEXT,
created_at TEXT,
url TEXT NOT NULL,
section TEXT,
summary TEXT,
title TEXT NOT NULL,
text TEXT NOT NULL,
PRIMARY KEY(id, created_at)
);
To run all scripts at once issue the command
./run_scripts.sh
Database statistics | |
---|---|
Number of texts collected | 1721 |
- Plot showing all topics and their frequencies
- Plot showing word frequencies -- without the stopwords
Let
Then I will apply to each of these vectors a transformation where
each of its coordinates will be given by their term frequency adjusted
by the idf term. These idf terms take into consideration
the corpus
Finally, after the transformation above, I will project the vectors into
Despite the fact that low dimension vector spaces loses a lot of structure we can still see some topics with a high distance from other topics.
Embedding these vectors into
- SQLite3
- Python3 libraries:
- bs4
- requests (download urls)
- json
- re
- sqlite3
- PERL
- wget
The programs were tested in a GNU/LINUX machine.