
web scrapping texts from g1.globo.com for NLP

Primary LanguagePython

web scrapping texts from g1.globo.com for NLP

The idea here is to collect many articles from the website g1.globo.com to apply NLP techniques

The set of words here is in portuguese.

After running


A database, using the RDBMS SQLite3, called g1database.db will be created. This database has only one relation, given by the scheme:

create table articles (
    id          TEXT,
    created_at  TEXT, 
    url         TEXT NOT NULL,
    section     TEXT,
    summary     TEXT,
    title       TEXT NOT NULL,
    text        TEXT NOT NULL,

    PRIMARY KEY(id, created_at)

To run all scripts at once issue the command



Database statistics
Number of texts collected 1721


  • Plot showing all topics and their frequencies
topic trends image
  • Plot showing word frequencies -- without the stopwords
wordcloud image

Let $\mathcal{W}$ be the set of bag of words $$ \mathcal{W} = \{w = (w_i) \in \mathbb{R}^{\mathcal{V}} \},$$ where $\mathcal{V}$ is the set of most frequent words.

Then I will apply to each of these vectors a transformation where each of its coordinates will be given by their term frequency adjusted by the idf term. These idf terms take into consideration the corpus $$\mathcal{C} = \{d; d \text{ is a document whose topic has appeared more than 3 times} \}.$$

Finally, after the transformation above, I will project the vectors into $\mathbb{R}^2$ and $\mathbb{R}^3$ using the truncated SVD transformation.

R2 image R3 image

Despite the fact that low dimension vector spaces loses a lot of structure we can still see some topics with a high distance from other topics.

Embedding these vectors into $\mathbb{R}^{300}$ we get a better variance description of the phenomenon. Below a correlation plot showing the relation of the $i-th$ features

correlation map image


  • SQLite3
  • Python3 libraries:
    • bs4
    • requests (download urls)
    • json
    • re
    • sqlite3
  • PERL
  • wget

The programs were tested in a GNU/LINUX machine.