nos-grams: A Python repository from rubenros1795

NOS-grams is an n-gram viewer for the NOS.nl web archive. The viewer is accessible as a Streamlit app

Data 📁

The data concerns n-grams generated from online news articles published by the Dutch Broadcasting Foundation (NOS). The NOS articles stretch back until 2010. The organization officially started several years earlier, but unfortunately this data is lost (web archivists of all countries - unite!).

For this project the NOS was contacted. In light of the absence of any replies, I cannot share the raw articles. Only the n-grams are saved in HDF5 format. In this way the original article cannot be reproduced. Of course, you're free to scrape the archive yourself. Instructions follow below.

Preprocessing 🔧

The preprocessing pipeline looks as follows:

Scrape the web archive content in .html format: preprocess/scrape-html.py.
Parse the metadata (date, category) .html files and write to resources/metadata.json.: preprocess/parse-html.py
Extract titles and text content from the .html files: preprocess/parse-text.py.
Preprocess text (lowercase, remove all-digit tokens, tokenize): preprocess/preprocess.py.
Parse 1-4 grams from text data: preprocess/parse-ngrams.py.*
Convert n-gram .csv files to .h5 format for increasing the query speed.
Calculate total token sizes per month: preprocess/calculate-totals.ipynb.

* N.B.: n-grams are written to /ngrams which is not in this repo due to the size of the files.

The Streamlit App 💻

The n-grams can be queried through the Streamlit app. Streamlit is a fantastic framework for creating dashboards in Python. This app supports the following features:

Unigram querying (so far no bigrams and higher).
Multiple queries.
Wildcards: search for substrings, for example "*nieuws" or "buitenland*".
Relative frequency.
Two different visualisation types (bar charts and line charts).
Rolling averages.
Date range selection.

If you want to reproduce the app locally (check out the Streamlit documentation), run streamlit run app.py after installing the dependencies with requirements.txt (pip install requirements.txt).

Features to be added ⌛

Bigram/trigram support
Time series correlation support for finding related terms based on diachronic frequency patterns.
Collocation support for finding related terms based on word context.
Faster querying using a remote SQL database.
Fancy colors.

rubenros1795/nos-grams

Data 📁

Preprocessing 🔧

The Streamlit App 💻

Features to be added ⌛