Kloop Kyrgyz crawler.

Crawled content is included in the sqlite3 DB.

Credits:

  • Thanks to Bektour Iskender, we were allowed to crawl and use Kloop articles.
  • Thanks to Henriette Brand's awesome tutorial. My previous attempts to couple Django and Scrapy were ridiculous. Her article gave me hints on how to organize the project to make it possible to call Scrapy from within Django.

Notes:

  • This is currently a work in progress (written in ~8 hours or so).
  • There are 30 934 articles crawled from 2011 to 2024 (as of September 8, 2024). Due to network failures some of the articles were not crawled.
  • It took more than 12 hours to crawl the articles, because of the more or less gentle crawler settings (I didn't want to stress kloop's servers):
CONCURRENT_REQUESTS = 3
DOWNLOAD_DELAY = 1

Technology

  • Django 4 (for robust admin panel and awesome ORM)
  • Scrapy (for, surprise, scraping)
  • Other usefult libs (see the project's requirements file).

Explore the corpus

Unpack all_texts.txt.zip and have fun.

Dev setup prerequisites

We assume you have the following packages are installed in your system:

  • git
  • Python 3.6 or above
  • venv

Install dependencies and run

  • Clone the project: git clone https://github.com/kyrgyz-nlp/kloop-corpus.git

  • Go to the project folder: cd kloop-corpus

  • Create and activate a virtual env (sorry, Windows users, I don't know how you do this on your machine): python -m venv env && source env/bin/activate

  • Install project dependecies: pip install -r requirements.txt

  • Run the server: ./manage.py runserver

  • To run the crawler python manage.py crawl

Changelog

  • Sep, 08 2024: re-crawled from the beginning to update all_texts.txt.zip because the articles didn't contain valuable metadata.

TODO

  • Introduce --start-year=2020 kind of args to the crawler
  • Extract metadata using NER models or LLM
  • Remove whitespaces and remove empty articles
  • Push the current version of the corpus to Hugging Face
  • Introduce --start-from='2020-04' kind of args to the crawler
  • Introduce upsert logic: if the article is not in the DB, then crawl and save
  • Add webpages with basic corpus statistics: frequency dictionary, most frequent n-grams etc.