Toy project to scrape articles from some Malawian news sites to play with that data. This project uses an SQLite database to store the articles and runs a webserver to allow you to interact with the downloaded articles.
- Java 21+
- Javalin, used to be SparkJava but moved to Javalin which is being maintained..
- SQLite
- Go 1.21+
- Redis
- RedPanda for Kafka
- JSoup
What interesting things can we do:
Estimate reading times for the articlesUse sqlite for caching articlesWord frequency countNamed Entity Extraction (places, people, businesses, etc..)via NERIAUse Full Text Search functionalities of SQLiteUse new built-inHttpClient
for HTTP (async) requestsUse new Java language features:var
keyword,record
and Switch ExpressionsPublish Articles to a Kafka brokerUse Greypot to generate PDF from an article- Use Tribuo for basic article classification
- Try to use Formula Engine for search queries??
- Implement Markov Title Generator with Java
- Use Greypot to generate a basic newsletter like PDF of the articles (e.g. articles for month of September)
- Use jib to build the container for the project
- Text to speech (Amazon Polly??)
- Download images from the article and Base64 encode them for storage
- Play with String compression algorithms
- Use
Virtual Threads
for running the webscrapers (waiting for Project Loom to reach GA :)) - Integrate Cloudy for rendering semantic word clouds from articles
- NLP-esque tasks
- Find out how much money has been mentioned on NyasaTimes since 2016 categorized by keywords (donates, funds, wins, receives, aid etc..)
- Group articles by Keywords
- Group articles by named entities
- Extract quotes from articles "..." He said, "..." said Ndani ndani
- Find/detect (Malawian) locations mentioned in the articles
- Given an article, find related articles by named entity, location or amount range
- Use Lucene Tokenizer to get words which can be used with CRAFTML for labeling
- Add "pub/sub" for keywords - enable a person to subscribe to a keyword and send a webhook when we have an article that matches that keyword.
- Nyasa Times
- Malawi24
- ADD SUPPORT FOR mbc.mw ref https://mbc.mw/gess-to-be-awarded-in-london/
- ADD SUPPORT FOR times.mw ref https://times.mw/malawi-egypt-in-fertiliser-talks/
- ADD SUPPORT FOR https://mwnation.com/ .. ref: https://mwnation.com/tough-going-for-poor-households/
You will need to have Java JDK 17 and the latest Apache Maven, I recommend using IntelliJ's latest IDEA IDE.
$ git clone https://github.com/zikani03/articulated.git
$ cd articulated
$ mvn clean compile package
$ java --enable-preview -Dspark.port="4567" -Dneria.url="https://neria-fly.fly.dev" -jar target\articulated.jar
This runs a webserver on port localhost:4567
you can use the following curl request to download sports articles...
$ curl -X POST http://localhost:4567/articles/download/nyasatimes/sports?from=1&to=10
DISCLAIMER: The content scraped from the sites is property of those publishers and this is intended for personal use. If you want to use this for some other purposes e.g. commercial purposes, speak to a lawyer. I am not a lawyer. YMMV
Copyright (c) Zikani Nyirenda Mwase