title | separator | verticalSeparator | theme | revealOptions | ||
---|---|---|---|---|---|---|
Scalable, polyglot, observable news analysis |
<!-- section --> |
^---$ |
white |
|
Timm Heuss
April 2023
This low-key session will be about a custom news crawling and analysis solution I've been building and improving for years now. The talk will cover the challenges I faced and how I evolved the architecture over time to eventually end up with a scalable, polyglot, observable system. We will also reflect about the beauty of open source on GitHub and message queues. In the hands-on session, we will scale the application interactively and monitor the performance in real-time.
Tech-Keywords: Docker, NATS.io, Prometheus, Loki, Grafana, Golang, Python
Staying up to date is key.
But there's too much information out there.
Also, don't trust centralized approaches.
So...
Why not let my machine
find relevant information for me?
Rich Site Summary
Have lists of RSS feeds.
Enrich them with fivefilters.
Match regexp against article full texts.
- NATS queue
- Article URL Feeder
- Keyword Matcher
- Pocket Integration
- Fivefilters
article_urls: URLs of articles from the internet
match_urls: URLs that match my interests
~50 feeds from manually curated sources
~630 feeds from kilimchoi/engineering-blogs
Retrieves articles, matches against predefined regexes, puts successful matches on match_urls queue
# Simple name-dropping
Strange Loop
# Positive lookaheads
(?i)^(?=.*(docker))(?=.*(alternative|anti pattern|best practice|goodbye|ranger|podman|cli|benchmark)).*
queue.WithArticleUrls(func(m *nats.Msg) {
var url = string(m.Data)
var fulltext = fulltextrss.RetrieveFullText(url)
var text = prepareAndCleanString(fulltext)
var match, regexId = keywords.Match(text)
if match {
queue.PushToPocket(model.Match{
Url: url,
RegexId: regexId,
})
}
https://github.com/heussd/nats-news-analysis/tree/main/keyword-matcher-go
Project | Client languages |
---|---|
NATS | 29 |
Kafka | 18 |
Pulsar | 7 |
Rabbit | 10 |
https://docs.nats.io/nats-concepts/overview/compare-nats
Metric | Python | Golang | Comparison |
---|---|---|---|
Docker image size | 424MB | 6.09MB | Go is ~70x smaller |
Memory consumption | 23,8MiB | 8,33MiB | Go needs ~3x less memory |
LoC | 447 | 485 | Python has ~8% less lines |
Python's regex engine matches differently compared to Go's (third-party) engine
Python tends to overmatch, leading to better matching performance in the statistics - which is misleading.
keyword-matcher-go:
scale: 4
image: ghcr.io/heussd/nats-news-analysis/keyword-matcher-go:latest
rss-article-url-feeder-go-1st:
image: ghcr.io/heussd/nats-news-analysis/rss-article-url-feeder-go:latest
[...]
volumes:
- type: bind
source: ./urls-primary.txt
target: /urls.txt
consistency: cached
read_only: true
rss-article-url-feeder-go-2nd:
[...]
image: ghcr.io/heussd/nats-news-analysis/rss-article-url-feeder-go:latest
volumes:
- type: bind
source: ./urls-secondary.txt
target: /urls.txt
consistency: cached
read_only: true
events { worker_connections 1024; }
http {
upstream fullfeedrss {
server nats-news-analysis_fullfeedrss_1:80;
server nats-news-analysis_fullfeedrss_2:80;
}
server {
listen 80;
location / {
proxy_pass http://fullfeedrss;
}
}
}
- Push principle
- One or more containers push their logs to Loki
- Docker driver
services:
service:
[...]
logging:
driver: loki
options:
loki-url: "http://host.docker.internal:3100/loki/api/v1/push"
- Pull principle
- One container exposes an metrics endpoint
- Additional tooling exports metrics to Prom instance
Cloud-native Principles
==
Supercharge your Possibilities
- rss-article-url-feeder-go
- keyword-matcher-go
- pocket-integration
- fivefilters-full-text-rss
- nats
- NGINX
- Prometheus NATS Exporter
- Prometheus
- Grafana Loki
- Grafana
Measure | Effort | Win |
---|---|---|
NATS | medium | Use 29 languages, de-duplication, persistence |
+ Docker | low | Scale |
+ nginx | low | Scale even better |
+ Loki Driver | low | Mighty observability stack |
+ Grafana | low | Dashboard |
CNCF, GitHub, Docker, NATS FTW
https://github.com/heussd/nats-news-analysis https://github.com/heussd/talk-polyglot-scalable-observable-news-analysis