title separator verticalSeparator theme revealOptions
Scalable, polyglot, observable news analysis
<!-- section -->
^---$
white
transition
fade
<style> h1{ font-size: 42pt !important; } </style>

Polyglot | scalable | observable news analysis

Timm Heuss

April 2023



This low-key session will be about a custom news crawling and analysis solution I've been building and improving for years now. The talk will cover the challenges I faced and how I evolved the architecture over time to eventually end up with a scalable, polyglot, observable system. We will also reflect about the beauty of open source on GitHub and message queues. In the hands-on session, we will scale the application interactively and monitor the performance in real-time.

Tech-Keywords: Docker, NATS.io, Prometheus, Loki, Grafana, Golang, Python

Motivation & Idea


Staying up to date is key.

But there's too much information out there.

Also, don't trust centralized approaches.

So...


Why not let my machine

find relevant information for me?


RSS

Rich Site Summary


How many RSS feeds are delivered today


How they look like with fivefilters


News Analysis in a nutshell

Have lists of RSS feeds.

Enrich them with fivefilters.

Match regexp against article full texts.

Basic components

  • NATS queue
  • Article URL Feeder
  • Keyword Matcher
  • Pocket Integration
  • Fivefilters

It's all about URLs!

article_urls: URLs of articles from the internet

match_urls: URLs that match my interests


Article URL Feeder

~50 feeds from manually curated sources

~630 feeds from kilimchoi/engineering-blogs


Keyword Matcher

Retrieves articles, matches against predefined regexes, puts successful matches on match_urls queue


keywords.txt

# Simple name-dropping
Strange Loop

# Positive lookaheads
(?i)^(?=.*(docker))(?=.*(alternative|anti pattern|best practice|goodbye|ranger|podman|cli|benchmark)).*


Pocket Integration


Fivefilters


Matching centerpiece

queue.WithArticleUrls(func(m *nats.Msg) {
    var url = string(m.Data)

    var fulltext = fulltextrss.RetrieveFullText(url)

    var text = prepareAndCleanString(fulltext)

    var match, regexId = keywords.Match(text)

    if match {
      queue.PushToPocket(model.Match{
        Url:     url,
        RegexId: regexId,
      })
    }

https://github.com/heussd/nats-news-analysis/tree/main/keyword-matcher-go

Polyglotness


Project Client languages
NATS 29
Kafka 18
Pulsar 7
Rabbit 10

https://docs.nats.io/nats-concepts/overview/compare-nats


First implementation with Python


Re-implementations with Go


Why not both?


Python vs. Go

Metric Python Golang Comparison
Docker image size 424MB 6.09MB Go is ~70x smaller
Memory consumption 23,8MiB 8,33MiB Go needs ~3x less memory
LoC 447 485 Python has ~8% less lines

Python vs. Go matching performance?

Python's regex engine matches differently compared to Go's (third-party) engine ☹️

Python tends to overmatch, leading to better matching performance in the statistics - which is misleading.

Scalability


Bottleneck: Keyword Matcher


So lets scale it with docker compose

keyword-matcher-go:
  scale: 4
  image: ghcr.io/heussd/nats-news-analysis/keyword-matcher-go:latest



Parallel feeders

  rss-article-url-feeder-go-1st:
    image: ghcr.io/heussd/nats-news-analysis/rss-article-url-feeder-go:latest
    [...]
    volumes:
      - type: bind
        source: ./urls-primary.txt
        target: /urls.txt
        consistency: cached
        read_only: true
  rss-article-url-feeder-go-2nd:
    [...]
    image: ghcr.io/heussd/nats-news-analysis/rss-article-url-feeder-go:latest
    volumes:
      - type: bind
        source: ./urls-secondary.txt
        target: /urls.txt
        consistency: cached
        read_only: true



Bottleneck: Fivefilters


Simple nginx load balancer


events { worker_connections 1024; }

http {
 upstream fullfeedrss {
    server nats-news-analysis_fullfeedrss_1:80;
    server nats-news-analysis_fullfeedrss_2:80;
 }
 server {
    listen 80;
    location / {
       proxy_pass http://fullfeedrss;
    }
  }
}



Observability


Loki

  • Push principle
  • One or more containers push their logs to Loki
  • Docker driver

Setup loki in docker-compose

services:
  service:
  	[...]
    logging:
      driver: loki
      options:
        loki-url: "http://host.docker.internal:3100/loki/api/v1/push"



Prometheus

  • Pull principle
  • One container exposes an metrics endpoint
  • Additional tooling exports metrics to Prom instance



🙌 Hands on 🙌


Starting Point


Add feeder


Add 2nd feeder


Add Keyword Matcher


PUMP IT


Reflections


Beauty of Cloud-native


Cloud-native Principles

==

Supercharge your Possibilities


~170 LoC Docker Compose


Tiny design decision - huge impact

Measure Effort Win
NATS medium Use 29 languages, de-duplication, persistence
+ Docker low Scale
+ nginx low Scale even better
+ Loki Driver low Mighty observability stack
+ Grafana low Dashboard

"Throw-away mode" in Grafana


Beauty of Open Source (at GitHub)


GitHub employees have your back ❤️


GitHub bots have your back ❤️


The community is talking code ❤️


The community is talking code ❤️

Thank you

CNCF, GitHub, Docker, NATS FTW

https://github.com/heussd/nats-news-analysis https://github.com/heussd/talk-polyglot-scalable-observable-news-analysis