/pyflink-nlp

Self-contained demo using PyFlink with Gensim+spaCy to find topics in the Flink User Mailing List. All you need is Docker! 🐳

Primary LanguagePython

Building an Analytics Pipeline with PyFlink

(WIP)

See the slides for context.

demo_overview

Getting the setup up and running

docker-compose build

docker-compose up -d

Is everything really up and running?

docker-compose ps

You should be able to access the Flink Web UI (http://localhost:8081), as well as Superset (http://localhost:8088).

Submitting the PyFlink job

docker-compose exec jobmanager ./bin/flink run -py /opt/pyflink-nlp/pipeline.py \ 
  --pyArchives /opt/pyflink-nlp/lda_model.zip#model \
  --pyFiles /opt/pyflink-nlp/tokenizer.py -d

Once you get the Job has been submitted with JobID <JobId> green light, you can check and monitor its execution using the Flink WebUI:

Flink-Web-UI

Superset

To visualize the results, navigate to (http://localhost:8088) and log into Superset using:

username: admin

password: superset

There should be a default dashboard named "Flink User Mailing List" listed under Dashboards:

Superset


And that's it!

If you have any questions or feedback, feel free to DM me on Twitter @morsapaes.