Allows to classify comment as being toxic, insulting, etc over a RESTful API.
Based on Kaggle "Toxic Comment Classification Challenge". Kernel for training and exporting models - https://www.kaggle.com/deniskovalenko/logistic-regression-with-words-and-char-n-grams (forked from https://www.kaggle.com/thousandvoices/logistic-regression-with-words-and-char-n-grams)
Capabilities: serve predictions in real time and provide metrics
api_token should be set as environment variable for Scala app.
Result:
{"scores":{"identity_hate":0.00196645105731319,"insult":0.006155951964776735,"obscene":0.00613491609641551,"severe_toxic":0.0026068558324382532,"threat":0.0006324124967297172,"toxic":0.017618535761723918},"success":true}
Result:
{"metrics":{"requestsPerMinute":4,"mostCommonLabels":["toxic","obscene"],"toxicMean":0.276513919586555,"severe_toxicMean":0.24385146656244983,"obsceneMean":0.25744020103790544,"threatMean":0.004259511579034224,"insultMean":0.25579802629887466,"identity_hateMean":0.017571348458626793}}
returns average scores, requests per minute and 2 most common labels
Projects needs 3 containers:
Kafka container (spotify/kafka), denkovalenko/prediction-api:1.0, denkovalenko/toxic-comments-classifier:1.0
This images are available from dockerhub or you can build them by yourself
cd predictor sbt docker:publishLocal
cd ../model-service docker build -t denkovalenko/toxic-comments-classifier:1.0 . cd ..
docker-compose up
(main app starts on 9005 port)
-
Scalability: By separating web service from model service we can scale model throughtput by just deploying more containers, as long as they belong to same Kafka consumer group.
-
Extendibility: Prediction pipeline is model-agnostic - it just passes request to matching kafka topic and displays whatever was send by model. So it's fairly easy to add new model:
- Deploy new app that will consume requests from "requests-YOUR_NEW_PROJECT" Kafka topic and will publish complete JSON to "response-YOUR_NEW_PROJECT"
- Allow new project in application.conf by adding it's name to colon-separated list "api.allowed_topics"
-
By passing scores through Kafka we can use Kafka Streaming (or other frameworks) to compute statistics about them
-
Metrics are pre-computed, so additional requests from clients won't increase load on Kafka
-
Raw scores from each model are displayed (instead of normalizing so they would add up to 1) - for "neutral" comments we want to expect all scores being low, but with normalization likely one of scores would dominate (say, neutral comment has 0.6 score of being threat)
- Feature extraction takes more than 1 second, which makes it quite expensive operation. A possible improvement would be to batch queries together and perform transform on a batch.
- No caching yet
- We might need to put load balancer before Scala app to scale it up.
- For that it's needed to get rid of my bicycle-style stream processing and use Kafka streams, as for now state about recent request is stored in-memory.
- Currently deployment is just docker-compose. Some container orchestration tool might be useful.
- Kafka 'cluster' currently persists inside container which isn't great, need to attach external volume
- I haven't managed to calculate metrics with Kafka Streaming, so current metrics calculations are quite naive
- Although it's pretty easy to add new model, for metrics calculations a lot of things are hard-coded.