Current Manifestation: u/AutoShadow0133
A Reddit bot powered by a text classifier for determining if text is about the videogame Rust or the programming language Rust
A high-level overview of the actions of the bot is simply to listen to the stream of new textposts on the r/rust subreddit. When it sees a new post then it runs a text classifier that's been trained to discern text about the videogame vs programming language. If it believes that the title+body is about the videogame (based on a configurable prediction threshold, see analysis for more information), then it simply leaves a comment listing this conclusion along with several popular Rust game subreddits that might fit the post.
Note: This has only been tested on Linux and the following assumes a Linux environment
This repo is essentially assumed to be installed under
/opt/rust_text_classifier
with only one required change being
- A config file called
config.json
(sample_config.json
acts as a template)
If desired a different posts_corpus
can be used and several files will be
automatically generated
posts.db
simply keeps track of classifications on poststext_classifier.pkl
which is a pickled form of the classifier to avoid having to retrain each time the program is launched
This project uses poetry
for
handling dependencies and virtual environments. With poetry installed getting
all the dependencies setup and then running the bot is as simple as running the
following from the project dir
poetry install --no-dev # Only need to do this once
poetry run ./bot # Uses the virtual environment created above
Alternatively you can use your system's package manager, or you can manually
use pip to install the dependencies (I don't think it supports reading from
pyproject.toml
yet, but I could be wrong)
Note: The classifier always uses an equal number of posts from each category, so even though there are more posts about the game available it will only select enough to match the posts about the lang
There is an analysis
script for some basic (read as hacky) analysis. This
test simply trains a classifier off 80% of the posts found in posts_corpus
and then tests the accuracy using the remaining 20% of the posts. This test is
run 100 times with the values for each category being averaged together and
reported. This is repeated using 50%
, 60%
, and 70%
as the threshold.
The current corpus I'm using is a set of 400 removed r/rust posts along with 400 r/rust posts about the lang (Big thanks to the moderators for helping me get access to relevant removed posts)
Category | Threshold | Correct | Incorrect | Ignored |
---|---|---|---|---|
Lang | 50% | 97.20% | 2.80% | 0.00% |
Game | 50% | 95.07% | 4.93% | 0.00% |
Lang | 60% | 91.24% | 0.89% | 7.87% |
Game | 60% | 85.89% | 1.91% | 12.20% |
Lang | 70% | 75.38% | 0.26% | 24.37% |
Game | 70% | 64.96% | 1.01% | 34.02% |
The posts corpus is generated by running a simple script that fetches all new
text posts from r/rust
along with a number of Rust Game subreddits every 5
minutes. From there the posts from r/rust
are manually classified into
r_rust_correct
and r_rust_incorrect
. This is done to best match the
information that the bot will attempt to classify although using older data
would likely also work well.
The classifier also prefers posts from r_rust_incorrect
before all other Rust
Game posts since it best matches the data it's attempt to classify, so it will
use posts from there before any other Rust Game posts (The loaded posts are
always shuffled though to guarantee better randomness for testing accuracy).
All contents in this repo excluding the content within the posts_corpus
directory is licensed under either of
- Apache License, Version 2.0, (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.