Relevant for submission is trhe "two-step-transformer" of approach 3. Instructions for running in docker can be found below.
steps:
- Find multipart spoiler with a majority vote of certain text features of the attributes postText and targetParagraphs.
- Run the transformer baseline on features that have not been classified as multipart.
goals:
- Reach a similar accuracy for multipart spoiler as the transformer baseline.
- Improve the efficiency due to less data that needs to be predicted by the transformer baseline.
code:
steps:
- Simplify the input text before predicting the spoiler type
- Either by just replacing difficult words (MILES)
- Or replacing words and reformulating (MUSS)
- Predict spoiler type using the baseline transformer
goals:
- Better accuracy due to a smaller set of words / reduced complexity
code:
results:
- Only replacing difficult words did not enhance the accuracy.
- Due to problems with the libraries, the approach has been discarded.
steps:
- Find multipart spoiler with a Gradient Boosting classification model trained on text features of the attributes postText and targetParagraphs.
- Run the transformer baseline on features that have not been classified as multipart.
goals:
- Reach a similar accuracy for multipart spoiler as the transformer baseline.
- Improve the efficiency due to less time for training and forward-pass of the Gradient Boosting classification model than the transformer baseline needs.
code:
results:
- Faster than baseline, as multipart spoilers are classified within seconds
- Probably similar or better accuracy, depending on the dataset
Running in Docker:
(Optional) Build:
cd statistical-model-multi-classification
docker build -t ghcr.io/tudbs-clickbait/team-1-task-1:two-step .
- (optional) Login to GitHub registry
- (optional, push image)
docker push ghcr.io/tudbs-clickbait/team-1-task-1:two-step
Run:
docker run -v ${PWD}/data:/data ghcr.io/tudbs-clickbait/team-1-task-1:two-step --input=/data/validation_short.jsonl --output=/data/out.jsonl