Stihi.ru dataset for Sberbank КлассикAI competition
Parsing was done via 'parsing.ipynb'. Proposed method is not the best one, but due to enormous number of poems in Stihi.ru it is fine. Parsing was performing for 40 hours in 8 similar jupyter notebooks (i.e. 8 pools). ~700k poems were parsed in total, after cleaning of data - ~160k poems were keeped.
Data cleaning:
- Delete artefacts of parsing;
- Delete duplicates;
- Delete all punctuation and empty lines;
- Delete short poems(<4 strings in poem, <4 words in string);