English reviews and advertisements in reviews
Closed this issue · 1 comments
Hi there,
I used your dataset in a class I teach for a topic modeling assignment. Thanks for making it available! Students discovered the above mentioned issues in the data. If you make another version, you could take this into account.
The advertisements contain fixed phrases, but most of it is different each time. There are a lot of these in the dataset, so it causes a lot of noise. The English reviews are less frequent and should be easy to filter out with a language identification model, such as from fastText.
Example of advertisement:
jannie52-over-de-man-die-de-draak-doodde
Waarschijnlijk heb ik een teer zieltje. Ik heb me geërgerd aan het taalgebruik. Ik begrijp niet wat de toegevoegde waarde is van de seksistische en discriminerende taal. Het plot vond ik niet sterk. Eigenlijk een vervelend boek om te lezen. 'Het is autobiografisch, helemaal waargebeurd maar toch zie je elementen van fictie in de stijl en vooral de opbouw, dat maakt het des te sterker.' - Win boeken voor je hele leesclub! We gaan Wil van Jeroen Olyslaegers luisteren via de gratis Hebban Luisterboeken-app. Doe je mee? 'Wat beweegt de jonge zwarte deelpachter Tucker Caliban om huis, vee en akkers te vernietigen en met vrouw en kind naar het Noorden te vertrekken?'- Win Uit de maat voor je hele leesgroep!
Example of English review:
anca-over-house-of-leaves
WARNING: REVIEW CONTAINS SPOILERS ABOUT THE ENDING! So, technically I’m not completely finished with this book. I still have the exhibits, annexes and appendixes to go, but I’m finished with the main story of this book and to be honest, just want to be done, get this book of my currently reading shelve [...]
I've removed the advertisements in v2.0 (and increased the dataset size to 118,516). This was possible by fixing a bug in the scraper, which wrongly included a few of the advertisements blocks.
Also removing the English would take a bit more work, because the review's language isn't properly described on Hebban. Some guesswork with fastText could indeed work, but I've decided to not fix it for now, though I'm open to pull requests.