Article count confusion when using TopMine
Closed this issue · 1 comments
taalbrecht commented
When writing to file for TopMine, article count changes, which could cause an issue with document alignment or multiword token identification as tokens may not be properly identified if they are accidentally split across several documents from one source.
Actions to solve problem:
- Fix single articles being written to multiple lines in writeLines
- Ensure that only multiword tokens are taken from TopMine; not document contents (should already be in place)
taalbrecht commented
Fixed in commit 6e6073a by only pulling multiword vocab results from topMine