IlyaSemenov/wikipedia-word-frequency

Recommendation to edit for calculating context diversity?

Opened this issue · 1 comments

Hi there – I'm interested in modifying the script to calculate the number of different documents in which words appear (e.g., how many wikipedia articles does the word "DOG" appear in). Have you considered this, or do you have a recommended approach to modifying the script for this purpose? Wanted to check in before attempting the changes myself. Appreciate your consideration.

I have not considered this, and honestly I don't see much value in these numbers (other than its curious to see). However, I realize it could be valuable to some, and it's a good fit for this mini project.

I would recommend to change the file format to tab separated values, with now 3 columns: word, number of uses, number of different articles. (I am not sure why I didn't do that originally.)