Basic Token Frequency
Closed this issue · 4 comments
Could we put some basic token frequency after tokens are generated? Most popular words, etc. If it could be broken down by date that would also be perhaps interesting (most popular words in year1 vs year2). That could flow into the word cloud nicely I think.
For the tokenization that we do now, it is tokenization per row, which will includes a crawl_date
column. So, are you thinking adding another column with most popular words, say 10-20 per row with stop words removed? Or a graph of some sort that combines word distribution over time? If so, @lintool what'd work well for that graph-wise?
So, are you thinking adding another column with most popular words, say 10-20 per row with stop words removed
This is what I was thinking, but
a graph of some sort that combines word distribution over time
might be more effective? Curious if @lintool has any suggestions...
How's this now? Does it cover the spirit of the issue?
https://github.com/archivesunleashed/notebooks/blob/master/parquet_text_analyis.ipynb
Looks good!