archivesunleashed/notebooks

Basic Token Frequency

Closed this issue · 4 comments

Could we put some basic token frequency after tokens are generated? Most popular words, etc. If it could be broken down by date that would also be perhaps interesting (most popular words in year1 vs year2). That could flow into the word cloud nicely I think.

For the tokenization that we do now, it is tokenization per row, which will includes a crawl_date column. So, are you thinking adding another column with most popular words, say 10-20 per row with stop words removed? Or a graph of some sort that combines word distribution over time? If so, @lintool what'd work well for that graph-wise?

So, are you thinking adding another column with most popular words, say 10-20 per row with stop words removed

This is what I was thinking, but

a graph of some sort that combines word distribution over time

might be more effective? Curious if @lintool has any suggestions...

Looks good!