Basic Token Frequency

Question

Basic Token Frequency

Closed this issue 5 years ago · 4 comments

Could we put some basic token frequency after tokens are generated? Most popular words, etc. If it could be broken down by date that would also be perhaps interesting (most popular words in year1 vs year2). That could flow into the word cloud nicely I think.

Answer 1 · 2020-01-07T02:22:23.000Z

For the tokenization that we do now, it is tokenization per row, which will includes a crawl_date column. So, are you thinking adding another column with most popular words, say 10-20 per row with stop words removed? Or a graph of some sort that combines word distribution over time? If so, @lintool what'd work well for that graph-wise?

Answer 2 · 2020-01-07T15:48:05.000Z

So, are you thinking adding another column with most popular words, say 10-20 per row with stop words removed

This is what I was thinking, but

a graph of some sort that combines word distribution over time

might be more effective? Curious if @lintool has any suggestions...

Answer 3 · 2020-01-12T20:12:34.000Z

How's this now? Does it cover the spirit of the issue?

https://github.com/archivesunleashed/notebooks/blob/master/parquet_text_analyis.ipynb

Answer 4 · 2020-01-16T21:07:22.000Z

Looks good!