authors with N+ reviews

Question

authors with N+ reviews

xxxvincxxx opened this issue 7 years ago · 1 comments

hello!
I am enjoying your code and trying to translate it in R.
I don't really understand the logic you use when you plot authors with N+ reviews vs. number of reviews.
Specifically:

author_groups = all_reviews.groupby('author')
author_counts = author_groups.size().reset_index()

x = range(120)
y = [sum(author_counts[0] > i) for i in x]

why do you choose the range 120?
for each element in range, why do you use sum(author_counts[0])?

Sorry for the stupid question, but I am curious about your analysis and I want to understand it in-depth.
Cheers!

Answer 1 · 2018-04-04T13:47:09.000Z

Thanks for your interest!

The section you identified computes the number of authors who have written more than N reviews. The value 120 is arbitrary: after initially looking at the data, I found that almost every author had written fewer than 120 reviews but there are a few with many many more. I didn't care to visualize the length of that tail so 120 was the cutoff.

The answer to the next question involves a deeper dive into the internals of pandas. Again, I am trying to compute the number of authors with greater than N reviews. author_counts is computed as a dataframe with one column containing the author names (called "author"), and another computing the total number of reviews for that author. I did it in a groupby-aggregate and did not take care to name it something special, and so pandas just named it 0.

author_counts[0] > i returns a boolean vector with a true value for every author with more than i reviews. Summing that (sum(author_counts[0] > i)) returns the total number of authors with true values, thus completing the count. Does that make sense?