authors with N+ reviews
xxxvincxxx opened this issue · 1 comments
hello!
I am enjoying your code and trying to translate it in R.
I don't really understand the logic you use when you plot authors with N+ reviews vs. number of reviews.
Specifically:
author_groups = all_reviews.groupby('author')
author_counts = author_groups.size().reset_index()
x = range(120)
y = [sum(author_counts[0] > i) for i in x]
- why do you choose the range 120?
- for each element in range, why do you use
sum(author_counts[0])
?
Sorry for the stupid question, but I am curious about your analysis and I want to understand it in-depth.
Cheers!
Thanks for your interest!
The section you identified computes the number of authors who have written more than N
reviews. The value 120 is arbitrary: after initially looking at the data, I found that almost every author had written fewer than 120 reviews but there are a few with many many more. I didn't care to visualize the length of that tail so 120 was the cutoff.
The answer to the next question involves a deeper dive into the internals of pandas
. Again, I am trying to compute the number of authors with greater than N reviews. author_counts
is computed as a dataframe with one column containing the author names (called "author"), and another computing the total number of reviews for that author. I did it in a groupby-aggregate and did not take care to name it something special, and so pandas just named it 0
.
author_counts[0] > i
returns a boolean vector with a true value for every author with more than i
reviews. Summing that (sum(author_counts[0] > i)
) returns the total number of authors with true values, thus completing the count. Does that make sense?