ArthurHeitmann/arctic_shift

Thanks so much and quick question about retrieval vs post time

danrthompson opened this issue · 3 comments

Thanks so much for doing this. I've been considering doing an academic project based on the latest data you've been posting, but after downloading and parsing the data I noticed that almost no comments have any upvotes, which I found confusing until I did a comparison of the retrieved on and created dates, and it appears that basically each row is retrieved the same day as it is submitted

Far be it from me to criticize your work, and even just the text data is incredibly useful! But if there's any way that you could do the retrieval X amount of time after the comment is posted, the data would have a lot more information.

I only analyzed the latest month you posted so:

  1. Is this the case for all the months you posted, or just the latest one?
  2. Is there any way you could do retrievals later so that the upvote data exists, or is that not going to be feasible (if it's not, I understand completely)?

Either way, thanks so much for doing this - it's such an important data set that so many academics find super useful. And I was even open to licensing the data I want, but it looks like reddit's API actually does not allow you to analyze the data you pull. I thought their whole goal was to monetize the data being used by AI companies - but it seems like they have yet to even create an API product or pricing for that use case.... Not that I'm even using AI but I do want to analyze the data and their API seems like it isn't remotely designed for batch post downloads (and the ToS prohibit explicitly the type of use that they're ostensibly trying to charge for). It's so baffling.

Indeed, there's only about a 15s delay between when a thing was created and when it was archived. I started archiving in real time around July 10th. So everything before that has the more or less final upvote numbers. The reason for that is to preserve the post/comment before it is edited, removed by a mod, deleted, etc.

A couple of other people have asked about this too. So I'm slowly thinking about how to do a second pass with a 36 hour delay. But that is not going to come soon. The dev time + fetching from the reddit api time + post processing time will take maybe 2 months.

Small status update: I've now implemented the functionality for archiving posts & comments a second time 36 hours later. That is running now and the upcoming November dumps should have updated upvotes (+ some other fields maybe). But the previous 4 months will take quite a bit of time.

Final update (after a long time): API responses now also have updated "score", "num_comments", etc. values. So API responses should now be the same, as the data that is released through dumps.