Watchful1/PushshiftDumps

No data is extracted?

Opened this issue · 9 comments

Hello I am trying to extract data from a _comments.zst and _submissions.zst file, but I am getting following your script (single_file.py) is lines in the cmd logs and nothing else is outputed:

D:\historicArchives\PushshiftDumps\scripts>py single_file.py ..._comments.zst
2022-10-05 07:11:33 : 100,000 : 0 : 22,807,050:53%
2022-11-15 22:50:16 : 200,000 : 0 : 34,865,950:81%
2022-12-21 22:43:45 : 300,000 : 0 : 42,810,910:100%
Complete : 330,112 : 0

D:\historicArchives\PushshiftDumps\scripts>py single_file.py ..._submissions.zst
Complete : 41,052 : 0

How and what can I do to get the data? I am confused. Thanks.

Also I would like to inform you that the 2 links you showed in the descriptioin of the page, are no longer available:
....The files can be downloaded from [here](https://files.pushshift.io/reddit/) or torrented from [here](https://academictorrents.com/details/f37bb9c0abe350f0f1cbd4577d0fe413ed07724e).

single_file.py is an example of how to implement your own script, it doesn't do anything other than count the lines. You can try the filter_file.py or to_csv.py scripts depending on what you are trying to do.

Glad you are here after 2 years!
I would like to:

  1. download the posts and comments of a subreddit (into my computer), whichever format is best.
  2. Eventually, make it browsable locally using "github.com/yakabuf/redarc"

What would you suggest me do step by step?
By the way, I found your page thanks to a page in the subreddit DataHoarder"

The redarc github has a setup process listed for loading the data dumps. If you're having problems I'd recommend asking over there and not here. I've never done that before.

When you say "loading the data dumps" , is that actually the .zst files? If not, then I would need help transforming the zst files into actualy data. Can you help with that?
I tried the filter_file and I am getting this error:

.....- INFO: Output format set to csv
Traceback (most recent call last):
  File "D:\historicArchives\PushshiftDumps\scripts\filter_file.py", line 187, in <module>
    handle = open(output_path, 'w', encoding='UTF-8', newline='')
FileNotFoundError: [Errno 2] No such file or directory: '\\\\MYCLOUD...\\Public\\output.csv'

Could be related to the fact i am using a cloud, is it possible to tell the script to write the outputs inside a specified path? Or even locally were the script is? Thanks

I only briefly looked at redarc, but you should just follow their setup guide instead of trying my scripts.

Your git is mentioned here: https://old.reddit.com/r/DataHoarder/comments/1479c7b/historic_reddit_archives_ongoing_archival_effort/
And they say that I need to use your scripts to EXTRACT the data.
SO I need to use your scripts. Am I wrong??

The dump files are compressed using zstandard compression. If you want to just extract them there's any number of tools that can extract zst files. But there's not much point, unless you are working with a specific small file, they will be too large for open with just about any tool. That's why I wrote scripts like the filter_file one that decompresses the file line by line, narrows it down to only lines that match your filters and then write it out again. If you want, say, all posts by a certain user in a certain subreddit you can take the subreddit file and filter it by that user.

You just store the files compressed, for the most part there's no need to decompress them.

Redarc is a completely separate thing, did you try reading the setup guide over there?

Yes Redarc seems to be able to use the extracted data and make it readable in a wab browser page kind of thing right? (like some kind of local or hosted website that displays the data you have extacrted right?)
But in order to use that, I will need to extarct the data right?
The problem is for now, all I have is 2 .zst files.
You are saying your script allow to extract specific parts of the data, and that if I want to extact the whole data, I might be better using other tools?
My goal is not one specific user or one specific topic, but rather a whole subreddit.
Redarc needs the extracted data as input firts, before being able to make use of redarc right?
Well, in that case I will need the extacted data
image

Could you direct me to a method to extact the data, before heading up to redarc then? ALL I have is zst files for now.

Read the redarc page. It literally tells you what to do with the zst files.