Watchful1/PushshiftDumps

unable to understand error msg when run combine_folder_multiprocess.py script

Closed this issue · 1 comments

hi, this is the error message I get when I try running the script combine_folder_multiprocess.py. both the script and the data files (e.g. RC_2020_01.zst) is also in same folder. Is there something that I have not configured correctly, it doesn't seem to be loading the compressed files? thanks

python combine_folder_multiprocess.py reddit/comments --value wallstreetbets
2023-03-06 13:03:40,653 - INFO: Loading files from: reddit/comments
2023-03-06 13:03:40,653 - INFO: Writing output to working folder
2023-03-06 13:03:40,653 - INFO: Checking field subreddit for value wallstreetbets
2023-03-06 13:03:40,805 - INFO: Existing input file was read, if this is not correct you should delete the pushshift_working folder and run this script again
2023-03-06 13:03:40,805 - INFO: Processed 0 of 0 files with 0.00 of 0.00 gigabytes
Traceback (most recent call last):
File "D:\py\combine_folder_multiprocess.py", line 390, in
log.info(f"{total_lines_processed:,}, {total_lines_errored} errored : {(total_bytes_processed / (2**30)):.2f} gb, {(total_bytes_processed / total_bytes) * 100:.0f}% : {files_processed}/{len(input_files)}")
ZeroDivisionError: division by zero

Existing input file was read, if this is not correct you should delete the pushshift_working folder and run this script again

This is the important line here. The script saves progress in between runs so if it crashes you can start back up in the middle. That line is telling you that it's loading the saved progress file, which might be from an old run and the files it's trying to pick up aren't there any more.

The actual error is just saying there were no files so it failed when trying to calculate the progress.

However, if you just want the subreddit wallstreetbets, you can get it from here and save yourself the trouble https://www.reddit.com/r/pushshift/comments/11ef9if/separate_dump_files_for_the_top_20k_subreddits/