Watchful1/PushshiftDumps

using the --field argument properly

Closed this issue · 2 comments

huycke commented

I'm confused on how to use the '--field' argument in the cmd line when I run the [combine_folder_multiprocess.py] file. I'm wondering if we have control over the fields that are saved to csv. For example, could I have it write the author, ID, date, subreddit, votes, awards, etc. to the output .csv file?

combine_folder_multiprocess does not output csv, it outputs a compressed zst file. The --field argument is for filtering.

huycke commented

Welp that makes sense.

As a short follow up, when using the to_csv.py code to turn the .zst files you make with the [combine_folder_multiprocess.py] the fields we can pull from are:
dict_keys(['all_awardings', 'allow_live_comments', 'archived', 'author', 'author_created_utc', 'author_flair_background_color', 'author_flair_css_class', 'author_flair_template_id', 'author_flair_text', 'author_flair_text_color', 'can_gild', 'category', 'content_categories', 'contest_mode', 'created_utc', 'discussion_type', 'distinguished', 'domain', 'edited', 'gilded', 'gildings', 'hidden', 'hide_score', 'id', 'is_created_from_ads_ui', 'is_crosspostable', 'is_meta', 'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video', 'link_flair_background_color', 'link_flair_css_class', 'link_flair_richtext', 'link_flair_text', 'link_flair_text_color', 'link_flair_type', 'locked', 'media', 'media_embed', 'media_only', 'name', 'no_follow', 'num_comments', 'num_crossposts', 'over_18', 'parent_whitelist_status', 'permalink', 'pinned', 'pwls', 'quarantine', 'removed_by_category', 'retrieved_utc', 'score', 'secure_media', 'secure_media_embed', 'selftext', 'send_replies', 'spoiler', 'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers', 'subreddit_type', 'suggested_sort', 'thumbnail', 'thumbnail_height', 'thumbnail_width', 'title', 'top_awarded_type', 'total_awards_received', 'treatment_tags', 'upvote_ratio', 'url', 'whitelist_status', 'wls'])

These are specified at the end of the cmd that runs to_csv.py.