/reddit2text

The Python toolkit for converting Reddit threads into organized text data. Extract and process Reddit content with ease!

Primary LanguagePythonApache License 2.0Apache-2.0

Reddit2Text

reddit2text is the Python library designed to effortlessly transform any Reddit thread into clean, readable text data.

Perfect for prompting to an LLM, performing NLP/data analysis, or simply archiving for offline use, reddit2text offers a straightforward interface to access and convert content from Reddit.

Table of Contents

Installation

Easy install using pip

pip3 install reddit2text

Quickstart

First, you need to create a Reddit app to get your client_id and client_secret, in order to access the Reddit API.

Here's a visual step-by-step guide I created to do this! Alternatively, you can look at Reddit's API documentation.

Then, replace the client_id, client_secret, and user_agent with your credentials.

The user agent can be anything you like, but we recommend following this convention according to Reddit's guidelines: '<app type>:<app name>:<version> (by <your username>)'

This is enough to get started:

from reddit2text import Reddit2Text

r2t = Reddit2Text(
    # replace with your actual creds
    client_id='123abc',
    client_secret='123abc',
    user_agent='script:my_app:v1.0 (by u/reddit2text)'
)

URL = 'https://www.reddit.com/r/AskReddit/comments/1by3p2o/whats_the_stupidest_animal_and_how_has_it/'

output = r2t.textualize_post(URL)
print(output)

Here is an example (truncated) output from the above code! https://pastebin.com/niQTGbys

Extra Configuration

  • max_comment_depth, Optional[str]:
    • Maximum depth of comments to output. Includes the top-most comment. Defaults to None or -1 to include all.
  • comment_delim, Optional[str]:
    • String/character used to indent comments according to their nesting level. Defaults to | to mimic reddit.
r2t = Reddit2Text(
    # credentials ...
    max_comment_depth=3,  # all comment chains will be limited to a max of 3 replies
    comment_delim='#'  # each comment level will be preceded by multiples of this string
)

Current Features

  • Convert any Reddit thread (the post + all its comments) into structured text.
  • Include all comments, with the ability to specify the maximum comment depth.
  • Configure a custom comment delimiter, for visual separation of nested comments.

Have a Feature Idea?

Simply open an issue on github and tell me what should be added to the next release!

Planned Features

  • Comprehensive Formatting/Saving
    • Being able to save to a file location as .txt, .csv, .json, or to your clipboard!
  • Filtering/Sorting
    • Filter/sort comments based on upvotes, author name, body content, number of replies, etc. Also add in the ability to get the Top N comments.
  • Extra data fields
    • Access extra information for each post/comment, like whether it's NFSW or not and when it was created
  • Image/video support
    • Enable mining of not just text threads, but also image and video posts
  • CLI output
    • Add a progress bar to the terminal for threads with a large amount of comments
  • Anonymize usernames
    • Give the ability to obfuscate usernames, while still preserving their uniqueness across all comments
  • Iterate across many posts at once
    • Given a subreddit as the input and the sorting method (hot, top, new, etc.), loop over multiple posts at once and textualize them

Contributions

Contributions to reddit2text are always welcomed! I'm just a person that made something I think is useful, so any help is appreciated. You can always submit a pull requests or add an issue to the GitHub repository.

License

reddit2text is released under the Apache License 2.0. See the LICENSE file for more details.