This project was built for scraping Reddit datasets. Datasets this project was built on: https://files.pushshift.io/reddit/comments/
To unpack the .zstd files you can use:
- Windows: https://sourceforge.net/projects/zstandard.mirror/
- Mac: https://formulae.brew.sh/formula/zstd
To use this tool and work with desired dataset follow these steps:
- Download one of the datasets
- Unpack the dataset file using ZStandard tool (above-mentioned)
- Move the unpacked file to
~/Data/Datasets
- Load the data with one of the Comment models from
~/Models/Comments
. Example:DatasetLoader.load(DatasetVersionEnum.Dataset_2018_07, limit=10000)
If specific Comment model doesn't exist, you can add it by doing the following:
- In
./Models/Comments
add new Python file calledCommentYYYYMM.py
(instead ofYYYY
andMM
use the year and the month of the dataset). To get class definition based on your specific database there is a helper functionDatasetLoader.print_class_definition
that prints out the code you can copy and paste into the file.
Example:
from Data.DatasetLoader import DatasetLoader
def main() -> None:
DatasetLoader.print_class_definition('RC_2018-07')
Output:
from typing import List
from Models.BaseComment import BaseComment
from Models.Singleton import Singleton
class Comment201807(BaseComment, metaclass=Singleton):
archived: List[bool] = []
author: List[str] = []
...
def __init__(self, comment: dict) -> None:
super().__init__(comment)
self.comment_json = self.comment_json
self.archived = self.comment_json['archived']
self.author = self.comment_json['author']
...
def add_batch(self, batch: List[dict]) -> None:
for comment in batch:
self.add_comment(comment)
- In the class DatasetVersionEnum located in
./Data
add new enum case. Example:
from Data.Dataset import Dataset
from Models.Comments.Comment201807 import Comment201807
class DatasetVersionEnum:
Dataset_2018_07 = Dataset('RC_2018-07', Comment201807)
The project is not optimized for big datasets. Recommended load limit=1000000. Loading 1 million rows takes ~30 seconds (depends on machine) and ~2.5Gb of RAM.
Also, as of right now, this project is only compatible with datasets from: https://files.pushshift.io/reddit/comments/
Dynamic loading and unloading (depending on the use case, it may not be necessary to keep the old data in the memory)