/scrapy-folder-tree

A scrapy pipeline which stores files using folder trees.

Primary LanguagePythonMIT LicenseMIT

scrapy-folder-tree

pre-commit.ci status codecov PyPI GitHub license PyPI - Format PyPI - Status

This is a scrapy pipeline that provides an easy way to store files and images using various folder structures.

Supported folder structures:

Given this scraped file: 05b40af07cb3284506acbf395452e0e93bfc94c8.jpg, you can choose the following folder structures:

Using the file name

class: scrapy-folder-tree.ImagesHashTreePipeline

full
├── 0
.   ├── 5
.   .   ├── b
.   .   .   ├── 05b40af07cb3284506acbf395452e0e93bfc94c8.jpg
Using the crawling time

class: scrapy-folder-tree.ImagesTimeTreePipeline

full
├── 0
.   ├── 11
.   .   ├── 48
.   .   .   ├── 05b40af07cb3284506acbf395452e0e93bfc94c8.jpg
Using the crawling date

class: scrapy-folder-tree.ImagesDateTreePipeline

full
├── 2022
.   ├── 1
.   .   ├── 24
.   .   .   ├── 05b40af07cb3284506acbf395452e0e93bfc94c8.jpg

Installation

pip install scrapy-folder-tree

Usage

Use the following settings in your project:

ITEM_PIPELINES = {
    'scrapy_folder_tree.FilesHashTreePipeline': 300
}