This tool aims to optimize the storage of a large fashion dataset, focusing on reducing the disk space required without compromising much on the quality of data.
- Image Optimization: Converts and resizes images to .webp format. Reduced dataset size from 15 GB to 4.12 GB for over 44,000 products.
- JSON Compression: Streamlines JSON files by keeping only essential fields and applies gzip compression, reducing size from 870 MB to 180 MB.
- Organized File Structure: Each product's image and JSON are stored in a dedicated folder named after the productId.
- Dataset Segmentation: Organizes products into subfolders, each containing up to 1000 products, for manageable dataset chunks (e.g., folders named 01, 02, 03, etc.).
The original dataset is available at Fashion Product Images Dataset.
- Download the dataset from the above link.
- Transfer the required JSON files from
fashion-dataset/styles
todata/product-data/
. - Move the desired images from
fashion-dataset/images
todata/product-images/
.
Run the following command to start the application:
npm start
The final output is stored in the products
folder.
products/
- 01/
- products-list.txt
- productId1/
- product-data.json.gz
- product-img.webp
- ...
- productId1000/
- product-data.json.gz
- product-img.webp
- 02/
- products-list.txt
- 03/
- products-list.txt
- ...
Note :
products-list.txt
in each folder contains a list of all products in that folder, with a maximum of 1000 products per folder.- Currently, the project includes only 10,000 out of the 44,000+ products. Follow the provided instructions to process the entire dataset.