Structured image description generator.
Dito uses a pre-trained vision-language model to automatically generate concise, human-readable captions for images. It processes all images in a given folder and outputs structured JSON grouped by subfolder which can be used for dataset annotation, content indexing, or training data preparation.
pip install torch torchvision transformers pillowpython dito.py {image folder path} {image description data file path}Example:
python dito.py IMAGE_FOLDER/ image_description_data.jsonThe output is a JSON object where:
- Keys are relative subfolder paths (or
""for the root folder). - Values are objects mapping image file labels to their generated captions.
{
"": {
"forest": "a person walking down a snowy path in the woods",
"plane": "a blue and white plane",
"ship": "two large ships in the water",
"train": "snow on the ground"
},
"animal/": {
"cow": "a brown cow standing on top of a grass covered hill",
"robin": "a small bird sitting on a branch in the snow",
"weaver": "a red bird sitting on a branch"
},
"landscape/": {
"mountains": "a body of water",
"mountains_2": "a clear blue sky"
}
}- Only image files with the following extensions are processed :
.avif,.jpg,.jpeg,.png,.webp.
0.1
Eric Pelzer (ecstatic.coder@gmail.com).
This project is licensed under the GNU General Public License version 3.
See the LICENSE.md file for details.
