- [1/4/24] Data v6 release; Word level recognition for 6/13 languages.
- [18/4/24] Data v5 release; Word level recognition annotations for 5/13 languages.
- [12/4/24] Data v4 release; Word level recogntion annotations for 2/13 languages. (Part-2)
- [19/3/24] Data v3 release; Word level recognition annotations for 2/13 languages. (Part-1)
- [3/3/24] Data v2 release; Word level recognition annotations for 1/13 languages.
- [2/2/24] Data v1 released for 13 languages along with the detection annotations.
- Recognition annotation release.
- Detection annotations for 13 languages.
- Data v1 for 13 languages.
Language | #images | #Total words | # Total words with recognition annotations |
---|---|---|---|
Assamese | 295 | 7991 | 0 |
Bengali | 305 | 9766 | 0 |
Gujarati | 525 | 4767 | 4062 |
Hindi | 1218 | 17935 | 17088 |
Kannada | 627 | 8847 | 6606 |
Malayalam | 474 | 6850 | 4249 |
Meitei | 82 | 1632 | 0 |
Odia | 533 | 10657 | 0 |
Punjabi | 517 | 20017 | 19261 |
Tamil | 521 | 5413 | 4505 |
Telugu | 607 | 6375 | 0 |
Urdu | 551 | 11771 | 0 |
Marathi | - | 25875 | 0 |
Step 1: Request access to the data by filling this form. We shall review your request and provide access to the data.
Step 2: Download the data from the link provided in the email.
Step 3: Extract the downloaded zip file into "data" folder
unzip BSTD.zip -d data
Step 4: Download the images
python3 downloadImages.py
Words in the image are annotated in the polygon format. The annotation file is a json file with the following format:
"language_image_id": {
"annotations":
{
"polygon_0":
{
"coordinates":
[
[x1, y1],
[x2, y2],
...,
[xn, yn]
],
"text": "text in the current polygon"
},
...,
"polygon_n":
{
"coordinates":
[
[x1, y1],
[x2, y2],
...,
[xn, yn]
],
"text": "text in the current polygon"
}
},
"url": "url of the image",
"image_name": "name of the image",
"language": "main language"
}
To visualise detection annotations, run the following command:
python3 visualise.py <image_path> <path_to_BSTD.json>
for e.g.
python3 visualise.py data/hindi/image_141.jpg data/BSTD.json
Some examples are below:
- The data is collected from the internet and hence there are some images which are not in the correct orientation. We have tried to remove such images but there might be some left.
- All the images are collected from Wikimedia commons (under Creative Commons Licence, cc-by-sa-4.0)
- Further detection and recognition annotations are human annotated.
For any queries, please contact us at: