Universal Language Contribution APIs (ULCA) is an open-sourced scalable data platform, supporting various types of dataset for Indic languages, along with a user interface for interacting with the datasets.
- Be the premier data and models repository for Indic language resources
- Collect datasets for MT (Machine Translation), ASR (Automatic Speech Recognition) , TTS (Text To Speech), OCR (Optical Character Recognition) and various NLP tasks in standardized but extensible formats. Please refer to the Datasets section.
- Collect extensive metadata related to dataset for various analysis
- Proper attribution for every contributor at the record level
- Deduplication capability built-in
- Simple interface to search and download datasets based on various filters
- Perform various quality checks on the submitted datasets
ULCA allows users to contribute various types of datasets including and not limited to the following :
Dataset Type | Description |
---|---|
Parallel Dataset | Consists of bi-lingual sentence pairs which are meaningfully the same. |
ASR/TTS Dataset | Consists of audio to text mapping |
ASR Unlabeled Dataset | These are raw ASR datasets without transcript value. |
OCR Dataset | Consists of image to text mapping |
Monolingual Dataset | Consists of sentences in a single language |
- Submit a new dataset from the above mentioned types
- Delete any of the submitted datasets.
- Upload a newer version of the submitted dataset with more information. (Ex : v2 of PIB dataset)
- Enhance the quality of the datasets submitted by others (Ex : add alignment score etc)
Users can contribute various types of models (Note : ULCA doesn’t host the models, rather it refers to the inference endpoints specified by the contributors)
Model Type | Description |
---|---|
Translation Model | Model to translate a given sentence in one language into the sentence in another language. |
ASR Model | Model to convert audio into respective transcript. |
TTS Model | Model to convert a text into respective audio. |
OCR Model | Model to convert a given image to text. |
- Submit any new model from the above mentioned types
- Inference support for the model
- Run benchmarking for the submitted models
- Publish a model for anyone to infer
As part of ULCA, qualified subject matter experts can submit the benchmarking datasets, which can be used to evaluate various models. The process of benchmarking will be available for any submitted model.
Submit any new model from the above mentioned types
ULCA code base is published as an open-sourced project (MIT license) under the following repository : https://github.com/ULCA-IN/ulca
- ULCA data/model contracts : https://github.com/ULCA-IN/ulca/tree/master/specs
- Sample usages : https://github.com/ULCA-IN/ulca/tree/master/specs/examples
- Test datasets : https://github.com/ULCA-IN/ulca/tree/master/ulca-test-datasets
Service | Build Status |
---|---|
Ingest | |
Publish | |
User Management | |
Validate | |
Test |
It's fairly easy to contribute dataset to ULCA ecosystem. The submitter just have to upload a zip folder containing two textual files and optional reference files like audio or image. The textual file content can be in JSON or CSV format. The naming convention of textual file should be :
params.json
orparams.csv
data.json
ordata.csv
ULCA system currently supports the following type of datasets :
- Parallel dataset
- Monolingual dataset
- ASR / TTS dataset
- OCR dataset
- Document Layout dataset
ULCA relies upon the submitter to explain their dataset, so that it can be beneficial to the large community, following some of the suggestions will surely benefit the community at large.
params
file should contain the discussed attributes.
Dataset should have the following mandatory attributes, we will cover each of them individually. Please note the mandatory attributes and values assigned to these attributes are strictly enforced.
- datasetType
- languages
- collectionSource
- domain
- license
- submitter
Following are the optional attributes :
- version
This defines the type of dataset (parallel/monolingual/asr etc). The values can be referred in DatasetType
Sample usage :
"dataset-type": "parallel-corpus"
It is important to convey what language the dataset is directed towards. The structure of languages
attributes should be followed. Same parameter can be used to define a single language or a language pair. Let's look at the following example where the languages
defines a parallel dataset that typically has a language pair where sourceLanguage
is English
and targetLanguage
is Bengali
. The defined language code are per ISO 639-1 & 639-2 and can be referred in LanguagePair
{
"sourceLanguage": "en",
"targetLanguage": "bn"
}
Monolingual, ASR/TTS, OCR dataset typically uses a single language and the following example can be used to define the languages
attribute.
"sourceLanguage": "en"
This attribute defines that relevant business area or domain
under which dataset is curated. ULCA ONLY accepts one values that are defined under Domain schema.
Sample usage :
- domain specifically for
legal
domain
"domain": "legal"
- dataset meant for
news
domain
"domain": "news"
This attribute is bit straight forward, dataset submitter should choose on from available License.
Sample usage:
"license": "cc-by-4.0"
This attribute is mostly free text and optional, however we recommend it to be descriptive so that community users should be able to look at the sources from where the dataset has been curated. Mostly putting a URL along with some description should suffice.
Sample usage :
"collectionSource": [
"https://main.sci.gov.in",
"42040.pdf",
"SCI judgment pdfs"
]
The attribute holds the description of the user who submitted the dataset as well as the team members who are part of the project, we suggest acknowledging all team members how small the contribution could be. Typically it should describe the project or team's goal.
Sample usage :
{
"submitter": {
"name": "Project Anuvaad",
"aboutMe": "Open source project run by ekStep foundation, part of Sunbird project"
},
"team": [
{
"name": "Ajitesh Sharma",
"aboutMe": "NLP team lead at Project Anuvaad"
},
{
"name": "Vishal Mauli",
"aboutMe": "Backend team lead at Project Anuvaad"
},
{
"name": "Aravinth Bheemraj",
"aboutMe": "Data engineering team lead at Project Anuvaad"
},
{
"name": "Rimpa Mondal",
"aboutMe": "Freelancer Bengali translator at Project Anuvaad"
}
]
}
This section explains the params
specific to supported dataset type. We will go through each dataset type individually and in detail.
Parallel dataset params
have few specific attributes defined below
- collectionMethod
This attribute is an optional field in params
for the parallel dataset. It's a combination of collectionDescription
and collectionDetails
. collectionDescription
is a mandatory property if a collectionMethod
is included, which actually defines the methods the user has used for creating the dataset.
Sample usage :
"collectionMethod": {
"collectionDescription": [
"machine-translated-post-edited"
],
"collectionDetails": {
"translationModel": "Google",
"translationModelVersion": "v2",
"editingTool": "Anuvaad",
"editingToolVersion": "v1.4",
"contributor": {
"name": "Aravinth Bheemaraj",
"aboutMe": "NLP Data team lead at Project Anuvaad"
}
}
}
The values for the collectionDescription
can be found here
Based on the collection method defined, the collectionDetails
can one of the 4 available schemas.
See detailed sample usage at data.json and params.json
In order to do bitext mining at large scale, submitters often leverage strategies like LaBSE, LASER etc. to align and generate parallel corpus. This strategy at large scale bitext mining has helped the community at large. Use this property in params
to indicate your bitext mining strategy and also report alignmentScore
property in data
for every record. A sample record is defined below :
{
"sourceText": "In the last 24 hours, 4,987 new confirmed cases have been added.",
"targetText": "उन्होंने बताया कि पिछले 24 घंटे में 4987 नए मामलों की पुष्टि हुई है।",
"collectionMethod": {
"collectionDetails": {
"alignmentScore": 0.79782
}
}
} }
}
ULCA will reject those records not satisfying the mentioned criterion. We have explained this scenario in the example, data.json and params.json
Listed properties are specific to OCR dataset.
- format
- dpi
- imageTextType
Describe the image file format present in the submitted dataset, choose from following image type. Also refer to the example provided.
- jpeg
- bmp
- png
- tiff
"format": "tiff"
Describes the standard image metadata about pixel density.
- 300_dpi
- 72_dpi
"dpi": "72_dpi"
This property defines the presence of text on various categories of image. For example a text region can be present on scene or let's say on a document. Following are various defined possibilities here.
- scene-text
- typewriter-typed-text
- computer-typed-text
- handwritten-text
user can use these options as follows based upon text annotation done on the image type.
"imageTextType": "computer-typed-text"