ULCA

Universal Language Contribution APIs (ULCA) is an open-sourced scalable data platform, supporting various types of dataset for Indic languages, along with a user interface for interacting with the datasets.

Why ULCA?

Be the premier data and models repository for Indic language resources
Collect datasets for MT (Machine Translation), ASR (Automatic Speech Recognition) , TTS (Text To Speech), OCR (Optical Character Recognition) and various NLP tasks in standardized but extensible formats. Please refer to the Datasets section.
Collect extensive metadata related to dataset for various analysis
Proper attribution for every contributor at the record level
Deduplication capability built-in
Simple interface to search and download datasets based on various filters
Perform various quality checks on the submitted datasets

Supported entities in ULCA

`Datasets`

ULCA allows users to contribute various types of datasets including and not limited to the following :

Dataset Type	Description
Parallel Dataset	Consists of bi-lingual sentence pairs which are meaningfully the same.
ASR/TTS Dataset	Consists of audio to text mapping
ASR Unlabeled Dataset	These are raw ASR datasets without transcript value.
OCR Dataset	Consists of image to text mapping
Monolingual Dataset	Consists of sentences in a single language

Supported functionalities :

Submit a new dataset from the above mentioned types
Delete any of the submitted datasets.
Upload a newer version of the submitted dataset with more information. (Ex : v2 of PIB dataset)
Enhance the quality of the datasets submitted by others (Ex : add alignment score etc)

`Models`

Users can contribute various types of models (Note : ULCA doesn’t host the models, rather it refers to the inference endpoints specified by the contributors)

Model Type	Description
Translation Model	Model to translate a given sentence in one language into the sentence in another language.
ASR Model	Model to convert audio into respective transcript.
TTS Model	Model to convert a text into respective audio.
OCR Model	Model to convert a given image to text.

Supported functionalities :

Submit any new model from the above mentioned types
Inference support for the model
Run benchmarking for the submitted models
Publish a model for anyone to infer

`Benchmarking suite`

As part of ULCA, qualified subject matter experts can submit the benchmarking datasets, which can be used to evaluate various models. The process of benchmarking will be available for any submitted model.

Supported functionalities :

Submit any new model from the above mentioned types

Codebase & Deployment

ULCA code base is published as an open-sourced project (MIT license) under the following repository : https://github.com/ULCA-IN/ulca

`Important links`

ULCA data/model contracts : https://github.com/ULCA-IN/ulca/tree/master/specs
Sample usages : https://github.com/ULCA-IN/ulca/tree/master/specs/examples
Test datasets : https://github.com/ULCA-IN/ulca/tree/master/ulca-test-datasets

Service	Build Status
Ingest
Publish
User Management
Validate
Test

Contribution

It's fairly easy to contribute dataset to ULCA ecosystem. The submitter just have to upload a zip folder containing two textual files and optional reference files like audio or image. The textual file content can be in JSON or CSV format. The naming convention of textual file should be :

params.json or params.csv
data.json or data.csv

Supported Dataset Types

ULCA system currently supports the following type of datasets :

Parallel dataset
Monolingual dataset
ASR / TTS dataset
OCR dataset
Document Layout dataset

Data and Params schema for parallel dataset

Data and Params schema for monolingual dataset

Data and Params schema for asr / tts dataset

Data and Params schema for ocr dataset

Representing a dataset `params`

ULCA relies upon the submitter to explain their dataset, so that it can be beneficial to the large community, following some of the suggestions will surely benefit the community at large.

params file should contain the discussed attributes.

Dataset should have the following mandatory attributes, we will cover each of them individually. Please note the mandatory attributes and values assigned to these attributes are strictly enforced.

datasetType
languages
collectionSource
domain
license
submitter

Following are the optional attributes :

version

datasetType

This defines the type of dataset (parallel/monolingual/asr etc). The values can be referred in DatasetType

Sample usage :

 "dataset-type": "parallel-corpus"

languages

It is important to convey what language the dataset is directed towards. The structure of languages attributes should be followed. Same parameter can be used to define a single language or a language pair. Let's look at the following example where the languages defines a parallel dataset that typically has a language pair where sourceLanguage is English and targetLanguage is Bengali. The defined language code are per ISO 639-1 & 639-2 and can be referred in LanguagePair

{
   "sourceLanguage": "en",
   "targetLanguage": "bn"
}

Monolingual, ASR/TTS, OCR dataset typically uses a single language and the following example can be used to define the languages attribute.

 "sourceLanguage": "en"

domain

This attribute defines that relevant business area or domain under which dataset is curated. ULCA ONLY accepts one values that are defined under Domain schema.

Sample usage :

domain specifically for legal domain

 "domain": "legal"

dataset meant for news domain

 "domain": "news"

license

This attribute is bit straight forward, dataset submitter should choose on from available License.

Sample usage:

  "license": "cc-by-4.0"

collectionSource

This attribute is mostly free text and optional, however we recommend it to be descriptive so that community users should be able to look at the sources from where the dataset has been curated. Mostly putting a URL along with some description should suffice.

Sample usage :

  "collectionSource": [
     "https://main.sci.gov.in",
     "42040.pdf",
     "SCI judgment pdfs"
  ]

submitter

The attribute holds the description of the user who submitted the dataset as well as the team members who are part of the project, we suggest acknowledging all team members how small the contribution could be. Typically it should describe the project or team's goal.

Sample usage :

 {
        "submitter": {
            "name": "Project Anuvaad",
            "aboutMe": "Open source project run by ekStep foundation, part of Sunbird project"
        },
        "team": [
            {
                "name": "Ajitesh Sharma",
                "aboutMe": "NLP team lead at Project Anuvaad"
            },
            {
                "name": "Vishal Mauli",
                "aboutMe": "Backend team lead at Project Anuvaad"
            },
            {
                "name": "Aravinth Bheemraj",
                "aboutMe": "Data engineering team lead at Project Anuvaad"
            },
            {
                "name": "Rimpa Mondal",
                "aboutMe": "Freelancer Bengali translator at Project Anuvaad"
            }
        ]
    }

Representing a specific type dataset `params`

This section explains the params specific to supported dataset type. We will go through each dataset type individually and in detail.

Parallel Dataset specific `params`

Parallel dataset params have few specific attributes defined below

collectionMethod

collectionMethod

This attribute is an optional field in params for the parallel dataset. It's a combination of collectionDescription and collectionDetails. collectionDescription is a mandatory property if a collectionMethod is included, which actually defines the methods the user has used for creating the dataset.

Sample usage :

    "collectionMethod": {
        "collectionDescription": [
            "machine-translated-post-edited"
        ],
        "collectionDetails": {
            "translationModel": "Google",
            "translationModelVersion": "v2",
            "editingTool": "Anuvaad",
            "editingToolVersion": "v1.4",
            "contributor": {
                "name": "Aravinth Bheemaraj",
                "aboutMe": "NLP Data team lead at Project Anuvaad"
            }
        }
    }

The values for the collectionDescription can be found here Based on the collection method defined, the collectionDetails can one of the 4 available schemas. See detailed sample usage at data.json and params.json

In order to do bitext mining at large scale, submitters often leverage strategies like LaBSE, LASER etc. to align and generate parallel corpus. This strategy at large scale bitext mining has helped the community at large. Use this property in params to indicate your bitext mining strategy and also report alignmentScore property in data for every record. A sample record is defined below :

    {
        "sourceText": "In the last 24 hours, 4,987 new confirmed cases have been added.",
        "targetText": "उन्होंने बताया कि पिछले 24 घंटे में 4987 नए मामलों की पुष्टि हुई है।",
        "collectionMethod": {
            "collectionDetails": {
                "alignmentScore": 0.79782
            }
        }
    }   }
    }

ULCA will reject those records not satisfying the mentioned criterion. We have explained this scenario in the example, data.json and params.json

OCR Dataset specific `params`

Listed properties are specific to OCR dataset.

format
dpi
imageTextType

format

Describe the image file format present in the submitted dataset, choose from following image type. Also refer to the example provided.

jpeg
bmp
png
tiff

  "format": "tiff"

dpi

Describes the standard image metadata about pixel density.

300_dpi
72_dpi

  "dpi": "72_dpi"

imageTextType

This property defines the presence of text on various categories of image. For example a text region can be present on scene or let's say on a document. Following are various defined possibilities here.

scene-text
typewriter-typed-text
computer-typed-text
handwritten-text

user can use these options as follows based upon text annotation done on the image type.

  "imageTextType": "computer-typed-text"

svp19/ulca

ULCA

Why ULCA?

Supported entities in ULCA

Datasets

Supported functionalities :

Models

Supported functionalities :

Benchmarking suite

Supported functionalities :

Codebase & Deployment

Important links

Contribution

Supported Dataset Types

Data and Params schema for parallel dataset

Data and Params schema for monolingual dataset

Data and Params schema for asr / tts dataset

Data and Params schema for ocr dataset

Representing a dataset params

datasetType

languages

domain

license

collectionSource

submitter

Representing a specific type dataset params

Parallel Dataset specific params

collectionMethod

OCR Dataset specific params

format

dpi

imageTextType

`Datasets`

`Models`

`Benchmarking suite`

`Important links`

Representing a dataset `params`

Representing a specific type dataset `params`

Parallel Dataset specific `params`

OCR Dataset specific `params`