A Hindi-English Code-Mixed Dataset for Text Normalization
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/.
We are releasing our dataset for Normalization of Hindi-English Code-Mixed Text Data in JSON format.
The object/fields in the released dataset are as shown in the following table:
Field | Description | Example |
---|---|---|
id | Unique identifier for each datapoint | 30 |
inputText | Filtered & cleaned input text | whtas ur name |
tags | We get normalizedText from inputText after applying transformation according to the tags | ['Short Form', 'Short Form', 'Looks Good'] |
normalizedText | Manually annotated normalized inputText | what is your name |