[Archive] tldr-pages translations pairs dataset is now officially available on Kaggle

Question

[Archive] tldr-pages translations pairs dataset is now officially available on Kaggle

kbdharun opened this issue 3 months ago · 0 comments

We have been using the https://github.com/tldr-pages/tldr-translation-pairs-gen tool to generate translation datasets in TMX (Translation Memory eXchange) format for use in OPUS (a public dataset of translated resources on the web). OPUS's corpora are widely used by tools like LibreTranslate (powered by argos-translate).

While TMX is the format used with the OPUS dataset; tldr-translation-pairs-gen supports other formats like XML, CSV, and JSON. CSV is a widely used format for data analysis (And Kaggle a platform owned by Google is very popular among students and Data Scientists work best with CSV files) so I created a CSV dataset to work with our translation pairs initially under my personal Kaggle account and requested creation of an Organization (https://www.kaggle.com/organizations/tldr-pages) to move it over there. And later last week it got approved and I moved our CSV Dataset to it (https://www.kaggle.com/datasets/tldr-pages/tldr-pages-translation-pairs-dataset).

I was in contact with SethFalco discussing ways to automate the updation of the dataset, but none seem to feasible in the long run, so I will manually get the CSV assets from the latest release and update the dataset once every month (If there aren't a lot of changes might change this to updating dataset Quarterly once).

If any of the maintainers are interested in collaborating with this dataset or interested in creating new datasets under the Organization. Feel free to contact me.

Already documented this in our Access repository. Will add a new section called "Datasets" to the Wiki (highlighting datasets created from tldr-pages).