collected CC0 sentences written in Catalan, from Public Domains and/ CC0 licences
TeMU-BSC is the Text Mining Unit of the Barcelona Supercomputing Center - Centro Nacional de Supercomputación in Barcelona (Spain) [https://www.bsc.es/discover-bsc/organisation/scientific-structure/text-mining]
93691 sentences selected from the Catalan Government Crawling. Numbers have been transcribed.
The Catalan Government Crawling Corpus is a 39-million-token web corpus of Catalan built from the web. It has been obtained by crawling the .gencat domain and subdomains, belonging to the Catalan Government during September and October 2020.
Both the packaging and its content are under a CC0 Universal Licence. Please refer to web.gencat.cat/en/menu-ajuda/ajuda/avis_legal/index.html
155990 sentences extracted by the following Projecte Aina datasets:
The Language Technologies Unit agrees that Mozilla may publish these contributions under the CC0 public domain dedication available at https://creativecommons.org/publicdomain/zero/1.0/legalcode. We, therefore, agree to waive all copyright and related or neighbouring rights together with all associated claims and causes of action with respect to these contributions to the extent possible under the law.
2166 sentences, generated from edited_selected_chatbot.txt, and semi-authomaticaly doing masking with bsc/roberta-base-ca-cased transfromer model, and keeping only the well-formed ones.
The TeMU-BSC agrees that Mozilla may publish these contributions under the CC0 public domain dedication available at https://creativecommons.org/publicdomain/zero/1.0/legalcode. We, therefore, agree to waive all copyright and related or neighbouring rights together with all associated claims and causes of action with respect to these contributions to the extent possible under the law.
873 sentences selected from our chatbot corpus, not published already.
The TeMU-BSC agrees that Mozilla may publish these contributions under the CC0 public domain dedication available at https://creativecommons.org/publicdomain/zero/1.0/legalcode. We, therefore, agree to waive all copyright and related or neighbouring rights together with all associated claims and causes of action with respect to these contributions to the extent possible under the law.
160665 sentences generated with substituition templates for this project, with the municipalities of all Catalan-speaking areas, published here for the first time.
The TeMU-BSC agrees that Mozilla may publish these contributions under the CC0 public domain dedication available at https://creativecommons.org/publicdomain/zero/1.0/legalcode. We, therefore, agree to waive all copyright and related or neighbouring rights together with all associated claims and causes of action with respect to these contributions to the extent possible under the law.
20k sentences from "Diccionaris de l'Enciclopèdia", published here under CC0 licence by the included "CC0 waiver", to be used in the Common voice platform.
1711 sentences created by Secretaria de política lingüística (Linguistic Policy Office, from the Catalan government) for this project, published here for the first time, and covered as CC0 by the included "CC0 waiver" (see SPL_CC0_waiver.pdf)
15451 sentences generated with substituition templates for this project, published here for the first time.
The TeMU-BSC agrees that Mozilla may publish these contributions under the CC0 public domain dedication available at https://creativecommons.org/publicdomain/zero/1.0/legalcode. We, therefore, agree to waive all copyright and related or neighbouring rights together with all associated claims and causes of action with respect to these contributions to the extent possible under the law.
4469 new sentences, generated from frases_spl, and semi-authomaticaly doing masking with bsc/roberta-base-ca-cased transfromer model, and keeping only the well-formed ones.
The TeMU-BSC agrees that Mozilla may publish these contributions under the CC0 public domain dedication available at https://creativecommons.org/publicdomain/zero/1.0/legalcode. We, therefore, agree to waive all copyright and related or neighbouring rights together with all associated claims and causes of action with respect to these contributions to the extent possible under the law.
366 literary sentences, published here under a CC0 licence, and edited for well-formedness and idiomacy.
35244 sentences from Marius Serra's, work, extracted with the author's permission.
This sentences are extracted from the following works:
Books:
- La Napeu
- Fora de joc a Montserrat
- Jugar-s'hi la vida
- La novel·la de Sant Jordi
- D'on trec el temps
- Hawaii Lima
- Plans de futur
- L'arca de Babel
- De com s'escriu una novel·la
- Enviar i rebre
- Farsa
- La vida normal
- L'home del sac
- La llegenda de Sant Jordi
- Mon oncle
- Quiet
- Res no és perfecte a Hawaii
- Tres és massa
- Verbàlia
Mots encreuats (crosswords)
Author's blog.
17795 new sentences from Marius Serra's, crosswords, provided by the author.
74772 new intent-like sentences, generated with substitution templates, published here for the first time.
The TeMU-BSC agrees that Mozilla may publish these contributions under the CC0 public domain dedication available at https://creativecommons.org/publicdomain/zero/1.0/legalcode. We, therefore, agree to waive all copyright and related or neighbouring rights together with all associated claims and causes of action with respect to these contributions to the extent possible under the law.
58310 sentences from a catalan newswire. The owner agrees that Mozilla may publish these contributions under the CC0 public domain dedication available at https://creativecommons.org/publicdomain/zero/1.0/legalcode.
2664 intent-like sentences, generated with substituition templates for this project, published here for the first time.
The TeMU-BSC agrees that Mozilla may publish these contributions under the CC0 public domain dedication available at https://creativecommons.org/publicdomain/zero/1.0/legalcode. We, therefore, agree to waive all copyright and related or neighbouring rights together with all associated claims and causes of action with respect to these contributions to the extent possible under the law.
764 sentences from Joan Pujolar's work, extracted from here and here.
21237 sentences from our own corpora and datasets:
- questions from VilaQuAD https://zenodo.org/record/4761430#.YW1KaJuxXOs and ViquiQuAD https://zenodo.org/record/4761412#.YW1KaZuxXOs, that we commissioned.
- hypotesis we commissioned for the TECA dataset: https://zenodo.org/record/4761458#.YW1KWZuxXOs
4154 new sentences from the XitXat corpus, written by our team, published here for the fist time.
The TeMU-BSC agrees that Mozilla may publish these contributions under the CC0 public domain dedication available at https://creativecommons.org/publicdomain/zero/1.0/legalcode. We, therefore, agree to waive all copyright and related or neighbouring rights together with all associated claims and causes of action with respect to these contributions to the extent possible under the law.
3818 other sentences from the XitXat corpus, written by our team, published here for the fist time.
The TeMU-BSC agrees that Mozilla may publish these contributions under the CC0 public domain dedication available at https://creativecommons.org/publicdomain/zero/1.0/legalcode. We, therefore, agree to waive all copyright and related or neighbouring rights together with all associated claims and causes of action with respect to these contributions to the extent possible under the law.
They have been added to the Common Voice corpus through the Sentence Collector
18550 sentences generated with substituition templates for this project with wikidata data, published here for the first time.
The TeMU-BSC agrees that Mozilla may publish these contributions under the CC0 public domain dedication available at https://creativecommons.org/publicdomain/zero/1.0/legalcode. We, therefore, agree to waive all copyright and related or neighbouring rights together with all associated claims and causes of action with respect to these contributions to the extent possible under the law.
49990 sentences randomly selected and translated from wiki.es.txt into Catalan. Not post edited.
Files that aggregate files descrived before:
Contains all the 107k sentences from the files:
- edited_generated_selected_chatbot.txt
- edited_selected_chatbot.txt
- frases_spl.txt
- generades_spl_seleccionades.txt
- more_intents.txt
- plantilles_intents.txt
- selected_club.txt
37428 sentences from:
- frases_toponims_illes.txt
- frases_toponims_valencians.txt
- sentences_from_xitxat_corpus.txt
- wikidata_sentences.txt
Some sentences have been edited or removed while supervising the contents.
Contribution agreements for the previously published sentences.
-
edited_generated_selected_chatbot.txt, edited_selected_chatbot.txt, frases_spl.txt, generades_spl_seleccionades.txt, more_intents.txt, plantilles_intents.txt, selected_club.txt are owned by TeMU-BSC and published here under CC0 licence.
-
catalan_government_crawling_frases_seleccionades_filtrades and literatura sources are public under a CC0 licence