Trouble Extracting Monolingual Datasets from SeamlessAlign
nassergharbi opened this issue · 1 comments
nassergharbi commented
Problem Description
The dataset provided at this link presents challenges in extracting Maltese datasets. Specifically, the metadata for Textual <-> Audio alignment includes a subset seemingly sourced from common-crawl, with specified data URLs, and another subset from other corpora lacking specified URLs.
Questions
-
Data Retrieval Without URLs:
- How can one retrieve datasets for which there are no specified URLs?
-
Linking Audio to Transcription for Non-Common Crawl Corpora:
- For datasets from sources other than Common Crawl, how can the link between audio and transcription be established when no URL is provided?
Objective
I am particularly interested in extracting Maltese audio datasets with corresponding transcriptions to establish a gold standard.
Dataset Details
- Dataset Link: https://github.com/facebookresearch/seamless_communication/blob/main/docs/m4t/seamless_align_README.md
Thank you very much in advance for your support!
nassergharbi commented
Sorry, I've posted this issue in the wrong place. The correct repo is here: facebookresearch/seamless_communication#338