Trouble Extracting Monolingual Datasets from SeamlessAlign

Question

Trouble Extracting Monolingual Datasets from SeamlessAlign

nassergharbi opened this issue 8 months ago · 1 comments

Problem Description

The dataset provided at this link presents challenges in extracting Maltese datasets. Specifically, the metadata for Textual <-> Audio alignment includes a subset seemingly sourced from common-crawl, with specified data URLs, and another subset from other corpora lacking specified URLs.

Questions

Data Retrieval Without URLs:
- How can one retrieve datasets for which there are no specified URLs?
Linking Audio to Transcription for Non-Common Crawl Corpora:
- For datasets from sources other than Common Crawl, how can the link between audio and transcription be established when no URL is provided?

Objective

I am particularly interested in extracting Maltese audio datasets with corresponding transcriptions to establish a gold standard.

Dataset Details

Dataset Link: https://github.com/facebookresearch/seamless_communication/blob/main/docs/m4t/seamless_align_README.md

Thank you very much in advance for your support!

Answer 1 · 2024-01-22T15:58:57.000Z

Sorry, I've posted this issue in the wrong place. The correct repo is here: facebookresearch/seamless_communication#338