XetHub hosted fork of Meta's Seamless Communication models in a monorepo design:
models\
folder contains all of the model files themselvescode\
folder is a Git submodule to Meta's repo containing code and documentation
Large files like the model files are hosted by XetHub while source code is still hosted by GitHub using our GitHub app. This ML monorepo design bakes in reproducibility with no workflow changes and simplifies versioning since source code and large files can live in the same logical folder.
This repo contains 41 GB of model files so double check your available storage on your target machine.
- Install our tiny git-xet extension for your operating system. This extension lets you pull and push large files in Xet managed GitHub repos.
- Then clone the repo locally:
git clone git@github.com:xetdata/seamless_monorepo.git
- The download may take a while and you will see output from the git-xet client resembling the following:
git-xet 0.12.5 filter started
Updating files: 100% (39/39), done.
Xet: Retrieving data blocks: 15.34 GiB / 110 MiB/s
Filtering content: 45% (11/24), 4.72 GiB / 70 MiB/s
- The
code/
folder is a Git sub-module that links to Meta's original repo. Download it using from the root directory of this monorepo:
git submodule update --init --recursive
Bonus tip: save your SSH passphrase in your keychain so you don't have to enter it 4 times every time you git clone or git push.
If you have limited storage space or don't want to wait for the full download of all the model files, you can use the lazy clone feature baked into our git-xet extension:
git xet clone --lazy git@github.com:xetdata/seamless_monorepo.git
This command downloads all files managed by GitHub directly (like source code and markdown files) and only downloads pointers to larger binary files managed by XetHub.
Use the following command to materialize specific files:
git xet materialize models/seamless-streaming/seamless_streaming_unity.pt
View a full list of currently materialized files using:
git xet lazy show
You can also mount the entire model repo in just a few seconds. The files you need are fetched behind the scenes as you need them.
git xet mount git@github.com:xetdata/seamless_monorepo.git
Join our Slack community here.
Seamless Expressive models
Meta requires that you register your email with them to use the Seamless Expressive models. You can fill out the form here.
Licenses
Meta's Seamless models have multiple licenses that you need to comply with.
The following non-generative components are MIT licensed as found in MIT_LICENSE:
- Code
- Text only part of the mExpresso dataset found in the SeamlessExpressive README.
- UnitY2 forced alignment extractor found in the UnitY2 Aligner README.
- Speech toxicity tool with the etox dataset found in the Toxicity README.
The following models are CC-BY-NC 4.0 licensed as found in the LICENSE:
- SeamlessM4T models (v1 and v2).
- SeamlessStreaming models.
The following models are Seamless licensed as found in SEAMLESS_LICENSE:
- Seamless models.
- SeamlessExpressive models.