Introduce Sailor Models and SailCraft Tools
longxudou opened this issue · 1 comments
Thanks @ChenghaoMou for building the excellent text-dedup and presenting the wonderful blog!
We have used your work in our following project.
Sailor Language Models
Sailor is a suite of Open Language Models tailored for South-East Asia (SEA), focusing on languages such as 🇮🇩Indonesian, 🇹🇭Thai, 🇻🇳Vietnamese, 🇲🇾Malay, and 🇱🇦Lao.
See Sailor homepage for more details.
SailCraft Data Toolkit
Leveraging text-dedup, we've built a data processing pipeline tool called SailCraft.
It consists of four stages: initial data cleaning, near deduplication, exact deduplication, and a second round of data cleaning.
Many thanks for your contribution for open research!
Thank you so much for letting me know and congratulations on your model release! I will update this repo shortly.