ChenghaoMou/awesome-data-deduplication

Introduce Sailor Models and SailCraft Tools

longxudou opened this issue · 1 comments

Thanks @ChenghaoMou for building the excellent text-dedup and presenting the wonderful blog!

We have used your work in our following project.

Sailor Language Models

Sailor is a suite of Open Language Models tailored for South-East Asia (SEA), focusing on languages such as 🇮🇩Indonesian, 🇹🇭Thai, 🇻🇳Vietnamese, 🇲🇾Malay, and 🇱🇦Lao.
See Sailor homepage for more details.

SailCraft Data Toolkit

Leveraging text-dedup, we've built a data processing pipeline tool called SailCraft.
It consists of four stages: initial data cleaning, near deduplication, exact deduplication, and a second round of data cleaning.

Many thanks for your contribution for open research!

Thank you so much for letting me know and congratulations on your model release! I will update this repo shortly.