csebuetnlp/banglabert

The websites the dataset was scraped from?

imr555 opened this issue · 1 comments

As Alexa Web rankings shut down in May, 2022, (https://www.alexa.com/topsites/countries/BD), it is not possible to retrieve the names of the Bangladeshi websites used.

It would be really useful if the names of the fifty Bangladeshi websites used to scrape the dataset could be released. It would help understand the nature of the dataset used to train the model and help in model interpretability experiments too.

Pretraining data sources have been enumerated in the appendix of our paper.