The websites the dataset was scraped from?
imr555 opened this issue · 1 comments
imr555 commented
As Alexa Web rankings shut down in May, 2022, (https://www.alexa.com/topsites/countries/BD), it is not possible to retrieve the names of the Bangladeshi websites used.
It would be really useful if the names of the fifty Bangladeshi websites used to scrape the dataset could be released. It would help understand the nature of the dataset used to train the model and help in model interpretability experiments too.
abhik1505040 commented
Pretraining data sources have been enumerated in the appendix of our paper.