This is a script to process the data from Wikipedia - Tulu dataset is used here. The result will be a dataframe that can be used as model input.
https://dumps.wikimedia.org/tcywiki/20230320/
If you prefer Kaggle, then check out https://www.kaggle.com/code/moreducks/wikipedia-topic-classfication-dataset-prep