Samskritam

This repository contains resources and tools for building language models and tokenizers for Sanskrit, an ancient Indian language with a rich literary tradition. The goal of this project is to facilitate natural language processing tasks and language understanding for Sanskrit texts.

Features Sanskrit Language Models: Pre-trained language models for Sanskrit based on transformer architectures like BERT, GPT, and others. Sanskrit Tokenizers: Efficient tokenizers for Sanskrit, handling the unique challenges of the Sanskrit writing system and its complex orthography. Data Preprocessing: Scripts and utilities for cleaning, normalizing, and preparing Sanskrit text data for training language models. Evaluation Benchmarks: Datasets and evaluation scripts for testing the performance of Sanskrit language models on various NLP tasks.