Methods Included:
- Code Mixing Index
- Multilingual Index
- Language Entropy
- I-index : Prob of Switching
- Burstiness
- Span Entropy
- Memory
- SPAvg : returns switching points. calculate avg over corpus
>> from cs_metrics import *
>> sample = 'EN EN HI HI UNIV UNIV HI HI EN EN EN HI HI'
>> cmi(sample)
>> 45.45454545454546
>> mindex(sample)
>> 0.9836065573770497
>> lang_entropy(sample)
>> 0.9940302114769565
>> burstiness(sample)
>> -0.4835086004775133
Sample code:
import pandas as pd
from run import calc
df = pd.read_csv('data_combined_get_splits_v1.csv')
df = df.dropna(subset=['langtags'])
df['langtags'] = df['langtags'].apply(eval)
# we also need to filter the df to remove monolingual sents and sents with only 1 word
df['langtags'] = df['langtags'].apply(lambda x: np.nan if not set(['hi', 'en']).issubset(x) else x)
df = df.dropna(subset=['langtags'])
print(calc(df.langtags.values[0], 'switch_surprisal'))
# see list of all supported functions in file run.py
df['langtags'] = df['langtags'].apply(lambda x: calc(x, 'switch_surprisal'))
To Dos:
[ ] Take list or str as input. Implemented for I-index
[ ] case insensitive lang_tags, other_tags. Implemented for I-index
[ ] take num of languages as an input argument
[ ] take num of other tags as an input argument