Language identification is the task of determining the languege, by looking on sample of text.
Goals:
- build multiple models for identification of programming languages for a given file
- compare accuracy and performance
Non-goal: ignore vendored&generated code, override \w user settings
IN: file name, content(sample?) of the file OUT: class(prose, documentation, code), programming language name
Go bindings. Inference should be possible from Golang code
- collect data
- vowpal wabbit
- fastText
- move to scikit-learn:
- binary classificartion
- decision trees
- Visualize precision/recall/AUC/performance
- move to TF
- feed-forward NN
- use predictions from Golang
- RNN
- Golang buindings (for predictions)
- move to PyTourch
brew install rebenv ruby-build cmake icu4c
rbenv install 2.4.2
rbenv global 2.4.2
rbenv version
ruby --verion
gem install bundler
LDFLAGS="-L/usr/local/opt/icu4c/lib" CPPFLAGS="-I/usr/local/opt/icu4c/include" gem install github-linguist
brew install jq q vowpal-wabbit
git clone
pushd fasttext && make -j4 && popd
- Clone list of projects
- for each: run linguist and annotate every file
- Separate train/test datasets
Experiments
linguist --json | jq 'keys[]'
linguist --json | jq '."Objective-C"'
linguist --json | jq -r 'keys[] as $k | "\($k); \(.[$k][])"' | less
linguist --json | jq --arg pwd "$PWD" -r 'keys[] as $k | "\($k);\(.[$k][])"' | awk -F';' -v pwd=$PWD '{print $1 ";" pwd "" $2}' > files.csv
read and vectorize input:
- filename
- file extension
- shebang
- 1-gram
- 2-gram
- 3-gram
- words/tokens
- bytes/integers
- semi-supervised by linguist\enry 132Mb, 4200 files in 18 lang
# list of ursl in ./dataset-1/repos.txt
./dataset-1/clone_and_annotate_each_file.sh
- RosettaCodeData 200Mb, 64,745 files in 651 lang
cd ./dataset-2/
git clone https://github.com/bzz/RosettaCodeData; cd RosettaCodeData
# Install RosetaCode \wo tests https://gist.github.com/vifo/2718520
./cpannti.sh "RosettaCode"
rosettacode sync .
- fenced code blocks from README.md 904MB, 3.5m snippets in 287 "langs" (freq > 50) Extracted on OSD-5, improved on OSD-11
cd ./dataset-3/
gsutil -m cp "gs://srcd-production-dataproc/fenced_code_blocks_json_parsed_top_langs.gz" .
# collect data, get languages
./clone_and_annotate_each_file.sh
# stats: lang, number of lines, number of files
q -d";" "SELECT c1, SUM(c3) as s, COUNT(1) as cnt FROM ./annotated_files.csv GROUP BY c1 ORDER BY cnt DESC"
q -d";" "SELECT c1, SUM(c3) as s, COUNT(1) as cnt FROM ./annotated_files.csv GROUP BY c1 ORDER BY s DESC"
OAA multiclass classification \w logistic regression
https://github.com/JohnLangford/vowpal_wabbit/wiki/One-Against-All-(oaa)-multi-class-example
Features:
- file ext
- file name
# extract features, convert to https://github.com/JohnLangford/vowpal_wabbit/wiki/Input-format#input-format
./extract_features_vw.py ./annotated_files.csv
# shuffle
./extract_features_vw.py ./annotated_files.csv | perl -MList::Util=shuffle -e 'print shuffle(<>);' > train_features.vw
# split train/validate
python split.py train_features.vw train_split.vw test_split.vw -p 0.8 -r popa2
# train, for 19 languages
vw -d train_split.vw --oaa 19 --loss_function logistic -f trained_vw.model
# individual prediction
vw -t -i trained_vw.model
# test
vw -t -i trained_vw.model test_split.vw -p test_split.predict
# P@1, R@1, AUC using http://osmot.cs.cornell.edu/kddcup/software.html
vw -d test_split.vw -t -i trained_vw.model -r /dev/stdout | perf -roc -files test_split.gold /dev/stdin
# get the labels into *.actual (correct) file
$ cut -d' ' -f1 playtennis.txt > playtennis.actual
# paste the actual vs predicted side-by-side (+ cleanup trailing zeroes)
$ paste playtennis.actual playtennis.predict | sed 's/\.0*$//' > playtennis.ap
# convert 1's to 0's and 2's to 1's:
$ perl -pe 's/1/0/g; s/2/1/g;' playtennis.ap > playtennis.ap01
# run perf to determine precision, recall and F-measure:
$ perf -PRE -REC -PRF -file playtennis.ap01
PRE 1.00000 pred_thresh 0.500000
REC 0.80000 pred_thresh 0.500000
PRF 0.88889 pred_thresh 0.500000
# AUC using https://github.com/zygmuntz/kaggle-amazon/blob/master/auc.py
pip install ml_metrics
wget https://raw.githubusercontent.com/zygmuntz/kaggle-amazon/master/auc.py
python auc.py test_split.vw test_split.predict
> AUC: 0.0614430665163
From https://arxiv.org/abs/1607.01759
linear models with a rank constraint and a fast loss approximation trains \w stochastic gradient descent and a linearly decaying learning rate CBOW-like, n-grams using the 'hashing trick'
https://github.com/facebookresearch/fastText#text-classification https://github.com/facebookresearch/fastText/blob/master/tutorials/supervised-learning.md#getting-and-preparing-the-data
TODO:
Features:
- full text
Based on github/liguist
# format input from `annotated_files.csv` to `__label__N <token1> <token2> ...`
./extract_features_fastText.py annotated_files.csv | perl -MList::Util=shuffle -e 'print shuffle(<>);' > repos-files.txt
# pre-process
cat repos-files.txt| sed -e "s/([.\!?,'/()])/ 1 /g" | tr "[:upper:]" "[:lower:]" > repos-files.preprocessed.txt
# split
python split.py repos-files.txt repos-files.train repos-files.valid -p 0.7 -r dupa
#or
wc -l repos-files.txt
head -n 3000 repos-files.txt > repos-files.train
tail -n 1221 repos-files.txt > repos-files.valid
# train
fasttext supervised -input repos-files.train -output trained_fastText.model
## orig
Number of words: 323748
Number of labels: 16
Progress: 100.0% words/sec/thread: 2752590 lr: 0.000000 loss: 0.757425 eta: 0h0m
## pre-rpcessed
Read 2M words
Number of words: 315588
Number of labels: 16
Progress: 100.0% words/sec/thread: 2767002 lr: 0.000000 loss: 0.848436 eta: 0h0m
# individual predictions, top5
fasttext predict trained.model.bin.bin - 5
# test + P@1, R@1
fasttext test trained_fastText.model.bin repos-files.valid
N 1217
P@1 0.892
R@1 0.892
# 25 epoch
N 1226
P@1 0.971
R@1 0.971
# 1.0 lr
N 1226
P@1 0.983
R@1 0.983
# 25 epoch + 1.0 lr
~/floss/fastText/fasttext supervised -input repos-files.train -epoch 25 -lr 1.0 -output trained.model.bin
N 1226
P@1 0.991
R@1 0.991
# same + sub-word ngrams
fasttext supervised -input repos-files.train -epoch 25 -lr 1.0 -maxn 6 -minn 3 -output trained.model.ngram
fasttext test trained.model.ngram.bin repos-files.valid 10
N 1221
P@10 0.0999
R@10 0.999
Number of examples: 1221
fasttext print-word-vectors trained.model.ngram.bin
//not 0 0 0 0 for un-known words
~/floss/fastText/fasttext supervised -bucket 200000 -minn 3 -maxn 4 -input repos-files.train -output result/repos-files-3-4-200000 -lr 1.0 -epoch 25
~/floss/fastText/fasttext test result/repos-files-3-4-200000.bin repos-files.valid 5
~/floss/fastText/fasttext predict-prob result/repos-files-3-4-200000.bin test_langid.txt
fasttext skipgram -input repos-files-and.train -output embeddings-files-and
- visualize embeddings for files in projector From _fastText
export filename="4k-files"
./extract_features_fastText.py ../dataset-1/annotated_files.csv | perl -MList::Util=shuffle -e 'print shuffle(<>);' > ${filename}.txt
#or using 10 lines chunks
./extract_features_fastText.py --chunks 10 ../dataset-1/annotated_files.csv | perl -MList::Util=shuffle -e 'print shuffle(<>);' > ${filename}.txt
python ../split.py ${filename}.txt ${filename}-path.train ${filename}-path.valid -p 0.7 -r dupa
# strip path before training
cat ${filename}-path.train | cut -d "|" -f 2- > ${filename}.train
cat ${filename}-path.valid | cut -d "|" -f 2- > ${filename}.valid
~/floss/fastText/fasttext supervised -minn 3 -maxn 4 -bucket 200000 -dim 50 -input ${filename}.train -output result/${filename}-3-4-200000-50 -lr 1.0 -epoch 25
~/floss/fastText/fasttext test result/${filename}-3-4-200000-50.bin ${filename}.valid 1
./prepare_visualization.py ${filename}-path.valid
~/floss/fastText/fasttext print-sentence-vectors result/${filename}-3-4-200000-50.bin < ${filename}-no-chunks-path-nolabel.txt | tr " " "\t" > ${filename}-no-chunks-docs.tsv
~/floss/fastText/fasttext quantize -input result/${filename}-3-4-200000-50.bin -output result/${filename}-3-4-200000-50 -qnorm -retrain -cutoff 100000
- plot precision/train on Github .md data
Shallow feed-forward model, replicating fastText/CLD3 results
- https://blog.twitter.com/engineering/en_us/a/2015/evaluating-language-identification-performance.html
- https://github.com/saffsd/langid.py
- http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html
- https://github.com/google/cld3 / Natural Language Processing with Small Feed-Forward Networks
- https://github.com/poliglot/fasttext
- https://github.com/fchollet/keras/blob/master/examples/imdb_fasttext.py
- make
clone_and_annotate_each_file.shpull, if repo in./reposalreayd exists - parallelize data collection \w GNU parallel or equivalent
- plot AUC http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html
- corner-cases: Objective-C vs C vs C++ or using
file:.gitattributeslabeled - add text, markdown, etc