/polyglot

Detect Program Language from Source Code Using Naive Bayes Classifier

Primary LanguageLassoBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

polyglot

Build Status

Program language savant. It is used to detect program languages just like github/linguist, but based on naive Bayes classifier.

Getting Started

Install using pip

pip install git+https://github.com/polyrabbit/polyglot

First, we need to train polyglot on a multilingual training corpus, each folder in the corpus should contain files of the same language whose name is identified by the folder. e.g.

polyglot train --corpus=./corpus --ngram=3 --verbose --output=./model.json

A pre-included model.json is generated using the above command. Run polyglot train --help for usage specifics.

After training, we can use the Naive Bayes classifier to classify a given file. e.g.

echo import os | polyglot classify --ngram=3 --top=3 --verbose --model=./model.json -

Which outputs top 3 most likely languages in descending order with their scores

[(u'Python', 6.719828065958895), (u'Frege', -11.021531184412824), (u'Objective-C++', -13.244791737113022)]

Run polyglot classify --help for usage specifics.

Classification Algorithm

  1. Lex input string into tokens, and generate n-grams from those tokens (trigram by default)

     "#include<stdio.h>".lex().ngram(max_n=3)
     
     =>[['#'], ['include'], ['<'], ['stdio'], ['.'], ['h'], ['>'], ['#', 'include'], ['include', '<'], ['<', 'stdio'], ['stdio', '.'], ['.', 'h'], ['h', '>'], ['#', 'include', '<'], ['include', '<', 'stdio'], ['<', 'stdio', '.'], ['stdio', '.', 'h'], ['.', 'h', '>']]
    
  2. Computing the probability of a language given a token

     P(lang | token) 
     
     = P(token | lang) * P(lang) / P(token)
    
        n_token_on_lang(token, lang)     n_lang_tokens(lang)     n_token(token)
     = ------------------------------ * -------------------- /  ---------------
        n_lang_tokens(lang)              n_tokens()              n_tokens()
    
        n_token_on_lang(token, lang) 
     = ------------------------------
        n_token(token)
    
  3. Combining individual probabilities

     P(lang | tok_1, tok_2, tok_3...tok_n)
     
       P(tok_1, tok_2, tok_3...tok_n | lang) * P(lang)
     = --------------------------------------------
               P(tok_1, tok_2, tok_3...tok_n)
    
       (naively assume that tokens are independent from each other)
    
       P(tok_1|lang) * P(tok_2|lang) * P(tok_3|lang) ... P(tok_n|lang) * P(lang)
     = ---------------------------------------------------------------------
          P(tok_1) * P(tok_2) * P(tok_3) ... P(tok_n)
     
             P(tok|lang)   P(lang|tok)
           ( ----------- = ----------- )
               P(tok)        P(lang)
    
       P(lang|tok_1) * P(lang|tok_2) * P(lang|tok_3) ... P(lang|tok_n) * P(lang)
     = ---------------------------------------------------------------------
                            P(lang)^N
    
  4. Dealing with rare words

     P(unseen_token) = 1.0/(n_all_tokens()+1)
    

References

  1. A Plan For Spam
  2. Naive Bayes spam filtering
  3. Implementation of naive bayesian spam filter algorithm
  4. How To Build a Naive Bayes Classifier

V2EX讨论