nlp-reading-group
These are some experiments with NLP methods applied to software (source code, comments, etc).
Preprocessing
===
Tokenization
- Simple Python 2.7 tokenizer based on Pygments: https://github.com/bvasiles/nlp-reading-group/blob/master/simplePyLex.py. Currently only for Python, can be extended very easily.
Takes as input: (1) the path to the folder with all the source code to be tokenized; (2) the filename extension of the files to be tokenized (always \*.py
in this prototype); (3) the path to the output file.
When run on the Django code base (download or clone the repo first), it produces this output
Example: python simplePyLex.py ./data/django \*.py ./data/django.code.lexed.txt
n-grams
===
- Operations on bigrams in Python using NLTK: https://github.com/bvasiles/nlp-reading-group/blob/master/bigrams.py
Takes as input some tokenized corpus. Illustrates possible operations on n-grams using NLTK