Emacs Lisp library that automatically detects the programming language in a buffer or string. Implemented as a random forest classifier, trained in scikit-learn and deployed to Emacs Lisp.
M-x language-detection-buffer
- Interactive function, outputs the buffer's language to *Messages*
- When called non-interactively, returns the language
(language-detection-string STRING)
- Non-interactive function, returns the language of STRING
Language classification accuracy of this model compared to a few existing systems on three different datasets:
Stack Overflow | Rosetta Code | Github Linguist | |
---|---|---|---|
language-detection.el | 73% | 84% | 87% |
Highlight.js | 36% | 54% | 56% |
Pygments | 10% | 25% | 37% |
SourceClassifier | 51% | 48% | 65% |
The actual random forest lookup is fast, the bottleneck is regular expression-based tokenization.
Function Name Call Count Elapsed time Average Time
language-detection-buffer 441 8.0538804580 0.0182627674
language-detection-string 441 8.0525858239 0.0182598318
language-detection-tokenize-string 441 6.2478335339 0.0141674229
language-detection-token-frequencies 441 1.6420298549 0.0037234237
language-detection-forest-lookup 441 0.0973103440 0.0002206583
language-detection-tree-lookup 4851 0.0885046049 1.824...e-05
- ada
- awk
- c
- clojure
- cpp
- csharp
- css
- dart
- delphi
- emacslisp
- erlang
- fortran
- fsharp
- go
- groovy
- haskell
- html
- java
- javascript
- json
- latex
- lisp
- lua
- matlab
- objc
- perl
- php
- prolog
- python
- r
- ruby
- rust
- scala
- shell
- smalltalk
- sql
- swift
- visualbasic
- xml
Write this in your .emacs:
(require 'cl-lib)
(defun eww-tag-pre (dom)
(let ((shr-folding-mode 'none)
(shr-current-font 'default))
(shr-ensure-newline)
(insert (eww-fontify-pre dom))
(shr-ensure-newline)))
(defun eww-fontify-pre (dom)
(with-temp-buffer
(shr-generic dom)
(let ((mode (eww-buffer-auto-detect-mode)))
(when mode
(eww-fontify-buffer mode)))
(buffer-string)))
(defun eww-fontify-buffer (mode)
(delay-mode-hooks (funcall mode))
(font-lock-default-function mode)
(font-lock-default-fontify-region (point-min)
(point-max)
nil))
(defun eww-buffer-auto-detect-mode ()
(let* ((map '((ada ada-mode)
(awk awk-mode)
(c c-mode)
(cpp c++-mode)
(clojure clojure-mode lisp-mode)
(csharp csharp-mode java-mode)
(css css-mode)
(dart dart-mode)
(delphi delphi-mode)
(emacslisp emacs-lisp-mode)
(erlang erlang-mode)
(fortran fortran-mode)
(fsharp fsharp-mode)
(go go-mode)
(groovy groovy-mode)
(haskell haskell-mode)
(html html-mode)
(java java-mode)
(javascript javascript-mode)
(json json-mode javascript-mode)
(latex latex-mode)
(lisp lisp-mode)
(lua lua-mode)
(matlab matlab-mode octave-mode)
(objc objc-mode c-mode)
(perl perl-mode)
(php php-mode)
(prolog prolog-mode)
(python python-mode)
(r r-mode)
(ruby ruby-mode)
(rust rust-mode)
(scala scala-mode)
(shell shell-script-mode)
(smalltalk smalltalk-mode)
(sql sql-mode)
(swift swift-mode)
(visualbasic visual-basic-mode)
(xml sgml-mode)))
(language (language-detection-string
(buffer-substring-no-properties (point-min) (point-max))))
(modes (cdr (assoc language map)))
(mode (cl-loop for mode in modes
when (fboundp mode)
return mode)))
(message (format "%s" language))
(when (fboundp mode)
mode)))
(setq shr-external-rendering-functions
'((pre . eww-tag-pre)))
The model consists of a Random Forest ensemble of 11 Decision Tree classifiers.
It was trained on code snippets from the Stack Overflow data dump, extracted from <pre>
tags. The number of snippets per language was capped at 10,000. The actual training snippets can be unpickled from snippets.cpkl
.
The snippets were tokenized simple regex: ([a-zA-Z0-9_]+|[^ a-zA-Z0-9_\n\t]+)
. The set of single tokens and pairs of consecutive tokens are pruned to the top 500 token/token pairs per language.
Finally, the tokens are counted for frequency and placed in a 7772-dimensional bag-of-words vector. (7772 is the total number of unique tokens/pairs.) For example, this is a bag corresponding to a short Haskell snippet:
" : 0.057971
( : 0.0144928
) : 0.0144928
-> : 0.0289855
-> IO : 0.0144928
. : 0.0289855
. h : 0.0289855
:: : 0.0289855
IO : 0.0144928
h : 0.0289855
import : 0.0289855
For classification I use a Random Forest with 11 decision trees. The hyperparameters were found using random grid search. Here's a confusion matrix:
The decision trees are written out as Emacs Lisp arrays and traversed recursively until leaves are found in each tree.