/hyperpolyglot

A fast programming language detector

Primary LanguageRenderScriptApache License 2.0Apache-2.0

hyperpolyglot

A fast programming language detector.

Hyperpolyglot is a fast programming language detector written in Rust based on Github's Linguist Ruby library. Hyperpolyglot supports detecting the programming language of a file or detecting the programming language makeup of a directory. For more details on how the language detection is done, see the Linguist README.

CLI

Installing cargo install hyperpolyglot

Usage hyply [PATH]

Output

85.00% Rust
15.00% RenderScript

Library

Adding as a dependency

[dependencies]
hyperpolyglot = "0.1.0"

Detect

use hyperpolyglot;

let detection = hyperpolyglot::detect(Path::new("src/bin/main.rs"));
assert_eq!(Ok(Some(Detection::Heuristics("Rust"))), detection);

Breakdown

use hyperpolyglot::{get_language_breakdown};

let breakdown: HashMap<&'static str, Vec<(Detection, PathBuf)>> = get_language_breakdown("src/");
println!("{:?}", breakdown.get("Rust"));

Divergences from Linguist

  • The probability of the language occuring is not taken into account when classifying. All languages are assumed to have equal probability.

  • An additional heuristic was added for .h files.

  • Vim and Emacs modelines are not considered in the detection process.

  • Generated and Binary files are not excluded from the breakdown function.

  • When calculating the language makeup of a directory, file count is used instead of byte count.

Benchmarks

  • Benchmarks were run using the command line tool hyperfine
  • Benchmarks were run on a 8gb 3.1 GHz Dual-Core Intel Core i5 MacBook Pro
  • enry is a port of the Linguist library to go
  • Both enry and Linguist are single-threaded

samples dir

Tool mean (ms) median (ms) min (ms) max (ms)
hyperpolyglot (multi-threaded) 1,188 1,186 1,166 1,226
hyperpolyglot (single-threaded) 2,424 2,424 2,414 2,442
enry 21,619 21,566 21,514 21,855
Linguist 42,407 42,386 42,070 42,856

Rust Repo

Tool mean (ms) median (ms) min (ms) max (ms)
hyperpolyglot (multi-threaded) 3,808 3,751 3,708 4,253
hyperpolyglot (single-threaded) 8,341 8,334 8,276 8,437
enry 82,300 82,215 82,021 82,817
Linguist 196,780 197,300 194,033 202,930

Linux Kernel

  • The reason hyperpolyglot is so much faster here is the heuristic added to .h files which significantly speeds up detection for .h files that can't be classified with the Objective-C or C++ heuristics
Tool mean (s) median (s) min (s) max (s)
hyperpolyglot (multi-threaded) 3.7574 3.7357 3.7227 3.9021
hyperpolyglot (single-threaded) 7.5833 7.5683 7.5445 7.6489
enry 137.6046 137.4229 137.1955 138.8694

Accuracy

All of the programming language detectors are far from perfect and hyperpolyglot is no exception. It's language detections mirror Linguist and enry for most files with the biggest divergences coming from files that need to fall back on the classifier. Files that can be detected through a common known filename, an extension, or by following the set of heuristics should approach 100% accuracy.

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.