roskakori/pygount

Add option to merge embedded languages

mathben opened this issue · 2 comments

Story

As user I want to see a single count for a base language even of there are source codes with various embedded languages so that I can get a general idea how much the base language is used independent of the embedded languages.

Example languages this is useful for: HTML, XML, JavaScript.

Goals

  • When Pygments detects a language that contains a plus between two non-plus characters, the left one is the base language and the right one the embedded language. Examples:
    • JavaScript → base: JavaScript, no sub
    • JavaScript+Lasso → base: JavaScript, sub: Lasso
    • C++ → base: C++, no sub; reason: A + must be followed by a non-plus to be the start of a sub-language
  • When the command line option --merge-embedded is specified, all source files from an embedded language count only towards the base language.

Original request: Option to merge sub language

Example of output
┏━━━━━━━━━━━━━━━━
┃ Language ┃
┡━━━━━━━━━━━━━━━━
│ Python │
│ XML │
│ XML+Django/Jinja │
│ JavaScript+Lasso │
│ JavaScript │
│ Genshi │
│ SCSS
│ JavaScript+Genshi Text │
│ HTML │
│ JavaScript+Django/Jinja │
│ CSS+Lasso │
empty

Can we have an option to merge result of
XML with XML+Django/Jinja + Genshi
Javascript with Javascript+Lasso + Javascript+Genshi Text + Javascript+Django/Jinja

Are maybe remove sub language classification from analysis?

The classification is done by pygments, so this would need to be an extra step performed by pygount.

Seemingly pygments uses the convention base_language + "+" + other_languages, so pygount could split before the + and just use the base language as actual language.

Not high on my list of priorities but I leave it in the backlog.

Note to self: We still need to detect C++ to be a full language, not the "+" sub-language of C.

Here's a first code snipplet to derive the base language, if any.

import re

_BASE_LANGUAGE_REGEX = re.compile(r"^(?P<base_language>[^+]+)\+[^+].*$")


def base_language(language: str) -> str:
    base_language_match = _BASE_LANGUAGE_REGEX.match(language)
    return language if base_language_match is None else base_language_match.group("base_language")


assert base_language("JavaScript") == "JavaScript"
assert base_language("JavaScript+Lasso") == "JavaScript"
assert base_language("JavaScript+") == "JavaScript+"  # no actual language 
assert base_language("C++") == "C++"
assert base_language("++C") == "++C"  # no actual language