Add option to merge embedded languages

Question

Add option to merge embedded languages

mathben opened this issue 2 years ago · 2 comments

Story

As user I want to see a single count for a base language even of there are source codes with various embedded languages so that I can get a general idea how much the base language is used independent of the embedded languages.

Example languages this is useful for: HTML, XML, JavaScript.

Goals

When Pygments detects a language that contains a plus between two non-plus characters, the left one is the base language and the right one the embedded language. Examples:
- JavaScript → base: JavaScript, no sub
- JavaScript+Lasso → base: JavaScript, sub: Lasso
- C++ → base: C++, no sub; reason: A + must be followed by a non-plus to be the start of a sub-language
When the command line option --merge-embedded is specified, all source files from an embedded language count only towards the base language.

Original request: Option to merge sub language

Example of output
┏━━━━━━━━━━━━━━━━
┃ Language ┃
┡━━━━━━━━━━━━━━━━
│ Python │
│ XML │
│ XML+Django/Jinja │
│ JavaScript+Lasso │
│ JavaScript │
│ Genshi │
│ SCSS
│ JavaScript+Genshi Text │
│ HTML │
│ JavaScript+Django/Jinja │
│ CSS+Lasso │
│ empty │

Can we have an option to merge result of
XML with XML+Django/Jinja + Genshi
Javascript with Javascript+Lasso + Javascript+Genshi Text + Javascript+Django/Jinja

Are maybe remove sub language classification from analysis?

Answer 1 · 2023-02-12T23:06:36.000Z

The classification is done by pygments, so this would need to be an extra step performed by pygount.

Seemingly pygments uses the convention base_language + "+" + other_languages, so pygount could split before the + and just use the base language as actual language.

Not high on my list of priorities but I leave it in the backlog.

Answer 2 · 2023-02-13T09:52:25.000Z

Note to self: We still need to detect C++ to be a full language, not the "+" sub-language of C.

Here's a first code snipplet to derive the base language, if any.

import re

_BASE_LANGUAGE_REGEX = re.compile(r"^(?P<base_language>[^+]+)\+[^+].*$")


def base_language(language: str) -> str:
    base_language_match = _BASE_LANGUAGE_REGEX.match(language)
    return language if base_language_match is None else base_language_match.group("base_language")


assert base_language("JavaScript") == "JavaScript"
assert base_language("JavaScript+Lasso") == "JavaScript"
assert base_language("JavaScript+") == "JavaScript+"  # no actual language 
assert base_language("C++") == "C++"
assert base_language("++C") == "++C"  # no actual language