This repository aims to provide a suite of static analysis tools to analyze code complexity. This is currently mostly a collection of heuristics, but will evolve with time. Unlike other code metrics related repos at the time of writing, this one computes software complexity by file and language in addition to an aggregate score across all files & languages. There is support for over 20 languages, including low-level assembly languages like MIPS. Please note this is release 0.0.1, and is not suitable for production. Example directories are provided for the reader's viewing pleasure, but will probably go away in a later version.
Usage is simple, simply pass in a directory as a command-line argument to /software_metrics/metrics/run_metrics/runner.py --dir <directory>
This creats a directory logs_<hash>
and under each of those, there is a directory for each metric, further broken down by file and language. The exception to this is the maintainability index which is directly under logs
.
Under software_metrics/metrics/metrics_cfgs
there are several key files -
lang_regexes.json
. This file contains operators, keywords etc. which allow for pattern matching. This makes extending the complexity metrics calculation across languages trivial (even experimental ones) - simply add another key to the JSON file containing the patterns for all syntatic tokens of interest.program_file_exts.txt
- a file containing the extensions that the runner should consider when computing complexity. This can be easily be modified to ignore specific file exts. NOTE - BEHAVIOR OF THIS IS NOT YET TESTED!program_file_exts_map.json
- this file is a JSON which maps language -> List[allowable extensions]. This is also a key component in making the software complexity calculations extendable to other languages, and can be applied to novel languages as seamlessly as well-established ones.generate_hll_tokens.py
- this file is responsible for auto-generating a JSON file which contains the assignments, branches, conditional, loop keywords, and comments which are used to calculate metrics. This allows the computations to be language agnostic and simply operate over syntactic tokens of interest. This file is for high-level languages like Python, Go, C++ etc. The generated file will be calledhll_tokens.json
.generate_lll_tokens.py
- similarly, this file is responsible for auto-generating a JSON file which contains the assignments, branches, conditional and loop keywords along with comment tokens for assembly-level languages like x86 and ARM.generate_ir_tokens.py
- like above, but for middle-end Intermediate Representation (IR) languages like LLVM and Gimble (the only two which are currently supported).
- ABC Metrics
- Halstead Complexity
- Cyclomatic Complexity - this also includes a crude ASCII Control-Flow Graph (CFG) of a given program
- LOC (broken down into source, comment, and blank lines of code)
- Maintainability index
The metrics used here are themselves not novel, and fall under the umbrella of static analysis code tools. That said, higher abstraction constructs like relationships between objects, functions, and polymorphic types deserve further exploration. The current version has a file under /software_metrics/experimental/func_programming_utils.py
which inputs a python file, and does the following
- Computes number of higher order functions
- Degree of each higher order function
- ASCII diagrams showing relationships between the functions in the file
- In-degree and out-degree using the Network-X Graph Library.
More sophisticated features will be added in future releases, like object dependency injections, as well as code which will automatically perform dependency inversion and Inversion-of-Control (IoC) for classes
KNOWN ISSUES :
- Meta-referencing tokens - In a given language (ex. Python), if comments contain syntatic keywords for loops, assignments, branches etc. the counter(s) will include those inside the calculation. This is also true if the exact match of the keyword is a string (but not substring). An example -
my_list = ['something','if']
will count theif
as an additional branch. Fortunately, this does not occur whenmy_list = ['something','if_in_a_string']
. In general, this can be easily fixed with better regex/pattern matching to filter out comments and then only pattern match over source. Code already exists to parse source from comments, but the logic for filtering out the syntatic keywords is not yet in the repository.
-
Improper handling of all file types - in particular, binary file exceptions are not fully handled, but will be fixed shortly.
-
Proper unit tests for all metrics across all languages
-
Additional Documentation
-
Dockerization
-
Development of more informative metrics which intelligently aggregate and built on the existing ones