A simple Python3 tool (also available as a GitHub Action) to detect similarities between files within a repository.
A command line tool that receives a directory or a list of files and determines the degree of similarity between them.
The tool intends guide the refactoring efforts of a developer who wishes to reduce code duplication within a component and improve its software architecture.
Its development was initiated within the context of the DAT265 - Software Evolution Project.
The tool uses the gensim Python library to determine the similarity between source code files, supplied by the user. The default supported languages are C, C++, JAVA, Python and C#.
The following Python packages have to be installed:
- nltk
pip3 install --user nltk
- gensim
pip3 install --user gensim
- astor
pip3 install --user astor
- punkt
python3 -m nltk.downloader punkt
Suppress the warnings (generated by the used libraries)
as python3 -W ignore duplicate_code_detection.py
and then supply the necessary
arguments. More details can be found by running the tool with the --help
option.
Notice: Due to the way the models are created, the more source files you provide the tool the more accurate the similarity calculations are. In other words, the bigger the project, the more useful the tool is.
If duplicate-code-detection-tool
is the name where the tool resides in and
smartcar_shield/src
contains the repository you want to check for source code
similarities between the files, then you can run the following to get the
similarity report:
python3 -W ignore duplicate-code-detection-tool/duplicate_code_detection.py -d smartcar_shield/src/
The result should look something like this:
The tool is also available as a GitHub Action for easy integration with projects hosted on GitHub. An example output of the tool can be seen here.
The Action is meant to be triggered during pull requests to give the developers an impression over the degree of similarity between the files in the source code. Below you will find a sample workflow files that illustrate the usage.
Depending on the size of your project, you may want to have the tool running multiple times (i.e in diffferent steps) that test specific parts of your repository for duplicate code. This way you will not compare each file in your codebase with everything else and get back more meaningful reports.
In the following example the tool will examine source code (the languages supported by default)
in the src/
and test/ut
directories relative to the root directory of your repository.
The results will be posted as a comment in the pull request that was opened.
name: Duplicate code
on: pull_request
jobs:
duplicate-code-check:
name: Check for duplicate code
runs-on: ubuntu-20.04
steps:
- name: Check for duplicate code
uses: platisd/duplicate-code-detection-tool@master
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
directories: "src/, test/ut"
If you want to avoid the "spam" you should configure the tool to not always run. Specifically, if you wish to trigger the Action manually, you can do so by leaving a comment in the pull request.
The following action will trigger the tool to be run when a comment containig run_duplicate_code_detection_tool
is posted in a pull request. The tool will run using the code in the pull request.
name: Duplicate code
on: issue_comment
jobs:
duplicate-code-check:
name: Check for duplicate code
# Trigger the tool only when a comment containing the keyword is published in a pull request
if: github.event.issue.pull_request && contains(github.event.comment.body, 'run_duplicate_code_detection_tool')
runs-on: ubuntu-20.04
steps:
- name: Check for duplicate code
uses: platisd/duplicate-code-detection-tool@master
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
directories: "."
Important: Please note that due to the way GitHub Actions work, you will first have to merge this into your main branch so it starts taking effect.
It may not make sense to compare all files or get a files with very low similarity reported. In the following workflow, the different optional arguments are demonstrated.
For the various default values, please consult action.yml.
name: Duplicate code
on: pull_request
jobs:
duplicate-code-check:
name: Check for duplicate code
runs-on: ubuntu-20.04
steps:
- name: Check for duplicate code
uses: platisd/duplicate-code-detection-tool@master
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
directories: "src"
# Ignore the specified directories
ignore_directories: "src/external_libraries"
# Only examine .h and .cpp files
file_extensions: "h, cpp"
# Only report similarities above 5%
ignore_below: 5
# If a file is more than 70% similar to another, then the job fails
fail_above: 70
# If a file is more than 15% similar to another, show a warning symbol in the report
warn_above: 15
# Remove `src/` from the file paths when reporting similarities
project_root_dir: "src"
# Remove docstrings from code before analysis
# For python source code only. This is checked on a per-file basis
only_code: true
# Leave only one comment with the report and update it for consecutive runs
one_comment: true
# The message to be displayed at the start of the report
header_message_start: "The following files have a similarity above the threshold:"
To use Duplicate Code Detection Tool as a pre-commit hook with pre-commit add the following to your .pre-commit-config.yaml
file:
- repo: https://github.com/platisd/duplicate-code-detection-tool.git
rev: '' # Use the sha / tag you want to point at
hooks:
- id: duplicate-code-detection
NOTE: that this repository sets args:
-f
, if you are configuring duplicate-code-detection-tool using args you'll want to include either-f
(--files
) or-d
(--directories
).
only_code
option only works with python files for now