- Project Overview
- Features
- Technologies Used
- Project Document
- Installation
- Usage
- Project Poster
- System Architecture
- Contact Us
The increasing complexity of software systems demands effective methods for understanding and maintaining source codes. In this context, traceability links between software artifacts, such as requirements, code, and documentation, play a crucial role in facilitating software comprehension, maintenance, and evolution. This project focuses on developing a comprehensive framework for automated trace link generation between use case documents and Java code files, leveraging machine learning techniques and natural language processing (NLP) tools.
The project comprises several modules, each serving a specific purpose in the traceability link generation process. The Data Collection Module retrieves and preprocesses relevant data, including use case documents, Java code files, and associated metadata. The Feature Extraction Module extracts features from both textual and code-based sources, capturing essential information for link prediction. Subsequently, the Model Prediction Module applies machine learning algorithms, such as Random Forest, to predict trace links based on the extracted features, as well as the Deep Learning Prediction Module for benchmarking the machine learning outputs. Additionally, the Bug Localization Module aims to locate buggy versions of code files by analyzing bug reports and version control data.
Throughout the development process, various software tools and technologies are employed, including Visual Studio Code, Python, Jupyter Notebook, TensorFlow, and TreeSitting. These tools facilitate tasks such as code editing, data analysis, machine learning model training, and parsing source code. Furthermore, libraries such as NumPy, Pandas, and Scikit-learn provide essential functionalities for data manipulation, numerical computing, and machine learning tasks.
The proposed framework contributes to the automation of traceability link generation, reducing manual effort, and improving software maintenance efficiency. By establishing trace links between use case documents and code files, developers can better understand system requirements, track changes, and ensure consistency between software artifacts. Overall, this project aims to enhance software comprehension and maintenance processes, ultimately leading to the development of more reliable and maintainable software systems.
- Automated trace link generation between use case documents and Java code files.
- Feature extraction from textual and code-based sources.
- Machine learning and deep learning models for link prediction.
- Bug localization by analyzing bug reports and version control data.
- Programming Languages: Python, JS, HTML , Css
- Libraries and Frameworks: TensorFlow, NumPy, Pandas, Scikit-learn, React, NLTK
- Development Tools: Visual Studio Code, Jupyter Notebook, TreeSetter
- Python 3.8 or higher
- TensorFlow 2.4.1
- Required Python packages (listed in
requirements.txt
)
-
Clone the Repository:
[ git clone https://github.com/yourusername/traceability-link-generation.git](https://github.com/sohad-hossam/GP-codebase.git)
-
Install Required Packages:
pip install -r requirements.txt
-
Run The Website:
npm start python src/app.py
-
Run Feature Extraction Module:
-
Run Model Prediction Module:
-
Run Bug Localization Module:
=
The architecture of IntelliTest is covered in this section. We will review each module to illustrate how the modules work together.
The Preprocessor module processes both source code files and use case (UC) documents to prepare them for further analysis by cleaning, tokenizing, and filtering content.
- Removing punctuation marks, numeric characters, and stopwords.
- Lowercasing all words.
- Stemming words using the Porter Stemming Algorithm.
- Filtering out Java keywords to focus on relevant content.
- Removing numeric characters and stopwords.
- Lowercasing all words.
- Stemming words using the Porter Stemming Algorithm.
- Filtering out specific domain-related keywords.
- Reading files from provided paths.
- Storing tokenized documents.
- Generating sets of unique tokens for both source code and UC documents.
The Preprocessor module is composed into the following functions:
- CodePreProcessor: Processes source code files by cleaning, tokenizing, and filtering out irrelevant information based on Java keywords.
- UCPreProcessor: Processes UC documents by cleaning, tokenizing, and filtering out irrelevant information based on specific domain-related keywords.
- setup: Sets up documents and tokens by iterating through files in provided directories, applying preprocessing functions, and collecting tokenized documents and unique tokens.
- Language Constraint: The preprocessing steps are tailored for Java source code, considering its syntax and common conventions. Key Words must differ in other languages; the preprocessor is adjusted to work with Java codes.
The Maintainability Score Module calculates a maintainability score for source code files based on various metrics.
- Parsing Source Code: Utilizes a Tree-sitter parser for Java to parse the source code files.
- Identifying Parameters: Operands count, operators count, unique operands count, unique operators count.
- Computing Metrics: Program vocabulary, program length, calculated program length, volume, difficulty, effort, time required to program, number of delivered bugs.
- Maintainability Score Equation:
Maintainability Score = max(0, (100*(171 - 5.2*np.log(V) - 0.23*G - 16.2*np.log(SLOC))) / 171)
- Dependent on Tree-sitter as a parsing technique. Compared to another technique called “javalang,” Tree-sitter output matches the needed parameters better.
Integrates a total of 131 features mentioned in the “Predicting Query Quality for Applications of Text Retrieval to Software Engineering Tasks” and “Automatic Traceability Maintenance via Machine Learning Classification” papers to ensure the best results.
- TF-IDF Vectorizer: Processes source code files and UC documents.
- Feature Computation: Generates 131 features including IR-Based Features, Pre-retrieval Features, Post-retrieval Features, and Document Statistics Features.
- Further Processing: Normalizing features, feature selection, feature mapping, and handling data imbalance.
Feeds the data into a machine learning model (Random Forest) for predicting trace links, reaching an accuracy of 70% and F1-score of 0.7.
- Feature Selection: Identifying and retaining relevant features to improve model performance.
- Feature Mapping: Associating each data point with its corresponding features.
- Data Imbalance: Using BorderlineSMOTE for creating synthetic samples to balance class distribution.
- Model Training and Prediction: Training and predicting using RandomForestRegressor.
Trains and evaluates a neural network model for traceability link prediction and benchmarking.
- Data Retrieval and Preparation
- Data Preprocessing
- Model Definition and Training
- Model Evaluation
- Embedding and Vocabulary Management
- Data dependency on a specific dataset in SQLite format.
- Computational resource requirements.
- Handling of unknown tokens.
- Embedding quality.
- Loss function and class imbalance handling.
Identifies and locates specific code changes associated with resolved bug reports by processing Jira issues, extracting commit information, and analyzing code changes.
- Bug Report Retrieval
- Commit Hash Retrieval
- Commit Data Extraction
- Code Parsing
- Tokenization and Preprocessing
- Word Embedding Training
- Data Indexing
- Data availability and consistency.
- API rate limits.
- Computational resource requirements.
- Tokenization accuracy.
- Security of access to GitHub repository and database.
For any questions or suggestions, please feel free to reach out to any of our team members:
Member 1
- GitHub: github.com/yasmin-hashem24
- LinkedIn: linkedin.com/in/yasmin-hashem2024
- Email: yasmin.hashem201@gmail.com
Member 2
- GitHub: github.com/sohad-hossam
- LinkedIn: linkedin.com/in/member2
- Email: sohad95husam@gmail.com
Member 3
- GitHub: github.com/yasmiinezaki
- LinkedIn: linkedin.com/in/yasmiinezaki
- Email: yasmeen.zaki01@gmail.com
Member 4
- GitHub: github.com/bassabt-hisham.
- LinkedIn: linkedin.com/in/bassant-hisham
- Email: bassentkhafagi@gmail.com