Khushiyant/quasarpy

Integrate `sourcecollector` for Efficient Source Code Consolidation

Opened this issue · 3 comments

Is your feature request related to a problem? Please describe.
The current data preparation step for gathering and consolidating multiple source code files from various repositories is time-consuming. This process often involves manually collecting files, which is not only inefficient but also increases the risk of errors, ultimately hindering the workflow and delaying the training of models for smell detection.

Describe the solution you'd like
I propose integrating sourcecollector into the Quasar project. sourcecollector is a tool designed to consolidate multiple files into a single .txt file, streamlining the process of collecting source code from various repositories. This integration would enhance the efficiency of data preparation, allowing Quasar to easily gather and process large amounts of source code for training its smell detection models.

Describe alternatives you've considered
An alternative solution would be to manually gather and consolidate the source code files or develop a custom script to perform this task. However, these methods are less efficient and more error-prone compared to using a dedicated tool like sourcecollector.

Additional context
For more details on sourcecollector, you can visit the GitHub repository: https://github.com/hitesh22rana/sourcecollector. Integrating this tool could significantly reduce the time and effort required for preparing source code data, thereby enhancing the overall functionality and performance.

Where do you think sourcecollector would fit into the workflow?

sourcecollector could fit into the workflow by running as a cronjob to compile the source code from various repositories. This compiled data can then be extracted and used to train the smell detection models. This automation would streamline the data preparation process and help to train the model with the most up-to-date data.

Actually, data training path is still manual and also contains manual labelling of data. Currently, it requires following changes and implementation:

  • Automation of model training
  • Changing of current supervised to unsupervised model

How do you suggest separation of this process from package where we can just replace current model with next iterations?