Project Description
Project Goal
Initial Thoughts
Plan
Data Dictionary
Steps to Reproduce
Conclusion
- Takeaway and Key Findings
- Recommendations and Next Steps

Project: NLP-Based GitHub Language Predictor

The core aim of this project is to automate the process of identifying the primary programming language used in a repository. We achieve this by implementing a machine learning model that analyzes the README text.

Project Description

This project is designed to automatically identify the primary programming language used in a GitHub repository by analyzing the text in its README file. By harnessing the power of Natural Language Processing (NLP) and machine learning, we aim to make it easier for users to understand the technology stack of a project.

Project Goal

A project utilizing natural language processing techniques that involve webscraping github repos and return text data to analyze programming language uses.
Develop a machine learning model model that can predict the main programming language of a repository, given the text of the README file.

Initial Thoughts

My initial hypothesis is that text data that are associated with programming, tools, and possibly ide tools may be good text string to identify the programming language of a repo.

The Plan

Acquire historical repos names, languages, and readme text data from the GitHub API.
Prepare data using Regex and the BeautifulSoup Library.
Explore data in search of which words, bigramsm and trigrams are usefull.
- Answer the following initial questions
  - What are the top 10 words for any language?
  - Is there a significant difference in the frequency of the top 10 words used in repository readme among different languages?
  - Is there a significant association between the programming language and the likelihood that readme contains the word "build"?
  - What are the top ten bigrams for python?
  - What are the top 10 trigrams for C++?
Develop a Model to predict repository main programming language
- Use text data identified in explore to help build predictive models of different types
- Evaluate models on train and validate data
- Select the best model based on Accuracy Score
- Evaluate the best model on test data
Draw conclusions

Data Dictionary

Feature	Data Type	Definition
`language`	string	The programming language for repo
`text`	strings	The readme files containing the text for repo

Steps to Reproduce

Clone this project repository to your local machine.
Install project dependencies by running pip install -r requirements.txt in your project directory.
Obtain an API key from the GitHub website.
Create a config.py file in your project directory with your API key using the following format:

GITUB_API = "GITHUB_API_TOKEN"

Ensure that config.py is added to your .gitignore file to protect your API key.
Run the acquire.py script to fetch stock data from the GitHU API:

python acquire.py

Execute the prepare.py script for data preprocessing and splitting:

python prepare.py

Explore the dataset and answer initial questions using the explore.py script:

python explore.py

Develop machine learning models by running the model.py script:

python model.py

Evaluate the models, select the best-performing one, and draw conclusions based on the results of the model.py script.

Conclusion

Takeaways and Key Findings

The words in the readme texts have alot of words that can be considered noise because of it's equal dsitribution for each language type.
There bigrams and trigrams displayed more noise and opportunity to conduct more cleaning actions to possibily impprove model by including those engineered features.
The topic of robotics across popular repos shows that Python and C++ werethe most popula languages used in the data retreived.
The langauge feature contained alot of technologies and not actual programming languages which can skew the data displayed.

Model Improvement

The model could include those unused feature engineered columns to evaluate model performance changes.

Recommendations and Next Steps

We recommend revisiting the labeling of the programming langauges for each repository as they are not storing accurate information. We understand that there are debates on what is and is not considere a "real programming' language, but there is a distinction on what technologies are being used specifically and it should be easy to understand that when looking at the tech stack of repositories on GitHub.
Given more time, the following actions could be considered:
- Gather more data to improve model performance.
- We could retrieve more data and approach our web scrapping methods using other technologies.
  - The url location was using the website filter option and we could possibily remove that and use stars as a feature to improve results
- Fine-tune model parameters for better performance.

Marc-Aradillas/github_nlp_project