Predicting Coding Language in 'Games' Readmes

By: Arsene Boundaone, Cayt Schlichting, Paige Rackley

The Google slide summary for our project was created in Canva here.

Target audience for our project is the Codeup Data Science cohort and instructors.

Project Summary

Project Deliverables

google slides that:

- summarize results from exploration and modeling
- visuals that help describe our findings

a github repository containing:

- final notebook containing steps, analysis and findings
- a README.md that contains description and steps to reproduce
- README.md should also contain hyperlink to google slides

a 5-6 minute presentation of findings

Initial questions on the data

What are the top 5 programming languages when searching for 'Game' repos in github?

What are the most common words for some of these languages?

How do the most common words for a language compare to the highest TF-IDF scores for that language?

What are common bigrams for some of these languages? Are they significant to the kind of programming language?

Project Plan

Executive Summary

Through classification models, we were able to beat our baseline of 25.38% accuracy with a K-Nearest Neighbor model that had 38% accuracy. This is a small win, but I do believe there is room for improvement to have an even more accurate model.

Project Goal: The goal of this project was to use natural language processing and classification models to identify terms for predicting a readme's primary language on Github.

Key Findings:

Coding language name and libraries associated with that language seem to be an indicator of language. This is most apparent using the TF-IDF score.

Without a deeper understanding of common terms for each language, it is more difficult for us to identify.

KNN performed best on our validate subset with an accuracy of 47% and a Precision of 75%. However, these scores dropped notably on our test subset with an accuracy of 38% and precision of 43%. The KNN accuracy outperformed baseline by 12.65%.

Recommendations & Next Steps:

With more time, we would further tune our models. We would like to try additional random forest and decision tree classifiers with a greater depth given the number of features.

We would also like to perform this modeling with a larger dataset. Some of the less common languages appeared less than 20 times, making it a relatively small amount of data to train the model on. This also likely adversely affected the performance of our KNN model for the smaller datasets (validate and test).

Data Dictionary

Target	Definition
Language	The primary coding language identified in the GitHub repository

Feature	Definition
content	Content's of the repository's readme.md (String)
repo	partial repo URL

Reproducing this project

You can reproduce this project with the following steps:

Read this README

Clone the repository. Alternatively, you can download the .py, .json and Final_Report from the main folder.

Run the Final_Report notebook or explore the other notebooks for greater insight into the project.

If you want to recreate this with your own list of repos:

Clone the repository

Remove the repos.csv and data2.json files.

Read through the acquire.py functions and update the search URL with any keyword changes or increase the number of results.

run "python acquire.py" from your terminal to generate the csv and json files.

Step through the Final Report and other notebooks. You may want to modify which languages are viewed.

paige-arsene-cayt-nlp-project/games-nlp