competitive-coding-analysis: A Python repository from chittaranjan19

This repository involves exploratory analysis of competitive coding data.

Pre-requisites to use this project:

python2
- sklearn
- scipy
- lizard
python3
- bs4
- urllib

The directory structure:

Data
- Section-1
  - username
    - submissionID.txt : Contains the code the user wrote for that submission.
    - username.csv : Uncleaned csv file consisting of all the data scraped about the user's submissions.
    - username_m.csv : Cleaned csv file containing submissions only written in c++ language.
    - username_cleaned.csv : csv file containing all submissions in all languages but cleaned after fixing some minor errors.
- Section-2
  - username
    - submissionID.txt : Contains the code the user wrote for that submission.
    - username.csv : Uncleaned csv file consisting of all the data scraped about the user's submissions.
    - username_m.csv : Cleaned csv file containing submissions only written in c++ language.
    - username_cleaned.csv : csv file containing all submissions in all languages but cleaned after fixing some minor errors.
- Section-3
  - username
    - submissionID.txt : Contains the code the user wrote for that submission.
    - username.csv : Uncleaned csv file consisting of all the data scraped about the user's submissions.
    - username_m.csv : Cleaned csv file containing submissions only written in c++ language.
    - username_cleaned.csv : csv file containing all submissions in all languages but cleaned after fixing some minor errors.
- Section-4
  - username
    - submissionID.txt : Contains the code the user wrote for that submission.
    - username.csv : Uncleaned csv file consisting of all the data scraped about the user's submissions.
    - username_m.csv : Cleaned csv file containing submissions only written in c++ language.
    - username_cleaned.csv : csv file containing all submissions in all languages but cleaned after fixing some minor errors.
- CSV files
  - features_time.csv : csv file containing all the timed attributes for all the users.
  - features_non_time.csv : csv file containing all the skilled attributes for all the users.
  - questions-complexity.csv : csv file containing all the questions, with their tags, their accepted submissions, and their complexity.
  - questions.csv : csv file containing all the scraped questions and their tags and number of accepted submissions.
Scripts
- Computation
  - questions.py : Generates the hash table of each question ID and its complexity, and which class. Writes this in questions-complexity.csv
- Features
  - features.py : Generates all the time dependent and the skilled(non-time) dependent attributes from the original set of attributes
- K-Means
  - kmeans.py : Clusters users based on the time dependent attributes. Generates a cluster plot of users.
  - kmeans_pca.py : First generates three components from the many skilled attributes using PCA, and runs k-means on these attributes.
- Scraping
  - questionScrape.py : Script to scrape all the questions, its tags, and number of accepted submissions from the codeforces website
  - section-1.py : Script to scrape random 50 users from rank 1-6000
  - section-2.py : Script to scrape random 50 users from rank 6000-12000
  - section-3.py : Script to scrape random 50 users from rank 12000-18000
  - section-4.py : Script to scrape random 50 users from rank 18000-24000
- Visualisations
  - pieChart.py : Generates a pie-chart for each section on how many easy/medium/hard problems they solve.
  - timerAll.py : Generates a plot of users against time they submit, to see which is the most common time coders are active.

chittaranjan19/competitive-coding-analysis