/shell-command-ranking

a rough dataset of the occurrence shell commands used on popular GitHub projects

Primary LanguageShellCreative Commons Attribution Share Alike 4.0 InternationalCC-BY-SA-4.0

shell-command-ranking

a dataset of the occurrence shell commands used on popular github projects includes:

  • projects labeled as "shell" programming language
  • projects tagged with the bash, zsh, fish, tcsh, ksh and additional relevant topics
  • 6,272 projects
  • 203,403 unique commands

ranking.sh - script to collect data

  • gets 1000 github repos labeled as containing a shell script language ranked by github stars, 10 batches of 100 at a time (max is 1000), with a random sleep to avoid the github rate limit
  • exports a list of those repos to repo_list.txt
  • shallow clones each of them to /repos
  • calls clean_repos.sh on each one to remove extraneous files to save space
  • searches for commands in *.sh, *.bash, *.zsh, *.fish, *.ksh, *.tcsh, and Dockerfiles in /repos
  • ranks those commands and outputs to command_ranking.txt

repo_list.txt - list of repos queried

  • github repos included in data sampling done on February 12, 2024:
    • 990 projects labeled as "shell" programming language
    • 913 additional projects tagged as "bash" topic
    • 791 additionalprojects tagged as "zsh" topic
    • 885 additionalprojects tagged as "fish" topic
    • 33 additional projects tagged as "tcsh" topic
    • 93 additional projects tagged as "ksh" topic
    • 862 additional projects tagged as "linux" topic
    • 806 additional projects tagged as "wsl" topic
    • 896 additional projects tagged as "devops" topic
  • list of shell commands detected by ranking.sh from repos listed in repo_list.txt, ranked in order of usage from most to least

clean_repos.sh - cleanup utility used during data collection to save space

  • deletes extraneous files from the /repos/ dataset to reduce it's size, unless a specific folder is specified as an argument, then it just deleted extraneous files in that folder
  • using clean_repos.sh reduces the footprint of cloning ~1000 github repos to ~1 GB

License

  • Scripts ending *.sh, README.md, and *git files are licensed under Apache 2.0
  • Data, including command_ranking.txt and repo_list.txt, are licensed under CC-SA-4 International