list of sites we will scrap:

Jenkinsfile
travis yml
gitlab yml
circle ci yml
go cd yml

Most of them can be done via search query into csv file. However JenkinsFile will be really hard to search for as it doesn't have a file extension to search for :(

to run

pip install -r requirements.txt

create .env file with token inside named GITHUB_TOKEN

scraper.py to create the data

then run main.py with the .env setup to CHECK and then RENDER to create all the things

orca will be required in order to render the plotly sankey graph

npm install -g electron@1.8.4 orca1

notes from latests meeting (12/11/2019)

how much things differ from standard templates for given languages? for travis and other languages

detect snippets as described in docs
before and after scripts
comment frequency and single line vs multi line

results need to publishable

do they use build matrices etc.?

what other enviroment variables do they use?

how do they use stages / tasks ? and how are they executed?

groups configuration lines of code distribution

distribution per language and the length of the ci script

plan:

finish scrapping the data [ ]
- need to make this more consistent so we can get a long run for this!!! [X]
group the data sets into one large one [X]
process the data removing duplicates [X]
get the graphs for language and thing type to create sepearte graphs for each thing
- get that to pick up on when there is no configuration in the % otherwise it is squed
yaml parser project!

Bugs:

none so far....

Fixed:

watchers aren't being picked up properly

consistent high speed data gathering is hard atm

read time out
502 error

saving >>> drizzleDumper
saving >>> DZNSegmentedControl
saving >>> DLCImagePickerController
Traceback (most recent call last):
  File "main.py", line 254, in <module>
    main()
  File "main.py", line 226, in main
    getReposStuff("raptor_webb", i, i + 100)
  File "main.py", line 185, in getReposStuff
    saveData = saveRepos(page, ["" for i in range(len(page))], name)
  File "main.py", line 42, in saveRepos
    dictionary = dict([(k, fixEncoding(getattr(repo, k))) for k in keys])
  File "main.py", line 42, in <listcomp>
    dictionary = dict([(k, fixEncoding(getattr(repo, k))) for k in keys])
  File "/home/eat/jl653/private/scraperGithub/venv2/lib/python3.6/site-packages/github/Repository.py", line 732, in topics
    self._completeIfNotSet(self._topics)
  File "/home/eat/jl653/private/scraperGithub/venv2/lib/python3.6/site-packages/github/GithubObject.py", line 262, in _completeIfNotSet
    self._completeIfNeeded()
  File "/home/eat/jl653/private/scraperGithub/venv2/lib/python3.6/site-packages/github/GithubObject.py", line 266, in _completeIfNeeded
    self.__complete()
  File "/home/eat/jl653/private/scraperGithub/venv2/lib/python3.6/site-packages/github/GithubObject.py", line 273, in __complete
    self._url.value
  File "/home/eat/jl653/private/scraperGithub/venv2/lib/python3.6/site-packages/github/Requester.py", line 268, in requestJsonAndCheck
    return self.__check(*self.requestJson(verb, url, parameters, headers, input, self.__customConnection(url)))
  File "/home/eat/jl653/private/scraperGithub/venv2/lib/python3.6/site-packages/github/Requester.py", line 279, in __check
    raise self.__createException(status, responseHeaders, output)
github.GithubException.GithubException: 502 {"message": "Server Error"}

so probably could but an exception handler around this to ignore it maybe???

jenkins configuration wasn't being picked up as we were only getting .yml or .yaml file types
- this should be good validation but do we want data for weird files that don't match???

csv -> line -> base64 -> text -> yaml

so it seems like it could be possible to create a python set of classes for parsing the yaml or we could just have them as yaml objects

yaml:

present:
- build matrices
- enviroment varaibles and secrets
analyses
- how do they use stages and tasks? what are common names etc. how does this differ per langauge
- comments lines and line vs blocks of comments
adv analysis:
- how much things differ from standard templates for given languages? for travis and other languages
  - detect snippets as described in docs
  - before and after scripts
loc [x]

what do people use CI for???? testing or deployment or notifications for deplosy: the tag and maybe string search in noticiations but not 100% reliable as scripts

groups configuration lines of code distribution

distribution per language and the length of the ci script

JosephLing/scraperGithub

to run

notes from latests meeting (12/11/2019)