jgehrcke/github-repo-stats

Provide tooling to aggregate files in snapshots directory

jgehrcke opened this issue · 1 comments

Over time, the number of individual files in the ../ghrs-data/snapshots/ directory grows to be O(1000) per year. This is not a problem for git. However, it creates inconveniences. For example, the snapshots directory cannot be browsed meaningfully anymore via github:

2022-06-15 16_44_40-ghrs-test_jgehrcke_covid-19-germany-gae_ghrs-data_snapshots at github-repo-stats

Note that only the oldest files are shown here, the newer files are truncated.

Another inconvenience is that upon checkout and parsing it might actually make a noticeable timing difference between having to write / read one file, or having to write (upon checkout) and read (upon parsing) 1000 files.

I think in the long run the Action should automatically aggregate data into less individual files (with each file having more content, obviously), so that maybe there are overall O(10) files per year.

One question is if the files should be nicely readable CSV files or if it makes sense to use a different serialization format.

An intermediate pragmatic step for me is to build tooling that allows to do this aggregation out-of-band, i.e. not as part of an Action run. The changes can then be manually committed to the data branch.

@jgehrcke thanks for creating this project and keeping it going. I was hoping to get an aggregated view for paths and referrers, like the one there is for views and clones. You can leave the individual files in the snapshots directory and aggregate the data and store it separately. Is that something you have on your radar? Is there anything I can do to help?