newaetech/chipwhisperer

Repo Cleanup

Opened this issue · 6 comments

ChipWhisperer is a fairly old project at this point and, as such, the repo has accumulated a lot of files and a large history. This has a lot of negative effects:

  • Downloading the repo takes a while
  • ChipWhisperer taking up so much room makes the installer and VM take up a lot more room
    • The VM now takes up so much room that it's over the Github limit for release files
    • These large file sizes increase build time, download time, install time, etc.
  • Some of the paths are a lot longer than they need to be. hardware/victims/firmware/* could be condensed into just target_firmware, for example
  • There's a lot of files here that most people probably don't care about - most people, for example, don't want to print ChipWhisperer-Lite PCBs or rebuild the FPGA/microcontroller firmware

It would therefore be beneficial if we could archive most of that history/those files and start fresh. The archive should be fairly simple; just make a new repo (maybe chipwhisperer-historical) and point a local version there.

For the new chipwhisperer repo, one option would be to start completely fresh; move all the desired files into a new repo and point that here. However, it would be nice if we could keep the history of all the non-NewAE contributions and just squash everything else to reduce space.

There may be some not-horrible ways too. I tried running git-filter-repo analyze which showed close to 400MB of deleted directories (in the packed size). That with killing old branches might go pretty far. It looks like a lot of the FPGA implementation files for e.g. CW305 are included which most people don't care about too, so could save some space there removing them (or moving).

We could also consider moving the ChipWhisperer python stuff to a separate repo... a long discussed option but would also need more consideration, as may be an even more breaking change.

blob-shas-and-paths.txt
directories-all-sizes.txt
path-deleted-sizes.txt
path-all-sizes.txt
extensions-deleted-sizes.txt
extensions-all-sizes.txt
directories-deleted-sizes.txt

Moving all the FPGA target stuff to its own repo makes sense and would give a substantial reduction.

Seems like the archive import will be even easier than I thought. There's a github option when making a new repo to import another. Will have to see how this works, but hopefully it grabs all branches and stuff there currently.

EDIT: Yeah, looks like importing preserves everything, including commits/branches/etc.

git-filter-repo also seems to work very well. By deleting the CW305 files and removing the history of every deleted file, I was able to get the repo size down to ~400MB. This is down from roughly 1.7GB. It may be worth trying to squash all the hardware/cw305.py history down to a single commit as well. I'd guess that would save something like 100MB

I assume that we can make similar gains on chipwhisperer-jupyter as well.

Useful link for this: https://stackoverflow.com/questions/63496368/git-how-to-remove-all-files-from-the-git-history-that-are-not-currently-prese

For chipwhisperer-jupyter, it looks like there quite a bit to save, but the traces for the simulated versions of labs end up taking up a lot of space. Without the trace files, we can get the repo down <200MB.

That sounds pretty good! For the traces and similar - we could move them to some external location (either another repo or even off github). Would like something stable so github might make sense still, but they could get downloaded "on demand" if you actually need them (and not by default).

Had thought of this a little before, we could have some small Python module that deals with it like e.g.,

import chipwhisperer_traces as ct

traces = ct.sca101.etc

Would have to see if there is an easy module to do this for us, but basically idea could be that when you first access it then it actually downloads them, and caches them locally. Or you can force a download of everything (if for example you are running a training and want it all cached locally) with something like ct.download()

The main advantage of a complicated download system is it can be updated to be almost anything in the future. So could be another URL or even another system (e.g., eventually using a real database or simialr).