I am not actively developing this tool anymore. There wasn't enough interest, including for my own use cases, to warrant them time.
With that said, here is my roadmap for development if I do ever return
- Change flag API to not use the deprecated and undocumented
argparse
feature. The current approach is nice but I don't like relying on undocumented features. Instead, there will be a flag where you then specify rclone options. (minor) - Move to the rclone API to allow sessions greatly reducing traffic, API Calls, costs, etc. (major)
- I think LFS provides the way to start and end a server but I'd have to play. This is needed to make this a viable tool since you otherwise waste a ton of API calls.
This software is still beta. A non-exhaustive and only roughly ordered list of priorities are:
- Test (and fix?) on windows
- Improve tests to handle/verify other edge cases
- subdirectories (work but not tested)
- conflicting rclone flags
- test for line coverage
- committed rclone config (similar chicken-and-egg as noted before)
- tests for additional error capture from rclone
- Document migrations (if possible)
- PyPI (or do we just distribute via
pip install git+https://...
?)- Better distribution and
setup.py
format
- Better distribution and
- type annotations
- Better parallel working so more than one thing can upload at the same time and make
--transfers
more correct. - real-world production testing
-
Remote Filename Format: Rather than
53/a0/53a079ad5d55c455b3d617a60117fd7f87ac8c0097454c63c3e5c91fdca9d1af
we can cut this down. One idea is to keep the first two hex bytes53/a0
and then base32 (for case insensitivity) encode the rest (without those bytes for even more space savings).import binascii, base64 hexhash = "53a079ad5d55c455b3d617a60117fd7f87ac8c0097454c63c3e5c91fdca9d1af" filename = f"{hexhash[:2]}/{hexhash[2:4]}/{base64.b32encode(binascii.unhexlify(hexhash[4:])).decode()}"
This ends up being 54 characters with no loss of information.
Do I make this an option? Or default. Is it worth it?
-
Content-based chunking: Much further down the line (and maybe a totally new tool) but can apply content based chunking to split files before upload. Adds a whole level of complexity but also pretty useful for small changes to large binaries.
Implements a pretty simple custom-transfer agent for git-lfs.
This is BETA. See Known Issues and Roadmap for more details.
This project is heavily inspired by lfs-folderstore and git-lfs-swift-transfer-agent (The idea mostly came from the former but the latter, being that it is in Python, was useful). git-lfs-rsync-agent also proved to be valuable in development.
PyPI To come later
$ python -m pip install git+https://github.com/Jwink3101/lfsrclone
TODO: INSTALL
The following are optional flags that can be specified below:
usage: lfsrclone [-h] [--log-file LOG_FILE]
[--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL,NONE}]
[--rclone-exe RCLONE_EXE] [--temp-dir TEMP_DIR]
remote
positional arguments:
remote Specify rclone remote
optional arguments:
-h, --help show this help message and exit
--log-file LOG_FILE [.git/lfsrclone.log] Specify alternative log file
destination
--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL,NONE}
Logging levels. Set to None to (effectivly) disable
logging (i.e. set to 9999)
--rclone-exe RCLONE_EXE
['rclone'] Specify rclone executable.
--temp-dir TEMP_DIR [.git/lfsrclone-tmp] Specify a temporary download
directory
All additional arguments are passed to rclone
This section is based on a similar one from lfs-folderstore.
Download and install git-lfs. Make sure you have followed the directions to begin such as git lfs install
(globally) and git lfs track *.ext
(locally), and git add .gitattributes
This assumes you have already set up LFS as noted above.
Set the following:
$ git config --add lfs.customtransfer.lfsrclone.path lfsrclone
Or, if you did not install lfsrclone, you can specify the full path to the Python file
Then,
$ git config --add lfs.standalonetransferagent lfsrclone
And finally,
$ git config --add lfs.customtransfer.lfsrclone.args "remote: <lfsrclone-options> <additional rclone flags>"
Note in the above that all arguments must be escaped properly so it is passed to git-config
as just one. Alternatively, do something like:
$ git config --add lfs.customtransfer.lfsrclone.args TMP
then open .git/config
and you will see lines like:
[lfs "customtransfer.lfsrclone"]
path = lfsrclone
args = TMP
You can then set TMP
to your full rclone command including \
for line continuation, etc. <lfsrclone-options>
are those noted above including the log file.
Cloning an existing lfsrclone repo presents a "chicken and the egg" problem with how to configure.
To do this, you tell git-lfs not to download files.
$ export GIT_LFS_SKIP_SMUDGE=1
$ git clone <repo>
$ unset GIT_LFS_SKIP_SMUDGE
##### OR #####
$ GIT_LFS_SKIP_SMUDGE=1 git clone <repo>
(this assumes Bash but it is similar for other shells)
Then move into the repo and set up as per above:
$ git config --add lfs.customtransfer.lfsrclone.path lfsrclone
$ git config --add lfs.standalonetransferagent lfsrclone
$ git config --add lfs.customtransfer.lfsrclone.args "remote: <flags>" # or TMP and replace
Finally:
$ git lfs pull
Since lfsrclone does not support file-locking, you may have to set
$ git config lfs.locksverify false
inside the repo
git-lfs has its own way to set transfers and concurrency as does rclone. lfsrclone will not try to deduce that in any way. It will run on the number of transfers called. See lfs.concurrenttransfers
to set the number or lfs.customtransfer.<name>.concurrent
to disable. Alternatively or in addition, you can set --transfers N
flag for rclone
You can pass any flags to rclone by XYZ but note that some are automatically set and may not be compatible with what you set. It also uses copy
instead of copyto
since we do not want to upload if the dest is already there (and the right size). Notable flags THISTOOL sets:
--size-only
: We do not need ModTime so no reason to get it and some remotes are very slow. We do not set--ignore-existing
because we want to overwrite incomplete uploads. And all remotes set size- Progress reporting
--log-level INFO
used to make it print output (note: same as-v
)--use-json-log
makes the output easier to parse
--no-traverse
Single transfers only so we can save listing--ask-password=false
No password prompts. Better to error
You can force rclone, and therefore LFS, to provide updates more often with --stats
. For example: --stats 100ms
will update 10 times a second.
You can set up any rclone remote. However, if you use crypt, consider whether or not you need directory name encryption. Files are stored in a content-addressable manner of <first hex byte>/<second hex byte>/<hex name>
. Encrypting the filenames make sense as they are the SHA256 and, while unlikely for most content, could leak the contents. However, unless the first two bytes of the SHA256 are critical, encrypting the leading directory names adds a lot of length to the file name.
For example, the following encrypts 70 characters
53/a0/53a079ad5d55c455b3d617a60117fd7f87ac8c0097454c63c3e5c91fdca9d1af
to 182 characters
fi7dlav8hvdcpf6kjpth26q40o/kjj1o6gs49ee5pmmij57rb0a2k/d696m9fgemb461gv0gsqroqrkojb9pg15q1ito7idtigpc1itaskqbmqdmmes3g78dvgki4jb80tth36dmj6opum9kbbrblkig34hk8qkquodng9kc59pauesmbjfj2c
while disabling directory name sets it to 134 characters
53/a0/d696m9fgemb461gv0gsqroqrkojb9pg15q1ito7idtigpc1itaskqbmqdmmes3g78dvgki4jb80tth36dmj6opum9kbbrblkig34hk8qkquodng9kc59pauesmbjfj2c
This is not exhaustive but do not set the following:
--progress
since we parse it out-v
or--log-level
since we use it already
- Cannot perform locking. See #4314
All code should run through black
I have always been disappointed that git-lfs required a special server. I don't care that it breaks the decentralized nature (though it does!) but it means that hosting requires more than just a simple ssh server. And portability becomes harders.
I've considered git-fat but it is outdated. I've looked at git-annex but it is (a) super (!!!) complicated, (b) not widely used, and (c) I don't like the symlink approach (even though smudge-filter approach also has its issues).
I've considered updating git-fat but decided it wasn't worth it. I ended up writing my own tool fully but found the edge-cases and testing to be more than I was willing to do (though it was a good learning experience!). So I left it alone.
But when I found lfs-folderstore, I learned about custom transfer agents. Suddenly, it became possible to let git-lfs handle the edge cases and the user interface and just let me handle the data! Win-win!