Copy and track files in git
, and a library to traverse the history
I use this to track my todo.txt
files, changes to configuration files, any shell histories which don't support timestamps (see all of my config files here)
This copies the files to a different directory, so it doesn't interfere with the application/configuration
By copying those files to a separate directory, I can always roll back to previous file, or see what the file was like a couple days/months ago.
For shell histories/files which are unique lines of text (e.g., my todo.txt
file) this also lets me estimate timestamps for when new lines were added to the history/text files, using the iter_commit_snapshots
and parse_snapshot_diffs
below, which emits added/removed events for individual lines with estimated times
This was mostly created for HPI, so I don't have to rewrite the code to extract lines for git history over and over
This is a general purpose solution for tracking file history in git
-- so its not extremely opinionated. In some cases it can be seen as a stop-gap solution, to have some file versioning in case you ever want to roll back. It may work particularly well for basic files with a couple dozen lines (e.g. I use it for RSS feeds, todo.txt
, bookmarks, and a couple history files)
Requires python3.8+
To install with pip, run:
pip install git_doc_history
The main script to backup data is the bash script bin/git_doc_history
, which gets installed into your ~/.local/bin/
directory.
If uses a config file (parsed with python-dotenv
-- so you can use bash-like syntax to grab environment variables) like:
SOURCE_DIR=~/.todo # copy from
BACKUP_DIR=~/data/todo_git_history # copy to
# multiple lines means multiple files
COPY_FILES="todo.txt
done.txt"
You can either provide the full path to that config file, or place the file in ~/.config/git_doc_history
For example, after placing it at ~/.config/git_doc_history/todo
-- to copy/commit any changes, run:
$ git_doc_history todo
Generated configuration:
SOURCE_DIR: /home/username/data/todo
BACKUP_DIR: /home/username/data/todo_git_history
COPY_FILES: todo.txt
done.txt
'/home/username/data/todo/todo.txt' -> '/home/username/data/todo_git_history/todo.txt'
'/home/username/data/todo/done.txt' -> '/home/username/data/todo_git_history/done.txt'
'/home/username/data/todo/.gitignore' -> '/home/username/data/todo_git_history/.gitignore'
[master f927490] update
1 file changed, 1 insertion(+)
create mode 100644 .gitignore
That uses python3 -m git_doc_history shell todo
to parse the configuration file, like:
eval "$(python3 -m git_doc_history shell todo)"
The python library comes with a small CLI interface to extract a file from some time ago:
$ python3 -m git_doc_history extract-file-at --at 2020-09-20 -c todo todo.txt -
setup command of completion
The BACKUP_DIR
is of course just a regular git directory -- you can reset --hard
to some point in the past to get rid of recent commits, rebase
/squash
to merge commits or do whatever you please
Most things will be done with git_doc_history.DocHistory
This doesn't assume the filetype is readable text (you may be storing images/binary doc files in the git repository), so the default is to return the data as bytes
-- you can .decode("utf-8")
to convert that to readable text
To traverse the entire history:
from git_doc_history import DocHistory
from git_doc_history.config import parse_config, resolve_config
# parse the config from the env file
doc = DocHistory.from_dict(parse_config(resolve_config("todo")))
# iterate through the history for the todo.txt file
for snapshot in doc.iter_commit_snapshots("todo.txt"):
print(str(snapshot.commit_sha))
print(str(snapshot.dt))
print(snapshot.data.decode("utf-8"))
Iterates through the git history in chronological order, keeping track
of when data was added or removed. By default, this parses the file
given by splitting it into lines. If lines are added/removed, this returns an
event which specifies when in the history, and what was added/removed
Alternatively, can pass a parse_func
, which is a function which
accepts the DocHistorySnapshot
, and returns a list of hashable items
to store as state
For an example of parsing diffs, see examples/todotxt_diff.py
:
Example output looks something like:
added 2022-03-08 12:14:45 (C) 2022-03-08 create shebang script +programming
removed 2022-03-08 13:14:58 (C) 2022-03-08 create shebang script +programming
added 2022-03-08 22:23:39 save formhistory.sqlite in browserexport
removed 2022-03-08 23:23:45 save formhistory.sqlite in browserexport
added 2022-03-09 02:49:58 (C) create a python fzf wrapper because apparently I can't find a good one
added 2022-03-10 16:24:24 (B) 2022-03-10 create plaintext playlist parser module +music
removed 2022-03-11 01:30:49 (B) 2022-03-10 create plaintext playlist parser module +music
added 2022-03-11 10:37:06 (C) 2022-03-11 sync tmux from home directory +programming
added 2022-03-12 03:44:24 install undotree +vim +programming
removed 2022-03-12 04:44:51 (C) 2022-03-11 sync tmux from home directory +programming
removed 2022-03-12 10:51:20 install undotree +vim +programming
In this case, 'removed' would mean I either changed the text on the line, or (more likely) I completed it
git clone 'https://github.com/purarue/git_doc_history'
cd ./git_doc_history
pip install '.[testing]'
pytest
flake8 ./git_doc_history
mypy ./git_doc_history