New metadata file format (textual)
przemoc opened this issue · 17 comments
It's desirable to introduce new metadata file format that would be human-friendly and merge-friendly (when used in VCS like git), so making it textual is an obvious choice. Such format should be compact (no XML!), but not too compact. Below you can see current version of my draft amendment.
Data types
----------
SSTRING - `;`-terminated string with special characters (`\n`, etc.)
and semicolon escaped
" v001t\n\n" file format
------------------------
HEADER
N * PENTRY
PENTRY format
-------------
SSTRING - Path
BSTRING(1) - Parameter:
"m" - mode
"o" - owner
"t" - mtime
"x" - xattr
BSTRING(1) - "="
SSTRING - Parameter value
BSTRING(1) - "\n"
Patameter value formats
-----------------------
mode - octal mode
owner - "USER:GROUP"
mtime - UTC date+time in basic ISO8601 format (`%Y%m%dT%H%M%S.%NZ`)
xattr - "KEY1=VALUE1[,KEY2=VALUE2...]"
(keys and values have comma and equals sign characters escaped)
Example:
MeTaSt00r3 v001t
metastore.c;m=644;
metastore.c;o=przemoc:users;
metastore.c;t=20140302T162230.123456789Z;
metastore.c;x=;
Why not put all parameters in one line? Well, it would be more space-efficient, sure, but also more error-prone and less merge-friendly. So I say no for all file parameters in one line.
Why not put file name only once followed by parameters, each one in its own line? Because we lose contextlessness of each line then, and meaningful line without context is a really nice asset that I would like to have in such new format, for all your merge, grep, etc. intents and purposes.
OTOH support for gzipping can be still considered I think. Git has textconv
, so diff case can be handled well. For (hopefully rare) merge case one can gunzip file, fix it and re-gzip. Or do g(un)zipifying conversion by metastore (it depends on what would be gzipped, whole metastore file or only data after header?). Space savings coming from gzipping could be substantial for repositories with lot of files. Maybe disk space usage would be then even similar to the old format? Still, these merges, grr... If only git supported bidirectional textconv
... :-)
Backward compatibility dictates that such new metadata format rather won't be a default one. There is arising need for metastore configuration file and I'll add a new issue for that.
- What about the filename encoding and escaping?
- If xattr contains binary value, it should be displayed in hexadecimal representation
- Orders (filename order, "m o t x" order, xattr order) should be defined, so that metastore generates the same result as long as the metadata is not changed.
- Every xattr should be put in a single line, not together in one line. For comparing or merging conveninence.
- What if ";" is contained in filename?
- There would be much redundant data for long path, such as "lib/plugins/os/microsoft-windows-vista/fvm_plugin.py"
File format proposal (for reference only, your project, you make decision):
# header comment
[metastore.c]
mode=644
owner=przemoc:users
mtime=20140302T162230.123456789Z
xattr="name1":txt"value1"
xattr="name2":hex"00 01 02 03 04 05"
[lib/plugins/os/microsoft-windows-vista/fvm_plugin.py]
mode=644
owner=przemoc:users
mtime=20140302T162230.123456789Z
xattr="name1":txt"value1\\\""
xattr="name2\\\"":hex"00 01 02 03 04 05"
- I don't think HEADER is useful. AFAIK all configuration file or data file use header comment.
- xattr name and value needs escaping.
- This format is common ini file format. There should be libraries around that can do r/w operation.
- This file is UTF-8 encoded.
@fpemud, thanks for your comments. My draft is rough at this stage and surely can be improved. I'll respond to some of your points and update the draft / issue description some time later (not today, though).
Your first comment:
-
Non-system-dependent encoding is possibly desirable, but it's not a simple change. If we're going to introduce one encoding, then UTF8 seems reasonable.
-
Originally you suggested base64, then switched to hex representation. I already explained, but definitely not deeply enough, how I see treating special characters in
SSTRING
- they should be escaped. How? Special characters other than;
could be written in hex form (\xHH
).I wasn't exposed much to xattrs, so I have to check what real world puts there - are there always binary data or maybe textual (or rather text-like) data are more often (I silently assumed the latter initially, but I may be wrong here).
-
Yes, strict ordering is rather obvious for file creation, but I should mention it explicitly in the description. It's necessary only to avoid superfluous changes in the file coming from lack of such ordering, and for better diff / merge workflows, of course. Mind that my intention is that metastore should be able to read and apply metadata from file even if the order is different, in accordance with Postel's law (robustness principle).
-
I see where are you coming from and I thought a bit about it before. We need to know all extended attributes to be able to remove no longer existing ones, but as I don't want to ditch Postel's law, we need all xattrs at once, i.e. in one line. I call it a necessary compromise.
-
SSTRING
has semicolon escaped (so;
turns into\;
) - no problem here. -
I call it a necessary compromise for great robustness.
Your second comment:
I thought about INI before, but I don't think it really suits metastore needs. File names can have brackets, so you have to escape both, and quite likely fix handling of that case in such INI library. These libraries also usually "overwrite" repeated key in section (your decomposed xattr), so it's another bother to deal with (maybe there are event-based INI parsers, that would help a bit I guess).
I have to add that MeTaSt00r3
at the beginning of file is to preserve metastore file detection by existing tools that depend on this magic value. I don't see any great value in breaking it. I don't support comments in my format proposal, because I don't think such metadata file really needs them, they would be easily lost after saving metadata anyway, and they would require escaping another character in file names.
(for reference only, your project, you make decision)
Strictly speaking, metastore per se is David Härdeman's project. I only maintain unofficial continuation (fork, if you prefer). I tried contacting David regarding his view of my continuation (whether it could become an officially blessed one), but I din't get any reply yet.
Extending the .gitmeta file format that is maintained by the setgitperms.perl script that comes standard with git (in contrib) is an obvious starting point. This format has the advantage that it would be a seamless upgrade for current setgitperms users. This format looks like this:
CMake/Utilities.cmake mode=0660 uid=1001 gid=1001
CMakeLists.txt mode=0660 uid=1001 gid=1001
COPYING mode=0660 uid=1001 gid=1001
CTestConfig.cmake mode=0660 uid=1001 gid=1001
Metastore is useful also out of git domains, so I'm not sure that taking setgitperms.perl script's .gitmeta file format is the proper way to go. It also doesn't look like space-in-filename-friendly (it's much more common to have space (
) in filename than semicolon (;
)), If you're concerned about numerical UID/GID, then having--numeric-owner
like tar seems fine.
What I missed in my original suggestion is storing numerical ids next to textual ones that could be used as fallback when given user/group doesn't exist, I'll amend the issue description later. I think about putting ids in parentheses.
I'm currently working on git-store-meta and here's the schema I come up:
# generated by {TAB} git-store-meta {TAB} 1.1.2
<file> {TAB} <type> {TAB} <mtime> {TAB} <atime> {TAB} <mode> {TAB} <user> {TAB} <group> {TAB} <uid> {TAB} <gid>
back\\slash {TAB} f {TAB} 2015-04-20T17:00:57Z {TAB} 2015-04-20T17:03:55Z {TAB} 0664 {TAB} danny {TAB} danny {TAB} 1001 {TAB} 1001
data.txt {TAB} f {TAB} 2015-04-20T17:00:57Z {TAB} 2015-04-20T17:00:57Z {TAB} 0664 {TAB} danny {TAB} danny {TAB} 1001 {TAB} 1001
del\x7Fname {TAB} f {TAB} 2015-04-20T17:00:57Z {TAB} 2015-04-20T17:00:57Z {TAB} 0664 {TAB} danny {TAB} danny {TAB} 1001 {TAB} 1001
subdir {TAB} d {TAB} 2015-04-20T17:00:57Z {TAB} 2015-04-20T17:00:58Z {TAB} 0775 {TAB} danny {TAB} danny {TAB} 1001 {TAB} 1001
subdir/file.txt {TAB} f {TAB} 2015-04-20T17:00:57Z {TAB} 2015-04-20T17:00:57Z {TAB} 0664 {TAB} danny {TAB} danny {TAB} 1001 {TAB} 1001
Columns are variable. The first and columns always exist, while the existence and order of other columns is depending on command arguments.
File names have backslashes ("") and control chars (0x00-0x1F, 0x7F) escaped using "\x##" notation, if there's any.
If and are both provided, git-store-meta attempts to apply the user name first, and fallbacks to apply the uid if failed. / works same.
Timestamps always store the UTC time, without the fractional part of seconds.
Rows except the first two are stored sorted by UTF-8 encoding. This is primarily for the --update mechanism to work properly. Though it still works without a proper sort if the user hacks in the data.
I think this should be readible, flexible, and hackable enough. I could be wrong, though, and any feedback is welcome.
I currently don't really use metastore since I cannot get it work on MsysGit and it lacks several features I need. However it's always nice to see metastore, or maybe a "C version git-store-meta"(?) to flourish up. :)
Great. Thanks for the info.
If I happen to have spare time I will try it.
I started a project too (in java), but I just made it far too complex...I tried to fulfil just any possible use case.
@danny0838 Your schema doesn't seem to be good enough, because it requires some predefined (via command-line, configuration or something else) order of attributes, thus it's clunky deal. Tab is really bad space-wise separator. Metadata applying should be possible to be performed without any additional options, that's why attributes should be stored as parameter=value
. I aim for conciseness, that's why I suggested one-letter parameters. At the same time, as I already explained, I think that having one parameter per line is the best for diff/merge cases. (If someone is truly worried about inefficiency here, then compact mode could be introduced and be turnable in configuration - it would make all parameters be put in one line next to filename, but I think such addition is the least important thing now.)
I think my original textual format proposal is still the best one so far. Nevertheless, configuration (#7) will be needed to land first, and to avoid stupid stuff in configuration, some other stuff has to go in even earlier, like file/dir excluding (#8, #9), as I won't ever allow to have this outrageous git
option in configuration file.
(BTW Sorry for all of you hoping of quicker metastore revival, I haven't abandoned metastore, I just wasn't able to squeeze time to work on it lately. I do hope to finally push things a bit forward in May. I planned v1.1 to be released in April, but it seems it will have to wait till May.)
@przemoc If there's already a stored data file existed, git-store-meta will parse it and use the same fields definition if it's not given in the command line, ant thus fields definition parameters only have to be provided in the command line once (i.e. the first --store) in usual usage, which shouldn't be too annoying.
Personally I could want to store mtime only (for mtime-sensitive binary files versioning), or to store mode only or mode and mtime (for some web projects), or maybe other possible cases I haven't met. Therefore the flexibility to select which fields are to be stored is a must-have feature, at least for me.
I'm also considering adding shortcuts for some usual column packs. For example ":all" means "user,group,mode,mtime,atime", ":all2" means "uid,gid,mode,mtime,atime", and ":mm" means "mode,mtime", etc. Though this is still pending.
Just to clarify this point. I have no comment about your other concerns. It's your project, after all. :)
@danny0838 I totally agree about flexibility regarding parameters that should be stored or applied, that's why I put work-on-parameters
as one of options in proposed configuration file (#7), which would be mox
by default (mode, owner, xattr - these are stored already in binary format), but could be changed as user wish (configuration can be put at system, global and local level). The idea is that applying metadata would apply whatever parameters are provided within metastore file, but only within the set defined in above mentioned option. Metastore file per se is not required to have all parameters defined for all files during applying metadata. So mtime-only case will be definitely supported.
I'm wondering only, whether it would be desired to have owner, i.e. user:group as defined in my first comment, split into two parameters. As I already mentioned in one of the comments, my original suggestion lacks numerical id fallback and I think it could be provided after slash (/
), i.e.
file;o=przemoc/1000:users/100;
OTOH using numerical ids only (like tar --numeric-owner
) should also be possible, so flexibility may require some additional options, which should be fine as long as default behavior will be decent.
I don't like the idea of successfully changing user but failing to change group for instance. Are there any real scenarios where such ok-fail case would be still ok after all?
I don't find any compelling reason to even optionally support atime. Maybe you could provide me some?
@przemoc I'd just let it go if the user change succeed and the group change failed, since the user is warned for any fail.
As for atime, I personally haven't come up with a real use case, and I'm just providing it since it's easy and git-cache-meta provides it. Though it seems that several programs would look for the last access time to determine whether a file can be safely removed, as this thread tells.
Instead of your own file format, perhaps consider using YAML
to directly serialize metastore
's data structures that represent
the entries. YAML has many desirable properties mentioned
earlier in the ticket and advantage of tools, syntax highlighting,
etc.
I just did a straight-forward textual implementation: xkrug-bubeck/metastore@e6b514b
Not really much has changed except all is text now.
And there are line endings between the files/folders and semicolons between the values.
someusr@debian:/opt/testdir$ /opt/github/metastore/bin/metastore -s
someusr@debian:/opt/testdir$ cat .metadata
MeTaSt00r3TEXT0001
./dir_with_a_file/a_file:someusr:someusr:1478614284:198484640:33188:0:
./.metadata:someusr:someusr:1478705827:807964:33188:0:
./dir_with_a_file:someusr:someusr:1478612652:256400519:16877:0:
./empty_dir:someusr:someusr:1478612636:976399731:16877:0:
.:someusr:someusr:1478622855:166926443:16877:0:
./belongs_root_with_caps:root:root:1478612674:704401676:33188:1:security.capability:20:1:0:0:2:0:48:128:0:0:48:128:0:0:0:0:0:0:0:0:0:
./belongs_root:root:root:1478612666:152401235:33188:0:
./mnt:someusr:someusr:1478622855:166926443:16877:0:
./belongs_someusr:someusr:someusr:1478612662:552401050:33188:0:
The only downside of this at the moment: It will fail at a file that includes the separator char ":".
I personally can live with that at the moment.
Edit:
- Merged my dev branch with my master.
-- Should merge without conflicts.
-- Replaced ';' with ':' as separator in regard to first posted patch.
Edit2:
Colon is a terrible separator as it is used by Debian apt. Reverted to using semicolons ';'.
bubeck@f7803c79d0421dd15685a37b1bfb7516ef499a91
Hi, Jürgen! Thanks for the contribution, but your straight-forward textual format is not what I wish for and it's not what I would like to see in metastore, therefore I cannot accept it.
But others may find it useful, so they can use the code from your repository if they find it good enough for their needs. It's (almost) always a good thing to have alternatives.
@xkrug-bubeck You can use my git-metafile instead. ;)
@przemoc Might I suggest the recutils format? It's fairly simple, and by using it we wouldn't need to create yet another textual data format (which is a bonus). Even without the recutils package installed, it can easily be manipulated in an editor (plus emacs and vim have plugins), or with sed/cut and such.
It's flexible enough that existing unix tools can be made to output it. Consider the following:
find testdir -printf 'name: %P\ntype: %y\nsize: %s\ndepth: %d\nmode: %m\ninode: %i\natime: %As\nctime: %Cs\nmtime: %Ts\n\n' > files.rec
This looks ugly, but you can run advanced queries like this:
recsel files.rec -e "name ~ '.*/foo/bar/baz-version-[12].{0,3}$' \
&& mode != 777 \
&& size >= 4096 \
&& mtime > $(date -d 2020-05-20 +%s)"
and get output like this:
name: projects/foo/bar/baz-version-2.1
type: d
size: 4096
depth: 2
mode: 755
inode: 12468250
atime: 1584162005
ctime: 1584162002
mtime: 1584162009
There are a number of other advantages too:
- Recutils also comes with
rec2csv
, allowing for additional flexibility. - There is a type system. The type of (for example)
mode
could be a regex string. A mode like "999" would be detected and raise an integrity error when checked with the included toolrecfix
. - There is a constraint system availible, so
recfix
could detect integrity violations when (for example) two files have the same basename and depth.
I like it. But if we go for simplicity and consistency maybe we can somehow and would better use that same format which gitconfig uses. And maybe there are tools for it available already. Although I understand it's limited and I haven't consider this task thoroughly.