consider design for file format

Question

consider design for file format

Opened this issue 5 years ago · 11 comments

Would you consider a different design for the file format?

A LevelDB file cluster is enclosed in a 7zip archive. Perhaps this design is in some ways compelling for implementation reasons. Yet any implementation must cache an extracted copy of the database in the file system. Also, LevelDB addresses concerns such as journaling and concurrency that an application document does not use. Finally, use of 7zip archives depends on operations that are inefficient and complicated compared to formats based on different design principles.

Perhaps a more transparent design would facilitate wider interoperability and longer-term accessibility. Perhaps it would also port better if you chose to expand the product.

I think to my years of frustration fighting word processors, and to how much better life has become through MarkDown and other conventions that are similarly open, simple, clear. A return to opaque and obtuse file formats makes me feel nervous, even as I am excited by user interfaces that promise superior experiences. Just as much, we all find that our growing dependence on digital information demands formats that can travel easily and survive long.

You could consider a collection of MarkDown files, one per card, indexed by a JSON file, all wrapped in a ZIP archive along with other JSON metadata. Third-party-tools, as well as mobile applications you might later develop, could easily adopt support for such a format now and in the future, since MarkDown, JSON, and ZIP are supported by small and fast libraries in all environments, languages, and platforms, and will be so for a very long time. And random-access reading and in-memory manipulations make implementations easy and efficient.

For me, the ability to open a file and to see its pieces using stock utilities gives me the assurance I crave while moving beyond plain MarkDown documents.

Answer 1 · 2019-11-07T18:17:19.000Z

The current design is a due to implementation decisions. Gingko's documents are actually stored in a git-like format, with commits and tree objects and content objects, all linked together. It's done this way to enable full version history and syncing/merging changes.

That said, I completely agree with the spirit of making file formats simple and future-proof. I've recently spent hours trying to transfer a user's data from a different outliner program (Tree 2 for macOS), because the program is now defunct/deprecated and used a proprietary format that's hard to parse, and the program itself crashes on trying to export to a different format.

Possibly a simple temporary compromise would be to include a plain-text version of the current state of the tree, in JSON or OPML, into the same archive. That way it's saved alongside the more feature-rich database version, but in a format that's much more accessible.

Answer 2 · 2019-11-07T20:54:41.000Z

Possibly a simple temporary compromise would be to include a plain-text version of the current state of the tree, in JSON or OPML, into the same archive. That way it's saved alongside the more feature-rich database version, but in a format that's much more accessible.

Edit: The below may not accurately reflect the intent of your comments. See further post below.

It could be an effective compromise, but I am wondering about the suggestion you are making that the data that can feasibly be represented in a plain and open format is limited to the current snapshot of the text, excluding the pieces that constitute prior commits. Not knowing precisely the logical format you have chosen, I am forced to generalize, but any data that is made into a LevelDB record can presumably in principle be represented by some combination of entries in a JSON table and external files, which could be bundled into an archive.

Depending on the structure and number of records, it is possible that certain limitations need to be considered directly, but I believe can also be resolved with appropriate thought. I would be happy to discuss in more detail if it helps.

Answer 3 · 2019-11-09T04:46:59.000Z

Possibly a simple temporary compromise would be to include a plain-text version of the current state of the tree, in JSON or OPML, into the same archive. That way it's saved alongside the more feature-rich database version, but in a format that's much more accessible.

The representation of the current state, in the above remark, is referring to a human-readable document strictly to benefit a user who wishes to read through the current sequence of text?

Answer 4 · 2019-11-10T20:53:54.000Z

The idea is to save the current state in a human-readable format, alongside the more involved database version, just in case the user needs to recover their work and can't open/access Gingko to do so. That way even if Gingko disappears tomorrow, all it takes to get their work back is a 7zip* unzip, and they'll have a text file with their latest version.

* I know 7zip is not as universal as regular zip, but it's been around since 1999 and anyone who's able to figure out that .gko is a 7z file, should also be able to figure out how to open it... Anyway, one thing at a time 🙂

Answer 5 · 2019-11-10T21:30:10.000Z

The functionality you describe seems redundant with an export feature.

But I suppose the idea you are giving would guarantee that any saved version of a work has some future-proof disaster-proof representation encoded inside it. Is this your reasoning?

Answer 6 · 2019-11-12T16:32:45.000Z

Yes, it's the same as export to JSON (should have said so, much easier way of describing it). And yes, it's a way of including a human-readable, future-proof version inside the file.

Answer 7 · 2019-11-13T08:23:09.000Z

An importable, structured file and a flat, readable file are both useful for different recovery scenarios. Both might be included unless file bloat would be an issue. I would suggest that a flat file with the MarkDown preserved is more helpful than one in unformatted text.

I am happy that you would be interested in evolving toward a clear and open format from simple building blocks. The reason for the earlier mention of Zip format was not prompted by a concern that 7zip tools would be unavailable on some desktop system being used for recovery, but rather in the interest of portability of format. Zip format is extensively supported by stable and mature libraries with small code size, minimal memory footprint, and high throughput. Compression of text files should be adequate if not optimal. Many file formats both from large open projects and smaller proprietary applications use Zip archives as the outer wrapper. Prominent examples would include Open Document Format and EPUB.

Also, I noticed earlier that you said that the LevelDB database was part of the implementation. Regardless of the file format, implementation under LevelDB requires loading or unloading items on the file system during open and save operations. These processes would seem to add considerably to I/O demands on the operations compared to loading a file to memory and saving from memory.

Would you mind sharing the current thinking underlying the design?

Answer 8 · 2019-11-13T17:10:24.000Z

"Would you mind sharing the current thinking underlying the design?"

I can give you a short version now: syncing documents across devices, and being able to access them from a web interface, is the most requested feature of the desktop app. I knew that going into the design, so I chose to use PouchDB to save the data. PouchDB has a client-side JS version for saving to IndexedDB, and a save-to-disk version based on LevelDB when on desktop. Both use identical interface, so I don't need to code for two different ways of saving files (one for web, another for desktop). And both use the CouchDB syncing protocol which makes sync/collaboration very easy to implement.

It's possible that we can choose to create two different saving protocols. One for offline-only documents, chosen for speed, efficiency, and long-term accessibility. The other for sync-able documents, chosen for ease of syncing and for being cross-platform.

As I mentioned in this tag feature comment though, this kind of perfectionism can be a dangerous trap at this stage of the project. Reworking the way files are saved in order to make Gingko files future-proof long-term takes time away from features and bug fixes that people need today. It's not ideal, but it's a necessary tradeoff 😒.

Answer 9 · 2019-11-14T04:04:21.000Z

My comments may have been badly worded. I was more expressing curiosity about the particular short-term obstacles than critiquing the overall choices.

As I understand that Couch/PouchDB are built on a JSON-typed document model, I definitely feel that the use of JSON-based design is wise and will keep open options for transport and persistence if the need arises to add further protocols complementing the original ones. Having this flexibility available is probably more important than employing it immediately.

Probably an oversimplification given various particular issues I don't know about, but naively a set of JSON objects can be trivially stored in a file by wrapping them in a list and dumping. So if a saved file is a subset of some JSON documents in a database, maybe the first attempt at a second persistence protocol is not too hard... or may be it is.

But I agree it's not necessary.

Answer 10 · 2020-06-08T12:18:07.000Z

@brainchild0 I've been thinking about this lately, and I'm starting to feel that you're right.

The entire file format was designed for future db syncing (once I manage to piece that together). But really, the result is we're left with the worst of both worlds:

A complicated file format, whose only justification is potential/future capability.
Due to the complicated nature, even local saves are not as reliable/inspectable/portable as they could be.

I mentioned that I could create two formats, but that it would be "perfectionist" to do so. I am starting to see that a lot of the bugs that are actually holding me back from focusing on sync and other big features, are due to having one format try to do everything well (and that is what's actually perfectionist).

So, what I'm thinking of doing instead is this:

A local-only file format whose focus is reliability, speed, portability, and security. There would be no autosaving, and no version history. There would be a simple "swap" file created alongside the document, to recover data in case of power failure or crash.
A cloud/syncing database whose focus is collaboration, sync, full branching/merging version history (a necessity when dealing with online/offline work and sync), and authentication by username/pass. When offline, the data is saved to local database for later syncing (securely, but in an internal folder). There should be no difference between "using the app online" and "using the app offline, but still logged in"

At first the only format available would be the local file one. But when both are available, user will be able to choose if they want to create a local file document, or a cloud one saved to their account.

I've already coded up the "save to plain JSON files manually" version in a new branch, and it's feeling quite solid. The straightforwardness and speed of "dump the tree state to a file as JSON" makes me much more confident in the reliability of the save process.

The downside is that I'd be removing full version history for local files (though, of course, there's still regular undo/redo within each session).

I've published a pre-release with these changes, for beta-testing purposes only. It cannot open files from the current format (yet). I'd appreciate any feedback or bug reports on this:
https://github.com/gingko/client/releases/tag/v2.5.0

Thanks!

Answer 11 · 2020-06-11T05:40:43.000Z

@AdrianoFerrari: I am glad that you find the comments helpful in some way.

The flat JSON file has many advantages, if nothing else, being trivial to convert to a plain document by a few lines of script. This structure is not, however, representative of the idea that prompted my earlier suggestion. Although you might not ultimately wish to adopt the idea, I would like to give you the chance to understand it.

I understood the objective as storing an abstracted representation, analogous to the database representation, but in a local file, that includes fine-grained revision details. As such, my thought was to conceive of an archive layout that captured the same structure of data, but based on serialization methods more agreeable to document files, instead of a database instance, and that might include only a subset of the total data in the instance.

Such an objective may be accomplished through an archive with a carefully specified layout. The layout might include a handful of separate sections. One section might contain objects, named according to their cryptographic hash, much like the objects that git stores and transfers. Another section might contain a variety of JSON files that assemble the constituent objects into the various revisions of the document.

Since a gist is worth a thousand words, I prepared one with an exported representation of the Example Tree from the web application. The content originally from the JSON file example-tree.json is represented in the zip archive example-tree.zip, which presents a simplified illustration of the concept. In this case, the file indices/current.json captures a snapshot of the tree. For purposes of illustration, this file mimics the structure of the original file, but card data is separated into the objects directory, where the name of the file is the hexadecimal encoding of the sha256 cryptographic hash.

I included a Python script, though it would have little practical use to you, to illustrate that the construction of this file is not necessarily very difficult, though more so perhaps than what you have considered so far.

Addtional data in the indices section of the archive might represent revision history, each object having multiple references without duplication of content. Thus, the effect is a clear file format, free from the database, but capturing the same structured data for managing revisions.

The use of an outer archive file has particular advantages, most prominently the ability to manipulate parts of the content without serializing or deserializing the entirety. Such flexibility is particularly useful when dealing with large resources, such as images, or on mobile platforms.

Please note that the example is not a literal suggestion for a format. JSON files included in the archive might be direct copies of documents in a database. However, for the sake of usability, it is helpful if a file format is not beholden to the particular features of a database.