mork-converter: A Python repository from lygstate

Mork Converter
Version 2.2
October 17, 2010
Kevin Goodsell

Preliminary Info
----------------
This project began with a misunderstanding. I wanted to convert my
Thunderbird mailboxes to a different format, and I mistakenly thought
that this would require reading the Mork-formatted .msf files. It turns
out that a .msf (mail summary file) file is only an index. I'm not sure
how I missed this, but that's beside the point.

There are a few existing tools to read (some) Mork files, but they fail
to read .msf files. I started writing my own tool, partly to experiment
with Python parsing using PLY (Python Lex-Yacc). In spite of my
realization that I had little or no real use for this tool, I decided to
continue working on it until it could be released in the hope that it
might be useful to others.

License
-------
This is free software released under the GNU General Public License,
version 2. Read the COPYING file for details.

The file doc/mork-format.txt is licensed under a Creative Commons
Attribution-NonCommercial-ShareAlike 3.0 Unported License. See that file
for details.

These licenses do not apply to files in the other/ directory, which are
not my work. You'll have to read those files (or contact the authors)
for license information.

Contact
-------
If you find this useful or discover a problem with it, please let me
know.

Email:
kevin-opensource@omegacrash.net

Prerequisites
-------------
* Python 2, version 2.4 or higher. Python 3 is not supported.
  - Version 2.2 was developed and tested on Python 2.6, but I think 2.4
    should still work.
* PLY (Python Lex-Yacc). Version 3.3 was used to develop and test this
  version.

Basic Usage
-----------
To read the Mork file history.dat and output an XML file to stdout, use:

  mork history.dat

To do the same, but output to a file called history.xml, use:

  mork --outname=history.xml history.dat

To output to a set of CSV (Comma-Separated Values) files, one for each
table and meta-table in the Mork database, use:

  mork --csv --outname=csvout history.dat

This will write the output files into a directory called csvout. You can
also write a single file with a header before each table by doing this:

  mork --csv --outname=csvout --single-file history.dat

This will still use the name csvout, but it will be a file rather than a
directory. Without the --outname option, it will write to stdout. When
writing CSV output to stdout, --single-file is implied.

For additional help, use:

  mork --help

Character Encoding Notes
------------------------
A few options let you specify character encodings. The accepted encoding
names are determined by your Python installation, and you can find a
list of the standard encoding names here:

http://docs.python.org/library/codecs.html#standard-encodings

The options that deal with character encodings require some
clarification. The most basic one is the -e/--out-encoding option,
which is pretty straight-forward. It determines the encoding used for
the output file (UTF-8 by default). Any of the Unicode encodings (e.g.
utf-8, utf-16, utf-32) should typically work. Encodings that only
support a subset of Unicode (such as the ISO-8859 family of encodings)
may or may not work depending on what characters appear in the input
database.

The -b/--byte-order option let's you specify the byte order (or
'endianness') of UTF-16 fields in the input. It takes an argument which
may be 'big' or 'little' (corresponding to most-significant-byte first
and least-significant-byte first, respectively). You can probably ignore
this option unless you get a warning that suggests using it and see
nonsense in the output file.

When interpreting fields in the input, a few different attempts are made
to guess the encoding. First, a list of fields known to be UTF-16 is
used. If the field is not in that list, an attempt is made to decode the
field as UTF-8. This usually works, because UTF-8 is compatible with
ASCII. If UTF-8 decoding fails, the next step is to try an ISO 8859
character set. This always works (since any arbitrary byte sequence can
be considered valid ISO 8859), but isn't always correct. If the result
contains certain characters that are only used as control codes, this is
taken to mean the ISO 8859 decoding is wrong. In this case, a final
attempt is made to decode the field using a user-specifiable character
set that defaults to windows-1252.

Both the ISO 8859 character set and the final character set can be
provided on the command line, using the -i/--iso-8859 option and
-f/--fallback-charset option, respectively. --iso-8859 takes a single
integer argument to select the specific ISO 8859 character set (the
default is '1', the Latin 1 character set). It can also accept an empty
string to indicate that ISO 8859 should not be used. --fallback-charset
can take any character set available in you Python installation, but
some obviously make more sense than others. utf-8 would be a bad choice,
since the fallback encoding is only used after UTF-8 fails. An ISO 8859
character set may make sense, as this would allow fields with control
codes to be considered valid (in such a case it would also make sense to
use --iso-8859 "" in order to skip the first attempt at using ISO 8859).

In general, --fallback-charset should be something that decodes all
possible byte sequences (meaning all bytes represent valid characters).

When all else fails, there is also a --force-encoding option. This
option can be given multiple times, and requires three arguments. The
first two arguments specify a row namespace and a column name, which
makes the option apply to the fields in the specified column in the
specified rows. The third argument is the character encoding to use for
the given fields. This bypasses all the other field decoding attempts.

About Mork Files
----------------
Mork is a general-purpose text-based database format created for use in
the Mozilla project. I know of five specific ways it has been used in
Mozilla and related projects (Firefox, Thunderbird):

* Browsing history (history.dat)
* Form auto-complete history (formhistory.dat)
* Address book (abook.mab, history.mab)
* Mailbox index (*.msf)
* Mail folder cache (panacea.dat)

Unfortunately this format is rather convoluted and very difficult to
read, making the data contained in the database practically inaccessible
by non-Mozilla applications. Recently sqlite databases have replaced
Mork databases for some uses, largely alleviating this problem.

Mork represents data in tables. Tables are uniquely identified by a
namespace and ID. Tables include a set of rows and metadata in the form
of a 'meta-table'. Rows are likewise identified by a namespace and ID,
and may also include metadata in the form of a 'meta-row', however I
have not encountered any Mork files that make use of meta-rows. If this
tool encounters a meta-row it will report a warning and otherwise ignore
it. (Update: this description of meta-rows may be wrong. Meta-rows may
actually occur in the form of rows within meta-tables, which are common
in Mork files. More investigation of this and code-fixing if necessary
is pending.) Row data is a set of key-value pairs where the keys are
referred to as columns.

The output from this tool depends on the options selected, but will
typically contain a set of tables corresponding to the tables in the
Mork file, each identified by the table namespace and ID. Each table's
meta-table is also included, and may contain information useful for
interpreting the table contents. Meta-tables seem to always contain a
'k' field that identifies the 'kind' of the table, and a 's' field for
the table 'status', which is often simply '9'. A digit in the status
gives the priority of the table. A 'u' or a 'v' indicate that the table
is unique or verbose, respectively.

Rows in a table are not uniform. Columns that appear in one row may not
appear in others. This makes the output a little odd, especially in
formats like CSV that require uniform rows. The result is that all
columns appear in the table header, and each row will have empty fields
for the columns that are not actually present in the row.

Each row is also identified by a namespace and ID, which typically shows
up in the output (as row attributes in XML and as extra columns in CSV).
The namespace may tell you something about the type of data in the row,
but other than that this can largely be ignored.

Filters
-------
This converter includes a basic framework for "filters", which are used
for writing output and for tweaking the database prior to output.

Available filters are shown in the --filters output. Information on how
to write filters can be found in the doc/ subdirectory. Unlike version
1.x, filters no longer have to be given explicitly on the command line.
Instead, certain command line options will activate certain filters, and
some will be enabled by default.

Known Issues
------------
* The Mork documentation I've been able to find is not quite adequate,
  so there's a significant amount of guess-work involved in the
  translation.
* Many errors are currently reported with Python tracebacks. This is not
  very user-friendly.
* There's no reliable way to determine character encodings for all
  fields. Because of this, some fields will not get decoded correctly,
  resulting in some strange characters in the result. This problem
  primarily affects email header fields in .msf files. The encoding
  options are intended to help resolve this.
* The DecodeCharacters filter in encodings.py requires special knowledge
  of what fields in the database use UTF-16 encoding. There may be
  missing fields that need to be added to the filter in order for it to
  correctly decode fields of that type.
* A similar warning applies to various other filters: some of them
  (FieldConverter, DecodeMimeHeaders) require a list of the fields that
  the filter conversion will be applied to. This list is found in the
  filter source and may need to be updated with missing fields.
lygstate/mork-converter