Mork Converter Version 2.2 October 17, 2010 Kevin Goodsell Preliminary Info ---------------- This project began with a misunderstanding. I wanted to convert my Thunderbird mailboxes to a different format, and I mistakenly thought that this would require reading the Mork-formatted .msf files. It turns out that a .msf (mail summary file) file is only an index. I'm not sure how I missed this, but that's beside the point. There are a few existing tools to read (some) Mork files, but they fail to read .msf files. I started writing my own tool, partly to experiment with Python parsing using PLY (Python Lex-Yacc). In spite of my realization that I had little or no real use for this tool, I decided to continue working on it until it could be released in the hope that it might be useful to others. License ------- This is free software released under the GNU General Public License, version 2. Read the COPYING file for details. The file doc/mork-format.txt is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. See that file for details. These licenses do not apply to files in the other/ directory, which are not my work. You'll have to read those files (or contact the authors) for license information. Contact ------- If you find this useful or discover a problem with it, please let me know. Email: kevin-opensource@omegacrash.net Prerequisites ------------- * Python 2, version 2.4 or higher. Python 3 is not supported. - Version 2.2 was developed and tested on Python 2.6, but I think 2.4 should still work. * PLY (Python Lex-Yacc). Version 3.3 was used to develop and test this version. Basic Usage ----------- To read the Mork file history.dat and output an XML file to stdout, use: mork history.dat To do the same, but output to a file called history.xml, use: mork --outname=history.xml history.dat To output to a set of CSV (Comma-Separated Values) files, one for each table and meta-table in the Mork database, use: mork --csv --outname=csvout history.dat This will write the output files into a directory called csvout. You can also write a single file with a header before each table by doing this: mork --csv --outname=csvout --single-file history.dat This will still use the name csvout, but it will be a file rather than a directory. Without the --outname option, it will write to stdout. When writing CSV output to stdout, --single-file is implied. For additional help, use: mork --help Character Encoding Notes ------------------------ A few options let you specify character encodings. The accepted encoding names are determined by your Python installation, and you can find a list of the standard encoding names here: http://docs.python.org/library/codecs.html#standard-encodings The options that deal with character encodings require some clarification. The most basic one is the -e/--out-encoding option, which is pretty straight-forward. It determines the encoding used for the output file (UTF-8 by default). Any of the Unicode encodings (e.g. utf-8, utf-16, utf-32) should typically work. Encodings that only support a subset of Unicode (such as the ISO-8859 family of encodings) may or may not work depending on what characters appear in the input database. The -b/--byte-order option let's you specify the byte order (or 'endianness') of UTF-16 fields in the input. It takes an argument which may be 'big' or 'little' (corresponding to most-significant-byte first and least-significant-byte first, respectively). You can probably ignore this option unless you get a warning that suggests using it and see nonsense in the output file. When interpreting fields in the input, a few different attempts are made to guess the encoding. First, a list of fields known to be UTF-16 is used. If the field is not in that list, an attempt is made to decode the field as UTF-8. This usually works, because UTF-8 is compatible with ASCII. If UTF-8 decoding fails, the next step is to try an ISO 8859 character set. This always works (since any arbitrary byte sequence can be considered valid ISO 8859), but isn't always correct. If the result contains certain characters that are only used as control codes, this is taken to mean the ISO 8859 decoding is wrong. In this case, a final attempt is made to decode the field using a user-specifiable character set that defaults to windows-1252. Both the ISO 8859 character set and the final character set can be provided on the command line, using the -i/--iso-8859 option and -f/--fallback-charset option, respectively. --iso-8859 takes a single integer argument to select the specific ISO 8859 character set (the default is '1', the Latin 1 character set). It can also accept an empty string to indicate that ISO 8859 should not be used. --fallback-charset can take any character set available in you Python installation, but some obviously make more sense than others. utf-8 would be a bad choice, since the fallback encoding is only used after UTF-8 fails. An ISO 8859 character set may make sense, as this would allow fields with control codes to be considered valid (in such a case it would also make sense to use --iso-8859 "" in order to skip the first attempt at using ISO 8859). In general, --fallback-charset should be something that decodes all possible byte sequences (meaning all bytes represent valid characters). When all else fails, there is also a --force-encoding option. This option can be given multiple times, and requires three arguments. The first two arguments specify a row namespace and a column name, which makes the option apply to the fields in the specified column in the specified rows. The third argument is the character encoding to use for the given fields. This bypasses all the other field decoding attempts. About Mork Files ---------------- Mork is a general-purpose text-based database format created for use in the Mozilla project. I know of five specific ways it has been used in Mozilla and related projects (Firefox, Thunderbird): * Browsing history (history.dat) * Form auto-complete history (formhistory.dat) * Address book (abook.mab, history.mab) * Mailbox index (*.msf) * Mail folder cache (panacea.dat) Unfortunately this format is rather convoluted and very difficult to read, making the data contained in the database practically inaccessible by non-Mozilla applications. Recently sqlite databases have replaced Mork databases for some uses, largely alleviating this problem. Mork represents data in tables. Tables are uniquely identified by a namespace and ID. Tables include a set of rows and metadata in the form of a 'meta-table'. Rows are likewise identified by a namespace and ID, and may also include metadata in the form of a 'meta-row', however I have not encountered any Mork files that make use of meta-rows. If this tool encounters a meta-row it will report a warning and otherwise ignore it. (Update: this description of meta-rows may be wrong. Meta-rows may actually occur in the form of rows within meta-tables, which are common in Mork files. More investigation of this and code-fixing if necessary is pending.) Row data is a set of key-value pairs where the keys are referred to as columns. The output from this tool depends on the options selected, but will typically contain a set of tables corresponding to the tables in the Mork file, each identified by the table namespace and ID. Each table's meta-table is also included, and may contain information useful for interpreting the table contents. Meta-tables seem to always contain a 'k' field that identifies the 'kind' of the table, and a 's' field for the table 'status', which is often simply '9'. A digit in the status gives the priority of the table. A 'u' or a 'v' indicate that the table is unique or verbose, respectively. Rows in a table are not uniform. Columns that appear in one row may not appear in others. This makes the output a little odd, especially in formats like CSV that require uniform rows. The result is that all columns appear in the table header, and each row will have empty fields for the columns that are not actually present in the row. Each row is also identified by a namespace and ID, which typically shows up in the output (as row attributes in XML and as extra columns in CSV). The namespace may tell you something about the type of data in the row, but other than that this can largely be ignored. Filters ------- This converter includes a basic framework for "filters", which are used for writing output and for tweaking the database prior to output. Available filters are shown in the --filters output. Information on how to write filters can be found in the doc/ subdirectory. Unlike version 1.x, filters no longer have to be given explicitly on the command line. Instead, certain command line options will activate certain filters, and some will be enabled by default. Known Issues ------------ * The Mork documentation I've been able to find is not quite adequate, so there's a significant amount of guess-work involved in the translation. * Many errors are currently reported with Python tracebacks. This is not very user-friendly. * There's no reliable way to determine character encodings for all fields. Because of this, some fields will not get decoded correctly, resulting in some strange characters in the result. This problem primarily affects email header fields in .msf files. The encoding options are intended to help resolve this. * The DecodeCharacters filter in encodings.py requires special knowledge of what fields in the database use UTF-16 encoding. There may be missing fields that need to be added to the filter in order for it to correctly decode fields of that type. * A similar warning applies to various other filters: some of them (FieldConverter, DecodeMimeHeaders) require a list of the fields that the filter conversion will be applied to. This list is found in the filter source and may need to be updated with missing fields.