LMFDB’s inventory is a very useful tool for identifying what exists in the database and what it means, but it’s power is limited by the fact that it relies entirely on user input. This could be helped by using software to automatically generate at least some of the inventory. This repository is the current output of a beta version of such an automatic system. To display the records it uses a markup language called Asciidoc( http://www.methods.co.nz/asciidoc/ ) which is supported by GitHub but is more powerful than the default GitHub markdown text.
The key advantage is that the data has been separated from the representation of the data, so that the data can be used in other ways.
The automatic code in early beta state. Currently it is split into three parts
-
LMFDB structure parsing
-
Structure conversion to Asciidoc. This puts in placeholder tags for information that can be patched later.
-
Patch Asciidoc files to replace the placeholder tags
The first part of the tool uses MongoDB mapreduce to identify every type of record that exists within the database. This is then parsed to a JSON file. This file should not be edited, but is formatted as follows italic keys are replaced by something from the database, bold keys are literal strings
-
Database - Name of the database
-
Collection - Name of the collection
-
fields
-
One key for each field that exists in at least one record
-
example - String containing an example record. This is obtained by searching for a single non empty record containing the field.
-
type - String containing the inferred type of the record. Uses the example record for type inference.
-
-
-
records
-
number - A number running from 0. Uniquely describes each record type. Used because record types have no intrinsic names
-
schema - A list containing the fields that are present in the record type
-
count - The number of records of this type that are in the collection
-
-
-
indices
-
number - A number running from 0. Uniquely describes each index. Used by analogy with records
-
name - The name of the index
-
keys - A list of pairs describing the key fields using this index. See the documentation for PyMongo index_information.
-
-
-
-
This file is then used to generate the database reports. This phase takes 1-2 hours to parse the entire database.
The next part of the code uses the database description JSON to build a mostly empty asciidoc document with special tags that are filled in later.
The files always have much the same structure (see https://github.com/csbrady-warwick/DocTesting/blob/master/MaassWaveForms.FS.adoc for a good example). Tagged fields are replaced with the special tags for later completion. The tags all have the same structure
@@Database\Collection\ID\Key@@
The @@ symbols are an escape sequence that are not used by Asciidoc. The rest of the data is then used to uniquely identify a part of a set of JSON documents (one for each collection at the moment). If the tag isn’t found in the patching system then the key is just left in place to remind developers that the key needs to be specified. This part takes a few seconds to run.
The third stage involves replacing the tags with their final values. This is first patched with user specified values from external JSON files. These external values allow you to specify
-
Descriptions for database fields
-
Types for database fields
-
Example values for database fields
-
Creator, code and status information for each collection
-
Names for each type of record in the collection (unless there is only one type)
-
Descriptions for each type of record in the collection (unless there is only one type)
-
A notes section at the bottom of each page
This data is supplied as a JSON file.
Any values that are unpatched from the user supplied files are patched with the automatic values for types and example records from the database structure parsing phase. This part is almost instant.
This scheme works well as is, but having the data in a machine readable format opens new possibilities. The entire of this data could be transposed to a collection in the database and then used online for purposes such as designing search queries and selecting data to be returned from a search. The entire inventory could be made available through the LMFDB website. What is the general community feeling on how to proceed?