Mongo project
A small standalone project meant to test our hands on mongo db
Setting up the environment
$> git clone git@github.com:Mathias-Boulay/mongo-project.git
$> cd mongo-project
$> docker compose up --build
The whole environment will get set up with all the default values and some default data.
Once the whole environment is ready, head over to the Link to localhost with an already filled form answer
Database format
Collection forms:
export type FieldType = 'TEXT_SHORT' | 'TEXT_LONG' | 'CHOICE_SINGLE' | 'CHOICE_MANY' | 'INTEGER';
@Schema()
export class Field {
@Prop({ required: true })
fieldID: string;
@Prop({ required: true })
question: string;
@Prop({ required: true })
type: FieldType;
@Prop({ max: 10 })
choices?: Array<string>; // All choices displayed to the user
}
@Schema()
export class Form {
@Prop([Field])
fields: Array<Field>;
}
Collection filledforms:
@Schema()
export class FilledFieldSchema {
@Prop({ required: true })
fieldID: string;
// Actual validation is handled beforehand
@Prop({ required: true, type: mongoose.Schema.Types.Mixed })
data: string | number | Array<string>;
}
/** The actual form */
@Schema()
export class FilledForm {
@Prop({ required: true, type: mongoose.Schema.Types.ObjectId, index: true })
formID: mongoose.Schema.Types.ObjectId;
@Prop([FilledFieldSchema])
fields: Array<FilledFieldSchema>;
}
Why did I chose this schema ?
One huge constraint of this project was the ability to see the form pre-filled with the content of a specific answer.
This effectively forced my hands in storing each form answers as individual documents, instead of creating a "pre-aggregated", metrics style view of the data (eg. storing the count of each choice instead of each choice). At least, it allows for more flexibility with how the data can be consulted.
Over the forms
collection, it is made to add almost as many fields as necessary, with only one top level field, it is barebone.
There are some inconveniences in theses schemas: First, the versioning of forms and their answers is not handled. My best shot would be to nest the current format into a "v1", "v2"... at the top document fields.
Second, some fields are not using the proper type. They are simple strings acting as ids, whereas an ObjectId
would be better for performance.
Indexing
Aside from the index on _id
, an index was used on formID
to quickly gather all answers to a given form. This strains the database a little more in writing, but is worth it since the collection can get quite large.
Another index on fieldID
was considered, however the scope of said index would only be a few sub-documents at best, making the gain not worth the cost.
Sharding
While present from the current state of the project, sharding got some attention.
The only proper way to shard both forms
and filledforms
collections is hash based sharding to distribute the load.
Time based sharding would be a really uneven load on the database. A lot of people start answering a form at first, then it dies over time. This would incur huge cost when gathering metrics.
Manual sharding isn't at option either, as the exact content of each form is up to the user, and far from being constrainted (eg. time ranges).
Post-Mortem
Oh my god my code is unfinished and unoptimized
I guess my skills with mongo db have slightly improved. It took me a while to (re-)learn the aggregation pipeline with some new operators. It is quite the powerful tool to transform data.
While I understood how the replica set works, I was confused by how sharding combined itself with the replica system. More precisely, where are the physical config servers are supposed to be located, relatives to the rest of clusters.