Watts-Lab/surveyor

DB change from test db to mongo db

Closed this issue · 16 comments

Tutorial for local mongodb install
https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
After a lot of trouble, if you are using Window Subsystem Linux (WSL), https://docs.microsoft.com/en-us/windows/wsl/tutorials/wsl-database use these instructions.

Changed from moongoose as the driver to official mongo driver, which doesn't require schemas, which created a much more cleaner interface.

Cool, thanks.

For responses, survey is probably just the url of the survey they did. WorkerID information is optional within the URL call, so we should not have that as a hard coded response variable. Its designed so we can be flexible about the added columns that can be called when a URL is provided, so I think that should either be part of the survey_response, or another list.

A key thing for us is that we want be able to provide exports of this data in standard formats that are relatively flat and neat, e.g., csv where one row is one response and one column is one variable. or JSON where one object in a top level list is one response and each variable for each response has a key value pair. This is of course totally doable with the structure you suggested, just saying that we would probably unwrap the list in survey_responses when exporting, so that the data is more directly useful for researchers.

I think we're not using survey at the moment, and probably we could survive without ever using it, so I'm not really sure if its worth having or not. One thing we could do is store the surveys that we download when a survey call is made, but, that may constrain things (e.g., if people use something that dynamically randomizes the survey csv. So perhaps we can drop that too for now? (though I'm totally open to ideas about how it could be useful for us)

I think we might not need user for now. I think that is a remnant from another project and might not be relevant here because we don't really need to track individuals who submit responses.

In researcher we probably want something like an encrypted key (or we can compute a key on the fly from email). I don't know if we need their name.

Ok. Cool. Just a note, in mongo, all fields are optional, unless specified to be required.

I'll remove user and survey schema and roll it into responses schema. Given that, I can really simplify the responses schema and unravel it to be more json friendly.

OK, cool. Is there any efficiency in different structures in Mongo?

One more thing, which may not be useful now but could be good to keep in mind. We currently do some level of validation at the survey, e.g., number fields force number responses, and this is derived from a column in the CSV. Presumably we will have more input fields in the future, and more sophisticated and custom validation approaches. It may at some point make sense to move this from an input thing to a validation step at a later stage, but probably where it is most important is in warning people to input things in a correct format.

On the efficiency part, I'm still looking into it, but I'll have more thoughts once I have running code. Just a glimpse into the design decisions for efficiency that I'll make can be found here. https://docs.mongodb.com/manual/core/data-model-design/

I'll keep the validation in mind. For now, I'll just program mongo to just validate types of everything but survey_responses, and we can leave it open the decision where to validate the responses.

There is slightly more testing I have to do before submitting a pull request , but this is what the server api looks like.
https://github.com/Watts-Lab/surveyor/tree/mongo. Most of the logic is in prod_db.ts

I think for most part it is clean. I got rid of the schemas through using the official mongodb driver for nodejs (super lightweight, only has CRUD operations). I previously used moongoose (a popular driver that adds alot of functionality), but realized this was not ideal use case as alot of the data is unstructured. There might be a small tradeoff in performance for a large boost in simplicity.
`

Neat, looks good.

A few notes:

  1. we are currently only using some simple functions but I think things like update (to update an existing entry) might also be useful, perhaps, in general, the main CRUD actions are things we need.
  2. is there a good way to use mongo in a test environment, so we could remove nedb completely? Like could we make the difference between prod and test for this just be something that is dealt with in the db wrapper? Ultimately I think it would be nice if we could launch this with just 2 npm commands (e.g., test and start as we do now), and I've generally found that setting up mongo requires a bit more configuration, but perhaps theres a good approach to doing it for dev so thats not an issue?
  3. is there a good way to make it so the api is the same for prod and dev? i.e., have a single wrapper that holds both db approaches and chooses where to write based on the env, as opposed to having a few lines of code for every db call in the main code of other functions, e.g., server.ts?
  1. Yup, I can add update to complete CRUD.
  2. I think we can replace neodb. The main barrier is installation. After installation, the starting of mongo is one start command, so it can be easily inputted in package.json. The configuration stuff for test environment, which I am running, is fairly minimal as of the most recent mongo (although I know it is has been bad in the past).

Therefore, I think if we can overcome installation barrier, then it is super easy to use. There are two ways we can go about this.

  • Approach 1: Write a document as installation guide (just the appropriate guide links, and some tips in troubleshooting)
  • Approach 2: Dockerize our test environment to standarize builds across platforms. I need to talk to @yli12313 about it.

For this week, I favor approach 1, and next week I want to explore approach 2 as long term solution.
3. Yes, I can build a similair wrapper for neodb, and create interface that both wrappers implement, so the db api will just be chosen by one if statement in the beginning.

Unrelated note
For production environment for mongo, I would recommend using mongo cloud serverless hosted on aws. This makes it so we don't need to do any management and just provide uri to our code. It is fairly cheap 30 cents/day/1million reads and $1.25 per million writes. This approx comes to $10 - $20 month.
https://www.mongodb.com/pricing

@sumants-dev nice, I like what you have. I can certainly add value in anyway that you would like. Thank you, Y

Great thanks. Approach 1 does sound simpler over all, and I think with reasonable instructions it will be fine. But do let us know if you realize anything interesting about approach 2.

We have integrated billing at AWS so just wondering, with that server less option are we really paying mongo to set up and run the system? (this might add an extra layer to our billing)

Alternatively, is there an integrated service from AWS that we could use, e.g., dynamo, that might otherwise be similar?

Cool. That seems like a good alternative for us. I think the costs will be reasonably low (most of these things are pay per use at AWS), and being under one billing umbrella is very useful.

It might be nice to estimate our costs for the individual-mapping workload, e.g., ~30,000 participants, ~10 surveys each, for writes, and probably around 1 read per day?

@yli12313 can you set @sumants-dev up with the appropriate permissions to configure this for us?

We can test it out and see how things go?

Ok. @sumants-dev: I'll follow up with you separately about this.

Need to add GUI connection to documentDB

Perhaps make a new issue for that? Why do we need it? So we can manipulate through a GUI editor?