powerpak/pathospot-compare

Do we have to use the example db schema names for our own data?

Closed this issue · 5 comments

Hi, thank you for this great resource. If we want to populate your software with our own data, do we have to use the exact same table schema and column names as you do in the example dataset? Or does the fact that we also have to specify the IN_QUERY mean that we can use our own schema?

It is certainly easier to use the same table schema and column names, as this requires no new code to be written for the pipeline to work.

However, if you are willing to write a small amount of code, you can in fact write an adapter to use a slightly different schema. An example of this is in lib/pathogendb_adapter_ceirs.rb. This replaces a few of the core methods for accessing the database to use different tables that we used to store data on influenza samples.

To activate these adapters, you supply the environment variable PATHOGENDB_ADAPTER to rake, e.g. PATHOGENDB_ADAPTER=CEIRS, in which case that extra ruby module is included into PathogenDBClient before running the pipeline. We have not yet documented this because we consider it an advanced feature, but we welcome your input on whether it is useful to end users!

Thank you Ted, this all makes sense.

Update: documentation on the database schema is now included in README-database.md.

Thank you, this documentation is great! I ended up being able to figure it out and have written code to adapt our schema to yours. The only thing I'm not sure about is that I have used the same ID for stock_ID, extract_ID, and isolate_ID, as we only ever get one sample_ID (sometimes a patient has two samples taken, and these are given different IDs). We only sequence the sample once, so having a single ID works for us. Hopefully it will work in your system. I'll find out tomorrow!

That should work, actually! I made a small note about this in the Schema introduction. Might elevate it to the FAQ.