datamart

cd datamart
conda env create -f environment.yml
source activate datamart_env
git update-index --assume-unchanged datamart/resources/index_info.json

python -W ignore -m unittest discover

Validate your schema

Dataset providers should validate their dataset schema against our json schema by the following

python scripts/validate_schema.py --validate_json {path_to_json}

eg.

$ python scripts/validate_schema.py --validate_json test/tmp/tmp.json
$ Valid json

How to provide index for one data source

Prepare your dataset schema and validate it with the previous step
Create your materialization method by creating a subclass of materializer_base.py. and put in datamart/materializers.

Implement get(self, metadata: dict = None, variables: typing.List[int] = None, constrains: dict = None) -> pd.DataFrame method,

metadata is the metadata after processing your dataset schema (profiling and so on). materialization field will not be changed, so put every arguments you need for materializing your dataset at materialization.arguments

variables, default to None for now

constrains, fake some constrains for your dataset for querying, eg.
```
constrains={
    "locations": ["los angeles", "new york"],
    "date_range": {
        "start": "2018-09-23T00:00:00",
        "end": "2018-10-01T00:00:00"
    }
}
```
returns a dataframe

take a look at noaa_materializer.py for example.
Have your dataset schema json materialization.python_path pointed to the materialization method. Take a look at tmp.json.
Try to create metadata and index it on Elasticsearch, following: Indexing demo
Try some queries for testing you materialization method, following: Query demo

Note: Launch notebook:

jupyter notebook test/indexing.ipynb

BhavyaGulati/datamart

datamart

Validate your schema

How to provide index for one data source