This demo builds from the apache airflow quick start (https://airflow.apache.org/start.html).
Set AIRFLOW_HOME
to point to the airflow/
directory in this repository.
Setup:
docker-compose up -d
airflow initdb # Only necessary on first run, of course
airflow webserver -p 8080
airflow scheduler
We will evaluate NPI data from cms.gov. For the original raw data, see: .
- Install GE in our project, and profile the datasource.
- Review data-docs built for the NPI data.
- Identify some columns to investigate further:
- "Provider Other Organization Name Type Code"
- "Provider Enumeration Date"
- Run the
create expectations
notebook
data_asset_name = 'demo__dir/default/npidata_pfile' data_asset_name = 'demo__dir/default/npidata_pfile_transformed'
ge.dataset.util.build_categorical_partition_object(batch, column='Provider Other Organization Name Type Code') batch.expect_column_distinct_values_to_be_in_set(column='Provider Other Organization Name Type Code', value_set=[3, 4, 5])