meta-facts

Automatic generation of Meta-Data for a dataset

Table of Contents

Motivation
How to run the application
Project Structure
Methodology
1. Where this Library fits in the overall architecture
2. Approach to determine Meta-Data
  1. Column Names
  2. File Path
  3. Units
  4. Temporal Coverage
  5. Granularity
  6. Spatial Coverage
  7. File Formats Available
  8. Is Public Dataset

Motivation

How to run the application

Runnning Localhost

poetry run uvicorn app.main:app --reload --port 8005

Deploy app

docker compose up --build

Access Swagger Documentation

http://localhost:8005/api/docs

Project structure

Files related to application are in the app or tests directories. Application parts are:

app
├── api              - web related stuff.
│   └── routes       - web routes.
├── core             - application configuration, startup events, logging.
├── models           - pydantic models for this application.
├── services         - logic that is not just crud related.
└── main.py          - FastAPI application creation and configuration.

Methodology

Approach to determine Meta-Data

Column Names

How are columns categorised?

The library categorises columns into Following Categories:

Column Entity	Columns
Date-Time	non_calendar_year calender_year other_year quarter month date
Geography	country state district
Unit	unit
Note	note
Unmapped	Any unmapped columns

Table of Content

Units :

General Workflow

graph LR;
  A[Dataset]-->B{Unit Column Exists ?};
  
  B -- NO --> C(RETURN Null String);
  B -- Yes --> D[Get all  unique units from UNIT Column];

  D --> E[Prepare List of all separate units];
  E --> F(RETURN all units as STRING SEPARATED WITH COMMAS)

Table of Content

Temporal Coverage :

General Workflow

flowchart LR

A(Dataset) -->  B{Year column exists ?}
B -- NO --> C(RETURN Null String) 
B -- Yes --> D[Calender / Non-Calender Year Columns]
D --> E{Years are in Sequence ?}
E -- YES --> F(RETURN string represntation of range \n example : 2012 to 2020 or \n 2012-13 to 2020-21)
E -- NO --> G(RETURN  comma separated values for all years, \n exmaple : 2012,2015,2018 or \n 2012-13, 2015-16, 2018-19)

Notes:

Determination of Temporal coverage is based on the presence of year column.
If both Calender year and Non-Calender year are presnet in dataset then priority will be given to Calender year.

Table of Content

Granulaity :

General Workflow

  flowchart LR
  A(Dataset) --> B{If any of Date-time or \nGeography columns exists ?}
  B -- No --> C(RETURN Null String)
  B -- YES -->  D[Map all Columns levels in \nSorted Order for respective Domains]
  D --> E[Map the columns groups according to \nproper naming convention Granularity]
  E --> F(RETURN Comma Separated Values of all Granularitues \n example : Quarterly, District)

Notes:

Granularity is calculated for 2 domains.
- Geography
- Date-Time
In config.py There are granularity ranks mentioned for each domain.
In config.py there are Keywords also present for Granularity if found in Datasets.

Table of Content

Spatial Coverage :

Mentioned below are the Cases for Spatial Covererage :

Spatial Location	Dataset with categories as	Methodology	Spatial Coverage
Countries	India, Pakisthan, China, etc		Country
Specific Country	India	represent it with the specific Country Name	India
States of a Country	Andhra Pradesh, Assam, etc		States of India
Regions of a country	South India, NE states etc		Regions of India
Specific State of a country	Andhra Pradesh	represent it with the specific State Name	Andhra Pradesh
Districts of a State/ States	Adilabad, Hyderabad etc		Districts of Telangana or Districts of India
Specific District of a state	Hyderabad	represent it with specific District Name	Hyderabad

General Workflow
```
  flowchart LR
  A(Dataset) --> B{If Geographical Columns exists ?}
  B -- NO --> C(RETURN Default Value as INDIA)
  B -- YES --> D[Sort the order of different \nGeographical Level]
  D --> E(RETURN Value of biggest order of Geographical Column \nwith proper naming convention)
```
Loading
Notes:
- This library currently facilitates only for Country, State and District level of Spatial Coverage.
- Mapping of levels of Geographic Columns is decided by corresponding column names and not the values, hence change in Column names will impact the mapping.
- If there is no Geographic column , then the result would be default for INDIA.
- Spatial coverage order, keyword Mapping and Naming Convention are mentioned in config.py.

Table of Content

File Formats Available :

Notes:

Reads the format of file from the file name.

Table of Content

saisantoshv3/meta-facts

meta-facts

Motivation

How to run the application

Runnning Localhost

Deploy app

Access Swagger Documentation

Project structure

Methodology

Approach to determine Meta-Data

Column Names

Units :

Temporal Coverage :

Granulaity :

Spatial Coverage :

File Formats Available :