/meta-facts

Fast-Api service to generate partial meta-data for datasets generated at Factly

Primary LanguagePython

meta-facts

Automatic generation of Meta-Data for a dataset


Table of Contents
  1. Motivation
  2. How to run the application
  3. Project Structure
  4. Methodology
    1. Where this Library fits in the overall architecture
    2. Approach to determine Meta-Data
      1. Column Names
      2. File Path
      3. Units
      4. Temporal Coverage
      5. Granularity
      6. Spatial Coverage
      7. File Formats Available
      8. Is Public Dataset

Motivation


How to run the application

Runnning Localhost

poetry run uvicorn app.main:app --reload --port 8005

Deploy app

docker compose up --build

Access Swagger Documentation

http://localhost:8005/api/docs


Project structure

Files related to application are in the app or tests directories. Application parts are:

app
├── api              - web related stuff.
│   └── routes       - web routes.
├── core             - application configuration, startup events, logging.
├── models           - pydantic models for this application.
├── services         - logic that is not just crud related.
└── main.py          - FastAPI application creation and configuration.

Methodology


Approach to determine Meta-Data


Column Names

  • How are columns categorised?
    • The library categorises columns into Following Categories:
      Column Entity Columns
      Date-Time non_calendar_year
      calender_year
      other_year
      quarter
      month
      date
      Geography country
      state
      district
      Unit unit
      Note note
      Unmapped Any unmapped columns
Table of Content

Units :

  • General Workflow

    graph LR;
      A[Dataset]-->B{Unit Column Exists ?};
      
      B -- NO --> C(RETURN Null String);
      B -- Yes --> D[Get all  unique units from UNIT Column];
    
      D --> E[Prepare List of all separate units];
      E --> F(RETURN all units as STRING SEPARATED WITH COMMAS)
    
    Loading
Table of Content

Temporal Coverage :

  • General Workflow

    flowchart LR
    
    A(Dataset) -->  B{Year column exists ?}
    B -- NO --> C(RETURN Null String) 
    B -- Yes --> D[Calender / Non-Calender Year Columns]
    D --> E{Years are in Sequence ?}
    E -- YES --> F(RETURN string represntation of range \n example : 2012 to 2020 or \n 2012-13 to 2020-21)
    E -- NO --> G(RETURN  comma separated values for all years, \n exmaple : 2012,2015,2018 or \n 2012-13, 2015-16, 2018-19)
    
    Loading

    Notes:

    • Determination of Temporal coverage is based on the presence of year column.
    • If both Calender year and Non-Calender year are presnet in dataset then priority will be given to Calender year.
Table of Content

Granulaity :

  • General Workflow

      flowchart LR
      A(Dataset) --> B{If any of Date-time or \nGeography columns exists ?}
      B -- No --> C(RETURN Null String)
      B -- YES -->  D[Map all Columns levels in \nSorted Order for respective Domains]
      D --> E[Map the columns groups according to \nproper naming convention Granularity]
      E --> F(RETURN Comma Separated Values of all Granularitues \n example : Quarterly, District)
    
    Loading

    Notes:

    • Granularity is calculated for 2 domains.
      • Geography
      • Date-Time
    • In config.py There are granularity ranks mentioned for each domain.
    • In config.py there are Keywords also present for Granularity if found in Datasets.
Table of Content

Spatial Coverage :

Mentioned below are the Cases for Spatial Covererage :

Spatial Location Dataset with categories as Methodology Spatial Coverage
Countries India, Pakisthan, China, etc Country
Specific Country India represent it with the specific Country Name India
States of a Country Andhra Pradesh, Assam, etc States of India
Regions of a country South India, NE states etc Regions of India
Specific State of a country Andhra Pradesh represent it with the specific State Name Andhra Pradesh
Districts of a State/ States Adilabad, Hyderabad etc Districts of Telangana or Districts of India
Specific District of a state Hyderabad represent it with specific District Name Hyderabad

  • General Workflow

      flowchart LR
      A(Dataset) --> B{If Geographical Columns exists ?}
      B -- NO --> C(RETURN Default Value as INDIA)
      B -- YES --> D[Sort the order of different \nGeographical Level]
      D --> E(RETURN Value of biggest order of Geographical Column \nwith proper naming convention)
    
    Loading

    Notes:

    • This library currently facilitates only for Country, State and District level of Spatial Coverage.
    • Mapping of levels of Geographic Columns is decided by corresponding column names and not the values, hence change in Column names will impact the mapping.
    • If there is no Geographic column , then the result would be default for INDIA.
    • Spatial coverage order, keyword Mapping and Naming Convention are mentioned in config.py.
Table of Content

File Formats Available :

Notes:

  • Reads the format of file from the file name.
Table of Content