Project aims to combine several scientific occurence datasets to provide ability to analyze them through the species distribution and divesite proximity.
For the setup please proceed to the documentation.
Project mimics medallion architecture with the following layers:
- Substrate - contains raw external data from the sources.
- Skeleton - contains joined and cleaned data for individual dimensions and facts.
- Coral - final aggregated data for the analytics.
Column Name | Data Type | Description |
---|---|---|
species | STRING | String representation of the species name. |
individualcount | INTEGER | Amount of individauls per 1 sighting (e.g. "I saw 5 ducks") |
eventdate | TIMESTAMP | Timestamp of the occurence |
geography | GEOGRAPHY | BigQuery type of geography marker: POINT() |
source | STRING | Source dataset from which occurence originated from |
is_invasive | BOOLEAN | Species considered invasive |
is_endangered | BOOLEAN | Species considered endangered |
- Divesites and observations distribution between used sources
- Invasive species near divesites
- Endangered species near divesites
- Top 20 invasive species near divesites
If you're interested in contributing to this project, need to report issues or submit pull requests, please get in touch via
Acknowledgement to #DataTalksClub for mentoring us through the Data Engineering Zoom Camp over the last 10 weeks. It has been a privilege to take part in the Spring '24 Cohort, go and check them out!