/divesite-species-analytics

Using public data to see what animal species can be spotted near local divesites

Primary LanguageJupyter NotebookMIT LicenseMIT

Divesite Species Analytics Data Pipeline

Project aims to combine several scientific occurence datasets to provide ability to analyze them through the species distribution and divesite proximity.

Setup

For the setup please proceed to the documentation.

Architecture Diagram

diagram

Data modelling

Project mimics medallion architecture with the following layers:

  • Substrate - contains raw external data from the sources.
  • Skeleton - contains joined and cleaned data for individual dimensions and facts.
  • Coral - final aggregated data for the analytics.

Main Occurences model:

Column Name Data Type Description
species STRING String representation of the species name.
individualcount INTEGER Amount of individauls per 1 sighting (e.g. "I saw 5 ducks")
eventdate TIMESTAMP Timestamp of the occurence
geography GEOGRAPHY BigQuery type of geography marker: POINT()
source STRING Source dataset from which occurence originated from
is_invasive BOOLEAN Species considered invasive
is_endangered BOOLEAN Species considered endangered
  • Divesites and observations distribution between used sources
  • Invasive species near divesites
  • Endangered species near divesites
  • Top 20 invasive species near divesites
Screenshot 2024-05-04 at 2 50 13 PM

Acknowledgements & Credits & Support

If you're interested in contributing to this project, need to report issues or submit pull requests, please get in touch via

Acknowledgements

Acknowledgement to #DataTalksClub for mentoring us through the Data Engineering Zoom Camp over the last 10 weeks. It has been a privilege to take part in the Spring '24 Cohort, go and check them out!

image