CSC-462-Project

Group project for CSC 462, distributed computing course at UVic

Initial tiling of input files is sequential
Processing of tiles NDVI/NDSI is done in parallel
Since I didn't set up a distributed file system and shared managed drives that can have more than 2 simultaneous connections are large and expensive I used base64 encoding and decoding to store images inside gRPC messages
Can run on Azure free Ubuntu VMs but takes over 35 minutes to complete with 1 master, 2 workers and 2 input Sentinel-2 files to calculate NDVI/NDSI
Currently, processing occurs on Arbutus inside Docker containers with 1 master and 2 works but can scale up
Query coordinates for two corners from EarthDaily mosaic, downloads the 32-bit image then processed NDVI and returns a processed 8-bit PNG

To run with docker:
- Copy the arbutus_processing/ folder on Arbutus
- In the master/connection.py file add API keys for:
  - Azure Data Lake API key
  - Data Lake folder path
  - Connection string for Azure PostgreSQL Database
  - API key for accessing the Earth Daily mosaics
- Make sure the make sure you have Docker installed, and the service is running
- From the main directory with docker-compose.yml
  
  docker-compose up -d
- View active docker containers that are running
  
  docker ps and with -a flag for stopped containers
- When running in Docker, files are output to a shared volume located in the /Output/output_merged sub-directory from the main directory
To run without Docker:
- Install Python3 and all the modules in master/requirements.txt and worker/requirements.txt files for whichever system is running the master and worker systems
- Run master from master/master.py then start desired number of workers from worker/worker.py
- When running locally files are saved into master/Output/output_merged directory

Setup a PostgreSQL database locally and run the localdb.sql file to create the database table
Install Python3 and the modules from requirements.txt, the main requirements are flask, psycopg2 and azure-storage-file-datalake
In the connection.py file add:
- Azure Data Lake API Key
- Data Lake folder path
- Connection string for local PostgreSQl database
- Connection string for Azure PostgreSQL database

Through the Azure portal, pgAdmin or your platform of choice run the utils/azure_psql.sql file to set up the database

IP address can be changed from localhost to run on servers, I used internal 10.X.X.X IP addresses and all my machines were on the same Azure VNet
worker.py is designed to run continuously requesting tasks, master.py waits until it is ready to assign and send tasks to the worker
Recovering from failures and stalled or stuck workers needs to be improved
Protocol Buffers are compiled with:

python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. dist_processing.proto

ATSchell/CSC-462-Project