/Atlas-BQ__ODS-EDW

Fork of ODS-EDW PoV

Primary LanguageJava

GCP-02

CLOUD ODS + EDW

Operationalized Data Lake/Warehouse - Ability to provide a highly available and highly scalable Operational Data Store (ODS) while keeping an Enterprise Data Warehouse (EDW) up to date in near real time, in order to obtain broader insight by cross-referencing against other sources of data in the enterprise.

SA Maintainer: Sig Narváez,

SA Contributor: Paresh Saraf,

SA Contributor: Munish Kapoor

SA Contributor: Chris Grabosky

Time to setup: 2+ hrs

Time to execute: 20 mins

Estimated GCP cost: ~$50

Estimated Atlas cost: <$5


Description

This proof shows how MongoDB Atlas on Google Cloud can act as an Operational Data Layer and effectively co-exist with Google BigQuery acting as Enterprise Data Warehouse.

The proof uses a sample movie dataset sample_mflix loaded into MongoDB Atlas. Google Cloud Dataflow is then used to transform the MongoDB collection into relational tables using the MongoDBIO library from Apache Beam. Finally, BigQueryIO library is used to write the transformed data into Google BigQuery. We will then join this data with a public dataset provided by Google BigQuery to show the power of consolidation across disparate data sources.

*** NOTE ***: The readme for this proof (this file) needs an overhaul given pull request ODS-EDW complete refactor #47. For now, please follow demo_steps.sh and demo_steps_ml.sh.


Prerequisites


Setup

1. Configure Atlas Environment

  • Log-on to your Atlas account
  • If you do not have an account, click on Sign Up. Sign Up with Google or enter details.
  • Click on the Database Access link on the left column. In the new page, choose to Add New User using the green button on the left. In the modal dialog, give the user the name appUser and password appUser123. We will use some built in roles so click Add Default Privileges and in the Default Privileges section add the roles readWriteAnyDatabase and clusterMonitor then press the green Add User button to create the user.

  • Return to the clusters page by clicking Clusters on the left navigation column.
  • Press the giant Build a Cluster button and then choose to create a single region cluster.

  • Deploy a 3 node replica-set in the Google Cloud region of your choice (here us-west2) with default settings, no backup, MongoDB version 4.0, and give it a name of your choosing (here we will use MDB-Dataflow). The size must be M10 or larger. Here we used an M30. Click the Create Cluster button and wait for the cluster to complete deployment.

If this is a free account, you will be required to add your credit card to create a cluster other than M0. To add your credit card details - select Billing on the left Navigation Menu and Add a Payment Method.

  • Whitelist the IPs. For the demo we will allow access from 0.0.0.0/0. This is not recommended for a production setup, where the recommendation will be to use VPC Peering and private IPs.

  • Copy the Connection String and replace the password with the password of the appUser created earlier. This will be used in the Execution later.

2. Load Data Into the MongoDB Atlas Cluster

MongoDB Atlas provides a sample dataset that you can use to quickly get started and begin experimenting with various tools like CRUD operations in Atlas and visualizations in Charts. To load the dataset:

  1. Navigate to your Clusters view.
  • In the left navigation pane in Atlas, click Clusters.
  1. Open the Load Sample Dataset dialog.
  • Locate the cluster where you want to load sample data.
  • Click the Ellipses (…) button for your cluster.
  • Click Load Sample Dataset.
  1. In the dialog, click Load Sample Dataset
  • The dialog closes and Atlas begins loading your sample dataset into your cluster.
  1. View your sample data.
  • To view your sample data by click your cluster’s Collections button. You should see the following databases in your cluster:
S. No. Datbase Name
1. sample_airbnb
2. sample_geospatial
3. sample_mflix
4. sample_supplies
5. sample_training
6. sample_weatherdata

We will be using the sample_mflix data for this exercise.

3. Configure GCP environment

  • Log-on to your personal GCP account.

  • In the Cloud Console, on the project selector page, create a new Cloud project named MongoDBAtlasODS-EDW.

  • Make sure that billing is enabled for your Google Cloud project

  • Enable the following Google Cloud services via the API by opening this url in your browser. You will need to select the MongoDBAtlasODS-EDW project.

    https://console.cloud.google.com/flows/enableapi?apiid=dataflow,compute_component,logging,storage_component,storage_api,bigquery,pubsub,datastore.googleapis.com,cloudresourcemanager.googleapis.com

  • Create a VM Instance using GCP Compute Engine

  • Use the following values for creating the VM instance. Ensure that you deploy in the same Google Cloud region as your MongoDB Atlas cluster (us-west2 in this example).
    • Name: client-vm
    • Region: us-west2
    • Zone: a
    • Machine configuration: leave as General purpose and choose n1-standard-1
    • Boot disk: Change to Ubuntu 18.04 LTS (you can choose something else but all instructions will be based on this)
    • Identity and API access: Allow full access to all Cloud APIs
    • Leave the other values as default.
    • Then click Create

  • Connect to the newly created VM using ssh

Purpose Command
Confirm version is Ubuntu 18.04 LTS lsb_release -a
Update apt cache sudo apt update
Install git sudo apt install -y git
Install Java sudo apt install -y default-jre
Check Java version java --version
Make sure only one Java is installed sudo update-alternatives --config java
Set JAVA_HOME environment variable export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
Make sure correct java is in path export PATH=$PATH:$JAVA_HOME/bin
Check Java version to make sure all is good java --version
Install Apache Maven sudo apt install -y maven
Make sure Maven is in path mvn -v

IF THE OUTPUT OF THE LSB_RELEASE COMMAND IS NOT THAT OF UBUNTU 18.04, YOU HAVE DEPLOYED THE WRONG VM FOR THESE INSTRUCTIONS. PLEASE DESTROY THE VM AND START OVER WITH THE CORRECT OPERATING SYSTEM

ON THE export JAVA_HOME COMMAND, THE PATH LISTED SHOULD BE THE PATH RETURNED BY THE sudo update-alternatives --config java COMMAND, OMMITTING THE /bin/java. AN EXAMPLE HAS BEEN PROVIDED

  • Clone the git repo using the git clone command. NOTE: Cloning private GitHub repos from the command line require personal access tokens. eg -

    git clone https://YOUR_USER:YOUR_PERSONAL_ACCESS_TOKEN@github.com/10gen/atlas-google-pov.git

  • Create a Cloud Storage bucket:

  • In the Cloud Console, go to the Cloud Storage Browser page then Create Bucket

  • Complete the form as follows:

    • Name: anything you choose but do not include sensitive information in the bucket name, because the bucket namespace is global and publicly visible
    • Location type: change to Region and in the drop down choose the same regions you deployed your compute and Atlas in (here us-west2)
    • Storage class: Standard
    • Access control: fine-grained
  • Then click the blue Create button

  • Configure BigQuery environment

  • Search for bigquery in search bar at the top and select first result.

  • Create a data set by name mflix. Accept remaining defaults.

  • Create an empty table movies (inside the mflix dataset) as shown below. Accept remaining defaults.


Execution

Make sure you are inside proof folder. Execute below script :

cd ~/atlas-google-pov/proofs/02-ODS-AND-EDW/

mvn compile exec:java \
    -Dexec.mainClass=dataflowdemonew.MongoDataFlowBigQuery \
    -Dexec.args="--project=<your GCP project id> \
    --mongouri=<mongodb atlas cluster connection string> \
    --gcpTempLocation=<temp location for dataflow (within the cloud storage bucket created before)> \
    --tempLocation=<temp location for bigquery (within the cloud storage bucket created before)> \
    --runner=DataflowRunner \
    --jobName=<job name> \
     --region=<region name>" \
    -Pdataflow-runner

Example:

mvn compile exec:java \
    -Dexec.mainClass=dataflowdemonew.MongoDataFlowBigQuery \
    -Dexec.args="--project=mongodbatlasods-edw-268604 \
         --mongouri=mongodb+srv://appUser:appUser123@mdb-dataflow-htq0y.gcp.mongodb.net/sample_mflix \
         --gcpTempLocation=gs://mkbucket001/tmp/ \
         --tempLocation=gs://mkbucket001/tmp/ \
         --runner=DataflowRunner \
         --jobName=dataflow-intro \
         --region=us-west1" \
         -Pdataflow-runner

HINT: If you obtain a runtime error, check which DataFlow Regional Endpoint is closest and available to your VM and MongoDB Atlas Cluster.

After a few minutes, navigate to GCP Dataflow (Select it from left tab in GCP console). You can see list of jobs in progress and completed:

Click on the relevant job. You should be able to see job details as below:

Navigate to BigQuery from GCP console. You should be able to see the rows populated in the empty tables created before.


Measurement

An Operational Data Lake/Date Warehouse will bring in data from various different System of Records (SORs). Users execute queries across these data sources. Let's join two distinct datasets, the mflix.movies data, which has the details on movie listings brought over from MongoDB Atlas and Google BigQuery's public dataset on film locations in San Francisco, where movies have been shot.

Lets query this data to see list of movies ordered by the number of shooting locations in San Francisco.

Execute the following query. You may see that the movie Blue Jasmine has been shot in over 50 locations in San Francisco, followed by Time After Time and San Andreas.

WITH   movieLocations 
AS     (select b.title title, 
               count(*) numLoc
        from `bigquery-public-data.san_francisco_film_locations.film_locations` b
        group by b.title),
movies 
AS     (SELECT movie_id,
               title,
               plot
        FROM   mflix.movies)
SELECT movie_id,
       m.title,
       numLoc,
       plot       
FROM   movies m
INNER JOIN
movieLocations 
ON m.title = movieLocations.title
order by 3 desc;

You should see an output like this:


Cleanup

  1. Terminate the MongoDB Atlas Cluster

  1. Delete BigQuery Dataset

  1. Delete Compute VM

  1. Delete Storage

  1. Shutdown the project, if it isn't used for other purposes.