kjy/DataEngineering_GCP_course3_ServerlessDataAnalysis_GoogleBigQuery-CloudDataflow
BigQuery. BigQuery is a petabyte scale data warehouse on Google Cloud that can run queries. Create a query, modify a query to add clauses, subqueries, built-in function and joins. Load a CSV file into a BigQuery table using the web UI. Load a JSON file into a BigQuery table using the CLI. Export a table using the web UI. Use nested fields, regular expressions, WITH statement, and GROUP, and HAVING. Dataflow. Dataflow is a runner (execution framework). Each step is called a transform. It goes from source (BigQuery) to sink (Cloud Storage). Setup a python dataflow project using Apache Beam, which executes data processing workflows. Create a Dataflow pipeline, using filtering. Execute query locally and on the cloud. MapReduce. To process a large dataset, break up the dataset into pieces such that each compute node processes data that’s local to it. The map operations happen in parallel on chunks of the original input data. The results of these maps are sent to the reduce nodes where aggregates are calculated. Reduce node processes on key or one set of keys. Identify map and reduce operations. Execute the pipeline. Use command line parameters. Side Inputs. A side input is an additional input that your DoFn can access each time it processes an element in the input PCollection. When you specify a side input, you create a view of some other data that can be read from within the ParDo transform's DoFn while procesing each element. Load data into BigQuery and run complex queries. Execute a Dataflow pipeline that can carry out map and reduce operations, using side inputs and stream into BigQuery. Use the output of a pipeline as a side-input to another pipeline.
No issues in this repository yet.