Serverless Analytics Workshop

Amazon QuickSight, Amazon Athena and Amazon Spectrum workshop. Workshop will focus on ingesting data into Athena & Spectrum, combining it with other data sources, and visualizing it in QuickSight.

Hands on workshop is broken up into 6 different sections to get you familiar with the Quicksight, Athena & Spectrum services:

Sign Up for AWS

Create your AWS Account

Navigate to Amazon AWS Free Tier. There are a variety of services that offer free tier to start building your solutions. Choose basic support plan.

Architecture and Permissions

Purpose of serverless components is to reduce the overhead of maintaining, provisioning, and managing servers to serve applications. AWS provides three compelling serverless services through AWS to store large amounts of data, manipulate data at scale, query data at scale and speed, and easily visualize it - namely AWS Glue, Amazon Athena, Amazon QuickSight.
alt text
To get these services working we need to allow these services to talk to one another. Following we will set up permissions for to accomplish this through AWS IAM.

Build Permissions and S3 Bucket

AWS provides a service to build resources out of predefined templates using CloudFormation. We will use a CloudFormation script to automate the creation of permissions, roles, and other elements we may require.

To create this we need to run a cloud formation template:

  1. Make sure you are in the N. Virginia Region
  2. Under services, click CloudFormation under Management Tools.
    alt text

  3. Click Create Stack
  4. Under "Choose a template", select the "Specify an Amazon S3 template URL" radio button option and enter this template url:
https://s3-us-west-2.amazonaws.com/slalom-seattle-ima/scripts/cloudformation/cf_QuickSightAthena_Workshop.template
  1. Click Next
  2. Enter the a name for your stack, like QuicksightAthena-Workshop
  3. Provide a unique name for your bucket to store your data - It needs to be globally unique name and the bucket name must contain only lowercase letters, numbers, periods (.), and dashes (-). No spaces!
  4. Hit Next
  5. Hit Next
  6. There is an acknowledge checkbox for you to review, and hit Create
  7. We will wait a couple minutes until the progress says CREATE_COMPLETE
    alt text


Populate S3 Bucket with data to query.

To get started with the analytics workshop, we need to provide data to query. This data may originate from a variety of sources into S3, but for this example we will upload a file into S3 manually.

  1. Open the S3 Console from the Services drop down menu
  2. Click your newly created bucket, by you or by our CloudFormation script.
  3. Hit Create folder and name it "B2B"
  4. Download sample dataset B2B Datasets. Unzip the dataset files into a local folder.
  5. Create a folder within B2B called "orders"
  6. Click on new folder and Upload the orders.csv.
  7. Create folders within B2B called "sales" and "events"
  8. In the sales folder Click on Upload and select the sales_ts.000.
  9. In the events folder, click Upload and select the allevents_pipe.txt file.
  10. Make note of the folders you saved this file under.

Query a file on S3 With Athena

  1. Open the Athena console from the Services dropdown.
  2. Create a table manually via DDL in the query window.
  3. Replace the location value to the folder location of your dataset. s3://your bucket name/B2B/orders/
CREATE DATABASE labs
CREATE EXTERNAL TABLE IF NOT EXISTS labs.orders (
  `row_id` int,
  `order_id` string,
  `order_date` date,
  `ship_date` date,
  `ship_mode_id` int,
  `customer_id` string,
  `segment` int,
  `product_id` string,
  `sales` double,
  `company_id` int,
  `quantity` int,
  `discount_pct` double,
  `profit_amt` double 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = ',',
  'field.delim' = ','
) LOCATION 's3://<your bucket here>/B2B/orders/'
TBLPROPERTIES ('has_encrypted_data'='false')
  1. Hit Run Query

  2. Run the following SQL statement and make sure that your table is reading correctly:

SELECT * 
FROM labs.orders LIMIT 100
  1. Show Create Table statement helps you better understand what it going on behind the scenes when creating a table.
SHOW CREATE TABLE labs.orders

Alternate definitions, schema on read:

DROP TABLE labs.orders;
CREATE EXTERNAL TABLE IF NOT EXISTS labs.orders (
  `row_id` string,
  `order_id` string,
  `order_date` string,
  `ship_date` string,
  `ship_mode_id` string,
  `customer_id` string,
  `segment_id` string,
  `product_id` string,
  `sale` string,
  `company_id` string,
  `quantity` string,
  `discount_pct` string,
  `profit_amt` string 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   "separatorChar" = ",",
   "quoteChar"     = "\"",
   "escapeChar"    = "\\"
)  
LOCATION "s3://marioj-bucket-02/B2B/orders/"
TBLPROPERTIES ("skip.header.line.count"="1")

More resources:

Congratulations, you queried your first S3 file through Amazon Athena!



Query a file on S3 With Redshift Spectrum

Amazon Redshift Spectrum enables you to run Amazon Redshift SQL queries against exabytes of data in Amazon S3. With Redshift Spectrum, you can extend the analytic power of Amazon Redshift beyond data stored on local disks in your data warehouse to query vast amounts of unstructured data in your Amazon S3 “data lake”

To use the spectrum service, you need to instantiate a Redshift Cluster.Before you begin setting up an Amazon Redshift cluster, pls make sure that you complete the following pre-req steps

  1. Set Up Prerequisites
  2. Create an IAM role for Spectrum
    • Open the IAM Console.
    • In the navigation pane, choose Roles.
    • Choose Create New Role.
    • Choose AWS Service Role, and then scroll to Amazon Redshift. Choose Select.
    • The Attach Policy page appears. Choose AmazonS3FullAccess and AmazonAthenaFullAccess. Choose Next Step.
    • For Role Name, type a name for your role, for example mySpectrumRole.
    • Review the information, and then choose Create Role.
    • Copy the Role ARN to your clipboard—this value is the Amazon Resource Name (ARN) for the role that you just created. You use that value when you create external tables to reference your data files on Amazon S3.

Now that you have the prerequisites completed, you can launch your Amazon Redshift cluster.

  1. Click on the Amazon Redshift link from the Services dropdown.
  2. On the Amazon Redshift Dashboard, choose Launch Cluster.
  3. On the Cluster Details page, enter the following values and then choose Continue:
    • Cluster Identifier: type examplecluster.
    • Database Name: leave this box blank. Amazon Redshift will create a default database named dev.
    • Database Port: type the port number on which the database will accept connections. You should have determined the port number in the prerequisite step of this tutorial. You cannot change the port after launching the cluster, so make sure that you have an open port number in your firewall so that you can connect from SQL client tools to the database in the cluster.
    • Master User Name: type masteruser. You will use this username and password to connect to your database after the cluster is available.
    • Master User Password and Confirm Password: type a password for the master user account.
  4. On the Node Configuration page, select the following values and then choose Continue:
    • Node Type: dc2.large
    • Cluster Type: Single Node
  5. On the Additional Configuration page, you will see different options depending on your AWS account, which determines the type of platform the cluster uses. To keep things simple select the EC2-VPC to launch your cluster. Use the following values to populate the various fields in the screen -
    • Cluster Parameter Group: select the default parameter group.
    • Encrypt Database: None.
    • Choose a VPC: Default VPC (vpc-xxxxxxxx)
    • Cluster Subnet Group: default
    • Publicly Accessible: Yes
    • Choose a Public IP Address: No
    • Enhanced VPC Routing: No
    • Availability Zone: No Preference
    • VPC Security Groups: default (sg-xxxxxxxx)
    • Create CloudWatch Alarm: No
  6. Associate an IAM role with the cluster. For AvailableRoles, choose mySpectrumRole (defined in step 2) and then choose Continue.
  7. On the Review page, review the selections that you’ve made and then choose Launch Cluster
  8. A confirmation page appears and the cluster will take a few minutes to finish. Choose Close to return to the list of clusters.
  9. On the Clusters page, choose the cluster that you just launched and review the Cluster Status information. Make sure that the Cluster Status is available and the Database Health is healthy before you try to connect to the database.
  10. Before you can connect to the cluster, you need to configure a security group to authorize access.
  11. To Configure the VPC Security Group (EC2-VPC Platform)
  • In the Amazon Redshift console, in the navigation pane, choose Clusters.
  • Choose examplecluster to open it, and make sure you are on the Configuration tab.
  • Under Cluster Properties, for VPC Security Groups, choose your security group.
  • After your security group opens in the Amazon EC2 console, choose the Inbound tab.
  • Choose Edit, and enter the following, then choose Save:
    • Type: Custom TCP Rule.
    • Protocol: TCP.
    • Port Range: type the same port number that you used when you launched the cluster. The default port for Amazon Redshift is 5439, but your port might be different.
    • Source: select Custom IP, then type 0.0.0.0/0. Note Using 0.0.0.0/0 is not recommended for anything other than demonstration purposes because it allows access from any computer on the internet. In a real environment, you would create inbound rules based on your own network settings.
  1. Connect to the cluster using the SQL Workbench/J client(installed in the prerequisites section in step 1). To connect from SQL Workbench perform the following steps:
    • Open SQL Workbench/J.
    • Choose File, and then choose Connect window.
    • Choose Create a new connection profile.
    • In the New profile text box, type a name for the profile.
    • Choose Manage Drivers. The Manage Drivers dialog opens.
    • Choose the Create a new entry button. In the Name text box, type a name for the driver. alt text
    • Choose the folder icon next to the Library box, navigate to the location of the driver, select it, and then choose Open.
      alt text If the Please select one driver dialog box displays, select com.amazon.redshift.jdbc4.Driver or com.amazon.redshift.jdbc41.Driver and choose OK. SQL Workbench/J automatically completes the Classname box. Leave the Sample URL box blank, and then choose OK.
    • In the Driver box, choose the driver you just added.
    • In URL, copy the JDBC URL from the Amazon Redshift (as listed below) console and paste it.
      • In the Amazon Redshift console, in the navigation pane, choose Clusters.
      • Choose examplecluster to open it, and make sure you are on the Configuration tab.
      • On the Configuration tab, under Cluster Database Properties, copy the JDBC URL of the cluster. alt text
    • In Username, type masteruser.
    • In Password, type the password associated with the master user account.
    • Choose the Autocommit box.
    • Save the profile using Save profile list icon

At this point you have a database called dev in the redshift cluster and connected to it using the SQL Workbench/J client. To get started with Spectrum, start with the data in the sales folder created in the B2B folder.

To start querying Sales data in S3 using Spectrum, we need to create an external table (sales) and an external schema (labs) for spectrum to access. This can be achieved by executing the following SQL statements on the SQL Workbench Client

  1. Create external schema and table
	create external schema labs 
	from data catalog 
	database 'labs' 
	iam_role 'yourIAMrole/mySpectrumRole'
	create external database if not exists;
	create external table labs.sales(
	salesid integer,
	listid integer,
	sellerid integer,
	buyerid integer,
	eventid integer,
	dateid smallint,
	qtysold smallint,
	pricepaid decimal(8,2),
	commission decimal(8,2),
	saletime timestamp)
	row format delimited
	fields terminated by '\t'
	stored as textfile
	location 's3://<yourbucket>/B2B/sales/'
	table properties ('numRows'='172000');
  1. Run the following SQL statement and make sure that your table is reading correctly:
	SELECT * 
	FROM labs.sales LIMIT 100

Congratulations, you queried your first S3 file through Amazon Spectrum!

You can view the amount of data scanned on S3 (scanned bytes/rows) and other details around query execution querying the SVL_S3QUERY system view.

	select query, segment, slice, elapsed, s3_scanned_rows, s3_scanned_bytes, s3query_returned_rows, s3query_returned_bytes, files 
	from svl_s3query 
	where query = pg_last_query_id() 
	order by query,segment,slice; 

One of the common use cases for Redshift Spectrum is to keep the larger fact tables in Amazon S3 and your smaller dimension tables in Amazon Redshift and then perform a join across the data sets in S3 and in Redshift Cluster. The steps listed below depicts the use case -

  1. Create an EVENT table (dimension table) in Redshift. Note that this table is internal to Redshift.
	create table event(
	eventid integer not null distkey,
	venueid smallint not null,
	catid smallint not null,
	dateid smallint not null sortkey,
	eventname varchar(200),
	starttime timestamp); 
  1. Load the EVENT table in Redshift using the COPY command with the data being pulled from events folder in B2B. Replace the IAM role ARN in the following COPY command with the role ARN you created in step 2.
	copy event from 's3://yourbucket/B2B/events/allevents_pipe.txt' 
	iam_role 'yourarn/mySpectrumRole'
	delimiter '|' timeformat 'YYYY-MM-DD HH:MI:SS' region 'us-east-1';
  1. The following query joins the external table SPECTRUM.SALES with the local table EVENT to find the total sales for the top ten events
	select top 10 labs.sales.eventid, sum(labs.sales.pricepaid) from labs.sales, event
	where labs.sales.eventid = event.eventid
	and labs.sales.pricepaid > 30
	group by labs.sales.eventid
	order by 2 desc;

Amazon Redshift supports late binding views and using this functionality you have the option to create a view between tables residing in the redshift cluster and external tables in S3. There is no dependency between the view and the objects it references. Because there is no dependency, you can drop or alter a referenced object without affecting the view. Amazon Redshift doesn't check for dependencies until the view is queried. The following queries illustrates the functionality of late binding views -

  1. To create a late-binding view, include the WITH NO SCHEMA BINDING clause. The following query creates a view with no schema binding between the event table in local redshift cluster and sales table in S3.
create view sales_vw as
select * from labs.sales, public.event
with no schema binding;
  1. Select the first 10 elements in the view
select * from sales_vw limit 10;

23.Alter the eventname column to title in the event table. Observe that you can re-run the query without recreating the view.

alter table event rename column title to eventname;
select * from sales_vw limit 10;


Integrating Glue with Athena and Spectrum

One of the many benefits of Glue, is its ability to discover and profile data from S3 Objects. This become handy in quickly creating a catalog of new and incoming data. To get started:

  1. In Athena, from the Database pane on the left hand side, click Create Table drop down and select Automatically
    alt text
  2. If this is your first time using Glue, you may be asked to upgrade your catalog, and get redirected. Make sure you are in the Crawlers section of Glue. On the left hand side there is a Crawlers link and hit Add Crawler
  3. Enter name your crawler "Taxi Crawler" and select the IAM role "GlueServiceRole". Click Next.
  4. Select the Specify path in another account radio button and enter s3://serverless-analytics/canonical/NY-Pub/ for the S3 path. Click Next.
  5. Do not add another data source and click Next.
  6. For frequency leave as Run on Demand and click Next.
  7. Select our labs database as a target
  8. In order to avoid table name collision Glue generates a unique table name so we'll need to provide a prefix, say taxi_ (include the underscore)
  9. Click Next
  10. Review the information is correct, specifically the "Include Path" field. Hit Finish when complete.
  11. Check the box next to your newly created crawler and click Run Crawler. It should take about a minute to run and create our table.

Exploring Glue Data Catalog

  1. On the left hand side, click Databases
  2. Find the labs database and click on it
  3. Click Tables in labs to view our newly created table alt text
  4. Click the table name and explore

Querying Taxi Data using Athena

  1. Switch back to the Athena console
  • You may need to replace the database and/or table names with ones shown in the Data Catalog.
  1. Enter SHOW PARTITIONS labs.taxi_ny_pub to verify all partitions were automatically added
  2. Try the SQL statement below to explore the data.
SELECT *
FROM labs.taxi_ny_pub
WHERE year BETWEEN '2013' AND '2016' AND type='yellow'
ORDER BY pickup_datetime desc
LIMIT 10;


alt text

SELECT 
  type,
  year, 
  count(*) fare_count, 
  avg(fare_amount) avg_fare, 
  lag(avg(fare_amount)) over (partition by type order by year asc) last_year_avg_fare
FROM labs.taxi_ny_pub
WHERE year is not null
GROUP BY year, type
ORDER BY year DESC, type DESC
  • Remember, you have the ability to Save a query for future re-use and reference.

Querying Taxi Data using Spectrum

  1. Switch to the SQL Editor using SQL Workbench Client
  2. As the the Glue Crawler (Taxi Crawler) has already created the taxi_ny_pub in the labs database, we can start querying these tables using the spectrum service.
  3. Run the following SQL queries using the SQL Editor. Each of the queries would take around 10 mins to complete on a single node redshift configuration.
SELECT *
FROM labs.taxi_ny_pub
WHERE year BETWEEN '2013' AND '2016' AND type='yellow'
ORDER BY pickup_datetime desc
LIMIT 10;

You get the following results from the Spectrum query alt text

SELECT 
  	type,
  	year, 
  	count(*) fare_count, 
  	avg(fare_amount) avg_fare, 
  	lag(avg(fare_amount)) over (partition by type order by year asc) last_year_avg_fare
FROM labs.taxi_ny_pub
WHERE year is not null
GROUP BY year, type
ORDER BY year DESC, type DESC

You get the following results from the Spectrum query
alt text

You can view the amount of data scanned on S3 (scanned bytes/rows) and other details around query execution querying the SVL_S3QUERY system view.

select query, segment, slice, elapsed, s3_scanned_rows, s3_scanned_bytes, s3query_returned_rows, s3query_returned_bytes, files 
from svl_s3query 
where query = pg_last_query_id() 
order by query,segment,slice; 


Breakout Exercises

Breakout 1 - Load B2B Dataset

Now that we have learned about crawlers, lets put it to use to load the rest of our B2B Datasets dataset.

  • Unzip the data, and upload it to your S3 Bucket remember, one folder represents one table.
  • Run a crawler through your bucket to discovery the dataset.
  • Add new tables to the labs database with prefix "b2b_"

Make sure to check fields, and how Glue is parsing your data. Correct any mistakes. Once complete, you should be able to run this query:

SELECT
  year(date_parse(Order_Date,'%c/%e/%Y')) Order_Year,
  Company_Name,
  SUM(quantity) Quantity,
  SUM(sales) Total_Sales,
  SUM(sales)/revenue_billion Sales_to_Revenue_Ratio
FROM labs.b2b_orders o
  JOIN labs.b2b_company co on  co.company_id = o.company_id
  JOIN labs.b2b_customer cu on cu.customer_id = o.customer_id
  JOIN labs.b2b_product p on p.product_id = o.product_id
  JOIN labs.b2b_segment s on s.segment_id = o.segment_id
  JOIN labs.b2b_ship_mode sm on sm.ship_mode_id = o.ship_mode_id
  JOIN labs.b2b_company_financials cp on cp.company_id = co.company_id
  JOIN labs.b2b_industry i on i.industry_id = co.industry_id
GROUP BY
  year(date_parse(Order_Date,'%c/%e/%Y')),
  Company_Name,
  revenue_billion
ORDER BY
  SUM(sales)/revenue_billion DESC
LIMIT 100


alt text

Breakout 2 - Discover Instacart Data

In this section, we will break out and follow the same instructions, but while loading data from another public source, Instacart. Instacart is a company that operates as a same-day grocery delivery service. Customers select groceries through a web application from various retailers and delivered by a personal shopper. Instacart has published a public datasource to provide insight into consumer shopping trends for over 200,000 users. Data Instacart in May 2017 to look at Instacart's customers' shopping pattern. You can find the data dictionary for the data set here

  • Source s3 bucket: s3://royon-customer-public/instacart/
  • Database: labs
  • Prefix: instacart_

Expected output

alt text

Notes on best practices

  • Partition your data
  • Compress your data!
  • With large datasets, split your files into ~100MB files
  • Convert data to a columnar format, with large datasets.

For more great tips view this post on AWS Big Data blog.



Visualizing and Dashboarding with QuickSight

Exercise 1

1. Setting up your QuickSight Account

Go to your AWS console and search for QuickSight. You will first be presented with a screen to sign up:
alt text

You can choose either Standard or Enterprise Edition (the main difference today is that Enterprise edition can hook up to Active Directory, though there will be more functionality in the future added to Enterprise Edition). For purposes of our lab today Standard Edition is fine. With both editions you get one free user, forever.
alt text

Next you will create and name for you account (you can name the account whatever you'd like) and a notification email address (set it to be your own email address). You will also see some prompts about enabling autodiscovery of your data in AWS sources, as well as access to Athena, S3 buckets, and S3 Storage analytics. Check all the boxes.
alt text

Note - Make sure you launch QuickSight in the same region you have chosen for Athena.
alt text

Once you are finished your account will load with some sample datasets and dashboards.

alt text

Alright, now we are ready to roll!

Here is some documentation on getting familiar with the UI: Navigating the UI

2. Connecting to the Data

Documentation: Data Preparation, Table Joins

Open QuickSight and choose 'Manage Data' in the upper right hand corner:
alt text

Choose 'New Dataset' and then select Athena.
alt text

alt text

Give it a name and choose 'Create Data Source'. Find the database you created earlier which contains the B2B tables and select the b2b_orders table. Try to make sure you are choosing the orders table that was created automatically by Glue instead of the table that we created using the SQL statement (if you happen to pick the wrong one, no problem, you just won't need to do the step where we create a calculated field to change the order_date to a date field). Choose 'Edit/Preview Data'. (If you clicked 'Select' instead, it's OK, just choose 'Edit/Preview Data' on the next screen and leave it on 'Import to SPICE for quicker analytics'.)
alt text

Now we will join all the tables we had created in Athena by using the Glue data crawler. Some tables join directly to the Orders table and some join to the Company table. To join a table to something other than the first one we selected (Orders) drag and drop it on top of the table you want to join it to. You will then need to define the join clauses - to do this, click on the little venn diagrams in-between the tables you see on the screen. They will all be based on the key which is named after the dimension table you are trying to join. Set them all to 'Inner' joins and click 'Apply' after you finish each table. alt text alt text

When you are finished it should look something like this (we will skip the Segment and Product tables as the crawler didn't pick up the headers correctly - we can correct this using a Glue ETL job, but for purposes of this lab we can just leave these two tables out of our new dataset):

alt text

Before we start visualizing, let's also add a couple calculated fields to convert the date fields, order_date and ship_date to date fields rather than strings (normally we could just change the datatype in QuickSight in the data preview window, but Athena does not support this today. It will be supported soon, and you could do this for any other type of data source, but for Athena we will need to make calculated fields). On the left side choose 'New Field' and then use the parseDate() function to convert the string field to a date field. Use these formulas for each calculated field:

parseDate({order_date},'MM/dd/yyyy')
parseDate({ship_date},'MM/dd/yyyy')


alt text

Once you are finished preparing the dataset, choose Save & Visualize on the top of your screen.
alt text

3. Creating Our Dashboard

Documentation: Creating Your First Analysis, Modifying Visuals

Great, now we are ready to begin visualizing our data. By default AutoGraph is chosen as the visual type, which will pick an appropriate visual type depending on the types of fields choose to visualize. We can leave it like that for now, and later we will specify certain visual types.

First click on 'sales' and we will get a KPI visual type. Then click on the Field Wells on the top and use the pull down menu to choose 'Show As->Currency':
alt text

Now click on the 'Order Date' field. Notice how our visual type automatically is changed to a line chart. It will default to the Year level, but use the pull down menu on the Order Date field to choose 'Aggregate->Month', or you can do the same thing by clicking on the Order Date label on the x-axis:
alt text

Next click the pull down menu on the segment field in the list of measures and choose 'Convert to dimension'. Then find it in the list of dimensions and select it. Now we will have 3 lines in our line chart, one per segment. Expand the axis range on the bottom of the visual to see the whole trend:
alt text


alt text

Great, we have our first visual! Now let's add another visual using the '+' button in the upper left and selecting 'Add visual':
alt text

For our next visual, let's start by clicking 'industry_name' and 'sales'. We will get a bar chart sorted in descending order by sales:
alt text

Let's add a drill down capability for our end users by dragging the 'company_name' field just below the 'industry_name' field on the Y axis field well. You should see a notification that says 'Add drill-down layer':
alt text

Cool, now our end users will be able to drill down from Industry to the actual Companies in that industry. You can see how this works by clicking on one of the bars and selecting 'Drill down to company_name':
alt text

If you want to drill back up, you can either click the bars again or you can use the icons in the upper right to either drill one level back up or all the way back to the top (if you have more than one drill down built in):
alt text

Next let's change the visual type to a Treemap using the Visual Types selector in the bottom left:
alt text

Now add another visual to the dashboard. This one will be a very granular table of all the order details. First select the 'Pivot Table' visual type. Then click on the company_name dimension. Expand the Field Wells on top and drag the order_id to the Rows underneath the company_name:
alt text

Also click on the product_id, ship_mode, sales, profit, and quantity fields to add more detail to our visual. It should look something like this:
alt text

Great, our dashboard is starting to shape up. We can now add some KPI visuals across the top to provide some high-level summary information for our users. Add another visual and select the 'sales' field. Expand the Field Wells and drag the Order Date to the 'Trend group' field well. Let's also resize the visual by dragging the bottom right corner of the visual to make it smaller. Drag it to the top of the dashboard by grabbing the dotted area on the top of the visual. Once you have it on the top it should look like this:
alt text

Let's repeat this last step to add two more KPI's to the top of the dashboard. After you add another visual, select the KPI visual type in the lower left of the screen:
alt text

The second one will be a KPI for the number of unique orders YoY. To do this, select the KPI visual type and drag 'order_id' to the 'Value' field well and 'Order Date' to the 'Trend group' field well. Change the aggregation on 'order_id' from Count to Count Distinct:
alt text

For the third KPI, let's show a YoY trend of the average order size. Click 'sales' and then use the pull down menu on the field to change the aggregation to Average. Add the 'Order Date' to the 'Trend group' field well like we did for the first KPI:
alt text

You can optionally play around with the KPI formatting options. You can change the primary number that is displayed and the comparison type. You can also choose if you would like to show the trend arrows as well as the progress bar (which is displayed as a bullet chart on the bottom of the KPI).
alt text

Lastly let's edit the titles of the KPI's to be more user friendly. I chose 'Sales YoY', 'Avg Order Size YoY', and '# of Orders YoY' for my titles:
alt text

Awesome! Our dashboard is looking really good. We are almost ready to share it with the rest of our end users. Just before we do that, let's add a filter (or many) for our users to leverage. On the left, choose 'Filter' and then either click 'Create one' or the little filter icon on the top and choose 'Order Date'. I like to use the 'Relative dates' type of UI for my date filters. Set it to the 'Last 5 years' and hit 'Apply'. Lastly click on the top where it says 'Only this visual' and change it to 'All visuals' so that it applies to the entire dashboard:
alt text


alt text

4. Sharing

Documentation: Creating and Sharing Your First Dashboard

We are ready to share our dashboard with the rest of our users now! Click the 'Share' button in the upper right of the screen and select 'Create Dashboard'. Give it a name like 'Sales Dashboard' and choose 'Create Dashboard'.

alt text

On the next screen you will be able to share it with other users in your QuickSight account.

alt text

Once you add them you can click 'Share' and it will send them an email saying a dashboard has been shared with them. Also the next time they log into QuickSight they will see it in the list of dashboards they have access to.

Great job! You have just created your first dashboard to be shared with the rest of your team!


alt text

Exercise 2 - Visualizing NY Taxi Data

One of the most compelling reasons for using Athena to query data on S3 is that you can query some really really BIG datasets. In our next exercise we will use QuickSight and Athena to visualize 2.7 Billion records. That's right, billion.

1. Connect to the Dataset

Open QuickSight and choose 'Manage Data' in the upper right hand corner.

Choose 'New Dataset' and then select Athena.

Give it a name and choose 'Create Data Source'. Find the database you created earlier which contains the NY taxi data and select the appropriate table. Choose 'Edit/Preview Data'.

Before we start visualizing, let's add a calculated field to convert the date field. The date field in this dataset is in Epoch date format. Therefore we will use a function to convert it to a more usable format. On the left side choose 'New Field' and then use the epochDate() function to convert pickup_datetime field to a date field. It is measured down to the millisecond, so we will also divide the integer by 1000 to get it into seconds before converting. Use this formula:

epochDate({pickup_datetime}/1000)

alt text

Make sure we keep it set to 'Query' rather than SPICE, which is different from what we did in the first exercise (actually when doing table joins QuickSight forces you to use SPICE, but when connecting to individual tables we get this choice). Since we are going to be working with nearly 3 billion records, we will want to query the data directly in S3 using Athena.
alt text

2. Creating Our Dashboard

Great, now we are ready to begin visualizing our data. By default AutoGraph is chosen as the visual type, which will pick an appropriate visual type depending on the types of fields choose to visualize. We can leave it like that for now, and later we will specify certain visual types.

Select 'passenger_count' and then use the pull down menu to change the aggregation to Count. Then use the pull down menu again and choose 'Format->1234.57' to round to two decimal places. The KPI will show that we have 2.67 billion records in the dataset. Pretty impressive performance on a dataset of that size!
alt text

alt text

Let's add another visual. This time select 'Pickup Date' (the calculated field you created). You should get a line chart. Use the pull down menu and change the aggregation to Week. Then expand the axis range on the bottom of the visual.
alt text

Select the 'type' field and you should get three lines, one for each type of taxi:
alt text

Let's add another visual. This one will also be a time trend but we will look at the data YoY. First change the visual type to a Line Chart. Then drag the 'month' field to the X axis field well and the 'year' field to the Color field well.

Notice the months on the bottom are out of order. Since the field is a string data type the months are sorted in alphabetical order. To fix this we must edit the dataset and change the data types for these columns. Use the dropdown menu for the name of your dataset and choose 'Edit analysis data sets' and then click 'Edit' on the next screen:
alt text


alt text

Click on the 'a' icon underneath both of these fields in the data preview window and change them both to 'Int':
alt text

Choose 'Save & visualize'. Now the months on our line chart should be sorted in the correct order:
alt text

One of the first things you will notice is that there is a huge drop in Feb on the 2010 line. A quick google search for 'nyc feb 2010' will reveal that there was a huge blizzard in Feb 2010! Makes sense why there were less rides for that month.

Feel free to continue exploring this data. There aren't a ton more dimensions to play with - the dataset was meant to highlight the scale of how many records Athena + S3 can handle rather than analytical depth - but go wild with it!

3. (Optional) Create A Story

How to createa a Story

In addiition to creating and sharing dashboards, you can also create and share 'stories'. They are great if you have found something interesting in the data and you would like to lead you users to that particular finding. For instance, using our last finding you could capture a 'scene' of the YoY trend visual, then filter to 2010 and capture another 'scene' to highlight the drop in Feb 2010 due to the blizzard.

Conclusion

Congratulations on creating your first Glue Crawlers, Athena Databases & Tables, and QuickSight Analyses and Dashboards! You are now well versed in Serverless Analytics!

For more tips and information about what's new in QuickSight, check out the blog as well as the other resources on the website!

The End

"# QuicksightAthenaSpectrum"