🏗 Build It Days: 🔢 Data & Analytics

A list of common areas people want to focus on when improving their current data, analytics, event pipelines, and related solutions.

👯 Wanting to implement some user analytics (either web or mobile)?

🎯 Wanting to centralize your logs?

⛵ Wanting to build your first data lake?

☁️ Getting Data into S3

Getting data into S3 in a structured format is, in my experience, the bulk of the effort in getting an MVP data lake implemented. The options are:

🚒 Kinesis Firehose

  • If you can get it into Kinesis Firehose this is easiest option, as it will take care of batching and saving to S3 for you. For the sake of getting something working quickly I typically use the Kinesis Data Generator (https://awslabs.github.io/amazon-kinesis-data-generator/web/help.html) to stream in something "production-like". It means I can focus on building out the query and visualisation stages, and come back to connect/stream actual production data as the final step.
  • I've had success in the past with providing a custom prefix for the delivery stream, something like year=!{timestamp:YYYY}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/, so that data is saved in the right format later for partitioning without an extra step. Where this is feasible or desirable is highly dependent on your specific data and query requirements, happy to talk about that. More details on prefix options is at https://docs.aws.amazon.com/firehose/latest/dev/s3-prefixes.html

💾 Just exporting and uploading

  • Low-tech and good for the sake of proving value quickly. Just export as much as you can from the existing systems and upload CSV/JSON/whatever. You'll want to automate that at some point though.

🎁 "Pre-packaged" solution

  • Here's an example that was the foundation used by the Proserv team in Australia for a data lake for a major project. It's quick to get setup and provides good practice around having a staging area for new data + an example of how to process and move into the primary data lake https://github.com/aws-samples/accelerated-data-lake

📚 Glue

AWS Glue is both a catalog of data you and your team have available, as well as an orchestration layer for pulling in and updating data in your data lake.

🤓 Wanting to analyze and run queries on your data lake?

  • Jump straight into Athena https://aws.amazon.com/blogs/big-data/analyzing-data-in-s3-using-amazon-athena/
  • JSON is often the easiest to get started, but can be the most challenging given the flexible structure. Here's some guidance on defining a schema for it: https://aws.amazon.com/blogs/big-data/create-tables-in-amazon-athena-from-nested-json-and-mappings-using-jsonserde/
  • Lots of customers think that this approach only works for internal and BI use cases and doesn't work for production apps that you want your customers to interface with. First, make sure you follow the recommendations below to actually improve performance. Second, I've worked with a customer that's got a fantastic solution where their UX gives perceived improvement in performance. They fire the request immediately, transition out the query screen, transition in the result screen. It all feels natural and fluid and behind the scenes the results are usually back before the user thinks the request has actually been initiated. A delightful experience that allows them to have flexible querying of multi-terabyte datasets without needing to run a multi-terabyte dedicated datawarehouse cluster.

📈 Wanting to visualize your data?

🏎 Wanting to optimise cost efficiency and/or performance?

🤖 Time to start your machine learning journey?