/artmosphere

Data Engineering Project at Insight

Primary LanguageXSLT

#Artmosphere

alt text

Note: The original website is down at the termination of the Insight program. However, the video demo of the website is available here. Slides are available here.

Code for the web framework Flask can be found here. Code for front end web application can be found in this folder.

##Table of Contents

##Introduction This is a data engineering project at Insight Data Engineering Fellow Program. The project provides a platform for users to search for different artworks, see similar art pieces and real-time popularity of a given art piece. Users can also see where all the artworks have been uploaded across the world. The main goal of the program to learn different tools used in a data pipeline for processing large datasets in a distributed manner.

Tools used:

##Settings Dataset: The dataset is a collection of 26,000 artworks and 45,000 artists collected from Artsy.net in JSON format. In order to simulate real-time user activities, the project also used self-engineered data in two formats:

  • Collecting log: timestamp, user_id, collected, artwork_id
  • Uploading log: timestamp, user_id, uploaded, artwork_id, location_code

AWS Clusters: A distributed AWS cluster of 4 EC2 machines is being used for this project. All the components (ingestion, batch and real-time processing) are configured and run in distributed mode, with 1 master node and 3 slave nodes. The master node has 8GB of memory and 50GB of storage. The work nodes each has 8GB of memory and 1TB of storage.

##Data Processing alt text

  • Data Ingestion (Kafka): The datasets for batch and real-time processing are ingested using Kafka. For batch processing, the datasets are stored into HDFS. For real-time processing, the data is streamed into Spark Streaming.

  • Batch Processing (HDFS, Spark): To perform batch processing job, Spark loads the data from HDFS and processed them in a distributed way. The two major batch processing steps for the project is to aggregate the artists upload locations and compute artwork-artwrok similarties.

    The following graph shows the performance analysis of Spark for one the batch processing jobs - aggregating artists upload locations - up to 500GB:

    alt text
  • Serving Layer (Elasticsearch, Cassandra): The platform provides a search function that searches a given keyword within the artworks' title. In order to achieve this goal, the metadata of all artworks are stored into Elasticsearch. All artworks and artists are stored in Cassandra tables and can be retrieved by ids. The aggregated artists locations are also stored in Cassandra table, which can be queried by location_code and timestamp.

  • Stream Processing (Spark Streaming): Spark Streaming processes the data in micro batches. The job aggregates how many people collected a certain piece of art every 5 seconds and saves the result into a table in Cassandra. The information can be queried by artwork_id and timestamp.

    • Streaming Processing: spark_streaming
      • To excute: run bash log_streaming_run.sh
  • Front-end (Flask, Bootstrap, Highcharts): The frond-end uses Flask as the framework and the website uses JavaScript and Twitter Bootstrap libriries. All the plots are achieved via Highcharts.

##Website Note: Website is down at the termination of the Insight program. However, the video demo of the website is available here.

  • The artwork information:

alt text

  • Display similar artworks:

alt text

  • Plots show in real-time how many people have collected this piece of art within a 5-sec frame:

alt text

alt text

  • A map shows where all the artworks have been uploaded across the world:

alt text

##Presentation Deck The presentation slides are available here.

The video demo of the website is available here.

##Packages Used for the Pipeline pyspark, pyspark-cassandra, elasticsearch-hadoop-2.1.0.Beta2.jar