Some years ago, I read an interesting article about an AI model that could create new poems, this was achievable since the model had a large dataset of poems. A strange idea was then created in my head, I though "What kind of useless or funny text would you be able to create with such a model?". And the answer was fanfictions, and I realized that it would be a fun project to try.
I therefore started to think about some important steps, for example:
- What cool things should I do with the data, create fanfictions or a classification model (separate dirty fanfiction from normal)?
- Where do I find the data?
- How should I retrieve the data?
- How should I store the data?
- How should I analyze the data?
The five tasks above can be achieved in many ways. So, I decided to try two approaches in order to learn more, the first approach will be using Airflow to get the data from a website and then I will store it in a local database and analyze the results in a dashboard using Google data studio.
The first approach that focus on airflow can be viewed below:
The second approach is to use AWS and make everything in the cloud!☁️ This was a fun project because I learned a lot about lambda, glue and Redshift. Below is second approach which has the AWS focus:
I had earlier experience with Airflow so it was fast and easy to setup. There are two main things that I use airflow for and those are
1) web scraping the website https://archiveofourown.org/ and load the data into a local postgres database
2) Use airflow to calculate several KPIs parallel and post the results in a different table in the same postgres database. The KPIs can then easily be analyzed using any BI tool, in this project I will use Google data studio.
The code for the Airflow dag and functions to web scrape the fanfiction-website is in this GitHub repo.
The raw data (with a little bit cleanup when downloading it) is stored in a postgresql database and looks like this:
Since it is possible that the data will grow very large over time and is not summarized in any way means that it would put a lot of pressure if the dashboard were based solely on this data. I therefore use Airflow again to execute SQL files in order to create KPIs and store those in another table.
The whole approach in Airflow looks like this:
It is very simple to add more KPIs to the table and Airflow, just add SQL files that follow the same structure and columns!
This is an example of the KPI table:
Using AWS is a large part when dealing with big data, I therefore wanted to get hands on experience with some of the most used tools in the AWS toolbox and this project was perfect for that.
The first thing I had to change was how to web scrape the data automatically on a given time. This was easily done by using AWS Lambda and boto3 in Python. I could reuse most of the functions that I used for airflow, making it easy to switch to Lambda. The function created csv files for that days downloaded data and put it in a S3 bucket. It was quite fun to see the bucket automatically get larger day after day.
Then I used another Lambda function that is triggered when a new csv file is dropped in the fanfiction bucket, this function takes the data from the csv and upload it to Redshift.
The csv only contains the raw data, so I use Glue with pyspark in order to calculate the KPIs and store those in another table in the Redshift database. Example of the pyspark code can be found here: Link to pyspark code
And that is all! This project was very fun and I learned a lot.