/Duolingo

ETL Project Using Alteryx Designer

Duolingo Data ETL Using Alteryx Designer

This is a repository for my portion of the final class project for the course CIS9440 Data Warehousing and Analytics (Spring 2022 semester) at the Zicklin School of Business. As part of Group 5, I was solely responsible for the ETL portion of the class project, which involved using Alteryx Designer to clean and transform the data so that the data can be loaded into a Microsoft Azure SQL Database.

Introduction

The dataset being processed by the Alteryx workflow is data gathered by Duolingo originally to assist with an experiment on Half-Life Regression. The dataset originally consisted of 12,854,226 learning traces taken over two weeks in 2013 from a group of anonymized users (the dataset in this repository consists of the first 200,000 learning traces in the dataset). The research team subsequently released the dataset to the public to facilitate future research into language learning. The dataset of learning traces was released on Harvard Dataverse, while the metadata, lexeme tag references and experiment code were documented on GitHub.

Methodology

For the project, students were given freedom to choose whatever tools or databases they would like to use. I used Alteryx Designer to transform the dataset so that the data could be readily loaded into a Microsoft Azure SQL Database as six dimension tables and two fact tables. I chose Alteryx Designer for the ETL process due to how easy it is to use Alteryx Designer with large datasets such as this one. I chose Microsoft Azure SQL Database for the database due to how easy it is to connect it to both Alteryx Designer and Tableau.

Below are the initial steps of our workflow:

Picture1

After bringing the learning_traces.13m.csv file into the workflow, I focused on the lexeme_string attribute in the dataset. This attribute had information in the form of tags about the word's surface form (i.e., lexeme), lemma, part of speech and morphological components. In the workflow, the lexeme, lemma and part-of-speech tags were separated into their own attributes. Then, the workflow counted the number of morphological components and put the results into a new attribute.

Below is a sample of the result of this portion of the workflow:

Picture2

Continuing onwards in the workflow, the lexeme_reference.txt file was brought into a Find Replace Tool:

Picture3

The Find Replace Tool replaced the values in the part-of-speech attribute with something more understandable, and an attribute (called lexeme_length) was created for the number of characters of the surface form. Below is a sample of the result of this portion of the workflow:

Picture4

All language abbreviations in the two language attributes were also replaced with the full names of the languages. Below is a sample of the result of this portion of the workflow:

Picture5

At this point in the workflow, most of the transformations of the lexeme_string and language columns were done.

I noticed that the User IDs in the dataset were case sensitive (e.g., “u:FO” and “U:fo” would be considered different users). To take care of case insensitivity issues later in the class project, a user_id attribute (numbering users from 1 onwards) was created with a unique id for each user.

Picture6

Below is a sample of the result of this portion of the workflow:

Picture7

Finally, the workflow was ready to create the dimension and fact tables that will be loaded into the Microsoft Azure SQL Database. Block Until Done Tools were used to organize the order in which the tables were created. Below is the part of the workflow that created the pos_dim, lexeme_dim and lemma_dim tables:

Picture8

As seen above, for any dimensions that did not already have a unique ID attribute that could be used in the dataset, an ID attribute was created to serve as the Primary Key for its dimension table. The “Post Create SQL Statement” option was used in each Output Data Tool to designate the Primary Key for each dimension, and then each Output Data Tool loaded a table into the database. A sample of the SQL code is below:

Picture9

Continuing onwards in the workflow, below is the part of the workflow that created the language_dim, user_dim and timestamp_dim tables:

Picture10

As seen above, the language and timestamp dimensions had an extra step involved before applying the same steps as for creating the other dimension tables:

  • Language Dimension: The Union Tool was used so that languages from both language attributes were included in the same dimension.
  • Timestamp Dimension: The Formula Tool was used to convert the timestamps from a UNIX timestamp to a DateTime data type.

Lastly, below is the part of the workflow that created the two fact tables for the class project:

Picture11

The “Post Create SQL Statement” option in each Output Data Tool was used to designate the Foreign Keys for each fact table. A sample of the SQL code is below:

Picture12

All the steps outlined above are a repeatable Alteryx workflow that can be run in its entirety in a single run. Utilizing a personal laptop, the extraction and transformation time takes about 3 minutes when using the full dataset of around 12.8 million rows. The loading time into the database is based on internet connection and Microsoft Azure SQL Database service tier.

Files Included in Repository

  • Data Warehousing Group Project - Tables.yxmd: The Alteryx workflow created for the class project.
  • learning_traces.13m.csv: An input file that has a sample of the Duolingo data. The class project originally used a dataset that had approximately 12.8 million rows. Due to the original file's enormous size, the file that is in this repository contains only the first 200,000 rows of data. The Alteryx workflow works on both the file currently in the repository as well as the original file from Harvard Dataverse.
  • lexeme_reference.txt: An input file that has the definitions for the abbreviations used in the learning_traces.13m.csv file.
  • CLEAN SAMPLE learning_traces.csv: An output file from the Alteryx workflow that shows the transformed sample data in a single csv file (excluding transformations done to create the dimension and fact tables for the database).
  • Presentation - Selected Slides.pptx: Selected slides from the group presentation that introduced the data and outlined the ETL process. All slides were created by me.