Vietnam Real Estate Data Pipeline

A data pipeline to extract Vietnam real estate data daily from Chợ tốt nhà.

Architecture

Extract data from Chợ tốt nhà (Extract)
Load into AWS S3 (Data lake)
Transfom and load into AWS Redshift (Data warehouse)
Transform using dbt (Transform)
Create Google Data Studio Dashboard (Apply)
Orchestrate pipeline with Airflow in Docker
Create AWS resources with Terraform (Infrastructure)

Project structure

├── airflow
│   ├── Dockerfile
│   ├── dags
│   │   └── daily_elt_pipeline.py
│   ├── docker-compose.yaml
│   ├── helpers
│   │   ├── extract_chotot_data.py
│   │   ├── upload_to_redshift.py
│   │   └── upload_to_s3.py
│   └── requirements.txt
├── dbt
│   ├── dbt-project.yml
│   └── models
│       ├── schema.yml
│       └── transform.sql
│   
├── terraform
│    ├── main.tf
│    ├── output.tf
│    └── variable.tf
├── .gitignore
├── README.md

Extract data

Chotot doesn't have public API so I inspect their website and found out some hidden API. Data extract include region, area, category and all the posting. All the code for extract data in extract_chotot_data.py file. For example, to get all region.

def get_region(path, save=True):
    id, name, url_name, lat, long = [],[],[],[],[]
    url  = 'https://gateway.chotot.com/v1/public/web-proxy-api/loadRegions'
    response = requests.get(url)
    r_json = response.json()
    regions = r_json['regionFollowId']['entities']['regions']
    for region_id in regions.keys():
        region = regions[region_id]
        id.append(region['id'])
        name.append(region['name'])
        url_name.append(region['name_url'])
        lat.append(region['geo'].split(',')[0])
        long.append(region['geo'].split(',')[1])

    result = pd.DataFrame({
        'id' : id,
        'name' : name,
        'url_name' : url_name,
        'lat' : lat,
        'long' : long
    })
    if save:
        result.to_parquet(path, index=False)
    else:
        return result

Load into S3

S3 will be data lake in this data pipeline, it store all raw data I just extract from chotot The data we extract will be daily upload to S3 bucket, store in parquet file type.

Load into Redshift

Since the data we extract is raw, I transform it a little bit and load into data warehouse - Redshift.

Docker & Airflow

I'm going to run my pipeline daily, for demonstration purposes, although this could be changed at a later point. Each day, the pipeline above will be running to ensure all the new posts will be update to my data warehouse.

We need to build the image first

docker-compose build
docker-compose up airflow-init

After building the image, we can use docker-compose up to start Airflow to run all the tasks for this pipeline.

Transfom with dbt

I use dbt to transform my data in Redshift, since the data quite simple, this step not really necessary.

The data after transform will load back to Redshift.

Infrastructure with terraform

Again, this step quite overkill for small project like this, but using terraform is a good way to create AWS resources. I use terraform to create AWS resources S3, Redshift, IAM.

Visualize with Google Data Studio

Finally, I use Google Data Studio to visualize the data. You can take a look here.

vuthanhdatt/real-estate-pipeline