Vietnam Real Estate Data Pipeline

A data pipeline to extract Vietnam real estate data daily from Chợ tốt nhà.

Architecture

Architecture

Project structure

├── airflow
│   ├── Dockerfile
│   ├── dags
│   │   └── daily_elt_pipeline.py
│   ├── docker-compose.yaml
│   ├── helpers
│   │   ├── extract_chotot_data.py
│   │   ├── upload_to_redshift.py
│   │   └── upload_to_s3.py
│   └── requirements.txt
├── dbt
│   ├── dbt-project.yml
│   └── models
│       ├── schema.yml
│       └── transform.sql
│   
├── terraform
│    ├── main.tf
│    ├── output.tf
│    └── variable.tf
├── .gitignore
├── README.md

Extract data

Chotot doesn't have public API so I inspect their website and found out some hidden API. Data extract include region, area, category and all the posting. All the code for extract data in extract_chotot_data.py file. For example, to get all region.

def get_region(path, save=True):
    id, name, url_name, lat, long = [],[],[],[],[]
    url  = 'https://gateway.chotot.com/v1/public/web-proxy-api/loadRegions'
    response = requests.get(url)
    r_json = response.json()
    regions = r_json['regionFollowId']['entities']['regions']
    for region_id in regions.keys():
        region = regions[region_id]
        id.append(region['id'])
        name.append(region['name'])
        url_name.append(region['name_url'])
        lat.append(region['geo'].split(',')[0])
        long.append(region['geo'].split(',')[1])

    result = pd.DataFrame({
        'id' : id,
        'name' : name,
        'url_name' : url_name,
        'lat' : lat,
        'long' : long
    })
    if save:
        result.to_parquet(path, index=False)
    else:
        return result

Load into S3

S3 will be data lake in this data pipeline, it store all raw data I just extract from chotot The data we extract will be daily upload to S3 bucket, store in parquet file type.

s3

Load into Redshift

Since the data we extract is raw, I transform it a little bit and load into data warehouse - Redshift.

s3

Docker & Airflow

I'm going to run my pipeline daily, for demonstration purposes, although this could be changed at a later point. Each day, the pipeline above will be running to ensure all the new posts will be update to my data warehouse.

We need to build the image first

docker-compose build
docker-compose up airflow-init

After building the image, we can use docker-compose up to start Airflow to run all the tasks for this pipeline.

aiflow

Transfom with dbt

I use dbt to transform my data in Redshift, since the data quite simple, this step not really necessary.

dbt

The data after transform will load back to Redshift.

back_redshift

Infrastructure with terraform

Again, this step quite overkill for small project like this, but using terraform is a good way to create AWS resources. I use terraform to create AWS resources S3, Redshift, IAM.

Visualize with Google Data Studio

Finally, I use Google Data Studio to visualize the data. You can take a look here.