A data pipeline to extract Vietnam real estate data daily from Chợ tốt nhà.
- Extract data from Chợ tốt nhà (Extract)
- Load into AWS S3 (Data lake)
- Transfom and load into AWS Redshift (Data warehouse)
- Transform using dbt (Transform)
- Create Google Data Studio Dashboard (Apply)
- Orchestrate pipeline with Airflow in Docker
- Create AWS resources with Terraform (Infrastructure)
├── airflow
│ ├── Dockerfile
│ ├── dags
│ │ └── daily_elt_pipeline.py
│ ├── docker-compose.yaml
│ ├── helpers
│ │ ├── extract_chotot_data.py
│ │ ├── upload_to_redshift.py
│ │ └── upload_to_s3.py
│ └── requirements.txt
├── dbt
│ ├── dbt-project.yml
│ └── models
│ ├── schema.yml
│ └── transform.sql
│
├── terraform
│ ├── main.tf
│ ├── output.tf
│ └── variable.tf
├── .gitignore
├── README.md
Chotot doesn't have public API so I inspect their website and found out some hidden API. Data extract include region, area, category and all the posting. All the code for extract data in extract_chotot_data.py
file. For example, to get all region.
def get_region(path, save=True):
id, name, url_name, lat, long = [],[],[],[],[]
url = 'https://gateway.chotot.com/v1/public/web-proxy-api/loadRegions'
response = requests.get(url)
r_json = response.json()
regions = r_json['regionFollowId']['entities']['regions']
for region_id in regions.keys():
region = regions[region_id]
id.append(region['id'])
name.append(region['name'])
url_name.append(region['name_url'])
lat.append(region['geo'].split(',')[0])
long.append(region['geo'].split(',')[1])
result = pd.DataFrame({
'id' : id,
'name' : name,
'url_name' : url_name,
'lat' : lat,
'long' : long
})
if save:
result.to_parquet(path, index=False)
else:
return result
S3 will be data lake in this data pipeline, it store all raw data I just extract from chotot The data we extract will be daily upload to S3 bucket, store in parquet
file type.
Since the data we extract is raw, I transform it a little bit and load into data warehouse - Redshift.
I'm going to run my pipeline daily, for demonstration purposes, although this could be changed at a later point. Each day, the pipeline above will be running to ensure all the new posts will be update to my data warehouse.
We need to build the image first
docker-compose build
docker-compose up airflow-init
After building the image, we can use docker-compose up
to start Airflow to run all the tasks for this pipeline.
I use dbt to transform my data in Redshift, since the data quite simple, this step not really necessary.
The data after transform will load back to Redshift.
Again, this step quite overkill for small project like this, but using terraform is a good way to create AWS resources. I use terraform to create AWS resources S3, Redshift, IAM.
Finally, I use Google Data Studio to visualize the data. You can take a look here.