This ReadMe provides instructions for setting a ETL Pipeline.
Before you begin, make sure you have the following:
- Kaggle account: You must have a Kaggle account. If you don't have one, you can create it at Kaggle.com.
- Install Docker on your system by following the instructions provided on the Docker website.
-
Create a directory named
.kaggle
in your user directory (usually/home/yourusername/
on Linux,C:\Users\yourusername\
on Windows).mkdir ~/.kaggle # Linux/Mac
or
mkdir C:\Users\yourusername\.kaggle # Windows
-
Log in to your Kaggle account.
-
Go to your Kaggle account settings by clicking on your profile picture and selecting "Account."
-
Scroll down to the "API" section and click on "Create New API Token." This will download a file called
kaggle.json
. -
Move the
kaggle.json
file you downloaded in the previous step to the.kaggle
directory you created in step 1.mv path/to/downloaded/kaggle.json ~/.kaggle/kaggle.json # Linux/Mac
or
move path\to\downloaded\kaggle.json C:\Users\yourusername\.kaggle\kaggle.json # Windows
Make sure to replace
path/to/downloaded/kaggle.json
with the actual path to the downloaded file. -
Protect your
kaggle.json
file:On Linux/Mac, you can restrict access to your API key using the following command:
chmod 600 ~/.kaggle/kaggle.json
-
Determine your current public Internet IP address by using ifconfig command.
-
Go to the file named
ip_address.txt
and place your IP address.
-
Create a Python virtual environment to manage your project dependencies. This step helps you isolate your project's dependencies from your system-wide Python installation.
# Create a virtual environment python -m venv venv # Activate the virtual environment source venv/bin/activate # On Linux/Mac .\venv\Scripts\Activate # On Windows
-
Make sure you are in your virtual environment (you should see the environment name in your terminal prompt).
-
Navigate to the directory where you have your project files, including the
requirements.txt
file. -
Install the project dependencies using
pip
.pip install -r requirements.txt
-
Execute command to enable script execution:
chmod +x run_all.sh
-
Run the script:
./run_all.sh # Linux/Mac (if you have made the script executable)
or
run_all.sh # Windows (if you have Cygwin or a similar Unix-like environment)
The run_all.sh
script will use your Kaggle API key to download a dataset, then create a PostgreSQL container for the database where the raw data will be transferred to. It will also spin up Metabase so you can interact with your data visually!