PySpark-on-GoogleColab

A Beginner’s Hands-on Guide to PySpark with Google Colab-Tutorial Notebook From Scratch

Note: All the details have been saved in a Notebook, you can run this notebook on your Google Colab easily!

1- Use !wget to download the dataset to the server

Colab is actually a Centos virtual machine with GPU. You can directly use the linux wget command to download the dataset to the server. The default is to download to the /content path

2- Loading data into PySpark from github¶

Spark has a variety of modules to read data of different formats. It also automatically determines the data type for each column, but it has to go over it once.

3- Use Google Cloud Disk to load datasets

add Codeadd Markdown First, the command to mount Google Cloud Disk in Colab is as follows. After execution, you will be asked to enter the key of your Google account to mount

4. Load dataset from kaggle

If you are playing a game on kaggle, the data set you need is prepared on it, and you can download it directly using the kaggle command. You need to choose to create an api token in the my profile of kaggle, and then generate the username and key locally

5. Upload to disk using the upload button

Google provides 67G of disk space. Use the upload button to upload the image below. This method is suitable for small datasets or own datasets

6-There is a library in jovian called open datasets.

First, install it into colab using- The URL can be any link be it google or kaggle links.

If you have any question feel free to ask, stay tuned to next works

Happy Learning! Stick To The Plan!

Author :Parissan Ahmadi

Linkdin : https://www.linkedin.com/in/parisan-ahmadi-1410a0a9/

Github : https://github.com/parisa-ahmadi

TelegramChannel : https://t.me/AIwithParissan

parisa-ahmadi/PySpark-on-GoogleColab