
Repository for source code of "Big Data (Hadoop, Map reduce, Hive)" lecture at MS SIO, CentraleSupélec

Primary LanguagePythonMIT LicenseMIT

🌐 BigData

Repository for source code of "Big Data (Hadoop, Map reduce, Hive)" lecture at MS SIO, CentraleSupélec

Setup the VPN to connect to Adaltas infrastructure

sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome ubuntu-desktop
  • Download the .ovpn file from the mail received from Adaltas (check william.afonso@student-cs.fr mailbox)
  • Open network settings
  • On VPN, click on "+"
  • Import from file
  • Select the .ovpn file previously downloaded
  • check the connexion to Adaltas infrastructure : ping edge-1.au.adaltas.cloud

Export scientific articles metadata to HDFS

Prepare the environment to use the applications that will follow

(On Ubuntu) Install the required libraries:

sudo apt-get install gcc python-dev libkrb5-dev

Create and activate a virtual environment:

python3 -m venv .venv
source .venv/bin/activate

Install python requirements:

pip install --upgrade pip
pip install -r requirements.txt

Generate a CSV as an aggregation of articles metadata

Clone the current project onto your machine:

git clone https://github.com/will-afs/BigData.git

Go into the BigData folder:

cd BigData

Generate the .csv database of PDF metadata :

python3 generate_csv.py

Push the CSV to HDFS

Activate the VPN

Get a Kerberos ticket to connect to the applications that will follow (password : AdaltasWill2000):

kinit w.afonso-cs

The ticket can be checked with the following command:


Push the CSV to HDFS

python3 push_csv_to_hdfs.py

Connect to the edge HDFS of Adaltas Cloud by SSH and check the .CSV has been stored into the HDFS

Connect to Adaltas server with password 'AdaltasWill2000':

ssh w.afonso-cs@edge-1.au.adaltas.cloud

Check HDFS content:

hdfs dfs -ls /education/cs_2022_spring_1/w.afonso-cs/fil-rouge/

The result should be as follows:

Process data with HQL

Connect to Zeppelin through a web browser (login: w.afonso-cs, password: AdaltasWill2000):


Open the following Notebook:

Run the Zeppelin Notebook:

It will create a Hive table from the CSV file previously stored in HDFS:

And expose some data visualization too. First, the title length distribution :

And then, the number of published PDFs by date (cumulated) :