The project consists in two parts: scaping a website and cluster its informations in different ways and use hash functions to find duplicates in a big set of passwords.
The scraping part of the project was based on this link in the Immobiliare.it site.
This data will be the corpus for our analysis.
-
Homework_4.ipynb
:This script provides the code of our analisys. It is possible that some plots are not clearly viewed, so to visualize them correctly the notebook is also shown here.
-
data.csv
:This file contains all the informations of the annoucements taken from Immobiliare.it.
-
Files containing all the functions of the different parts of the homework:
first_part_functions.py
:This file contains all the functions used during the part 1 of the HW.
scraping.py
:This file contains all the functions used for the scaping part.
Hash_functions.py
:This file contains all the functions used in the 2nd part of the HW.
-
KMeans.py
andKMeans_map_reduce.py
:These files contain the Classes of the KMeans algorithm implemented with and without map-reduce.