
Files Structure

The main datamining codes are in two files: One is named "pyspark.ipynb" which uses the pyspark as the data mining approach. The other one is named "pandas-sklearn.ipynb" which combines pandas with sklearn to realize the whole data mining.

Both two approaches will be discussed in my report in order to obtain the more detail analysis.


Since my dataset, namely "SBAnational.csv" whose size is 179.4 MB is too large to push to github directly. I have asked Mina about it and got a permittion to upload it seperately. You can get my dataset by click this link and download from google drive. If you have any problems,you can contact me through my email

