Rats in the Restaurants

Project Overview

Health inspections occur for the safety of the public to keep restaurants up to date with health codes. Health violations such as rats found in restaurants is one type of violation code that would lower the grade of the restaurant. This analysis will help identify different features that can lead restaurants to have health violations based in Los Angeles County.

Resources

Los Angeles County Open Source

Google Slides: Link
FINAL WEB APP: Link
Tableau Dashboard

Esther's Initial Analysis Dashboard: Link
Daniela's Interactive Map: Link
Daniela's Machine Learning Dashboard that displays the importance level per feature: Link
Julie's Health Violation Dashboard: Link
Maria's Health Violation Dashboard: Link

Software

Software/Toolkit: Visual Studio Code 1.39.0, Jupyter Notebook 6.0.3, SQLAlchemy 1.39,PostgreSQL
Languages: Python 3.7
Machine Learning Libraries: SciKitLearn

ETL

EXTRACT

Extracted from Los Angeles County’s open data portal
Filtered to compare urban and suburban cities within Los Angeles County
Evaluated possible correlations between health violations, venue size, community health demographics
Explored if marijuana, alcohol consumption, and crime rate affect health violations

TRANSFORM

Extricated 1,000,000 rows and nearly 1,000 columns of feature rich and machine learning ready utilizing 6 programming libraries to munge, normalize, split and encode drawing from Los Angeles city and county publicly available information

LOAD

Loaded two data frames into tables using PostgreSQL
Normalized data in SQL
Joined tables and exported to .CSV for Machine Learning*
The reliability and easy accessibility is one of SQL’s main fortes, assuring the clean join for the machine learning phase

SQL

In SQL, we merged inspection.csv(public.inspection2) and violation.csv to create a final dataset. Within the dataset, Facility_City names were replaced (i.e. "Malibu" to "Santa Monica") to correlate with closest distance to cities that were already placed in the inspection.csv.

ERD

Between public.inspection2 and violations, our primary key is "serial_number." That was used to merge the two datasets.

Machine Learning

Machine Learning
- Used multiple machine learning (i.e. PCA, SVM, SVC, Linear Regression) and finalized with Random Forest as it processed the best results(67% accuracy compared to 8% with SVC)
Random Forest Model

SVC Model

Description of Preliminary Data Preprocessing

We took several data transformation and preprocessing steps upon continuous data exploration and analysis. They were broken down into the following parts.

Part one of the data preprocessing included:
    * abstracting only restaurants, coordinates, and seats numbers
    * dropping unwanted columns, renaming kept columns
    * changing data types
    * replacing null values with appropriate values
Part two of the data preprocessing included:
    * replacing missed null values
    * dropping unwanted columns
Part three of the data preprocessing included:
    * modifying "Activity Date" to include "MONTH-YEAR" and "MONTH" columns
    * creating a new column from 'SEATS' into 'new_seats' that represents "seat bins"
    * dropping unwanted columns
Part four of the data preprocessing included:
    * setting facility city = "Los Angeles, City of"
    * dropping columns consisting of categorical data

Description of preliminary feature engineering and preliminary feature selection, including their decision-making process

We conducted linear regression analysis to assess a restaurant's "score" versus each potential feature selection (n=43. The analysis included determining p-values and r-values to better understand significance and correlations among our data. This process helped inform our feature selection.

Description of how data was split into training and testing sets

Our data was defined as: x = One of the 43 features [X = rats2_ml_df.copy()] [X = X.drop("SCORE", axis=1)] y = "SCORE" [y = rats2_ml_df["SCORE"].ravel()]

We separated our data into training and testing sets that included the the parameters: * X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=78)

Explanation of model choice, including limitations and benefits

The linear SVC model presented several limitations. Running our model took a couple of hours to complete, and it resulted in a very low accuracy score (about 6%). Upon discussing and re-evaluting our model, we decided on reducing our feature selections to about handful based on the strongest r-values from our linear regression analysis and limiting restaurants to Los Angeles city only. We tested this approach on a random forest model, which still procuded a low accuracy score. We reverted back to keeping all our feature selections and keeping all cities in Los Angeles County. Our random forest classifier included two parameters: n_estimators = 200 and random_state =78. Finally, our random forest model then produced an accuracy score of about 70%. Feature importance was calculated and sorted. It revealed that “LAT”, “LNG”, and “FACILITY_ZIP” were among the top. Although not strong, it still demonstrates an impact.

Al-Huneidi/Rats-in-the-Restaurants