/Rats-in-the-Restaurants

A group project that analyzes data on restaurant inspections and violations as well as community health in Los Angeles and uses a machine learning model to identify and assess factors that could affect grades and scores.

Primary LanguageJupyter Notebook

Rats in the Restaurants

Project Overview

Health inspections occur for the safety of the public to keep restaurants up to date with health codes. Health violations such as rats found in restaurants is one type of violation code that would lower the grade of the restaurant. This analysis will help identify different features that can lead restaurants to have health violations based in Los Angeles County.

Resources

  1. Los Angeles County Open Source
  1. Google Slides: Link

  2. FINAL WEB APP: Link

  3. Tableau Dashboard

  • Esther's Initial Analysis Dashboard: Link
  • Daniela's Interactive Map: Link
  • Daniela's Machine Learning Dashboard that displays the importance level per feature: Link
  • Julie's Health Violation Dashboard: Link
  • Maria's Health Violation Dashboard: Link
  1. Software
  • Software/Toolkit: Visual Studio Code 1.39.0, Jupyter Notebook 6.0.3, SQLAlchemy 1.39,PostgreSQL
  • Languages: Python 3.7
  • Machine Learning Libraries: SciKitLearn

ETL

  1. EXTRACT
  • Extracted from Los Angeles County’s open data portal
  • Filtered to compare urban and suburban cities within Los Angeles County
  • Evaluated possible correlations between health violations, venue size, community health demographics
  • Explored if marijuana, alcohol consumption, and crime rate affect health violations
  1. TRANSFORM
  • Extricated 1,000,000 rows and nearly 1,000 columns of feature rich and machine learning ready utilizing 6 programming libraries to munge, normalize, split and encode drawing from Los Angeles city and county publicly available information
  1. LOAD
  • Loaded two data frames into tables using PostgreSQL
  • Normalized data in SQL
  • Joined tables and exported to .CSV for Machine Learning*
  • The reliability and easy accessibility is one of SQL’s main fortes, assuring the clean join for the machine learning phase

SQL

  • In SQL, we merged inspection.csv(public.inspection2) and violation.csv to create a final dataset. Within the dataset, Facility_City names were replaced (i.e. "Malibu" to "Santa Monica") to correlate with closest distance to cities that were already placed in the inspection.csv.

“”

ERD

Between public.inspection2 and violations, our primary key is "serial_number." That was used to merge the two datasets.

“”

Machine Learning

  • Machine Learning

    • Used multiple machine learning (i.e. PCA, SVM, SVC, Linear Regression) and finalized with Random Forest as it processed the best results(67% accuracy compared to 8% with SVC)

    Random Forest Model “”

    SVC Model
    “”

Description of Preliminary Data Preprocessing

We took several data transformation and preprocessing steps upon continuous data exploration and analysis. They were broken down into the following parts.

Part one of the data preprocessing included:
    * abstracting only restaurants, coordinates, and seats numbers
    * dropping unwanted columns, renaming kept columns
    * changing data types
    * replacing null values with appropriate values
Part two of the data preprocessing included:
    * replacing missed null values
    * dropping unwanted columns
Part three of the data preprocessing included:
    * modifying "Activity Date" to include "MONTH-YEAR" and "MONTH" columns
    * creating a new column from 'SEATS' into 'new_seats' that represents "seat bins"
    * dropping unwanted columns
Part four of the data preprocessing included:
    * setting facility city = "Los Angeles, City of"
    * dropping columns consisting of categorical data

Description of preliminary feature engineering and preliminary feature selection, including their decision-making process

We conducted linear regression analysis to assess a restaurant's "score" versus each potential feature selection (n=43. The analysis included determining p-values and r-values to better understand significance and correlations among our data. This process helped inform our feature selection.

Description of how data was split into training and testing sets

Our data was defined as: x = One of the 43 features [X = rats2_ml_df.copy()] [X = X.drop("SCORE", axis=1)] y = "SCORE" [y = rats2_ml_df["SCORE"].ravel()]

We separated our data into training and testing sets that included the the parameters: * X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=78)

Explanation of model choice, including limitations and benefits

The linear SVC model presented several limitations. Running our model took a couple of hours to complete, and it resulted in a very low accuracy score (about 6%). Upon discussing and re-evaluting our model, we decided on reducing our feature selections to about handful based on the strongest r-values from our linear regression analysis and limiting restaurants to Los Angeles city only. We tested this approach on a random forest model, which still procuded a low accuracy score. We reverted back to keeping all our feature selections and keeping all cities in Los Angeles County. Our random forest classifier included two parameters: n_estimators = 200 and random_state =78. Finally, our random forest model then produced an accuracy score of about 70%. Feature importance was calculated and sorted. It revealed that “LAT”, “LNG”, and “FACILITY_ZIP” were among the top. Although not strong, it still demonstrates an impact.