/data_whiz

Example of how to use SQLSorcery for Data Whiz Summit 2019

Primary LanguagePythonMIT LicenseMIT

README

Prerequisites

  1. Install pipenv
  2. Install odbc drivers OR install Docker

Tutorial

  1. Install SQLSorcery

  2. Set up .env file

    MS_SERVER=dchess.database.windows.net
    MS_DB=data_whiz                                                                         
    MS_SCHEMA=dbo                                                                      
    MS_USER=sql_admin                                                                  
    MS_PWD=DataWhiz2019
  3. Download 2019 SBAC Research Files

    This can be done manually or using some simple python:

    import urllib.request
    
    url = "http://caaspp-elpac.cde.ca.gov/caaspp/researchfiles/sb_ca2019_1_csv_v2.zip"
    urllib.request.urlretrieve(url, "sb_ca2019_1_csv_v2.zip")
  4. Unzip the files

    Again, this can be done manually in the file system or with a little python:

    from zipfile import ZipFile
    
    with ZipFile("sb_ca2019_1_csv_v2.zip", "r") as z:
        z.extractall()
  5. Read the CSV files into Pandas dataframes

    import pandas as pd
    
    entities = pd.read_csv("sb_ca2019entities_csv.txt", encoding="ISO-8859-1")
    scores = pd.read_csv("sb_ca2019_1_csv_v2.txt", encoding="ISO-8859-1")

    The additional encoding is necessary because the SBAC files do not use the standard UTF-8.

  6. Load them as tables in the database

    from sqlsorcery import MSSQL
    
    sql = MSSQL()
    sql.insert_into("sbac_entities_2019", entities, chunksize=1000)
    sql.insert_into("sbac_scores_2019", scores, chunksize=1000)

    The additional chunksize param allows for batch insertions, greatly improving performance.

  7. Now you can query as you like. Some samples are provided.

    from sqlsorcery import MSSQL
    
    sql = MSSQL()
    df = sql.query("SELECT * FROM sbac_entities_2019")
  8. Run the source code

    If you are using the provided Dockerfile:

    $ docker build -t data_whiz .
    $ docker run --rm -it data_whiz

    If you have installed the Microsoft ODBC drivers locally:

    $ pipenv run python main.py