/hpd-violations

A database of NYC Housing Violations

Primary LanguageShellGNU General Public License v3.0GPL-3.0

HPD Violations

Creates a postgres database of housing violations in New York City. The Department of Housing, Preservation, and Development (HPD) releases every month a dataset of all housing violations they have issued. This code turns those CSV into a useful database.

To build a database of hpd violations since 2015 create a file 'pg_setup.sh' contains two bash functions that excues postgres commands. Example:

export PGPASSWORD=YOURPGPASSWORD

execute_sql () {
 psql -h 127.0.0.1 -d nycdb -U postgres -f $1
}

execute_sql_cmd () {
 psql -h 127.0.0.1 -d nycdb -U postgres --command "$1"
}

Then run the following scripts:

# Downloads all the data
./download_violations.sh
# Unzips the data & deletes the .xml files
./unzip.sh
# Inserts the data into postgres
./to_postgres.sh

This will produce 4 tables:

  • violations: Contains all violations since 2015. There are multiple entries for many violations. For instance, there might be an entry in May 2015 when the violation was opened and a second entry for the same violation in July 2015 when it was closed.
  • uniq_violations: Contains one row for each unique violation in the violations table with the information of the most recent entry.
  • open_violations: Contains all currently open violations including those opened before 2015. HPD keeps a separate dataset of 'All Open Violations'.
  • all_violations: This is a combination of uniq_violations and open_violations. It contains all open violations (since the dawn of...hpd?) and all closed violations since 2015 and is careful to avoid duplicates between the two datasets.

There are two companion projects to this database:

hpd-violations-server - A nodejs server serving JSON from the database

hpd-violations-web - A website to look up buildings

Violation XML to CSV converter

See folder: parse_violations_xml

This this is an XML converter for HPD violation data. HPD provides data only in the XML format for all violation data prior to Aug. 2014. However, there are a complex set of schema changes prior to 2015 (see HPD's metadata), so I have temporarily abandoned the project of bringing in all historic violation data and instead I'm just dealing with the data since 2015. One day, I'd like to revisit the goal of parsing all the violation data.

How to use the XML Violations Parser

python3 parse_violations.py input [output]

The input can be a single xml file or a directory (and subdirectories) of files. If no output is provided, it will produce an output csv with the same name as the input. For example, python3 parse_violations.py Violations20150201.xml creates Violations20150201.csv.

The violation XML files can download from HPD's website here: HPD Open Data - Violations

IMPORTANT NOTE: For the xml files to work with this program they must be pre-processed to have each element on its own line. This can be done with a simple perl script:

perl -pe 's/></>\n</g' inputfile.xml > outputfile.xml

MSC. Useful Commands

DB DUMP: pg_dump --no-owner -Fc -t all_violations hpd_violations > all_violations.dump

DB restore: pg_restore --clean -d hpd_violations all_violations.dump

Grant privileges:

psql -d hpd_violations -c "GRANT USAGE ON SCHEMA public TO hpdserver"
psql -d hpd_violations -c "GRANT SELECT ON ALL TABLES IN SCHEMA public TO hpdserver"

clear ngnix cache : find /data/nginx/cache/ -type f -delete