This script cleans the Nashville Housing data set by performing the following tasks:
- Standardizes the date format of the
SaleDate
column. - Populates the
PropertyAddress
column with data from theOwnerAddress
column wherePropertyAddress
is null. - Breaks out the
PropertyAddress
column into three separate columns:PropertySplitAddress
,PropertySplitCity
, andPropertySplitState
. - Breaks out the
OwnerAddress
column into three separate columns:OwnerSplitAddress
,OwnerSplitCity
, andOwnerSplitState
. - Changes the values in the
SoldAsVacant
column fromY
andN
toYes
andNo
. - Removes duplicate rows from the data set.
- Deletes the
OwnerAddress
,TaxDistrict
, andPropertyAddress
columns.
The script is written in T-SQL and can be run in SQL Server Management Studio.
- Open SQL Server Management Studio.
- Connect to the database that contains the
NashvilleHousing
table. - Right-click on the
NashvilleHousing
table and select "Open Table". - Copy and paste the script into the query window.
- Click the "Execute" button.
The script will run and the data set will be cleaned.
- The
SaleDate
column was originally in a variety of formats. The script standardizes the format toYYYY-MM-DD
. - The
PropertyAddress
column was missing data for some rows. The script populates the missing data by joining theNashvilleHousing
table to itself on theParcelID
column. - The
PropertyAddress
andOwnerAddress
columns were both in the formatAddress, City, State
. The script breaks these columns out into three separate columns for each address. - The
SoldAsVacant
column originally contained the valuesY
andN
. The script changes these values toYes
andNo
for consistency. - The script removes duplicate rows from the data set based on the
ParcelID
,PropertyAddress
,SalePrice
,SaleDate
, andLegalReference
columns. - The script deletes the
OwnerAddress
,TaxDistrict
, andPropertyAddress
columns because they are no longer needed.