e-n-f/housing-inventory

Data consistency issues

Opened this issue · 1 comments

Just scanning the 2016 file, I found the following entries:

May 2 ♥Spacious Home for Rent!♥ $390 / 3br - 1200ft2 - (san jose) pic map 
May 2 ♥Spacious Home for Rent!♥ $390 / 3br - 1200ft2 - (san jose) pic map 
May 1 efficiency studio available now! $99 deposit! $2885 / 450ft2 - (nob hill) pic map 
May 1 jr. 1 BD. Washer & Dryer in unit! $99 deposit $3250 / 1br - 550ft2 - (nob hill) pic map 
May 1 $99 Deposit- Text us for more info!!! $2830 / 405ft2 - (nob hill) pic map 
Apr 29 Exceptional Pacific Heights TIC $799000 / 2br - (Pacific Heights) pic
Apr 29 Awesome 5 Bedroom Available $800 / 5br - 3895ft2 - (2483 N Smiderle, San Bernardino, CA) pic

The first two are in San Jose and the same price appears twice. The other ones get listed as $99 by the "extract-craigslist" and "calc-medians" scripts. The last one is not in San Francisco.

Do you deduplicate or strip these out anywhere before doing analysis on them? I understand you can work around these issues a little bit by taking the median, but I do worry especially about overreporting at the low end.

Here's a script I used to work around these problems a little bit. I need to add deduplication to it.

package main

import (
	"bufio"
	"flag"
	"fmt"
	"log"
	"os"
	"regexp"
	"sort"
	"strconv"

	"github.com/kevinburke/housing-inventory-analysis/stats"
)

var parseRx = regexp.MustCompile(`\$[0-9]{2,10}`)

func getPrice(linePrices []string) int {
	if len(linePrices) == 0 {
		return -1
	}
	prices := make([]int, len(linePrices))
	for i := range linePrices {
		if len(linePrices[i]) < 2 {
			panic("too short: " + linePrices[i])
		}
		price, err := strconv.Atoi(linePrices[i][1:])
		if err != nil {
			panic(err)
		}
		prices[i] = price
	}
	if len(linePrices) == 1 {
		return prices[0]
	}
	if prices[0] < 200 && prices[1] < 200 {
		return -1
	}
	if prices[1] > prices[0] {
		return prices[1]
	}
	return prices[0]
}

func main() {
	flag.Parse()
	f, err := os.Open(flag.Arg(0))
	if err != nil {
		log.Fatal(err)
	}
	defer f.Close()
	bs := bufio.NewScanner(f)
	prices := make([]float64, 0)
	for bs.Scan() {
		linePrices := parseRx.FindAllString(bs.Text(), -1)
		if len(linePrices) > 0 {
			price := getPrice(linePrices)
			if price < 0 || price > 100000 {
				// sf is expensive, but not *that* expensive
				continue
			}
			prices = append(prices, float64(price))
		}
	}
	if err := bs.Err(); err != nil {
		log.Fatal(err)
	}
	sort.Float64s(prices)
	vals := stats.Sample{Xs: prices}
	fmt.Printf("Total rows: %d\n", len(prices))
	for i := float64(1); i <= 9; i++ {
		fmt.Printf("%dth %%ile: %v\n", int(i)*10, vals.Percentile(0.1*i))
	}
}
e-n-f commented

Thanks for the script! I was not doing any filtering on the files, so you have probably found some errors.