Data consistency issues #3

kevinburke · 2018-02-18T06:01:41Z

Just scanning the 2016 file, I found the following entries:

May 2 ♥Spacious Home for Rent!♥ $390 / 3br - 1200ft2 - (san jose) pic map 
May 2 ♥Spacious Home for Rent!♥ $390 / 3br - 1200ft2 - (san jose) pic map 
May 1 efficiency studio available now! $99 deposit! $2885 / 450ft2 - (nob hill) pic map 
May 1 jr. 1 BD. Washer & Dryer in unit! $99 deposit $3250 / 1br - 550ft2 - (nob hill) pic map 
May 1 $99 Deposit- Text us for more info!!! $2830 / 405ft2 - (nob hill) pic map 
Apr 29 Exceptional Pacific Heights TIC $799000 / 2br - (Pacific Heights) pic
Apr 29 Awesome 5 Bedroom Available $800 / 5br - 3895ft2 - (2483 N Smiderle, San Bernardino, CA) pic

The first two are in San Jose and the same price appears twice. The other ones get listed as $99 by the "extract-craigslist" and "calc-medians" scripts. The last one is not in San Francisco.

Do you deduplicate or strip these out anywhere before doing analysis on them? I understand you can work around these issues a little bit by taking the median, but I do worry especially about overreporting at the low end.

Here's a script I used to work around these problems a little bit. I need to add deduplication to it.

package main

import (
	"bufio"
	"flag"
	"fmt"
	"log"
	"os"
	"regexp"
	"sort"
	"strconv"

	"github.com/kevinburke/housing-inventory-analysis/stats"
)

var parseRx = regexp.MustCompile(`\$[0-9]{2,10}`)

func getPrice(linePrices []string) int {
	if len(linePrices) == 0 {
		return -1
	}
	prices := make([]int, len(linePrices))
	for i := range linePrices {
		if len(linePrices[i]) < 2 {
			panic("too short: " + linePrices[i])
		}
		price, err := strconv.Atoi(linePrices[i][1:])
		if err != nil {
			panic(err)
		}
		prices[i] = price
	}
	if len(linePrices) == 1 {
		return prices[0]
	}
	if prices[0] < 200 && prices[1] < 200 {
		return -1
	}
	if prices[1] > prices[0] {
		return prices[1]
	}
	return prices[0]
}

func main() {
	flag.Parse()
	f, err := os.Open(flag.Arg(0))
	if err != nil {
		log.Fatal(err)
	}
	defer f.Close()
	bs := bufio.NewScanner(f)
	prices := make([]float64, 0)
	for bs.Scan() {
		linePrices := parseRx.FindAllString(bs.Text(), -1)
		if len(linePrices) > 0 {
			price := getPrice(linePrices)
			if price < 0 || price > 100000 {
				// sf is expensive, but not *that* expensive
				continue
			}
			prices = append(prices, float64(price))
		}
	}
	if err := bs.Err(); err != nil {
		log.Fatal(err)
	}
	sort.Float64s(prices)
	vals := stats.Sample{Xs: prices}
	fmt.Printf("Total rows: %d\n", len(prices))
	for i := float64(1); i <= 9; i++ {
		fmt.Printf("%dth %%ile: %v\n", int(i)*10, vals.Percentile(0.1*i))
	}
}

The text was updated successfully, but these errors were encountered:

e-n-f · 2018-02-20T03:15:56Z

Thanks for the script! I was not doing any filtering on the files, so you have probably found some errors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data consistency issues #3

Data consistency issues #3

kevinburke commented Feb 18, 2018 •

edited

Loading

e-n-f commented Feb 20, 2018

Data consistency issues #3

Data consistency issues #3

Comments

kevinburke commented Feb 18, 2018 • edited Loading

e-n-f commented Feb 20, 2018

kevinburke commented Feb 18, 2018 •

edited

Loading