Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data consistency issues #3

Open
kevinburke opened this issue Feb 18, 2018 · 1 comment
Open

Data consistency issues #3

kevinburke opened this issue Feb 18, 2018 · 1 comment

Comments

@kevinburke
Copy link
Contributor

kevinburke commented Feb 18, 2018

Just scanning the 2016 file, I found the following entries:

May 2 ♥Spacious Home for Rent!♥ $390 / 3br - 1200ft2 - (san jose) pic map 
May 2 ♥Spacious Home for Rent!♥ $390 / 3br - 1200ft2 - (san jose) pic map 
May 1 efficiency studio available now! $99 deposit! $2885 / 450ft2 - (nob hill) pic map 
May 1 jr. 1 BD. Washer & Dryer in unit! $99 deposit $3250 / 1br - 550ft2 - (nob hill) pic map 
May 1 $99 Deposit- Text us for more info!!! $2830 / 405ft2 - (nob hill) pic map 
Apr 29 Exceptional Pacific Heights TIC $799000 / 2br - (Pacific Heights) pic
Apr 29 Awesome 5 Bedroom Available $800 / 5br - 3895ft2 - (2483 N Smiderle, San Bernardino, CA) pic

The first two are in San Jose and the same price appears twice. The other ones get listed as $99 by the "extract-craigslist" and "calc-medians" scripts. The last one is not in San Francisco.

Do you deduplicate or strip these out anywhere before doing analysis on them? I understand you can work around these issues a little bit by taking the median, but I do worry especially about overreporting at the low end.

Here's a script I used to work around these problems a little bit. I need to add deduplication to it.

package main

import (
	"bufio"
	"flag"
	"fmt"
	"log"
	"os"
	"regexp"
	"sort"
	"strconv"

	"github.com/kevinburke/housing-inventory-analysis/stats"
)

var parseRx = regexp.MustCompile(`\$[0-9]{2,10}`)

func getPrice(linePrices []string) int {
	if len(linePrices) == 0 {
		return -1
	}
	prices := make([]int, len(linePrices))
	for i := range linePrices {
		if len(linePrices[i]) < 2 {
			panic("too short: " + linePrices[i])
		}
		price, err := strconv.Atoi(linePrices[i][1:])
		if err != nil {
			panic(err)
		}
		prices[i] = price
	}
	if len(linePrices) == 1 {
		return prices[0]
	}
	if prices[0] < 200 && prices[1] < 200 {
		return -1
	}
	if prices[1] > prices[0] {
		return prices[1]
	}
	return prices[0]
}

func main() {
	flag.Parse()
	f, err := os.Open(flag.Arg(0))
	if err != nil {
		log.Fatal(err)
	}
	defer f.Close()
	bs := bufio.NewScanner(f)
	prices := make([]float64, 0)
	for bs.Scan() {
		linePrices := parseRx.FindAllString(bs.Text(), -1)
		if len(linePrices) > 0 {
			price := getPrice(linePrices)
			if price < 0 || price > 100000 {
				// sf is expensive, but not *that* expensive
				continue
			}
			prices = append(prices, float64(price))
		}
	}
	if err := bs.Err(); err != nil {
		log.Fatal(err)
	}
	sort.Float64s(prices)
	vals := stats.Sample{Xs: prices}
	fmt.Printf("Total rows: %d\n", len(prices))
	for i := float64(1); i <= 9; i++ {
		fmt.Printf("%dth %%ile: %v\n", int(i)*10, vals.Percentile(0.1*i))
	}
}
@e-n-f
Copy link
Owner

e-n-f commented Feb 20, 2018

Thanks for the script! I was not doing any filtering on the files, so you have probably found some errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants