rss.xml

<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Tanya Schlusser</title><link>https://tanyaschlusser.github.io/</link><description>Data-related analyses, tools, and news.</description><atom:link href="https://tanyaschlusser.github.io/rss.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><copyright>Contents © 2019 &lt;a href="mailto:tanya@tickel.net"&gt;Tanya Schlusser&lt;/a&gt; </copyright><lastBuildDate>Tue, 08 Jan 2019 06:13:15 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>What's so great about Knime?</title><link>https://tanyaschlusser.github.io/posts/whats-so-great-about-knime/</link><dc:creator>Tanya Schlusser</dc:creator><description>&lt;div&gt;&lt;p&gt;Last March, (for the fifth time, according to
&lt;a href="https://www.forestgt.com.au/latest-news/2018/3/2/knime-2018-gartner-magic-quadrant-market-leader"&gt;Forest Grove Technology&lt;/a&gt;),
the &lt;a href="https://www.knime.com/"&gt;Knime Analytics platform&lt;/a&gt; was named a Gartner Magic Quadrant
leader. This year's other leaders are &lt;a href="https://www.alteryx.com/"&gt;Alteryx&lt;/a&gt;,
&lt;a href="https://www.sas.com/"&gt;SAS&lt;/a&gt;, &lt;a href="https://rapidminer.com/"&gt;RapidMinder&lt;/a&gt;,
and &lt;a href="https://www.h2o.ai/"&gt;H2Oai&lt;/a&gt;.
The best thing I learned from the announcement? Knime is open source,
and free for individual users—I can afford to look at it!&lt;/p&gt;
&lt;p&gt;Knime (silent "k"; rhymes with "dime") provides a graphical user interface
to chain together blocks that represent steps in a data science workflow.
(So they're like Pentaho or Informatica but for machine learning.
 Or &lt;a href="http://www.ni.com/en-us/shop/labview.html"&gt;LabView&lt;/a&gt; if you have an
engineering background.)&lt;/p&gt;
&lt;p&gt;It has dozens of built-in data access and transformation functions,
statistical inference and machine learning algorithms,
&lt;a href="https://sourceforge.net/projects/pmml/"&gt;PMML&lt;/a&gt;,
and custom &lt;a href="https://www.knime.com/blog/blending-knime-and-python"&gt;Python&lt;/a&gt;,
&lt;a href="https://www.knime.com/nodeguide/scripting/java/example-of-java-snippet"&gt;Java&lt;/a&gt;,
&lt;a href="https://www.knime.com/nodeguide/scripting/r/example-of-r-snippet"&gt;R&lt;/a&gt;,
&lt;a href="https://informationentropy.wordpress.com/2016/03/24/programming-nodes-in-knime-with-scala-part-i/"&gt;Scala&lt;/a&gt;,
a &lt;a href="https://www.knime.com/nodeguide"&gt;zillion other nodes&lt;/a&gt;,
or other community plugins (since it's open source, anyone can
&lt;a href="https://www.knime.com/developer/example/extension-wizard"&gt;make a plugin&lt;/a&gt;.)
Even better, Knime imposes structure and modularity
on a data science workflow by requiring code fit into specified building blocks.&lt;/p&gt;
&lt;p&gt;This post implements the Bayesian NFL model from
&lt;a href="https://tanyaschlusser.github.io/posts/bayesian-updating-and-the-nfl/"&gt;last month&lt;/a&gt;
in Knime.
It adds the upstream and downstream workflows to pull new data each week and
write the model output to a spreadsheet: enough for a first look at this tool.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://tanyaschlusser.github.io/posts/whats-so-great-about-knime/"&gt;Read more&lt;/a&gt; (14 min remaining to read)&lt;/p&gt;&lt;/div&gt;</description><category>data science</category><category>knime</category><category>mathjax</category><category>python</category><category>workflow</category><guid>https://tanyaschlusser.github.io/posts/whats-so-great-about-knime/</guid><pubDate>Tue, 08 Jan 2019 05:00:42 GMT</pubDate></item><item><title>Bayesian updating and the NFL</title><link>https://tanyaschlusser.github.io/posts/bayesian-updating-and-the-nfl/</link><dc:creator>Tanya Schlusser</dc:creator><description>&lt;figure&gt;&lt;img src="https://tanyaschlusser.github.io/posts/bayesian-updating-and-the-nfl/home_spreads.png"&gt;&lt;/figure&gt; &lt;div tabindex="-1" id="notebook" class="border-box-sizing"&gt;
    &lt;div class="container" id="notebook-container"&gt;

&lt;div class="cell border-box-sizing text_cell rendered"&gt;&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;It's football season again, hooray! Every year for my friends' football pool I try out a different algorithm. Invariably, my picks are around 60% accurate. Not terrible, but according to NFL Pickwatch (&lt;a href="https://web.archive.org/web/20180811195907/http://nflpickwatch.com/"&gt;archive&lt;/a&gt;, &lt;a href="https://nflpickwatch.com/"&gt;current season&lt;/a&gt;), the best pickers get to 68 or 69%. So, an amazing performance—my upper bound—is just under 70%, and the lower bound for a competitive model—the FiveThirtyEight baseline—is 60%.&lt;/p&gt;
&lt;p&gt;I've been modeling NFL outcomes for a couple of years, and running linear (predicting point spread) and logistic (predicting win probability) regressions given various team and player data. My best year so far incorporated the Vegas spread into the model, and my biggest disaster so far was an aggressive lasso model on every player in every offensive line, with team defenses lumped as a group. Attempting to track &lt;a href="https://www.pro-football-reference.com/players/injuries.htm"&gt;injuries&lt;/a&gt;, suspensions, and other changes to the starting lineup was not sustainable for the amount of time I wanted to spend.&lt;/p&gt;
&lt;p&gt;Enter Nate Silver's awesome &lt;span class="vocabulary" title="Arpad Elo was a Hungarian-American physics professor who invented the system to rank chess players. Silver adapted it for Football, baseball, and most of the other sports on FiveThirtyEight."&gt;&lt;a href="https://fivethirtyeight.com/features/how-our-2017-nfl-predictions-work/"&gt;NFL Elo rankings&lt;/a&gt;&lt;/span&gt;, the aspirational target for this year. What's impressive is that he gets something like 60% accuracy out of literally no information but home field advantage and past scores. I particularly love that it updates weekly to incorporate the new information—this immediately says "Bayesian" and in fact is a lot how people using their intuition are making their picks anyway. A system like his—but with a more straightforward Bayesian model—is the goal of this post.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://tanyaschlusser.github.io/posts/bayesian-updating-and-the-nfl/"&gt;Read more&lt;/a&gt; (25 min remaining to read)&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;</description><category>a/b testing</category><category>bayesian</category><category>feature engineering</category><category>iterative</category><category>nfl</category><guid>https://tanyaschlusser.github.io/posts/bayesian-updating-and-the-nfl/</guid><pubDate>Sun, 09 Sep 2018 05:00:42 GMT</pubDate></item><item><title>Modeling property tax assessment in Cook County, IL</title><link>https://tanyaschlusser.github.io/posts/property-tax-cook-county/</link><dc:creator>Tanya Schlusser</dc:creator><description>&lt;figure&gt;&lt;img src="https://tanyaschlusser.github.io/posts/property-tax-cook-county/log_fit.png"&gt;&lt;/figure&gt; &lt;div tabindex="-1" id="notebook" class="border-box-sizing"&gt;
    &lt;div class="container" id="notebook-container"&gt;

&lt;div class="cell border-box-sizing text_cell rendered"&gt;&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;The year my Mom moved in down the street from us, my husband tried to get some local property tax appeal company to reduce her assessment. They refused, saying they thought there wasn't a case.&lt;/p&gt;
&lt;p&gt;The next year, she got a postcard from that same company: they would appeal her case and split the savings with her 50/50. Who wants to give up 50% of their tax savings? Plus, I was miffed from the prior year. I decided to try and appeal myself. Success!&lt;/p&gt;
&lt;p&gt;&lt;span class="vocabulary" title="A tool for automated testing of web applications, written in Java. I use the Python bindings."&gt;&lt;a href="https://selenium-python.readthedocs.io/"&gt;Selenium&lt;/a&gt;&lt;/span&gt; via Python bindings was used to pull the data from the web, and &lt;a href="https://www.statsmodels.org"&gt;statsmodels&lt;/a&gt;, with an interface that resembles R, was used to make the model.
&lt;/p&gt;&lt;p&gt;&lt;a href="https://tanyaschlusser.github.io/posts/property-tax-cook-county/"&gt;Read more&lt;/a&gt; (29 min remaining to read)&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;</description><category>civic data</category><category>selenium</category><category>taxes</category><guid>https://tanyaschlusser.github.io/posts/property-tax-cook-county/</guid><pubDate>Sun, 19 Aug 2018 05:00:42 GMT</pubDate></item><item><title>MCMC and the Ising Model</title><link>https://tanyaschlusser.github.io/posts/mcmc-and-the-ising-model/</link><dc:creator>Tanya Schlusser</dc:creator><description>&lt;figure&gt;&lt;img src="https://tanyaschlusser.github.io/posts/mcmc-and-the-ising-model/spins.png"&gt;&lt;/figure&gt; &lt;div tabindex="-1" id="notebook" class="border-box-sizing"&gt;
    &lt;div class="container" id="notebook-container"&gt;

&lt;div class="cell border-box-sizing text_cell rendered"&gt;&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;&lt;span class="vocabulary" title="A sequence of statistical outcomes in which each step is statistically independent from all of the prior steps."&gt;Markov-Chain&lt;/span&gt;
&lt;span class="vocabulary" title="A computer simulation technique using pseudo-random numbers to simulate random events."&gt;Monte Carlo&lt;/span&gt; (MCMC) methods are a category of numerical technique used in Bayesian statistics. They numerically estimate the distribution of a variable (the &lt;span class="vocabulary" title="The prior times the likelihood,  normalized, is the posterior distribution: the probability distribution of the target variable after incorporating the observed data."&gt;posterior&lt;/span&gt;) given two other distributions: the &lt;span class="vocabulary" title="A distribution that represents existing knowledge of a system. Often people choose a uniform (flat) distribution; or else something that is the known conjugate prior of a desired posterior distribution."&gt;prior&lt;/span&gt; and the &lt;span class="vocabulary" title="A special name for the probability mass (or density) function when you fix the random variable (e.g. `x`) and integrate over the parameters (e.g. `mu` and `theta`). It's renamed 'likelihood' just to make that swap explicit when talking about it. The integral over the parameters may not equal one so you have to normalize."&gt;likelihood function&lt;/span&gt;, and are useful when direct integration of the likelihood function is not tractable.&lt;/p&gt;
&lt;p&gt;I am new to Bayesian statistics, but became interested in the approach partly from exposure to the &lt;a href="https://tanyaschlusser.github.io/posts/mcmc-and-the-ising-model/"&gt;PyMC3 library&lt;/a&gt;, and partly from FiveThirtyEight's promoting it in a &lt;a href="https://fivethirtyeight.com/features/statisticians-found-one-thing-they-can-agree-on-its-time-to-stop-misusing-p-values/"&gt;commentary&lt;/a&gt; soon after the time of the p-hacking scandals a few years back (&lt;a href="https://www.ncbi.nlm.nih.gov/pubmed/22006061"&gt;Simmons et. al.&lt;/a&gt; coin 'p-hacking' in 2011, and &lt;a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4359000"&gt;Head et. al.&lt;/a&gt; quantify the scale of the issue in 2014).&lt;/p&gt;
&lt;p&gt;Until the 1980's, it was not realistic to use Bayesian techniques except when analytic solutions were possible. (Here's Wikipedia's &lt;a href="https://en.wikipedia.org/wiki/Conjugate_prior#Table_of_conjugate_distributions"&gt;list of analytic options&lt;/a&gt;. They're still useful.) MCMC opens up more options.&lt;/p&gt;
&lt;p&gt;The Python library &lt;a href="https://docs.pymc.io/"&gt;pymc3&lt;/a&gt; provides a suite of modern Bayesian tools: both MCMC algorithms and variational inference. One of its core contributors, Thomas Wiecki, wrote a blog post entitled &lt;a href="https://twiecki.github.io/blog/2015/11/10/mcmc-sampling/"&gt;MCMC sampling for dummies&lt;/a&gt;, which was the inspiration for this post. It was enthusiastically received, and cited by people I follow as the best available explanation of MCMC. To my dismay, I didn't understand it; probably because he comes from a stats background and I come from engineering. This post is for people like me.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://tanyaschlusser.github.io/posts/mcmc-and-the-ising-model/"&gt;Read more&lt;/a&gt; (24 min remaining to read)&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;</description><category>bayesian</category><category>mcmc</category><category>pymc3</category><guid>https://tanyaschlusser.github.io/posts/mcmc-and-the-ising-model/</guid><pubDate>Sun, 29 Jul 2018 05:00:42 GMT</pubDate></item></channel></rss>