My recent purchase of the book Baseball Hacks has made me dust off some my (I must admit impossibly rudimentary) knowledge of statistics and probability and think about baseball in that context.
I think it was several years ago while reading Lewis' Moneyball that I first became aware of Bill James' Pythagorean Theorem of Baseball. It states that the expected number of wins for a baseball team is proportional to the ratio of the runs the team scores and the sum of the square of the runs scored and the runs scored against. At the time I heard of this, I really didn't have much insight as to why that should be (I'm told that James takes five pages to derive it in his 1981 Baseball Abstract, but I don't have a copy of that) and that kind of bothered me.
So, I set out to try to derive my own formula that would give the expected number of wins. I downloaded the game logs for 2004 and extracted the 162 games that the Oakland Athletics played. They scored 793 runs and had 742 runs scored against them. It turns out that they actually won 91 games. How does that compare to what Bill James predicted? A bit of math and you'll find that by James' Pythagorean Formula, it might be expected that they win 86.37 games. Not too bad, but it seems that the A's might have overperformed a bit.
I tried to derive a similar number by a different tactic. I assumed that baseball scoring is a Poisson process (this is one of many simplifying assumptions that isn't true, but it simplifies the math). I then wrote a simple little simulator that played random seasons of baseball and totalled the runs that might be expected to yield the total of 793 runs. (Basically, the time to the next score can be generated by getting a uniform random variable u in the range zero to one, and then computing the time to the next score as being -log(u)/r, where r is the average rate (in this case, 793 / 162). You'll find that you don't get 793 runs very often, and the distribution of potential results forms a nice looking bell curve.
Fun, but not what we were originally trying to do.
It turns out that you can pretty easily determine for any potential number k what the probability is for a given Poisson random variable to have precisely k occurrances in the unit interval. You can look it up for yourself on Wikipedia, and it's just a few lines of code to implement. Then, you determine for each possible score, say, home and visitor, what the probability is that the particular combination of scores actually is (which I truncate at 20 runs per team), and simply sum up the cumulative probability.
Well, with one complication: it doesn't tell us how to score ties. I summed up the probability that tie games would occur, and found that according to the theory, ties should (for the 2004 Athletics) have happened about 21.18 times (they actually happened 19 times in 2004, not bad!). I decided to do the simplest possible thing, and just assume that each team will win 50% of the games which are tied after regulation. So, when I add half the tie percentage to the previously accounted for win percentage, and multiply by 162 games...
I get a prediction that the A's should have won 87.48 games. And I understand most of the assumptions and math that lead to this conclusion. Neat!
Oh, on the less theoretical front, Chavez, Thomas and Bradley hit three homers yesterday on three consecutive pitches and the A's won. I tuned in at the top of the 9th in today's game with the A's leading 3-1, just in time to see Huston Street leave ball after ball up in the strike zone and get hammered for 4 runs. The A's would load the bases in the bottom of the ninth, but Swisher flied out to end the game.
It's best not to lose sight of the game for the mathematics.