Tuesday, October 9, 2007

The run mystery, solved?

With some help from experienced sabermetricians over at the Baseball Fever forums, I may have solved the Case of the Too Many Runs.

Sure enough, the main missing factor seems to be the errors, which were so much more frequent in the IBL than in the majors. MLB run estimators can ignore errors and get pretty good results, but that won't do for the IBL. I expected to do plenty of multiplier tweaking to get the numbers to match up, but I was pleasantly surprised.

Turns out, Tango Tiger has already worked out multipliers for the BaseRuns estimator which take errors into account, along with just about every other imaginable game event. The complete set of weights can be found on his website.

I don't have stats for every category on that list, so I just took the ones I have available.

Remember the general formula for BaseRuns:

BaseRuns = A x B / (B + C) + HR

where

A = number of runners on base
B = expected number of bases advanced
C = outs
HR = home runs, of course

Using Tango Tiger's weights for the data I currently have, that yields:

A = (H - HR) + BB + IBB + HBP + E + 0.08*Sac

B = .726*1B + 1.948*2B + 3.134*3B + 1.694*HR + .052*BB - .483*IBB + .163*HBP + .799*E + .727*Sac -.057*K -.004*(Other Outs) + .813*SB - 1.188*CS

C = AB - H + .92*Sac

The results:

Thankfully, the IBL does indeed obey the universal natural laws of baseball. The estimates are all accurate to within 6.2%, with the league estimate less than 1% off the actual number of runs scored. And this was achieved without any coefficient tweaking on my part - Tango's formula was left as is, except for omitting those terms for which I have no data.

It seems fair to conclude that the main difference in IBL run production compared to the MLB was the high rate of errors, followed secondarily by the higher stolen base rate.

To check the robustness of these results, I applied them to the IBL stats broken down by field. Sure enough, BaseRuns accurately predicted the runs scored in three different run scoring environments to within 2% overall (though, with the small sample sizes, estimates were off by up to 15% for some individual teams at specific venues).

I could try the same exercise with a different run estimator adjusted for errors and steals, but I'm not sure there's much point. The conclusion is clear: You can't analyze IBL run scoring without accounting for errors.

1 comment:

Justin said...

That's awesome. One of the nicest demonstration of the flexibility of base runs that I've seen. Congrats on solving this!
-jinaz