Sunday, October 7, 2007

On the trail of the missing runs

Last week, I noted that Bill James's basic Runs Created formula substantially underestimates the number of runs actually scored by IBL teams, by an average of 25%. I suggested two possible explanations for this:

1. IBL teams scored runs in ways not accounted for by the Runs Created formula, such as stolen bases, sacrifices or fielding errors.

2. The Runs Created formula only provides accurate estimates within a range of values typical of major league baseball, but it is not correctly calibrated for the IBL's level of play.

I'd like to eliminate the second hypothesis from consideration.

First, note that the range of batting stats in the IBL isn't that far out of line with the MLB. Team stats range from .234 to .294 for batting average, from .368 to .419 for on-base percentage, and from .327 to .515 for slugging average. For the MLB, the equivalent values for 2007 are .248-.288 (AVG), .317-.363 (OBP) and .385-.461 (SLG). The IBL's averages are generally higher, especially for walks and extra-base hits, but not exceptionally so. A run estimation formula which can't handle them wouldn't seem to be of much use, and it's hard to believe that the usual formulas would be out of their calibration range.

To check this, we can apply some alternative formulas for run estimation based on different statistical approaches and see whether they yield similar IBL estimates to James's formula. If so, that would indicate that we're not seeing a calibration range problem. If the IBL's figures are out of calibration range for James's Runs Created, there's no reason to believe they'd also be out of range for every other formula. And if they were, there's no reason to believe they'd yield similar erroneous results with other formulas; different formulas should respond differently to out-of-range values.

So let's take a look. I've chosen the following formulas, along with James's RC-Basic: Base Runs (BsR), ERP and XRR. The results:

As you can see, all the formulas substantially underestimate IBL run creation, and all by more or less the same amount, by some 30 runs per team on average. Clearly, hypothesis 2 is refuted. IBL teams were scoring runs in ways not captured by the conventional run estimation formulas.

What might those be? Let's look at the frequency of some game events in the 2007 MLB and IBL.

I've cherry-picked the interesting numbers from the season averages and calculated them in terms of events per 100 plate appearances, to normalize for the different game lengths. The results (click to enlarge):

IBL teams scored 33% more runs per PA than in the majors, even though - perhaps surprisingly, there isn't much difference in average rates of hits, home runs or strikeouts. So where do all those extra runs come from?

Presumably from the greater numbers of walks (52% higher), stolen bases (nearly four times as many as the MLB), hit batters and errors (over 3 times as many), wild pitches and passed balls.

The basic Runs Created formula doesn't consider stolen bases, nor does the version of Base Runs used above. ERP and XRR do, which presumably accounts for some of their improved accuracy over the other two formulas for the IBL. But none of the formulas incorporates errors, wild pitches or passed balls. It will be interesting to see if we can find a way to adjust them appropriately.


Sky said...

Love the fact that you're tackling this.

Yes, I think the major reason that your Runs and RC numbers aren't syncing up is because of the errors (and HBP, to a certain extent). Times on base is more than just H+BB. You might also look to include catcher's interference and dropped third strikes if those things happen a lot in the IBL (I have no idea).

With BaseRuns, it's very likely the the B term deserves different weightings. And once you add in the errors (etc) that you're missing, you could just apply a multiplier to the B term to make BsR exactly equal to actual runs. That's sort of cheating, but would be the best route if you're looking to get custom linear weights.

iblemetrician said...


Thanks for the encouragement.

HBP shouldn't be an issue - I'm counting them as walks everywhere. I don't know about interference and dropped third strikes - they certainly occurred, but I doubt many batters reached base due to them.

Tangotiger said...

Base runs does have a term for errors:

iblemetrician said...


Thanks for pointing me to that page. I had seen it before but I hadn't bothered to work out what you were doing there. Now I get it, and I've applied it too. The results are coming.