Monday, October 1, 2007

The Case of the Too Many Runs

I've been thinking about run creation.

One of the oldest questions addressed by sabermetricians since the early days in the long-ago 1970s has been how to estimate the value of a team or player in terms of how many runs they have created. Pioneer baseball analyst Bill James devised a number of run estimation formulas, the simplest of which is impressive in both its elegance and its accuracy. It goes:

Runs Created = (Runners on base * Total bases advanced) / Opportunities

or, more precisely:

RC = ((H+BB) * TB) / (AB + BB)

Another way to write this formula is:


Using only statistics which are widely publicly available, this formula generally predicts the number of runs scored by a team to within 5% of its actual value. For the complete 2007 MLB season, for example, the Runs Created estimates per team range between 93.2% and 105.1% of the teams' actual runs scored, with only 4 out of 30 teams falling outside the 5% margin. The correlation coefficient between Runs and Runs Created is a striking 0.959 (perfect correlation is 1.000).


Since Runs Created accurately estimates runs scored by major league teams, and the game of baseball is the same whether played in Seattle or Gezer, presumably it should work as well for the IBL as the MLB.

Right and wrong.

On the one hand, RC still correlates highly with runs scored per team. For the IBL's six teams, the correlation coefficient between the two figures is 0.966.

On the other hand, Runs Created consistently underestimates the actual number of runs scored in the IBL, by an average of 25.2%, ranging from 15.7% for Modiin to a full 36.3% for Tel Aviv:

There are two possible explanations for this:

1. IBL teams were scoring runs in ways not accounted for by the Runs Created formula, such as stolen bases, sacrifices or fielding errors.

2. Bill James's Runs Created formula does not actually capture anything essential about the way runs are created in baseball. It just coincidentally happens to work within the range of values typical of major league baseball. In leagues outside that range, RC is not correctly calibrated to estimate actual runs.

I don't have enough data yet to adequately evaluate explanation 1, but a glance at the pitching statistics indicates that there might be something to it. MLB teams gave up an average of 777 runs this season, 717 of them scored as earned runs. That means 8.4% of runs scored were the direct or indirect result of fielding errors. (Yes, I know the error stats are unreliable, but it's the best I have.)

In the IBL, by contrast, the average team gave up 213 runs, but just 170 of them were earned runs. Fully 20% of runs were scored as unearned. That means they would only partially be reflected in the batting statistics and the Runs Created formula, since fielding errors aren't included in on-base percentage or slugging average.

This is still far from accounting for the 25% extra runs scored over the Runs Created estimates, but it may explain close to half of it. I'd need to study the data further to know for sure.

Regarding explanation 2, it is not at all farfetched to suggest that Bill James's Runs Created formula is in large part a lucky guess. For a more detailed exposition of runs created estimates and the problems with them, see this essay by sabermetrician Tangotiger (and the sequels here and here), in which he notes:

However, the reason that Runs Created "works" is not because of its construction. It's purely an accident that it works. It just so happens that the points at which Runs Created and common sense intersect is exactly at the same points at which MLB teams play at!

To determine whether this explains the IBL results, we'd have to examine whether the IBL's teams play within the range of values for which Runs Created is accurate. I'll try to get to that some other time.

Chag Sameach, and enjoy the MLB playoffs!

No comments: