Monday, November 19, 2007

Who were the IBL's best hitters? (Part II)

Last time, I ranked the IBL's top hitters using non-customized run estimators (Base Runs and Linear Weights). I promised that this time I'd apply customized linear weights, derived to suit the IBL's specific run-scoring environment.

I modified my approach somewhat after consulting with more experienced analysts; you can find the discussion here. Briefly, I ran a linear regression analysis on IBL data broken down by half-inning. Since run scoring in baseball occurs on a per-inning basis, aggregating the data into games (let alone seasons) reinforces all sorts of potential biases in the data. Park effects, for example, or team-specific skills, would appear to be associated with each other in aggregated data. That's much less likely in per-inning data, since there are so few game events in each half-inning.

More important for the IBL, I simply didn't have enough data points to get statistically significant results on a larger granularity. There were only 122 games and six teams. I'm trying to estimate the run values of some 15 different types of game events. Going down to the inning level gave me over 1600 independent data points, more than enough to estimate 15 coefficients (except for the rarest of game events).

Based on preliminary results and feedback, I made a couple of changes to my initial approach. The main one is that I lumped together times reaching base on error with singles, since from the batter's perspective they should be the same. It didn't make sense that I was seeing a substantially lower weight for a reach-on-error than for a single.

So here are the IBL-specific linear weights, along with the margins of error for each in parentheses. They represent the average number of runs created by of each game event.

Single or reached base on error: 0.586 (0.015)
Double:0.844 (0.035)
Triple:1.219 (0.127)
Home run:1.438 (0.042)
Walk:0.484 (0.018)
Hit by pitch:0.519 (0.038)
Error (without batter reaching base):0.302 (0.050)
Stolen base:0.091 (0.026)
Caught stealing:-0.281 (0.058)
Sacrifice fly:0.047 (0.069)
Sacrifice hit:0.109 (0.082)
Intentional walk (add this to the weight for a walk):-0.167 (0.114)
Out (apply this to every out):-0.156 (0.010)
Strikeout (add this to the weight for an out):-0.017 (0.019)
Ground into double play (add this to the weight for an out):-0.286 (0.056)

A couple of the margins of error are a bit high (see strikeouts, for example), but overall the level of significance is good.

Applying these weights to the league-level data, I got an estimated 1297 runs, about 1.6% higher than the actual figure of 1276. So to make everything match up, I shaved 1.6% off all my run estimates.

And here they are, the top 25 hitters in the 2007 IBL, using league-customized weights for the average run values of their offensive production (click to enlarge):

The first eight places are the same as the rankings using weights suitable for the major leagues. Of the 25 on the list, 24 are the same as before. The only difference is that Seth Binder replaces Ramon Rodriguez at position 25. Mike Lyons falls to 24th place; he was ranked higher using MLB-based weights presumably because stolen bases are worth less in a higher scoring league like the IBL; when it's easier to get on base and hit for power, it's not as valuable to take an extra base.

It's worth noting that the scale of the numbers is generally similar to those yielded by the MLB-based methods. The run estimates for positions 2 through 25 range from 21.88 to 44.35 here; from 21.68 to 43.71 for Base Runs, and from 22.47 to 43.63 for MLB-based linear weights.

The big discrepancy is for #1 Gregg Raymundo. Base Runs - which I suggested exaggerates performance for the extreme sluggers - gave him 59.88 runs, compared to 49.55 for MLB linear weights. The IBL-based linear weights surprised me by coming out at 56.28 runs, closer to the Base Runs estimate than the MLB linear weights estimate. I expected a linear approach to be closer to another linear approach than to a multiplicative model such as Base Runs.

To me, this proves two points: 1. Base Runs yields good run estimates even on the player level, not just for entire teams or pitchers. Gregg Raymundo was truly an exceptional hitter: AVG/OBP/SLG of .446/.600/.911 (OPS=1.511), rising to .505/.641/.970 (OPS=1.611) when adding bases reached on error. Yet Base Runs increased his run production estimate by just 6.4% over custom linear weights. Meanwhile, for Eladio Rodriguez, who hit at .461/.517/1.000 (OPS=1.517), or .471/.525/1.010 (OPS=1.535) with errors, Base Runs actually gave him fewer runs (39.01) than custom linear weights (41.15). Since no one approaching major league levels of play hits anywhere near those numbers, it seems safe to use Base Runs for estimating individual major league batters.

2. Gregg Raymundo absolutely dominated the hitting in this league, to an extent I didn't fully appreciate during the season. Perhaps that was to be expected, as I believe he was the only IBL player with experience in the AAA minors. Still, it's impressive.

Next time I'll give you the runs per plate appearance estimates, which neutralizes the impact of injuries and other differences in playing time over the season.

No comments: