bIBLemetrics: batting leaders

Monday, November 19, 2007

Who were the IBL's best hitters? (Part II)

Last time, I ranked the IBL's top hitters using non-customized run estimators (Base Runs and Linear Weights). I promised that this time I'd apply customized linear weights, derived to suit the IBL's specific run-scoring environment.

I modified my approach somewhat after consulting with more experienced analysts; you can find the discussion here. Briefly, I ran a linear regression analysis on IBL data broken down by half-inning. Since run scoring in baseball occurs on a per-inning basis, aggregating the data into games (let alone seasons) reinforces all sorts of potential biases in the data. Park effects, for example, or team-specific skills, would appear to be associated with each other in aggregated data. That's much less likely in per-inning data, since there are so few game events in each half-inning.

More important for the IBL, I simply didn't have enough data points to get statistically significant results on a larger granularity. There were only 122 games and six teams. I'm trying to estimate the run values of some 15 different types of game events. Going down to the inning level gave me over 1600 independent data points, more than enough to estimate 15 coefficients (except for the rarest of game events).

Based on preliminary results and feedback, I made a couple of changes to my initial approach. The main one is that I lumped together times reaching base on error with singles, since from the batter's perspective they should be the same. It didn't make sense that I was seeing a substantially lower weight for a reach-on-error than for a single.

So here are the IBL-specific linear weights, along with the margins of error for each in parentheses. They represent the average number of runs created by of each game event.

Single or reached base on error: 0.586 (0.015)
Double:	0.844 (0.035)
Triple:	1.219 (0.127)
Home run:	1.438 (0.042)
Walk:	0.484 (0.018)
Hit by pitch:	0.519 (0.038)
Error (without batter reaching base):	0.302 (0.050)
Stolen base:	0.091 (0.026)
Caught stealing:	-0.281 (0.058)
Sacrifice fly:	0.047 (0.069)
Sacrifice hit:	0.109 (0.082)
Intentional walk (add this to the weight for a walk):	-0.167 (0.114)
Out (apply this to every out):	-0.156 (0.010)
Strikeout (add this to the weight for an out):	-0.017 (0.019)
Ground into double play (add this to the weight for an out):	-0.286 (0.056)

A couple of the margins of error are a bit high (see strikeouts, for example), but overall the level of significance is good.

Applying these weights to the league-level data, I got an estimated 1297 runs, about 1.6% higher than the actual figure of 1276. So to make everything match up, I shaved 1.6% off all my run estimates.

And here they are, the top 25 hitters in the 2007 IBL, using league-customized weights for the average run values of their offensive production (click to enlarge):

The first eight places are the same as the rankings using weights suitable for the major leagues. Of the 25 on the list, 24 are the same as before. The only difference is that Seth Binder replaces Ramon Rodriguez at position 25. Mike Lyons falls to 24th place; he was ranked higher using MLB-based weights presumably because stolen bases are worth less in a higher scoring league like the IBL; when it's easier to get on base and hit for power, it's not as valuable to take an extra base.

It's worth noting that the scale of the numbers is generally similar to those yielded by the MLB-based methods. The run estimates for positions 2 through 25 range from 21.88 to 44.35 here; from 21.68 to 43.71 for Base Runs, and from 22.47 to 43.63 for MLB-based linear weights.

The big discrepancy is for #1 Gregg Raymundo. Base Runs - which I suggested exaggerates performance for the extreme sluggers - gave him 59.88 runs, compared to 49.55 for MLB linear weights. The IBL-based linear weights surprised me by coming out at 56.28 runs, closer to the Base Runs estimate than the MLB linear weights estimate. I expected a linear approach to be closer to another linear approach than to a multiplicative model such as Base Runs.

To me, this proves two points: 1. Base Runs yields good run estimates even on the player level, not just for entire teams or pitchers. Gregg Raymundo was truly an exceptional hitter: AVG/OBP/SLG of .446/.600/.911 (OPS=1.511), rising to .505/.641/.970 (OPS=1.611) when adding bases reached on error. Yet Base Runs increased his run production estimate by just 6.4% over custom linear weights. Meanwhile, for Eladio Rodriguez, who hit at .461/.517/1.000 (OPS=1.517), or .471/.525/1.010 (OPS=1.535) with errors, Base Runs actually gave him fewer runs (39.01) than custom linear weights (41.15). Since no one approaching major league levels of play hits anywhere near those numbers, it seems safe to use Base Runs for estimating individual major league batters.

2. Gregg Raymundo absolutely dominated the hitting in this league, to an extent I didn't fully appreciate during the season. Perhaps that was to be expected, as I believe he was the only IBL player with experience in the AAA minors. Still, it's impressive.

Next time I'll give you the runs per plate appearance estimates, which neutralizes the impact of injuries and other differences in playing time over the season.

Monday, November 12, 2007

Who were the IBL's best hitters? (Part I)

All this effort in tabulating reaches on error has been directed towards the goal of assessing player offensive performance. Having long ago determined that you can't analyze the IBL without errors, I needed to attribute the errors to batters - data which is missing from the IBL summary stats.

Now that I've done that, I can apply run estimators on a player-by-player basis to rank their offensive performance.

I won't rehash here the discussion of different run estimation methods. A good summary can be found here, by Justin Inaz.

I'll be looking at two run estimators, Base Runs and Linear Weights, and discussing how I chose the IBL-appropriate coefficients for them.

You may remember that I used Base Runs once before, in estimating the IBL's per-team performance. Arguably, Base Runs is not a suitable approach for assessing individual offensive players, since its formula applies the player's own on-base ability to his own base-advancement skills, as if he were playing on an entire team of players with his stats. This would yield overestimates for exceptionally good players, and underestimates for exceptionally bad ones.

Nevertheless, I've applied Base Runs for individual players to see what came out.

In addition, I applied Linear Weights. This family of techniques assigns a fixed multiplier to each type of offensive event in the game. To calculate a player's value, you just add up the values of all his stats. The multipliers are meant to be estimates of the average number of runs each type of event is worth in the league.

Thus, a single gets a certain run value, as does a home run, or an out, or a stolen base - and the same fixed value is applied to all of the player's offensive production, even if we know (for example) that a certain home run was a grand slam, while a certain two-outs single left him stranded at the end of the inning. We don't care; we just tote up the average run values and call that his estimated run production. The advantage is that it absolves the player of any responsibility for the performance of his teammates, so that may actually be what we want to do when comparing hitters across a league.

1. Base Runs

I took Tom Tango's weights for the Base Runs equations, with a few modifications. In the A component (runners on base), I added reaches on error (not all errors, just reaches) and catcher interference. In the B component (base advancement), instead of Tango's coefficient for errors (0.799) I scaled it up to attribute to each batter the league average ratio of other fielding errors (without the batter reaching base). That is, instead of 0.799*E I used 1.220*ROE, since I have ROE per batter but I have no data on runner advancement on errors. Finally, in the C component (outs), I subtracted ROE, since a batter reaching base on error is not out.

Applying this formula to the league totals, I get an estimated 1230.4 runs produced, about 3.6% lower than the actual value of 1276 - pretty good, since I did nothing to customize the coefficients for the IBL. (I'm still working out how to do that, now that you mention it.)

So here they are, the top 25 hitters in the 2007 Israel Baseball League, according to the Base Runs estimator. (Why 25? Feeling generous, I guess. It also coincides with all the players with at least 20 estimated runs produced.)

(Click to enlarge.)

There's Gregg Raymundo, way ahead of the pack, presumably due largely to his absurdly high on-base percentage. Jason Rees, who I recently dissed in comparison to Eladio Rodriguez, places second, followed closely by teammate Johnny Lopez. (Yes, the first three are all from Bet Shemesh.)

Eladio comes in fifth, but keep in mind that this is a cumulative statistic, so playing time matters. Had Eladio not been out with injury, he would presumably have surpassed Lopez and Rees (compare Eladio's 39.0 Base Runs in 118 plate appearances with Rees's 43.7 in 154).

Bet Shemesh grabs seven of the top 17 positions, dominating the leaders table as much as they dominated the diamond.

Bear in mind, though, that I'm using unadjusted stats here - Gezer's park factors presumably give the Blue Sox a bit of a boost. Though it doesn't seem to have done much for their home field partners, Modi'in, with just four slots in the top 25.

Enough about Base Runs. Let's have some Linear Weights.

2. Linear Weights

But which weights to use?

To start with, I took Tom Tango's weights (see the lwts_RC column here). They're based on the MLB from 1974-1990, so there's no reason to assume they'd be suitable for the IBL.

But they're not bad, either. Applying them to the league totals, they estimate 1237.2 runs, lower than the actual 1276, but a bit better than Base Runs did.

Applying them to the players, we get:

Not that different, actually. Again, the top 25 players are those with over 20 estimated runs produced. The exact same players are in both lists, with 12 of them at the exact same rankings as with Base Runs. A few of them are mixed around a bit - Eladio edges out Josh Doane, for example - but the only one with a significant change in position is David Kramer, who drops from 14th to 21st.

Arguably the most notable change between the two tables is in leader Gregg Raymundo, whose estimated run production drops from 59.88 using Base Runs to 49.55 using Linear Weights. This presumably demonstrates the problem with using Base Runs for individual player estimates of outstanding hitters - it's as if he played on a whole team of Raymundos, whereas Linear Weights assumes he played with average players.

But I still can't take these numbers seriously, knowing they were generated using weights from the seventies and eighties of Major League Baseball. I have no choice but to generate my own weights.

Tune in next time for IBL-specific Linear Weights estimates.

bIBLemetrics

Monday, November 19, 2007

Who were the IBL's best hitters? (Part II)

Monday, November 12, 2007

Who were the IBL's best hitters? (Part I)

From the archives

Blog Archive

Subscribe Now: Feed Icon

About Me

IBL links

IBL player blogs

Sabermetrics (baseball stats)