bIBLemetrics: sabermetrics

Showing posts with label sabermetrics. Show all posts

Tuesday, November 27, 2007

The Gezer conundrum, again

My anonymous commentator is trying to understand the park effects at Gezer. Actually, so am I.

The problem in a nutshell is how to distinguish between the skill levels of the home teams at Gezer and the effects of the park itself. Gezer was home to Bet Shemesh and Modiin, the league's two biggest slugging teams. If you look at the home run totals at Gezer versus the other fields, you'll find a tremendous gap:

Teams at Gezer scored over 2.8 times as many home runs per game as teams playing at Yarkon, and about 2.7 times as many home runs per fly ball. Compared to Sportek, the ratios are 2.4 and 2.2. Overall, 117 of the IBL's 187 home runs, or 63%, were hit at Gezer, where just 39% of the games were played.

But the performance gap narrows substantially when we look at broader measures of offense, not just home runs:

Batters at Gezer actually reached base less often than those at Sportek, and not a whole lot more than those at Yarkon. The slugging gap is substantial, but not nearly as wide as the home run gap. This may reflect on the pitchers of Bet Shemesh and Modiin, which were among the league's best.

If we count times reached base on errors as hits - which for all intents and purposes they are - the gap narrows further:

Remember that error rates were highest at Yarkon and Sportek. Counting errors, it turns out that on-base rates were pretty similar across the fields, with Sportek leading. In slugging, which is less important to run scoring than getting on base, Gezer led Sportek by just 60 points (or 13%) and Yarkon by 110 (28%).

Translated into run scoring, in runs per game, plate appearance and 27 outs:

That's right. At Gezer, the average game scored just 12% more runs than at Yarkon and 10% more than at Sportek. Per plate appearance, that's 14% more than Yarkon and 8% more than Sportek; per out, 15% more than Yarkon and 7% more than Sportek.

If you followed my recent post about how runs are scored, you'll understand why. Getting on base is much more important than slugging. And there are plenty of ways to score other than home runs.

What about the park factor?

But that 12-15% run boost at Gezer is not Gezer's park factor for runs. How much of the run increase was due to the field at Gezer, and how much due to the high-slugging teams that played there?

To find that out, you have to compare how the same set of teams played at Gezer versus away from Gezer. That's what I ultimately did in this post, where I took all the teams that played each other at least twice both at Gezer and elsewhere (and likewise for the other parks). This gives us a close approximation of how the different parks affect the same player matchups.

And that's where I discovered that though Gezer produced a home run boost of 76% over Sportek and 176% over Yarkon, overall run production for the same team matchups was just 4.4% higher than at Yarkon, and was actually 2.5% lower than at Sportek.

Now, these figures may be substantially inaccurate. The sample size is very small, with just 122 games distributed among six teams and three fields. The "pros" estimate major-league park effects over at least three full seasons of 162-game play. All sorts of noise could be skewing these results: a few unrepresentative games, or an untimely injury, or the distribution of pitchers in the games being compared.

But it seems clear that most of the 12-15% difference in run production among the three venues (as opposed to home run hitting) can be attributed to the offensive power of the teams that played in them.

This is consistent with the per-team run production averages:

Look at Bet Shemesh and Modiin, which shared Gezer; Netanya and Tel Aviv, which shared Sportek; and Petach Tikva and Ra'anana, which shared Yarkon. Most of the apparent park factors for run production are in fact due to differences in team offensive ability.

The upshot

What does this mean for comparing player performance? That park factors have their main impact on individual components of performance, such as home runs or strikeout rates. When comparing them among players, we have to pay close attention to park effects. But when comparing overall run production, we can be sloppier, since the park differences are not great.

For precise comparisons, we should weight performances by their respective parks by adjusting the run production estimates on a per-park basis, and I hope to post park-adjusted tables for batting leaders soon. But whatever corrections are necessary will not change the overall offensive domination by the Bet Shemesh sluggers.

One last comment. The two run production estimators I'm currently using, Base Runs and custom IBL linear weights, when calibrated to match overall IBL run scoring are also quite accurate at estimating overall run production at Gezer. But they show similar biases for the other two fields, overestimating production at Sportek by about 3.7% and underestimating production at Yarkon by some 3%. This could be pure chance, if teams overall scored about 12 more runs than should be expected at Yarkon and about 12 fewer at Sportek. But it might indicate that the formulas aren't quite capturing all the aspects of run production at the two fields.

Perhaps run estimates based on these formulas should be scaled up or down 3% to calibrate them to the actual results at Sportek and Yarkon.

Monday, November 26, 2007

Blog roadmap

An anonymous commentor has asked when I'll address IBL pitching. Please read the exchange between us, which touches a bit on IBL pitchers and how to assess them.

In response, I thought I should let you know what I'm planning to cover in the future, time permitting. Let me know if I'm missing anything of interest, or if you have any other comments about the agenda. Or if you'd like me to focus on one topic before another - these are in no particular order.

Batting

Finish the batting production leaders charts: runs created per plate appearance, park-adjusted figures.

Calculation of score rates per runner type and estimation of runs created based on them.

Baserunning

Leaders in net runs created and lost due to base stealing.

Looking at frequency of taking the extra base on hits.

Pitching

Charts of leaders by various raw pitching stats.

Thoughts about how to evaluate pitchers with so few starts and such unbalanced schedules.

Actual assessments of pitcher value, including DIPS (defense-independent pitching stats).

Fielding

What do we really know about it in the IBL?

General

The splits: Breaking down team stats by field, opposing team, day of week, week of season, inning, etc.

Compilation of IBL run expectancy charts by outs and baserunner situation.

A look at reported attendance figures.

Can't promise how long it will take me to get to any of this... I do have other things to do with my life, believe it or not!

Friday, November 23, 2007

A novel approach to run scoring estimation?

This post is an essay on sabermetric analysis of run scoring in baseball. If you're looking for insights into the Israel Baseball League in particular, please feel free to skip this entry.

People often discuss the relative importance to run scoring of on-base percentage versus slugging average. See, for example, here, here and here.

I'd like to try and shed new light on the question, using what I believe is a new analytic approach. Since I've only been analyzing baseball for a few months now and I'm not familiar with most of the vast sabermetric literature, it's possible, even likely, that someone's done this before. But I haven't come across it yet. Let me know if I'm repeating someone else's work. There are many open questions left to be addressed, and I'm writing up this very incomplete work in part to find out whether I'm barking up the wrong tree, or perhaps, as the British say, whether I'm just barking.

Update: Indeed, I'm not the first to come up with this. I seem to have essentially replicated the work of Prof. Carl Morris, described in detail in this impossibly-formatted text file. A layman's summary can be found here.

The basic model

Consider a simplified model of baseball run scoring, in which baserunning and advancing on outs are ignored. That is, no steals or pickoffs, no sacrifices or double plays or fielder's choice. This is obviously only an approximation of how the game works, but it's sufficient to demonstrate the principles involved. Besides, OBP and SLG don't incorporate those factors anyway.

In this model, every time a batter reaches base he either walks or gets a single, double, triple or home run. Runners already on base advance accordingly.

With a bit of simple mathematics based on probability theory, we can calculate in what fraction of innings different numbers of runners will reach base. For example, the probability that no runners at all will reach base in an inning is (1-OBP)^3. If OBP is 0.300, that means that in 34.3% of innings no runners will reach base.

Similarly, the chance that exactly one runner will reach base is 3 x OBP x (1-OBP)^3. In general (without going into the derivation), the chance that exactly r runners will reach base in an inning is (r+1)(r+2)/2 x OBP^r x (1-OBP)^3.

This chart shows the expected distribution of innings with each number of runners on base for different values of OBP (click to enlarge).

For example, with an OBP of .200, more than half of all innings have no baserunners. When OBP is .550, under 10% of innings have no baserunners, while over 15% of innings have one runner, a bit more have two runners, and a bit less again have three runners.

Even simpler is to calculate the average number of runners who will reach base per inning. Since OBP = ROB / (ROB + Outs), a bit of manipulation reveals that ROB / Out = OBP / (1-OBP), so ROB / Inning = 3 x OBP / (1-OBP).

Here's a chart of the average number of runners per inning as it varies by OBP.

Now let's think a bit about how runs are scored.

If you think about this simplified model of baseball, you'll realize sooner or later that outs don't matter. We know there must be three of them in each inning, and they are the basis for our calculation of how many runners will reach base in an inning, but we don't care at all who gets out or when or in what order. Since no runners advance on an out or are picked off, we can analyze run scoring based solely on the number of runners who reach base and how they get there.

Four types of runners

So let's start with the last runner in each inning. He can only score in one way: if he hits a home run. In real baseball, there are some other possibilities, including sacrifices, steals and errors. But in our model, since he can't knock himself in, there's no way for him to reach home unless he hits a home run. This is the case no matter how many runners preceded him in the inning, and no matter how many outs remain. So his chance of scoring is equal to the home run rate, defined here as the number of home runs divided by the number of runners reaching base: HRR = HR / (H + BB).

What about the runner before the last? He can, of course, also score with a home run. But he can also score if the runner who follows him knocks him in. So if the last runner hits a home run or a triple, the runner before him will score. If the second-to-last runner hit a double, and the last runner also hits a double, he will score his teammate. In general, we can list all the combinations of two on-base events which will bring the first of the two runners home. If we wanted to, we could calculate their combined probability.

Similarly, the third to last runner in each inning can score in all the ways the second to last runner can score - he can get a home run or be knocked in by the following runner. But he can also score in more ways, since he can be knocked in by the second runner following him. Again, we can list all the combinations of three on-base events which bring the first of the three runners home, though the list starts to get a bit long.

What about the fourth to last runner in an inning? Simple: he scores! There are only three bases, so if three more runners get on base, he has nowhere to go but home.

This means we can classify runners into four categories: the last runner on base in an inning, the second-to-last runner, the third-to-last runner, and all the rest. Each category has its own average score rate: for the last runner, his home run rate; for the second-to-last and third-to-last, the chances of them either homering or being knocked in by subsequent runners; and for the rest of the runners, the score rate is 100%.

Now here's the kicker: It's not hard to calculate what fraction of runners should be expected to fall into each of the four categories. The only variable is the on-base percentage.

How many runners are the last in the inning? Simple: each inning with at least one baserunner has one runner who is last. Count the number of innings with one or more more runner and divide it by the total number of baserunners, and you have the fraction of baserunners who are last in the inning. The formula is:

Fraction of runners who are last in an inning =
Fraction of innings with one or more runner / average runners per inning =
1 - (1-OBP)^3 / (3 x OBP / (1-OBP)) =
1 - 2 x OBP + 4/3 x OBP^2 - 1/3 x OBP^3

Similar manipulations give us the fraction of runners who are second to last in an inning:
2 x OBP - 14/3 x OBP^2 + 11/3 x OBP^3 - OBP^4

And third to last in an inning:
10/3 x OBP^2 - 25/3 x OBP^3 + 7 x OBP^4 - 2 x OBP^5

And all the rest - the runners who are guaranteed to score since they are followed by at least three other runners:
5 x OBP^3 - 6 x OBP^4 + 2 x OBP^5

We can graph these curves to see how the distribution of runners into the four categories varies with on-base percentage:

Two of these categories are not really affected by the slugging percentage. The fourth category of runners, of course, always score regardless of SLG. The first category, meanwhile, score only if they themselves homer, or they are advanced by some combination of steals, errors and sacrifices. Only the middle two categories of runners can be brought in to score by their team's collective slugging ability. They amount to a total of no more than about 42% of all of a team's baserunners, when OBP is around .400.

Another look at the same data:

Estimating scoring rates

If we want to use these runner categories to estimate run scoring, we need estimates of the scoring rate for each of the first three types of runner. There are two ways to estimate scoring rates: analytically or empirically. Analytically, we can list all the possible sequences of plays which would allow each runner to score and add up their probabilities. Empirically, we can process game event logs and count how many of each type of runner in fact scored for a given league and/or team.

I haven't done either of these properly, except for a brief, imprecise empirical check using the IBL play-by-play files.

Meanwhile, for a back-of-the-envelope estimate, we can assign the first type of runner - the last in the inning - a scoring rate equal to the home run per on-base rate (about 8% in the majors) plus something extra to account for steals, errors and sacrifices. Call it 8-12%.

The second type of runner can score on his own home run, or that of the runner following him, along with various combinations of doubles and triples - or even a single or walk, followed by taking the extra base on a double. His score rate is presumably at least twice the home run rate, plus. Call it 30-45%.

The third type of runner can score in all those ways, or he can be knocked in by the a third baserunner. Call it 50-75%.

I've prepared two graphs of the impact of the score rate on run scoring. The first shows the average runs scored per runner for different values of OBP, for a selection of widely varying score rates - the high scenario has score rates three times the low scenario. The second chart uses the same scenarios to compute expected runs scored per inning.

Overall, the impact of changing the score rate is higher when OBP is lower. This makes sense, since the higher the OBP, the higher the proportion of runners who are guaranteed to score. At on-base percentages around .300-.350, tripling the score rate per runner type leads to approximately a doubling of overall run scoring. But raising OBP from .300 to just .400 is worth more in runs scored than tripling the score rate at an OBP of .300.

If we zoom in on the typical range of OBP's, we can see that the MLB's 2007 OBP of .336 and run scoring rate of 4.8 per 27 outs matches very closely the middle scenario for runner scoring rates. The actual estimate yielded is 4.89. I did nothing deliberate to make this match up; I discovered the correspondence only after plotting the graphs. It would seem to confirm the overall intuitions in the scoring rate estimates.

So, where does this leave us? With lots of open questions:

- Empirical: What are the actual scoring rates per runner category in real baseball leagues? Do they yield correct estimates of run scoring when plugged into this approach?

- Empirical: How do the scoring rates correlate against game events, or against OBP and SLG? What coefficients can we apply to estimate scoring rates for different teams or leagues?

- Analytical: Can the formulas - specifically for the distribution of runner types by OBP - be simplified to yield a good-enough estimate with less calculation? (Though with modern computers and spreadsheets, it's not clear how important this is.)

- Practical: Does this approach offer anything not available from a more sophisticated Markov analysis?

- Applications: Can this be modified to estimate the run contribution of a single batter?

If you're still reading, I'd love to know what you think.

Monday, August 27, 2007

Welcome to bIBLemetrics!

With the inaugural season of the Israel Baseball League behind us, I've just launched bIBLemetrics, a blog for Israel Baseball League statistical analysis.

I was a big baseball fan as a kid, but lost interest when I finished high school nearly 20 years ago. This year, partly thanks to the IBL, I'm back with a vengeance. And I'm returning to one of my childhood dreams: to be the next Bill James. Instead of going up against the experienced statheads of Baseball Prospectus, though, I can start with the field to myself, right here with the IBL.

If a statistical analyst of the MLB is a sabermetrician, it seems appropriate that the IBL should have iblemetricians. So that's what you can call me.

I've already downloaded the IBL's box scores and game logs (see the Scoreboards section of their website), and I'm working on the software to extract the data from them. Some of the questions I hope to address once I'm set up:

Park effects: Does Gezer Field inflate offense? If so, by how much? Do park effects change our assessment of who were the league's top players?
In general, did the most valuable player awards go to the right players?
What is the advantage to batting second (if there is one)? With two teams sharing each home field, we can compare games with the same venue but different "home" and "away" sides.
By how much does Beit Shemesh (or any other team) increase a game's attendance?
Any other questions on your mind that can be approached with baseball statistics?

bIBLemetrics

Tuesday, November 27, 2007

The Gezer conundrum, again

Monday, November 26, 2007

Blog roadmap

Friday, November 23, 2007

A novel approach to run scoring estimation?

Monday, August 27, 2007

Welcome to bIBLemetrics!

From the archives

Blog Archive

Subscribe Now: Feed Icon

About Me

IBL links

IBL player blogs

Sabermetrics (baseball stats)