Monday, September 24, 2007

Looks like I went to the wrong university

Where I went, we couldn't get credit for a course in Baseball statistics!

Sunday, September 16, 2007

IBL alumni storm the Atlantic

The Atlantic League, that is. In the likely event that you've never heard of it (I hadn't), the Atlantic League is a ten-year-old independent baseball league in the northeastern U.S. "Independent" means it's not affiliated with any of the MLB's farm systems. The level of play is estimated to be equivalent to AA or AAA minor league ball (where AAA is the highest level of the minor leagues).

Two IBL alumni, both pitchers from the Bet Shemesh Blue Sox, have recently been signed by Atlantic League teams. First, Rafael Bergstrom was signed by the Bridgeport Bluefish, and now Jason Benson has joined the Lancaster Barnstormers. Bergstrom and Benson both played well in the IBL, and were among the league's best pitchers - hopefully I'll have more to say about pitching stats in the future - but I find it curious that they were signed so quickly, long before other players with better numbers.

In any case, I thought I'd take a look at their performance in the Atlantic League, but then I found that Ari Alexenberg has already done it.

If more IBL players move to other established leagues, it will be interesting to see what we can learn from the comparison about the IBL's overall level of play.

Note the comments on this post about Benson's Barnstormers debut. Someone noted that "his ERA is a bit high" - if only he knew about the home run factor at Gezer!

Monday, September 10, 2007

Park factors revisited. And recalculated.

The story so far:

Now, I'd like to recalculate the IBL park factors, and do it right this time.

Recall my earlier methodology: Average the performance at each field and divide it by the average performance across the IBL. But since this approach overweights the home teams for each field, I averaged performances on a per-team basis, and weighted the six teams equally at each field. As I noted, the flaw here is that the home teams are still overweight in the fielding side. The effect of Bet Shemesh's and Modiin's sluggers on the Gezer averages is reduced, but the effect of their pitchers is enhanced.

To check the magnitude of this problem, I applied the same methodology to pitching statistics. Same numbers of runs and hits, but accorded to the fielding team, averaged according to team weighted by the number of games each team played at the field. This time, I would expect the home team hitting to be overweighted. Sure enough, Gezer's home run ratio shot into the sky. Instead of the 1.32 park factor relative to average calculated in terms of batting statistics, pitching statistics yielded a factor of 1.78 - compared with just 0.53 at Yarkon (down from 0.63).

Weighting by games played per team meant that games played by visiting teams facing Bet Shemesh and Modiin were overweighted, and the influence of the home runs they gave up were exaggerated.

So I discarded this method. It doesn't make much sense anyway.

Instead, I set out to compute "conventional" park factors.

But with the IBL, nothing is conventional. I can't just compare home games with away games, since some of the "away" games were played at the home field against each team's partner sharing the field. I could just compare all games played at Gezer with all of Gezer's home teams' away games, but then the schedule would be unbalanced, since Modiin and Bet Shemesh never played each other outside Gezer.

Also, other games were played at Gezer (and Sportek and Yarkon) without the participation of the fields' home teams at all.

So here's the idea. For each field, find all "matchups" of two teams which played at least two games at that field, as well as at least two games at other fields. That is, we're building a list of teams for which we can compare their play at Gezer with their play elsewhere. Then compare the average performance of all the Gezer games on the list with the average performance of all the other games on the list (and add a correction factor to give the result in terms of a multiple of the average field).

I didn't know how balanced the lists of games would be. I was pleasantly surprised. I'll give you the details when I have more time. For now, just the bottom line: The park factors, calculated "conventionally" by comparing games by the same pairs of teams at one field versus games played by the same pairs of teams elsewhere.

Note that the home run factors are bigger than before: 1.41 at Gezer and 0.51 at Yarkon, compared with 1.32 and 0.63 before. This I attribute to the fact that the earlier approach actually underweighted Gezer's home teams. But still, the range of park factors for runs and hits is not very large, even surprisingly narrow.

The IBL fields may be very different, but I don't think you can fairly call any of them a hitter's or pitcher's park.

Sunday, September 9, 2007

About those home run derbies

Much was made in the press over the IBL's decision to break tie games with a home run derby, rather than baseball's traditional extra innings. (For the rules of the IBL's home run derby, see section 4.10 (b) of the IBL's official rules.) A commenter has asked how many home run derbies were held this season.

The answer, after a quick search of the files, is eight, or about 6.5% of games:

June 26: Modiin (L) @ Raanana (W)
July 2: Petach Tikva (L) @ Bet Shemesh (W)
July 12: Bet Shemesh (W) @ Raanana (L)
July 13: Netanya (L) @ Petach Tikva (W)
July 17: Modiin (W) @ Petach Tikva (L)
July 31: Raanana (W) @ Tel Aviv (L)
August 6: Raanana (L) @ Modiin (W)
August 8: Bet Shemesh (L) @ Modiin (W)

I never attended a game with a derby, so I can't comment on it from a fan's perspective. But from a baseball perspective, it seems daft. I'd actually prefer if they flipped a coin, or just declared the game a tie.

Baseball is an adversarial sport. The central drama focuses on the face-off between the pitcher and the batter, with one trying to outwit the other. No play in baseball is the act of a single player, or a single team. There are no free throws, for example.

In the home run derby, each team plays separately, appointing its own "pitchers" to pitch to its own "batters", whose only objective is to try to hit the ball out of the park. Properly speaking, these aren't even home runs. The pitches aren't really pitches, since they're not thrown in an effort to get the batter out, and the batters aren't really batting, since the only outcome that matters is the long ball. For that matter, no one ever needs to run the bases, let alone run home.

It's a meaningless exercise in raw batting power, and while it may be entertaining, it bears little relationship to playing baseball. You might as well have a baserunning race, or a game of catch. Those are also valuable baseball skills, but no one would suggest they should stand on their own. The same goes for smacking the ball over the fence.

Actually, I wonder how entertaining it is either. The drama of a home run is that you never know when it's coming. You can anticipate it or imagine it, but it's a rare event that only happens when it happens, as a result of the right (or wrong!) pitch to the right batter at the right moment. Where's the drama in the derby? Especially compared with the drama of extra innings, everyone nervously wondering how long it will go on and who will finally break the tie?

Then consider that baseball teams have different skill sets. Some have more power, others more on-base skills, other more pitching or fielding. Why decide the game based on a display of only one skill, which will naturally be concentrated more in some teams than others?

I don't imagine anyone in the IBL is reading this blog, but if you are, please reconsider the home run derby. It's a corruption of the spirit of baseball. No less.

A final comment about statistics: I can't find statistics for home run derbies anywhere on the IBL's website. Neither the box scores nor the game logs even mention which players participated in them, let alone who hit how many home runs in each round. This despite the provision of the rules (4.10 (b)(7)(a)) that "Statistics from the Home Run Derby shall not be included in a player’s regular statistics but shall be included only in the statistics for home runs during a Home Run Derby."

I presume the explanation is that the IBL used off-the-shelf scorekeeping software, which had no provisions for scoring home run derbies. Another reason to drop the practice.

Thursday, September 6, 2007

Home run leaders, broken down by field

As requested in the comments, here are the IBL's leading home run hitters, with their home runs and at bats broken down by ballpark. I've included everyone with at least five home runs for the season.

I don't have much to say about this. The dominance of Gezer, and of Gezer-based hitters, in the standings is plain to see. Just four of the fourteen home run leaders (Ramirez, Crotin, Doane, Field) were from teams other than Bet Shemesh and Modiin.

As a rough estimate of the Gezer effect, imagine cutting all the Gezer home run bars in half on the first chart. The same players still lead the list, but they're not nearly as dominant.

How extreme are the IBL's park effects?

Having quantified the differences between the IBL's parks, at least as a first estimate, I now wonder how the park factors compare with the major leagues. Do MLB parks vary as widely as the IBL's? My intuition would be to say no, that in the major leagues the ballparks are held to a higher standard of uniformity.

My intuition would appear to be somewhat wrong.

To start with the biggest variances in the IBL, home run factors range from 0.63 of average at Yarkon to 1.32 at Gezer. In the MLB (for 2006, the last complete season), home run factors range from 0.681 (AT&T Park, San Francisco) to 1.343 (Chase Field, Phoenix), about the same range as the IBL.

Triples - the most unreliable stat since it's such a rare event - range from 0.0 at Gezer to 1.94 at Yarkon. In the majors, the range is 0.4 (Great American, Cincinnati) to 2.0 (Rogers Centre, Toronto).

Overall hits, meanwhile, range from 0.94 at Yarkon to 1.06 at Sportek. In the MLB, it's 0.895 (Safeco Field, Seattle) to 1.14 (Coors Field, Denver) - a wider range. Same goes for runs: a narrow range of 0.99 to 1.03 in the IBL, compared with 0.86 (Petco Park, San Diego) to 1.15 (Great American, Cincinnati) in the MLB.

Of course, the variation in the majors is among thirty parks, not three, so the typical difference between MLB parks is much smaller than in the IBL. You could say that there is as much range of variation among the three IBL parks as there is among all thirty MLB parks. Though actually, judging by runs and hits, the IBL parks vary less than the MLB's.

I'm not going to calculate variances and standard deviations among sets of three figures, since I don't think they'd be meaningful, so I can't quantify the differences in ranges of park factors between the two leagues. But the IBL's situation is certainly not nearly as extreme as I had assumed.

Wednesday, September 5, 2007

Some comments

I'd appreciate some comments about - well, about just about anything, to be honest! But in particular, I'd like to know whether people can read the various charts I've put up. I'm not sure how well they're coming through, or which ones have been clearest. Please let me know.

On a related note, you may not have seen this brief exchange of comments about pitcher control.

Finally, if you like the IBL and you're into statistics, please tell a friend about the blog. It would be nice to be able to discuss this with someone other than my keyboard.

Surprise! Park factors are not what you expected!

I'd like to take a first stab at estimating IBL park factors.

For this first attempt, I'm going to average the performance of all six teams at each of the three fields.

I know I already explained why that's somewhat problematic: it overweights the home teams at each park, so that (for example) Gezer, home to the IBL's strongest batters (Bet Shemesh and Modiin) would have its park effects exaggerated in favor of batting, while Yarkon, home to the league's weakest teams (Petach Tikva and Raanana) would look too much like a pitcher's park.

To correct for this problem, I've decided to weight each team's record equally at each of the fields. That is, rather than tote up all the games played at each field and average the number of runs, hits, etc., I computed each team's individual performance at each field, and then averaged the team results weighting each team equally.

So even though Bet Shemesh played 24 games at Gezer and Modiin played 27, while Petach Tikva played only 9 there, I've treated them as equal for the purpose of averaging Gezer's performance levels. The same, of course, goes for Sportek and Yarkon.

This approach doesn't eliminate all forms of bias in the computation. For example, the home teams are still overrepresented in the average level of fielding faced at each park. But nothing we do will realistically eliminate all forms of bias, and I consider this approach to be a reasonable start.

The envelope, please:

Now, for the analysis. We've already seen the home run effects. Gezer produced 32% more home runs per game than the average field, while Yarkon produced 37% fewer. Overall, home runs were hit more than twice as often at Gezer as at Yarkon (relative factor: 2.08), and the difference from Sportek was nearly as large (1.82). Presumably, this is mainly due to the distances to the fences. Most of those Gezer home runs became fly outs at Yarkon.

So Gezer was a big hitters' park, right?

Not so fast.

What about triples?

Triples aren't that common to begin with. Only 19 were hit in the IBL season. Of those, 13 were at Yarkon, 6 at Sportek and 0 at Gezer. Shorter fences, not enough room to hit a triple.

How about base hits? Not much to say about them. Gezer was smack in the middle here, with a range of just 13% in hit rates between the three fields - probably too narrow to be meaningfully distinguished, given the sample size. Gezer was above average for doubles, with Yarkon below average, but again the range was not that great - 19%. Total bases from all hits - including those frequent Gezer homers - was just 7% above average at Gezer, with Yarkon 11% below average, a range of 21%.

Gezer also yielded more strikeouts and fewer walks than average. Overall, after home runs, the most telling column in the chart, and the most relevant to winning or losing games, is the one labelled R. Average runs per game, when weighting all six teams equally, varied within a range of 5% among the three parks. And Gezer was by no means the leader - it was just below average!

What can we learn from this? First, as usual, the data doesn't always support our intuitions about baseball. Second, a lot more goes into a baseball game than home run hitting. Third, if we were to assess IBL batters by adjusting their performances based on park factors, we'd see changes in home runs and triples, but batting and slugging averages wouldn't be affected much.

To confirm that suspicion, I calculated batting and slugging averages on the same basis as the park factors - averaging team results per field on an equal basis. (I don't have enough data tabulated yet to analyze on-base percentages.)

IBL 2007 batting and slugging averages by field, weighting teams equally at each field

So how much does Gezer field inflate offense? Not as much as I expected, at least.

Monday, September 3, 2007

The Gezer and the stick

Notice anything striking about the team batting statistics for the season?

Teams   G    AB     H     R    BB    SO    HR
Bet 41 1106 325 286 186 198 60
Mod 41 1092 297 218 157 192 52
Net 40 1063 295 185 113 182 17
Pet 40 1003 235 161 186 216 17
Raa 41 1059 277 196 159 240 21
Tel 41 1047 293 230 202 229 20

Two numbers virtually jump off the page: Bet Shemesh and Modiin's home run totals. They're about three times the league average. By coincidence, both teams share Gezer Field as their home field.

Or is it a coincidence? Did they rack up those homers because of something about Gezer, or were they just loaded with powerful sluggers who could hit well anywhere?

For a first cut at an answer, let's break down the season by venue.


Teams       G     AB      H      R     BB     SO     HR
Bet 24 648 213 190 105 99 51
Mod 27 711 198 147 96 122 41
Net 11 308 91 60 35 74 5
Pet 9 228 55 51 51 68 8
Raa 11 274 70 42 44 69 5
Tel 12 312 70 34 36 92 7


Teams       G     AB      H      R     BB     SO     HR
Bet 7 184 47 29 31 37 4
Mod 6 173 53 39 32 26 4
Net 13 329 88 54 31 52 5
Pet 9 216 51 27 31 41 3
Raa 8 234 79 58 34 37 7
Tel 19 454 136 106 90 82 9


Teams       G     AB      H      R     BB     SO     HR
Bet 10 274 65 67 50 62 5
Mod 8 208 46 32 29 44 7
Net 16 426 116 71 47 56 7
Pet 22 559 129 83 104 107 6
Raa 22 551 128 96 81 134 9
Tel 10 281 87 90 76 55 4

Are you rubbing your eyes yet?

Bet Shemesh and Modiin each played somewhat more than half their games at Gezer. But they hit about 80% of their home runs there!

What's more amazing: only some of the teams seem to have had a power boost at Gezer. Let's look at the stats per game:

Oddly, only Bet Shemesh, Modiin and Petach Tikva had a home run boost at Gezer. Perhaps Bet Shemesh and Modiin got used to the field and learned how to hit there?

You can also see from this chart that both Bet Shemesh and Modiin hit more home runs at the other fields than any other team (except Raanana at Sportek), and that Modiin actually outslugged Bet Shemesh in away games. (But keep in mind the small sample sizes we're working with, especially at Sportek.)

New questions on my mind: Was the power boost restricted to specific batters? Did it increase as the season progressed (indicating a learning curve for the home teams)?


Sunday, September 2, 2007

Does Gezer Field inflate offense?

Let's talk about park effects.

It's well known that different ballparks affect the play of the game in different ways. A given park may make it easier or harder to hit home runs, or doubles, or even to walk or strike out. It's what are called "park effects" or "park factors".

Take Gezer Field. Ari Alexenberg has identified some of the relevant factors. The 395ft altitude, in the Shfela foothills, may have some effect on the flight of the ball. More significant are the short distances down the right and left field lines (280 and 316 feet respectively), which turn fly balls into home runs, and the sharp upward slopes near the outfield fences, which turn routine pop-ups into wacky plays. Not to mention the relatively narrow foul territory and the lighting post in the middle of right field.

The question is how to quantify these effects.

Measuring park effects
On first thought, it might sound simple: just calculate the stats for all games played at a given field, and compare them with the overall league averages. If the games at that field yield more runs (for example) than the average, it's a hitter's park.

The problem with this approach is obvious. Take, for example, Yankee Stadium. Let's say that in the average game at Yankee stadium 6 runs are scored, compared with a league average of 4. Does that mean the park is responsible for the 50% increase in run production? Not necessarily. It could just be that the Yankees, who played in every game at Yankee Stadium, have a powerful offense and regularly outscore the league average, whether at home or on the road.

To get a meaningful estimate of park effects, you need to play the same set of games - between the same pairs of teams - both at the stadium you're measuring, and again at the average of all the league's stadiums. That precise experiment is not quite possible, but we can do something similar: Compare all the Yankees' home games with all their road games. Over the course of a season, the Yankees play the same teams at home as on the road, but they play their road games at a mix of all the league's stadiums. Comparing their home games with their road games is the closest we can come to playing the same list of games both at Yankee Stadium and at the average stadium.

For the major leagues, this is the standard way to measure park effects. (Recent figures for the MLB are available from ESPN.)

Clearly, there are some flaws to this approach. For one, luck plays a role. Even over an entire major league season, sheer luck would produce variations in runs scored in different stadiums, even when comparing the same teams. Second, a team's road games do not entirely reflect the league average stadium, since they don't include their home park. Finally, if a team's schedule does not consist of an even mixture of all the league's other teams, its road games will again not reflect the play level of the average league stadium, but rather a weighted average of the park effects of the venues it played in.

Some baseball researches have adjusted the conventional park effect calculation to account for these problems. To mitigate the effect of luck, several seasons' worth of data can be averaged. Corrections can be applied to adjust for the fact that road games don't include a team's home field. With substantial effort (better brush up your math), it's even possible to adjust for unbalanced schedules.

And you thought that was difficult?
So how can we estimate the park effects of Gezer Field (or Sportek or Yarkon, for that matter)?

Unfortunately, all the problems with estimating major league park effects go double or triple for the IBL. A 41-game season offers a much smaller sample size than a 162-game MLB schedule, increasing the error level of any statistic. With only three fields, it is even more difficult to compare performance with the "average" field, since each field itself contributes about a third of that average. Comparing home and road games essentially compares each field with the average of the other two, not with the league average.

And the IBL schedule was extraordinarily unbalanced, both in terms of the number of games between pairs of teams and, even more so, in terms of the distribution of those games by venue.

Here, to be precise, is the breakdown of IBL games by venue. The triplets of numbers represent games played at Gezer / Sportek / Yarkon, respectively. The pairs of numbers represent games at each team's home field and road games.

IBL games by venue (Gezer / Sportek / Yarkon)

Bet Mod Net Pet Raa Tel
Bet -- 8/0/0 5/2/1 4/1/3 3/0/5 4/4/1
Mod 8/0/0 -- 3/2/2 4/0/4 6/0/2 6/4/0
Net 5/2/1 3/2/2 -- 0/2/5 1/2/7 2/5/1
Pet 4/1/3 4/0/4 0/2/5 -- 1/3/5 0/3/5
Raa 3/0/5 6/0/2 1/2/7 1/3/5 -- 0/3/3
Tel 4/4/1 6/4/0 2/5/1 0/3/5 0/3/3 --

Total 24/7/10 27/6/8 11/13/16 9/9/22 11/8/22 12/19/10

Home/Away 24/17 27/14 13/27 18/22 19/22 19/22

League total: 47/31/44

Note that no IBL team played more than 27 games in its supposed home field, and Netanya played just 13 at "home" in Sportek, fewer than it played at Yarkon.

Double whammy
Finally, what about the fact that the IBL has six teams playing in just three parks? On the one hand, this can actually make it easier to estimate park effects. Approximately twice as many games were played at each field as would have been for the same length season with six fields, improving our sample size. Also, since two teams share each home field, each individual team has less of a biasing effect on the field's statistics.

On the other hand, some of Bet Shemesh's "road" games were actually played at its home field against Modiin, which shared it. Bet Shemesh never played Modiin elsewhere, so it's not possible to compare those games against the same games played in another location.

Ways forward
Given these complications, I doubt there's a single best way to assess IBL park effects. I can think of a few approaches worth trying. I hope to check them out and compare the results.

At the moment, I'm hoping to try:
  • Comparing all games played at each venue with the league averages

  • Comparing all games played at a given venue for which the same two teams also played at a different venue

  • As above, but weighted by the number of games played at each venue to try to achieve a more balanced schedule

Let me know if you have any better suggestions.