Wednesday, December 13, 2017

Batting Streaks: Statistical Illusions?

The Book asserts, after some statistical flourish, that batter streaks posses no predictive information. But this really presupposes that batters go through streaks at all.

We can test if a batter is going through a streak using the Wald-Wolfowitz runs test, which is a generic statistical test for determining if there is an anomalous streak in flipping a coin repeatedly (or any other repeated Bernoulli process).

The code used for doing the hypothesis testing with the data is available on github.

Hitting Streaks in General

Using the Retrosheet data for 2014–2016 (and 2006–2016), we can determine if a batter has hit the ball and successfully arrived on a base (or a home run) or is out. In other words, if a plate appearance was a "success" or "failure".

A streak remains ambiguous to me, but seems to be some quantity of nonrandomness in the result of trials. That is to say, a player's next plate appearance will be nearly identical to his previous plate appearance whilst "in a streak". This is completely different than the definition given in the rules of major league baseball!!! But it captures what people mean in the vernacular when saying, "Wow, so-and-so is on a hot streak."

Claim 1. Given a batter, his plate appearances over his career appear to be independent of each other.

Corollary. Streaks, hot or cold, do not exist.

Assumptions, Restrictions. If we ignore errors, pickoffs, caught stealing, wild pitches, balks, other advances,..., basically everything except singles, doubles, triples, home runs, and strike outs (and other forms of outs), then we can apply the Wald-Wolfowitz test.

To avoid false positives and "small samples", we restrict focus to batters who have at least 50 plate appearances in the 2016 season.

Wald-Wolfowitz Testing

We can try getting data for batters from 2014–2016, removing games at Coors field and batters on the Colorado team, and filtering out batters with fewer than 50 plate appearances. We then consider a plate appearance a "failure" if the batter ends up out, and a "success" otherwise.

There are two ways to handle the data now: we can consider each player's career as one long string of "success"/"failure" (or 1 and 0, respectively), then see if the plate appearances are independent of each other using the Wald-Wolfowitz test. Or we can consider each game as a string of "success"/"failure" and see if in a given game each plate appearance of a given batter is independent of each other. Spoiler alert: either way, we get the same result (each plate appearance is independent of each other).

We need to factor in the fact that we're doing Multi-hypothesis testing, so we're going to use either the Holm–Bonferroni method or Šidák correction. Again, spoiler alert, the results are unchanged by either method.

Player-by-Player testing of independence

We will consider the career for a given batter, and see if the plate apperances are independent of each other. Our hypotheses are:

  • H0: the plate appearances are independent of each other
  • Ha: the plate appearances are not independent of each other

Proposition 1. With α = 0.05, we fail to reject the null hypothesis for batters with at least 50 plate appearances in the games between 2014–2016 (inclusively) in 837 batters.

Proposition 2. With α = 0.05, we fail to reject the null hypothesis for batters with at least 50 plate appearances in the games between 2006–2016 (inclusively) in 1601 batters.

Game-by-Game testing of independence

We will consider the career for a given batter in a given game, and see if the plate appearances are independent of each other. Since we are working with small samples (each game has around 4 or 6 plate appearances), we can use exact p-values. Our hypotheses are:

  • H0: the plate appearances are independent of each other
  • Ha: the plate appearances are not independent of each other

Proposition 3. With α = 0.05, we fail to reject the null hypothesis for batters with at least 50 plate appearances in the games between 2014–2016 (inclusively) in 131 275 samples.

Proposition 4. With α = 0.05, we fail to reject the null hypothesis for batters with at least 50 plate appearances in the games between 2006–2016 (inclusively) in 484 746 samples.

Caution: In roughly a quarter of batter-game combinations, the batters consistently strike out. (More precisely, for the 2014–2016 this occurs 25.56693% of the time, and for 2006–2016 it occurs 24.8470% of the time.) We should expect this to occur for batters, which would be something around 11% of the data; the remaining 13% may very well be due to random fluctuation, which seems plausible by order of magnitude estimates for batters with 0.300 BA.

Batter Streaks in The Book

The Book considers batters over the course of 5 games, kind of like the n-gram of the batter's games for the season, except the first 5 (and last 5) games. Then to each of these games, we compute the wOBA for the batter. A hot streak is the top 5% of these 5-game-appearance wOBA scores, and a cold streak is the bottom 5% wOBA scores.

We reproduce table 14 from The Book, which examines data from 2000–2003 (excluding games at Coors field, and excluding Rockies players):

Number of distinct Players with one or more 5-game hot streaks543
Total number of hot streaks6408
Total PA during the streaks141259
Average wOBA during the streaks0.587

Observe, during a hot streak there's an average of 141259/6408 ≈ 22.044164 PA during a streak (or 4.4 PA per game). On average, the weighted linear combination of plays for a batter sums to (0.587 × 141259/6408) ≈ 12.939924. Using the formula for wOBA given in The Book, this translates to somewhere between 6.6358585 home runs and 14.377693 singles. Using the league average for the American League, we find the average BA over the time period is:

YearAL average BA1B/H2B/H3B/HHR/H
2000.2760.65934070.197155790.0193923730.12411118
2001.2670.65748820.20147750.0208193420.12021491
2002.2640.653478860.20532060.0211459750.12005457
2003.2670.65953180.19933110.0214046820.11973244
Average0.26850.657495440.200767060.0206770060.12106053

The probability of getting at least 6 hits in 22 PA with a BA of 0.2685 is, according to the binomial distribution, P(H > 6) ≈ 61.130404%. Given the average over these years, the expected wOBA when there are at least 6 hits is 0.2214 with a standard deviation of 0.1002. Assuming everyone were closer to the average, we would expect naively this to be a mildly rare thing to see a streak.

But looking at the League Average wOBA scores, which fluctuate around 0.33, we see a hot streak is then just a 2-sigma event, something which should happen every 3 weeks or so. From this perspective, of rough probabilistic arguments, it is unsurprising The Book concludes there is little predictive information in knowing if a player is in the middle of a hot streak (or cold streak, for that matter).

Saturday, December 2, 2017

wOBA is more elegant than you think

Review: How do we compute wOBA

Any given moment in baseball may be described by (1) how many outs there are, and (2) who's on base. There are possibly 0, 1, or 2 outs; and 8 possible configurations of runners on base. Hence we may describe the game at any moment in a given inning by 24 possible states. We may represent each possible state by a number from 0 to 23, using the formula: 8×outs + (base-configuration) = (game state). Here the base-configuration = 1×(first occupied) + 2×(second occupied) + 4×(third occupied), where the parenthetic (base occupied) is 1 if the given base is occupied and 0 otherwise...think of it as a 3-bit number.

Step 1: Compute the Run-Expectancy Matrix. We may set up a table whose rows are the base configurations, and the columns are the number of outs. Tom Tango calls this table the "Run Expectancy matrix" (or RE-matrix), but it's really a random variable. We find for a given state, over the course of a season (or set of seasons) the number of runs from that state until then end of the inning; then we divide by the number of times that state has occurred over the season(s).

In pseudocode (pidgin Python):

for plate_appearance in season:
    state = to_state(plate_appearance)
    runs[state] = runs[state] + runs_at_end_of_inning(plate_appearance)
    counts[state] = counts[state] + 1

for i = 0, 1, ..., 23:
    re[i] = runs[i]/counts[i]

Step 2: Compute the raw coefficients. The "run expectancy" for a given play is the number of runs resulting from the play, plus the difference in the value from the RE-matrix component for the final state from the RE-matrix component for the initial state.

Now, for a given play (BB, HBP, 1B, 2B, 3B, HR, Outs), we compute a coefficient k(BB) by summing the run expectancy for every walk in the season(s) then dividing by the total number of walks occurring in the season(s). Schematically in pseudo-code (pidgin Python):

number_of_walks = 0
re_of_walks = 0
for walk_event in season_walk_events:
    re_of_play = re[end_state(walk_event)] - re[start_state(walk_event)]
    re_of_walks = re_of_walks + re_of_play
    number_of_walks = number_of_walks + 1
k_BB = re_of_walks/number_of_walks

Given the structure of the RE-matrix, k(Outs) < 0 always.

Step 3: Scale the raw coefficients. For each of the offensive plays (BB, HBP, 1B, 2B, 3B, HR) we have the coefficient c(play) = k(play) − k(Outs).

Step 4: Compute the wOBA. We know compute the wOBA for a player by the formula:

wOBA = c(BB)×(BB/PA) + c(HBP)×(HBP/PA) + c(1B)×(1B/PA) + c(2B)×(2B/PA) + c(3B)×(3B/PA) + c(HR)×(HR/PA)

The normalizations vary, sometimes instead of PA it is (AB + BB − IBB + SF + HBP). The intuition remains the same, we multiply the coefficients by the probability our given batter will perform the given play.

But that means wOBA is the expected value for some random variable.

Exercise 1. Assume for simplicity that PA = BB + HBP + 1B + 2B + 3B + HR + Outs. Prove the following formula holds:

wOBA = k(BB)×(BB/PA) + k(HBP)×(HBP/PA) + k(1B)×(1B/PA) + k(2B)×(2B/PA) + k(3B)×(3B/PA) + k(HR)×(HR/PA) + k(Outs)×(Outs/PA) − k(Outs)

[Hint: plug in the definition of the re-scaled coefficients in terms of the raw coefficients.]

Due to the intricacies of a degenerate sigma algebra, the wOBA for a batter who has never even been at plate once will be zero.

Mathematical Cleverness Hidden in the Coefficients

The raw coefficients k(play) is actually the conditional expectation value E[RE|B=play] where "RE" is the random variable describing the entries of the RE-matrix, and "B" is the random variable for the play at hand (the BB, HBP, 1B, etc.). Recall the conditional expectation is itself a random variable when the "B" is left unspecified.

The expression E[RE|B=play] is precisely step 2 in computing wOBA, and if we do not fix the play it gives us a "random variable" — the function which, given a play, produces the corresponding coefficients for a that play.

For a player's wOBA, this is just the expectation of the conditional expectation minus the coefficient for outs: wOBA = Ebatter[E[RE|B]] − k(Outs).

A more elegant solution would be just to use Ebatter[E[RE|B]], so bad players are penalized for their outs. The only plausible reason for subtracting out k(Outs) that I could think of is to make wOBA look superficially "similar" to SLG, but it does have a nifty feature that a batter that only strikes out will have a vanishing wOBA score as opposed to a negative score (which has its own drawbacks).

Remark 1. The astute reader may recall from basic probability that E[E[X|Y]] = E[X], which is true when we take the expectation value using the probability distribution over the same probability space. But we are not doing that with wOBA, we are taking the inner expectation with respect to the season(s)'s average, and the outer expectation with respect to the batter's history. The geometry of the probability space is more subtle than one would think, it's more analogous to a Fiber bundle, where the fiber is the probability space over the 25-states of the inning, and the base-space is the batter's possible plays.

Possible Improvements

Steps 1 and 2 in the algorithm for computing wOBA coefficients didn't specify any conditions on which runs we look at. That is to say, we didn't restrict focus to a particular park, or for particular weather, and so on.

A possible improvement would be to project the statistics onto a particular subset: compute wOBA coefficients relative to a particular park, or for particular weather. This would factor into the statistic the park's idiosyncrasies.

More "controversial" improvements include counting "Caught Stealing" as a play, which doesn't really measure a batter's performance, but does measure a player's judgement and more crucially ability to steal.

Variance

As far as I can tell, The Book first introduces wOBA. It was rather quick in giving its variance in an appendix. Recall for a random variable on a finite probability space, the variance is:

Var(X) = (∑jX(j)2Pr(j)) - E[X]2

where I have written out explicitly the E[X2] for emphasis on the structure of the formula. The Book asserts it is the same as the Bernoulli distribution's variance.

Exercise 2. Assume for simplicity that PA = BB + HBP + 1B + 2B + 3B + HR + Outs. Show the following formula holds:

var(wOBA) = c(BB)2×(BB/PA) + c(HBP)2×(HBP/PA) + c(1B)2×(1B/PA) + c(2B)2×(2B/PA) + c(3B)2×(3B/PA) + c(HR)2×(HR/PA) − (wOBA)2

where the superscript 2 is to indicate square, x2 = x×x. Then prove or find a counter-example that

var(wOBA) = wOBA(1 − wOBA)/PA.

Wednesday, November 29, 2017

Yu Darvish

So, Yu Darvish was traded to the Dodgers from the Texas Rangers on . He also was the starter for game 7 of the 2017 world series. Did Darvish cost game 7 for the Dodgers? We will answer more the modest question, did Darvish perform as well for the Dodgers as he did for the Texas Rangers? We use the hits per batters faced as a measure of pitching skill.

Claim 1. Darvish performed worse on the Dodgers than on the Rangers. (End of Claim)

Claim 2. Darvish performed worse in game 7 of the World Series than he did on the previous 8 games with the Dodgers. (End of Claim)

More precisely, the hit rate (as measured by the ratio of hits to batters faced) for Darvish on the Dodgers follows a different (worse) distribution than while Darvish was on the Rangers. Similarly, the hit rate for Darvish in game 7 of the 2017 World Series follows a (worse) distribution than Darvish's pre-World Series performance for the Dodgers.

Addendum (). The Los Angeles Times reports the Astros apparently knew Yu Darvish had a "tell" revealing his pitch choice, which would explain the catastrophic difference between expected and observed performance.

Raw Data

We can list the relevant career statistics from baseball-reference.com:

TeamBFH
Texas3242630
Dodgers20244
Game 7 of WS229

Tenure on the Dodgers

We will now perform tests to justify claim 1, that Darvish's performance on the Dodgers is worse than his performance on the Texas Rangers, both with a frequentist test and through Bayesian means.

Frequentist Test

Proposition 1. Darvish's "hit rate" (ratio of hits to batters faced) for his time on the Rangers differs significantly from his time on the Dodgers with α = 0.002699796. (End of Proposition)

It would be tempting but wrong to use the Z-test for proportionality testing (with the "true proportion" being Darvish's hit rate while on the Rangers, and the "sample proportion" would be the hit rate while on the Dodgers). The reason it would be wrong is because this is not a simple random sample (remember, baseball cycles through the 9 batters in order, whereas a simple random sample assumes that every batter in baseball is equally probable at hitting next; for really, really large BF, we can pretend it's a simple random sample, but not for 202 batters faced in several months).

What we do instead is construct a Wilson interval for Darvish's time on the Texas rangers. This will give us an interval for which Darvish's "true hit rate" lies, and a percentage certainty. Basically, if the hit rate while on the Dodgers does not lie in the confidence interval constructed from the Ranger's hit rate data, then the test is statistically significant.

Proof. We have
  • = 630/3242 ≈ 0.1943245 be the observed "hit rate",
  • n = 3242 batters faced ("trials performed"), and
  • z = 3 (so we're 3-sigma confident),
then the confidence interval is [0.17433468, 0.21600676].

But the observed hit rate while on the Dodgers is D = 44/202 ≈ 0.21782178 which lies outside the confidence interval.

Thus with 99.8% confidence, we can conclude the "hit rate" for Darvish while on the Dodgers does not lie within a reasonable neighborhood of Darvish's "hit rate" while on the Rangers.

(End of Proof)

Remark 1.1. It is worth mentioning that the Chi-squared test for goodness of fit also assumes a simple random sample, which Darvish's Dodger statistics lack. A confidence interval avoids this problem.

Remark 1.2 (p-value). Note, since we are assuming this "hit rate" is a Bernoulli distribution, we can compute the p-value of the Dodgers's "hit rate" as the CDF with the probability of success equal to 630/3242 and the number of successes at least 44 out of 202 trials (since we want to consider situations at least as bad as the Dodgers's experienced), which would give the p-value of 0.222758. Caution: this critically rests on the unjustified assumption that the hit rate for the Texas rangers is the "true" hit rate, and not 3242 trials involving the unosbserved hit rate parameter which must be estimated.

Really, a better estimate for the p-value is to extend the confidence interval until we get to the value in question. This gives us a z = 3847/sqrt(1408649) ≈ 3.24131, which gives us a p-value of 0.0002974525723327681, which is respectably small.

Bayesian Tests

We can follow Kruschke's test of equivalence testing. This is really estimating the probability for a Bernoulli distribution, so we use a Beta distribution. Darvish's time on the Rangers will give us the priors α = 630, β = 3242 - 630 = 2612.

We plot the estimates made from the beta distribution, the 95% high density interval (HDI) is shaded in red (it is the interval 0.1808867 < x < 0.2081195), the hit rate for Yu Darvish on the Dodgers is the vertical blue line:

Since the observed value for the hit rate of Yu Darvish on the Dodgers lies outside the HDI (red region), more than half of the region of practical equivalence will lie outside the HDI, which suggests the null hypothesis should be rejected.

We should note the p-value for Yu Darvish's hit rate with the Dodger is 0.0004930547, which belies rejecting the null hypothesis. This means that given Darvish's history with the Rangers, his hit rate with the Dodgers is not just worse, but statistically significantly worse.

Remark 2.1. Compared to the previous computation of the p-value for the frequentist hypothesis test, this new and alarmingly small p-value may be surprising. But remember, the frequentist naive estimate was based on the assumption that Darvish's statistics with the Rangers was the "true hit rate" as opposed to observations of performance. Here we suppose we were merely observing "trials" of Darvish's pitching, and based on that "prior" history with the Rangers, how likely is Darvish's performance with the Dodgers from that prior. In short, there is no inconsistency with the p-value from Remark 1.2.

Performance during World Series 2017

Frequentist Tests

Proposition 3. Darvish's "hit rate" for the World Series differs significantly from his pre-World Series "hit rate" for the Dodgers. (End of Proposition)

Proof. We need to again construct the confidence interval.

We find for N = 180, and an approximate hit rate of ≈ 0.2178218, that the interval is [0.1401892, 0.3223284].

The hit rate for Darvish during the world series is WS ≈ 0.4090909, but WS > 0.3223284 lies way outside the upper bound of the Wilson interval. Hence we conclude it is statistically significantly different (worse). (End of Proof)

Remark 3.1. For Darvish's world series hit rate to even touch the confidence interval, we need z = 5.21928; this has a p-value of 4.4905×10−8.

Bayesian Analysis

We consider Darvish's hit rate with the prior data given by the Dodgers data excluding the World Series, we have a Beta prior with parameters α = 44 - 9 = 35, β = 145. Then we can plot the 95% high density interval for the posterior estimate of the hit rate (based on Darvish's performance with the Dodgers, pre-World Series) shaded in red and the vertical line being Darvish's World Series performance:

Even with a generous region of practical equivalence, there is no way we can accept the null hypothesis (that Darvish's performance at the World Series is consistent with his pre-World Series performance with the Dodgers).

Remark 4.1. In fact, this is so strange, the numerical calculation for the p-value rounds to zero. With 20 digits of precision (long double calculations), the p-value still rounds to zero.

Remark 4.2. The situation gets only worse if we add in (or consider instead) Darvish's performance with the Texas Rangers.

Conclusion and Future Work

Darvish just performed worse, for whatever reason, when with the Dodgers. What's more, he performed even worse during game 7 of the world series.

It would be interesting to see if traded pitchers in general have a worse "hit rate" than prior to trading, which would explain what we have witnessed with Darvish.

We also used a crude metric for determining pitcher effectiveness, simply assume the ratio of hits to batters faced is a proportion of interest like the bias of a coin. We could also extend our analysis with more sophisticated, though WHIP is approximately as informative as our "hit rate".

Realistically, we should construct something more sophisticated than OOPS to adequately gauge how well Darvish performed, rewarding him for striking out "better batters" and penalizing him for getting batters on base.