So, Yu Darvish was traded to the Dodgers from the Texas Rangers on . He also was the starter for game 7 of the 2017 world series. Did Darvish cost game 7 for the Dodgers? We will answer more the modest question, did Darvish perform as well for the Dodgers as he did for the Texas Rangers? We use the hits per batters faced as a measure of pitching skill.
Claim 1. Darvish performed worse on the Dodgers than on the Rangers. (End of Claim)
Claim 2. Darvish performed worse in game 7 of the World Series than he did on the previous 8 games with the Dodgers. (End of Claim)
More precisely, the hit rate (as measured by the ratio of hits to batters faced) for Darvish on the Dodgers follows a different (worse) distribution than while Darvish was on the Rangers. Similarly, the hit rate for Darvish in game 7 of the 2017 World Series follows a (worse) distribution than Darvish's pre-World Series performance for the Dodgers.
Addendum (). The Los Angeles Times reports the Astros apparently knew Yu Darvish had a "tell" revealing his pitch choice, which would explain the catastrophic difference between expected and observed performance.
Raw Data
We can list the relevant career statistics from baseball-reference.com:
Team | BF | H |
---|---|---|
Texas | 3242 | 630 |
Dodgers | 202 | 44 |
Game 7 of WS | 22 | 9 |
Tenure on the Dodgers
We will now perform tests to justify claim 1, that Darvish's performance on the Dodgers is worse than his performance on the Texas Rangers, both with a frequentist test and through Bayesian means.
Frequentist Test
Proposition 1. Darvish's "hit rate" (ratio of hits to batters faced) for his time on the Rangers differs significantly from his time on the Dodgers with α = 0.002699796. (End of Proposition)
It would be tempting but wrong to use the Z-test for proportionality testing (with the "true proportion" being Darvish's hit rate while on the Rangers, and the "sample proportion" would be the hit rate while on the Dodgers). The reason it would be wrong is because this is not a simple random sample (remember, baseball cycles through the 9 batters in order, whereas a simple random sample assumes that every batter in baseball is equally probable at hitting next; for really, really large BF, we can pretend it's a simple random sample, but not for 202 batters faced in several months).
What we do instead is construct a Wilson interval for Darvish's time on the Texas rangers. This will give us an interval for which Darvish's "true hit rate" lies, and a percentage certainty. Basically, if the hit rate while on the Dodgers does not lie in the confidence interval constructed from the Ranger's hit rate data, then the test is statistically significant.
- p̂ = 630/3242 ≈ 0.1943245 be the observed "hit rate",
- n = 3242 batters faced ("trials performed"), and
- z = 3 (so we're 3-sigma confident),
But the observed hit rate while on the Dodgers is p̂D = 44/202 ≈ 0.21782178 which lies outside the confidence interval.
Thus with 99.8% confidence, we can conclude the "hit rate" for Darvish while on the Dodgers does not lie within a reasonable neighborhood of Darvish's "hit rate" while on the Rangers.
(End of Proof)
Remark 1.1. It is worth mentioning that the Chi-squared test for goodness of fit also assumes a simple random sample, which Darvish's Dodger statistics lack. A confidence interval avoids this problem.
Remark 1.2 (p-value). Note, since we are assuming this "hit rate" is a Bernoulli distribution, we can compute the p-value of the Dodgers's "hit rate" as the CDF with the probability of success equal to 630/3242 and the number of successes at least 44 out of 202 trials (since we want to consider situations at least as bad as the Dodgers's experienced), which would give the p-value of 0.222758. Caution: this critically rests on the unjustified assumption that the hit rate for the Texas rangers is the "true" hit rate, and not 3242 trials involving the unosbserved hit rate parameter which must be estimated.
Really, a better estimate for the p-value is to extend the confidence interval until we get to the value in question. This gives us a z = 3847/sqrt(1408649) ≈ 3.24131, which gives us a p-value of 0.0002974525723327681, which is respectably small.
Bayesian Tests
We can follow Kruschke's test of equivalence testing. This is really estimating the probability for a Bernoulli distribution, so we use a Beta distribution. Darvish's time on the Rangers will give us the priors α = 630, β = 3242 - 630 = 2612.
We plot the estimates made from the beta distribution, the 95% high density interval (HDI) is shaded in red (it is the interval 0.1808867 < x < 0.2081195), the hit rate for Yu Darvish on the Dodgers is the vertical blue line:
Since the observed value for the hit rate of Yu Darvish on the Dodgers lies outside the HDI (red region), more than half of the region of practical equivalence will lie outside the HDI, which suggests the null hypothesis should be rejected.
We should note the p-value for Yu Darvish's hit rate with the Dodger is 0.0004930547, which belies rejecting the null hypothesis. This means that given Darvish's history with the Rangers, his hit rate with the Dodgers is not just worse, but statistically significantly worse.
Remark 2.1. Compared to the previous computation of the p-value for the frequentist hypothesis test, this new and alarmingly small p-value may be surprising. But remember, the frequentist naive estimate was based on the assumption that Darvish's statistics with the Rangers was the "true hit rate" as opposed to observations of performance. Here we suppose we were merely observing "trials" of Darvish's pitching, and based on that "prior" history with the Rangers, how likely is Darvish's performance with the Dodgers from that prior. In short, there is no inconsistency with the p-value from Remark 1.2.
Performance during World Series 2017
Frequentist Tests
Proposition 3. Darvish's "hit rate" for the World Series differs significantly from his pre-World Series "hit rate" for the Dodgers. (End of Proposition)
Proof. We need to again construct the confidence interval.
We find for N = 180, and an approximate hit rate of p̂ ≈ 0.2178218, that the interval is [0.1401892, 0.3223284].
The hit rate for Darvish during the world series is p̂WS ≈ 0.4090909, but p̂WS > 0.3223284 lies way outside the upper bound of the Wilson interval. Hence we conclude it is statistically significantly different (worse). (End of Proof)
Remark 3.1. For Darvish's world series hit rate to even touch the confidence interval, we need z = 5.21928; this has a p-value of 4.4905×10−8.
Bayesian Analysis
We consider Darvish's hit rate with the prior data given by the Dodgers data excluding the World Series, we have a Beta prior with parameters α = 44 - 9 = 35, β = 145. Then we can plot the 95% high density interval for the posterior estimate of the hit rate (based on Darvish's performance with the Dodgers, pre-World Series) shaded in red and the vertical line being Darvish's World Series performance:
Even with a generous region of practical equivalence, there is no way we can accept the null hypothesis (that Darvish's performance at the World Series is consistent with his pre-World Series performance with the Dodgers).
Remark 4.1. In fact, this is so strange, the numerical calculation for the p-value rounds to zero. With 20 digits of precision (long double calculations), the p-value still rounds to zero.
Remark 4.2. The situation gets only worse if we add in (or consider instead) Darvish's performance with the Texas Rangers.
Conclusion and Future Work
Darvish just performed worse, for whatever reason, when with the Dodgers. What's more, he performed even worse during game 7 of the world series.
It would be interesting to see if traded pitchers in general have a worse "hit rate" than prior to trading, which would explain what we have witnessed with Darvish.
We also used a crude metric for determining pitcher effectiveness, simply assume the ratio of hits to batters faced is a proportion of interest like the bias of a coin. We could also extend our analysis with more sophisticated, though WHIP is approximately as informative as our "hit rate".
Realistically, we should construct something more sophisticated than OOPS to adequately gauge how well Darvish performed, rewarding him for striking out "better batters" and penalizing him for getting batters on base.