Saturday, December 2, 2017

wOBA is more elegant than you think

Review: How do we compute wOBA

Any given moment in baseball may be described by (1) how many outs there are, and (2) who's on base. There are possibly 0, 1, or 2 outs; and 8 possible configurations of runners on base. Hence we may describe the game at any moment in a given inning by 24 possible states. We may represent each possible state by a number from 0 to 23, using the formula: 8×outs + (base-configuration) = (game state). Here the base-configuration = 1×(first occupied) + 2×(second occupied) + 4×(third occupied), where the parenthetic (base occupied) is 1 if the given base is occupied and 0 otherwise...think of it as a 3-bit number.

Step 1: Compute the Run-Expectancy Matrix. We may set up a table whose rows are the base configurations, and the columns are the number of outs. Tom Tango calls this table the "Run Expectancy matrix" (or RE-matrix), but it's really a random variable. We find for a given state, over the course of a season (or set of seasons) the number of runs from that state until then end of the inning; then we divide by the number of times that state has occurred over the season(s).

In pseudocode (pidgin Python):

for plate_appearance in season:
    state = to_state(plate_appearance)
    runs[state] = runs[state] + runs_at_end_of_inning(plate_appearance)
    counts[state] = counts[state] + 1

for i = 0, 1, ..., 23:
    re[i] = runs[i]/counts[i]

Step 2: Compute the raw coefficients. The "run expectancy" for a given play is the number of runs resulting from the play, plus the difference in the value from the RE-matrix component for the final state from the RE-matrix component for the initial state.

Now, for a given play (BB, HBP, 1B, 2B, 3B, HR, Outs), we compute a coefficient k(BB) by summing the run expectancy for every walk in the season(s) then dividing by the total number of walks occurring in the season(s). Schematically in pseudo-code (pidgin Python):

number_of_walks = 0
re_of_walks = 0
for walk_event in season_walk_events:
    re_of_play = re[end_state(walk_event)] - re[start_state(walk_event)]
    re_of_walks = re_of_walks + re_of_play
    number_of_walks = number_of_walks + 1
k_BB = re_of_walks/number_of_walks

Given the structure of the RE-matrix, k(Outs) < 0 always.

Step 3: Scale the raw coefficients. For each of the offensive plays (BB, HBP, 1B, 2B, 3B, HR) we have the coefficient c(play) = k(play) − k(Outs).

Step 4: Compute the wOBA. We know compute the wOBA for a player by the formula:

wOBA = c(BB)×(BB/PA) + c(HBP)×(HBP/PA) + c(1B)×(1B/PA) + c(2B)×(2B/PA) + c(3B)×(3B/PA) + c(HR)×(HR/PA)

The normalizations vary, sometimes instead of PA it is (AB + BB − IBB + SF + HBP). The intuition remains the same, we multiply the coefficients by the probability our given batter will perform the given play.

But that means wOBA is the expected value for some random variable.

Exercise 1. Assume for simplicity that PA = BB + HBP + 1B + 2B + 3B + HR + Outs. Prove the following formula holds:

wOBA = k(BB)×(BB/PA) + k(HBP)×(HBP/PA) + k(1B)×(1B/PA) + k(2B)×(2B/PA) + k(3B)×(3B/PA) + k(HR)×(HR/PA) + k(Outs)×(Outs/PA) − k(Outs)

[Hint: plug in the definition of the re-scaled coefficients in terms of the raw coefficients.]

Due to the intricacies of a degenerate sigma algebra, the wOBA for a batter who has never even been at plate once will be zero.

Mathematical Cleverness Hidden in the Coefficients

The raw coefficients k(play) is actually the conditional expectation value E[RE|B=play] where "RE" is the random variable describing the entries of the RE-matrix, and "B" is the random variable for the play at hand (the BB, HBP, 1B, etc.). Recall the conditional expectation is itself a random variable when the "B" is left unspecified.

The expression E[RE|B=play] is precisely step 2 in computing wOBA, and if we do not fix the play it gives us a "random variable" — the function which, given a play, produces the corresponding coefficients for a that play.

For a player's wOBA, this is just the expectation of the conditional expectation minus the coefficient for outs: wOBA = Ebatter[E[RE|B]] − k(Outs).

A more elegant solution would be just to use Ebatter[E[RE|B]], so bad players are penalized for their outs. The only plausible reason for subtracting out k(Outs) that I could think of is to make wOBA look superficially "similar" to SLG, but it does have a nifty feature that a batter that only strikes out will have a vanishing wOBA score as opposed to a negative score (which has its own drawbacks).

Remark 1. The astute reader may recall from basic probability that E[E[X|Y]] = E[X], which is true when we take the expectation value using the probability distribution over the same probability space. But we are not doing that with wOBA, we are taking the inner expectation with respect to the season(s)'s average, and the outer expectation with respect to the batter's history. The geometry of the probability space is more subtle than one would think, it's more analogous to a Fiber bundle, where the fiber is the probability space over the 25-states of the inning, and the base-space is the batter's possible plays.

Possible Improvements

Steps 1 and 2 in the algorithm for computing wOBA coefficients didn't specify any conditions on which runs we look at. That is to say, we didn't restrict focus to a particular park, or for particular weather, and so on.

A possible improvement would be to project the statistics onto a particular subset: compute wOBA coefficients relative to a particular park, or for particular weather. This would factor into the statistic the park's idiosyncrasies.

More "controversial" improvements include counting "Caught Stealing" as a play, which doesn't really measure a batter's performance, but does measure a player's judgement and more crucially ability to steal.

Variance

As far as I can tell, The Book first introduces wOBA. It was rather quick in giving its variance in an appendix. Recall for a random variable on a finite probability space, the variance is:

Var(X) = (∑jX(j)2Pr(j)) - E[X]2

where I have written out explicitly the E[X2] for emphasis on the structure of the formula. The Book asserts it is the same as the Bernoulli distribution's variance.

Exercise 2. Assume for simplicity that PA = BB + HBP + 1B + 2B + 3B + HR + Outs. Show the following formula holds:

var(wOBA) = c(BB)2×(BB/PA) + c(HBP)2×(HBP/PA) + c(1B)2×(1B/PA) + c(2B)2×(2B/PA) + c(3B)2×(3B/PA) + c(HR)2×(HR/PA) − (wOBA)2

where the superscript 2 is to indicate square, x2 = x×x. Then prove or find a counter-example that

var(wOBA) = wOBA(1 − wOBA)/PA.

No comments:

Post a Comment