Our goal is to use the house election results1 to fit a very simple model of the electorate. We consider the electorate as having some number of “identity” groups. For example we could divide by sex (the census only records this as a F/M binary), age, “old” (45 or older) and “young” (under 45) and education (college graduates vs. non-college graduate) or racial identity (white vs. non-white). We recognize that these categories are limiting and much too simple. But we believe it’s a reasonable starting point, a balance between inclusiveness and having way too many variables.
For each congressional district where both major parties ran candidates, we have census estimates of the number of people in each of our demographic categories2. And from the census we have national-level turnout estimates for each of these groups as well3.
All we can observe is the sum of all the votes in the district, not the ones cast by each group separately. But each district has a different demographic makeup and so each is a distinct piece of data about how each group is likely to vote.
The turnout numbers from the census are national averages and aren’t correct in any particular district. Since we don’t have more detailed turnout data, there’s not much we can do. But we do know the total number of votes observed in each district, and we should at least adjust the turnout numbers so that the total number of votes predicted by the turnout numbers and populations is close to the observed number of votes. For more on this adjustment, see below.
How likely is a voter in each group to vote for the democratic candidate in a contested race?
For each district, \(d\), we have the set of expected voters (the number of people in each group, \(i\) in that region, \(N^{(d)}_i\), multiplied by the turnout, \(t_i\) for that group), \(V^{(d)}_i\), the number of democratic votes, \(D^{(d)}\), republican votes, \(R^{(d)}\) and total votes, \(T^{(d)}\), which may exceed \(D^{(d)} + R^{(d)}\), since there may be third party candidates. For the sake of simplicity, we assume that all groups are equally likely to vote for a third party candidate. We want to estimate \(p_i\), the probability that a voter (in any district) in the \(i\)th group–given that they voted for a republican or democrat–will vote for the democratic candidate.
The national turnout numbers from the census, multiplied by the populations of each group in the district will not add up to the number of votes observed, since actual turnout varies district to district. We adjust these turnout numbers via a technique4 from Ghitza and Gelman, 2013.
Bayes theorem relates the probability of a model (our demographic voting probabilities \(\{p_i\}\)), given the observed data (the number of democratic votes recorded in each district, \(\{D^{(d)}\}\)) to the likelihood of observing that data given the model and some prior knowledge about the unconditional probability of the model itself \(P(\{p_i\})\), as well as \(P(\{D^{(d)}\})\), the unconditional probability of observing the “evidence”: \(\begin{equation} P(\{p_i\}|\{D^{(d)})P(\{D^{(d)}\}) = P(\{D^{(d)}\}|\{p_i\})P(\{p_i\}) \end{equation}\) In this situation, the thing we wish to compute, \(P(\{p_i\}|\{D^{(d)}\})\), is referred to as the “posterior” distribution.
\(P(\{p_i\})\) is called a “prior” and amounts to an assertion about what we think we know about the parameters before we have seen any of the data. In practice, this can often be set to something very boring, in our case, we will assume that our prior is just that any \(p_i \in [0,1]\) is equally likely.
\(P(\{D^{(d)}\})\) is the unconditional probability of observing the specific outcome \(\{D^{(d)}\}\), that is, the specific set of election results we observe. This is difficult to compute! Sometimes we can compute it by observing: \(\begin{equation} P(\{D^{(d)}\}) = \sum_{\{p_i\}} P(\{D^{(d)}\}|\{p_i\}) P(\{p_i\}) \end{equation}\). But in general, we’d like to compute the posterior in some way that avoids needing the probability of the evidence.
\(P(\{D^{(d)}\}|\{p_i\})\), the probability that we observed our evidence (the election results), given a specific set \(\{p_i\}\) of voter preferences is a thing we can calculate: Our \(p_i\) are the probability that one voter of type \(i\), who votes for a democrat or republican, chooses the democrat. We assume, for the sake of simplicity, that for each demographic group \(i\), each voter’s vote is like a coin flip where the coin comes up “Democrat” with probability \(p_i\) and “Republican” with probability \(1-p_i\). This distribution of single voter outcomes is known as the Bernoulli distribution.. Given \(V_i\) voters of that type, the distribution of democratic votes from that type of voter is Binomial with \(V_i\) trials and \(p_i\) probability of success. But \(V_i\) is quite large! So we can approximate this with a normal distribution with mean \(V_i p_i\) and variance \(V_i p_i (1 - p_i)\) (see Wikipedia). However, we can’t observe the number of votes from just one type of voter. We can only observe the sum over all types. Luckily, the sum of normally distributed random variables follows a normal distribution as well. So the distribution of democratic votes across all types of voters is also approximately normal, with mean \(\sum_i V_i p_i\) and variance \(\sum_i V_i p_i (1 - p_i)\) (again, see Wikipedia). Thus we have \(P(D^{(d)}|\{p_i\})\), or, more precisely, its probability density. But that means we also know the probability density of all the evidence given \(\{p_i\}\), \(\rho(\{D^{(d)}\}|\{p_i\})\), since that is just the product of the densities for each district: \(\begin{equation} \mu^{(d)}(\{p_i\}) = \sum_i V_i p_i\\ v^{(d)}(\{p_i\}) = \sum_i V_i p_i (1 - p_i)\\ \rho(D^{(d)}|\{p_i\}) = \frac{1}{\sqrt{2\pi v^{(d)}}}e^{-\frac{(D^{(d)} -\mu^{(d)}(\{p_i\}))^2}{2v^{(d)}(\{p_i\})}}\\ \rho(\{D^{(d)}\}|\{p_i\}) = \Pi^{(d)} \rho(D^{(d)}|\{p_i\}) \end{equation}\)
Now that we have this probability density, we want to look for the set of voter preferences which maximizes it. There are many methods to do this but in this case, because the distribution has a simple shape, and we can compute its gradient, a good numerical optimizer is all we need. That gives us maximum-likelihood estimates and covariances among the estimated parameters.
Want to read more from Blue Ripple? Visit our website, sign up for email updates, and follow us on Twitter and FaceBook. Folks interested in our data and modeling efforts should also check out our Github page.
MIT Election Data and Science Lab, 2017 , “U.S. House 1976–2018” , https://doi.org/10.7910/DVN/IG0UN2 , Harvard Dataverse, V3 , UNF:6:KlGyqtI+H+vGh2pDCVp7cA== [fileUNF]↩︎
Source: US Census, American Community Survey https://www.census.gov/programs-surveys/acs.html↩︎
Source: US Census, Voting and Registration Tables https://www.census.gov/topics/public-sector/voting/data/tables.2014.html. NB: We are using 2017 demographic population data for our 2018 analysis, since that is the latest available from the census. We will update this once the census publishes updated 2018 American Community Survey data.↩︎
We note that there is an error in the 2013 Ghitza and Gelman paper, one which is corrected in a more recent working paper http://www.stat.columbia.edu/~gelman/research/published/mrp_voterfile_20181030.pdf. by the same authors. In the 2013 paper, a correction is derived for turnout in each region by finding the \(\delta^{(d)}\) which minimizes \(|T^{(d)} - \sum_i N^{(d)}_i logit^{-1}(logit(t_i) + \delta^{(d)})|\). The authors then state that the adjusted turnout in region \(d\) is \(\hat{t}^{(d)}_i = t_i + \delta^{(d)}\) which doesn’t make sense since \(\delta^{(d)}\) is not a probability. This is corrected in the working paper to \(\hat{t}^{(d)}_i = logit^{-1}(logit(t_i) + \delta^{(d)})\).↩︎