Is there evidence of pollsters herding?

Yesterday, the Telegraph published an article by Dan Hodges accusing UK pollsters of herding. For those unfamiliar with the term, herding is the unnatural convergence of results between polling companies, sometimes seen as an election approaches. Various cases have been documented in the US – the more egregious ones involve pollsters deliberately manipulating their numbers.

But herding can also take more subtle forms. Pollsters have a number of methodological decisions to make in order to achieve a sample representative of the population – phone or online fieldwork, which variables to weight by, which parties to prompt for, how to model turnout, how to handle don’t knows and refusers, and so on. In many cases there is no “right” or “wrong” answer, but these decisions have consequences for the final results. If pollsters systematically err on the side of the herd when picking a methodological path, polls will end up with unnaturally similar results.

So is there evidence that herding happened in Britain? I’ve replicated – as closely as possible -the Nate Silver analysis that Hodges refrenced, based on the Labour-Conservative spread, and taking absolute differences to its 21-day rolling average. I’ve done it in two versions – one for all regular pollsters and a second that excludes YouGov, due to the very high frequency of their polls giving them a correspondingly outsize weighting in the version that includes them:

The LOESS trend is almost flat Absolute deviations of Conservative-Labour vote share spreads from 21-day moving average. LOESS trend uses 0.8 smoothing alpha.

You can see from the chart that polls didn’t suddenly fall into line as election day approached. The shape of the LOESS trendline looks nothing like the “herding” one in Nate Silver’s piece – it looks much more like the one in the “gold standard” chart (the second one) in this piece by his colleague Harry Enten. (Incidentally, this chart does highlight a genuine problem – the reaction to outliers. If I weighted the polls by column inches and airtime they received, it would look rather different…)

There are legitimate reasons why polls might converge. To give just a couple of examples, as don’t knows make their minds up, the impact of some pollsters excluding them while others reallocate them in various way becomes smaller. And at the very end of the campaign, pollsters often increase their sample sizes, reducing (sometimes considerably) the variability inherent in sampling. I wrote about how house effects were relatively small in October and Anthony Wells did so in January. There have been periods of temporarily increased divergence, and in particular between phone and online polls, but not in a way that looks suspicious.

But why are the trendlines lower (implying smaller errors) than in the US races that FiveThirtyEight analysed? There are two main reasons. Firstly, sample sizes are typically larger in Great Britain-wide polls than in US Senate polls of a single state (headline sample sizes of 2,000 are now standard online and sub-1,000 in any mode is now very rare).

The second reason is more technical. The “theoretical minimum” average error that Silver calculates is based on the standard error (from which the margin of error is also derived) of the spread between the Democrat and the Republican. When there is no-one else in the race, the correlation between the two is basically assumed to be perfectly negative (US polls tend to include don’t knows in their headline numbers, but for this purpose their impact is small). This means that the statistical error on the lead is basically double the error on one of the parties.

But in Britain’s multiparty environment, things are different. Labour and Conservative vote shares are not perfectly negatively correlated. Case in point – from the start of 2015, both gained at the expense of smaller parties. Therefore the standard error on the lead is quite a bit less than double the error on the share (by how much is tricky to calculate for the same reason).

But what about the pattern of polling errors in finals polls? I’ve devised an exercise to examine just that, for each party. Firstly I’ve taken the taken the polling error of each pollster and subtracted the average error across all 11 pollsters in this table. Then I’ve calculated the standard error for each observation, taking into account the vote share of the party in question and the effective sample size in each poll. The actual error is then expressed as a multiple of the standard error, giving the normalised error.

In the following bar charts, each “block” is an instance of one final call poll having a normalised error at each (rounded) number of sigmas above or below the mean (zero). The dotted line is the distribution we’d theoretically expect given true random samples (and apologies to pedants for the fact that the profile is mildly distorted between sigma points – due to an apparent limitation in Excel – but at the round numbers its level is correct). And since you can’t have half a pollster, the bar chart pattern is inevitably a bit “lumpy”.

If pollsters were herding, there would be too many of them in the middle of the chart and too few at the wings. But for the Conservative vote share, we see the opposite:

Distribution of normalised errors - Counts of pollsters (total 11) at each sigma ±0.5, based on final call polls normalised to party means and to poll/party standard error.

For Labour, it’s less clear cut at first glance. We’d expect four or five pollsters to be bang in the middle and there are five. There are too many at +1 and too few at -1. But across these three points we find 10 pollsters, the closest whole number to the 9.7 that you’d statistically expect:

Distribution of normalised errors - Counts of pollsters (total 11) at each sigma ±0.5, based on final call polls normalised to party means and to poll/party standard error.

Now let’s combine all five GB-wide parties (55 observations in total) into one chart. For those interested, the individual charts are available at the following links for the Lib Dems, Greens and UKIP. If there’s one chart that tells the story, this is it. The distribution is flatter than implied by theory:

This does not look like herding Counts of pollster/party combinations (total 55) at each sigma ±0.5, based on final call polls normalised to party means and to poll/party standard error.

This is what you’d expect – the theoretical estimate assumes pure random samples which (as recent events may have demonstrated) is not the case, so the observed errors will vary more. These results have a standard deviation of about 1.8, and a fair amount of excess kurtosis (fat tails). In short, no sign of anything dodgy.

None of this guarantees that there weren’t isolated cases of individual pollsters herding. But the polling failure was an industry-wide problem, and the evidence doesn’t support the idea that herding was a cause.





About The Author

Related Posts

  • Martin Rosenbaum

    Hi Matt,
    Thanks for that analysis, very interesting indeed.
    Some queries:
    1. In terms of your analysis of the errors on the final polls, surely there could be herding on the lead (spread), even if the means by which that is achieved is not through herding on the party vote shares? When you look at the final polls, it is the similarity on the lead which is striking, even with the variation on the shares.
    2. As for your loess trendline, would it really reflect herding just on the final polls in the last two to three days in the campaign? What would it look like with a lower smoothing parameter? Maybe Nate Silver’s point is that even with a high smoothing parameter, herding is apparent in the US. Perhaps we don’t have herding here to the same extent, but nevertheless some such effect could be apparent with lower smoothing?
    3. In calculating the loess trendline, the Nate Silver method excludes polls from the same firm from the 21-day rolling average. I’m not convinced by the logic in that. If herding is happening, by whatever means, whether conscious or subconscious or a by-product, why should new polls not be influenced by old polls from the same organisation? Did you follow his method on that? Does it make much difference whether you do or don’t?
    4. With regard to your concluding sentence, I don’t think the real issue here (whatever certain people say) is whether herding is the fundamental cause of the problem. Surely the real question is whether, if something else is already going wrong, any herding that exists would exacerbate that and make a bad situation worse?
    Martin

  • https://www.ncpolitics.uk Matt Singh

    Hi Martin, thanks very much for your comments. To take each in turn:
    1. This definitely needs some further work – the main problem is that the standard error of the lead isn’t observable – you need some estimate of the correlation which, annoyingly, has been very unstable and in the final results it was about +0.8! I’m undecided on what it should be. But given the limited direct switching of votes between the two main parties (especially when combined with past vote weighting) and the squeeze of the smaller parties, you could quite plausibly have a positive rho, in which case the similarity of leads wouldn’t be a problem. But also you could get a spurious correlation due to house effects.
    2. I’ll retry it with a lower alpha as soon as I’m back at my PC. But you have to be careful because sample sizes increased quite a lot in the final week, so errors aren’t an apples-to-apples comparison…
    3. I didn’t do that – is the reference in the same article? I didn’t see it. I suspect that’s something he does anyway to stop the averages gravitating towards the more regular pollsters. Some of the GB averages do that too, or (as with UKPR) weight older ones down.
    4. Yes, agreed. But it’s hard to answer that question until we know what went wrong. For example, if it’s a sampling issue, It might be expected to have very different effects on different pollsters, given that the sources of their samples are so different. Then the scope for herding would be far greater. Much less so if it’s a behavioural issue.
    Hope this helps
    Matt