Findings hold warning for tracking efforts as governments, health officials navigate pandemic
When Delphi-Facebook and the U.S. Census Bureau provided estimates of COVID-19 vaccine uptake last spring, their weekly reports drew on responses from as many as 250,000 people.
The data sets boasted statistically tiny margins of error, raising confidence that the numbers were correct. But when the Centers for Disease Control and Prevention reported actual vaccination rates, the two polls were off — by a lot. By the end of May, the Delphi-Facebook study had overestimated vaccine uptake by 17 percentage points — 70 percent versus 53 percent, according to the CDC — and the Census Bureau’s Household Pulse Survey had done the same by 14 percentage points.
“A biased big data survey can be worse than no survey at all,” said Xiao-Li Meng, Editor-in-Chief of the Harvard Data Science Review and the Whipple V.N. Jones Professor of Statistics.
A comparative analysis by statisticians and political scientists from Harvard, Oxford, and Stanford universities concludes that the surveys fell victim to the “big data paradox,” a mathematical tendency of big data sets to minimize one type of error — due to small sample size — but magnify another that tends to get less attention: flaws linked to systematic biases that make the sample a poor representation of the larger population.
The big data paradox was identified and coined by one of the study’s authors, Harvard’s Xiao-Li Meng, the Whipple V.N. Jones Professor of Statistics, in his 2018 analysis of polling during the 2016 presidential election. Famous for predicting a Hillary Clinton victory, the polls were skewed by “nonresponse bias,” which in this case was the tendency of Trump voters to either not respond or define themselves as “undecided.”