Modern machine learning methods have enabled major advances in big data analysis, but current state-of-the-art technology is not suited to the intricacies of surveys that use complex sampling methods. With the support of a three-year grant of $337,000 from the National Science Foundation, assistant professor of statistics Paul Parker will develop statistical and machine learning methods best suited to the analysis of complex surveys produced by federal statistical offices.
“We’re currently in this revolution in data science and machine learning where there are all these new methods that can analyze these huge data sets and do it really well, but they can’t necessarily be used off-the-shelf for this. survey datasets,” Parker says. “That’s because they usually rely on a simple random sample of the population, which isn’t the case with these types of surveys.”
This project will focus on a group of surveys produced by the National Center for Scientific and Technical Statistics (NCSES), such as the National Survey of College Graduates and the Survey of Earned Doctorates, which help inform key official population estimates. Rather than sampling a population with equal probability, these surveys and other federal surveys are typically over- or under-sampling of certain groups.
Parker will develop statistical methods for machine learning models specifically designed to take into account survey design, the unique way data is collected. He wants to take advantage of the ability of machine learning technology to create flexible data models that can often improve the accuracy of population estimates.
However, many machine learning models are often not equipped to provide key estimates of uncertainty in datasets, a shortcoming that Parker will address through the frameworks he develops.
†[The project addresses] two things: consider the study design, but also incorporate it into a statistical framework to generate those uncertainty estimates,” Parker said. “I think these are the two areas where our expertise will help improve these models.”
These new methods will benefit agencies tasked with producing population estimates from NCSES surveys, who are increasingly faced with a combination of limited resources and higher expectations for their jobs, Parker said. The improved estimates will also be useful to people interpreting and making policy or financing decisions based on the data produced.
Ultimately, Parker hopes these methods will be more broadly applicable to other federal statistical offices, as well as areas such as economics and sociology that deal with dependent research data sets.
This project is funded by the NSF’s National Center for Science and Engineering Statistics and will be a collaboration with lead researcher Scott Holan of the University of Missouri.