The predictive power of artificial intelligence (AI) machine learning accelerates discoveries in the life sciences.
a new study shows how AI and genomics can predict future mutations of the SARS-CoV-2 virus that causes the disease COVID-19.
“The severe acute respiratory syndrome” coronavirus 2 pandemic (SARS-CoV-2) is characterized by transmission waves initiated by new variants replacing the older ones,” wrote the Broad Institute of MIT and the Harvard research team with their co-authors from the University of Massachusetts Medical School and other affiliates. † “Given this pattern of emergence, there is a clear need for the early detection of new variants to prevent excess deaths.”
The research team developed a hierarchical Bayesian regression AI model called PyR0 that can provide scalable analysis of the full set of public datasets of SARS-CoV-2 genomes. The Bayesian model predicts emerging viral lines.
The algorithm used is completely Bayesian. Unlike frequentist linear regression, Bayesian linear regression uses probability distributions instead of point estimates, and the output is generated based on a normal (Gaussian) distribution. The goal of Bayesian linear regression is to find the posterior distribution for the model parameters instead of finding the single optimal value of the model parameters.
Through systematic backtesting, we found that the model would have provided an early warning and aided in the identification of VoCs had it been applied routinely to SARS-CoV-2 samples, confirming its utility for public health and underscoring the value of the rapid sharing of genomic data .
The AI model accommodated 6,466,300 SARS-CoV-2 genomic data from GISAID (Global Initiative on Sharing All Influenza Data). The team used stochastic variation inference to fit the large model. Even with this approach, this complex task required solving an optimization problem with more than 75 million dimensions.
The scientists divided the genetic samples into clusters and then analyzed the suitability of each cluster. Specifically, the team created 3,000 clusters from 1,544 PANGO lines and modeled the suitability of lines individually across 1,560 geographic areas. The study authors reported,
The model correctly concludes that the World Health Organization classification variant Omicron (PANGO BA.2) has the highest fitness to date: 8.9 times [95 percent confidence interval (CI) 8.6 to 9.2] higher than the original A-line, an accurate harbinger of its rise in regions where it circulates.
According to the researchers, their algorithm can be applied to different viral phenotypes and to any viral genomic dataset.
“Using this model, emerging lineages can be seen along with the mutations that contribute to transmissibility not only in Spike but also in other viral proteins,” the authors reported. “The model can prioritize lineages when they come forward for public health concerns.”
Copyright © 2022 Cami Rosso. All rights reserved.