Despite being around for less than three years, the COVID-causing virus SARS-CoV-2 is arguably the most studied and genetically sequenced pathogen in history. Disease surveillance teams around the world have uploaded millions of viral sequences to public databases that allow researchers to track how the virus is spreading.
A new calculation model has mined this unprecedented amount of data — more than 6.4 million SARS-CoV-2 sequences — to find patterns among the mutations that help a new viral strain spread around the world. The model, called PyR0, analyzed how different viral lineages originated and spread between December 2019 and January 2022. From this data, it learned how to identify the combinations of mutations and the amount of time it takes for variants such as Delta or Omicron to predominate. The model, which a team of researchers described in Science in May, public health programs could provide advance notice of which lines are potentially dangerous and allow officials to plan ahead.
pyR0 used data leading up to mid-December 2021 to correctly predict Omicron’s BA.2 subvariant, which was rare in much of the world at the time, would quickly spread rapidly. By March 2022, BA.2 had become the dominant species worldwide. If the model had been run in November 2020, it would also have correctly predicted that the Alpha variant would soon become dominant: the World Health Organization only identified Alpha as a variant of concern in December of that year.
Most COVID vaccines target the virus’s spike protein, which it uses to enter cells. Mutations in this protein appear to allow certain variants to escape the body’s immune response to the virus through vaccination or previous infection. The PyR0 model found that simply having numerous spike protein mutations did not necessarily make a species more evolutionarily suitable. But a few specific peak mutations in late 2021 helped the Omicron subvariants BA.1 and BA.2 evade the immune system.
pyR0 also found that a series of nonspike mutations in BA.2’s genome that affect how the virus replicates may contribute to its rapid spread. The model’s ability to quickly analyze entire genomes, the researchers say, could help scientists know which regions of the virus’s genome should be studied to develop future therapies.
Scientific American spoke with co-author Jacob Lemieux, an infectious disease researcher at the Broad Institute of the Massachusetts Institute of Technology and Harvard University and a physician at Massachusetts General Hospital in Boston, about how algorithms that “learn” from large data sets could shape the future of the pandemic.
[An edited transcript of the interview follows.]
What can PyR0 tell us about the following predominant variants?
We can’t necessarily say what’s going to happen in terms of mutations. We can say what will happen next in terms of which lineages are most likely to increase in frequency.
In other words, if one car is traveling at 70 miles per hour and another car is traveling at 35 miles per hour, we can make a prediction that within a certain time the car will overtake at 70 miles per hour and overtake the other car. But those predictions are only good for the foreseeable future, because the way the pandemic works is that all of a sudden a 210 mph car comes out of nowhere and completely changes the dynamics.
The amazing thing is that it has happened again and again. First it was the D614G variant, then it was Alpha, then it was Delta, then it was Omicron; now it is Omicron BA.2 and its close cousins BA.4 and BA.5. So this kind of dynamic seems to be a common feature of the pandemic.
But the things that allow the cars to go fast — the qualities that confer this fitness advantage — seem to have changed over time. Omicron, in particular, appears to be very immune, mainly by escaping the human antibody response. That trait has become increasingly important to the virus, and it makes sense because so many people have had COVID or been vaccinated, or both.
It seems that this increasing immune evasion has been brewing continuously during the pandemic and has now really reached its full expression. This isn’t the first study to show that, but it does show it systematically. And it seems likely that such an immune escape will continue to be part of what makes a lineage grow. We cannot predict, within the context of this study, which mutations will occur in the future and confer additional immune escape.
How does your model help predict and track new variants?
What we model is how different combinations of mutations in different lineages affect the growth rate of individual viral variants in the population. [Editor’s note: A lineage is a group of variants with a common ancestor.] Since each new lineage has a constellation of mutations—some of which we’ve seen before in other lineages—we can begin to ask the question, “What mutations are driving this?”
We model this demand in many different regions of the world and then essentially merge the information into a single model. The reason we can do this is because people from all over the world are sequencing the virus, and they label the order with the date and region of the collection. So we know in different regions which sexes increase in frequency relative to the others. This information is incredibly valuable – we couldn’t have built our model without this kind of information.
It is a real computational challenge to actually implement that model and adapt it to the data. Lead study author Fritz Obermeyer had come to the Broad Institute from Uber AI, where researchers had developed a programming language and software framework that uses machine learning to model and apply probabilities to large data sets. It was really great to be able to apply these methods to the amount of data we’ve never had before.
We are trying to improve the model and we have a new version of it. We actually think that successful genera are driven by a small number of mutations, and the others are just a little bit along for the ride. A related challenge is studying the genetic or statistical interaction between mutations. Perhaps Mutation 1 makes the virus fitter; maybe Mutation 2 makes it more suitable. But maybe the combination of 1 and 2 together makes it less appropriate. Those kinds of interactions are really hard to handle because the numbers are growing so fast.
How can this model help us plan our response to the pandemic?
One of the things we’re learning is that genome sequencing of emerging viruses is part of the response to an outbreak. We’re seeing a lot of genome sequencing, for example with the monkey pox outbreak that’s going on right now.
There is so much data that we can’t let a person search everything. We need systematic, statistical machine learning programs that help humans detect new variants. As a support tool for disease surveillance, this type of approach can be very helpful. We’re trying to automate this model so we can run it regularly and see if we can spot things to be concerned about.
We found that by modeling mutations instead of just lines, the model was smarter and faster to learn. And the faster you learn about the properties of a lineage, the more you know how concerned you should be.
I don’t think this model is a substitute for well-structured programs – such as those of governments and international organizations – for conducting disease surveillance. It is a support tool for such programs to enable them to systematically screen and rank ascending lines. I think this kind of approach will be possible in the future as the data on flu and other viruses keeps piling up.