Viral dark matter

UD researchers' website tracks genetic sequence data from unknown viruses

University of Delaware researchers Shawn Polson and Eric Wommack have received a three-year, $867,661 National Science Foundation (NSF) Advances in Biological Informatics (ABI) grant to continue and expand the work being conducted on their Viral Informatics Resource for Metagenome Exploration (VIROME) website.

The web-application site is designed to help researchers explore sequence data collected from environmental viruses or, as they describe it, to “shine a little bit of light into the unknown.”

The project began in 2008 with funding from the Gordon and Betty Moore Foundation and came about when Delaware Biotechnology Institute-based researchers Polson and Wommack discovered the lack of computational tools for analyzing DNA sequences from environmental viruses.

“One big problem we ran into when looking at microbes in the environment is if you want to grow them, only about 1 percent of them are going to be culturable. For the vast majority, the only way we really have to look at them is to try to isolate a sample from the environment and sequence the DNA that’s there and attempt to identify the little pieces,” said Polson, a research assistant professor in UD’s departments of Computer and Information Sciences and Biological Sciences and coordinator of the Center for Bioinformatics and Computational Biology’s Core.

Wommack, a professor who holds joint appointments in the departments of Plant and Soil Sciences and Biological Sciences and the College of Earth, Ocean, and Environment, added that the field of microbiome research, in which researchers randomly sample microbes without any sort of laboratory cultivation, is relatively new.

“Ninety-nine percent of the microbes can’t be grown in the lab so when you look at only the microbes you can grow, you’re vastly restricted in your understanding of what is really there and what they are really doing,” said Wommack. “It wasn’t until sequencing became cheap enough that we could even contemplate doing things this way. Now that DNA sequencing is affordable, we can start to apply it to these questions, and UD is pretty well positioned with regards to sequencing and computation.”

Metagenomics

Using the technique known as metagenomics, which was established in the mid-2000s to study bacteria in the environment, Polson explained that he and Wommack are interested in viruses that infect microbes like bacteria and microalgae. These sorts of viruses account for most of the viruses in natural systems.

In essence, metagenomics requires lots of sequencing of genomic DNA taken directly from microbes in environmental samples such as water, air or soil. Metagenomics is also being applied to samples from humans, plants and animals to uncover the diversity of microbes associated with larger organisms.

Once the metagenomic sequence data is obtained, a complex analysis of the data is necessary to uncover what the genes do and which organism or virus they came from. Usually with sequence analysis, researchers will compare the sequences to known things that have been seen before and are stored in a database. The problem with studying environmental viruses is that so few known viral genomes have been sequenced.

“With bacteria, that strategy worked pretty well. You might get 70 percent of your sequences matching a gene that had been seen before. With viruses, at first we were seeing 5 percent of our sequences hitting something that had been seen before,” said Polson. “We really had to come up with a way to analyze that data and, at first, it was just for our own use but then we built a tool to grab every little piece of information we could. From there, we realized that other people might be interested, so that’s when we started building the website and talking to the Moore Foundation.”

Now people submit their sequence data to the VIROME website and it is processed through an analysis pipeline. The process involves a great deal of computational work, which Polson and Wommack have been providing through a computational grid supported by the National Science Foundation.

While the funding from the Moore Foundation got the site started, with this new NSF ABI grant, Polson and Wommack are seeking ways to sustain the website and grow the program to evolve with the changing technologies.

Evolving technologies

When VIROME first started, sequencing was expensive and people were sequencing long pieces of data but they were not doing a lot of sequences in bulk. Now, things have swung in the other direction, with tens to hundreds of millions of short sequences. Because the cost of sequencing has gone down, a lot of data can be produced and sequences are much shorter so the researchers need to evolve their pipeline to assemble the little pieces into longer stretches and then provide the analysis.

“One of the challenges in big data science is to make the data easily available,” said Wommack. “It’s a lot to ask an individual principal investigator to maintain a site for sharing this data. Part of our model going forward is that the users will support the computation but we’ll maintain the shared data resource so that hopefully the data and, more importantly, the analysis has a long term life beyond their immediate scientific needs.”

While the two researchers and Daniel Nasko, a doctoral student in the bioinformatics and systems biology program, are currently processing all of the data, Anup Mahurkar, their collaborator at the University of Maryland’s Institute for Genome Sciences, is trying to take that work and package it in a way that allows the users to handle some of the computational requirements, as well.

“It’s called a virtual machine so essentially you create software that makes a computer think that it’s something else,” said Wommack. “It’s sort of like a computer running an image of another computer.”

Viral dark matter

The other aspect to the research is the creation of a database of viral dark matter to help identify unknown proteins in datasets.

“At first we could only identify about 5 percent of the genes that we were finding in these datasets and we wanted to create a database specifically for this viral dark matter — these genes are very abundant and we don’t know what they do. The goal of the database is to give more meaning to the dark matter, to see if we can figure out what these genes might be doing,” said Polson.

Wommack said that when the researchers go out and get a water sample or soil sample to get sequence data out of the viruses and compare them to other known virus sequences, most of the time they do not find a match.

“There’s a vast amount of genetic novelty among viruses and so we have to start somewhere and the ‘starting somewhere’ is to leverage the VIROME pipeline and our big data resource to begin to at least classify where and when these unknown viral genes have been observed,” said Wommack. “It’s somewhat reassuring when you do see the gene again within another, different viral sample, at least you know it’s a gene that belongs to viruses. You don’t know what it does, but it occurred more than once.”

The goal is to track these occurrences to see if they are happening in certain types of environments more frequently than others, which can give the researchers clues as to where to start looking for the unknown viruses.

“You start with these guilt by association sorts of findings,” said Wommack. “I might not know what it does, but I know we always find it in ecosystems like the Chesapeake Bay and we don’t see it in the deep ocean, we don’t see it in soils. The hope is by building this database that it will start to point the way to genes that we have no idea of their function but they happen to be really important to what viruses do. We hope it gives us a process of being able to shine a little bit of light into the unknown.”

Article by Adam Thomas

Photo by Lindsay Yeager