New DNA Database Allows Far Faster Searches for Pathogen Genomes

For the first time, it’s possible to easily answer a question as simple as: “Have we seen this thing before?”

In 2015, scientists discovered a pig in China that would set off a frantic, worldwide search. The pig carried bacteria resistant to colistin, a drug used to cure infections when almost all other drugs have failed. Colistin is an old antibiotic with sometimes severe side effects in humans. Chinese doctors didn’t even prescribe it for human patients; instead, farmers were relying on literal tons of it, used in low doses, as a growth promoter in pigs.

Bacteria are constantly crossing continents in people, animals, and food, though. In England, where colistin is reserved for patients in rare and dire circumstances, public-health officials worried. Could colistin-resistant bacteria also be lurking in that country?

The answer was hidden somewhere in Public Health England’s archives. The agency routinely collects and sequences bacteria on food and humans, and it just needed to search those sequences for the DNA segment that confers bacterial resistance to colistin. In theory, this shouldn’t have been much harder than a Google search. To a computer, a DNA sequence looks like a very long word, which just happens to be made up of only four letters: A, T, C, and G.

Yet, the search took 256 computers working together for an entire weekend, says Zamin Iqbal, a computational genomicist at the European Bioinformatics Institute who collaborates with Public Health England. The researchers there did find colistin resistance among their 24,000 samples, and eventually, countries all over the world found it, too.

Why did this process take so long? The computers at Public Health England had to open up and search the sequencing files of 24,000 genomes one by one. If Google had to search every page on the internet for the word pie every time you search for pie, that search would also take forever. Instead, Google is constantly indexing pages. If a blog post is written about pie, Google files that post under the pie entry in its index. So when someone comes along looking for pie recipes, it just has to serve up the pages under the pie entry. That’s part of the reason a Google search takes less than a second.

So Iqbal decided to build a Google of sorts for bacterial and viral genomes. He and his colleagues downloaded all available genomes—nearly 500,000 at the time—from a public database called the European Nucleotide Archive. The 170,000-gigabyte data set took six whole weeks to download. Then, the team indexed the data. The resulting tool is called BIGSI, for BItsliced Genomic Signature Index.

Searching for colistin resistance through nearly 500,000 sequences now takes just a few seconds.

Suppose a patient has an unusual brain infection, says Jennifer Gardy, a genomic epidemiologist who until recently was at the University of British Columbia and who was not involved with the project. Suppose it’s a pathogen that the doctor doesn’t recognize. Before BIGSI, the pathogen’s particular sequence might have been hiding in one of those 500,000 genomes. But a mountain of data is only as good as your ability to search it. “We can now go back and look through all of the DNA, through all of the other experiments that had done sequencing. Loads and loads of DNA,” Gardy says. For the first time, it’s possible to easily answer a question as simple as: “Have we seen this thing before?”

Since Iqbal and his colleagues started sharing their project—making a demo version of BIGSI available online, posting a non-peer-reviewed paper on the website bioRxiv, giving talks—they’ve been hearing from researchers who’ve started to use it.  After Andrew Page, a bioinformatics researcher now at the Quadram Institute, learned about the tool, he walked back to his office and fired it up. Page was interested in a particular plasmid, a round loop of DNA, that helps make typhoid-fever bacteria drug resistant. These plasmids seemed to have popped up out of the blue in Pakistan.

“Within two seconds, I got a list of 20 other samples where they were seen,” Page says. The plasmid wasn’t just in other typhoid bacteria. It was in soil bacteria, animal bacteria, E. coli—painting a much more complex picture of how resistance plasmids emerge and get swapped between different bacterial species.

Iqbal’s paper is just getting published today in Nature Biotechnology, after making its way through the sometimes slow process of peer review. But published papers have already cited the bioRxiv preprint, and another scientist wrote a program to more easily search mutations of a gene in BIGSI. Tara Smith, an epidemiologist at Kent State University, says that BIGSI is a fantastic idea, although the tool is only as good as the data that go in. “The genomes we choose to sequence are very biased,” she says—often toward serious clinical infections, from patients in research-intensive hospitals, in big urban centers.

The team is updating BIGSI with new data that have been made public since Iqbal made the first version, and the total number of sequences available at one quick click will be up to 1.2 million. As sequencing becomes more common, the number of publicly available bacterial and viral genomes has doubled. At the rate this work is going, within a few years multiple millions of searchable pathogen genomes will be available—a library of DNA and disease, spread the world over.