The Defense Advanced Research Projects Agency’s foray into fighting hackers and other malcontents on the Web can be summed up in a single probing question.
“How can I make the unseen seen?” Dan Kaufman, the director of DARPA’s information innovation office director, said last week in a feature on “60 Minutes.”
The answer, Kaufman said, is Memex. Developed by DARPA, this search engine on steroids dives deep into the realm of the “Dark Web” and spits out a data-driven map detailing all of the patterns it’s unearthed.
After only one year in use, Memex has already played an important role in about 20 different investigations, according to officials.
Inspiration for the technology’s — and its name — came in part from a 1945 Atlantic article written by Vannevar Bush, director of the Office of Scientific Research and Development, which was stood up in 1941 to coordinate military science research during World War II.
Bush described a memex as “a device in which an individual stores all his books, records and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility.” In other words, a lot like the Internet we know today.
But of course, the real importance of Memex is not how it came to be, but the innovative advances it has accomplished with big data. And one of the ways it is accomplishing this goal is through the use of social science.
Jacob Shapiro and his team at Giant Oak, a data firm that advertises itself as “seeing the people behind the data,” are responsible for the social science aspect of Memex.
The team has held the role since late August. Its main job, according to Shapiro, is to apply social science to the reams of data collected by Memex to make sure it isn’t misinterpreting any of the data — a common pitfall.
Without a correct understanding of where the data came from, the time and energy spent on accumulating information will likely prove unhelpful, Shapiro explained.
“One thing that happens a lot with big data is [that] it’s very easy to lose track of what the social process is that generated the data in the first place,” Shapiro told Nextgov. “The data doesn’t always mean what you think [it means], because there can be complicated, unobserved processes which are generating certain patterns.”
The 2012 flu season calculation — or lack thereof — produced by Google, is an example of that type of misstep, according to Shapiro. Its algorithm overestimated flu prevalence by over 100 percent.
The mistake likely happened because the search team changed its flu algorithm without communicating the change with the flu trends team, Shapiro explained. While the team members were getting data, it proved unhelpful because they had the incorrect understanding of where the numbers came from.
Although DARPA envisions Memex as a program to generate search results for a variety of missions, it currently focuses on using its tools to fight human trafficking and its hotbed of Web ads and exchanges.
Shapiro said there is little knowledge about how human trafficking markets work, their size and even their geography. “Part of our charter is to build some more of that basic knowledge,” he said.
Even these initial steps have proven surprisingly difficult.
“We’ve spent relatively more time than we thought we would trying to find the intersection between, ‘We know the bad things happened,’ and ‘We actually have data on what the entities involved in those bad things were doing,’” Shapiro said.