Ameca, a robot made by the UK's Engineered Arts interacts with visitors on July 6, 2023, in Geneva, Switzerland.

Ameca, a robot made by the UK's Engineered Arts interacts with visitors on July 6, 2023, in Geneva, Switzerland. Johannes Simon / Getty Images

The Pentagon just launched a generative AI task force

A conversation with the leader of Task Force Lima reveals its objectives—and the main questions it's trying to answer.

Generative AI programs like ChatGPT and Google’s Bard have captivated public attention and lawmaker scrutiny, though so far the Pentagon has been reluctant to adopt them. But Thursday, the Defense Department announced a task force to understand how it might use such tools safely, reveal situations where it’s unsafe for the department to use them, and explore how countries like China might use generative AI to harm the United States. 

The task force, dubbed Task Force Lima, “will assess, synchronize, and employ generative artificial intelligence (AI) across the Department,” according to a draft press statement viewed by Defense One

Generative AI refers to “a category of AI algorithms that generate new outputs based on the data they have been trained on,” according to the World Economic Forum. That’s very different from far more simple machine-learning algorithms that just take structured data like numerical values and output the next likely statistical outcome. Public-facing generative AI tools include so-called large language models like ChatGPT that can write new text that is largely indistinguishable from human speech. These tools have already been used to write essays, business plans, and even scientific papers. But because large language models are trained on data corpora as large as the searchable Web, they sometimes lie—or, in the parlance of AI researchers, “hallucinate.” For this reason, Pentagon officials have expressed reluctance to embrace generative AI.

The new effort will be led by Craig Martell, the Defense Department’s chief digital officer. Martell said much is still up in the air, but a main objective will be to find a “set of use cases, internally to the department, where we believe that generative AI can help us do our job and where the dangers, the difficulty of generative AI can be mitigated. For example…if I need to do first draft generation of some document, that's fine, because I'm going to take full responsibility to edit that document, make sure it's actually correct before I pass it up the line, because my career is on the line.” 

However, there are a number of use cases where the risks of hallucination will be too high to employ a large language model, such as “anything kinetic,” or having to deal with lethal weapons, he said. 

“Let's just get clear on what the acceptability conditions are.” Martell said. For instance, if someone in the department very quickly needs to summarize a lot of text into a legal document, an LLM might be a good tool for that purpose, but only if the user can satisfy certain concerns. “How do we then mitigate against the potential hallucinations that are in that text? If there are procedures by which that text can easily be checked, then we're probably going to be on board with that.”

Martell came to the Defense Department from Lyft, and he’s publicly warned of the dangers large language models can pose. But, he says, such models “aren’t a Pandora's Box, either.” The Defense Department has to understand where they can be used safely—and where adversaries might deploy them, he said. “There's going to be a set of use cases that are going to be new attack vectors, not just to the DOD, but to corporations as well. And we're going to have to… to figure out diligently how to solve those.”

The task force will also help the Pentagon better understand what it needs to buy to achieve new AI ambitions. That could mean more cloud services, data or synthetic data, models, or none of those. The effort is too young to know exactly what it will mean for industry, Martell said. 

“If we decide that we're gonna build our own foundational model, we're gonna be scaling up compute, there's no question about it. In order to build your own foundation model, you need lots of compute power. If we're going to buy it from someone else, or if we're going to take an open source one and fine tune it,” they may need different contracts, he said. 

Martell said that a broader question is whether the Defense Department even has enough use cases where generative AI could be helpful, given a better understanding of the risks. While the department has a lot of carefully curated internal data, terabytes of jet engine diagnostics or years of drone surveillance footage over the Middle East do not a large language model make. Because they have to pull from large text corpora, a certain amount of unreliability is inherent in the way they work. The question the academic field has yet to answer is whether that inherent unreliability can be accurately quantified to reveal the risks of use.

“It’s an open question whether we have enough data that covers broad enough coverage that the value of the model could maintain without that pre-trained data. On the other hand…my hypothesis is, the more pre-trained data, the more likely that hallucinations. So that's a trade off we're really going to have to explore. I don't, I actually don't think the scientific community knows the answer to that yet,” he said. 

One of the potential benefits of the task force is to illuminate the best path forward for industry—at least if the goal is to produce products or services that meet Defense Department standards.That will likely mean giving companies the courage and incentive to not only reduce the inherent risks in their models, but also to increase ease of use for more people. For instance, Martell said, one reason why programs like ChatGPT aren’t suitable for the Defense Department now is the amount of question engineering required to produce suitable results. Lengthy prompt trees are fine for hobbyists, but an operator who has to do several other complex tasks needs an interface that is intuitive and functions better from the beginning.

 “There's lots of research that has to be done, not just for the DOD but in the community as a whole, about those two pieces: what [does] automated, prompt engineering and context mean, and what…automatic hallucination mitigation look like? These things are unknown,” he said.