How the cloud eased the VA's pains during the pandemic

During the height of the pandemic, the Department of Veteran Affairs moved its increasingly popular video telehealth service to the cloud, one of about 100 applications scheduled to be migrated by the end of 2024.

The COVID-19 pandemic put the Department of Veterans Affairs’ modernization efforts on fast forward, forcing it to move major systems to the cloud in a fraction of the time it would typically take and then monitoring those systems’ performance as demand surged.

For instance, VA’s Enterprise Cloud Solutions Office (ECSO) had to quickly accommodate a 400% increase in remote users and a surge of telehealth visits that went from 25,000 per month to more than 40,000 per day. By April, monthly telehealth visits increased by nearly 2,000% compared to pre-COVID levels.

“Those are just astronomical system challenges,” said Dave Link, CEO of ScienceLogic, which helped VA navigate them. “When you think about all the resources you have to spin up -- web servers, the video equipment that runs in the cloud that had to spin up automatically during a very compressed periods of time and then the load that that creates and the challenges for IT for managing that kind of rapid load onboarding -- it’s really unprecedented.”

Those big challenges required a bold move. Whereas most agencies start their cloud journey with a small project, VA began with one if its largest, most mission-critical systems: VA Video Connect. The department had to scale up the on-premises system while setting up a matching environment in the cloud.

Next, the VA moved 20% of a catalog of about 100 applications to the cloud, with about half expected to be migrated by the end of 2024. The Veterans Benefits Management System (VBMS) was among the first to move.

“VBMS is a web-based application for paperless claims processing. It’s the primary application of our benefits department,” ECSO Director Dave Catanoso wrote in a July blog post. “It is almost ten years old and has a tremendous amount of custom code. It supports roughly 4000 simultaneous users, and it manages millions and millions of documents, so you can imagine moving something like that is not easy. We had to migrate 800 million documents from the external hosting provider up to the Amazon cloud. Today, it’s tracking approximately 2.2 billion documents, and it has to be available 24×7 for users across the U.S., and from Puerto Rico to the Philippines.”

To monitor the performance of all these systems regardless of where they reside, VA contracted with ScienceLogic for its SL1 platform. The product looks at the services that cloud hyperscalers – such as  Amazon Web Services, Microsoft Azure or Google Cloud Platform -- are running and provides a real-time view through to the network and applications levels, Link said.

“The service … is generally made up of dozens or hundreds of different technologies underneath it, and so we aggregate and contextualize all that data into a very, very smart analytics engine,” he said. Applying machine learning to the performance data provides “a spotlight to the operations teams before there is a service disruption,” he added.

Just as a doctor can’t diagnose a problem based only on a temperature reading, the platform ingests billions of data points a day to pinpoint dangers, Link said. For some technologies, it collects performance data and executes one type of algorithm, while for other data, it will use different algorithms.

When something looks suspicious, the platform can automatically pull more data from the end device in question. For example, rather than getting data every minute, it can collect it every 15 seconds for the next five minutes.

SL1 can send alerts to a cellphone, pager, communications platform such as Microsoft Teams, and through connections to IT service management platforms such as ServiceNow. When it detects an issue, the platform creates an incident that goes through a series of workflows to the team responsible for that area or application.

ScienceLogic also has a real-time dashboard that gives users an at-a-glance view or can be set to be send snapshots at regular intervals. The views depend on the user’s job. Someone working on troubleshooting would get a detailed engineering dashboard, while executives would see simple green, yellow and red indicators that they could drill into for details.

All of that work happened under a contract the company won in August 2020. In July, VA awarded ScienceLogic and Swish Data a $43 million contract to provide artificial intelligence operations tools to help with monitoring and managing the Veterans Affairs Enterprise Cloud, which encompasses internal hybrid clouds and the external public cloud.

“You want to manage the public and private systems in one environment,” Link said.

This article first appeared on GCN, a Defense Systems partner site.