A screen of an idle virus affected cash machine in a state-run OshchadBank says "Sorry for inconvenience/Under repair" in Kiev, Ukraine, Wednesday, June 28, 2017.

A screen of an idle virus affected cash machine in a state-run OshchadBank says "Sorry for inconvenience/Under repair" in Kiev, Ukraine, Wednesday, June 28, 2017. AP/ Efrem Lukatsky

Which Bugs Will Hackers Exploit First? Machine Learning Promises a Better Guess

Most vulnerabilities are known; defenders need a better way to know which ones pose an imminent threat.

The vast majority of the bugs that hackers exploit aren’t fancy zero-days that no one has ever seen or reported. Most are vulnerabilities that have gotten out into the wild and spread via chat rooms and hacker forums on the dark web. Guessing which bugs will cause the most damage — useful in knowing which ones to patch first — is still mostly a guess. But researchers from Arizona State University have developed a machine-learning model to predict which vulnerabilities are the most likely to cause the next headline-grabbing incident.

If the hackers discussed the bug in Chinese, the likelihood of a hacker using it rose to 9 percent. If it was English, the likelihood rose to 13 percent. If it was Russian, 40 percent.

Today’s most common methods for anticipating the likelihood that a previously disclosed software vulnerability will cause major damage are imperfect at best. Take two bugs: one exploited by the WannaCry ransomware, which shut down hospitals and other institutions across the United States and Europe; and Heartbleed, a bug believed to have been discovered and exploited by the NSA. The latter was judged by the National Vulnerabilities Database’s common vulnerability scoring system to have a severity score 5-in-10, mix of likelihood of exploit and potential damage done; the former, 8.1 chances in 10. But other viruses that scored even higher had far less impact.  

In a new paper, researchers at Arizona State University say their method is 266 percent better than the CVSS methodology at predicting whether a bug will be exploited. It uses web-crawling algorithms and random forest machine learning to search for discussions of new exploits on the dark web, a portion of the Internet that is only accessible via a special browser like Tor, in order to protect the anonymity of users. It extracts clues from how the hackers in the forum are discussing the bug, turning text into numerical values and storing them in a database.

Once a bug hits the darker corners of the Internet, what determines the likelihood of a hacker exploiting it? The researchers found that only 2.4 percent of the bugs in the NVD are ever exploited. Among those that are, a few features stick out. If the bug had a proof-of-concept attached to it — meaning that someone actually tried it and it worked — the likelihood rose by 9 percent. If it was discussed on the dark web, the likelihood rose by 14 percent.

The biggest jump factor was language. If the hackers discussed the bug in Chinese, the likelihood of a hacker using it rose to 9 percent. If it was English, the likelihood rose to 13 percent. If it was Russian, 40 percent.

The method is also faster than current CVSS reporting. “In over 97% of the cases, our model makes correct predictions the day a vulnerability is disclosed. This is, on average, 13 days before any exploit is detected.” Mohammed Almukaynizi, one of the paper’s authors, told Defense One in an email.