Evolving Walls of Ones and Zeroes: Deep Learning and its Uses in Cybersecurity

Only eleven years after the electromechanical computer was created, the world's first design for a computer virus was born. John Von Neumann, a mathematician and engineer, gave a series of lectures at the University of Illinois about the theory of self-replicating computer programs. However, it would not be until the early 1970’s that the first functional computer virus would be created. This virus, named “Creeper,” was the first computer program to spread and self-replicate, and was deleted by “Reaper,” the world's first anti-virus software[1]. This cycle of a computer virus being created, and an anti-virus being written to combat it has been going on for over fifty years.

However, with recent developments in machine learning, the classification and removal of malware may end up being entirely automated. A majority of new malware is built upon existing malware, and classifying the type is often the first step towards eradicating it. Deep learning programs have shown exceptional performance in dentification tasks, and, at the 9th New Technologies, Mobility and Security conference, an artificial neural network used to analyze imagery was shown to have “better than...state of the art performance”[2] when it came to identifying malware.

What ways do today's antivirus programs identify malware, and what are their flaws? edit

Anti-virus programs usually go about detection using a few methods, such as sandboxing, heuristic detection, and real-time detection. Sandboxing runs programs in a virtual environment and records what the program does. If the program is deemed non-malicious, the antivirus software then runs it on your actual computer.[3] While this technique is effective, it is slow and resource heavy, and therefore is not used in many user-side antivirus programs. Heuristic detection, or “genetic detection” is the process of identifying viruses by checking for similarities with already existing viruses. This method is effective but relies entirely on the limited databases used by the antivirus software. Real-time detection is the process of scanning a file when it is being downloaded or opened. This is the method that most anti-malware programs use. The biggest issue with this is that if the virus is not previously known, the antivirus will not flag it.

What are some deep learning antivirus programs, and how do they work? edit

Shallow machine learning programs predict the relationship between two variables and are used in many cybersecurity programs. However, new advances in deep learning technologies have led to neural networks outperforming even the best shallow learning algorithms. At the 9th New Technologies, Mobility and Security conference, an artificial neural network used to analyze imagery was shown to have outdone the winners of the Microsoft Malware Classification Challenge[2]. This system converts files into binary, and then turns the binary into a grayscale image, at which point the neural network scans it for similarities with other malware. This system had a 99.97% success rate on a dataset of over 10,000 malware samples. The neural network model used in this method is a convolutional neural network, which is based on the visual cortex of animals.

A model of a neural network, showing the input, hidden, and output layers

Another neural network program, the FO-SAIR (Factional-Order Susceptible-Antidote-Infected-Removed) framework, has shown great success in removing viruses. This program is modeled after organic disease treatments, and is a modified version of the SIR (Susceptible-Infected-Removed) framework. FO-SAIR is examined through stochastic optimization, a method that generates random variables to simulate an actual system. This program is one of the most cost-efficient antivirus methods, as it creates “antidotes” at a rate dependent on how much the virus has spread, and deletes these antidotes after they have no more use.[4] As powerful as these programs are, they are limited by hefty storage requirements, and by the time it takes to train them. While effective, this makes them not a reasonable option for an everyday consumer.

Won't hackers have access to deep learning too? edit

As technology becomes more accessible, both the quantity and quality of malware have skyrocketed. Former general manager of Australia's Computer Emergency Response Team Graham Ingram stated that “We are getting code of a quality that is probably worthy of software engineers[,]”[5] in reference to the skill of newer malware authors. However, there are several key factors that make deep learning more suited for antiviruses than malware. The need for both substantial computing power and a very large dataset make deep learning inaccessible to many people. This is especially true for malware authors, as large datasets of antivirus software is much harder to find than large datasets of viruses, meaning it would be harder to train a virus on antiviruses than it would be to train an antivirus on viruses. In addition, according to MathWorks, deep learning is used to “perform classification tasks directly from images, text, or sound[,]” which is useful for identifying, but not hiding, malware. While both malware authors and antivirus authors will have access to deep learning, by its nature it is more suited for antiviruses than malware.

Conclusion edit

Machine learning has advanced rapidly over the years and has impacted many fields; both those directly connected to technology and those that are not. As this technology continues to be improved upon, it will be used more and more frequently. This will especially be the case in the field of cybersecurity, as the deep learning programs are showing very high levels of success, especially compared to most modern antivirus programs. As machine learning grows more advanced, it will become the leading approach to cybersecurity due to its speed and accuracy in malware classification.

Discussion questions edit

  1. Do you think deep learning programs will fully replace traditional antivirus software in the future? Why or why not? What challenges still exist?
  2. How might advances in deep learning impact the "arms race" between malware creators and cybersecurity specialists? Will one side gain an upper hand?
  3. Should machine learning algorithms used for cybersecurity purposes be open source? What are the ethical implications around transparency and scrutiny in this case?
  4. What other applications of deep learning show promise for improving cybersecurity beyond analyzing malware? For example, could it detect intrusions or compromised accounts?
  5. Do you believe enough is being done legally and politically to prepare for emerging technologies like AI-powered cyberattacks or cyberwarfare? What policies or regulations could help address these threats?

References edit

  1. Chen, Thomas, and Jean-Marc, Robert (2004). "The Evolution of Viruses and Worms" web.archive.org. Retrieved 31 July 2019.
  2. a b Kalash, Mahmoud et al. “Malware Classification with Deep Convolutional Neural Networks.” IEEE.org 2018 9th IFIP International Conference on New Technologies, Mobility and Security. 2018
  3. “Sandboxing Protects Endpoints | Stay Ahead of Zero Day Threats” comodo.com (June 20, 2014)
  4. Noinang, Sakda et al. “Numerical Assessments Employing Neural Networks for a Novel Drafted Anti-Virus Subcategory in a Nonlinear Fractional-Order SIR Differential System” IEEE Access vol. 10 (2022)
  5. Kotadia, Munir. "Why popular antivirus apps 'do not work'". zdnet.com (July 2006)