Joseph M. Burton
During the last several years the use of Data Analytics (sometimes called Business Analytics) and its technological sibling, Machine Learning, have changed the manner in which many companies do business, brought significant and tangible changes to business efficiency and bolstered profitability. Some of the early and successful uses of data analytics have occurred in banking and financial services; oil and gas exploration and extraction; healthcare; insurance; and retail. However, it is only recently that these technologies have begun to be generally employed within the cybersecurity industry. This article describes some of the most significant current and future uses of data analytics and machine learning in cybersecurity. It also describes some of the legal and practical challenges, as well as unique consequences, attendant to the use of these extraordinary technologies for cybersecurity.
Key Technologies
Before discussing the application of data analytics based techniques to cybersecurity, it is important to understand four key concepts of that underlie its use and phenomenal effectiveness.
Big Data
One of the most significant side effects of the digital age has been the tremendous increase in the speed and volume of data creation. Today, it is estimated that approximately 2.5 exabytes of information is created each day (the equivalent of 250,000 Libraries of Congress, or 90 years of HD video). Historically, tremendous volumes of information were often considered a burden because of the need to store and retain the information for later use and/or analysis. Fortunately, technological advances have greatly minimized these burdens and opened up opportunities for businesses. First, the cost and effort to store and subsequently recover tremendous volumes of information have become cheaper and easily accessible because of memory chip advances and the movement toward cloud based computing and data storage platforms. Most importantly, advances in the ability to analyze these large volumes of information through the application of data analytics have provided businesses with the means to discover and leverage the latent value contained within their own information.
Data Analytics
Data analytics involves the application of various statistical techniques, such as retrogression and scatter analysis, to extremely large quantities of data, so called “Big Data.” Application of these statistical methods yields information about the nature of the data and the relationships among the various data elements.
There are three forms of data analytics. The first is descriptive analytics. Descriptive analytics reveals, in all of its various dimensions, what the data describes as having occurred, and the relationships among individual elements of the data. Descriptive analytics was the earliest form of “data mining,” which is the extraction of valuable insight and information otherwise locked away inside a mountain of data.
The second form of data analytics is called predictive analytics. In addition to describing what the data reveals about what has occurred, predictive analytics, using other statistical methods, can predict, within stated margins of error, the likelihood of outcomes and relationships in a different data set containing the same or similar data. Predictive analytics allow a machine to “learn” from the information it has previously analyzed and therefore to make educated guesses about what other similar datasets should or are likely to reveal.
The third and most powerful form of data analytics is prescriptive analytics. It allows a machine to describe, based upon its analysis of a specified “seed” dataset, what outcomes or interactions are likely to happen, when they are likely to happen, and why it is likely to happen all within a different dataset containing information similar to the original “seed” dataset it had previously analyzed.
Machine Learning
While the three forms of data analytics are the basis for the successful and growing use of data analytics, they have been enhanced by technologies from the related field of artificial intelligence (AI) that allow the use of data analytics in more varied and sophisticated ways.
The most important, machine learning, is closely associated with predictive and prescriptive analytics. Prescriptive analytics is classified as an “unsupervised” form of analytics, unlike descriptive analytics, which is a form of supervised analytics. “Supervised” means that the analytics performed (and thus the results generated) are subject to and the product of known, tightly specified and pre-programed parameters, characteristics and controls. By contrast, in using “unsupervised” analytic techniques, a computer is initially “taught” the nature, characteristics and relationships among important data elements in an initial data set, and is then left to identify, on its own, the same or similar characteristics and relationships in other data sets.
The difference between supervised and unsupervised analytics is often described as the difference between programing a computer and training a computer. The power in this difference is in many ways reminiscent of the old proverb: “Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime.”
In addition, recent advances in machine learning brought on by the use of “neural networks,” “layered neural networks,” “deep learning,” and other AI techniques have greatly increased the power and effectiveness of predictive and prescriptive analytical algorithms.
Unstructured Data
Lastly, two greatly improved artificial intelligence derived capabilities are helping to accelerate the growth of data analytics usage within more and more industries—the growing ability of computers to effectively utilize “unstructured data.”
A huge amount of both existing and newly created data, perhaps even the majority of digital data, has been unavailable for analysis because it is not in, nor can it be put into, a usable table format. This data, termed “unstructured data,” includes text, audio, video, and still-image based information, and together accounts for perhaps 80 percent of all created and existing data. However, because of recent and continuing advances in natural language processing (i.e., speech to text, text to speech) and image processing (both video and still images), unstructured data can be ingested, processed and used in data analytics and machine learning based applications.
Cybersecurity Applications
The creation of machine learning-based applications is one of the hottest areas of investment and development in the financial technology (FINTECH) sector. This is not surprising, as descriptive and predictive analytical techniques have long been used by banks and other financial institutions to sift through transaction histories in order to identify fraudulent activity in both offline and online transactions.
The increased use of mobile banking, proper user identification and authentication has taken on increased importance. To meet this challenge, many financial institutions are employing advanced behavioral techniques that help prevent fraud by insuring that only an authorized user has access to banking services. These techniques, which rely upon real-time descriptive and predictive analytics, are passive and therefore do not require any obvious user input or action. Thus, if needed, they act as an additional authentication layer but provide customers with a low friction, hassle-free, user experience.
However, outside of these specialized financial services cases, machine learning and data analytics technologies have not been widely employed in cybersecurity products. The most common exceptions have involved antivirus products that rely upon descriptive analytics techniques for the recognition of virus signatures. Another example is found in newer malware detection and network intrusion detection products that are beginning to utilize predictive analytic techniques to enable real time analysis of network activity.
In addition to monitoring and analyzing network activity, there has been some application of user and entity behavior analytics (UEBA) based products. Similar to the banking software mentioned earlier, these products use predictive analytics to analyze the real-time behavior of network users and network devices in order to detect types of anomalous behavior indicative of potentially unauthorized activity.
However, it still is fair to say that with very few exceptions, existing cybersecurity products do not employ the more advanced predictive and prescriptive analytics techniques, nor do they use machine learning technologies. There are two potential reasons for this: the absence of a need for strong and deep data analysis and the absence of a robust capability to do it.
Early cybersecurity strategy was predicated on establishing a security perimeter “wall,” and safely placing computing devices “behind” that wall. Security relied upon constructing strong defenses, against known and anticipated vulnerabilities in the wall, followed by vigilant monitoring of the wall for evidence of a breach. This is akin to the World War I strategy of building static defenses and waiting for the enemy to attack. Given this strategy there was little perceived need for basic log analysis, let alone a need for more comprehensive data analytics. Moreover, because reviewing log data is a tedious time consuming task, it has been routinely relegated to the bottom of the priorities list.
A recent change in cybersecurity strategy is being driven by the realization that in an age of mobile, ubiquitous, and collaborative information sharing and computing, there is no “perimeter wall,” or more specifically, that (almost) everyone is, or can get, behind that wall. Moreover, with the development and proliferation of technologies like the internet of things (IoT), the attack vectors available for compromise or exploitation have exponentially increased. There just are not enough fingers available to plug the security wall.
Current security thinking and strategy have evolved and now assume that the “perimeter” has been or likely will be breached. Thus, current strategy places a premium and emphasis upon (1) protecting the actual data at risk through encryption and other technologies, and (2) affirmatively seeking out and reacting to “indicators of compromise” and other evidence of an actual or likely breach. This strategy more closely resembles modern warfare strategy, which relies on mobility, speed, and the capability to effectively seek out and destroy an aggressor.
While this new strategy of proactive defense and rapid response is a valid adjustment to the changed security threat landscape, it requires and demands that the voluminous, and often real time, information available from modern computing systems be effectively reviewed and utilized. This is not likely to occur if left to routinely busy and often time-constrained IT and security staff. In addition, problems of system interoperability and compatibility between different products (which often leads to an incomplete or inaccurate picture of a system or networks state) and vast collections of potentially invaluable security information is effectively being turned into a vast data wasteland. Machine learning and data analytics techniques can resolve both of these problems and serve as the bedrock foundation for realizing the new cybersecurity strategy.
Review and analysis of information is automatically performed by the computer, not by IT or security personnel. (In the case of descriptive or predictive, some initial and perhaps subsequent training may be involved. In the case of the more sophisticated deep learning systems, only a period of initial training is necessary.) Human input and involvement can be inserted at those points in the process where involvement is most critical. When properly implemented, data analytics and machine learning should be used to inform or augment human decision-making, not replace it.
Secondly, the volume of potentially relevant and available information to be analyzed is a benefit, not a detriment. Data analytics requires (and indeed provides the best results with) more data, not less. Moreover, advanced data-driven technologies can utilize both structured and unstructured information. This makes available a large and rich range of information, such as all written publications, security blog postings, security alerts, up-to-the-minute and archived security intelligence, lecture transcripts and notes, and live audio and video recordings. Items like this provide greater context to the analysis, thereby greatly aiding the accuracy and reliability of its conclusions and proffered recommendations.
The data-driven cybersecurity products discussed to this point provide an excellent means to better implement many of the technical safeguards required to achieve cybersecurity. However, technical safeguards represent just one aspect of the cybersecurity puzzle. The overriding goals of any cybersecurity program are: (1) actual protection of the confidentiality, integrity and availability of the information at risk; and (2) legal defensibility in case of a compromise of the information. Standing alone, the technical safeguards achievable through application of even the most advanced machine learning technologies are not sufficient to accomplish these cybersecurity goals. An effective and legally defensible cybersecurity program is only achieved through the development, implementation, and maintenance of a series of not just technical, but also, administrative and physical safeguards, all in compliance with a measurable, repeatable and individually tailored security process.
For example, consider just one of several aspects that are critical to the success of the security process: identification of foreseeable risks. Risk assessment is not primarily a technical issue; it is a considered business judgement, which takes into account and in combination factors, such as the location, nature and use made of the information owned or controlled by the business; its sensitivity; its value to aggressors; the economic and non-economic impact to the business likely to arise from its compromise; and the costs and effort required to protect it.
Other aspects of the cybersecurity process are also excellent candidates for application of data-driven technologies. These candidates include: evaluating the effectiveness of existing security safeguards (i.e., administrative, technical and physical); monitoring the compliance status of regulated security and privacy requirements; evaluating and monitoring the strength, effectiveness and compliance status of third party security measures; and evaluating and recommending annual adjustments to the cybersecurity program. The information describing and supporting consideration of these factors is likely to be voluminous and are perfect for data-driven analytics.
Until now, managing all of these aspects of the full cybersecurity process has frequently been seen as a complex, daunting and needlessly confusing task. Utilization of data analytics and machine learning technologies are highly suited for ingesting, analyzing, describing and evaluating the entire range of information needed to enable company management to make fact based, data-driven cybersecurity decisions and thus more effectively discharge their cybersecurity responsibilities.
Challenges and Consequences
While the promise of data-driven cybersecurity applications is great, like any new technology, there are also potential dangers and challenges that must be addressed.
Accountability and Transparency
Who is, or should be, responsible when decisions or actions made by a machine subsequently prove to be “wrong” and result in adverse consequences? Until recently, the answer could be derived by applying accepted principles of product liability law. Under these laws, when a machine or other product is defective (either as a matter of design or in its inherent operation) and results in physical injury or economic damage, the designer, manufacturer or distributor of the product could be held legally liable.
However, in cases involving machine learning, we are confronted with an entirely new dimension to the problem—decision-making. Assume that the machine was not defective. That is, the machine and its algorithms operated precisely as designed and specified, but its judgments or recommendations were wrong for any number of reasons, including the data was inaccurate, corrupted or biased; other available data was ignored or inappropriately weighted; or the algorithms used (not the data) were biased or incomplete. Would this “error” constitute a new form of (machine) negligent, or even culpably intentional behavior? Such issues are even more vexing when this problem is considered in a context in which unsupervised analytics, or advanced deep learning technologies are employed. In these situations, the decisions made by the machine are one step removed from the original algorithms used to train and control the system. In this context the machine is in effect, thinking for itself and drawing its own conclusions. Can the machine, its creators, or its users be liable?
To date, there is no existing legislation or set of regulations directly addressing these issues; similarly there is a dearth of case law. The most noteworthy case is a recent Wisconsin Supreme Court decision that authorized probation departments to use at sentencing data analytics, which predicted the likelihood that a particular defendant would become a recidivist. Based on numerous problems with the reliability of the data used by the machine, the court allowed its use only if trial judges were informed, in writing, of its analytical shortcomings and they did not exclusively rely on the recommendation in evaluating the likelihood of recidivism.
Closely related to these accountability issues is the issue of transparency. It makes sense that the individuals employing and affected by these technologies should have access to the algorithms supporting the machine’s conclusions or recommendations. However, in many situations, the actual workings of the technology are a “black box,” because the algorithms are deemed proprietary intellectual property. Unless we find a way to permit individuals, who place critical reliance on the findings and recommendations produced by these technologies, access to this information, there is serious potential that their use of these technologies may be limited or abandoned.
Small and Midsize Business (SMB) Access
Small and midsize businesses seeking application of these technologies to their cybersecurity needs are likely to face two issues.
First, highly sophisticated technologies, as discussed, are usually expensive to acquire and implement. Moreover, these technologies usually are not targeted at SMBs. Second, application of these technologies often require in-house data analytics expertise; or at least experts sufficiently familiar with the underlying data technologies to be able to “translate” between the two so that the businesses data is properly identified and effectively used by the program algorithms. This is not a problem faced just by cybersecurity businesses or SMBs (a recent McKinsey & Company report found that one of the most significant barriers holding up wider adoption of data analytic technology is a shortage of qualified data science experts); it is one likely to affect SMBs disproportionately. SMBs are frequently targeted and exploited by cyber aggressors. It is vital that these technologies become readily available to SMBs because efforts to target and attack them are about to become more serious and dangerous, and the consequences are likely to adversely affect all of us.
Machine Warfare
Now that the data analytics and machine learning technologies are becoming available for use by the “good guys” (cybersecurity defenders), it is inevitable that the same technologies will also be employed by the “bad guys” (cybersecurity aggressors). We are fast approaching a world similar to the television series “Person of Interest,” wherein a “good” artificial intelligence entity known as “The Machine” faces off against a “bad” artificial intelligence entity known as “Samaritan.” Cybersecurity hackers and other aggressors have already demonstrated enviable knowledge, skill, speed and tenacity in developing malware and other technical measures for penetrating or circumventing cybersecurity defenses. As we begin to employ these new machine learning technologies it is only a matter of time before cyber aggressors do the same. It is even possible that they will beat us to the punch.
Last Word
The difficulties in implementing effective cybersecurity are rapidly increasing. These difficulties are a product of the complexity and scope of a rapidly developing digital infrastructure; the speed at which new vulnerabilities and attacks on that infrastructure and its users are developed; and the sophistication, tenacity and inherent tactical advantage naturally possessed by cyber aggressors.
Data analytics and machine learning technologies are capable of overcoming these problems and providing effective cybersecurity. Indeed, it is likely that in the future these technologies may be the only effective means of achieving cybersecurity. While there are currently few such products on the market, this circumstance is likely to change quickly, as one can see from the recent acquisitions of pure machine learning companies by cybersecurity vendors, and from an increasing number of new startup companies focused on machine learning based cybersecurity.
Data-driven cybersecurity is an exciting, powerful and potentially game changing new tool available to the entire cybersecurity community. It is for this reason that we all would be wise to keep an eye on its development.
Joseph M. Burton is a partner in Duane Morris’ San Francisco office. He is nationally recognized in the field of information security law, advising and representing individuals and corporations regarding their rights and responsibilities in maintaining the security of digital information Mr. Burton is a former Assistant United States Attorney and Chief of the Silicon Valley Office for the Northern District of California, where he handled several pioneering high technology investigations and prosecutions.
Reprinted with permission.