PREDICT: Large Datasets for Cyber Security
The DHS Science and Technology Directorate is sponsoring an initiative to facilitate the accessibility of computer and network operational data for use in cybersecurity defensive research and development. The PREDICT (Protected Repository for the Defense of Infrastructure against Cyber Threats) initiative represents an important three-way partnership among government, critical information infrastructure providers, and security development communities (both academic and commercial), all of whom seek technical solutions to protect the public and private information infrastructure. The primary goal of PREDICT is to bridge the gap between on one hand producers of security-relevant network operations data and on the other hand technology developers and evaluators who can leverage this data to accelerate the design, production, and evaluation of next-generation cyber security solutions.
Specifically, PREDICT provides developers and evaluators with regularly updated network operations data sources relevant to cybersecurity defense technology development, including sources that are minimally anonymized, if not entirely uncensored. The data sets are intended to provide developers with timely and detailed insight into cyberattack phenomena occurring across the Internet, and in some cases will reveal the effects of these attacks on networks that are owned or managed by the data producers. A key motivation of PREDICT is to make these data sources more widely available to technology developers and evaluators, who today often determine the efficacy of their technical solutions on anecdotal evidence or small-scale test experiments, rather than on more comprehensive real-world data.
The PREDICT website http://www.predict.org/ contains an overview and general information as background, along with the data repository. Basic categories of datasets include those relating to IP packet headers, and Internet topology data. Descriptions of the specific categories are provided on the website, along with descriptors relating to the fields of the individual datasets. As specified on the website, access to the PREDICT data repository is available to eligible research groups upon approval of their applications. In addition, new sources of data are continually being sought.
Considerable effort has been devoted within the PREDICT community to ensuring the privacy of individuals and organizations with respect to the contents of the data repository. The DHS PREDICT Privacy Impact Assessment document is available, and represents a significant proactive analysis of the privacy concerns and what measures are needed to confront them.
One of the PREDICT performers, the University of California, San Diego, released an analysis of the Syrian internet outage via blog entry “Syria disappears from the Internet.”
One of the PREDICT performers, released an analysis/Tech Report on the Internet outages related to Hurricane Sandy. There was a press release of this at: http://news.usc.edu/#!/article/45114/internet-outages-in-the-u-s-doubled-during-hurricane-sandy-usc-study-finds/. The original tech report can be found at: http://ant.isi.edu/blog/archives/272