NEW MODELS FOR ANOMALY DETECTION IN DATA STREAMS
Ph.D. Student: Aurora Esteban
Advisor: Amelia Zafra, Sebastián Ventura
Started on: December 2019
Keywords: Anomaly detection, Data stream
The rapid growth in the number and forms of connection today, whether through sensors, smart devices, health applications or autonomous vehicles, is leading to the generation of data streams capable of providing information in real time about the status of the monitored systems. The extraction of new knowledge from these data flows is a line of research that has attracted great attention in recent years. Thus, the application of these new techniques is revolutionizing areas such as the Internet of things, Industry 4.0 or computer security.
Detection of outliers or anomalies refers to the problem of identifying patterns in the data that do not follow an expected behavior. This task is especially complicated in a data stream scenario, where there is not a static set of data, but they are arriving at the system dynamically, either in batches or one by one. There are multiple real problems that have been approached from this perspective: the detection of failures in industrial systems through the monitoring of their parameters; applications related to financial fraud, such as the detection of fraudulent operations in credit cards or insurance transactions; the detection of anomalous behavior in web registries su as unauthorized access or security applications and intrusion detection.
In this scenario, the detection of anomalies in data streams is currently an active research area, in which there are still many questions to be raised. Some of them are intended to be addressed in this PhD thesis:
- In data streams, the input data is not permanently available, but arrives as one or more continuous streams of temporal data, so it can evolve dynamically. Therefore, the learning methods applied must comply with certain restrictions, such as execution in a single reading, real-time response, unlimited memory, or the detection of changes in the distribution of the data (concept-drift). This thesis aims to advance significantly in this area by proposing models that work in these environments to solve the detection of anomalies following the line of recent studies. In this context, it is intended to advance in the proposal of models based on adaptive decision trees (Hoeffding Adaptive Tree, HAT), as well as classifiers based on ensembles.
- From a data mining perspective, anomaly detection has mostly been approached using unsupervised learning. These approaches take advantage of working with unlabeled data, which are easier to obtain and represents more common scenarios in anomaly detection. However, there are studies that show the relevance of starting from labeled data to obtain more efficient results. The approaches in this thesis are intended to provide a novel contribution through the use of techniques in semi-supervised models. In this sense, it is proposed to apply active learning techniques, which maximize the benefits of semi-supervised learning by labeling the most representative data.
The partial objectives of this Ph.D thesis are the following:
- A complete study of the current proposals about anomaly detection will be carried out to delimit the types of existing anomalies, as well as the most appropriate proposals to solve each of the problems raised.
- Design and implementation of new models for detecting anomalies in data streams, with special interest in time series data. Different approaches based on ensembles and Hoeffding Adaptive Trees will be evaluated.
- Design and implementation of active learning strategies that allow working in a semi-supervised learning environment for the detection of anomalies.
- Validation of the models developed both with the use of benchmarks, as well as in a real domain of predictive maintenance for the detection of breakdowns in land vehicles.
The development of this thesis is being supported by:
- Spanish Ministry of Education, Culture and Sports under the FPU program (FPU19/03924).