CI/CD for Unsupervised Anomaly Detection

The agile practices of continuous integration and continuous deployment have become the standard in industrial software development. Applying these methods can shorten the time-to-market of an application while also ensuring higher software quality, thus, they make the entire software lifecycle more efficient. However, little attention is paid to applying CI/CD in machine learning applications, especially in unsupervised ML scenarios where no labeled data is available. Anomaly detection is one of the most common use cases of unsupervised machine learning with the goal to identify abnormal occurrences of behaviour in datasets or time-series data. Therefore, it is applied in many application areas such as fraud detection, network intrusion or medical analysis. This thesis targets the challenge of applying CI/CD practices to unsupervised anomaly detection. For this purpose, the traditional principles of CI/CD are extracted from literature research and adjusted to match the requirements of the machine learning lifecycle of anomaly detection applications. Based on the adjusted principles, a conceptualized pipeline is proposed. An implementation using technologies such as Kubeflow, Kale and RoK demonstrates the feasibility of this pipeline. Furthermore, the Scoring Distance, a metric to evaluate the performance of unsupervised anomaly detection models, is proposed and evaluated. The results show that it is not as accurate as the ground-truth based metric F1 score, but both metrics scale similarly across various models, making the Scoring Distance a promising solution for an internal model evaluation.