Knowledge discovery in biological big data : Tailor-made data analysis algorithms integrating expert knowledge

Hausen, Jonas

Aachen (2020)


Over course of recent decades, rapid technological advances have led to the advent of big data analysis within biology and environmental science fields. This development has been enabled by new technologies such as data sharing and storing, alongside novel high-throughput methods, to generate large datasets at comparably low costs. Biological big data share common characteristics including heterogeneity, a large number of variables, and high noise. Traditional methods for data analysis and visualization are often not able to handle these characteristics and therefore fail to extract biologically meaningful results. To separate relevant knowledge from random patterns, expert knowledge is needed. A promising way to solve this problem is to integrate this expert knowledge in data mining techniques, which are especially suited for the analysis of big data. The aim of this study is the integration of expert knowledge in the analysis of big biological data. To achieve this, a data analysis workflow utilizing the characteristics of biological data was developed. This workflow was applied to three different big biological datasets from environmental research: a) Gene expression data from zebrafish (Danio rerio) following exposure to different environmental contaminants b) Taxonomic data and environmental parameters from a global soil-zoology database c) Fungal DNA sequence data from soil samples taken in differently managed forests. All three datasets were analysed via a data mining workflow, which consisted of preprocessing, application of a data mining algorithm, and visualisation, to handle the volume and complexity of the data. At different steps of the analysis workflow, domain-specific expert knowledge was integrated. In this manner, irrelevant or insignificant results were excluded, and only biologically meaningful results were derived. The integration of expert knowledge in the analysis of the zebrafish data strongly reduced data noise to reveal genes and patterns, which react specifically to one of the contaminants. An adapted version of the framework filtered out unimportant variables from the soil-zoology database and helped determine biologically relevant classes of the remaining parameters. Expert knowledge was then used to identify essential patterns in fungal communities and determine habitat-specific ecological guild compositions in the different forests. At specific steps, the collaboration of a domain expert and a data scientist turned out to be crucial for the success of the analysis. The workflow helped to identify these steps by subdividing the complex data analysis into smaller and more straightforward work tasks. Powerful visualizations were essential to enhance and improve the cooperation as they provided a platform for discussion and validation of the results. The ability to show multiple aspects of the data via a wide range of applications was one of the keys to the collaboration and all three applications relied heavily on them. The results of the present thesis demonstrate how domain-specific expert knowledge can be used to improve the results of data mining approaches in the analysis of big, heterogeneous biological data. The cooperation of data scientists and domain experts made it possible to account for the characteristics of the individual subjectspecific datasets, whilst maintaining the power of the data mining approaches.