Extracting useful and novel knowledge from raw data is a complex process involving several phases such as studying the problem domain, preprocessing the dataset, building and deploying models, and interpreting the results [Fay96]. Although the experience and know-how of the data scientist are irreplaceable, certain phases are repetitive and time-consuming. In this vein, given a dataset, several authors have proposed methods to automate the selection of the best algorithms and their hyper-parameters [Hut19], its application being usually limited to a single phase. However, applying these methods to each phase separately might generate invalid sequences of algorithms or hamper their performance, given that specific synergies and constraints could be missed. To mitigate these shortcomings, workflow composition approaches have started to jointly optimize two or more phases.
A pioneering work was [Tho13], which leveraged Bayesian optimization to optimize a two-step workflow. This technique has been adopted by other authors [Feu15, Zha16]. However, it required setting the workflow structure beforehand, thus hampering its applicability to specific scenarios. Evolutionary algorithms, especially genetic programming (GP), allow composing more complex workflow structures [Ols16, Sa17, Lar19]. However, they still need to be further studied as the evaluation of hundreds or thousands of workflows might be prohibitive for large datasets. Furthermore, it would be interesting to incorporate problem-specific knowledge into the optimization process.