What does a real data-science pipeline actually look like? Our ICSE 2022 paper maps them.

May 23, 2022

A data-science system is more than its model. Data is acquired, cleaned, transformed, modeled, and served, and this sequence of stages, the data-science pipeline, is where much of the real engineering happens. Yet the pipeline as a whole had not been studied carefully.

In this paper (ICSE 2022), Sumon Biswas, Mohammad Wardat, and Hridesh Rajan present a comprehensive study of data-science pipelines, examining them in theory, in small hand-written examples, and in large real-world projects. The result is a clearer account of how these pipelines are built and where their difficulties lie, which gives tool builders and researchers a foundation to work from.

This work is part of Modular and Dependable AI; see our related work on mining repositories to assist AutoML. The full paper is available here.