The Art and Practice of Data Science Pipelines: A Comprehensive Study of Data Science Pipelines In Theory, In-The-Small, and In-The-Large

By: Sumon Biswas, Mohammad Wardat, and Hridesh Rajan

Abstract

Increasingly larger number of software systems today are including data science components for descriptive, predictive, and prescriptive analytics. The collection of data science stages from acquisition, to cleaning/curation, to modeling, and so on are referred to as data science pipelines. To facilitate research and practice on data science pipelines, it is essential to understand their nature. What are the typical stages of a data science pipeline? How are they connected? Do the pipelines differ in the theoretical representations and that in the practice? Today we do not fully understand these architectural characteristics of data science pipelines. In this work, we present a three-pronged comprehensive study to answer this for the state-of-the-art, data science in-the-small, and data science in-the-large. Our study analyzes three datasets: a collection of 71 proposals for data science pipelines and related concepts in theory, a collection of over 105 implementations of curated data science pipelines from Kaggle competitions to understand data science in-the-small, and a collection of 21 mature data science projects from GitHub to understand data science in-the-large. Our study has led to three representations of data science pipelines that capture the essence of our subjects in theory, in-the-small, and in-the-large.

ACM Reference

Biswas, S. et al. 2022. The Art and Practice of Data Science Pipelines: A Comprehensive Study of Data Science Pipelines In Theory, In-The-Small, and In-The-Large. ICSE’22: The 44th International Conference on Software Engineering (May 2022).

BibTeX Reference

@inproceedings{biswas22art,
  author = {Sumon Biswas and Mohammad Wardat and Hridesh Rajan},
  title = {The Art and Practice of Data Science Pipelines: A Comprehensive Study of Data Science Pipelines In Theory, In-The-Small, and In-The-Large},
  booktitle = {ICSE'22: The 44th International Conference on Software Engineering},
  location = {Pittsburgh, PA, USA},
  month = {May 21-May 29},
  year = {2022},
  entrysubtype = {conference},
  abstract = {
    Increasingly larger number of software systems today are including data
    science components for descriptive, predictive, and prescriptive analytics.
    The collection of data science stages from acquisition, to cleaning/curation,
    to modeling, and so on are referred to as data science pipelines. To facilitate
    research and practice on data science pipelines, it is essential to understand
    their nature. What are the typical stages of a data science pipeline? How are
    they connected? Do the pipelines differ in the theoretical representations
    and that in the practice? Today we do not fully understand these architectural
    characteristics of data science pipelines. In this work, we present a
    three-pronged comprehensive study to answer this for the state-of-the-art,
    data science in-the-small, and data science in-the-large. Our study analyzes
    three datasets: a collection of 71 proposals for data science pipelines and
    related concepts in theory, a collection of over 105 implementations of
    curated data science pipelines from Kaggle competitions to understand data
    science in-the-small, and a collection of 21 mature data science projects
    from GitHub to understand data science in-the-large. Our study has led to
    three representations of data science pipelines that capture the essence
    of our subjects in theory, in-the-small, and in-the-large.
  }
}