{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,26]],"date-time":"2025-09-26T00:09:55Z","timestamp":1758845395730,"version":"3.44.0"},"reference-count":41,"publisher":"Wiley","issue":"9","license":[{"start":{"date-parts":[[2025,9,8]],"date-time":"2025-09-08T00:00:00Z","timestamp":1757289600000},"content-version":"vor","delay-in-days":7,"URL":"https:\/\/2.zoppoz.workers.dev:443\/http\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["onlinelibrary.wiley.com"],"crossmark-restriction":true},"short-container-title":["J Software Evolu Process"],"published-print":{"date-parts":[[2025,9]]},"abstract":"<jats:title>ABSTRACT<\/jats:title><jats:p>Ensuring high\u2010quality data is crucial for the successful deployment of machine learning models, thereby sustaining the operational pipelines around such models. However, a significant number of practitioners do not currently use data quality checks or measurements as gateways for their model construction and operationalization, indicating a need for greater awareness and adoption of these tools. In this study, we propose an automated approach for automating the process of architecting machine learning pipelines by means of (semi\u2010)automated data quality checks. We focus on tabular data as a representative of the most widely used structured data formats in said pipelines. Our work is based on a subset of metrics that are particularly relevant in MLOps pipelines, stemming from our engagement with expert practitioners in machine learning operations (MLOps). We selected Deepchecks, a well\u2010known tool for conducting data quality checks, from a cohort of similar tools to evaluate the quality of datasets collected from Kaggle, a widely used platform for machine learning competitions and data science projects. We also analyze the main features used by Kaggle to rank their datasets and used these features to validate the relevance of our approach. Our approach shows the potential for automated data quality checks to improve the efficiency and effectiveness of MLOps pipelines and their operation, by decreasing the risk of introducing errors and biases into machine learning models in production.<\/jats:p>","DOI":"10.1002\/smr.70044","type":"journal-article","created":{"date-parts":[[2025,9,9]],"date-time":"2025-09-09T02:06:10Z","timestamp":1757383570000},"update-policy":"https:\/\/2.zoppoz.workers.dev:443\/https\/doi.org\/10.1002\/crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Engineering MLOps Pipelines With Data Quality: A Case Study on Tabular Datasets in Kaggle"],"prefix":"10.1002","volume":"37","author":[{"given":"Matteo","family":"Pancini","sequence":"first","affiliation":[{"name":"Department of Electronics, Information and Bioengineering (DEIB) Politecnico di Milano  Milan Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Matteo","family":"Camilli","sequence":"additional","affiliation":[{"name":"Department of Electronics, Information and Bioengineering (DEIB) Politecnico di Milano  Milan Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/2.zoppoz.workers.dev:443\/https\/orcid.org\/0000-0002-0405-9814","authenticated-orcid":false,"given":"Giovanni","family":"Quattrocchi","sequence":"additional","affiliation":[{"name":"Department of Electronics, Information and Bioengineering (DEIB) Politecnico di Milano  Milan Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Damian","family":"Andrew\u00a0Tamburri","sequence":"additional","affiliation":[{"name":"Department of Engineering University of Sannio  Benevento Italy"},{"name":"NXP Semiconductors N.V. (NXP)  Eindhoven the Netherlands"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"311","published-online":{"date-parts":[[2025,9,8]]},"reference":[{"volume-title":"Machine Learning","year":"2010","author":"Mitchell T. M.","key":"e_1_2_9_2_1"},{"key":"e_1_2_9_3_1","first-page":"558","volume-title":"Software Engineering: Principles and Practice","author":"Youll D. P.","year":"1993"},{"key":"e_1_2_9_4_1","first-page":"17","volume-title":"Sustainable MLOps: Trends and Challenges","author":"Tamburri D. A.","year":"2020"},{"key":"e_1_2_9_5_1","first-page":"333","volume-title":"Architecting MLOps in the Cloud: From Theory to Practice","author":"Kumara I.","year":"2023"},{"key":"e_1_2_9_6_1","unstructured":"A.Jain H.Patel L.Nagalapatti et\u00a0al. \u201cOverview and Importance of Data Quality for Machine Learning Tasks\u201d. In: KDD\u201920. Association for Computing Machinery; New York NY USA: 35613562 (2020)"},{"volume-title":"DevOps: A Software Architect's Perspective","year":"2015","author":"Bass L.","key":"e_1_2_9_7_1"},{"issue":"88","key":"e_1_2_9_8_1","first-page":"16","article-title":"Process Mining Software Repositories: Do Developers Work as Expected?","volume":"2012","author":"Serebrenik A.","year":"2012","journal-title":"ERCIM News"},{"key":"e_1_2_9_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2019.2962027"},{"key":"e_1_2_9_10_1","unstructured":"N.Hynes D.Sculley andM.Terry \u201cThe Data Linter: Lightweight Automated Sanity Checking for ML Data Sets\u201d. (2017)."},{"volume-title":"Proceedings of the Second Conference on Machine Learning and Systems, SysML 2019, Stanford, CA, USA","year":"2019","author":"Breck E.","key":"e_1_2_9_11_1"},{"key":"e_1_2_9_12_1","unstructured":"D.Baylor E.Breck H. T.Cheng et\u00a0al. \u201cTFX: A TensorFlow\u2010Based Production\u2010Scale Machine Learning Platform\u201d. In: KDD\u201917. Association for Computing Machinery; New York NY USA: 13871395 (2017)"},{"key":"e_1_2_9_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2641190.2641198"},{"key":"e_1_2_9_14_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-81-322-2494-5"},{"key":"e_1_2_9_15_1","doi-asserted-by":"publisher","DOI":"10.1089\/big.2021.0112"},{"key":"e_1_2_9_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/2851613.2851786"},{"key":"e_1_2_9_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASWEC.2013.21"},{"key":"e_1_2_9_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2021.3063727"},{"key":"e_1_2_9_19_1","unstructured":"Deepchecks. Code Available at:.https:\/\/2.zoppoz.workers.dev:443\/https\/github.com\/deepchecks\/deepchecks;."},{"key":"e_1_2_9_20_1","unstructured":"DataCleaner. Code Available at:.https:\/\/2.zoppoz.workers.dev:443\/https\/github.com\/datacleaner\/DataCleaner;."},{"key":"e_1_2_9_21_1","unstructured":"Griffin. Code Available at:.https:\/\/2.zoppoz.workers.dev:443\/https\/github.com\/apache\/griffin;."},{"key":"e_1_2_9_22_1","unstructured":"FRosner. Code Available at:.https:\/\/2.zoppoz.workers.dev:443\/https\/github.com\/FRosner\/drunken\u2010data\u2010quality;."},{"key":"e_1_2_9_23_1","unstructured":"dev a.\u2010l. Code Available at:.https:\/\/2.zoppoz.workers.dev:443\/https\/github.com\/agile\u2010lab\u2010dev\/DataQuality;."},{"key":"e_1_2_9_24_1","unstructured":"Talend. Code Available at:.https:\/\/2.zoppoz.workers.dev:443\/https\/github.com\/Talend\/data\u2010quality;."},{"key":"e_1_2_9_25_1","unstructured":"SauceCat. Code Available at:.https:\/\/2.zoppoz.workers.dev:443\/https\/github.com\/SauceCat\/pydqc;."},{"key":"e_1_2_9_26_1","unstructured":"mobydq. Code Available at:.https:\/\/2.zoppoz.workers.dev:443\/https\/github.com\/mobydq\/mobydq;."},{"key":"e_1_2_9_27_1","unstructured":"yandexdataschool. Code Available at:.https:\/\/2.zoppoz.workers.dev:443\/https\/github.com\/yandexdataschool\/cms\u2010dqm;."},{"key":"e_1_2_9_28_1","unstructured":"awslabs. Code Available at:.https:\/\/2.zoppoz.workers.dev:443\/https\/github.com\/awslabs\/deequ;."},{"key":"e_1_2_9_29_1","unstructured":"AnmolNarang. How Do Duplicates Affect ML Model?.https:\/\/2.zoppoz.workers.dev:443\/https\/www.kaggle.com\/questions\u2010and\u2010answers\/200598; 2021."},{"key":"e_1_2_9_30_1","unstructured":"BadrW.Why Feature Correlation Matters. A Lot!https:\/\/2.zoppoz.workers.dev:443\/https\/towardsdatascience.com\/why\u2010feature\u2010correlation\u2010matters\u2010a\u2010lot\u2010847e8ba439c4;2019."},{"key":"e_1_2_9_31_1","unstructured":"T.Nawale \u201cHow to Deal With Missing Values in Machine Learning\u201d.https:\/\/2.zoppoz.workers.dev:443\/https\/medium.com\/geekculture\/how\u2010to\u2010deal\u2010with\u2010missing\u2010values\u2010in\u2010machine\u2010learning\u201098e47f025b9c; (2022)."},{"key":"e_1_2_9_32_1","unstructured":"S.Eade \u201cData Cleaning With Pandas Avoid This Mistake!\u201d.https:\/\/2.zoppoz.workers.dev:443\/https\/towardsdatascience.com\/data\u2010cleaning\u2010with\u2010pandas\u2010avoid\u2010this\u2010mistake\u20107af559657c2c;2020."},{"key":"e_1_2_9_33_1","unstructured":"S.Kumar \u201c7 Ways to Handle Missing Values in Machine Learning\u201d.https:\/\/2.zoppoz.workers.dev:443\/https\/towardsdatascience.com\/7\u2010ways\u2010to\u2010handle\u2010missing\u2010values\u2010in\u2010machine\u2010learning\u20101a6326adf79e; (2020)."},{"key":"e_1_2_9_34_1","unstructured":"M.Inuwa \u201cOutliers and Overfitting When Machine Learning Models Cant Reason\u201d.https:\/\/2.zoppoz.workers.dev:443\/https\/www.analyticsvidhya.com\/blog\/2022\/07\/outliers\u2010and\u2010overfitting\u2010when\u2010machine\u2010learning\u2010models\u2010cant\u2010reason; (2022)."},{"key":"e_1_2_9_35_1","unstructured":"T.Jegan \u201cWhy to Remove Columns With 1 Unique Value While Data Cleaning Process?\u201d.https:\/\/2.zoppoz.workers.dev:443\/https\/www.kaggle.com\/questions\u2010and\u2010answers\/269943; (2021)."},{"key":"e_1_2_9_36_1","unstructured":"Shilpijs. \u201cText Preprocessing in Python \u2010 Getting Started With NLP\u201d.https:\/\/2.zoppoz.workers.dev:443\/https\/www.analyticsvidhya.com\/blog\/2021\/08\/text\u2010preprocessing\u2010in\u2010python\u2010getting\u2010started\u2010with\u2010nlp\/; (2021)."},{"key":"e_1_2_9_37_1","first-page":"171","volume-title":"Class Imbalance Problem","author":"Ling C. X.","year":"2010"},{"key":"e_1_2_9_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3604560"},{"key":"e_1_2_9_39_1","unstructured":"F. K.Khattak V.Subasri A.Krishnan et\u00a0al. \u201cMLHOps: Machine Learning for Healthcare Operations\u201d (2023)."},{"key":"e_1_2_9_40_1","doi-asserted-by":"crossref","unstructured":"H. P.Kriegel P.Kr\u00f6ger E.Schubert A.Zimek \u201cLoOP: Local Outlier Probabilities\u201d. In: CIKM\u201909. Association for Computing Machinery;2009; New York NY USA: 16491652","DOI":"10.1145\/1645953.1646195"},{"key":"e_1_2_9_41_1","unstructured":"M.O\u2019Neill \u201cHotness Calculation Formula\u201d.https:\/\/2.zoppoz.workers.dev:443\/https\/www.kaggle.com\/general\/39290;."},{"volume-title":"Software Metrics: A Rigorous and Practical Approach","year":"1998","author":"Fenton N. E.","key":"e_1_2_9_42_1"}],"container-title":["Journal of Software: Evolution and Process"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/2.zoppoz.workers.dev:443\/https\/onlinelibrary.wiley.com\/doi\/pdf\/10.1002\/smr.70044","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,25]],"date-time":"2025-09-25T08:27:37Z","timestamp":1758788857000},"score":1,"resource":{"primary":{"URL":"https:\/\/2.zoppoz.workers.dev:443\/https\/onlinelibrary.wiley.com\/doi\/10.1002\/smr.70044"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9]]},"references-count":41,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2025,9]]}},"alternative-id":["10.1002\/smr.70044"],"URL":"https:\/\/2.zoppoz.workers.dev:443\/https\/doi.org\/10.1002\/smr.70044","archive":["Portico"],"relation":{},"ISSN":["2047-7473","2047-7481"],"issn-type":[{"type":"print","value":"2047-7473"},{"type":"electronic","value":"2047-7481"}],"subject":[],"published":{"date-parts":[[2025,9]]},"assertion":[{"value":"2024-04-24","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-07-29","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-09-08","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}],"article-number":"e70044"}}