DS Methodology: Data Requirements
and Collection
Data Requirements
If the problem is the 'recipe' and data is the 'ingredient,' the data scientist must identify the
required ingredients (data), how to source or collect them, understand them, and prepare
them for the desired outcome. Once the problem and analytical approach are understood,
the data scientist defines the data requirements before data collection and preparation. For
decision tree classification, this includes identifying data content, formats, and sources. In a
healthcare case study, criteria were established to select a patient cohort for congestive
heart failure.
Criteria included patients admitted within the provider’s service area, a primary diagnosis
of congestive heart failure, and continuous enrollment for at least six months before
primary admission. The cohort excluded patients with significant medical conditions to
avoid skewing results.
The content for decision tree modeling involved a complete clinical history, including
admissions, diagnoses, procedures, prescriptions, and services. The data scientists rolled up
transactional records into one record per patient, creating new variables, which required
anticipating the data preparation stage.
Data Collection
After the initial data collection, data scientists assess whether the collected data meets their
needs. Sometimes data is more difficult to obtain or costs more than expected, requiring
adjustments to the data requirements.
In the data collection stage, descriptive statistics and visualization techniques help assess
data content, quality, and provide initial insights. Data gaps are identified, and decisions are
made on how to fill or substitute missing information.
In the case study, data was collected from various sources, including demographic, clinical,
and coverage information, as well as claims and pharmaceutical data. Some data, like drug
information, was not available initially, but the team was able to build a good model without
it. The team could later revisit missing data if needed.
Data scientists, DBAs, and programmers often collaborate to extract, merge, and clean data
from different sources, preparing it for the next stage (data understanding). Automating
data processes can improve efficiency.
Summary: Data Requirement and Collection
Data Requirements stage tasks include identifying the correct and necessary data content,
data formats, and data sources for the specific analytical approach.
During the Data Collection stage, expert data scientists meticulously revise data
requirements and make critical decisions regarding the quantity and quality of data.
Data scientists apply descriptive statistics and visualization techniques to thoroughly assess
the content, quality, and initial insights gained from the collected data, identify gaps, and
determine if new data is needed or to substitute existing data.