MDS Final Task
MDS Final Task
This assignment focuses on analyzing and visualizing data related to soft drink sales,
specifically Coke and Pepsi products. Please use Quarto to report your findings.
The dataset you're working with is an example of panel data. Panel data, also known as
longitudinal data or time-series cross-sectional data, is a type of data that follows the same
subjects (such as individuals, households, firms, countries, or any other entities) over a
period of time. In your dataset, the key characteristics of panel data are evident:
1. Multiple Observations per Subject: Each entity (in this case, households or
individuals represented by PANID ) is observed at multiple points in time. This is evident
from variables like year , WEEK , and MINUTE which track the same entity over different
time periods.
2. Temporal Dimension: The data includes a time component (years and weeks),
allowing for the analysis of trends, patterns, and changes over time within each subject.
3. Cross-Sectional Dimension: The dataset also has a cross-sectional aspect, meaning it
includes multiple subjects (different PANIDs) observed at each point in time.
Variable description:
You are given the following information on this dataset:
1. IRI_KEY: This is the masked store number, serving as a unique identifier for each store.
2. WEEK: IRI Week number, which corresponds to specific weeks in the calendar year.
3. UNITS (represented in the dataset as units ): Total unit sales of the product.
4. DOLLARS (represented in the dataset as dollars ): Total dollar sales of the product.
5. F (Feature): This variable indicates the type of advertising or promotional feature
associated with a product. It can take several values:
1. D (Display): This variable represents the level of in-store display for a product. It can be:
0: No display.
1: Minor display.
2: Major display (includes codes 1 & 2).
This corresponds to the d1 and d2 variables in the dataset.
2. PR (Price Reduction flag): Indicates whether there was a significant price reduction (1 if
the temporary price reduction is 5% or greater, 0 otherwise). This information may be
reflected in the price or price_storedata variables in the dataset, although a specific
flag for price reduction is not directly mentioned in the glimpse of the dataset.
Tasks
1. Explanation of the Data
Task: Understand the dataset, which contains information on soft drink products,
including Coke and Pepsi, with a 'decision' variable indicating the buyer's choice.
3. Rename Variables
Task: Standardize variable names according to a style guide.
1. Choose a Focus Area: Select a specific aspect of the data you find interesting. This
could be customer behavior, sales trends, geographical differences, or anything else
that catches your attention.
2. Formulate a Hypothesis or Question: Start with a clear hypothesis or a research
question that you want to explore. For example, "Do marketing campaigns significantly
impact the sales of a particular brand?" or "Is there a regional preference for Coke over
Pepsi?"
3. Data Analysis and Visualization: This can include creating new variables, segmenting
the data, and performing statistical tests. Use visualization tools, such as ggplot2 , to
help illustrate your findings.
4. Innovative Approach: Try to think outside the box. You can combine different variables,
use advanced statistical techniques, or even merge this data with external data sources
to enrich your analysis.
5. Document Your Findings: Prepare a report or presentation that outlines your
methodology, findings, and conclusions. Include visualizations and any interesting
patterns or anomalies you discovered.
6. Reflect on the Implications: Discuss the implications of your findings. How do they
add value to understanding consumer behavior or market trends? What
recommendations would you make to a company based on your analysis?