0% found this document useful (0 votes)
5 views

MDS Final Task

The assignment involves analyzing and visualizing soft drink sales data for Coke and Pepsi using Quarto, focusing on panel data characteristics. Key tasks include data loading, variable renaming, creating a DateTime variable, and various visualizations related to sales trends, price changes, and buyer behavior. Additionally, students are encouraged to explore the dataset creatively to uncover new insights and document their findings.

Uploaded by

ingkarat.watt
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

MDS Final Task

The assignment involves analyzing and visualizing soft drink sales data for Coke and Pepsi using Quarto, focusing on panel data characteristics. Key tasks include data loading, variable renaming, creating a DateTime variable, and various visualizations related to sales trends, price changes, and buyer behavior. Additionally, students are encouraged to explore the dataset creatively to uncover new insights and document their findings.

Uploaded by

ingkarat.watt
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

MDS Final Task

This assignment focuses on analyzing and visualizing data related to soft drink sales,
specifically Coke and Pepsi products. Please use Quarto to report your findings.

The dataset you're working with is an example of panel data. Panel data, also known as
longitudinal data or time-series cross-sectional data, is a type of data that follows the same
subjects (such as individuals, households, firms, countries, or any other entities) over a
period of time. In your dataset, the key characteristics of panel data are evident:

1. Multiple Observations per Subject: Each entity (in this case, households or
individuals represented by PANID ) is observed at multiple points in time. This is evident
from variables like year , WEEK , and MINUTE which track the same entity over different
time periods.
2. Temporal Dimension: The data includes a time component (years and weeks),
allowing for the analysis of trends, patterns, and changes over time within each subject.
3. Cross-Sectional Dimension: The dataset also has a cross-sectional aspect, meaning it
includes multiple subjects (different PANIDs) observed at each point in time.

Variable description:
You are given the following information on this dataset:

1. IRI_KEY: This is the masked store number, serving as a unique identifier for each store.
2. WEEK: IRI Week number, which corresponds to specific weeks in the calendar year.
3. UNITS (represented in the dataset as units ): Total unit sales of the product.
4. DOLLARS (represented in the dataset as dollars ): Total dollar sales of the product.
5. F (Feature): This variable indicates the type of advertising or promotional feature
associated with a product. It can take several values:

NONE: No feature or promotion.


- FS-C: Frequent Shopper Program C, available to members only.
- C: Small ad, usually a single line of text.
- FS-B: Frequent Shopper Program B.
- B: Medium size ad.
- FS-A: Frequent Shopper Program A.
- A: Large size ad.
- FSA+: Frequent Shopper Program A+.
- A+: Also known as “Q” or “R” – indicating a retailer coupon or rebate.
In the dataset, this could correspond to the variables fa , faplus , fb , and fc , where
each variable might represent a different type of feature.

1. D (Display): This variable represents the level of in-store display for a product. It can be:
0: No display.
1: Minor display.
2: Major display (includes codes 1 & 2).
This corresponds to the d1 and d2 variables in the dataset.
2. PR (Price Reduction flag): Indicates whether there was a significant price reduction (1 if
the temporary price reduction is 5% or greater, 0 otherwise). This information may be
reflected in the price or price_storedata variables in the dataset, although a specific
flag for price reduction is not directly mentioned in the glimpse of the dataset.

Here is a bit more information:

9. PANID ( <dbl> ): Panelist identifier, representing individual households or buyers.


10. IRI_KEY ( <dbl> ): Masked store number, uniquely identifying each store.
11. Market_Name ( <chr> ): The market or geographic area name where the store is
located, represented as characters.
12. store_type ( <chr> ): Category or type of the store, represented as characters.
13. year ( <dbl> ): Year of purchase.
14. WEEK ( <dbl> ): IRI week number, translated to calendar weeks.
15. MINUTE ( <dbl> ): Time of purchase in minutes from the beginning of the week.
16. units ( <dbl> ): Total unit sales, represented. Contains missing values (NA).
17. dollars ( <dbl> ): Total dollar sales.
18. price ( <dbl> ): Product price. (note: change scale)
19. pid ( <dbl> ): Product ID
20. brand ( <dbl> ): Brand identifier.
21. decision ( <dbl> ): Indicates the buyer's choice, with 1 indicating selection.
22. L4 ( <chr> ): Likely represents a category or company name, such as "COCA COLA CO"
or "PEPSICO INC", represented as characters.
23. L5 ( <chr> ): Likely represents a sub-category or specific product name, such as "COKE
CLASSI", "DIET COKE", etc., represented as characters.
24. price_storedata ( <dbl> ): Store-specific price data.
25. d1 and d2 ( <dbl> ): Variables that could be related to display or promotional strategies.
26. fa, faplus, fb, fc ( <dbl> ): Feature-related variables, likely indicating different types of
advertisements or promotions.
27. no_choice ( <dbl> ): Indicates instances with no purchase decision.
28. marketshare ( <dbl> ): Market share of the product.
29. brandchoice ( <dbl> ): Indicates brand choice
30. b1, b2, b3, b4 ( <dbl> ): Additional brand-related variables.
31. choice ( <dbl> ): Another variable related to purchase decision
32. u ( <dbl> ): A calculation of the utility of the buyer for this option (ignore this)

Tasks
1. Explanation of the Data
Task: Understand the dataset, which contains information on soft drink products,
including Coke and Pepsi, with a 'decision' variable indicating the buyer's choice.

2. Load the Data


Task: Import SAS data into R.
Hint: Use packages like haven to read SAS files in R.

3. Rename Variables
Task: Standardize variable names according to a style guide.

4. Create DateTime Variable


Task: Use 'Year', 'Week', and 'Minute' to create a DateTime variable using lubridate .
Details:
Recode WEEK to reflect the week number for each year, starting with Week 1322 as
the first week of 2005. The dataset starts on January 3rd, just before 9 AM.
MINUTE is the number of minutes since the beginning of the week.
Hint: Utilize years() , weeks() , and minutes() functions from lubridate .

5. Plot Total Soft Drink Sales Over Time


Task: Create a time series plot of total expenditure on all soft drinks (use price ).

6. Sales of Classic Coke in Specific Stores


Task: Plot monthly sales of classic Coke in PITTSFIELD and EAU CLAIR .
Hint: Aggregate data monthly and include proper labels.

7. Sales of All Soft Drink Types


Task: Create a plot showing sales of all types of soft drinks over time (each type should
get its own color) and interpret it.

8. Price Trends for Diet Pepsi


Task: Visualize how the price of Diet Pepsi has changed over time.

9. Price and Sales Relationship


Task: Show the relationship between price and sales for all drinks, interpret the data,
and discuss price elasticity.
Hint: Consider scatter plots and correlation analysis.

10. Confidence Intervals for Prices


Task: Calculate confidence intervals for prices of different products using the infer
package.

11. Compare Mean Prices


Task: Compare the mean prices in the two stores using the infer package.

12. Proportion of Pepsi Buyers


Task: Determine if the proportion of buyers choosing Pepsi products ( L4 ) differs
between the two stores using infer .

13. Additional exploration


Objective: For this task, you are encouraged to dive deeper into the dataset2_4brands data
and conduct your own analysis. This is an opportunity to apply your creativity and analytical
skills to uncover new insights, patterns, or trends in the data that have not been previously
explored.
Instructions:

1. Choose a Focus Area: Select a specific aspect of the data you find interesting. This
could be customer behavior, sales trends, geographical differences, or anything else
that catches your attention.
2. Formulate a Hypothesis or Question: Start with a clear hypothesis or a research
question that you want to explore. For example, "Do marketing campaigns significantly
impact the sales of a particular brand?" or "Is there a regional preference for Coke over
Pepsi?"
3. Data Analysis and Visualization: This can include creating new variables, segmenting
the data, and performing statistical tests. Use visualization tools, such as ggplot2 , to
help illustrate your findings.
4. Innovative Approach: Try to think outside the box. You can combine different variables,
use advanced statistical techniques, or even merge this data with external data sources
to enrich your analysis.
5. Document Your Findings: Prepare a report or presentation that outlines your
methodology, findings, and conclusions. Include visualizations and any interesting
patterns or anomalies you discovered.
6. Reflect on the Implications: Discuss the implications of your findings. How do they
add value to understanding consumer behavior or market trends? What
recommendations would you make to a company based on your analysis?

You might also like