0% found this document useful (0 votes)

34 views3 pages

Time Series Analysis and Visualization in R

Q: How can geom_rect() be used to enhance the visualization of presidential terms in relation to housing market data?

The geom_rect() function can enhance visualization by shading the periods corresponding to different presidential terms over the housing market data. This visual layering can reveal patterns or anomalies in housing data that coincide with specific presidencies, aiding in the analysis of political impact on the housing market. This approach requires precise start and end date mappings for each term .

Q: How can the combination of different visualization strategies improve the interpretation of longitudinal data?

Combining different visualization strategies, such as using lines for time series, color coding for categorical differentiation, and overlays for trend analysis, can significantly enhance the interpretation of longitudinal data by making complex relationships more comprehensible. These strategies collectively address different facets of the data, enabling multi-dimensional insights that are not readily apparent with single-method visualizations .

Q: Why might it be important to differentiate individuals based on health status when analyzing gene expression data using heatmaps?

Differentiating individuals based on health status in gene expression analysis using heatmaps is crucial because it allows for the identification of gene patterns that correlate with specific health conditions. This can lead to understanding the genetic basis of diseases, developing diagnostic markers, and tailoring treatments based on genetic predispositions .

Q: What role does aggregating data by year and city play in analyzing the Texas housing dataset?

Aggregating data by year and city in the Texas housing dataset allows for summarizing trends and patterns over time and space, making it easier to compare cities or analyze general trends. This simplification aids in clarifying the broader economic context and dynamics within different regions over time .

Q: In what scenarios is adding a linear regression line to a plot of grouped data advantageous?

Adding a linear regression line to a plot of grouped data is advantageous when we aim to understand and interpret the overall trend across the groups and evaluate the linear relationship between the variables. This can guide predictions and estimations of future values and highlight deviations from expected trends in specific groups, which may have implications for decision-makers .

Q: What challenges arise when using heatmaps for large datasets, and how might these be addressed?

Challenges with heatmaps for large datasets include difficulty in distinguishing patterns when too much information is presented, leading to visual clutter and misinterpretation. These can be addressed by selectively focusing on key variables, reducing the dimension of the data, managing color scales for clarity, and incorporating tools that enhance user interaction with the heatmap .

Q: How can time series visualization be improved in the txhousing dataset using ggplot2?

To improve time series visualization in the txhousing dataset using ggplot2, we can consider adding color or facets to clarify different dimensions of data, such as time, categories, or other variables. Additionally, enhancing the plot by aggregating data, such as sales, by group or city, and overlaying trend lines can improve interpretability .

Q: What is the significance of using the geom_path() function in visualizing the connection between variables in time series data?

The geom_path() function is significant in visualizing time series data as it allows us to see the directional connection between variables as they change over time. This can highlight trends and relationships not easily visible with static point comparisons. It helps in identifying time-based trends and cyclical patterns between variables like expenditure and unemployment .

Q: What insights can be derived from comparing housing price developments in different cities using time series data?

Comparing housing price developments in different cities using time series data can reveal disparities in market dynamics, such as growth rates, market volatility, and economic health. These insights can inform decisions about investments and policy-making by highlighting where and when significant market changes occur across different regions .

Q: Why is it useful to color time series data by date when plotting relationships like unemployment and savings?

Coloring time series data by date is useful because it visually distinguishes different time periods, which can reveal trends and patterns otherwise hidden in aggregated data. This differentiation helps in identifying seasonal variations or changes due to external temporal factors, making it easier to conclude about trends, such as shifts in unemployment and savings over time .

1. The document discusses visualizing time series data and grouped/longitudinal data using R. It introduces time series datasets like economics and txhousing and explores plotting trends over time. 2. Tasks involve better visualizing housing sales data, plotting relationships between economic variables over time, and comparing housing prices and presidential terms. 3. Grouped data from the txhousing dataset is aggregated by city and year to plot trends for each group. 4. A heatmap is introduced to visualize large datasets and look for patterns across many variables or groups. Tasks involve creating heatmaps to explore economic data and compare gene expression in healthy and diseased individuals.

Uploaded by

Yassine Azzimani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views3 pages

Time Series Analysis and Visualization in R

Uploaded by

Yassine Azzimani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Lab session 2: Time series, heatmaps, and other

Data analysis and Visualization – Ilias Thomas

Red: code to copy paste in RStudio

Italics: variable and dataset names

Blue: functions and arguments

In the previous lab we worked with bivariate and univariate plots. What was common in these plots is
that there was no order in the data in a temporal direction. In this Lab we will work mainly with time series
data and other types of grouped and longitudinal data.

Time series
What is a time series? It is a series of data collected over a long period of time with a time interval between
each observation point (could be consistent or not). Univariate or multivariate data can be collected at
each point.

Let as consider some time series data in the ggplot2 package,

• economics: US economic time series

• presentential: Terms of 12 presidents from Eisenhower to Trump
• txhousing: Housing sales in Texas

Try to initially understand the datasets.

When working with time series data it makes sense to plot lines instead of points as we are interested in
following a trend in the data over the temporal variable.

Let’s try to explore the txhousing data. Plot the following:

ggplot(txhousing, aes(date, sales)) + geom_line()

How does the plot look like? Is it something you can interpret or not?

Task 1
1. Explore how to better visualize the sales for this dataset. Use the tools we learned in the previous lab,
but also explore the dataset to find what can be changed to get a better visualization.
In the economics data plot unemployment vs date. It is quite simple, right? How about if we want to plot
the relationship between two variables over time?

Task 2
1. Use the geom_path() argument to find the connection between expenditure and unemployment. What
do you see?

2. How about if you plot the relationship between unemployment and savings? Do you notice a trend? If
not, then how can you solve this issue?
Hint: color by date

Do you see a trend now? What would be the conclusion?

3. Visualize the house pricing development in Houston vs. the offer, and color based on the year. What do
you notice?
Hint: use subset()

Task 3
1. We can now try and plot the order of the US presidents. In this task you should plot the order of starting
in office and color based on the party.

2. Now that we have some understand of the housing data and the presidential data lets try to plot the
two datasets together. If you focus on Dallas, does the president affiliation influence the listing of houses?

To complete the task above you would need to convert the date into actual dates with:

txhousing$date <- [Link](format(date_decimal(txhousing$date), "%Y-%m-%d"))

You will have to use the geom_rect() argument to create boxes of starting and ending dates for the
presidential duration. As this might be quite advanced you should add

geom_rect(

aes(xmin = start, xmax = end, fill = party),

ymin = -Inf, ymax = Inf, alpha = 0.2,

data = presidential

)
to your plot. Doing that will create the shading. Just remember that the original ggplot is on the housing
data!

Grouped data
How about when we have groups in the data? In the Texas housing dataset, we have many cities. In the
Midwest data we had states. It is not unusual to have multiple observation at different time points for
separate groups (either patients, or states, or brands) to see the trend of each over time. These data are
usually called longitudinal data.

Let’s try to plot the Texas housing data prices, by year for the different cities. In order to achieve that you
will need to aggregate the city prices over the years by city. You can use this code:

data_aggregate <- aggregate(txhousing [4:8], by=list(txhousing$year, txhousing$city), mean)

Now that you have the new dataset you can try to plot by group (exactly the same as adding color, but
change the word).

Task 4
1. Having done the grouping add a linear regression line to the plot using method=”lm”. What do you
conclude? Can you think of situations where we would choose group over color?

Heatmap
Sometimes it is not easy to go through large amounts of data. In cases where we want to look for patterns
in over a large number of variables we can plot the data as a heatmap to see how values of variables
change between groups or over time. For that purpose, we can use heatmaps. To work with heatmaps we
will have to work outside ggplot as it is not the optimal library to work with for this type of visualization.
Please install the package pheatmap.

Going to the economics dataset create a heatmap of the dataset (excluding the first column) using the
pheatmap function. What do you notice? Why do you think this is and how can you solve it?

Task 5
In the lab room, you will find the file Genes. This file contains info of 1000 gene expression levels and 40
individuals. The first 20 are healthy and the last 20 suffer from a disease. Import this file into your
workspace and create a heatmap of the genes. Can you differentiate between the healthy and the
diseased population?

Common questions