Manipulating Time Series Data in Python
Last Updated :
18 Mar, 2022
A collection of observations (activity) for a single subject (entity) at various time intervals is known as time-series data. In the case of metrics, time series are equally spaced and in the case of events, time series are unequally spaced. We may add the date and time for each record in this Pandas module, as well as fetch dataframe records and discover data inside a specific date and time range.
Generate a date range:
Pandas package is imported. pd.date_range() method is used to create a date range, the date range has a monthly frequency.
Python3
import pandas as pd
Date_range = pd.date_range(start = '1/12/2020' , end = '20/5/2021' , freq = 'M' )
print (Date_range)
print ( type (Date_range))
print ( type (Date_range[ 0 ]))
|
Output:
DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30',
'2020-05-31', '2020-06-30', '2020-07-31', '2020-08-31',
'2020-09-30', '2020-10-31', '2020-11-30', '2020-12-31',
'2021-01-31', '2021-02-28', '2021-03-31', '2021-04-30'],
dtype='datetime64[ns]', freq='M')
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
Operations on timestamp data:
The date range is converted into a dataframe with the help of pd.DataFrame() method. The column is converted to DateTime using to_datetime() method. info() method gives information about the dataframe if there are any null values and the datatype of the columns.
Python3
import pandas as pd
Date_range = pd.date_range(start = '1/12/2020' , end = '20/5/2021' , freq = 'M' )
Data = pd.DataFrame(Date_range, columns = [ 'Date' ])
Data[ 'Date' ] = pd.to_datetime(Data[ 'Date' ])
print (Data.info())
|
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 16 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 256.0 bytes
Convert data from a string to a timestamp:
if we have a list of string data that resembles DateTime, we can first convert it to a dataframe using pd.DataFrame() method and convert it to DateTime column using pd.to_datetime() method.
Python3
import pandas as pd
string_data = [ '2020-01-31' , '2020-02-29' , '2020-03-31' , '2020-04-30' ,
'2020-05-31' , '2020-06-30' , '2020-07-31' , '2020-08-31' ,
'2020-09-30' , '2020-10-31' , '2020-11-30' , '2020-12-31' ,
'2021-01-31' , '2021-02-28' , '2021-03-31' , '2021-04-30' ]
Data = pd.DataFrame(string_data, columns = [ 'Date' ])
Data[ 'Date' ] = pd.to_datetime(Data[ 'Date' ])
print (Data.info())
|
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 16 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 256.0 bytes
None
According to the format of our string values, we can convert them to DateTime. datetime.strptime() function can be used in this scenario
Python3
import pandas as pd
from datetime import datetime
string_data = [ 'May-20-2021' , 'May-21-2021' , 'May-22-2021' ]
timestamp_data = [datetime.strptime(x, '%B-%d-%Y' ) for x in string_data]
print (timestamp_data)
Data = pd.DataFrame(timestamp_data, columns = [ 'Date' ])
print (Data.info())
|
Output:
[datetime.datetime(2021, 5, 20, 0, 0), datetime.datetime(2021, 5, 21, 0, 0), datetime.datetime(2021, 5, 22, 0, 0)]
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 3 entries, 0 to 2
Data columns (total 1 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 Date 3 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 152.0 bytes
Slicing and indexing time series data:
CSV file is imported in this example and a column with string data is converted into DateTime using pd.to_timestamp() method. That particular column is set as an index which helps us slice and index data accordingly. data. loc[‘2020-01-22’][:10] indexes data on the day ‘2020-01-22’ and the result is further sliced to return the first 10 observations on that day.
To view and download the CSV file click here.
Python3
import pandas as pd
data = pd.read_csv( 'covid_data.csv' )
data[ 'ObservationDate' ] = pd.to_datetime(data[ 'ObservationDate' ])
data = data.set_index( 'ObservationDate' )
print (data.head())
print (data.loc[ '2020-01-22' ][: 10 ])
|
Output:
Unnamed: 0 Province/State ... Deaths Recovered
ObservationDate ...
2020-01-22 0 Anhui ... 0.0 0.0
2020-01-22 1 Beijing ... 0.0 0.0
2020-01-22 2 Chongqing ... 0.0 0.0
2020-01-22 3 Fujian ... 0.0 0.0
2020-01-22 4 Gansu ... 0.0 0.0
[5 rows x 7 columns]
Unnamed: 0 Province/State ... Deaths Recovered
ObservationDate ...
2020-01-22 0 Anhui ... 0.0 0.0
2020-01-22 1 Beijing ... 0.0 0.0
2020-01-22 2 Chongqing ... 0.0 0.0
2020-01-22 3 Fujian ... 0.0 0.0
2020-01-22 4 Gansu ... 0.0 0.0
2020-01-22 5 Guangdong ... 0.0 0.0
2020-01-22 6 Guangxi ... 0.0 0.0
2020-01-22 7 Guizhou ... 0.0 0.0
2020-01-22 8 Hainan ... 0.0 0.0
2020-01-22 9 Hebei ... 0.0 0.0
[10 rows x 7 columns]
In this example, we slice data from ‘2020-01-22’ to ‘2020-02-22’.
Python3
import pandas as pd
from datetime import datetime
data = pd.read_csv( 'covid_data.csv' )
data[ 'ObservationDate' ] = pd.to_datetime(data[ 'ObservationDate' ])
data = data.set_index( 'ObservationDate' )
print (data.loc[ '2020-01-22' : '2020-02-22' ])
|
Output:
Unnamed: 0 Province/State ... Deaths Recovered
ObservationDate ...
2020-01-22 0 Anhui ... 0.0 0.0
2020-01-22 1 Beijing ... 0.0 0.0
2020-01-22 2 Chongqing ... 0.0 0.0
2020-01-22 3 Fujian ... 0.0 0.0
2020-01-22 4 Gansu ... 0.0 0.0
... ... ... ... ... ...
2020-02-22 2169 San Antonio, TX ... 0.0 0.0
2020-02-22 2170 Seattle, WA ... 0.0 1.0
2020-02-22 2171 Tempe, AZ ... 0.0 0.0
2020-02-22 2172 Unknown ... 0.0 0.0
2020-02-22 2173 NaN ... 0.0 0.0
[2174 rows x 7 columns]
Resampling time series data for various aggregates/summary statistics for different time periods:
To resample time-series data, use the pandas resample() function. It is a time series frequency conversion and resampling convenience technique. The caller must give the label of a DateTime-like series/index to the on/level keyword argument if the object has a DateTime-like index.
Python3
import pandas as pd
from datetime import datetime
data = pd.read_csv( 'covid_data.csv' )
data[ 'ObservationDate' ] = pd.to_datetime(data[ 'ObservationDate' ])
data = data.set_index( 'ObservationDate' )
data = data.resample( 'Y' ).mean()
print (data)
|
Output:
Unnamed: 0 Confirmed Deaths Recovered
ObservationDate
2020-12-31 96232.5 39696.116550 1160.959453 24659.893368
2021-12-31 249447.0 163315.277678 3514.893386 93925.632661
Calculate a rolling statistic like a rolling average:
Dataframe created with Pandas. The rolling() method allows you to calculate rolling windows. The idea of calculating a rolling window is most commonly employed in signal processing and time-series data. To put it another way, we take a window of size k at a time and apply some mathematical operation to it. A window of size k signifies that k successive values are displayed at the same time. All of the ‘k’ values are equally weighted in the simplest instance. In the below example window size is 5.
Python3
import pandas as pd
from datetime import datetime
data = pd.read_csv( 'covid_data.csv' )
data[ 'ObservationDate' ] = pd.to_datetime(data[ 'ObservationDate' ])
data[ 'Last Update' ] = pd.to_datetime(data[ 'Last Update' ])
data = data.set_index( 'ObservationDate' )
data = data[[ 'Last Update' , 'Confirmed' ]]
data[ 'rolling_sum' ] = data.rolling( 5 ). sum ()
print (data.head())
|
Output:
Last Update Confirmed rolling_sum
ObservationDate
2020-01-22 2020-01-22 17:00:00 1.0 NaN
2020-01-22 2020-01-22 17:00:00 14.0 NaN
2020-01-22 2020-01-22 17:00:00 6.0 NaN
2020-01-22 2020-01-22 17:00:00 1.0 NaN
2020-01-22 2020-01-22 17:00:00 0.0 22.0
Dealing with missing data:
In the previous example, the rolling_sum column has Nan values, so we can use that data to demonstrate how to deal with missing data.
Null values appear as NaN in Data Frame when a CSV file contains null values. Fillna() handles and lets the user replace NaN values with their own values, similar to how the pandas dropna() function maintains and removes Null values from a data frame. Filling the missing values in the dataframe in a backward manner is accomplished by passing backfill as the method argument value in fillna(). Fillna() fills the missing values in the dataframe in a forward direction by passing ffill as the method parameter value.
Python3
import pandas as pd
from datetime import datetime
data = pd.read_csv( 'covid_data.csv' )
data[ 'ObservationDate' ] = pd.to_datetime(data[ 'ObservationDate' ])
data[ 'Last Update' ] = pd.to_datetime(data[ 'Last Update' ])
data = data.set_index( 'ObservationDate' )
data = data[[ 'Last Update' , 'Confirmed' ]]
data[ 'rolling_sum' ] = data.rolling( 5 ). sum ()
print (data.head())
data[ 'rolling_backfilled' ] = data[ 'rolling_sum' ].fillna(method = 'backfill' )
print (data.head( 5 ))
|
Output:
Last Update Confirmed rolling_sum
ObservationDate
2020-01-22 2020-01-22 17:00:00 1.0 NaN
2020-01-22 2020-01-22 17:00:00 14.0 NaN
2020-01-22 2020-01-22 17:00:00 6.0 NaN
2020-01-22 2020-01-22 17:00:00 1.0 NaN
2020-01-22 2020-01-22 17:00:00 0.0 22.0
Last Update Confirmed rolling_sum rolling_backfilled
ObservationDate
2020-01-22 2020-01-22 17:00:00 1.0 NaN 22.0
2020-01-22 2020-01-22 17:00:00 14.0 NaN 22.0
2020-01-22 2020-01-22 17:00:00 6.0 NaN 22.0
2020-01-22 2020-01-22 17:00:00 1.0 NaN 22.0
2020-01-22 2020-01-22 17:00:00 0.0 22.0 22.0
Fundamentals of Unix/epoch time:
One may come across time values in Unix time while working with time-series data. The amount of seconds since 00:00:00 Coordinated Universal Time (UTC), Thursday, January 1, 1970, is known as Unix time, sometimes known as Epoch time. Unix time helps us decipher time stamps so we don’t get confused by time zones, daylight savings time, and other factors.
In the below example we convert epoch time to timestamp using pd.to_timestamp() method. If we want time in UTC to a particular time zone, tz_localize() and tz. convert() methods are used. In the below example we convert it to the ‘Europe/Berlin’ timezone.
Python3
import pandas as pd
from datetime import datetime
epoch = 1598776989
timestamp = pd.to_datetime(epoch, unit = 's' )
print (timestamp)
print (timestamp.tz_localize( 'UTC' ).tz_convert( 'Europe/Berlin' ))
|
Output:
2020-08-30 08:43:09
2020-08-30 10:43:09+02:00
Similar Reads
Graphing Different Time Series Data in Python
Time series data is a sequence of data points recorded at specific time intervals. It is widely used in various fields such as finance, economics, weather forecasting, and many others. Visualizing time series data helps to identify trends, patterns, and anomalies, making it easier to understand and
3 min read
Python | Pandas Series.at_time()
Pandas series is a One-dimensional ndarray with axis labels. The labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Pandas Series.at_time() function is used to
3 min read
Sort a Pandas Series in Python
Series is a one-dimensional labeled array capable of holding data of the type integer, string, float, python objects, etc. The axis labels are collectively called index. Now, Let's see a program to sort a Pandas Series. For sorting a pandas series the Series.sort_values() method is used. Syntax: Ser
3 min read
Python | Pandas Series.data
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas series is a One-dimensional ndarray with axis labels. The labels need not be un
2 min read
Basic of Time Series Manipulation Using Pandas
Although the time series is also available in the Scikit-learn library, data science professionals use the Pandas library as it has compiled more features to work on the DateTime series. We can include the date and time for every record and can fetch the records of DataFrame. We can find out the da
4 min read
Date and Time Operations in Pandas Series
Working with dates and times is a common task in data analysis, and Pandas provide powerful tools to handle these operations efficiently. In this section, we'll explore various methods available in the Pandas Series for converting, formatting, and manipulating datetime data. What do you mean by Pand
4 min read
Manipulating DataFrames with Pandas - Python
Before manipulating the dataframe with pandas we have to understand what is data manipulation. The data in the real world is very unpleasant & unordered so by performing certain operations we can make data understandable based on one's requirements, this process of converting unordered data into
4 min read
How to Plot a Time Series in Matplotlib?
Time series data is the data marked by some time. Each point on the graph represents a measurement of both time and quantity. A time-series chart is also known as a fever chart when the data are connected in chronological order by a straight line that forms a succession of peaks and troughs. x-axis
4 min read
How to Resample Time Series Data in Python?
In time series, data consistency is of prime importance, resampling ensures that the data is distributed with a consistent frequency. Resampling can also provide a different perception of looking at the data, in other words, it can add additional insights about the data based on the resampling frequ
5 min read
Python | Pandas DatetimeIndex.to_series()
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas DatetimeIndex.to_series() function create a Series with both index and values e
2 min read