Advanced Feature Extraction and Selection from Time Series Data Using tsfresh in Python
Last Updated :
02 Jul, 2024
Time series data is ubiquitous in various fields such as finance, healthcare, and engineering. Extracting meaningful features from time series data is crucial for building predictive models. The tsfresh
Python package simplifies this process by automatically calculating a wide range of features. This article provides a comprehensive guide on how to use tsfresh
to extract features from time series data.
Introduction to tsfresh
tsfresh
(Time Series Feature extraction based on scalable hypothesis tests) is a Python package designed to automate the extraction of a large number of features from time series data. It is particularly useful for tasks such as classification, regression, and clustering of time series data. The package integrates seamlessly with pandas
and scikit-learn
, making it easy to incorporate into existing workflows.
Key Features of tsfresh
- Automated Feature Extraction: Extracts hundreds of features from time series data automatically.
- Feature Selection: Identifies relevant features using statistical tests.
- Scalability: Supports parallel processing and integration with
dask
for handling large datasets. - Compatibility: Works well with
pandas
DataFrames and scikit-learn
pipelines. - Simple Features: These calculate a single number from the time series, such as the absolute energy, highest absolute value, and sum of absolute changes.
- Combiner Features: These calculate multiple features for a list of parameters at once, returning a list of (key, value) pairs for each input parameter.
To begin using TSFresh, let's install it. You can do this using pip or conda:
pip install tsfresh
For large datasets, you might want to install the dask
extension:
pip install tsfresh[dask]
Step 1: Preparing the Data
tsfresh
requires the data to be in a specific format. Each time series should have a unique identifier (id
), a time column (time
), and a value column (value
). Load your time series data into a pandas DataFrame. Ensure that the data is sorted by the time dimension, which can be represented by a column named "time" or any other suitable name. Here's an example:
Python
import pandas as pd
# Example time series data
data = {
'id': [1, 1, 1, 2, 2, 2],
'time': [1, 2, 3, 1, 2, 3],
'value': [10, 20, 30, 15, 25, 35]
}
df = pd.DataFrame(data)
Output:
id time value
0 1 1 10
1 1 2 20
2 1 3 30
3 2 1 15
4 2 2 25
5 2 3 35
To extract features, use the extract_features
function:
Python
from tsfresh import extract_features
features = extract_features(df, column_id='id', column_sort='time')
print(features)
Output:
value__variance_larger_than_standard_deviation value__has_duplicate_max \
1 1.0 0.0
2 1.0 0.0
value__has_duplicate_min value__has_duplicate value__sum_values \
1 0.0 0.0 60.0
2 0.0 0.0 75.0
value__abs_energy value__mean_abs_change value__mean_change \
1 1400.0 10.0 10.0
2 2075.0 10.0 10.0
value__mean_second_derivative_central value__median ... \
1 0.0 20.0 ...
2 0.0 25.0 ...
value__fourier_entropy__bins_5 value__fourier_entropy__bins_10 \
1 0.693147 0.693147
2 0.693147 0.693147
value__fourier_entropy__bins_100 \
1 0.693147
2 0.693147
value__permutation_entropy__dimension_3__tau_1 \
1 -0.0
2 -0.0
value__permutation_entropy__dimension_4__tau_1 \
1 NaN
2 NaN
value__permutation_entropy__dimension_5__tau_1 \
1 NaN
2 NaN
value__permutation_entropy__dimension_6__tau_1 \
1 NaN
2 NaN
value__permutation_entropy__dimension_7__tau_1 \
1 NaN
2 NaN
value__query_similarity_count__query_None__threshold_0.0 \
1 NaN
2 NaN
value__mean_n_absolute_max__number_of_maxima_7
1 NaN
2 NaN
[2 rows x 783 columns]
The output shows the first few rows of the extracted features. Let's break down some of these columns:
- value__variance_larger_than_standard_deviation: This feature checks if the variance of the time series is larger than its standard deviation.
1.0
indicates that the variance is larger than the standard deviation.0.0
indicates otherwise.
- value__has_duplicate_max: Indicates whether the maximum value in the time series is duplicated.
0.0
means no duplicates.1.0
would mean there are duplicates.
- value__has_duplicate_min: Similar to
has_duplicate_max
, but for the minimum value. - value__has_duplicate: Indicates whether any value in the time series is duplicated.
- value__sum_values: The sum of all values in the time series.
- value__abs_energy: The absolute energy of the time series, which is the sum of the squared values.
- value__mean_abs_change: The mean of the absolute differences between consecutive values in the time series.
- value__mean_change: The mean of the differences between consecutive values in the time series.
- value__mean_second_derivative_central: The mean of the second-order differences of the time series.
- value__median: The median value of the time series.
- value__fourier_entropy__bins_5: The entropy of the Fourier transform coefficients when divided into 5 bins.
- value__fourier_entropy__bins_10: Similar to
fourier_entropy__bins_5
, but with 10 bins. - value__fourier_entropy__bins_100: Similar to
fourier_entropy__bins_5
, but with 100 bins. - value__permutation_entropy__dimension_3__tau_1: The permutation entropy of the time series with a specified embedding dimension and time delay.
- value__query_similarity_count__query_None__threshold_0.0: The count of subsequences in the time series that are similar to a given query with a certain threshold.
- value__mean_n_absolute_max__number_of_maxima_7: The mean of the absolute values of the top N maxima in the time series.
Some features have NaN
values. This can happen if the feature calculation is not applicable or meaningful for the given time series segment. For example, if a time series is too short to calculate a meaningful permutation entropy with higher dimensions, the result will be NaN
.
Step 3: Feature Selection
Not all extracted features may be relevant for your task. tsfresh
provides methods to select relevant features based on their significance:
Python
from tsfresh import select_features
from tsfresh.utilities.dataframe_functions import impute
y = pd.Series([0, 1], index=[1, 2])
# Impute missing values
impute(features)
# Select relevant features
selected_features = select_features(features, y)
print(selected_features)
Output:
value__autocorrelation__lag_4' 'value__autocorrelation__lag_5'
'value__autocorrelation__lag_6' 'value__autocorrelation__lag_7'
'value__autocorrelation__lag_8' 'value__autocorrelation__lag_9'
'value__partial_autocorrelation__lag_0'
'value__partial_autocorrelation__lag_1'
'value__partial_autocorrelation__lag_2'
'value__partial_autocorrelation__lag_3'
'value__partial_autocorrelation__lag_4'
'value__partial_autocorrelation__lag_5'
'value__partial_autocorrelation__lag_6'
'value__partial_autocorrelation__lag_7'
'value__partial_autocorrelation__lag_8'
'value__partial_autocorrelation__lag_9'
'value__agg_linear_trend__attr_"stderr"__chunk_len_5__f_agg_"mean"'
'value__agg_linear_trend__attr_"stderr"__chunk_len_5__f_agg_"var"'
'value__agg_linear_trend__attr_"stderr"__chunk_len_10__f_agg_"max"'
'value__agg_linear_trend__attr_"stderr"__chunk_len_10__f_agg_"min"'
'value__agg_linear_trend__attr_"stderr"__chunk_len_10__f_agg_"mean"'
'value__agg_linear_trend__attr_"stderr"__chunk_len_10__f_agg_"var"'
'value__agg_linear_trend__attr_"stderr"__chunk_len_50__f_agg_"max"'
'value__agg_linear_trend__attr_"stderr"__chunk_len_50__f_agg_"min"'
'value__agg_linear_trend__attr_"stderr"__chunk_len_50__f_agg_"mean"'
'value__agg_linear_trend__attr_"stderr"__chunk_len_50__f_agg_"var"'
'value__augmented_dickey_fuller__attr_"teststat"__autolag_"AIC"'
'value__augmented_dickey_fuller__attr_"pvalue"__autolag_"AIC"'
'value__augmented_dickey_fuller__attr_"usedlag"__autolag_"AIC"'
'value__permutation_entropy__dimension_4__tau_1'
'value__permutation_entropy__dimension_5__tau_1'
'value__permutation_entropy__dimension_6__tau_1'
'value__permutation_entropy__dimension_7__tau_1'
'value__query_similarity_count__query_None__threshold_0.0'
'value__mean_n_absolute_max__number_of_maxima_7'] did not have any finite values. Filling with zeros.
We customize the feature extraction process by specifying which features to calculate. This is done using the ComprehensiveFCParameters
or EfficientFCParameters
classes:
Python
from tsfresh.feature_extraction import extract_features, ComprehensiveFCParameters
settings = ComprehensiveFCParameters()
features = extract_features(df, column_id='id', column_sort='time', default_fc_parameters=settings)
print(features)
Output:
value__variance_larger_than_standard_deviation value__has_duplicate_max \
1 1.0 0.0
2 1.0 0.0
value__has_duplicate_min value__has_duplicate value__sum_values \
1 0.0 0.0 60.0
2 0.0 0.0 75.0
value__abs_energy value__mean_abs_change value__mean_change \
1 1400.0 10.0 10.0
2 2075.0 10.0 10.0
value__mean_second_derivative_central value__median ... \
1 0.0 20.0 ...
2 0.0 25.0 ...
value__fourier_entropy__bins_5 value__fourier_entropy__bins_10 \
1 0.693147 0.693147
2 0.693147 0.693147
value__fourier_entropy__bins_100 \
1 0.693147
2 0.693147
value__permutation_entropy__dimension_3__tau_1 \
1 -0.0
2 -0.0
value__permutation_entropy__dimension_4__tau_1 \
1 NaN
2 NaN
value__permutation_entropy__dimension_5__tau_1 \
1 NaN
2 NaN
value__permutation_entropy__dimension_6__tau_1 \
1 NaN
2 NaN
value__permutation_entropy__dimension_7__tau_1 \
1 NaN
2 NaN
value__query_similarity_count__query_None__threshold_0.0 \
1 NaN
2 NaN
value__mean_n_absolute_max__number_of_maxima_7
1 NaN
2 NaN
[2 rows x 783 columns]
Step 1: Load Financial Data
Suppose you have financial time series data, such as stock prices, with columns for date, stock ID, and price.
Python
import pandas as pd
# Example financial time series data
data = {
'id': ['AAPL', 'AAPL', 'AAPL', 'GOOG', 'GOOG', 'GOOG'],
'date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-01', '2023-01-02', '2023-01-03'],
'price': [150, 152, 148, 2800, 2825, 2790]
}
df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'])
Step 2: Define Custom Feature Extraction Settings
You can define custom settings to extract specific features that are relevant to financial data, such as moving averages, volatility, and returns.
Python
from tsfresh.feature_extraction import extract_features, MinimalFCParameters
# Define custom settings
custom_settings = {
'mean': None,
'standard_deviation': None,
'variance': None,
'maximum': None,
'minimum': None,
'absolute_sum_of_changes': None,
'longest_strike_above_mean': None,
'longest_strike_below_mean': None
}
# Extract features using custom settings
features = extract_features(df, column_id='id', column_sort='date', column_value='price', default_fc_parameters=custom_settings)
print(features)
Output:
price__mean price__standard_deviation price__variance price__maximum \
AAPL 150.0 1.632993 2.666667 152.0
GOOG 2805.0 14.719601 216.666667 2825.0
price__minimum price__absolute_sum_of_changes \
AAPL 148.0 6.0
GOOG 2790.0 60.0
price__longest_strike_above_mean price__longest_strike_below_mean
AAPL 1.0 1.0
GOOG 1.0 1.0
Step 3: Feature Selection
Python
# Assuming you have target labels for feature selection
# For demonstration purposes, let's create a dummy target variable
# In a real scenario, this should be your actual target variable
y = pd.Series([1, 0], index=['AAPL', 'GOOG'])
# Impute missing values
impute(features)
# Select relevant features
selected_features = select_features(features, y)
print("Selected Features:")
print(selected_features)
Output:
Selected Features:
Empty DataFrame
Columns: []
Index: [AAPL, GOOG]
Let's walk through a complete example using the robot execution failures dataset provided by tsfresh
.
Step 1: Load the Data
Python
from tsfresh.examples.robot_execution_failures import download_robot_execution_failures, load_robot_execution_failures
download_robot_execution_failures()
timeseries, y = load_robot_execution_failures()
Step 2: Extract Features
Python
features = extract_features(timeseries, column_id='id', column_sort='time')
print(features.head())
Output:
F_x__variance_larger_than_standard_deviation F_x__has_duplicate_max \
1 0.0 0.0
2 0.0 1.0
3 0.0 0.0
4 0.0 1.0
5 0.0 0.0
F_x__has_duplicate_min F_x__has_duplicate F_x__sum_values \
1 1.0 1.0 -14.0
2 1.0 1.0 -13.0
3 1.0 1.0 -10.0
4 1.0 1.0 -6.0
5 0.0 1.0 -9.0
F_x__abs_energy F_x__mean_abs_change F_x__mean_change \
1 14.0 0.142857 0.000000
2 25.0 1.000000 0.000000
3 12.0 0.714286 0.000000
4 16.0 1.214286 -0.071429
5 17.0 0.928571 -0.071429
F_x__mean_second_derivative_central F_x__median ... \
1 -0.038462 -1.0 ...
2 -0.038462 -1.0 ...
3 -0.038462 -1.0 ...
4 -0.038462 0.0 ...
5 0.038462 -1.0 ...
T_z__fourier_entropy__bins_5 T_z__fourier_entropy__bins_10 \
1 NaN NaN
2 1.073543 1.494175
3 1.386294 1.732868
4 1.073543 1.494175
5 0.900256 1.320888
T_z__fourier_entropy__bins_100 \
1 NaN
2 2.079442
3 2.079442
4 2.079442
5 2.079442
T_z__permutation_entropy__dimension_3__tau_1 \
1 -0.000000
2 0.937156
3 1.265857
4 1.156988
5 1.156988
T_z__permutation_entropy__dimension_4__tau_1 \
1 -0.000000
2 1.234268
3 1.704551
4 1.907284
5 1.863680
T_z__permutation_entropy__dimension_5__tau_1 \
1 -0.000000
2 1.540306
3 2.019815
4 2.397895
5 2.271869
T_z__permutation_entropy__dimension_6__tau_1 \
1 -0.000000
2 1.748067
3 2.163956
4 2.302585
5 2.302585
T_z__permutation_entropy__dimension_7__tau_1 \
1 -0.000000
2 1.831020
3 2.197225
4 2.197225
5 2.197225
T_z__query_similarity_count__query_None__threshold_0.0 \
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
T_z__mean_n_absolute_max__number_of_maxima_7
1 0.000000
2 0.571429
3 0.571429
4 1.000000
5 0.857143
[5 rows x 4698 columns]
Step 3: Select Relevant Features
Python
from tsfresh import select_features
from tsfresh.utilities.dataframe_functions import impute
# Impute missing values
impute(features)
# Select relevant features
selected_features = select_features(features, y)
print(selected_features.head())
Output:
F_x__value_count__value_-1 F_x__abs_energy F_x__root_mean_square \
1 14.0 14.0 0.966092
2 7.0 25.0 1.290994
3 11.0 12.0 0.894427
4 5.0 16.0 1.032796
5 9.0 17.0 1.064581
T_y__absolute_maximum F_x__mean_n_absolute_max__number_of_maxima_7 \
1 1.0 1.000000
2 5.0 1.571429
3 5.0 1.000000
4 6.0 1.285714
5 5.0 1.285714
F_x__range_count__max_1__min_-1 F_y__abs_energy F_y__root_mean_square \
1 15.0 13.0 0.930949
2 13.0 76.0 2.250926
3 14.0 40.0 1.632993
4 10.0 60.0 2.000000
5 13.0 46.0 1.751190
F_y__mean_n_absolute_max__number_of_maxima_7 T_y__variance ... \
1 1.000000 0.222222 ...
2 3.000000 4.222222 ...
3 2.142857 3.128889 ...
4 2.428571 7.128889 ...
5 2.285714 4.160000 ...
F_y__cwt_coefficients__coeff_14__w_5__widths_(2, 5, 10, 20) \
1 -0.751682
2 0.057818
3 0.912474
4 -0.609735
5 0.072771
F_y__cwt_coefficients__coeff_13__w_2__widths_(2, 5, 10, 20) \
1 -0.310265
2 -0.202951
3 0.539121
4 -2.641390
5 0.591927
T_y__lempel_ziv_complexity__bins_3 T_y__quantile__q_0.1 \
1 0.400000 -1.0
2 0.533333 -3.6
3 0.533333 -4.0
4 0.533333 -4.6
5 0.466667 -5.0
F_z__time_reversal_asymmetry_statistic__lag_1 F_x__quantile__q_0.2 \
1 -596.000000 -1.0
2 -680.384615 -1.0
3 -617.000000 -1.0
4 3426.307692 -1.0
5 -2609.000000 -1.0
F_y__quantile__q_0.7 \
1 -1.0
2 -1.0
3 0.0
4 1.0
5 0.8
T_x__change_quantiles__f_agg_"var"__isabs_False__qh_0.2__ql_0.0 \
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
T_z__large_standard_deviation__r_0.35000000000000003 T_z__quantile__q_0.9
1 0.0 0.0
2 1.0 0.0
3 1.0 0.0
4 0.0 0.0
5 0.0 0.6
[5 rows x 682 columns]
Step 4: Train a Machine Learning Model
You can now use the selected features to train a machine learning model:
Python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Split the data
X_train, X_test, y_train, y_test = train_test_split(selected_features, y, test_size=0.3, random_state=42)
# Train the model
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
Output:
Accuracy: 1.0
Conclusion
TSFresh is a powerful tool for automatic feature extraction from time series data. Its ability to extract hundreds of relevant features and integrate with popular Python libraries makes it an essential package for data scientists and researchers working with time series data. By following the steps outlined in this article, you can efficiently extract features from your time series data and unlock valuable insights.
Similar Reads
Extracting Features from Time Series Data Using tsfresh
Time series data is prevalent in various fields such as finance, healthcare, and engineering. Extracting meaningful features from this data is crucial for building predictive models. The tsfresh Python package simplifies this process by automatically calculating a wide range of features. This articl
5 min read
Creating Powerful Time Series Features with tsfresh
Time series data presents unique challenges and opportunities in machine learning. Effective feature engineering is often the key to unlocking the hidden patterns within these sequences. The tsfresh library (Time Series Feature Extraction based on scalable hypothesis tests) offers a robust and autom
8 min read
Efficient and Scalable Time Series Analysis with Large Datasets in Python
Time series analysis is a crucial aspect of data science, especially when dealing with large datasets. Python, with its extensive library ecosystem, provides a robust platform for handling time series data efficiently and scalably. This article explores efficient and scalable methods to handle time
7 min read
Feature Engineering for Time-Series Data: Methods and Applications
Time-series data, which consists of sequential measurements taken over time, is ubiquitous in many fields such as finance, healthcare, and social media. Extracting useful features from this type of data can significantly improve the performance of predictive models and help uncover underlying patter
9 min read
Pandas Series dt.time | Extract Time from Time Stamp in Series
The Series.dt.time attribute returns a NumPy array containing time values of the timestamps in a Pandas series. Example C/C++ Code import pandas as pd sr = pd.Series(['2012-10-21 09:30', '2019-7-18 12:30', '2008-02-2 10:30', '2010-4-22 09:25', '2019-11-8 02:22']) idx = ['Day 1', 'Day 2', 'Day 3', 'D
2 min read
8 Types of Plots for Time Series Analysis using Python
Time series data Time series data is a collection of observations chronologically arranged at regular time intervals. Each observation corresponds to a specific time point, and the data can be recorded at various frequencies (e.g., daily, monthly, yearly). This type of data is very essential in many
10 min read
R Time Series Modeling on Weekly Data Using ts() Object
Time series modeling refers to the analysis and forecasting of data points collected or recorded at specific time intervals. This type of modeling focuses on identifying patterns, trends, and seasonal variations within a dataset that is sequentially ordered in time. It is important because many real
6 min read
How to Resample Time Series Data in Python?
In time series, data consistency is of prime importance, resampling ensures that the data is distributed with a consistent frequency. Resampling can also provide a different perception of looking at the data, in other words, it can add additional insights about the data based on the resampling frequ
5 min read
Convert any Dates in Spreadsheets using Python
In this article, we are going to see how to convert any Dates in Spreadsheets using Python. Used file: This file comprises a single column entitled 'Date' and stores random dates of 2021 in some different forms of format. Approach:We'll begin by importing the pandas library.Let's have a look at the
3 min read
Time Series Clustering using TSFresh
Time series data is ubiquitous across various domains, including finance, healthcare, and IoT. Clustering time series data can uncover hidden patterns, group similar behaviors, and enhance predictive modeling. One powerful tool for this purpose is TSFresh, a Python library designed to extract releva
7 min read