Advanced Feature Extraction and Selection from Time Series Data Using tsfresh in Python

Last Updated : 02 Jul, 2024

Time series data is ubiquitous in various fields such as finance, healthcare, and engineering. Extracting meaningful features from time series data is crucial for building predictive models. The tsfresh Python package simplifies this process by automatically calculating a wide range of features. This article provides a comprehensive guide on how to use tsfresh to extract features from time series data.

Table of Content

Introduction to tsfresh
How to Use TSFresh for Feature Extraction : Installation

Step 1: Preparing the Data
Step 2: Extracting Features
Step 3: Feature Selection

Customizing Feature Extraction With TSFresh
Practical Implementation with tsfresh Feature Extraction

Example 1: Feature Extraction for Financial Time Series
Example 2 : Feature Extraction for Robot Execution Failures

Introduction to `tsfresh`

tsfresh (Time Series Feature extraction based on scalable hypothesis tests) is a Python package designed to automate the extraction of a large number of features from time series data. It is particularly useful for tasks such as classification, regression, and clustering of time series data. The package integrates seamlessly with pandas and scikit-learn, making it easy to incorporate into existing workflows.

Key Features of `tsfresh`

Automated Feature Extraction: Extracts hundreds of features from time series data automatically.
Feature Selection: Identifies relevant features using statistical tests.
Scalability: Supports parallel processing and integration with dask for handling large datasets.
Compatibility: Works well with pandas DataFrames and scikit-learn pipelines.
Simple Features: These calculate a single number from the time series, such as the absolute energy, highest absolute value, and sum of absolute changes.
Combiner Features: These calculate multiple features for a list of parameters at once, returning a list of (key, value) pairs for each input parameter.

How to Use TSFresh for Feature Extraction : Installation

To begin using TSFresh, let's install it. You can do this using pip or conda:

pip install tsfresh

For large datasets, you might want to install the dask extension:

pip install tsfresh[dask]

Step 1: Preparing the Data

tsfresh requires the data to be in a specific format. Each time series should have a unique identifier (id), a time column (time), and a value column (value). Load your time series data into a pandas DataFrame. Ensure that the data is sorted by the time dimension, which can be represented by a column named "time" or any other suitable name. Here's an example:

Python

import pandas as pd

# Example time series data
data = {
    'id': [1, 1, 1, 2, 2, 2],
    'time': [1, 2, 3, 1, 2, 3],
    'value': [10, 20, 30, 15, 25, 35]
}
df = pd.DataFrame(data)

Output:

	id	time	value
0	1	1	10
1	1	2	20
2	1	3	30
3	2	1	15
4	2	2	25
5	2	3	35

Step 2: Extracting Features

To extract features, use the extract_features function:

Python

from tsfresh import extract_features

features = extract_features(df, column_id='id', column_sort='time')
print(features)

Output:

value__variance_larger_than_standard_deviation  value__has_duplicate_max  \
1                                             1.0                       0.0   
2                                             1.0                       0.0   

   value__has_duplicate_min  value__has_duplicate  value__sum_values  \
1                       0.0                   0.0               60.0   
2                       0.0                   0.0               75.0   

   value__abs_energy  value__mean_abs_change  value__mean_change  \
1             1400.0                    10.0                10.0   
2             2075.0                    10.0                10.0   

   value__mean_second_derivative_central  value__median  ...  \
1                                    0.0           20.0  ...   
2                                    0.0           25.0  ...   

   value__fourier_entropy__bins_5  value__fourier_entropy__bins_10  \
1                        0.693147                         0.693147   
2                        0.693147                         0.693147   

   value__fourier_entropy__bins_100  \
1                          0.693147   
2                          0.693147   

   value__permutation_entropy__dimension_3__tau_1  \
1                                            -0.0   
2                                            -0.0   

   value__permutation_entropy__dimension_4__tau_1  \
1                                             NaN   
2                                             NaN   

   value__permutation_entropy__dimension_5__tau_1  \
1                                             NaN   
2                                             NaN   

   value__permutation_entropy__dimension_6__tau_1  \
1                                             NaN   
2                                             NaN   

   value__permutation_entropy__dimension_7__tau_1  \
1                                             NaN   
2                                             NaN   

   value__query_similarity_count__query_None__threshold_0.0  \
1                                                NaN          
2                                                NaN          

   value__mean_n_absolute_max__number_of_maxima_7  
1                                             NaN  
2                                             NaN  

[2 rows x 783 columns]

The output shows the first few rows of the extracted features. Let's break down some of these columns:

value__variance_larger_than_standard_deviation: This feature checks if the variance of the time series is larger than its standard deviation.
- 1.0 indicates that the variance is larger than the standard deviation.
- 0.0 indicates otherwise.
value__has_duplicate_max: Indicates whether the maximum value in the time series is duplicated.
- 0.0 means no duplicates.
- 1.0 would mean there are duplicates.
value__has_duplicate_min: Similar to has_duplicate_max, but for the minimum value.
value__has_duplicate: Indicates whether any value in the time series is duplicated.
value__sum_values: The sum of all values in the time series.
value__abs_energy: The absolute energy of the time series, which is the sum of the squared values.
value__mean_abs_change: The mean of the absolute differences between consecutive values in the time series.
value__mean_change: The mean of the differences between consecutive values in the time series.
value__mean_second_derivative_central: The mean of the second-order differences of the time series.
value__median: The median value of the time series.
value__fourier_entropy__bins_5: The entropy of the Fourier transform coefficients when divided into 5 bins.
value__fourier_entropy__bins_10: Similar to fourier_entropy__bins_5, but with 10 bins.
value__fourier_entropy__bins_100: Similar to fourier_entropy__bins_5, but with 100 bins.
value__permutation_entropy__dimension_3__tau_1: The permutation entropy of the time series with a specified embedding dimension and time delay.
value__query_similarity_count__query_None__threshold_0.0: The count of subsequences in the time series that are similar to a given query with a certain threshold.
value__mean_n_absolute_max__number_of_maxima_7: The mean of the absolute values of the top N maxima in the time series.

Some features have NaN values. This can happen if the feature calculation is not applicable or meaningful for the given time series segment. For example, if a time series is too short to calculate a meaningful permutation entropy with higher dimensions, the result will be NaN.

Step 3: Feature Selection

Not all extracted features may be relevant for your task. tsfresh provides methods to select relevant features based on their significance:

Python

from tsfresh import select_features
from tsfresh.utilities.dataframe_functions import impute

y = pd.Series([0, 1], index=[1, 2])

# Impute missing values
impute(features)

# Select relevant features
selected_features = select_features(features, y)
print(selected_features)

Output:

value__autocorrelation__lag_4' 'value__autocorrelation__lag_5'
 'value__autocorrelation__lag_6' 'value__autocorrelation__lag_7'
 'value__autocorrelation__lag_8' 'value__autocorrelation__lag_9'
 'value__partial_autocorrelation__lag_0'
 'value__partial_autocorrelation__lag_1'
 'value__partial_autocorrelation__lag_2'
 'value__partial_autocorrelation__lag_3'
 'value__partial_autocorrelation__lag_4'
 'value__partial_autocorrelation__lag_5'
 'value__partial_autocorrelation__lag_6'
 'value__partial_autocorrelation__lag_7'
 'value__partial_autocorrelation__lag_8'
 'value__partial_autocorrelation__lag_9'
 'value__agg_linear_trend__attr_"stderr"__chunk_len_5__f_agg_"mean"'
 'value__agg_linear_trend__attr_"stderr"__chunk_len_5__f_agg_"var"'
 'value__agg_linear_trend__attr_"stderr"__chunk_len_10__f_agg_"max"'
 'value__agg_linear_trend__attr_"stderr"__chunk_len_10__f_agg_"min"'
 'value__agg_linear_trend__attr_"stderr"__chunk_len_10__f_agg_"mean"'
 'value__agg_linear_trend__attr_"stderr"__chunk_len_10__f_agg_"var"'
 'value__agg_linear_trend__attr_"stderr"__chunk_len_50__f_agg_"max"'
 'value__agg_linear_trend__attr_"stderr"__chunk_len_50__f_agg_"min"'
 'value__agg_linear_trend__attr_"stderr"__chunk_len_50__f_agg_"mean"'
 'value__agg_linear_trend__attr_"stderr"__chunk_len_50__f_agg_"var"'
 'value__augmented_dickey_fuller__attr_"teststat"__autolag_"AIC"'
 'value__augmented_dickey_fuller__attr_"pvalue"__autolag_"AIC"'
 'value__augmented_dickey_fuller__attr_"usedlag"__autolag_"AIC"'
 'value__permutation_entropy__dimension_4__tau_1'
 'value__permutation_entropy__dimension_5__tau_1'
 'value__permutation_entropy__dimension_6__tau_1'
 'value__permutation_entropy__dimension_7__tau_1'
 'value__query_similarity_count__query_None__threshold_0.0'
 'value__mean_n_absolute_max__number_of_maxima_7'] did not have any finite values. Filling with zeros.

Customizing Feature Extraction With TSFresh

We customize the feature extraction process by specifying which features to calculate. This is done using the ComprehensiveFCParameters or EfficientFCParameters classes:

Python

from tsfresh.feature_extraction import extract_features, ComprehensiveFCParameters

settings = ComprehensiveFCParameters()
features = extract_features(df, column_id='id', column_sort='time', default_fc_parameters=settings)
print(features)

Output:

 value__variance_larger_than_standard_deviation  value__has_duplicate_max  \
1                                             1.0                       0.0   
2                                             1.0                       0.0   

   value__has_duplicate_min  value__has_duplicate  value__sum_values  \
1                       0.0                   0.0               60.0   
2                       0.0                   0.0               75.0   

   value__abs_energy  value__mean_abs_change  value__mean_change  \
1             1400.0                    10.0                10.0   
2             2075.0                    10.0                10.0   

   value__mean_second_derivative_central  value__median  ...  \
1                                    0.0           20.0  ...   
2                                    0.0           25.0  ...   

   value__fourier_entropy__bins_5  value__fourier_entropy__bins_10  \
1                        0.693147                         0.693147   
2                        0.693147                         0.693147   

   value__fourier_entropy__bins_100  \
1                          0.693147   
2                          0.693147   

   value__permutation_entropy__dimension_3__tau_1  \
1                                            -0.0   
2                                            -0.0   

   value__permutation_entropy__dimension_4__tau_1  \
1                                             NaN   
2                                             NaN   

   value__permutation_entropy__dimension_5__tau_1  \
1                                             NaN   
2                                             NaN   

   value__permutation_entropy__dimension_6__tau_1  \
1                                             NaN   
2                                             NaN   

   value__permutation_entropy__dimension_7__tau_1  \
1                                             NaN   
2                                             NaN   

   value__query_similarity_count__query_None__threshold_0.0  \
1                                                NaN          
2                                                NaN          

   value__mean_n_absolute_max__number_of_maxima_7  
1                                             NaN  
2                                             NaN  

[2 rows x 783 columns]

Practical Implementation with `tsfresh` Feature Extraction

Example 1: Feature Extraction for Financial Time Series

Step 1: Load Financial Data

Suppose you have financial time series data, such as stock prices, with columns for date, stock ID, and price.

Python

import pandas as pd

# Example financial time series data
data = {
    'id': ['AAPL', 'AAPL', 'AAPL', 'GOOG', 'GOOG', 'GOOG'],
    'date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-01', '2023-01-02', '2023-01-03'],
    'price': [150, 152, 148, 2800, 2825, 2790]
}
df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'])

Step 2: Define Custom Feature Extraction Settings

You can define custom settings to extract specific features that are relevant to financial data, such as moving averages, volatility, and returns.

Python

from tsfresh.feature_extraction import extract_features, MinimalFCParameters

# Define custom settings
custom_settings = {
    'mean': None,
    'standard_deviation': None,
    'variance': None,
    'maximum': None,
    'minimum': None,
    'absolute_sum_of_changes': None,
    'longest_strike_above_mean': None,
    'longest_strike_below_mean': None
}

# Extract features using custom settings
features = extract_features(df, column_id='id', column_sort='date', column_value='price', default_fc_parameters=custom_settings)
print(features)

Output:

 price__mean  price__standard_deviation  price__variance  price__maximum  \
AAPL        150.0                   1.632993         2.666667           152.0   
GOOG       2805.0                  14.719601       216.666667          2825.0   

      price__minimum  price__absolute_sum_of_changes  \
AAPL           148.0                             6.0   
GOOG          2790.0                            60.0   

      price__longest_strike_above_mean  price__longest_strike_below_mean  
AAPL                               1.0                               1.0  
GOOG                               1.0                               1.0

Step 3: Feature Selection

Python

# Assuming you have target labels for feature selection
# For demonstration purposes, let's create a dummy target variable
# In a real scenario, this should be your actual target variable
y = pd.Series([1, 0], index=['AAPL', 'GOOG'])

# Impute missing values
impute(features)

# Select relevant features
selected_features = select_features(features, y)
print("Selected Features:")
print(selected_features)

Output:

Selected Features:
Empty DataFrame
Columns: []
Index: [AAPL, GOOG]

Example 2 : Feature Extraction for Robot Execution Failures

Let's walk through a complete example using the robot execution failures dataset provided by tsfresh.

Step 1: Load the Data

Python

from tsfresh.examples.robot_execution_failures import download_robot_execution_failures, load_robot_execution_failures

download_robot_execution_failures()
timeseries, y = load_robot_execution_failures()

Step 2: Extract Features

Python

features = extract_features(timeseries, column_id='id', column_sort='time')
print(features.head())

Output:

   F_x__variance_larger_than_standard_deviation  F_x__has_duplicate_max  \
1                                           0.0                     0.0   
2                                           0.0                     1.0   
3                                           0.0                     0.0   
4                                           0.0                     1.0   
5                                           0.0                     0.0   

   F_x__has_duplicate_min  F_x__has_duplicate  F_x__sum_values  \
1                     1.0                 1.0            -14.0   
2                     1.0                 1.0            -13.0   
3                     1.0                 1.0            -10.0   
4                     1.0                 1.0             -6.0   
5                     0.0                 1.0             -9.0   

   F_x__abs_energy  F_x__mean_abs_change  F_x__mean_change  \
1             14.0              0.142857          0.000000   
2             25.0              1.000000          0.000000   
3             12.0              0.714286          0.000000   
4             16.0              1.214286         -0.071429   
5             17.0              0.928571         -0.071429   

   F_x__mean_second_derivative_central  F_x__median  ...  \
1                            -0.038462         -1.0  ...   
2                            -0.038462         -1.0  ...   
3                            -0.038462         -1.0  ...   
4                            -0.038462          0.0  ...   
5                             0.038462         -1.0  ...   

   T_z__fourier_entropy__bins_5  T_z__fourier_entropy__bins_10  \
1                           NaN                            NaN   
2                      1.073543                       1.494175   
3                      1.386294                       1.732868   
4                      1.073543                       1.494175   
5                      0.900256                       1.320888   

   T_z__fourier_entropy__bins_100  \
1                             NaN   
2                        2.079442   
3                        2.079442   
4                        2.079442   
5                        2.079442   

   T_z__permutation_entropy__dimension_3__tau_1  \
1                                     -0.000000   
2                                      0.937156   
3                                      1.265857   
4                                      1.156988   
5                                      1.156988   

   T_z__permutation_entropy__dimension_4__tau_1  \
1                                     -0.000000   
2                                      1.234268   
3                                      1.704551   
4                                      1.907284   
5                                      1.863680   

   T_z__permutation_entropy__dimension_5__tau_1  \
1                                     -0.000000   
2                                      1.540306   
3                                      2.019815   
4                                      2.397895   
5                                      2.271869   

   T_z__permutation_entropy__dimension_6__tau_1  \
1                                     -0.000000   
2                                      1.748067   
3                                      2.163956   
4                                      2.302585   
5                                      2.302585   

   T_z__permutation_entropy__dimension_7__tau_1  \
1                                     -0.000000   
2                                      1.831020   
3                                      2.197225   
4                                      2.197225   
5                                      2.197225   

   T_z__query_similarity_count__query_None__threshold_0.0  \
1                                                NaN        
2                                                NaN        
3                                                NaN        
4                                                NaN        
5                                                NaN        

   T_z__mean_n_absolute_max__number_of_maxima_7  
1                                      0.000000  
2                                      0.571429  
3                                      0.571429  
4                                      1.000000  
5                                      0.857143  

[5 rows x 4698 columns]

Step 3: Select Relevant Features

Python

from tsfresh import select_features
from tsfresh.utilities.dataframe_functions import impute

# Impute missing values
impute(features)

# Select relevant features
selected_features = select_features(features, y)
print(selected_features.head())

Output:

   F_x__value_count__value_-1  F_x__abs_energy  F_x__root_mean_square  \
1                        14.0             14.0               0.966092   
2                         7.0             25.0               1.290994   
3                        11.0             12.0               0.894427   
4                         5.0             16.0               1.032796   
5                         9.0             17.0               1.064581   

   T_y__absolute_maximum  F_x__mean_n_absolute_max__number_of_maxima_7  \
1                    1.0                                      1.000000   
2                    5.0                                      1.571429   
3                    5.0                                      1.000000   
4                    6.0                                      1.285714   
5                    5.0                                      1.285714   

   F_x__range_count__max_1__min_-1  F_y__abs_energy  F_y__root_mean_square  \
1                             15.0             13.0               0.930949   
2                             13.0             76.0               2.250926   
3                             14.0             40.0               1.632993   
4                             10.0             60.0               2.000000   
5                             13.0             46.0               1.751190   

   F_y__mean_n_absolute_max__number_of_maxima_7  T_y__variance  ...  \
1                                      1.000000       0.222222  ...   
2                                      3.000000       4.222222  ...   
3                                      2.142857       3.128889  ...   
4                                      2.428571       7.128889  ...   
5                                      2.285714       4.160000  ...   

   F_y__cwt_coefficients__coeff_14__w_5__widths_(2, 5, 10, 20)  \
1                                          -0.751682             
2                                           0.057818             
3                                           0.912474             
4                                          -0.609735             
5                                           0.072771             

   F_y__cwt_coefficients__coeff_13__w_2__widths_(2, 5, 10, 20)  \
1                                          -0.310265             
2                                          -0.202951             
3                                           0.539121             
4                                          -2.641390             
5                                           0.591927             

   T_y__lempel_ziv_complexity__bins_3  T_y__quantile__q_0.1  \
1                            0.400000                  -1.0   
2                            0.533333                  -3.6   
3                            0.533333                  -4.0   
4                            0.533333                  -4.6   
5                            0.466667                  -5.0   

   F_z__time_reversal_asymmetry_statistic__lag_1  F_x__quantile__q_0.2  \
1                                    -596.000000                  -1.0   
2                                    -680.384615                  -1.0   
3                                    -617.000000                  -1.0   
4                                    3426.307692                  -1.0   
5                                   -2609.000000                  -1.0   

   F_y__quantile__q_0.7  \
1                  -1.0   
2                  -1.0   
3                   0.0   
4                   1.0   
5                   0.8   

   T_x__change_quantiles__f_agg_"var"__isabs_False__qh_0.2__ql_0.0  \
1                                                0.0                 
2                                                0.0                 
3                                                0.0                 
4                                                0.0                 
5                                                0.0                 

   T_z__large_standard_deviation__r_0.35000000000000003  T_z__quantile__q_0.9  
1                                                0.0                      0.0  
2                                                1.0                      0.0  
3                                                1.0                      0.0  
4                                                0.0                      0.0  
5                                                0.0                      0.6  

[5 rows x 682 columns]

Step 4: Train a Machine Learning Model

You can now use the selected features to train a machine learning model:

Python

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Split the data
X_train, X_test, y_train, y_test = train_test_split(selected_features, y, test_size=0.3, random_state=42)

# Train the model
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

Output:

Accuracy: 1.0

Conclusion

TSFresh is a powerful tool for automatic feature extraction from time series data. Its ability to extract hundreds of relevant features and integrate with popular Python libraries makes it an essential package for data scientists and researchers working with time series data. By following the steps outlined in this article, you can efficiently extract features from your time series data and unlock valuable insights.

Extracting Features from Time Series Data Using tsfresh

frisbevhwy

Improve

Article Tags :

Practice Tags :

python

Advanced Feature Extraction and Selection from Time Series Data Using tsfresh in Python

Introduction to tsfresh

Key Features of tsfresh

How to Use TSFresh for Feature Extraction : Installation

Step 1: Preparing the Data

Step 2: Extracting Features

Step 3: Feature Selection

Customizing Feature Extraction With TSFresh

Practical Implementation with tsfresh Feature Extraction

Example 1: Feature Extraction for Financial Time Series

Example 2 : Feature Extraction for Robot Execution Failures

Conclusion

Similar Reads

Thank You!

What kind of Experience do you want to share?

Introduction to `tsfresh`

Key Features of `tsfresh`

Practical Implementation with `tsfresh` Feature Extraction