Open In App

Advanced Feature Extraction and Selection from Time Series Data Using tsfresh in Python

Last Updated : 02 Jul, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Time series data is ubiquitous in various fields such as finance, healthcare, and engineering. Extracting meaningful features from time series data is crucial for building predictive models. The tsfresh Python package simplifies this process by automatically calculating a wide range of features. This article provides a comprehensive guide on how to use tsfresh to extract features from time series data.

Introduction to tsfresh

tsfresh (Time Series Feature extraction based on scalable hypothesis tests) is a Python package designed to automate the extraction of a large number of features from time series data. It is particularly useful for tasks such as classification, regression, and clustering of time series data. The package integrates seamlessly with pandas and scikit-learn, making it easy to incorporate into existing workflows.

Key Features of tsfresh

  • Automated Feature Extraction: Extracts hundreds of features from time series data automatically.
  • Feature Selection: Identifies relevant features using statistical tests.
  • Scalability: Supports parallel processing and integration with dask for handling large datasets.
  • Compatibility: Works well with pandas DataFrames and scikit-learn pipelines.
  • Simple Features: These calculate a single number from the time series, such as the absolute energy, highest absolute value, and sum of absolute changes.
  • Combiner Features: These calculate multiple features for a list of parameters at once, returning a list of (key, value) pairs for each input parameter.

How to Use TSFresh for Feature Extraction : Installation

To begin using TSFresh, let's install it. You can do this using pip or conda:

pip install tsfresh

For large datasets, you might want to install the dask extension:

pip install tsfresh[dask]

Step 1: Preparing the Data

tsfresh requires the data to be in a specific format. Each time series should have a unique identifier (id), a time column (time), and a value column (value). Load your time series data into a pandas DataFrame. Ensure that the data is sorted by the time dimension, which can be represented by a column named "time" or any other suitable name. Here's an example:

Python
import pandas as pd

# Example time series data
data = {
    'id': [1, 1, 1, 2, 2, 2],
    'time': [1, 2, 3, 1, 2, 3],
    'value': [10, 20, 30, 15, 25, 35]
}
df = pd.DataFrame(data)

Output:

	id	time	value
0 1 1 10
1 1 2 20
2 1 3 30
3 2 1 15
4 2 2 25
5 2 3 35

Step 2: Extracting Features

To extract features, use the extract_features function:

Python
from tsfresh import extract_features

features = extract_features(df, column_id='id', column_sort='time')
print(features)

Output:

value__variance_larger_than_standard_deviation  value__has_duplicate_max  \
1 1.0 0.0
2 1.0 0.0

value__has_duplicate_min value__has_duplicate value__sum_values \
1 0.0 0.0 60.0
2 0.0 0.0 75.0

value__abs_energy value__mean_abs_change value__mean_change \
1 1400.0 10.0 10.0
2 2075.0 10.0 10.0

value__mean_second_derivative_central value__median ... \
1 0.0 20.0 ...
2 0.0 25.0 ...

value__fourier_entropy__bins_5 value__fourier_entropy__bins_10 \
1 0.693147 0.693147
2 0.693147 0.693147

value__fourier_entropy__bins_100 \
1 0.693147
2 0.693147

value__permutation_entropy__dimension_3__tau_1 \
1 -0.0
2 -0.0

value__permutation_entropy__dimension_4__tau_1 \
1 NaN
2 NaN

value__permutation_entropy__dimension_5__tau_1 \
1 NaN
2 NaN

value__permutation_entropy__dimension_6__tau_1 \
1 NaN
2 NaN

value__permutation_entropy__dimension_7__tau_1 \
1 NaN
2 NaN

value__query_similarity_count__query_None__threshold_0.0 \
1 NaN
2 NaN

value__mean_n_absolute_max__number_of_maxima_7
1 NaN
2 NaN

[2 rows x 783 columns]

The output shows the first few rows of the extracted features. Let's break down some of these columns:

  1. value__variance_larger_than_standard_deviation: This feature checks if the variance of the time series is larger than its standard deviation.
    • 1.0 indicates that the variance is larger than the standard deviation.
    • 0.0 indicates otherwise.
  2. value__has_duplicate_max: Indicates whether the maximum value in the time series is duplicated.
    • 0.0 means no duplicates.
    • 1.0 would mean there are duplicates.
  3. value__has_duplicate_min: Similar to has_duplicate_max, but for the minimum value.
  4. value__has_duplicate: Indicates whether any value in the time series is duplicated.
  5. value__sum_values: The sum of all values in the time series.
  6. value__abs_energy: The absolute energy of the time series, which is the sum of the squared values.
  7. value__mean_abs_change: The mean of the absolute differences between consecutive values in the time series.
  8. value__mean_change: The mean of the differences between consecutive values in the time series.
  9. value__mean_second_derivative_central: The mean of the second-order differences of the time series.
  10. value__median: The median value of the time series.
  11. value__fourier_entropy__bins_5: The entropy of the Fourier transform coefficients when divided into 5 bins.
  12. value__fourier_entropy__bins_10: Similar to fourier_entropy__bins_5, but with 10 bins.
  13. value__fourier_entropy__bins_100: Similar to fourier_entropy__bins_5, but with 100 bins.
  14. value__permutation_entropy__dimension_3__tau_1: The permutation entropy of the time series with a specified embedding dimension and time delay.
  15. value__query_similarity_count__query_None__threshold_0.0: The count of subsequences in the time series that are similar to a given query with a certain threshold.
  16. value__mean_n_absolute_max__number_of_maxima_7: The mean of the absolute values of the top N maxima in the time series.

Some features have NaN values. This can happen if the feature calculation is not applicable or meaningful for the given time series segment. For example, if a time series is too short to calculate a meaningful permutation entropy with higher dimensions, the result will be NaN.

Step 3: Feature Selection

Not all extracted features may be relevant for your task. tsfresh provides methods to select relevant features based on their significance:

Python
from tsfresh import select_features
from tsfresh.utilities.dataframe_functions import impute

y = pd.Series([0, 1], index=[1, 2])

# Impute missing values
impute(features)

# Select relevant features
selected_features = select_features(features, y)
print(selected_features)

Output:

value__autocorrelation__lag_4' 'value__autocorrelation__lag_5'
'value__autocorrelation__lag_6' 'value__autocorrelation__lag_7'
'value__autocorrelation__lag_8' 'value__autocorrelation__lag_9'
'value__partial_autocorrelation__lag_0'
'value__partial_autocorrelation__lag_1'
'value__partial_autocorrelation__lag_2'
'value__partial_autocorrelation__lag_3'
'value__partial_autocorrelation__lag_4'
'value__partial_autocorrelation__lag_5'
'value__partial_autocorrelation__lag_6'
'value__partial_autocorrelation__lag_7'
'value__partial_autocorrelation__lag_8'
'value__partial_autocorrelation__lag_9'
'value__agg_linear_trend__attr_"stderr"__chunk_len_5__f_agg_"mean"'
'value__agg_linear_trend__attr_"stderr"__chunk_len_5__f_agg_"var"'
'value__agg_linear_trend__attr_"stderr"__chunk_len_10__f_agg_"max"'
'value__agg_linear_trend__attr_"stderr"__chunk_len_10__f_agg_"min"'
'value__agg_linear_trend__attr_"stderr"__chunk_len_10__f_agg_"mean"'
'value__agg_linear_trend__attr_"stderr"__chunk_len_10__f_agg_"var"'
'value__agg_linear_trend__attr_"stderr"__chunk_len_50__f_agg_"max"'
'value__agg_linear_trend__attr_"stderr"__chunk_len_50__f_agg_"min"'
'value__agg_linear_trend__attr_"stderr"__chunk_len_50__f_agg_"mean"'
'value__agg_linear_trend__attr_"stderr"__chunk_len_50__f_agg_"var"'
'value__augmented_dickey_fuller__attr_"teststat"__autolag_"AIC"'
'value__augmented_dickey_fuller__attr_"pvalue"__autolag_"AIC"'
'value__augmented_dickey_fuller__attr_"usedlag"__autolag_"AIC"'
'value__permutation_entropy__dimension_4__tau_1'
'value__permutation_entropy__dimension_5__tau_1'
'value__permutation_entropy__dimension_6__tau_1'
'value__permutation_entropy__dimension_7__tau_1'
'value__query_similarity_count__query_None__threshold_0.0'
'value__mean_n_absolute_max__number_of_maxima_7'] did not have any finite values. Filling with zeros.

Customizing Feature Extraction With TSFresh

We customize the feature extraction process by specifying which features to calculate. This is done using the ComprehensiveFCParameters or EfficientFCParameters classes:

Python
from tsfresh.feature_extraction import extract_features, ComprehensiveFCParameters

settings = ComprehensiveFCParameters()
features = extract_features(df, column_id='id', column_sort='time', default_fc_parameters=settings)
print(features)

Output:

 value__variance_larger_than_standard_deviation  value__has_duplicate_max  \
1 1.0 0.0
2 1.0 0.0

value__has_duplicate_min value__has_duplicate value__sum_values \
1 0.0 0.0 60.0
2 0.0 0.0 75.0

value__abs_energy value__mean_abs_change value__mean_change \
1 1400.0 10.0 10.0
2 2075.0 10.0 10.0

value__mean_second_derivative_central value__median ... \
1 0.0 20.0 ...
2 0.0 25.0 ...

value__fourier_entropy__bins_5 value__fourier_entropy__bins_10 \
1 0.693147 0.693147
2 0.693147 0.693147

value__fourier_entropy__bins_100 \
1 0.693147
2 0.693147

value__permutation_entropy__dimension_3__tau_1 \
1 -0.0
2 -0.0

value__permutation_entropy__dimension_4__tau_1 \
1 NaN
2 NaN

value__permutation_entropy__dimension_5__tau_1 \
1 NaN
2 NaN

value__permutation_entropy__dimension_6__tau_1 \
1 NaN
2 NaN

value__permutation_entropy__dimension_7__tau_1 \
1 NaN
2 NaN

value__query_similarity_count__query_None__threshold_0.0 \
1 NaN
2 NaN

value__mean_n_absolute_max__number_of_maxima_7
1 NaN
2 NaN

[2 rows x 783 columns]

Practical Implementation with tsfresh Feature Extraction

Example 1: Feature Extraction for Financial Time Series

Step 1: Load Financial Data

Suppose you have financial time series data, such as stock prices, with columns for date, stock ID, and price.

Python
import pandas as pd

# Example financial time series data
data = {
    'id': ['AAPL', 'AAPL', 'AAPL', 'GOOG', 'GOOG', 'GOOG'],
    'date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-01', '2023-01-02', '2023-01-03'],
    'price': [150, 152, 148, 2800, 2825, 2790]
}
df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'])

Step 2: Define Custom Feature Extraction Settings

You can define custom settings to extract specific features that are relevant to financial data, such as moving averages, volatility, and returns.

Python
from tsfresh.feature_extraction import extract_features, MinimalFCParameters

# Define custom settings
custom_settings = {
    'mean': None,
    'standard_deviation': None,
    'variance': None,
    'maximum': None,
    'minimum': None,
    'absolute_sum_of_changes': None,
    'longest_strike_above_mean': None,
    'longest_strike_below_mean': None
}

# Extract features using custom settings
features = extract_features(df, column_id='id', column_sort='date', column_value='price', default_fc_parameters=custom_settings)
print(features)

Output:

 price__mean  price__standard_deviation  price__variance  price__maximum  \
AAPL 150.0 1.632993 2.666667 152.0
GOOG 2805.0 14.719601 216.666667 2825.0

price__minimum price__absolute_sum_of_changes \
AAPL 148.0 6.0
GOOG 2790.0 60.0

price__longest_strike_above_mean price__longest_strike_below_mean
AAPL 1.0 1.0
GOOG 1.0 1.0

Step 3: Feature Selection 

Python
# Assuming you have target labels for feature selection
# For demonstration purposes, let's create a dummy target variable
# In a real scenario, this should be your actual target variable
y = pd.Series([1, 0], index=['AAPL', 'GOOG'])

# Impute missing values
impute(features)

# Select relevant features
selected_features = select_features(features, y)
print("Selected Features:")
print(selected_features)

Output:

Selected Features:
Empty DataFrame
Columns: []
Index: [AAPL, GOOG]

Example 2 : Feature Extraction for Robot Execution Failures

Let's walk through a complete example using the robot execution failures dataset provided by tsfresh.

Step 1: Load the Data

Python
from tsfresh.examples.robot_execution_failures import download_robot_execution_failures, load_robot_execution_failures

download_robot_execution_failures()
timeseries, y = load_robot_execution_failures()

Step 2: Extract Features

Python
features = extract_features(timeseries, column_id='id', column_sort='time')
print(features.head())

Output:

   F_x__variance_larger_than_standard_deviation  F_x__has_duplicate_max  \
1 0.0 0.0
2 0.0 1.0
3 0.0 0.0
4 0.0 1.0
5 0.0 0.0

F_x__has_duplicate_min F_x__has_duplicate F_x__sum_values \
1 1.0 1.0 -14.0
2 1.0 1.0 -13.0
3 1.0 1.0 -10.0
4 1.0 1.0 -6.0
5 0.0 1.0 -9.0

F_x__abs_energy F_x__mean_abs_change F_x__mean_change \
1 14.0 0.142857 0.000000
2 25.0 1.000000 0.000000
3 12.0 0.714286 0.000000
4 16.0 1.214286 -0.071429
5 17.0 0.928571 -0.071429

F_x__mean_second_derivative_central F_x__median ... \
1 -0.038462 -1.0 ...
2 -0.038462 -1.0 ...
3 -0.038462 -1.0 ...
4 -0.038462 0.0 ...
5 0.038462 -1.0 ...

T_z__fourier_entropy__bins_5 T_z__fourier_entropy__bins_10 \
1 NaN NaN
2 1.073543 1.494175
3 1.386294 1.732868
4 1.073543 1.494175
5 0.900256 1.320888

T_z__fourier_entropy__bins_100 \
1 NaN
2 2.079442
3 2.079442
4 2.079442
5 2.079442

T_z__permutation_entropy__dimension_3__tau_1 \
1 -0.000000
2 0.937156
3 1.265857
4 1.156988
5 1.156988

T_z__permutation_entropy__dimension_4__tau_1 \
1 -0.000000
2 1.234268
3 1.704551
4 1.907284
5 1.863680

T_z__permutation_entropy__dimension_5__tau_1 \
1 -0.000000
2 1.540306
3 2.019815
4 2.397895
5 2.271869

T_z__permutation_entropy__dimension_6__tau_1 \
1 -0.000000
2 1.748067
3 2.163956
4 2.302585
5 2.302585

T_z__permutation_entropy__dimension_7__tau_1 \
1 -0.000000
2 1.831020
3 2.197225
4 2.197225
5 2.197225

T_z__query_similarity_count__query_None__threshold_0.0 \
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN

T_z__mean_n_absolute_max__number_of_maxima_7
1 0.000000
2 0.571429
3 0.571429
4 1.000000
5 0.857143

[5 rows x 4698 columns]

Step 3: Select Relevant Features

Python
from tsfresh import select_features
from tsfresh.utilities.dataframe_functions import impute

# Impute missing values
impute(features)

# Select relevant features
selected_features = select_features(features, y)
print(selected_features.head())

Output:

   F_x__value_count__value_-1  F_x__abs_energy  F_x__root_mean_square  \
1 14.0 14.0 0.966092
2 7.0 25.0 1.290994
3 11.0 12.0 0.894427
4 5.0 16.0 1.032796
5 9.0 17.0 1.064581

T_y__absolute_maximum F_x__mean_n_absolute_max__number_of_maxima_7 \
1 1.0 1.000000
2 5.0 1.571429
3 5.0 1.000000
4 6.0 1.285714
5 5.0 1.285714

F_x__range_count__max_1__min_-1 F_y__abs_energy F_y__root_mean_square \
1 15.0 13.0 0.930949
2 13.0 76.0 2.250926
3 14.0 40.0 1.632993
4 10.0 60.0 2.000000
5 13.0 46.0 1.751190

F_y__mean_n_absolute_max__number_of_maxima_7 T_y__variance ... \
1 1.000000 0.222222 ...
2 3.000000 4.222222 ...
3 2.142857 3.128889 ...
4 2.428571 7.128889 ...
5 2.285714 4.160000 ...

F_y__cwt_coefficients__coeff_14__w_5__widths_(2, 5, 10, 20) \
1 -0.751682
2 0.057818
3 0.912474
4 -0.609735
5 0.072771

F_y__cwt_coefficients__coeff_13__w_2__widths_(2, 5, 10, 20) \
1 -0.310265
2 -0.202951
3 0.539121
4 -2.641390
5 0.591927

T_y__lempel_ziv_complexity__bins_3 T_y__quantile__q_0.1 \
1 0.400000 -1.0
2 0.533333 -3.6
3 0.533333 -4.0
4 0.533333 -4.6
5 0.466667 -5.0

F_z__time_reversal_asymmetry_statistic__lag_1 F_x__quantile__q_0.2 \
1 -596.000000 -1.0
2 -680.384615 -1.0
3 -617.000000 -1.0
4 3426.307692 -1.0
5 -2609.000000 -1.0

F_y__quantile__q_0.7 \
1 -1.0
2 -1.0
3 0.0
4 1.0
5 0.8

T_x__change_quantiles__f_agg_"var"__isabs_False__qh_0.2__ql_0.0 \
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0

T_z__large_standard_deviation__r_0.35000000000000003 T_z__quantile__q_0.9
1 0.0 0.0
2 1.0 0.0
3 1.0 0.0
4 0.0 0.0
5 0.0 0.6

[5 rows x 682 columns]

Step 4: Train a Machine Learning Model

You can now use the selected features to train a machine learning model:

Python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Split the data
X_train, X_test, y_train, y_test = train_test_split(selected_features, y, test_size=0.3, random_state=42)

# Train the model
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

Output:

Accuracy: 1.0

Conclusion

TSFresh is a powerful tool for automatic feature extraction from time series data. Its ability to extract hundreds of relevant features and integrate with popular Python libraries makes it an essential package for data scientists and researchers working with time series data. By following the steps outlined in this article, you can efficiently extract features from your time series data and unlock valuable insights.


Next Article

Similar Reads