SlideShare a Scribd company logo
Time Series Analytics
with Spark
Simon Ouellette
Faimdata
What is spark-timeseries?
• Open source time series library for Apache Spark 2.0
• Sandy Ryza
– Advanced Analytics with Spark: Patterns for Learning from Data
at Scale
– Senior Data Scientist at Clover Health
• Started in February 2015
https://github.com/sryza/spark-timeseries
Who am I?
• Chief Data Science Officer at Faimdata
• Contributor to spark-timeseries since September 2015
• Participated in early design discussions (March 2015)
• Been an active user for ~2 years
http://faimdata.com
Survey: Who uses time series?
Design Question #1: How do we
structure multivariate time series?
Columnar or Row-based?
Vectors Date/
Time
Series 1 Series 2
Vector 1 2:30:01 4.56 78.93
Vector 2 2:30:02 4.57 79.92
Vector 3 2:30:03 4.87 79.91
Vector 4 2:30:04 4.48 78.99
RDD[(ZonedDateTime, Vector)]
DateTime
Index
Vector for
Series 1
Vector for
Series 2
2:30:01 4.56 78.93
2:30:02 4.57 79.92
2:30:03 4.87 79.91
2:30:04 4.48 78.99
TimeSeriesRDD(
DateTimeIndex,
RDD[Vector]
)
Columnar representation Row-based representation
Columnar vs Row-based
• Lagging
• Differencing
• Rolling operations
• Feature generation
• Feature selection
• Feature transformation
• Regression
• Clustering
• Classification
• Etc.
More efficient in
columnar representation:
More efficient in
row-based representation:
Example: lagging operation
• Time complexity: O(N)
(assumes pre-sorted RDD)
• For each row, we need
to get values from
previous k rows
• Time complexity: O(K)
• For each column to lag,
we truncate most recent
k values, and truncate
the DateTimeIndex’s
oldest k values.
Columnar representationRow-based representation
Example: regression
• We’re estimating:
• The lagged values are typically part of each row, because they are
pre-generated as new features.
• Stochastic Gradient Descent: we iterate on examples and
estimate error gradient to adjust weights, which means that we care
about rows, not columns.
• To avoid shuffling, the partitioning must be done such that all
elements of a row are together in the same partition (so the gradient
can be computed locally).
Current solution
• Core representation is columnar.
• Utility functions to go to/from row-based.
• Reasoning: spark-timeseries operations are
mostly time-related, i.e. columnar. Row-based
operations are about relationships between the
variables (ML/statistical), thus external to spark-
timeseries.
Typical time series
analytics workflow:
Survey: Who uses univariate time
series that don’t fit inside a single
executor’s memory?
(or multivariate of which a single
variable’s time series doesn’t fit)
Design Question #2: How do we
partition the multi-variate time series
for distributed processing?
Across features, or across time?
Current design
Assumption: a single time series must
fit inside executory memory!
Current design
Assumption: a single time series must
fit inside executory memory!
TimeSeriesRDD (
DatetimeIndex,
RDD[(K, Vector)]
)
IrregularDatetimeIndex (
Array[Long], // Other limitation: Scala arrays = 232 elements
java.time.ZoneId
)
Future improvements
• Creation of a new TimeSeriesRDD-like class that
will be longitudinally (i.e. across time) partitioned
rather than horizontally (i.e. across features).
• Keep both types of partitioning, on a case-by-
case basis.
Design Question #3: How do we
lag, difference, etc.?
Re-sampling, or index-preserving?
Option #1: re-sampling
Irregular
Time
y value at t x value at t
1:30:05 51.42 4.87
1:30:07.86 52.37 4.99
1:30:07.98 53.22 4.95
1:30:08.04 55.87 4.97
1:30:12 54.84 5.12
1:30:14 49.88 5.10
Uniform
Time
y value at t x value at
(t – 1)
1:30:06 51.42 4.87
1:30:07 51.42 4.87
1:30:08 53.22 4.87
1:30:09 55.87 4.95
1:30:10 55.87 4.97
1:30:11 55.87 4.97
1:30:12 54.84 4.97
1:30:13 54.84 5.12
1:30:14 49.88 5.12
Before After (1 second lag)
Option #2: index preserving
Irregular
Time
y value at t x value at t
1:30:05 51.42 4.87
1:30:07.86 52.37 4.99
1:30:07.98 53.22 4.95
1:30:08.04 55.87 4.97
1:30:12 54.84 5.12
1:30:14 49.88 5.10
Irregular
Time
y value at t x value at
(t – 1)
1:30:05 51.42 N/A
1:30:07.86 52.37 4.87
1:30:07.98 53.22 4.87
1:30:08.04 55.87 4.87
1:30:12 54.84 4.97
1:30:14 49.88 5.12
Before After (1 second lag)
Current functionality
• Option #1: resample() function for lagging/differencing by
upsampling/downsampling.
– Custom interpolation function (used when
downsampling)
• Conceptual problems:
– Information loss and duplication (downsampling)
– Bloating (upsampling)
Current functionality
• Option #2: functions to lag/difference irregular
time series based on arbitrary time intervals.
(preserves index)
• Same thing: custom interpolationfunction can be
passed for when downsampling occurs.
Overview of current API
High-level objects
• TimeSeriesRDD
• TimeSeries
• TimeSeriesStatisticalTests
• TimeSeriesModel
• DatetimeIndex
• UnivariateTimeSeries
TimeSeriesRDD
• collectAsTimeSeries
• filterStartingBefore, filterStartingAfter, slice
• filterByInstant
• quotients, differences, lags
• fill: fills NaNs by specified interpolation method (linear, nearest, next, previous,
spline, zero)
• mapSeries
• seriesStats: min, max, average, std. deviation
• toInstants, toInstantsDataFrame
• resample
• rollSum, rollMean
• saveAsCsv, saveAsParquetDataFrame
TimeSeriesStatisticalTests
• Stationarity tests:
– Augmented Dickey-Fuller (adftest)
– KPSS (kpsstest)
• Serial auto-correlation tests:
– Durbin-Watson (dwtest)
– Breusch-Godfrey (bgtest)
– Ljung-Box (lbtest)
• Breusch-Pagan heteroskedasticity test (bptest)
• Newey-West variance estimator (neweyWestVarianceEstimator)
TimeSeriesModel
• AR, ARIMA
• ARX, ARIMAX (i.e. with exogenous variables)
• Exponentially weighted moving average
• Holt-winters method (triple exp. smoothing)
• GARCH(1,1),ARGARCH(1,1,1)
Others
• Java bindings
• Python bindings
• YAHOO financial data parser
Code example #1
Time Y X
12:45:01 3.45 25.0
12:46:02 4.45 30.0
12:46:58 3.45 40.0
12:47:45 3.00 35.0
12:48:05 4.00 45.0
Y is stationary
X is integrated of order 1
Code example #1
Code example #1
Time y d(x) Lag1(y) Lag2(y) Lag1(d(x)) Lag2(d(x))
12:45:01 3.45
12:46:02 4.45 5.0 3.45
12:46:58 3.45 10.0 4.45 3.45 5.0
12:47:45 3.00 -5.0 3.45 4.45 10.0 5.0
12:48:05 4.00 10.0 3.00 3.45 -5.0 10.0
Code example #2
• We will use Holt-Winters to forecast some seasonal data.
• Holt-winters: exponential moving average applied to level, trend and
seasonal component of the time series, then combined into global
forecast.
Code example #2
0
100
200
300
400
500
600
700
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
93
97
101
105
109
113
117
121
125
129
133
137
141
Passengers
Code example #2
Code example #2
350
400
450
500
550
600
650
1 2 3 4 5 6 7 8 9 10 11 12 13
Holt-Winters forecast validation (additive & multiplicative)
Actual Additive forecast Multiplicativeforecast
Thank You.
e-mail: souellette@faimdata.com

More Related Content

PDF
[236] 카카오의데이터파이프라인 윤도영
PPTX
Apache Pinot Meetup Sept02, 2020
PDF
Real Time Analytics: Algorithms and Systems
PDF
Adventures in Observability - Clickhouse and Instana
PDF
CDC Stream Processing with Apache Flink
PDF
Presto on Apache Spark: A Tale of Two Computation Engines
PPTX
Developing with the Go client for Apache Kafka
PPTX
Programming in Spark using PySpark
[236] 카카오의데이터파이프라인 윤도영
Apache Pinot Meetup Sept02, 2020
Real Time Analytics: Algorithms and Systems
Adventures in Observability - Clickhouse and Instana
CDC Stream Processing with Apache Flink
Presto on Apache Spark: A Tale of Two Computation Engines
Developing with the Go client for Apache Kafka
Programming in Spark using PySpark

What's hot (20)

PDF
KSQL Intro
PDF
Extending Druid Index File
PDF
XStream: stream processing platform at facebook
PDF
Data Analyse Black Horse - ClickHouse
PDF
Clickhouse at Cloudflare. By Marek Vavrusa
PPTX
Introduction to Apache ZooKeeper
PDF
Scaling Apache Pulsar to 10 PB/day
KEY
Rainbird: Realtime Analytics at Twitter (Strata 2011)
PDF
Streaming ETL to Elastic with Apache Kafka and KSQL
PDF
Data Distribution and Ordering for Efficient Data Source V2
PDF
DASK and Apache Spark
PPTX
Apache Flink and what it is used for
PDF
Developing Real-Time Data Pipelines with Apache Kafka
PDF
Introduction to the Disruptor
PDF
Geospatial Advancements in Elasticsearch
PDF
Apache Spark at Airbnb
PDF
Introduction to Apache Calcite
PDF
PDF
카프카, 산전수전 노하우
PPTX
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
KSQL Intro
Extending Druid Index File
XStream: stream processing platform at facebook
Data Analyse Black Horse - ClickHouse
Clickhouse at Cloudflare. By Marek Vavrusa
Introduction to Apache ZooKeeper
Scaling Apache Pulsar to 10 PB/day
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Streaming ETL to Elastic with Apache Kafka and KSQL
Data Distribution and Ordering for Efficient Data Source V2
DASK and Apache Spark
Apache Flink and what it is used for
Developing Real-Time Data Pipelines with Apache Kafka
Introduction to the Disruptor
Geospatial Advancements in Elasticsearch
Apache Spark at Airbnb
Introduction to Apache Calcite
카프카, 산전수전 노하우
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Ad

Viewers also liked (20)

PDF
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
PDF
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
PDF
Realtime Analytical Query Processing and Predictive Model Building on High Di...
PDF
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
PDF
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
PDF
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
PDF
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
PDF
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
PDF
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
PDF
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
PDF
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
PDF
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
PDF
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
PDF
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
PDF
Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
PDF
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
PDF
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
PDF
Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil
Ad

Similar to Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette (20)

PDF
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
PPTX
Java 8
PDF
Internals of Presto Service
PDF
InfluxDB Internals
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
PDF
Redis TimeSeries: Danni Moiseyev, Pieter Cailliau
PPTX
ElasticSearch as (only) datastore
PPTX
Leveraging spire for complex time allocation logic
KEY
The Why and How of Scala at Twitter
PDF
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
PDF
Automated product categorization
PDF
Automated product categorization
PPTX
Think Like Spark: Some Spark Concepts and a Use Case
PPTX
Understanding Sitecore Schedulers: Configuration and Execution Guide
PDF
2014-04-easteros
PPTX
L6.sp17.pptx
PPTX
AWS Redshift Introduction - Big Data Analytics
PPTX
Benchmarking Solr Performance at Scale
PDF
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...
PPTX
Cassandra
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
Java 8
Internals of Presto Service
InfluxDB Internals
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Redis TimeSeries: Danni Moiseyev, Pieter Cailliau
ElasticSearch as (only) datastore
Leveraging spire for complex time allocation logic
The Why and How of Scala at Twitter
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Automated product categorization
Automated product categorization
Think Like Spark: Some Spark Concepts and a Use Case
Understanding Sitecore Schedulers: Configuration and Execution Guide
2014-04-easteros
L6.sp17.pptx
AWS Redshift Introduction - Big Data Analytics
Benchmarking Solr Performance at Scale
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...
Cassandra

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PPTX
Machine Learning Solution for Power Grid Cybersecurity with GraphWavelets
PPT
Performance Implementation Review powerpoint
PPTX
Lecture 1 Intro in Inferential Statistics.pptx
PPTX
Purple and Violet Modern Marketing Presentation (1).pptx
PDF
Mastering Financial Analysis Materials.pdf
PDF
Company Profile 2023 PT. ZEKON INDONESIA.pdf
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Presentation1.pptxvhhh. H ycycyyccycycvvv
PDF
AI Lect 2 Identifying AI systems, branches of AI, etc.pdf
PDF
Foundation of Data Science unit number two notes
PDF
Launch Your Data Science Career in Kochi – 2025
PDF
Nashik East side PPT 01-08-25. vvvhvjvvvhvh
PPTX
CL11_CH20_-LOCOMOTION-AND-MOVEMENT-Autosaved.pptx
PPTX
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
PDF
CH2-MODEL-SETUP-v2017.1-JC-APR27-2017.pdf
PPTX
Economic Sector Performance Recovery.pptx
PPTX
artificial intelligence deeplearning-200712115616.pptx
PPTX
batch data Retailer Data management Project.pptx
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Machine Learning Solution for Power Grid Cybersecurity with GraphWavelets
Performance Implementation Review powerpoint
Lecture 1 Intro in Inferential Statistics.pptx
Purple and Violet Modern Marketing Presentation (1).pptx
Mastering Financial Analysis Materials.pdf
Company Profile 2023 PT. ZEKON INDONESIA.pdf
Business Acumen Training GuidePresentation.pptx
Presentation1.pptxvhhh. H ycycyyccycycvvv
AI Lect 2 Identifying AI systems, branches of AI, etc.pdf
Foundation of Data Science unit number two notes
Launch Your Data Science Career in Kochi – 2025
Nashik East side PPT 01-08-25. vvvhvjvvvhvh
CL11_CH20_-LOCOMOTION-AND-MOVEMENT-Autosaved.pptx
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
CH2-MODEL-SETUP-v2017.1-JC-APR27-2017.pdf
Economic Sector Performance Recovery.pptx
artificial intelligence deeplearning-200712115616.pptx
batch data Retailer Data management Project.pptx
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn

Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette

  • 1. Time Series Analytics with Spark Simon Ouellette Faimdata
  • 2. What is spark-timeseries? • Open source time series library for Apache Spark 2.0 • Sandy Ryza – Advanced Analytics with Spark: Patterns for Learning from Data at Scale – Senior Data Scientist at Clover Health • Started in February 2015 https://github.com/sryza/spark-timeseries
  • 3. Who am I? • Chief Data Science Officer at Faimdata • Contributor to spark-timeseries since September 2015 • Participated in early design discussions (March 2015) • Been an active user for ~2 years http://faimdata.com
  • 4. Survey: Who uses time series?
  • 5. Design Question #1: How do we structure multivariate time series? Columnar or Row-based?
  • 6. Vectors Date/ Time Series 1 Series 2 Vector 1 2:30:01 4.56 78.93 Vector 2 2:30:02 4.57 79.92 Vector 3 2:30:03 4.87 79.91 Vector 4 2:30:04 4.48 78.99 RDD[(ZonedDateTime, Vector)] DateTime Index Vector for Series 1 Vector for Series 2 2:30:01 4.56 78.93 2:30:02 4.57 79.92 2:30:03 4.87 79.91 2:30:04 4.48 78.99 TimeSeriesRDD( DateTimeIndex, RDD[Vector] ) Columnar representation Row-based representation
  • 7. Columnar vs Row-based • Lagging • Differencing • Rolling operations • Feature generation • Feature selection • Feature transformation • Regression • Clustering • Classification • Etc. More efficient in columnar representation: More efficient in row-based representation:
  • 8. Example: lagging operation • Time complexity: O(N) (assumes pre-sorted RDD) • For each row, we need to get values from previous k rows • Time complexity: O(K) • For each column to lag, we truncate most recent k values, and truncate the DateTimeIndex’s oldest k values. Columnar representationRow-based representation
  • 9. Example: regression • We’re estimating: • The lagged values are typically part of each row, because they are pre-generated as new features. • Stochastic Gradient Descent: we iterate on examples and estimate error gradient to adjust weights, which means that we care about rows, not columns. • To avoid shuffling, the partitioning must be done such that all elements of a row are together in the same partition (so the gradient can be computed locally).
  • 10. Current solution • Core representation is columnar. • Utility functions to go to/from row-based. • Reasoning: spark-timeseries operations are mostly time-related, i.e. columnar. Row-based operations are about relationships between the variables (ML/statistical), thus external to spark- timeseries.
  • 12. Survey: Who uses univariate time series that don’t fit inside a single executor’s memory? (or multivariate of which a single variable’s time series doesn’t fit)
  • 13. Design Question #2: How do we partition the multi-variate time series for distributed processing? Across features, or across time?
  • 14. Current design Assumption: a single time series must fit inside executory memory!
  • 15. Current design Assumption: a single time series must fit inside executory memory! TimeSeriesRDD ( DatetimeIndex, RDD[(K, Vector)] ) IrregularDatetimeIndex ( Array[Long], // Other limitation: Scala arrays = 232 elements java.time.ZoneId )
  • 16. Future improvements • Creation of a new TimeSeriesRDD-like class that will be longitudinally (i.e. across time) partitioned rather than horizontally (i.e. across features). • Keep both types of partitioning, on a case-by- case basis.
  • 17. Design Question #3: How do we lag, difference, etc.? Re-sampling, or index-preserving?
  • 18. Option #1: re-sampling Irregular Time y value at t x value at t 1:30:05 51.42 4.87 1:30:07.86 52.37 4.99 1:30:07.98 53.22 4.95 1:30:08.04 55.87 4.97 1:30:12 54.84 5.12 1:30:14 49.88 5.10 Uniform Time y value at t x value at (t – 1) 1:30:06 51.42 4.87 1:30:07 51.42 4.87 1:30:08 53.22 4.87 1:30:09 55.87 4.95 1:30:10 55.87 4.97 1:30:11 55.87 4.97 1:30:12 54.84 4.97 1:30:13 54.84 5.12 1:30:14 49.88 5.12 Before After (1 second lag)
  • 19. Option #2: index preserving Irregular Time y value at t x value at t 1:30:05 51.42 4.87 1:30:07.86 52.37 4.99 1:30:07.98 53.22 4.95 1:30:08.04 55.87 4.97 1:30:12 54.84 5.12 1:30:14 49.88 5.10 Irregular Time y value at t x value at (t – 1) 1:30:05 51.42 N/A 1:30:07.86 52.37 4.87 1:30:07.98 53.22 4.87 1:30:08.04 55.87 4.87 1:30:12 54.84 4.97 1:30:14 49.88 5.12 Before After (1 second lag)
  • 20. Current functionality • Option #1: resample() function for lagging/differencing by upsampling/downsampling. – Custom interpolation function (used when downsampling) • Conceptual problems: – Information loss and duplication (downsampling) – Bloating (upsampling)
  • 21. Current functionality • Option #2: functions to lag/difference irregular time series based on arbitrary time intervals. (preserves index) • Same thing: custom interpolationfunction can be passed for when downsampling occurs.
  • 23. High-level objects • TimeSeriesRDD • TimeSeries • TimeSeriesStatisticalTests • TimeSeriesModel • DatetimeIndex • UnivariateTimeSeries
  • 24. TimeSeriesRDD • collectAsTimeSeries • filterStartingBefore, filterStartingAfter, slice • filterByInstant • quotients, differences, lags • fill: fills NaNs by specified interpolation method (linear, nearest, next, previous, spline, zero) • mapSeries • seriesStats: min, max, average, std. deviation • toInstants, toInstantsDataFrame • resample • rollSum, rollMean • saveAsCsv, saveAsParquetDataFrame
  • 25. TimeSeriesStatisticalTests • Stationarity tests: – Augmented Dickey-Fuller (adftest) – KPSS (kpsstest) • Serial auto-correlation tests: – Durbin-Watson (dwtest) – Breusch-Godfrey (bgtest) – Ljung-Box (lbtest) • Breusch-Pagan heteroskedasticity test (bptest) • Newey-West variance estimator (neweyWestVarianceEstimator)
  • 26. TimeSeriesModel • AR, ARIMA • ARX, ARIMAX (i.e. with exogenous variables) • Exponentially weighted moving average • Holt-winters method (triple exp. smoothing) • GARCH(1,1),ARGARCH(1,1,1)
  • 27. Others • Java bindings • Python bindings • YAHOO financial data parser
  • 28. Code example #1 Time Y X 12:45:01 3.45 25.0 12:46:02 4.45 30.0 12:46:58 3.45 40.0 12:47:45 3.00 35.0 12:48:05 4.00 45.0 Y is stationary X is integrated of order 1
  • 30. Code example #1 Time y d(x) Lag1(y) Lag2(y) Lag1(d(x)) Lag2(d(x)) 12:45:01 3.45 12:46:02 4.45 5.0 3.45 12:46:58 3.45 10.0 4.45 3.45 5.0 12:47:45 3.00 -5.0 3.45 4.45 10.0 5.0 12:48:05 4.00 10.0 3.00 3.45 -5.0 10.0
  • 31. Code example #2 • We will use Holt-Winters to forecast some seasonal data. • Holt-winters: exponential moving average applied to level, trend and seasonal component of the time series, then combined into global forecast.
  • 34. Code example #2 350 400 450 500 550 600 650 1 2 3 4 5 6 7 8 9 10 11 12 13 Holt-Winters forecast validation (additive & multiplicative) Actual Additive forecast Multiplicativeforecast