Leverage Financial News to Predict Stock Price Movements
Using Word Embeddings and Deep Neural Networks
Yangtuo Peng and Hui Jiang
Department of Electrical Engineering and Computer Science
York University, 4700 Keele Street, Toronto, Ontario, M3J 1P3, Canada
emails:
[email protected],
[email protected],
Abstract Ding et al., 2014), twitters sentiments (Si et al.,
2013; Si et al., 2014), microblogs (Bar-Haim et al.,
Financial news contains useful informa- 2011). For example, (Xie et al., 2013) propose to
tion on public companies and the market. use semantic frame parsers to generalize from sen-
In this paper we apply the popular word tences to scenarios to detect the (positive or neg-
arXiv:1506.07220v1 [cs.CE] 24 Jun 2015
embedding methods and deep neural net- ative) roles of specific companies, where support
works to leverage financial news to pre- vector machines with tree kernels are used as pre-
dict stock price movements in the market. dictive models. On the other hand, (Ding et al.,
Experimental results have shown that our 2014) propose to use various lexical and syntac-
proposed methods are simple but very ef- tic constraints to extract event features for stock
fective, which can significantly improve forecasting, where they have investigate both lin-
the stock prediction accuracy on a stan- ear classifiers and deep neural networks as predic-
dard financial database over the baseline tive models.
system using only the historical price in- In this paper, we propose to use the recent word
formation. embedding methods (Mikolov et al., 2013b) to se-
lect features from on-line financial news corpora,
1 Introduction and employ deep neural networks (DNNs) to pre-
In the past few years, deep neural networks dict the future stock movements based on the ex-
(DNNs) have achieved huge successes in many tracted features. Experimental results have shown
data modeling and prediction tasks, ranging from that the features derived from financial news are
speech recognition, computer vision to natural very useful and they can significantly improve the
language processing. In this paper, we are inter- prediction accuracy over the baseline system that
ested in applying the powerful deep learning meth- only relies on the historical price information.
ods to financial data modeling to predict stock
2 Our Approach
price movements.
Traditionally neural networks have been used to In this paper, we use deep neural networks (DNNs)
model stock prices as time series for the forecast- as our predictive model, which takes as input the
ing purpose, such as in (Kaastra and Boyd, 1991; features extracted from both historical price infor-
Adya and Collopy, 1991; Chan et al., 2000; Sk- mation and on-line financial news to predict the
abar and Cloete, 2002; Zhu et al., 2008). In these stock movements in the future (either up or down).
earlier work, due to the limited training data and
computing power available back then, normally 2.1 Deep Neural Networks
shallow neural networks were used to model var- The structure of DNNs used in this paper is a con-
ious types of features extracted from stock price ventional multi-layer perceptron with many hid-
data sets, such as historical prices, trading vol- den layers. An L-layer DNN consisting of L − 1
umes, etc, in order to predict future stock yields hidden nonlinear layers and one output layer. The
and market returns. More recently, in the commu- output layer is used to model the posterior proba-
nity of natural language processing, many meth- bility of each output target. In this paper, we use
ods have been proposed to explore additional in- the rectified linear activation function, i.e., f (x) =
formation (mainly online text data) for stock fore- max(0, x), to compute from activations to outputs
casting, such as financial news (Xie et al., 2013; in each hidden layer, which are in turn fed to the
next layer as inputs. For the output layer, we use from Google1 , we first compute the vector rep-
the softmax function to compute posterior proba- resentations for all words occurring in the train-
bilities between two nodes, standing for stock-up ing set. Secondly, we manually select a small set
and stock-down. of seed words, i.e., nine words of {surge, rise,
shrink, jump, drop, fall, plunge, gain, slump} in
2.2 Features from historical price data this work, which are believed to have a strong in-
In this paper, for each target stock on a target date, dication to the stock price movements. Next, these
we choose the previous five days’ closing prices seed words are used to search for other useful key-
and concatenate them to form an input feature vec- words based on the cosine distances calculated be-
tor for DNNs: P = (pt−5 , pt−4 , pt−3 , pt−2 , pt−1 ), tween the word vector of each seed word and that
where t denotes the target date, and pm denotes the of other words occurring in the training set. For
closing price on the date m. We then normalize all example, based on the pre-calculated word vec-
prices by the mean and variance calculated from tors, we have found other words, such as rebound,
all closing prices of this stock in the training set. decline, tumble, slowdown, climb, which are very
In addition, we also compute first and second order close to at least one of the seed words. In this way,
differences among the five days’ closing prices, we have searched all words occurring in training
which are appended as extra feature vectors. set and kept the top 1,000 words (including the
For example, we compute the first order differ- nine seed words) as the keywords for our predic-
ence as follows: ∆P = (pt−4 , pt−3 , pt−2 , pt−1 ) tion task. Finally, a 1000-dimension feature vec-
−(pt−5 , pt−4 , pt−3 , pn−2 ). In the same way, the tor, called bag-of-keywords or BoK, is generated
second order difference is calculated by taking the for each sample. Each dimension of the BoK vec-
difference between two adjacent values in each tor is the TFIDF score computed for each selected
∆P . Finally, for each target stock on a particular keyword from the whole training corpus.
date, the feature vector representing the historical (2) Polarity score (PS): We further compute
price information consists of P , ∆P and ∆∆P . so-called polarity scores (Turney and Littman,
2003; Turney and Pantel, 2010) to measure how
2.3 Financial news features
each keyword is related to stock movements and
In order to extract fixed-size features suitable to how each keyword applies to a target stock in
DNNs from financial news corpora, we need to each sentence. To do this, we first compute the
pre-process the text data. For all financial articles, point-wise mutual information for each keyword
we first split them into sentences. We only keep freq(w,pos)×N
w: PMI(w, pos) = log freq(w)×freq(pos) , where
those sentences that mention at least one stock
freq(w, pos) denotes the frequency of the key-
name or one public company. Each sentence is
word w occurring in all positive samples, N de-
labelled by the publication date of the original ar-
notes the total number of samples in the train-
ticle and the mentioned stock name. It is possi-
ing set, freq(w) denotes the total number of key-
ble that multiple stocks are mentioned in one sen-
word w occurring in the whole training set and
tence. In this case, this sentence is labeled several
freq(pos) denotes the total number of positive
times for each mentioned stock. We then group
samples in the training set. Furthermore, we cal-
these sentences by the publication dates and the
culate the polarity score for each keyword w as:
underlying stock names to form the samples. Each
PS(w) = PMI(w, pos) − PMI(w, neg). Obvi-
sample contains a list of sentences that were pub-
ously, the above polarity score PS(w) measures
lished on the same date and mentioned the same
how (either positively or negatively) each keyword
stock or company. Moreover, each sample is la-
is related to stock movements and by how much.
belled as positive (“price-up”) or negative (“price-
Next, for each sentence in all samples, we need
down”) based on its next day’s closing price con-
to detect how each keyword is related to the men-
sulted from the CRSP financial database (Booth,
tioned stock. To do this, we use the Stanford
2012). In the following, we introduce our method
parser (Marneffe et al., 2006) to detect whether the
to extract three types of features from each sample.
target stock is a subject of the keyword or not. If
(1) Bag of keywords (BoK): We first select the
the target stock is not the subject of the keyword
keywords based on the recent word embedding
in the sentence, we assume the keyword is oppo-
methods in (Mikolov et al., 2013a; Mikolov et
1
al., 2013b). Using the popular word2vec method https://code.google.com/p/word2vec/
sitely related to the underlying stock. As a result,
we need to flip the sign of the polarity score. Oth-
erwise, if the target stock is the subject of the key-
word, we keep the keyword’s polarity score as it is.
For example, in a sentence like “Apple slipped be-
hind Samsung and Microsoft in a 2013 customer
experience survey from Forrester Research”, we
first identify the keyword slipped, based on the
parsing result, we know Apple is the subject while
Samsung and Microsoft are not. Therefore, if this
sentence is used as a sample for Apple, the above Figure 1: Illustration of a part of correlation graph
polarity score of “slipped” is directly used. How-
ever, if this sentence is used as a sample for Sam-
sung or Microsoft, the polarity score of “slipped”
predict those stocks mentioned in the news. In this
is flipped by multiplying −1.
section, we propose a new method to extend to
Finally, the resultant polarity scores are mul- predict more stocks that may not be directly men-
tiplied to the TFIDF scores to generate another tioned in the financial news. Here we propose to
1000-dimension feature vector for each sample. use a stock correlation graph, shown in Figure 1, to
(3) Category tag (CT): We further define a list predict those unseen stocks. The stock correlation
of categories that may indicate a specific event or graph is an undirected graph, where each node rep-
activity of a public company, which we call as cat- resents a stock and the arc between two nodes rep-
egory tags. In this paper, the defined category resents the correlation between these two stocks.
tags include: new-product, acquisition, price- For example, if some stocks in the graph are men-
rise, price-drop, law-suit, fiscal-report, invest- tioned in the news on a particular day, we first
ment, bankrupt, government, analyst-highlights. use the above method to predict these mentioned
Each category is first manually assigned with a stocks. Afterwards, the predictions are propagated
few words that are closely related to the category. along the arcs in the graph to generate predictions
For example, we have chosen released, publish, for those unseen stocks.
presented, unveil as a list of seed words for the cat-
(1) Build the graph: We choose the top 5,000
egory new-product, which indicates the company’s
stocks from the CRSP database (Booth, 2012) to
announcement of new products. Similarly, we use
construct the correlation graph. At each time, any
the above word embedding model to automatically
two stocks in the collection are selected to align
expand the above word list by searching for more
their closing prices based on the related dates (be-
words that have closer cosine distances with the
tween 2006/01/01 - 2012/12/31). Then we calcu-
selected seed words. In this paper, we choose the
late the correlation coefficient between the closing
top 100 words to assign to each category.
prices of these two stocks. The computed correla-
After we have collected all key words for tion coefficient (between −1 and 1) is attached to
each category, for each sample, we count the the arc connecting these two stocks in the graph,
total number of occurrences of all words un- indicating their price correlation. The correlation
der each category, and then we take the log- coefficients are calculated for every pair of stocks
arithm to obtain a feature vector as V = from the collection of 5,000 stocks. In this paper
(log N1 , log N2 , log N3 , ..., log Nc ), where Nc de- we only keep the arcs with an absolute correlation
notes the total number of times the words in cate- value greater than 0.8, all other edges are consid-
gory c appear in a sample. ered to be unreliable and pruned from the graph, a
tiny fraction of which is shown in Figure 1.
2.4 Predicting Unseen Stocks via Correlation
(2) Predict unseen stocks: In order to predict
Graph
price movements of unseen stocks, we first take
There are a large number of stocks trading in the the prediction results of those mentioned stocks
market. However, we normally can only find a from the DNN outputs, by which we construct a
fraction of them mentioned in daily financial news. 5000-dimension vector x. Each dimension of x
Hence, for each date, the above method can only corresponds to one stock and we set zeros for all
unseen stocks. The above graph propagation pro- feature combination error rate
cess can be mathematically represented as a ma- price 48.12%
trix multiplication: x0 = Ax, where A is a sym- price + BoK 46.02%
metric matrix denoting all correlation weights in price + BoK + PS 43.96%
the graph. Of course, the graph propagation, i.e. price + BOK + CT 45.86%
matrix multiplication, may be repeated for several price + PS 45.00%
times until the prediction x0 converges. price + CT 46.10%
price + PS +CT 46.03%
3 Dataset price + BoK + PS + CT 43.13%
The financial news data we used in this paper are Table 1: Stock prediction error rates on the test
provided by (Ding et al., 2014) which contains set.
106,521 articles from Reuters and 447,145 from
Bloomberg. The news articles were published in
the time period from October 2006 to December
2013. The historical stock security data are ob-
tained from the Centre for Research in Security
Prices (CRSP) database (Booth, 2012). We only
use the security data from 2006 to 2013 to match
the time period of the financial news. Base on
the samples’ publication dates, we split the dataset
into three sets: a training set (all samples be-
tween 2006-10-01 and 2012-12-31), a validation Figure 2: Predict unseen stocks via correlation
set (2013-01-01 and 2013-06-15) and a test set
(2013-06-16 to 2013-12-31). The training set con-
taints 65,646 samples, the validation set 10,941 4.2 Predict Unseen Stocks via Correlation
samples, and the test set 9,911 samples. Here we group all outputs from DNNs based on
the dates of all samples on the test set. For each
date, we create a vector x based on the DNN pre-
4 Experiments diction results for all observed stocks and zeros
for all unseen stocks, as described in section 2.4.
4.1 Stock Prediction using DNNs Then, the vector is propagated through the corre-
In the first set of experiments, we use DNNs to lation graph to generate another set of stock move-
predict stock’s price movement based on a vari- ment prediction. We may apply a threshold on the
ety of features, namely producing a polar predic- propagated vector to prune all low-confidence pre-
tion of the price movement on next day (either dictions. The remaining ones may be used to pre-
price-up or price-down). Here we have trained a dict some stocks unseen on the test set. The pre-
set of DNNs using different combinations of fea- diction of all unseen stocks is compared with the
ture vectors and found that the DNN structure of actual stock movement on next day. Experimental
4 hidden layers (with 1024 hidden nodes in each results are shown in Figure 2, where the left y-
layer) yields the best performance in the valida- axis denotes the prediction accuracy and the right
tion set. We use the historical price feature alone y-axis denotes the percentage of stocks predicated
to create the baseline and various features derived out of all 5000 per day under each pruning thresh-
from the financial news are added on top of it. We old. For example, using a large threshold (0.9), we
measure the final performance by calculating the may predict with an accuracy of 52.44% on 354
error rate on the test set. As shown in Table 1, extra unseen stocks per day, in addition to predict-
the features derived from financial news can sig- ing only 110 stocks per day on the test set.
nificantly improve the prediction accuracy and we
5 Conclusion
have obtained the best performance (an error rate
of 43.13%) by using all the features discussed in In this paper, we have proposed a simple method
Sections 2.2 and 2.3. to leverage financial news to predict stock move-
ments based on the popular word embedding and [Mikolov et al.2013a] Tomas Mikolov, Kai Chen, Greg
deep learning techniques. Our experiments have Corrado, and Jeffrey Dean. 2013a. Efficient esti-
mation of word representations in vector space. In
shown that the financial news is very useful in
Proceedings of Workshop at ICLR.
stock prediction and the proposed methods can
significantly improve the prediction accuracy on [Mikolov et al.2013b] Tomas Mikolov, Ilya Sutskever,
a standard financial data set. Kai Chen, Greg S Corrado, and Jeff Dean. 2013b.
Distributed representations of words and phrases
and their compositionality. In Proceedings of NIPS,
Acknowledgments pages 3111–3119.
This work was supported in part by an NSERC [Si et al.2013] Jianfeng Si, Arjun Mukherjee, Bing Liu,
Engage grant from Canadian federal government. Qing Li, Huayi Li, and Xiaotie Deng. 2013. Ex-
ploiting topic based twitter sentiment for stock pre-
diction. In Proceedings of the 51st Annual Meet-
ing of the Association for Computational Linguistics
References (Volume 2: Short Papers), pages 24–29, Sofia, Bul-
garia, August. Association for Computational Lin-
[Adya and Collopy1991] Monica Adya and Fred Col- guistics.
lopy. 1991. How effective are neural networks at
forecasting and prediction? a review and evaluation. [Si et al.2014] Jianfeng Si, Arjun Mukherjee, Bing Liu,
Journal of Forecasting, 17:481–495. Sinno Jialin Pan, Qing Li, and Huayi Li. 2014. Ex-
ploiting social relations and sentiment for stock pre-
[Bar-Haim et al.2011] Roy Bar-Haim, Elad Dinur, Ro- diction. In Proceedings of the 2014 Conference on
nen Feldman, Moshe Fresko, and Guy Goldstein. Empirical Methods in Natural Language Processing
2011. Identifying and following expert investors in (EMNLP), pages 1139–1145, Doha, Qatar, October.
stock microblogs. In Proceedings of the 2011 Con- Association for Computational Linguistics.
ference on Empirical Methods in Natural Language
Processing, pages 1310–1319, Edinburgh, Scotland, [Skabar and Cloete2002] Andrew Skabar and Ian
UK., July. Association for Computational Linguis- Cloete. 2002. Neural networks, financial trading
tics. and the efficient markets hypothesis. In Proc.
the Twenty-Fifth Australasian Computer Science
[Booth2012] Chicago Booth. 2012. CRSP Data Conference (ACSC2002), Melbourne, Australia.
Description Guide for the CRSP US Stock
Database and CRSP US Indices Database. [Turney and Littman2003] Peter D. Turney and
Center for Research in Security Prices, The Uni- Michael L. Littman. 2003. Measuring praise and
versity of Chicago Graduate School of Business criticism: Inference of semantic orientation from
(https://wrds-web.wharton.upenn. association. ACM Trans. Inf. Syst., 21(4):315–346,
edu/wrds/index.cfm). October.
[Turney and Pantel2010] Peter D. Turney and Patrick
[Chan et al.2000] Man-Chung Chan, Chi-Cheong Pantel. 2010. From frequency to meaning: Vec-
Wong, and Chi-Chung Lam. 2000. Financial tor space models of semantics. J. Artif. Int. Res.,
time series forecasting by neural network using 37(1):141–188, January.
conjugate gradient learning algorithm and multiple
linear regression weight initialization. Computing [Xie et al.2013] Boyi Xie, Rebecca J. Passonneau, Leon
in Economics and Finance, 61. Wu, and Germán G. Creamer. 2013. Semantic
frames to predict stock price movement. In Proceed-
[Ding et al.2014] Xiao Ding, Yue Zhang, Ting Liu, and ings of the 51st Annual Meeting of the Association
Junwen Duan. 2014. Using structured events to for Computational Linguistics (Volume 1: Long Pa-
predict stock price movement: An empirical inves- pers), pages 873–883, Sofia, Bulgaria, August. As-
tigation. In Proceedings of the 2014 Conference on sociation for Computational Linguistics.
Empirical Methods in Natural Language Processing
(EMNLP), pages 1415–1425, Doha, Qatar, October. [Zhu et al.2008] Xiaotian Zhu, Hong Wang, Li Xu, and
Association for Computational Linguistics. Huaizu Li. 2008. Predicting stock index increments
by neural networks: The role of trading volume un-
[Kaastra and Boyd1991] Iebeling Kaastra and Milton der different horizons. Expert Systems with Appli-
Boyd. 1991. Designing a neural network for fore- cations, 34:30433054.
casting financial and economic time series. Neuro-
computing, 10:215–236.
[Marneffe et al.2006] Marie-Catherine Marneffe, Bill
MacCartney, and Christopher D. Manning. 2006.
Generating typed dependency parses from phrase
structure parses. In Proceedings LREC.