Big Data, Machine Learning, and Econometrics
Big Data, Machine Learning, and Econometrics
Marc Francke
[email protected]
1 Big data
2 Statistical modeling
3 Algorithmic modeling
4 Examples
AVM
Street View
Text data
Clicks on internet pages
5 Concluding remarks
6 References
Big data
- Higher frequency
- More detailed information
- Narrower segmentation
I compare to census data, household surveys, ...
- Big data complement, do not replace, small data
- Big data (text, images, audio, video, sensor data ...) require
new techniques for analysis
Statistical modeling
- Input → Statistical model → Output
- Input: transaction price, house size, number of rooms, ...
- Model specification / data generating process assumed to be
known, depending on unknown coefficients
I Functional form
I Stochastic assumptions
y |X , β, σ ∼ N(X β, σI)
- Model estimation (of β, σ): OLS, WLS, maximum likelihood,
minimize loss function, ...
- Output: Predicted values, coefficients, confidence bounds,
(in sample) model fit, ...
Statistical modeling
Machine learning
Classification
Random Forest
Neural networks
- Collection of computational units: artificial neurons
- Neurons ‘receive’ a weighted sum of outputs from preceding
‘layer’ (non-linear)
- ‘Learning’ is optimizing connection weights
Features
Target
Fine Tuning
Hyperparameters Ex. Grid Search,
Gradient Based
Best
hyperparameters
found in the
training step
Added value ML
- Machine learning algorithms are now technically easy to use: ‘off
the shelf’ packages available in R and Python
- Systematic data cleaning procedures
- Scaling / Transformation / Normalization of variables requires
expert opinion
- Feature engineering
I Additional explanatory variables from text, pictures, audio, ...
I These variables can be used within (econometric) model
- Flexible functional forms / many variables / interactions
I “Simply including all pairwise interactions would be infeasible as it
produces more regressors than data points. ... Machine learning
searches for these interactions automatically ” (JEP Mullainathan
and Spiess, 2017)
- Systematic model comparison
I Econometrics’ practice: only the best performing model is showed
Marc Francke (UvA) Big data, machine learning, and econometrics 20 / 48
Algorithmic modeling
Weak points ML
Causality
- What is the effect of a treatment?
I observed difference in outcome =
average treatment effect on the treated + selection bias
I challenge: to get rid of selection bias
- Econometrics/statistics developed methods to deal with causality
I instrumental variables, regression discontinuity, diff-in-diff, and
various forms of natural and designed experiments
I Causal effect of treatment is not identified without making
(non-testable) assumptions: Justification is important in research
- Instrumental variable approach
I Linear equation yi = xi0 β + εi
I However, E[xi εi ] 6= 0 and E[zi εi ] = 0
I Solution 2SLS:
1. regress x on z resulting in prediction x̂
2. regress x̂ on y to get estimate of β
I First step is essentially a prediction problem: supervised ML
F weak instruments
F overfitting
Marc Francke (UvA) Big data, machine learning, and econometrics 23 / 48
Algorithmic modeling
Comparison
(Supervised) ML Econometrics
Econometrics is alchemy
Keynes: ‘No one could be more frank, more painstaking, more free
from subjective bias or parti pris than Professor Tinbergen. There is no
one, therefore, so far as human qualities go, whom it would be safer to
trust with black magic. That there is anyone I would trust with it at the
present stage, or that this brand of statistical alchemy is ripe to
become a branch of science, I am not yet persuaded. But Newton,
Boyle and Locke all played with Alchemy. So let him continue.’
- Precision of valuations:
example for CRE (Kok, Koponen, and Martínez-Barbosa, 2017)
- Hybrid valuation model for RRE (Ortec Finance, 2017)
- Development of hybrid automated valuation model: (time series
and spatial) econometrics and ML
- Focus on out-of-sample model performance (precision) and
interpretability
- Transaction prices of 740,000 houses in the Netherlands in the
period 2009–2016
- Model trained on transactions up to 2015
- Out-of-sample prediction of transactions in 2016
Data
Street View
Street View
Testing theories
Text data
Feature extraction
Funda
Prices
Clicks
Liquidity
Concluding remarks
- Relabeling
- OLS versus ML: applied research oversimplifies
- ML provides additional toolkit for econometricians
I Functional form flexibility (at the cost of interpretation)
I Creation of additional covariates (unsupervised ML)
Satellite & language → new variables (sentiment in media)
I Focus on out-of-sample performance and model averaging
(although partly unsolved issues)
- A priori structure often needed when having limited # observations
- Do invest in knowledge on Bayesian statistics & econometrics
I Variable selection: spike-and-slab regression
I Large literature on model averaging (Steel, 2011)
I Model comparison: training & test samples (O’Hagan, 2004, Ch. 7)
I Complex model structures for large data sets: Variational methods
(Spantini, Bigoni, and Marzouk, 2017) are feasible
Books
- Books
- Bishop (2006, book):
Pattern recognition and machine learning
- Müller and Guido (2017, book):
Introduction to machine learning with Python: A guide for data
scientists
- Books online available
- Hastie, Tibshirani, and Friedman (2009):
The elements of statistical learning: Data mining, inference, and
prediction
- Barber (2017):
Bayesian reasoning and machine learning
- Murphy (2012)
Machine Learning. A probabilistic approach
References I
Athey, S. (2018). “The impact of machine learning on economics”. In: Economics of Artificial
Intelligence. University of Chicago Press.
Barber, David (2017). Bayesian reasoning and machine learning. Cambridge University Press.
Bishop, C. (2006). Pattern recognition and machine learning. Springer-Verlag New York.
Breiman, L. (2001). “Statistical modeling: The two cultures (with comments and a rejoinder by the
author)”. In: Statistical Science 16.3, pp. 199–231.
Greene, W. H. (2008). Econometric Analysis, 6/E. 6th. Prentice Hall.
Hastie, T., R. Tibshirani, and J. H. Friedman (2009). The elements of statistical learning: Data
mining, inference, and prediction. 2nd ed. Springer series in statistics. Springer-Verlag New
York.
Hendry, D. F. (1980). “Econometrics – Alchemy or science?” In: Economica 47.188, pp. 387–406.
Kok, N., E. L. Koponen, and C. A. Martínez-Barbosa (2017). “Big Data in real estate? From
manual appraisal to automated valuation”. In: The Journal of Portfolio Management 43.6,
pp. 202–211.
Mullainathan, S. and J. Spiess (2017). “Machine learning: An applied econometric approach”. In:
Journal of Economic Perspectives 31.2, pp. 87–106.
Müller, A. C. and S. Guido (2017). Introduction to machine learning with Python: A guide for data
scientists. O’Reilly Media.
Murphy, K. P. (2012). Machine Learning. A probabilistic approach. MIT Press.
References II
Naik, N., S. D. Kominers, R. Raskar, E. L Glaeser, and C. A. Hidalgo (2017). “Computer vision
uncovers predictors of physical urban change”. In: Proceedings of the National Academy of
Sciences 114.29, pp. 7571–7576.
O’Hagan, A. (2004). Kendall’s Advanced Theory of Statistics. Second Edition. Vol. 2B, Bayesian
Inferencev. London: Arnold.
Spantini, A., D. Bigoni, and Y. Marzouk (2017). “Inference via low-dimensional couplings”. In:
arXiv preprint arXiv:1703.06131.
Steel, M. J. (2011). “Bayesian model averaging and forecasting”. In: Bulletin of EU and US
Inflation and Macroeconomic Analysis 200, pp. 30–41.
Stock, J .H. and M. H. Watson (2012). Introduction to Econometrics. 3rd. Pearson.
Theil, H. (1971). Principles of Econometrics. John Wiley & Sons.
Varian, H. R (2014). “Big data: New tricks for econometrics”. In: Journal of Economic
Perspectives 28.2, pp. 3–28.
Verbeek, M. (2008). A Guide to Modern Econometrics. 3th Edition. John Wiley & Sons, Ltd.