0% found this document useful (0 votes)
70 views

Zero To Deep Learning With Keras and Tensorflow Compress

Uploaded by

PA Quiroga
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views

Zero To Deep Learning With Keras and Tensorflow Compress

Uploaded by

PA Quiroga
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 769

2

Zero to Deep Learning

Francesco Mosconi

June 12, 2019


Copyright © 2018-2019 Francesco Mosconi. All rights reserved. Printed in the United States of America

Published by Fullstack.io

Editors: Nate Murray and Ari Lerner

September 2018: v0.9.3 February 2019: v1.0 June 2019: v1.1

Zero to Deep Learning is a registered trademark of Catalit LLC.

Many of the designations used by manufacturers and sellers to distinguish their products are claimed as
trademarks. Where those designations appear in this book, and Catalit LLC was aware of a trademark claim,
the designations have been printed.

All rights reserved. No portion of the book manuscript may be reproduced, stored in a retrieval system, or
transmitted in any form or by any means beyond the number of purchased copies, except for a single
backup or archival copy. The code may be used freely in your projects, commercial or otherwise.

The authors and publisher have taken care in preparation of this book, but make no expressed or implied
warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for
incidental or consequential damagers in connection with or arising out of the use of the information or
programs container herein.
iii

Dedicated to our Future Selves. May we take care of one another


and the world making good use of Artificial Intelligence. And to my
nieces, who will inherit the future we build today.
iv
Table of Contents

0 Preface 1

0.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

0.2 Why you should care . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

0.3 Who this book is for . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

0.4 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

0.5 About the author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

0.6 How to use this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4


0.6.1 How to approach exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
0.6.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

0.7 Prerequisites - Is this book right for me? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6


0.7.1 Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
0.7.2 Partial derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
0.7.3 Dot product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
0.7.4 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

0.8 Our development environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8


0.8.1 Miniconda Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
0.8.2 Conda Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
0.8.3 GPU enabled environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
0.8.4 Tensorflow 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
0.8.5 Jupyter notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
0.8.6 Environment check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
0.8.7 Python 3.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
0.8.8 Jupyter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
0.8.9 Other packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
0.8.10 Troubleshooting installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
0.8.11 Updating Conda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1 Getting Started 21

v
vi TABLE OF CONTENTS

1.1 Deep Learning in the real world . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.2 First Deep Learning Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24


1.2.1 Numpy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.2.2 Matplotlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.2.3 Scikit-Learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.2.4 Keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

1.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
1.3.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
1.3.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
1.3.3 Exercise 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
1.3.4 Exercise 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

2 Data Manipulation 55

2.1 Many types of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55


2.1.1 Tabular Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

2.2 Data Exploration with Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57


2.2.1 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.2.2 Selections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.2.3 Unique Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.2.4 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.2.5 Aggregations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.2.6 Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.2.7 Pivot Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.2.8 Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

2.3 Visual data exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70


2.3.1 Line Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.3.2 Scatter plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
2.3.3 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.3.4 Cumulative Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
2.3.5 Box plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.3.6 Subplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.3.7 Pie charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
2.3.8 Hexbin plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

2.4 Unstructured data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82


2.4.1 Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
2.4.2 Sound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.4.3 Text data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

2.5 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86


TABLE OF CONTENTS vii

2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
2.6.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
2.6.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
2.6.3 Exercise 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
2.6.4 Exercise 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
2.6.5 Exercise 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

3 Machine Learning 89

3.1 The purpose of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

3.2 Different types of learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

3.3 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

3.4 Configuration File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

3.5 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94


3.5.1 Let’s draw some examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.5.2 Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.5.3 Finding the best model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.5.4 Linear Regression with Keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
3.5.5 Evaluating Model Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.5.6 Train / Test split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

3.6 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114


3.6.1 Linear regression fail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
3.6.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.6.3 Train/Test split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

3.7 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130


3.7.1 How to avoid overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

3.8 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

3.9 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133


3.9.1 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
3.9.2 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
3.9.3 F1 Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

3.10 Feature Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137


3.10.1 Categorical Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
3.10.2 Feature Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

3.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141


3.11.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
3.11.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
viii TABLE OF CONTENTS

4 Deep Learning 145

4.1 Beyond linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

4.2 Neural Network Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147


4.2.1 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.2.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
4.2.3 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
4.2.4 Deeper Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

4.3 Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154


4.3.1 Tanh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
4.3.2 ReLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
4.3.3 Softplus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
4.3.4 SeLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

4.4 Binary classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160


4.4.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
4.4.2 Deep model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

4.5 Multiclass classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170


4.5.1 Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
4.5.2 Mutually exclusive classes and Softmax . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
4.5.3 The Iris dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178


4.7.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
4.7.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
4.7.3 Exercise 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
4.7.4 Exercise 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

5 Deep Learning Internals 181

5.1 This is a special chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

5.2 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182


5.2.1 Finite differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
5.2.2 Partial derivatives and the gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

5.3 Backpropagation intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

5.4 Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

5.5 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

5.6 Gradient calculation in Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192


TABLE OF CONTENTS ix

5.7 The math of backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196


5.7.1 Forward Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
5.7.2 Weight updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

5.8 Fully Connected Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203


5.8.1 Forward Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
5.8.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

5.9 Matrix Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206


5.9.1 Forward Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
5.9.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

5.10 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207


5.10.1 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
5.10.2 Logistic Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
5.10.3 Learning Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
5.10.4 Batch Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220

5.11 Optimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223


5.11.1 Stochastic Gradient Descent (or Simply Go Down) and its variations . . . . . . . . . . 224

5.12 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

5.13 Inner layer representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

5.14 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236


5.14.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
5.14.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
5.14.3 Exercise 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
5.14.4 Exercise 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

6 Convolutional Neural Networks 239

6.1 Intro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

6.2 Machine Learning on images with pixels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239


6.2.1 MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
6.2.2 Pixels as features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
6.2.3 Multiclass output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
6.2.4 Fully connected on images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250

6.3 Beyond pixels as features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255


6.3.1 Using local information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
6.3.2 Images as tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
6.3.3 Colored images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259

6.4 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262


x TABLE OF CONTENTS

6.4.1 Convolutional Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264


6.4.2 Pooling layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
6.4.3 Final architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
6.4.4 Convolutional network on images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

6.5 Beyond images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280

6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280

6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280


6.7.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
6.7.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281

7 Time Series and Recurrent Neural Networks 283

7.1 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

7.2 Time series classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286


7.2.1 Fully connected networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
7.2.2 Fully connected networks with feature engineering . . . . . . . . . . . . . . . . . . . . 291
7.2.3 Fully connected networks with 1D Convolution . . . . . . . . . . . . . . . . . . . . . . 293

7.3 Sequence Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296


7.3.1 1-to-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
7.3.2 1-to-many . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
7.3.3 many-to-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
7.3.4 asynchronous many-to-many . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
7.3.5 synchronous many-to-many . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
7.3.6 RNN allow graphs with cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298

7.4 Time series forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299


7.4.1 Fully connected network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
7.4.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
7.4.3 Recurrent Neural Network Maths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
7.4.4 Long Short-Term Memory Networks (LSTM) . . . . . . . . . . . . . . . . . . . . . . . 325
7.4.5 LSTM forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330

7.5 Improving forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334


7.5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340

7.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340


7.6.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
7.6.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
7.6.3 Exercise 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
7.6.4 Exercise 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
TABLE OF CONTENTS xi

8 Natural Language Processing and Text Data 345

8.1 Use cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345

8.2 Text Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347


8.2.1 Loading text data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
8.2.2 Feature extraction from text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
8.2.3 Bag of Words features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
8.2.4 Sentiment classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
8.2.5 Text as a sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365

8.3 Sequence generation and language modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378


8.3.1 Character sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
8.3.2 Recurrent Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
8.3.3 Sampling from the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
8.3.4 Sequence to sequence models and language translation . . . . . . . . . . . . . . . . . . 391

8.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391


8.4.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
8.4.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392

9 Training with GPUs 395

9.1 Graphical Processing Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395

9.2 Cloud GPU providers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397


9.2.1 Google Colab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
9.2.2 Pipeline AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
9.2.3 Floydhub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
9.2.4 Paperspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
9.2.5 AWS EC2 Deep Learning AMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
9.2.6 AWS Sagemaker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
9.2.7 Google Cloud and Microsoft Azure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
9.2.8 The DIY solution (on Ubuntu) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414

9.3 GPU VS CPU training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415


9.3.1 Tensorflow 2.0 compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
9.3.2 Convolutional model comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415

9.4 Multiple GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420


9.4.1 Distribution strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
9.4.2 Data Parallelization using Tensorflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
9.4.3 Data Parallelization using Keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
9.4.4 Data Parallelization using Horovod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
9.4.5 Supercomputing with Tensorflow Mesh . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
xii TABLE OF CONTENTS

9.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425

9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425


9.6.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
9.6.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426

10 Performance Improvement 427

10.1 Learning curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428

10.2 Reducing Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440


10.2.1 Model Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
10.2.2 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
10.2.3 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454

10.3 Data augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458

10.4 Tensorflow Data API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467

10.5 Hyperparameter optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469


10.5.1 Hyper-parameter tuning in Tensorboard . . . . . . . . . . . . . . . . . . . . . . . . . . 469
10.5.2 Weights and Biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
10.5.3 Hyperopt and Hyperas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
10.5.4 Cloud based tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477

10.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477


10.6.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477

11 Pretrained Models for Images 479

11.1 Recognizing sports from images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480

11.2 Keras applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485

11.3 Predict class with pre-trained Xception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488

11.4 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499

11.5 Data augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502

11.6 Bottleneck features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505

11.7 Train a fully connected on bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511


11.7.1 Image search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514

11.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525


11.8.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
11.8.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
11.8.3 Exercise 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
TABLE OF CONTENTS xiii

12 Pretrained Embeddings for Text 527

12.1 “Unsupervised”-“supervised learning” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528

12.2 GloVe embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529

12.3 Loading pre-trained embeddings in Keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535

12.4 Gensim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538


12.4.1 Word Analogies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539

12.5 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541

12.6 Other pre-trained embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544


12.6.1 Word2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
12.6.2 FastText . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546

12.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546


12.7.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546
12.7.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547

13 Serving Deep Learning Models 549

13.1 The model development cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549


13.1.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550
13.1.2 Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
13.1.3 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
13.1.4 Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
13.1.5 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
13.1.6 Model Exporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554
13.1.7 Model Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555
13.1.8 Model Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556

13.2 Deploy a model to predict indoor location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556


13.2.1 Data exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
13.2.2 Model definintion and training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
13.2.3 Export the model with Keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562

13.3 A simple deployment with Flask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568


13.3.1 Full script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
13.3.2 Run the script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
13.3.3 Get Predictions from the API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573

13.4 Deployment with Tensorflow Serving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575


13.4.1 Saving a model for Tensorflow Serving . . . . . . . . . . . . . . . . . . . . . . . . . . . 576
13.4.2 Inference with Tensorflow Serving using Docker and the Rest API . . . . . . . . . . . 578
13.4.3 The gRPC API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
xiv TABLE OF CONTENTS

13.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585


13.5.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
13.5.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586

14 Conclusions and Next Steps 589

14.1 Where to go next . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589

14.2 Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590

14.3 Bootcamp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590

15 Appendix 591

15.1 Matrix multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591

15.2 Chain rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594


15.2.1 Univariate functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
15.2.2 Multivariate functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
15.2.3 Exponentially Weighted Moving Average (EWMA) . . . . . . . . . . . . . . . . . . . . 597

15.3 Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602


15.3.1 Tensor Dot Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605

15.4 Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607


15.4.1 1D Convolution & Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607
15.4.2 2D Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 610
15.4.3 Image filters with convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616

15.5 Backpropagation for Recurrent Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621

16 Getting Started Exercises Solutions 623

16.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623

16.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625

16.3 Exercise 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630

16.4 Exercise 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631

17 Data Manipulation Exercises Solutions 635

17.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635

17.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637

17.3 Exercise 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642

17.4 Exercise 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644

17.5 Exercise 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646


TABLE OF CONTENTS xv

18 Machine Learning Exercises Solutions 649

18.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 649

18.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652

19 Deep Learning Exercises Solutions 661

19.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661

19.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666

19.3 Exercise 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668

20 Deep Learning Internals Exercises Solutions 671

20.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 671

20.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674

20.3 Exercise 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676

20.4 Exercise 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678

21 Convolutional Neural Networks Exercises Solutions 681

21.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 681

21.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684

22 Time Series and Recurrent Neural Networks Exercises Solutions 689

22.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 689

22.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 692

22.3 Exercise 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693

22.4 Exercise 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694

23 Natural Language Processing and Text Data Exercises Solutions 703

23.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703

23.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706

24 Training with GPUs Exercises Solutions 713

24.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713

24.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717

25 Performance Improvement Exercises Solutions 719

25.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719


xvi TABLE OF CONTENTS

26 Pretrained Models for Images Exercises Solutions 727

26.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727

26.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 729

26.3 Exercise 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 729

27 Pretrained Embeddings for Text Exercises Solutions 735

27.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735

27.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 738

28 Serving Deep Learning Models Exercises Solutions 745

28.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745

28.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 748


0Preface

Introduction
Artificial Intelligence is the most powerful technology of the 21st century. It is the fourth, and possibly the
last, technological revolution in the history of humanity and it is impacting all aspects of our society and
civilization. Like the agricultural revolution, the industrial revolution and the Internet revolution of the late
20th century, the AI revolution is powered by a set of core technologies that are enabling new applications,
breakthroughs, and new insights.

In the case of AI, these enablers are cheap sensors, cheap data storage, and cheap computing power. Cheap
sensors and storage ushered in the first epoch of the AI revolution: the era of Big Data. In the first decade of
the 21st century, Big Data became a “must have” technology for modern business. The ability to record
consumer activity and interactions, as well as supply chain and production data, allowed companies to
gather real-time intelligence on an unprecedented scale. It also enabled data-driven applications, a new set of
products with data at the core. Social media platforms, like LinkedIn and Facebook, are excellent examples
of these.

The ubiquity of data prepared the way for the second epoch in the AI revolution: the era of Machine
Learning and Deep Learning. Both of these technologies have been around since the second half of the 20th
century but had rarely seen a broad application. Historically, this was mainly because their performance on
real-world tasks was not good enough, due to the absence of large training datasets.

Everything changed in 2012. The ImageNet challenge is a computer-science competition, founded in 2009,
where research scientists try to design algorithms that achieve the highest accuracy in visual object
recognition, on a dataset of millions of images. Until 2011, recognition errors were above 25. However, in
2012 an algorithm based on Deep Learning brought the error down to 16 for the first time. This incredible
achievement is widely recognized as the start of the Deep Learning revolution, and since then, we have seen

1
2 CHAPTER 0. PREFACE

Deep Learning conquer an increasing number of domains.

Imagenet Scores

What’s interesting about this is not so much that a new algorithm was applied (in fact, artificial neural
networks had been around for years, if not decades, already). It’s the fact that new hardware breakthroughs
and the availability of large datasets made it possible to exploit such an algorithm. After conquering image
recognition, Deep Learning made breakthroughs in many other fields of machine intelligence including: -
machine translation (think: Google Translate) - speech recognition (think: Amazon Alexa, Apple Siri,
Google Now, etc.) - recommendation systems (think: Netflix, Amazon, Spotify, etc.) - search engines and
information retrieval (think: Google, Baidu, Bing, etc.) - medical diagnostics (think: applications in cancer
screening and imaging) - fraud detection (think: Visa, Mastercard, Stripe, PayPal, etc.) - forecasting (think:
hedge funds, utility companies etc.) - robotics & automation (think: Tesla autopilot)

and many more.

So, what is Deep Learning? Deep Learning is a technology capable of learning very complex relation
between arbitrary inputs and outputs. It is based on a mathematical concept called an Artificial Neural
Network (ANN), which is nothing more than a fancy name for a very complicated mathematical function. It
had been relegated to a few research labs scattered across the world until the early 2010s. However, since
then, every major technology company has released an open source framework that implements the core
concepts of Deep Learning, the most famous being Tensorflow by Google (see here for the top 8
frameworks, and here for a comparison chart).

This book is a practical introduction to this set of technologies.

Why you should care


According to research published by Element AI, there are roughly 22,000 PhD-level researchers in the field
of Artificial Intelligence. Tencent, a Chinese technology giant, estimated that there were 300,000 engineers
equipped with AI knowledge at the end of 2017, while the demand for talent from companies is in the
millions. To give an idea of the magnitude of the Deep Learning revolution, China envisions building a $1
0.3. WHO THIS BOOK IS FOR 3

trillion AI industry by 2030 and are investing billions of dollars to make this happen. Private companies
compete for AI talent so fiercely that salaries can reach millions of dollars, and private equity and
acquisition deals in the hundreds of millions are common.

Companies are seeing tremendous improvements in productivity and competitiveness by introducing Deep
Learning technology. Examples include Google, who use Deep Learning to save millions in the electricity
that powers their data centers and Airbnb, who are improving the search experience for their users using
Neural Networks.

The industrialization phase of the Deep Learning revolution is well underway.

Who this book is for


This book is for the developer, eager to learn how to implement these models in practice and become an
attractive candidate in this booming job market.

This book is for the technical manager who needs to interface with expert data science teams of experts
and needs to know enough to share a common language and be an effective leader.

This book is for teams that need to quickly ramp up their Deep Learning capabilities to add intelligent
components to their software.

We will start by introducing data and Machine Learning problems before diving deeper into how Neural
Networks are built, trained, and used. We will deal with tabular data, images, text data, and time series data,
and for each of these we will build and train the appropriate Neural Network.

By the end of the book, we will be able to recognize problems that can be solved using Deep Learning,
collect and organize data so that Neural Networks can consume it, build and train models to address a
variety of tasks, take advantage of the cloud to speed up training, and deploy models for use as an API.

Acknowledgements
This book would not exist without the help of many friends and colleagues who contributed in several ways.
Special thanks go to Ari and Nate from Fullstack for the continuous support and useful suggestions
throughout the project. Thanks to Nicolò for reading throughout the early version of the book and
contributing many corrections to make it more accessible to beginners. Thanks to Carlo for helping me
transform this book into a successful Bootcamp. A huge thank you to François Chollet for inventing Keras
and making Deep Learning accessible to the world and the Tensorflow developers for the fantastic work
they are doing. Finally to Chiara, to my friends and my family for all the emotional support through this
journey: thank you, I would not have finished this without all of you!

About the author


Francesco Mosconi is an experienced data scientists and entrepreneur. He is the founder/CEO and Chief
Data Scientist of Catalit, a data science consulting and training company based in San Francisco, CA.

With 15 years experience working with data, Francesco has been an instructor at General Assembly, The
4 CHAPTER 0. PREFACE

Francesco Mosconi

Data Incubator, Udemy and many conferences including ODSC, TDWI, PyBay, and AINext.

Formerly he was co-founder and Chief Data Officer at Spire, a YC-backed company that invented the first
consumer wearable device capable of continuously tracking respiration and physical activity.

Francesco started his career with a Ph.D. in Biophysics. He published a paper on DNA mechanics that
currently has over 100 citations. He then applied his data skills to help small and large companies grow
through data analytics, machine learning, and Deep Learning solutions.

He also started a series of workshops on Machine Learning, and Deep Learning called Dataweekends,
training hundreds of students on these topics.

This book extends and improves the training program of Dataweekends, and it provides a practical
foundation in Machine Learning and Deep Learning.

How to use this book


This book provides a self-contained, practical introduction to Deep Learning. It assumes familiarity with
Python and with a little bit of math.

The first chapters review core concepts in Machine Learning and explain how Deep Learning expands the
field. We will build an intuition to recognize the kind of problems where Deep Learning shines and those
where other techniques could provide better results.

Chapters 4-8 present the foundation of Deep Learning. By the end of Chapter 8, we’ll be able to use Fully
0.6. HOW TO USE THIS BOOK 5

Connected, Convolutional, and Recurrent Neural Networks on your laptop and deal with a variety of input
sources including tabular data, images, time series, and text.

Chapters 9-13 build on core chapters to extend the reach of our skills both in depth and in width. We’ll learn
to improve the performance of our models as well as to use pre-trained models, piggybacking on the
shoulders of giants. We will also talk about how to use GPUs to speed up training as well as how to serve
predictions from your model.

This book is a practical one. Everything we introduce is accompanied by code to experiment with and
explore.

The code and notebooks accompanying the book are available through your purchase on your Gumroad
library. To follow along with the book, please make sure to download and unpack the code on your local
computer.

How to approach exercises

Before we go on, let’s spend a couple of words on exercises. They are a crucial part of this book, and we
suggest working as follows:

1. Execute the code provided with the chapter, to get a sense of what it is doing.
2. Once we have run the provided code, we suggest starting working through the exercises. We begin
with easy exercises and build gradually towards more difficult ones.

If you find yourself stuck, here are some resources where you can look for help:

• Look at the error message: understand which parts of it are essential. The most critical line in the
Python error message is the last one. It tells us the error type. The other lines give us information
about what caused the error to happen so that we can go ahead and fix it.
• The internet: try pasting part of the error message in a search engine and see what you find. It is very
likely that someone has already encountered the same problem and a solution is available.
• Stack Overflow: this is a vast knowledge base where people answer code-related questions. Very often
you can search for the specific error message you got and find the answer here.

Notation

You will find different fonts for different parts of the text:

• bold for new terms


• italic to emphasize important concepts
• fixed width for code variables
• math for math

We may use tip blocks, like this one:


6 CHAPTER 0. PREFACE

TIP: this is a tip

to indicate practical suggestions and concepts that add to the material but are not strictly core.

And we may also use math blocks, like this one:

x0 + x1 + x2 (1)

for math formulas.

Prerequisites - Is this book right for me?


This book is for software engineers and developers that want to approach Machine Learning and Deep
Learning and understand it. When reviewing current books on these topics, we found that they tend to
either be very abstract and theoretical with lots of maths and formulas or too practical, with just code and
no explanation of the theory.

In this book, we’ll try to find a balance between the two. It is an application focused book, with working
examples, coding exercises, solutions, and real-world datasets. At the same time, we won’t shy away from
the math when necessary. We’ll try to keep equations to a minimum, so here are a few symbols you may
encounter in the course of the book.

Sum

Sometimes we will need to indicate the sum of many quantities at once. Instead of writing the sum explicitly
like this:

x0 + x1 + x2 + x3 + x4 + x5 + ... (2)

we may use the summation symbol ∑ and write it like this:

∑ xi (3)
i

Partial derivatives

Sometimes we will need to indicate the speed of change in a function with respect to one of its arguments,
which is obtained through a partial derivative, indicated with the symbol ∂:
0.7. PREREQUISITES - IS THIS BOOK RIGHT FOR ME? 7

∂ f (x1 , x2 , ...)
(4)
∂x1

It means we are looking at how much f is changing for a unit of change in x1 when all the other variables are
kept fixed (find more about on Wikipedia).

Dot product

A lot of Deep Learning relies on a few common linear algebra concepts like vectors and matrices. In
particular, an operation we will frequently use is the dot product: A.B. If you’ve never seen this before it may
be a great time to look up on Wikipedia how it works.

In any case do not worry too much about the math. We will introduce it gradually and only when needed,
so you’ll have time to get used to it.

Python

This book is not about Python. It’s a book about Deep Learning and how to build Deep Learning models
using Python. Therefore, some familiarity with Python is required to follow along.

One of the questions we often receive is “I don’t have any experience with Python, would I able to follow the
course if I take a basic Python course before?”. The answer is YES, but let us help you with that.

This book focuses on data techniques, and it assumes some familiarity with Python and programming
languages. It is designed to speed up your learning curve in deep learning giving you enough knowledge for
you to then be able to continue learning on your own.

So what are the things you can do to ramp up in Python?

Here are two resources to get you started: - Learn Python the Hard Way: a great way to start learning without
putting too much effort into it. - Hacker Rank 30 days of code: more problem-solving oriented resource.

To follow this course easily, you should be familiar with the following Python constructs:

• special keywords: in, return, None


• variables and data types: int, float, strings
• data structures: lists, dictionaries, sets and tuples
• flow control: for loops, while loops, conditional statements (if, elif, else)
• functions
• classes and a bit of object oriented programming
• packages and how to import them
• pythonic constructs like list comprehension, iterators and generators

Once you are comfortable with these, you’ll be ready to take this course.
8 CHAPTER 0. PREFACE

Our development environment


Since this is a practical course, we’ll need to get some tools installed on your computer. Whether you have
Linux, Mac or Windows, the software required for this course runs on all these systems, so you won’t have
to worry about it.

We will need to install Python and a few libraries that allow us to perform Machine Learning and Deep
Learning experiments.

Miniconda Python

Anaconda Python is an excellent open source distribution of Python packages geared to data science. It
comes with a lot of useful tools, and we encourage you to have a look at it. For this book, we will not need
the full Anaconda distribution, but we will install the required packages so that we can keep space
requirements to a minimum.

TIP: if you already have Anaconda Python installed, make sure conda is up to date by
running conda update conda in a terminal window.

We can do this by installing Miniconda, which includes Python and the Anaconda package installer conda.
Here are the steps to complete to install it:

1. Download Miniconda Python 3.7 for your system (Windows/Mac OS X/Linux).


2. Run the installer and make sure that it completes successfully.
3. Done!

If you’ve completed these steps successfully, you can open a command prompt (how to do this will differ
depending on which OS you’re using) and type python to launch the Python interpreter. It should display
something like the following (Linux):

Python 3.7.2 (default, Dec 29 2018, 06:19:36)


[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.

or (MacOS):

Python 3.7.2 (default, Dec 29 2018, 00:00:04)


[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.

Congratulations, you have just installed Miniconda Python successfully.


0.8. OUR DEVELOPMENT ENVIRONMENT 9

Conda Environment

Environment creation

Now that we have installed Python, we need to download and install the packages required to run the code
of this book. We will do this in a few simple steps.

1. Open a terminal window.

• Change directory to the folder you have downloaded from our book repository using the command
cd.
• Run the command conda env create to create the environment with all the required packages.

TIP: when you run the create command you will get the error:

CondaValueError: prefix already exists:

If the environment already exists. In that case, run the command conda env update
instead.

You will see a lot of lines on your terminal. What is going on? The conda package installer is creating an
environment, i.e. a special folder that contains the specific versions of each required package for the book.
The environment is specified in the file environment.yml, that looks like this:

name: ztdlbook
channels:
- defaults
dependencies:
- python=3.7.*
- bz2file==0.98
- cython==0.29.*
- numpy==1.16.*
- flask==1.0.*
- gensim==3.4.*
- h5py==2.9.*
- jupyter==1.0.*
- matplotlib==3.0.*
- nomkl==3.0.*
- pandas==0.24.*
- pillow==5.4.*
10 CHAPTER 0. PREFACE

- pip==19.0.*
- pytest==4.1.*
- pyhamcrest==1.9.*
- scikit-learn==0.20.*
- scipy==1.2.*
- seaborn==0.9.*
- setuptools==40.8.*
- twisted==18.9.*
- pip:
- jupyter_contrib_nbextensions==0.5.*
- tensorflow==2.0.0-alpha0
- tensorflow-serving-api==1.13.*

The package installer reads the environment file and downloads the correct versions of each package
including its dependencies.

Once you created the environment you should see a message like the following (Mac/Linux):

# To activate this environment, use:


# > conda activate ztdlbook
#
# To deactivate an active environment, use:
# > conda deactivate

or the following (Windows):

# To activate this environment, use:


# > activate ztdlbook
#
# To deactivate an active environment, use:
# > deactivate ztdlbook

This tells you how to activate and deactivate the environment.

You can read more about conda environments here.

So let’s go ahead and activate the environment by typing:

conda activate ztdlbook

(or the Windows equivalent). If you do that, you’ll notice that your command prompt changes and now
displays the environment name at the beginning, within parentheses, like this:
0.8. OUR DEVELOPMENT ENVIRONMENT 11

(ztdlbook) <other stuff in the prompt> $

If you see this, you can go ahead to the next step.

GPU enabled environment

If your machine has an NVIDIA GPU you’ll want to install the GPU-enabled version of Tensorflow. In
Chapter 9 we cover cloud GPUs in detail and explain how to install all the required software. Assuming you
have already installed the NVIDIA drivers, CUDA and CUdnn, you can create the environment for this
book in a few simple steps:

1. run:

conda env create -f environment-gpu.yml

It is the same environment as the standard one, minus the tensorflow-serving- api package. The
reason for this is that this package has standard Tensorflow as dependency and if we install it
programmatically it will clutter our tensorflow-gpu installation. So go ahead and create the environment
using the above config file.

2. After creating the environment, activate it with:

conda activate ztdlbook

3. Finally, install the tensorflow-serving-api package without dependencies:

pip install tensorflow-serving-api==1.13.* --no-deps

Tensorflow 2.0

Tensorflow 2.0 was introduced in March 2019, and it is a notable update to the library. The most relevant
change for the readers of this book is that Keras became the default interface for building models. The
merger of Keras into Tensorflow has been going on for the past year, but with Tensorflow 2.0 redundant
functionality from other APIs has been removed, leaving Keras as the preferred way to define models.
When we began writing this book, Keras and Tensorflow were not so tightly integrated, and it made sense to
install them separately and consider them as independent libraries. As of 2019, Keras continues to be an
open-source API specification that supports different backends including Tensorflow, CNTK, and Theano.
At the same time, the Tensorflow project integrates an independent implementation of the Keras API, and
it’s this implementation which is becoming the standard high-level API for model definition in Tensorflow.
The support from the Tensorflow developer community makes the integration of Keras with Tensorflow
much tighter than with any other backend, and therefore we decided to port our code to Tensorflow 2.0 and
use its API throughout the book. If you’ve been reading the beta versions of the book, you will see that
changes are minimal. On the other hand, are very proud to be one of the first Deep Learning books with
code written for Tensorflow 2.0!
12 CHAPTER 0. PREFACE

Jupyter notebook

Now that we have installed Python and the required packages let’s explore the code for this book. We
provide the code as Jupyter notebooks. These are documents that can contain live code, equations,
visualizations, and explanatory text. They are very common in the data science community because they
allow for easy prototyping of ideas and fast iteration. Notebook files are opened and edited through the
Jupyter Notebook web application. Let’s launch it!

Starting Jupyter Notebook

In the terminal, change directory to the course folder (if you haven’t already) and then type:

jupyter notebook

This command will start the notebook server and open a window in your default browser, and you should
reach a window like the one shown in here:

Jupyter Notebook

This is the notebook dashboard and serves as a home page for the notebook. There are three tabs:

• Files : this tab displays the notebooks and files in the current directory. By clicking on the
breadcrumbs or on sub-directories at the top of notebook list, you can navigate your file system. You
can create and upload new files. To create a new notebook, click on the “New” button at the top of the
list and select a kernel from the dropdown. To upload a new file, click on the “Upload” button and
browse the file from your computer. By selecting a notebook file, you can perform several tasks, such
as “Duplicate”, “Shutdown”, “View”, “Edit” or “Delete” it.
0.8. OUR DEVELOPMENT ENVIRONMENT 13

• Running : this tab displays the currently running Jupyter processes, either a Terminals or Notebooks.
This tab allows you to shut down running notebook.

TIP: Notebooks remain running until you explicitly shut them down, and closing the
notebook’s page is not sufficient.

• Cluster : this tab displays parallel processes, provided by IPython parallel and it requires further
activation, not necessary for the scope of this book.

Starting your first Jupyter Notebook

If you are new to Jupyter Notebook, it may feel a little disorienting at first, mainly if you usually work with
an IDE. However, you’ll see that it’s quite easy to navigate your way around it. Let’s start from how you open
the first notebook.

Click on the course folder:

The course folder

and the notebooks forming the course will appear. Go ahead and click on the 00_Introduction.ipynb
notebook:

This will open a new tab where you should see the content of this chapter in the notebook. Now scroll down
to this point and feel free to continue reading from the screen if you prefer.
14 CHAPTER 0. PREFACE

Course notebook

Jupyter Notebook cheatsheet

Let us summarize here a few handy commands to get you started with Jupyter Notebook.

TIP: For a complete introduction to the Jupyter Notebook we encourage you to have a look
at the official documentation.

• Ctrl-ENTER executes the currently active cell and keeps the cursor on the same cell
• Shift-ENTER executes the currently active cell and moves the cursor on the same cell
• ESC enables the Command Mode. Try it. You’ll see the border of the notebook change to Blue. In
Command Mode you can press a single key and access many commands. For example, use:

– A to insert cell above the cursor


– B to insert cell below the cursor
– DD to delete the current cell
– F to open the find/replace dialogue
– Z to undo the last command

Finally, you can use H to access the help dialog with all the keyboard shortcuts for both command and edit
mode:

Environment check

If you have followed the instructions this far, you should be running the first notebook.
0.8. OUR DEVELOPMENT ENVIRONMENT 15

Jupyter shortcuts
16 CHAPTER 0. PREFACE

The next command cell makes sure that you are using the Python executable from within the course
environment and should evaluate without an error.

TIP: If you get an error, try the following:

1. Close this notebook.


2. Go to the terminal and stop Jupyter Notebook using:

CTRL+C

3. Make sure that you have activated the environment, you should see a prompt like:

(ztdlbook) $

4. (Optional) if you don’t see that prompt activate the environment:


• mac/linux:
conda activate ztdlbook
• Windows:
activate ztdlbook
5. Restart Jupyter Notebook.
6. Re-open the first notebook in the course folder
7. Re-run the next cell.

In [1]: import os
import sys

env_name = 'ztdlbook'

p = sys.executable
try:
assert(p.find(env_name) != -1)
print("Congrats! Your environment is correct!")
except Exception as ex:
print("It seems your environment is not correct.\n",
"Currently running Python from this path:\n",
p,
"\n",
"Please follow the instructions and retry.")
raise ex

Congrats! Your environment is correct!


0.8. OUR DEVELOPMENT ENVIRONMENT 17

Python 3.7

The next line checks that you’re using Python 3.7.x and it should execute without any error.

If you get an error, go back to the previous step and make sure you created and activated the environment
correctly.

In [2]: python_version = "3.7"

v = sys.version
try:
assert(v.find(python_version) != -1)
print("Congrats! Your Python is correct!")
except Exception as ex:
print("It seems your Python is not correct.\n",
"Version should be:", python_version, "\n"
"Python sys.version:\n",
v,
"\n",
"Please follow the instructions above\n",
"and make sure activated the environment.")
raise ex

Congrats! Your Python is correct!

Jupyter

Let’s check that Jupyter is running from within the environment.

In [3]: import jupyter


j = jupyter.__file__

try:
assert(j.find('jupyter') != -1)
assert(j.find(env_name) != -1)
print("Congrats! You are using Jupyter from\n",
"within the environment.")
except Exception as ex:
print("It seems you are not using the correct\n",
"version of Jupyter.\n",
"Currently running Python from this path:\n",
j,
"\n",
"Please follow the instructions above\n",
18 CHAPTER 0. PREFACE

"and make sure activated the environment.")


raise ex

Congrats! You are using Jupyter from


within the environment.

Other packages

Here we will check that all the packages are installed and have the correct versions. If everything is ok you
should see:

Houston, we are 'go'!

If there’s an issue here, please make sure you have checked the previous steps.

In [4]: import pip


import bz2file
import cython
import flask
import gensim
import h5py
import jupyter
import matplotlib
import numpy
import pandas
import PIL
import pytest
import sklearn
import scipy
import seaborn
import setuptools
import twisted
import hamcrest
import tensorflow
# import tensorflow_serving

def check_version(pkg, version):


actual = pkg.__version__.split('.')
if len(actual) == 3:
actual_major = '.'.join(actual[:2])
elif len(actual) == 2:
actual_major = '.'.join(actual)
else:
0.8. OUR DEVELOPMENT ENVIRONMENT 19

raise NotImplementedError(pkg.__name__ +
"actual version :"+
pkg.__version__)
try:
assert(actual_major == version)
except Exception as ex:
print("{} {}\t=> {}".format(pkg.__name__,
version,
pkg.__version__))
raise ex

check_version(cython, '0.29')
check_version(flask, '1.0')
check_version(gensim, '3.4')
check_version(h5py, '2.9')
check_version(matplotlib, '3.0')
check_version(numpy, '1.16')
check_version(pandas, '0.24')
check_version(PIL, '5.4')
check_version(pip, '19.0')
check_version(pytest, '4.1')
check_version(sklearn, '0.20')
check_version(scipy, '1.2')
check_version(seaborn, '0.9')
check_version(setuptools, '40.8')
check_version(twisted, '18.9')
check_version(hamcrest, '1.9')
check_version(tensorflow, '2.0')

print("Houston, we are 'go'!")

Houston, we are 'go'!

Congratulations, you have just verified that you have correctly set up your computer to run the code in this
book.

Troubleshooting installation

If for some reason you encounter errors while running the first notebook, the simplest solution is to delete
the environment and start from scratch again.

To remove the environment:

• close the browser and go back to your terminal


• stop Jupyter Notebook (CTRL-C)
20 CHAPTER 0. PREFACE

• deactivate the environment (Mac/Linux):

conda deactivate ztdlbook

• deactivate the environment (Windows 10):

deactivate ztdlbook

• delete the environment:

conda remove -y -n ztdlbook --all

• restart from environment creation and make sure that each step completes until the end.

Updating Conda

One thing you can also try is to update your conda executable. It may help if you already had Anaconda
installed on your system.

conda update conda

We tested these instructions on:

• Mac OSX Sierra 10.14


• Ubuntu 16.04 and 18.04
• Windows 10
1
Getting Started

Deep Learning in the real world


This book is a hands-on course where we learn to train Deep Learning models. Such models are
omnipresent in the real world, and you may have already encountered them without knowing! Both large
and small companies use them to solve challenging problems. Here we will mention some of them, but we
encourage you to keep yourself informed since new applications come out every day.

Image recognition

It is a widespread application, consisting in determine whether or not an image contains some specific
objects, features, or activities. For example, the following image shows an object detection algorithm taken
from the Google Blog.

The trained model can identify the objects in the image. Similar algorithms can be applied to recognize
faces, or determine diseases from X-ray scans, or in self-driving cars, to name a few examples.

Predictive modeling

Deep Learning can be applied to times series data, to forecast future events, for example, the energy
consumption of a region, the temperature over an area, the price of a stock, and so on. Also, researchers use
Neural Networks to predict demographic changes, election results, and natural disasters.

21
22 CHAPTER 1. GETTING STARTED

Object Detection

Language translation

Many companies use Deep Learning for language translation. This approach uses a large artificial Neural
Network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single
integrated model. The following image represents an example of an instant visual translation, taken from
the Google Blog. This algorithm combines image recognition tasks with language translation ones.

Machine Translation

Recommender system

Recommender systems help the user finding the correct choice among the available possibilities. They are
everywhere, and we use them every day: when we buy a book that Amazon recommends us based on our
1.1. DEEP LEARNING IN THE REAL WORLD 23

previous history, or when we listen to that song tailored to our taste in Spotify, or when we watch with the
family that movie recommended in Netflix, to name some examples.

Automatic Image Caption Generation

Automatic image captioning is the task where, given a picture, the system can generate a caption that
describes the contents of the image. Once you can detect objects in photographs and create labels for those
objects, you can turn those labels into a coherent sentence description. Here is a sample of automatic image
caption generation taken from Andrej Karpathy and Li Fei-Fei at Stanford University.

Image Caption Generation

Anomaly detection

Anomaly detection is a technique used to identify unusual patterns that do not conform to expected
behavior. It has many applications in business, from intrusion detection (identifying strange sequences in
network traffic that could signal a hack) to system health monitoring (spotting a malignant tumor in an MRI
scan), and from fraud detection in credit card transactions to fault detection in operating environments.
24 CHAPTER 1. GETTING STARTED

Automatic Game Playing

Here Deep Learning models learn how to play a computer game, training to maximize some goals. One
glorious example is the DeepMind’s AlphaGo algorithm developed by Google, that beat the world master at
the game Go, or more recently AlphaStar that in January 2019, specialized in playing the real-time strategy
game StarCraft II.

First Deep Learning Model


Now that we are set up and ready to go, let’s get our feet wet with our first simple Deep Learning model. Let’s
begin by separating two sets of points of different color in a two-dimensional space.

First, we are going to create these two sets of points, and then we will build a model that can separate them.
Although it’s a toy example, this is representative of many industry-relevant problems where the model
predicts a binary outcome.

TIP: Binary prediction


When we talk about binary prediction, we mean identifying one type or another. In this
example, we’re separating between blue and red parts. True or False, 0 or 1, Yes or No, are
all other outcomes of a binary prediction.

Question: Can you think of any industry examples where we may want to predict a binary outcome?

Answer: detecting if an email is spam or not, identifying if a credit card transaction is legitimate or not,
predicting if a user is going to buy an item or not.

The primary goal of this exercise is to see that, with a few lines of code we can sufficiently define and train a
Deep Learning model. Do not worry if some of it is beyond your understanding yet. We’ll walk through it
and see code similar to it in the rest of the book in-depth.

In the next chapters, we will be building more complex models, and we will work with more exciting
datasets.

First, we are going to import a few libraries.

Numpy

At their core, Neural Networks are mathematical functions. The workhorse library used industry-wide is
numpy. numpy is a Python library that contains many mathematical functions, particularly around working
with arrays of numbers.

For instance, numpy contains functions for: * vector math * matrix math * operations optimized for number
1.2. FIRST DEEP LEARNING MODEL 25

arrays

While we’ll use higher-level libraries such as Keras a lot in this book, being familiar with and proficient in
using at numpy is a core skill we’ll need to build (and evaluate) our networks. Also, while numpy is a
comprehensive library, there are only a few key functions that we will use over and over again. We’ll cover
each new function as it comes up, so let’s dive in and try out a few basic operations.

Basic Operations

The first thing we need to do to use numpy is import it into our workspace:

In [1]: import numpy as np

TIP: If you get an error message similar to the following one, don’t worry.

---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-1-4ee716103900> in <module>()
----> 1 import numpy as np

ImportError: No module named 'numpy'

Python error messages may not seem very easy to navigate, but it is actually quite simple.
To understand the error, we suggest reading from the bottom up. Usually the last line is the
most informative. Here it says: ImportError, so it looks like it didn’t find the module
called numpy. This could be due to many reasons, but it probably indicates that either you
didn’t install numpy or you didn’t activate the conda environment. Please refer back to the
installation section and make sure you have activated the environment before starting
jupyter notebook.

Using numpy, let’s create a simple 1-D array:

In [2]: a = np.array([1, 3, 2, 4])

Here we’ve created a 1-dimensional array containing four numbers. We can evaluate a to see the current
values:

In [3]: a
26 CHAPTER 1. GETTING STARTED

Out[3]: array([1, 3, 2, 4])

Note that the type of a is numpy.ndarray. The documentation for this type is available here.

In [4]: type(a)

Out[4]: numpy.ndarray

TIP: Jupyter Notebook is a great interactive environment, and it allows to access the
documentation for each object loaded in memory. Just append a question mark to any
variable in our notebook and execute the cell. For example, in the next cell, type:

a?

and then run it. It opens a pane in the bottom with the documentation for the object a.
This trick works with any object in the notebook. Pretty awesome! Press escape to dismiss
the panel at the bottom.

In [ ]:

Let’s create two more arrays, a 2-D and a 3-D array:

In [5]: b = np.array([[8, 5, 6, 1],


[4, 3, 0, 7],
[1, 3, 2, 9]])

c = np.array([[[1, 2, 3],
[4, 3, 6]],
[[8, 5, 1],
[5, 2, 7]],
[[0, 4, 5],
[8, 9, 1]],
[[1, 2, 6],
[3, 7, 4]]])

Again we can evaluate them to check that they are indeed what we expect them to be:

In [6]: b
1.2. FIRST DEEP LEARNING MODEL 27

Out[6]: array([[8, 5, 6, 1],


[4, 3, 0, 7],
[1, 3, 2, 9]])

In [7]: c

Out[7]: array([[[1, 2, 3],


[4, 3, 6]],

[[8, 5, 1],
[5, 2, 7]],

[[0, 4, 5],
[8, 9, 1]],

[[1, 2, 6],
[3, 7, 4]]])

In mathematical terms, we can think of the 1-D array as a vector, the 2-D array as a matrix and the 3-D array
as a tensor of order 3.

TIP: What is a Tensor?


We will encounter tensors later in the book and we will give a more precise definition of
them then. For the time being, we can think of a Tensor as a more general version of a
matrix, which can have more than two indices for its elements. Another useful way to
think of tensors is to imagine them as a list of arrays of equal shape.

Numpy arrays are objects, which means they have attributes and methods. A useful property is shape,
which tells us the number of elements in each dimension of the array:

In [8]: a.shape

Out[8]: (4,)

Python here tells us the object has four items along the first axis. The trailing comma is needed in Python to
indicate that the purpose is a tuple with only one element.
28 CHAPTER 1. GETTING STARTED

TIP: In Python, if we write (4), this is interpreted as the number 4 surrounded by


parentheses, and parentheses are ignored. Typing (4,) is interpreted as a tuple with a
single element: the number 4. If the tuple contains more than one element, like (3, 4) we
can omit the trailing comma.

Tab tricks

Trick 1: Tab completion In Jupyter Notebook we can type faster by using the tab key to complete the
name of any variable we previously created. So, for example, if somewhere along our code we have created a
variable called verylongclumsynamevariable and we need to use it again, we can simply start typing ver
and hit tab to see the possible completions, including our long and clumsy variable.

TIP: try to use meaningful and short variable names, to make your code more readable.

Trick 2: Methods & Attributes In Jupyter Notebook, we can hit tab after the dot . to know which
methods and attributes are accessible for a specific object. For example, try typing

a.

and then hit tab. You will notice that a small window pops up with all the methods and attributes available
for the object a. This looks like:

This is very handy if we are looking for inspiration about what a can do.

Trick 3: Documentation pop-up Let’s go ahead and select a.argmax from the tab menu by hitting
enter. It is a method, and we can quickly find how it works. Let’s open a round parenthesis a.argmax(
and let’s hit SHIFT+TAB+TAB (that is TAB two times in a row while holding down SHIFT). This will open a
pop-up window with the documentation of the argmax function.

Here you can read how the function works, which inputs it requires and what outputs it returns. Pretty nice!

Let’s look at the shape of b:

In [9]: b.shape
1.2. FIRST DEEP LEARNING MODEL 29

Methods pop-up pressing DOT+TAB

Access documentation with SHIFT+SHIFT+TAB


30 CHAPTER 1. GETTING STARTED

Out[9]: (3, 4)

Since b is a 2-dimensional array, the attribute .shape has two elements one for each of the two axes of the
matrix. In this particular case, we have a matrix with three rows and four columns or a 3x4 matrix.

Let’s look at the shape of c:

In [10]: c.shape

Out[10]: (4, 2, 3)

c has three dimensions. Notice how the last element indicates the length of the innermost axis. In fact
shape is a tuple, whose elements indicate the lengths of the axes, starting from the outermost one.

TIP: Knowing how to navigate the shape of an ndarray is essential for the work we will
encounter later: from being able to perform a dot product between weights and inputs in a
model, to correctly reshaping images when feeding them to a Convolutional Neural
Network.

Selection

Now that we know how to create arrays, we also need to know how to extract data out of them. You can
access elements of an array using the square brackets. For example, we can select the first element in a by
doing:

In [11]: a[0]

Out[11]: 1

Remember that numpy indices start from 0 and the element at any particular index can be found by n-1. For
instance, you access the first element by referencing the cell at a[0] and the second element at a[1].

Uncomment the next line and select the second element of b.

In [12]: # arr = np.array([4, 3, 0, 7])


# assert (second element of arr)
1.2. FIRST DEEP LEARNING MODEL 31

Unlike accessing arrays in, say, JavaScript, numpy arrays have a powerful selection notation that you can use
to read data in a variety of ways.

For instance, we can use commas to select along multiple axes. For example, here’s how we can get the first
sub-element of the third element in c:

In [13]: c[2, 0]

Out[13]: array([0, 4, 5])

What about selecting all the first items along the second axis in b? Use the : operator:

In [14]: b[:, 0]

Out[14]: array([8, 4, 1])

Since b is a 2-D array, this is equivalent to selecting the first column.

: is the delimiter of the slice syntax to select a sub-part of a sequence, like: [begin:end].

For example, we can select the first three elements in a by typing:

In [15]: a[0:2]

Out[15]: array([1, 3])

and we can select the upper left 2x2 sub-matrix in b as:

In [16]: b[:1, :1]

Out[16]: array([[8]])

Try selecting the elements requested in the next few lines.

Select the second and third elements of a:

In [17]: # uncomment and complete the next line


# assert ( your code here == np.array([3, 2]))
32 CHAPTER 1. GETTING STARTED

Select the elements from 1 to the end in a:

In [18]: # uncomment and complete the next line


# assert ( your code here == np.array([3, 2, 4]))

Select all the elements from the beginning excluding the last one:

In [19]: # uncomment and complete the next line


# assert ( your code here == np.array([1, 3, 2]))

Stride

We can also select regularly spaced elements by specifying a step size after a second :. For example, to select
the first and third element in a we can type:

In [20]: a[0:-1:2]

Out[20]: array([1, 2])

or, simply:

In [21]: a[::2]

Out[21]: array([1, 2])

where it is implicit that we want start and end to be the first and last element in the array.

Math

We’ll try to keep the math at a minimum here, but we do need to understand how the various operations
work in an array.

Math operators work element-wise, meaning that the mathematical operation is performed on all of the
elements and their corresponding element locations.

For instance, let’s say we have two variables of shape (2,).


1.2. FIRST DEEP LEARNING MODEL 33

one = np.array([1, 2])


two = np.array([3, 4])

Addition works here by adding one[0] and two[0] together, then adding one[1] and two[1] together:

one + two # array([4, 6])

In [22]: 3 * a

Out[22]: array([ 3, 9, 6, 12])

In [23]: a + a

Out[23]: array([2, 6, 4, 8])

In [24]: a * a

Out[24]: array([ 1, 9, 4, 16])

In [25]: a / a

Out[25]: array([1., 1., 1., 1.])

In [26]: a - a

Out[26]: array([0, 0, 0, 0])

In [27]: a + b

Out[27]: array([[ 9, 8, 8, 5],


[ 5, 6, 2, 11],
[ 2, 6, 4, 13]])

In [28]: a * b
34 CHAPTER 1. GETTING STARTED

Out[28]: array([[ 8, 15, 12, 4],


[ 4, 9, 0, 28],
[ 1, 9, 4, 36]])

Go ahead and play a little to make sure we understand how these work.

TIP: If you’re not familiar with the difference between element-wise multiplication and dot
product, check out these two links: - Hadamard product - Matrix multiplication

As mentioned in the beginning, numpy is a very mature library that allows us to perform many operations
on arrays including:

• vectorized mathematical functions


• masks and conditional selections
• matrix operations
• aggregations
• filters, grouping, selections
• random numbers
• zeros and ones

and much more.

We will introduce these different operations as needed. The curious reader should check to this
documentation for more information.

Matplotlib

Another library we will use extensively is Matplotlib. The Matplotlib library is used to plot graphs so that we
can visualize our data. Visualization can help a lot in Machine Learning. Throughout this book, we will use
different kinds of plots in many situations, including

• Visualizing the shape/distributions of our data.


• Inspecting the performance improvement of our networks as training progresses.
• Visualizing pairs of features to look for correlations.

Let’s have a look at how to generate the most common plots available in matplotlib.

In [29]: import matplotlib.pyplot as plt


1.2. FIRST DEEP LEARNING MODEL 35

Above we’ve imported matplotlib.pyplot, which gives us access to the plotting functions.

Let’s set a few standard parameters that define how plots will look like. Their names are self-explanatory. In
future chapters, we’ll bundle these into a configuration file.

In [30]: from matplotlib.pyplot import rcParams


from IPython.display import set_matplotlib_formats

rcParams['font.size'] = 14
rcParams['lines.linewidth'] = 2
rcParams['figure.figsize'] = (7.5, 5)
rcParams['axes.titlepad'] = 14
rcParams['savefig.pad_inches'] = 0.12
set_matplotlib_formats('png', 'pdf')

Plot

To plot some data with a line plot, we can call the plot function on that data. We can try plotting our a
vector from above like so:

In [31]: plt.plot(a);

4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0
36 CHAPTER 1. GETTING STARTED

We can also render a scatter plot, by specifying a symbol to use to plot each point.

In [32]: plt.plot(a, 'o');

4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0

In the plot below, we interpret 2-D Arrays as tabular data, i.e., data arranged in a table with rows and
columns. Each row corresponds to one data point and each column to a coordinate for that data point.

If we plot b we will obtain four curves, one for each coordinate, with three points each.

In [33]: plt.plot(b, 'o-');


1.2. FIRST DEEP LEARNING MODEL 37

0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00

Notice that we plotted the four lines in the figure as a two dimensional graph where the first line has a point
at (0, 8), one at (1, 4), and another at (2, 1), which maps to the columns of b.

Let’s take another look at b:

In [34]: b

Out[34]: array([[8, 5, 6, 1],


[4, 3, 0, 7],
[1, 3, 2, 9]])

b has the shape (3, 4). If we want to plot three lines with four points each, we need to swap the rows with
the columns.

We do that by using the transpose function:

In [35]: b.transpose()

Out[35]: array([[8, 4, 1],


[5, 3, 3],
38 CHAPTER 1. GETTING STARTED

[6, 0, 2],
[1, 7, 9]])

Now we can pass b.transpose() to plot and plot the three lines:

In [36]: plt.plot(b.transpose(), 'd-');

0
0.0 0.5 1.0 1.5 2.0 2.5 3.0

Notice how we used a different marker and also added the line between points. matplotlib contains a
variety of functions that let us create detailed, sophisticated plots.

We don’t need to understand all of the operations up-front, but for an example of the power Matplotlib
provides, let’s look at a more complex plot example (don’t worry about understanding every one of these
functions, we’ll cover the ones we need later on):

In [37]: plt.figure(figsize = (9, 6))

plt.plot(b[0], color='green', linestyle='dashed',


marker='o', markerfacecolor='blue',
markersize=12 )
plt.plot(b[1], 'D-.', markersize=12 )
1.2. FIRST DEEP LEARNING MODEL 39

plt.xlabel('Remember to label the X axis', fontsize=12)


plt.ylabel('And the Y axis too', fontsize=12)

t = r'Big Title/greek/math: $\alpha \sum \dot x \^y$'


plt.title(t, fontsize=16)

plt.axvline(1.5, color='orange', linewidth=4)


plt.annotate(xy=(1.5, 5.5), xytext=(1.6, 7),
s="Very important point",
arrowprops={"arrowstyle": '-|>'},
fontsize=12)
plt.text(0, 0.5, "Some Unimportant Text", fontsize=12)

plt.legend(['Series 1', 'Series 2'], loc=2);

Big Title/greek/math: xy
8 Series 1
Series 2 Very important point
7
6
And the Y axis too

5
4
3
2
1
Some Unimportant Text
0
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Remember to label the X axis

TIP This gallery gives more examples of what’s possible with Matplotlib.
40 CHAPTER 1. GETTING STARTED

Scikit-Learn

Scikit-learn is a beautiful library for many Machine Learning algorithms in Python. We will use it here to
generate some data

In [38]: from sklearn.datasets import make_circles

X, y = make_circles(n_samples=1000,
noise=0.1,
factor=0.2,
random_state=0)

The make_circle function will generate two “rings” of data points, each with two coordinates. It will also
create an array of labels, either 0 or 1.

TIP: label is a common term in Machine Learning that will be explained better in
classification problems. It is a number indicating the class to which the data belongs.

We assigned these to the variables X and y. It is a very common notation in Machine Learning. - X indicates
the input variable, and it is usually an array of dimension >= 2 with the outer index running over the various
data points in the set. - y indicates the output variable, and it can be an array of dimension >= 1. In this case,
our data will belong to either one circle or the other, and therefore our output variable will be binary: either
0 or 1. In particular, the data points belonging to the inner circle will have a label of 1.

Let’s take a look at the raw data we generated in X:

In [39]: X

Out[39]: array([[ 0.24265541, 0.0383196 ],


[ 0.04433036, -0.05667334],
[-0.78677748, -0.75718576],
...,
[ 0.0161236 , -0.00548034],
[ 0.20624715, 0.09769677],
[-0.19186631, 0.08916672]])

We can see the raw generated labels in y:

In [40]: y[:10]
1.2. FIRST DEEP LEARNING MODEL 41

Out[40]: array([1, 1, 0, 1, 1, 1, 0, 0, 0, 1])

We can also check the shape of X and y respectively:

In [41]: X.shape

Out[41]: (1000, 2)

In [42]: y.shape

Out[42]: (1000,)

While we can investigate the individual points, it becomes a lot clearer if we plot the points visually using
matplotlib.

Here’s how we do that:

In [43]: plt.figure(figsize=(5, 5))


plt.plot(X[y==0, 0], X[y==0, 1], 'ob', alpha=0.5)
plt.plot(X[y==1, 0], X[y==1, 1], 'xr', alpha=0.5)
plt.xlim(-1.5, 1.5)
plt.ylim(-1.5, 1.5)
plt.legend(['0', '1'])
plt.title("Blue circles and Red crosses");
42 CHAPTER 1. GETTING STARTED

Blue circles and Red crosses


1.5
0
1.0 1

0.5
0.0
0.5
1.0
1.5
1.5 1.0 0.5 0.0 0.5 1.0 1.5

Notice that we used some transparency, controlled by the parameter alpha.

TIP: what does the X[y==0, 0] syntax do? Let’s break it down:

• X[ , ] is the multiple-axis selection operator so that we will be selecting along rows


and columns in the 2D X array.
• X[:, 0] would select all the elements in the first column of X. If we interpret the two
columns of X as being the coordinates along the two axes of the plot, we are selecting
the coordinates along the horizontal axis.
• y==0 returns a boolean array of the same length as y, with True at the locations
where y is equal to 0 and False in the remaining locations. By passing this boolean
array in the row selector, numpy will smartly choose only those rows in X for which
the boolean condition is True.

Thus X[y==0, 0] means: select all the data points corresponding to the label 0 and for
each of these select the first coordinate, then return all these in an array.
1.2. FIRST DEEP LEARNING MODEL 43

Notice also how we are using the keywords color and marker to modify the aspect of our plot.

When looking at this plot, notice that points scatter on the plane in two concentric circles, the blue dots
form a larger ring on the outside, and the red crosses a smaller circle on the inside. Although the data in this
plot is synthetic, it’s representative of any situations where we want to separate two classes that are not
separable with a straight line.

TIP: We refer to synthetic data when data was generated using a random number generator
or a script

For example, in the next chapters we will try to distinguish between fake and true banknotes or between
different classes of wine, and in all these cases the boundary between a class and the other will not be a
straight line.

In this toy example, we want to train a simple Neural Network model to learn to separate the blue circles
from the red crosses.

Time to import our Deep Learning library Keras!

Keras

Keras is the Deep Learning model definition API we will use throughout the book. It’s modular, well
designed, and it has been integrated by both Google and Microsoft to serve as the high-level API for their
Deep Learning libraries (if you are not familiar with APIs, you may have a look on Wikipedia).

As explained earlier, Tensorflow adopted Keras as the default model specification API starting from the
recent 2.0 release. For this reason, we decided to use its syntax throughout the book.

TIP: Do not worry about understanding every line of code of what follows. The rest of the
book walks you through how to use Keras and Tensorflow (the most popular open-source
Deep Learning library, developed by Google), and so we’re not going to explain every detail
here. Here we’re going to demonstrate an overview of how to use Keras, and we’ll describe
more features as the book progresses.

To train a model to tell the difference between red crosses and blue dots above, we have to perform the
following steps:

1. Define our Neural Network structure; this is going to be our model.


44 CHAPTER 1. GETTING STARTED

2. Train the model on our data.


3. Check that the model has correctly learned to separate the red crosses from the blue dots.

TIP: If this is the first time you train a Machine Learning model, do not worry, we will
repeat these steps many times throughout the book, and we’ll have plenty of opportunities
to familiarize ourselves with them.

Let’s start by importing a few libraries:

In [44]: from tensorflow.keras.models import Sequential


from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD

Let’s start with step one: defining a Neural Network model. The next four lines are all that’s necessary to do
just that.

Keras will interpret those lines and behind the scenes create a model in Tensorflow. We may have noticed
that the above cells informed us that Keras is “Using Tensorflow backend”. Keras is just a high-level API
specification that can work with several back-ends. For this course, we will use it with the Tensorflow library
as back-end.

The Neural Network below will take two inputs (the horizontal and vertical position of a data point in the
plot above) and return a single value: the probability that the value belongs to the “Red Crosses” in the inner
circle.

Let’s build it!

We start by creating an empty shell for our model. We do this using the Sequential class, which tells Keras
that we are planning to build our model sequentially, adding one component at a time. So we will start by
declaring the model to be a sequential model and then we will proceed to add layers to the model.

TIP: Keras also offers a functional API to build models. It is a bit more complicated and we
will introduce it later in the book. Most of the models in this book will be created using the
Sequential API.

In [45]: model = Sequential()


1.2. FIRST DEEP LEARNING MODEL 45

The next step is to add components to our model. We won’t explain the meaning of each of these lines now,
except pointing your attention to 2 facts:

1. We are specifying the input shape of our model input_shape=(2,) in the first line below so that our
model will expect two input values for each data point.
2. We have one output value only which will give us the predicted probability for a point to be a blue dot
or a red cross. This is specified by the number 1 in the second line below.

In [46]: model.add(Dense(4, input_shape=(2,), activation='tanh'))


model.add(Dense(1, activation='sigmoid'))

Finally, we need to compile the model, which will communicate to our backend (Tensorflow) the model
structure and how it will learn from examples. Again, don’t worry about knowing what optimizer and loss
function mean, we’ll have plenty of time to understand those.

In [47]: model.compile(optimizer=SGD(lr=0.5),
loss='binary_crossentropy',
metrics=['accuracy'])

Defining the model is like creating an empty box ready to receive data of a specific shape. We can think of it
like wiring up an electric circuit or setting up a pipe. To obtain predictions from it, we’ll need to feed some
example data (i.e., flow electricity through the circuit or water through the pipeline). When this happens,
the model will learn general rules to formulate accurate predictions. In the present case, this means the
model will learn to separate the red crosses from the blue dots.

To train a model we use the fit method. We’ll discuss this in great detail in the chapter on Machine
Learning.

In [48]: model.fit(X, y, epochs=20);

Epoch 1/20
1000/1000 [==============================] - 0s 398us/sample - loss: 0.7120
- accuracy: 0.5220
Epoch 2/20
1000/1000 [==============================] - 0s 69us/sample - loss: 0.6848 -
accuracy: 0.6250
Epoch 3/20
1000/1000 [==============================] - 0s 69us/sample - loss: 0.6519 -
accuracy: 0.6930
Epoch 4/20
1000/1000 [==============================] - 0s 71us/sample - loss: 0.5719 -
accuracy: 0.7990
Epoch 5/20
1000/1000 [==============================] - 0s 70us/sample - loss: 0.4763 -
accuracy: 0.8580
46 CHAPTER 1. GETTING STARTED

Epoch 6/20
1000/1000 [==============================] - 0s 68us/sample - loss: 0.4150 -
accuracy: 0.8610
Epoch 7/20
1000/1000 [==============================] - 0s 69us/sample - loss: 0.3758 -
accuracy: 0.8650
Epoch 8/20
1000/1000 [==============================] - 0s 69us/sample - loss: 0.3430 -
accuracy: 0.8730
Epoch 9/20
1000/1000 [==============================] - 0s 69us/sample - loss: 0.2923 -
accuracy: 0.8960
Epoch 10/20
1000/1000 [==============================] - 0s 69us/sample - loss: 0.2167 -
accuracy: 0.9560
Epoch 11/20
1000/1000 [==============================] - 0s 70us/sample - loss: 0.1621 -
accuracy: 0.9960
Epoch 12/20
1000/1000 [==============================] - 0s 68us/sample - loss: 0.1264 -
accuracy: 1.0000
Epoch 13/20
1000/1000 [==============================] - 0s 68us/sample - loss: 0.1040 -
accuracy: 1.0000
Epoch 14/20
1000/1000 [==============================] - 0s 70us/sample - loss: 0.0883 -
accuracy: 1.0000
Epoch 15/20
1000/1000 [==============================] - 0s 68us/sample - loss: 0.0768 -
accuracy: 1.0000
Epoch 16/20
1000/1000 [==============================] - 0s 69us/sample - loss: 0.0679 -
accuracy: 1.0000
Epoch 17/20
1000/1000 [==============================] - 0s 68us/sample - loss: 0.0608 -
accuracy: 1.0000
Epoch 18/20
1000/1000 [==============================] - 0s 69us/sample - loss: 0.0550 -
accuracy: 1.0000
Epoch 19/20
1000/1000 [==============================] - 0s 68us/sample - loss: 0.0504 -
accuracy: 1.0000
Epoch 20/20
1000/1000 [==============================] - 0s 69us/sample - loss: 0.0463 -
accuracy: 1.0000

The fit function just ran 20 rounds or passes over our data. Each round is called an epoch. At each epoch, we
flow our data through the Neural Network and compare the known labels with the predictions from the
network and measure how accurate our net was.

After 20 iterations the accuracy of our model is 1 or close to 1, meaning it predicted 100 (or close to 100)
of the test cases correctly.
1.2. FIRST DEEP LEARNING MODEL 47

Decision Boundary

Now that our model is well trained, we can feed it with any pair of numbers and it will generate a prediction
for the probability that a point situated on the 2D plane at those coordinates belongs to the group of red
crosses.

In other words, now that we have a trained model, we can ask for the probability to be in the group of “red
crosses” for any point in the 2D plane. This is great because we can see if it has correctly learned to draw a
boundary between red crosses and blue dots. One way to calculate this is to define a grid on the 2D plane
and calculate the probability predicted by the model for any point on this grid. Let’s do it.

TIP: Don’t worry if you don’t yet understand everything in the following code. It is ok if
you get a general idea.

Our data varies roughly between -1.5 and 1.5 along both axes, so let’s build a grid of equally spaced
horizontal lines and vertical lines between these two extremes.

We will start by building two arrays of equally spaced points between the -1.5 and 1.5. The np.linspace
function does just that.

In [49]: hticks = np.linspace(-1.5, 1.5, 101)


vticks = np.linspace(-1.5, 1.5, 101)

In [50]: hticks[:10]

Out[50]: array([-1.5 , -1.47, -1.44, -1.41, -1.38, -1.35, -1.32, -1.29, -1.26,
-1.23])

Now let’s build a grid with all the possible pairs of points from hticks and vticks. The function
np.meshgrid does that.

In [51]: aa, bb = np.meshgrid(hticks, vticks)

In [52]: aa.shape

Out[52]: (101, 101)


48 CHAPTER 1. GETTING STARTED

In [53]: aa

Out[53]: array([[-1.5 , -1.47, -1.44, ..., 1.44, 1.47, 1.5 ],


[-1.5 , -1.47, -1.44, ..., 1.44, 1.47, 1.5 ],
[-1.5 , -1.47, -1.44, ..., 1.44, 1.47, 1.5 ],
...,
[-1.5 , -1.47, -1.44, ..., 1.44, 1.47, 1.5 ],
[-1.5 , -1.47, -1.44, ..., 1.44, 1.47, 1.5 ],
[-1.5 , -1.47, -1.44, ..., 1.44, 1.47, 1.5 ]])

In [54]: bb

Out[54]: array([[-1.5 , -1.5 , -1.5 , ..., -1.5 , -1.5 , -1.5 ],


[-1.47, -1.47, -1.47, ..., -1.47, -1.47, -1.47],
[-1.44, -1.44, -1.44, ..., -1.44, -1.44, -1.44],
...,
[ 1.44, 1.44, 1.44, ..., 1.44, 1.44, 1.44],
[ 1.47, 1.47, 1.47, ..., 1.47, 1.47, 1.47],
[ 1.5 , 1.5 , 1.5 , ..., 1.5 , 1.5 , 1.5 ]])

aa and bb contain the points of the grid, we can visualize them:

In [55]: plt.figure(figsize=(5, 5))


plt.scatter(aa, bb, s=0.3, color='blue')
# highlight one horizontal series of grid points
plt.scatter(aa[50], bb[50], s=5, color='green')
# highlight one vertical series of grid points
plt.scatter(aa[:, 50], bb[:, 50], s=5, color='red');
1.2. FIRST DEEP LEARNING MODEL 49

1.5
1.0
0.5
0.0
0.5
1.0
1.5
1 0 1

The model expects a pair of values for each data point, so we have to re-arrange aa and bb into a single array
with two columns.

The ravel function flattens an N-dimensional array to a 1D array, and the np.c_ class will help us combine
aa and bb into a single 2D array.

In [56]: ab = np.c_[aa.ravel(), bb.ravel()]

We can check that the shape of the array is correct:

In [57]: ab.shape

Out[57]: (10201, 2)

We have created an array with 10201 rows and two columns, these are all the points on the grid we drew
above. Now we can pass it to the model and obtain a probability prediction for each position in the grid.

In [58]: c = model.predict(ab)
50 CHAPTER 1. GETTING STARTED

In [59]: c

Out[59]: array([[0.00017887],
[0.00023386],
[0.0003089 ],
...,
[0.06605449],
[0.06628567],
[0.06648385]], dtype=float32)

Great! We have predictions from our model for all points on the grid, and they are all values between 0 and
1.

Let’s check to make sure that they are, in fact between 0 and 1 by checking the minimum and maximum
values:

In [60]: c.min()

Out[60]: 3.0189753e-05

In [61]: c.max()

Out[61]: 0.987646

Let’s reshape c so that it has the same shape as aa and bb. We need to do this so that we will be able to use it
to control the size of each dot in the next plot.

In [62]: c.shape

Out[62]: (10201, 1)

In [63]: cc = c.reshape(aa.shape)
cc.shape

Out[63]: (101, 101)

Let’s see what they look like! We will redraw the grid, making the size of each dot proportional to the
probability predicted by the model that that point belongs to the group of red crosses.
1.2. FIRST DEEP LEARNING MODEL 51

In [64]: plt.figure(figsize=(5, 5))


plt.scatter(aa, bb, s=20*cc);

1.5
1.0
0.5
0.0
0.5
1.0
1.5
1 0 1

Nice! We see that a dense cloud of points with high probability is in the central region of the plot, exactly
where our red crosses are. We can draw the same data in a more appealing way using the plt.contourf
function with appropriate colors and transparency:

In [65]: plt.figure(figsize=(5, 5))


plt.contourf(aa, bb, cc, cmap='bwr', alpha=0.2)
plt.plot(X[y==0, 0], X[y==0, 1], 'ob', alpha=0.5)
plt.plot(X[y==1, 0], X[y==1, 1], 'xr', alpha=0.5)
plt.title("Blue circles and Red crosses");
52 CHAPTER 1. GETTING STARTED

Blue circles and Red crosses


1.5
1.0
0.5
0.0
0.5
1.0
1.5
1.5 1.0 0.5 0.0 0.5 1.0 1.5

The last plot clearly shows the decision boundary of our model, i.e., the curve that delimits the area
predicted to be red crosses VS the region predicted to be blue dots.

Our model learned to distinguish the two classes correctly, which is promising, even if the current example
was elementary.

Below are some exercises for you to practice with the commands and concepts we just introduced.

Exercises

Exercise 1

Let’s practice a little bit with numpy:

• generate an array of zeros with shape=(10, 10), call it a


• set every other element of a to 1, both along columns and rows, so that you obtain a nice
checkerboard pattern of zeros and ones
• generate a second array to be the sequence from 5 included to 15 excluded, call it b
1.3. EXERCISES 53

• multiply a times b in such a way that the first row of a is an alternation of zeros and fives, the second
row is an alternation of zeros and sixes and so on. Call this new array c. To complete this part, you
will have to reshape b as a column array
• calculate the mean and the standard deviation of c along rows and columns
• create a new array of shape=(10, 5) and fill it with the non-zero values of c, call it d
• add random Gaussian noise to d, centered in zero and with a standard deviation of 0.1, call this new
array e

In [ ]:

Exercise 2

Practice plotting with matplotlib:

• use plt.imshow() to display the array a as an image, does it look like a checkerboard?
• display c, d and e using the same function, change the colormap to grayscale
• plot e using a line plot, assigning each row to a different data series. This should produce a plot with
noisy horizontal lines. You will need to transpose the array to obtain this.
• add a title, axes labels, legend and a couple of annotations

In [ ]:

Exercise 3

Reuse your code:

• encapsulate the code that calculates the decision boundary in a nice function called
plot_decision_boundary with the signature:

def plot_decision_boundary(model, X, y):


....

In [ ]:

Exercise 4

Practice retraining the model on different data:

• use the functions make_blobs and make_moons from Scikit-Learn to generate new datasets with two
classes
54 CHAPTER 1. GETTING STARTED

• plot the data to make sure you understand it


• re-train your model on each of these datasets
• display the decision boundary for each of these models

In [ ]:
Data Manipulation
2
This chapter is about data.

To do deep-learning effectively, we’ll need to be able to work with data of all shapes and sizes. At the end of
this section, we will be able to explore data visually and do simple descriptive statistics using Python and
Pandas.

Many types of data


Data comes in many forms, formats, and sizes. For example, as a data scientist at a web company, data will
probably be accessible in as records in a database. We can think of these as huge spreadsheets, with rows and
columns containing numbers.

On the other hand, if we are developing a method to detect cancer from brain scans, we will deal with
images and video data, very often these files will be significant in size (or number) and possibly in
complicated formats.

If we are trying to detect a signal for trading stocks based on information in news articles, our data will
often be millions of text documents.

If we are translating the spoken language to text, our input data will be sound files, and so on.

Traditionally Machine Learning has been relatively good at dealing with “tabular” data, while
“unstructured” data such as text, sound, and images, were each addressed with very sophisticated,
domain-specific techniques.

Deep Learning is particularly good at efficiently learning ways to represent such “unstructured” data,

55
56 CHAPTER 2. DATA MANIPULATION

and this is one of the reasons for its enormous success. Neural net models can be used to solve a translation
problem or an image classification problem, without worrying too much about the type of underlying data.

The first reason why Deep Learning is so popular is this: it can deal with many different types of data.

Before we can train models on our data, we need to gather the data and provide it to our networks in a
consistent format. Let’s take a look at a few different types of data and learn about the tools we’ll be using to
process and explore them.

Tabular Data

The most straightforward data to feed to a Machine Learning model is so-called tabular data. It’s called
mytabular because it can be represented in a table with rows and columns, very much like a spreadsheet.
Let’s use an example to define some common vocabulary that we will throughout the book.

Common terms for tabular data

A row in a table corresponds to a datapoint, and it’s often referred to as a record. A record is a list of
attributes (extracted from a data point), which are often numbers, categories, or free-form text. These
attributes go by the name of features.

According to Bishop 1, in Machine Learning a feature is an individual measurable property of a


phenomenon being observed. In other words, features are the properties we are using to characterize our
data.

Features can be directly measurable or inferred from other features. Think, for example, of the number of
times a user visited a website or the browser they used - both of which can be directly counted. We could
also create a new feature from existing data such as the average time between two user visits. The process of
calculating new features is called feature engineering.
2.2. DATA EXPLORATION WITH PANDAS 57

That said, not all the features can be as informative. Some may be entirely irrelevant for what we are trying
to do. For example, if we are trying to predict how likely a user is to buy our product, it is plausible that
his/her first name will have no predictive power. On the other hand, previous purchases may carry a lot
of information concerning propensity to buy.

Traditionally feature engineering (extracting or inventing “good” features) and feature selection (keeping
only the “good” features) have received much emphasis. Deep Learning bypasses these steps by
automatically figuring out the relevant features and building higher order features deeper in the network.

Here is another reason why Deep Learning is so popular: it automates the complicated process of feature
engineering.

1: Bishop, Christopher (2006). Pattern recognition and Machine Learning. Berlin: Springer. ISBN
0-387-31073-8.

Data Exploration with Pandas


When building a predictive model, it’s often helpful to get some quick facts about our dataset. We may spot
some very evident problems with the data that we may want to address. This first phase is called data
exploration and consists of a series of questions that we want to ask:

• How big is our dataset?


• How many features do we have?
• Is any record corrupted or missing information for one or more features?
• Are the features numbers or categories?
• How is each feature distributed? Are they correlated?

We want to ask these questions early to decide how to proceed further without wasting time. For example, if
we have too few data points, we may not have enough samples to train a Machine Learning model. Our first
step, in that case, will be to go out and gather more data. If we have missing data, we need to decide what to
do about it. Do we delete the records missing data or do we impute (create) the missing data? Moreover, if
we impute the data, how do we decide the value? If we have many features, but only a few of them are not
constant, we’d better eliminate the constant features first, because they will undoubtedly have no predictive
power, and so on.

Python comes with a library that allows addressing all these questions very easily, called Pandas.

Let’s load it in our notebook.

In [1]: import pandas as pd

Pandas is an open source library that provides high-performance, easy-to-use data structures, and data
analysis tools. It can load data from a multitude of sources including CSV, JSON, Excel, HTML, PDF and
many others (here you may find all the types of file that it can load, together with a short description). Let’s
start by loading a CSV file.
58 CHAPTER 2. DATA MANIPULATION

TIP: A comma-separated values file (CSV) stores tabular data (numbers and text) in plain
text. Each line of the file is a data record, and each record consists of one or more fields,
separated by commas.

Before we do anything else, let’s also set a couple of standard options that will help us to contain the size of
the tables displayed. We configure pandas to show at most 13 rows of data in a data frame and 7 columns.
Bigger data frames will be truncated with ellipses.

In [2]: pd.set_option("display.max_rows", 13)


pd.set_option("display.max_columns", 7)
pd.set_option("display.latex.repr", True)
pd.set_option('max_colwidth', 30)

Notice here that the display.latex.repr is only set to True for the PDF version of the book, while it is
False for the other versions. Starting from the next chapter, we’ll group all the configurations in a single
script. Let’s now load the data from the titanic-train.csv file:

In [3]: df = pd.read_csv('../data/titanic-train.csv')

This is a popular dataset containing information about passengers of the Titanic, such as their name, age,
and if they survived.

pd.read_csv will read the CSV file and create a Pandas DataFrame object from it. A DataFrame is a
labeled, 2D data-structure, much like a spreadsheet.

Now that we have imported the Titanic data into a Pandas DataFrame object, we can inspect it. Let’s start by
peeking into the first few records to get a feel for how DataFrames work.

df.head() displays the first five lines of the DataFrame. We can see it as a table, with column names
inferred from the CSV file and an index, indicating the row it came from:

In [4]: df.head()

Out[4]:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Hea... female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
2.2. DATA EXPLORATION WITH PANDAS 59

df.info() summarizes the content of the DataFrame, letting us know the index range, the number, and
names of columns with their data type.

We also learn about missing entries. For example, notice that the Age column has a few null entries.

In [5]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

df.describe() summarizes the numerical columns with some basic stats: count, min, max, mean,
standard deviation and so on.

In [6]: df.describe()

Out[6]:

PassengerId Survived Pclass Age SibSp Parch Fare


count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

This is very useful to compare the scale of different features and decide if we need to rescale some of them.

Indexing

We can access individual elements of a DataFrame. Let’s see a few ways.


60 CHAPTER 2. DATA MANIPULATION

We can get the fourth row of the DataFrame (numerical index 3) using df.iloc[3]

In [7]: df.iloc[3]

Out[7]:

3
PassengerId 4
Survived 1
Pclass 1
Name Futrelle, Mrs. Jacques Hea...
Sex female
Age 35
SibSp 1
Parch 0
Ticket 113803
Fare 53.1
Cabin C123
Embarked S

We can fetch elements corresponding to indices 0-4 and column ‘Ticket’:

In [8]: df.loc[0:4,'Ticket']

Out[8]:

Ticket
0 A/5 21171
1 PC 17599
2 STON/O2. 3101282
3 113803
4 373450

We can obtain the same result by selecting the first five elements of the column ‘Ticket’, with .head()
command:

In [9]: df['Ticket'].head()

Out[9]:
2.2. DATA EXPLORATION WITH PANDAS 61

Ticket
0 A/5 21171
1 PC 17599
2 STON/O2. 3101282
3 113803
4 373450

To select multiple columns, we pass the list of columns:

In [10]: df[['Embarked', 'Ticket']].head()

Out[10]:

Embarked Ticket
0 S A/5 21171
1 C PC 17599
2 S STON/O2. 3101282
3 S 113803
4 S 373450

Selections

Pandas is smart about indices and allows us to write expressions. For example, we can get the list of
passengers with Age over 70:

In [11]: df[df['Age'] > 70]

Out[11]:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
96 97 0 1 Goldschmidt, Mr. George B male 71.0 0 0 PC 17754 34.6542 A5 C
116 117 0 3 Connors, Mr. Patrick male 70.5 0 0 370369 7.7500 NaN Q
493 494 0 1 Artagaveytia, Mr. Ramon male 71.0 0 0 PC 17609 49.5042 NaN C
630 631 1 1 Barkworth, Mr. Algernon He... male 80.0 0 0 27042 30.0000 A23 S
851 852 0 3 Svensson, Mr. Johan male 74.0 0 0 347060 7.7750 NaN S

To understand what this does, let’s break it down. df['Age'] > 70 returns a boolean Series of values that
are True when the Age is greater than 70 (and False otherwise). The length of this series is the same as that
of the whole DataFrame, as you can check by running:

In [12]: len(df['Age'] > 70)

Out[12]: 891
62 CHAPTER 2. DATA MANIPULATION

Passing this series to the [] operator, selects only the rows for which the boolean series is True. In other
words, Pandas matches the index of the DataFrame with the index of the Series and selects only the rows for
which the condition is True.

We can obtain the same result using the query operator.

In [13]: df.query("Age > 70")

Out[13]:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
96 97 0 1 Goldschmidt, Mr. George B male 71.0 0 0 PC 17754 34.6542 A5 C
116 117 0 3 Connors, Mr. Patrick male 70.5 0 0 370369 7.7500 NaN Q
493 494 0 1 Artagaveytia, Mr. Ramon male 71.0 0 0 PC 17609 49.5042 NaN C
630 631 1 1 Barkworth, Mr. Algernon He... male 80.0 0 0 27042 30.0000 A23 S
851 852 0 3 Svensson, Mr. Johan male 74.0 0 0 347060 7.7750 NaN S

We can use the & and | Python operators (which do bitwise and bitwise or, respectively) to combine
conditions. For example, the next statement returns the records of passengers 11 years old and with five
siblings/spouses.

In [14]: df[(df['Age'] == 11) & (df['SibSp'] == 5)]

Out[14]:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
59 60 0 3 Goodwin, Master. William F... male 11.0 5 2 CA 2144 46.9 NaN S

If we use an or operator, we’ll have passengers that are 11 years old or passengers with five siblings/spouses.

In [15]: df[(df.Age == 11) | (df.SibSp == 5)]

Out[15]:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
59 60 0 3 Goodwin, Master. William F... male 11.0 5 2 CA 2144 46.9000 NaN S
71 72 0 3 Goodwin, Miss. Lillian Amy female 16.0 5 2 CA 2144 46.9000 NaN S
386 387 0 3 Goodwin, Master. Sidney Le... male 1.0 5 2 CA 2144 46.9000 NaN S
480 481 0 3 Goodwin, Master. Harold Vi... male 9.0 5 2 CA 2144 46.9000 NaN S
542 543 0 3 Andersson, Miss. Sigrid El... female 11.0 4 2 347082 31.2750 NaN S
683 684 0 3 Goodwin, Mr. Charles Edward male 14.0 5 2 CA 2144 46.9000 NaN S
731 732 0 3 Hassan, Mr. Houssein G N male 11.0 0 0 2699 18.7875 NaN C
802 803 1 1 Carter, Master. William Th... male 11.0 1 2 113760 120.0000 B96 B98 S

Again, we can use the query method to achieve the same result.

In [16]: df.query('(Age == 11) | (SibSp == 5)')


2.2. DATA EXPLORATION WITH PANDAS 63

Out[16]:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
59 60 0 3 Goodwin, Master. William F... male 11.0 5 2 CA 2144 46.9000 NaN S
71 72 0 3 Goodwin, Miss. Lillian Amy female 16.0 5 2 CA 2144 46.9000 NaN S
386 387 0 3 Goodwin, Master. Sidney Le... male 1.0 5 2 CA 2144 46.9000 NaN S
480 481 0 3 Goodwin, Master. Harold Vi... male 9.0 5 2 CA 2144 46.9000 NaN S
542 543 0 3 Andersson, Miss. Sigrid El... female 11.0 4 2 347082 31.2750 NaN S
683 684 0 3 Goodwin, Mr. Charles Edward male 14.0 5 2 CA 2144 46.9000 NaN S
731 732 0 3 Hassan, Mr. Houssein G N male 11.0 0 0 2699 18.7875 NaN C
802 803 1 1 Carter, Master. William Th... male 11.0 1 2 113760 120.0000 B96 B98 S

Unique Values

The unique method returns the unique entries. For example, we can use it to know the possible ports of
embarkment and only select the distinct values.

In [17]: df['Embarked'].unique()

Out[17]: array(['S', 'C', 'Q', nan], dtype=object)

Sorting

We can sort a DataFrame by any group of columns. For example, let’s sort people by Age, starting from the
oldest using the ascending flag. By default, ascending is set to True, which sorts by the youngest first. To
reverse the sort order, we set this value to False.

In [18]: df.sort_values('Age', ascending = False).head()

Out[18]:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
630 631 1 1 Barkworth, Mr. Algernon He... male 80.0 0 0 27042 30.0000 A23 S
851 852 0 3 Svensson, Mr. Johan male 74.0 0 0 347060 7.7750 NaN S
493 494 0 1 Artagaveytia, Mr. Ramon male 71.0 0 0 PC 17609 49.5042 NaN C
96 97 0 1 Goldschmidt, Mr. George B male 71.0 0 0 PC 17754 34.6542 A5 C
116 117 0 3 Connors, Mr. Patrick male 70.5 0 0 370369 7.7500 NaN Q

Aggregations

Pandas also allows performing aggregations and group-by operations like we can do in SQL and can
reshuffle data into pivot-tables like a spreadsheet application. It is perfect for data exploration, and we
strongly recommend a thorough look at its documentation if you are new to Pandas. Here we will review
only a few useful commands.

value_counts() counts how many instances of each value are there in a series, sorting them in descending
order. We can use it to know how many people survived and how many died:
64 CHAPTER 2. DATA MANIPULATION

In [19]: df['Survived'].value_counts()

Out[19]:

Survived
0 549
1 342

and how many people were traveling in each class:

In [20]: df['Pclass'].value_counts()

Out[20]:

Pclass
3 491
1 216
2 184

Like in a database, we can group data by column name and then aggregate them with some function. For
example, let’s count dead and alive passengers by class:

In [21]: df.groupby(['Pclass','Survived'])['PassengerId'].count()

Out[21]:

PassengerId
Pclass Survived
1 0 80
1 136
2 0 97
1 87
3 0 372
1 119

This is a potent tool. We can immediately see that almost 2/3 of passengers in first class survived, compared
to only about 1/3 of passengers in 3rd class!

We can look at individual columns min, max, mean and median, to get some more information about our
numerical features. For example, the next line shows that the youngest passenger was less than six months
old:
2.2. DATA EXPLORATION WITH PANDAS 65

In [22]: df['Age'].min()

Out[22]: 0.42

while the oldest was eighty years old:

In [23]: df['Age'].max()

Out[23]: 80.0

The average age of the passengers was almost 30 years old:

In [24]: df['Age'].mean()

Out[24]: 29.69911764705882

While the median age was a bit younger, 28 years old:

In [25]: df['Age'].median()

Out[25]: 28.0

We can see if the mean age of survivors was different from the mean age of victims.

In [26]: mean_age_by_surv = df.groupby('Survived')['Age'].mean()


mean_age_by_surv

Out[26]:

Age
Survived
0 30.626179
1 28.343690

Although the mean age of survivors seems a bit lower, the difference between the two classes is not
statistically significant as we can see by looking at the standard deviation.
66 CHAPTER 2. DATA MANIPULATION

In [27]: std_age_by_survived = df.groupby('Survived')['Age'].std()


std_age_by_survived

Out[27]:

Age
Survived
0 14.172110
1 14.950952

Merge

Pandas can perform join operations as we can do in SQL using the merge operation. For example, let’s
combine the two previous tables:

In [28]: df1 = mean_age_by_surv.round(0).reset_index()


df1

Out[28]:

Survived Age
0 0 31.0
1 1 28.0

In [29]: df2 = std_age_by_survived.round(0).reset_index()


df2

Out[29]:

Survived Age
0 0 14.0
1 1 15.0

In [30]: df3 = pd.merge(df1, df2, on='Survived')


df3

Out[30]:

Survived Age_x Age_y


0 0 31.0 14.0
1 1 28.0 15.0
2.2. DATA EXPLORATION WITH PANDAS 67

In [31]: df3.columns = ['Survived',


'Average Age',
'Age Standard Deviation']
df3

Out[31]:

Survived Average Age Age Standard Deviation


0 0 31.0 14.0
1 1 28.0 15.0

merge is incredibly powerful. We recommend reading more into its functionality in Pandas documentation

Pivot Tables

Pandas can aggregate data into a pivot table, just like Microsoft Excel.

TIP: A pivot table is a table that summarizes data in another table by applying a double
group-by operation followed by an aggregation such as average or sum.

For example, we can create a table which holds the count of the number of people who survived (or not) per
class:

In [32]: df.pivot_table(index='Pclass',
columns='Survived',
values='PassengerId',
aggfunc='count')

Out[32]:

Survived 0 1
Pclass
1 80 136
2 97 87
3 372 119

Correlations

Finally, Pandas can also calculate correlations between features, making it easier to spot redundant
information or uninformative columns.
68 CHAPTER 2. DATA MANIPULATION

For example, let’s check the correlation of a few columns with a True value of Survived. If it’s true that
women and children go first, we expect to see some correlation with Age and Sex, while we expect no
correlation with PassengerId.

Since the Sex column is a string, we first need to create an auxiliary (extra) IsFemale boolean column that
is set to True if the Sex is ‘female’.

In [33]: df['IsFemale'] = df['Sex'] == 'female'

In [34]: corr_w_surv = df.corr()['Survived'].sort_values()


corr_w_surv

Out[34]:

Survived
Pclass -0.338481
Age -0.077221
SibSp -0.035322
PassengerId -0.005007
Parch 0.081629
Fare 0.257307
IsFemale 0.543351
Survived 1.000000

Before looking at what these values mean, let’s peek ahead a little and look into Pandas plotting functionality.
We can use Pandas plotting functionality to display the last result visually. Let’s import Matplolib:

In [35]: import matplotlib.pyplot as plt

Also, let’s set the configuration of plots first:

In [36]: from matplotlib.pyplot import rcParams

rcParams['font.size'] = 14
rcParams['lines.linewidth'] = 2
rcParams['figure.figsize'] = (9, 6)
rcParams['axes.titlepad'] = 14
rcParams['savefig.pad_inches'] = 0.2

Now let’s use pandas to plot the corr_w_surv data frame. Notice that we will exclude the last row, which is
Survived itself:
2.2. DATA EXPLORATION WITH PANDAS 69

In [37]: title = 'Titanic Passengers: correlation with survival'


corr_w_surv.iloc[:-1].plot(kind='bar', title=title);

Let’s interpret the graph above. The largest correlation with survival is being a woman. We also see that
people who paid a higher fare (probably corresponding to a higher class) had a higher chance of surviving.

The attribute Pclass is negatively correlated, meaning the higher the class number, the lower the chance of
survival, which makes sense (first class passenger more likely to survive than third class).

Age is also negatively correlated, though mildly, meaning the younger you are the more likely you are to
survive. Finally, as expected PassengerId does not correlate with survival.

We’ve barely scratched the surface of what Pandas can do for data manipulation and data exploration. Do
refer to the mentioned documentation for a better understanding of its capabilities.
70 CHAPTER 2. DATA MANIPULATION

Visual data exploration


After an initial look at the properties of our tabular dataset, it is often beneficial to dig a little deeper using
visualizations. Looking at a graph, we may spot a trend, a particular repeating pattern, or a correlation. Our
visual cortex is an extremely good pattern recognizer, so it only makes sense to take advantage of it when
possible.

We can represent data visually in several ways, depending on the type of data and on what we are interested
in seeing.

Let’s create some artificial data and visualize it in different ways.

In [38]: import numpy as np

We will create 3 data series: - A noisy stationary sequence centered around zero (data1). - A sequence with
larger noise, following a linearly increasing trend (data2). - A sequence with where noise increases over
time (data 3). - A sequence with somewhat intermediate noise, following a sinusoidal oscillatory pattern
(data 4).

In [39]: N = 1000
data1 = np.random.normal(0, 0.1, N)
data2 = (np.random.normal(1, 0.4, N) +
np.linspace(0, 1, N))
data3 = 2 + (np.random.random(N) *
np.linspace(1, 5, N))
data4 = (np.random.normal(3, 0.2, N) +
0.3 * np.sin(np.linspace(0, 20, N)))

Now, let’s create a DataFrame object composing all of our newly created data sequences. First, we aggregate
the data using np.vstack and we transpose it:

In [40]: data = np.vstack([data1, data2, data3, data4])


data = data.transpose()

Then we create a data frame with the appropriate column names:

In [41]: cols = ['data1', 'data2', 'data3', 'data4']

df = pd.DataFrame(data, columns=cols)
df.head()

Out[41]:
2.3. VISUAL DATA EXPLORATION 71

data1 data2 data3 data4


0 -0.055743 0.953853 2.858182 3.177591
1 0.035861 1.093153 2.365206 3.061100
2 0.071000 1.102536 2.621110 3.211021
3 -0.171425 1.020165 2.740725 2.853363
4 0.021963 0.787630 2.666747 2.938457

Even when we received a description of these four datasets, it’s tough to understand what’s going on by
simply looking at the table of numbers. Instead, let’s look at this data visually.

Line Plot

Pandas plot function defaults to a line plot. This is a good choice if our data comes from an ordered series
of consecutive events (for example, the outside temperature in a city over the course of a year).

A line plot represents the values of data in sequential order and makes it easy to spot trends like growth over
time or seasonal patterns.

In [42]: df.plot(title='Line plot');


72 CHAPTER 2. DATA MANIPULATION

Above, we’re using the plot method on the DataFrame. We can obtain the same plot using
matplotlib.pyplot (and passing in the DataFrame df as an argument) like this:

In [43]: plt.plot(df)
plt.title('Line plot')
plt.legend(['data1', 'data2', 'data3', 'data4']);

Scatter plot

If data is not in order, and we are looking for correlations between variables, a scatter plot is a better
choice. We can change the style of the line plot if we want to plot data in order:

In [44]: df.plot(style='.', title='Scatter Plot');


2.3. VISUAL DATA EXPLORATION 73

alternatively, we can use the scatter plot kind if we want to visualize one column against another:

In [45]: df.plot(kind='scatter', x='data1', y='data2',


xlim=(-1.5, 1.5), ylim=(0, 3),
title='Data1 VS Data2');
74 CHAPTER 2. DATA MANIPULATION

In the above plot, we see that there is no correlation between data1 and data2 (which may be obvious
because data1 is random noise).

Histograms

Sometimes we are interested in knowing the frequency of occurrence of data, and not their order. In this
case, we divide the range of data into buckets and ask how many points fall into each bucket. This is called a
histogram, and it represents the statistical distribution of our data.

This could look like a bell curve, or an exponential decay, or have a weird shape. By plotting the histogram
of a feature, we might spot the presence of distinct sub-populations in our data and decide to deal with each
one separately.

In [46]: df.plot(kind='hist',
bins=50,
title='Histogram',
alpha=0.6);
2.3. VISUAL DATA EXPLORATION 75

Note that we lost all the temporal information contained in our data, for example, the oscillations in data4
are not visible any longer, all we see is a quite large bell-like distribution, where the sinusoidal oscillations
have been summed up in the histogram.

Cumulative Distribution

A close relative of a histogram is the cumulative distribution, which serves to calculate what fraction of our
sample falls below a certain value:

In [47]: df.plot(kind='hist',
bins=100,
title='Cumulative distributions',
density=True,
cumulative=True,
alpha=0.4);
76 CHAPTER 2. DATA MANIPULATION

Try to answer these questions:

1. how much of data1 falls below 2?


2. how much of data2 falls below 1.5?

Answers: 1. 100. If you draw a vertical line that passes through 2, you will see that it
crosses the cumulative distribution for data1 at the high value of 1, which corresponds to
100. 2. approximately 50. This can be seen by tracing a vertical line at 1.5 and checking
at what height it crosses the data2 distribution.

Box plot

A box plot is a useful tool to compare several distributions. It is often used in biology and medicine to
compare the results of an experiment with a control group. For example, in the simplest form of a clinical
trial for a new drug, there will be two boxes, one for the population that took the drug and the other for the
population that took the placebo.
2.3. VISUAL DATA EXPLORATION 77

In [48]: df.plot(kind='box',
title='Boxplot');

What does this plot mean? In loose terms, it’s as if we were looking at the histogram plot from above. Each
box represents the critical facts about the distribution of that particular data series. Let’s first get an intuition
about the information it shows. Later we will give a more formal definition.

Let’s start with the green horizontal line that cuts each box. It represents the position of the peak of the
histogram. We can check the peak for data1 that the line is at 0, exactly like the very sharp peak of data1
in the histogram figure, and for data4 the green line is roughly at 3, precisely like the peak of the red
histogram in the previous picture.

The box represents the bulk of the data, i.e., it gives us an idea of how fat and centered our distribution is
around the peak. We can see that the box in data3 is not centered around the green line, reflecting the fact
that the histogram in green is skew. The whiskers give us an idea of the extension of the tails of the
distribution. Again, notice how the upper whisker of data3 extends to high values.

TIP: For the more statistically inclined readers, here are the formal definitions of the above
concepts: - The green line is the median of our data, i.e., the value lying at the midpoint of
78 CHAPTER 2. DATA MANIPULATION

the distribution. - The box around it denotes the confidence interval (calculated using a
gaussian approximation). Notice how these reproduce more closely the actual size of the
noise fluctuations for data2 and data4. - The whiskers above and below denote the range
of data not considered outliers. By default they are set to be at [Q1 - 1.5*IQR, Q3 +
1.5*IQR], where Q1 is the first quartile, Q3 the third quartile and IQR the interquartile
range. Notice that these give us a clear indication that data3 is not symmetric around its
median. - The dots represent data that are considered outliers.

TIP: In the previous TIP, we just introduced the concept of outliers. Outliers are data that
are distant from other observations. Outliers may be due for example to variability in the
measurement, or they may indicate experimental errors. It is a fundamental concept in
Machine Learning, and we’ll have the chance to discuss it later.

Subplots

We can also combine these plots in a single figure using the subplots command:

In [49]: fig, ax = plt.subplots(2, 2, figsize=(16,12))

df.plot(ax=ax[0][0],
title='Line plot')

df.plot(ax=ax[0][1],
style='o',
title='Scatter plot')

df.plot(ax=ax[1][0],
kind='hist',
bins=50,
title='Histogram')

df.plot(ax=ax[1][1],
kind='box',
title='Boxplot');
2.3. VISUAL DATA EXPLORATION 79

Pie charts

Pie charts are useful to visualize fractions of a total, for example, we could ask how much of data1 is greater
than 0.1:

In [50]: gt01 = df['data1'] > 0.1


piecounts = gt01.value_counts()
piecounts

Out[50]:

data1
False 842
True 158

In [51]: piecounts.plot(kind='pie',
figsize=(7, 7),
explode=[0, 0.15],
80 CHAPTER 2. DATA MANIPULATION

labels=['<= 0.1', '> 0.1'],


autopct='%1.1f%%',
shadow=True,
startangle=90,
fontsize=16);

Hexbin plot

Hexbin plots are useful to look at 2-D distributions. Let’s generate some new data for this plot.

In [52]: dat1 = np.random.normal((0, 0), 2, size=(1000, 2))


dat2 = np.random.normal((9, 9), 3, size=(2000, 2))
2.3. VISUAL DATA EXPLORATION 81

data = np.vstack([dat1, dat2])

df = pd.DataFrame(data, columns=['x', 'y'])

In [53]: df.head()

Out[53]:

x y
0 0.077641 3.539570
1 3.284320 -3.680475
2 0.467137 -0.479667
3 -1.734382 -0.980701
4 -0.907213 -0.826562

In [54]: df.plot();

This new data is a stack of two 2-D random sequences, the first one centered in (0, 0) and the second one
centered in (9, 9). Let’s see how the hexbin plot visualizes them.
82 CHAPTER 2. DATA MANIPULATION

In [55]: df.plot(kind='hexbin', x='x', y='y', bins=100,


cmap='rainbow', title='Hexbin Plot');

The Hexbin plot is the 2-D extension of a histogram. It defines regular tiles to cover the 2-D plane and then
counts how many points end up in each tile. The color is proportional to the count. Since we created this
dataset with points sampled from 2 Gaussian distributions, we expect to see tiles containing more points
near the centers of these two Gaussians, which is what we observe above.

We encourage you to have a look at the this gallery to get some inspiration on visualizing your data.
Remember that the choice of visualization ties to the kind of data and the kind of question we are asking.

Unstructured data
Most often than not, data doesn’t come as a nice, well-formatted table. As we mentioned earlier, we could be
dealing with images, sound, text, movies, protein molecular structures, video games and many other types
of data.

The beauty of Deep Learning is that it can handle most of this data and learn optimal ways to represent it for
the task at hand.
2.4. UNSTRUCTURED DATA 83

Images

Let’s take images for example. We’ll use the PIL imaging library (installed as Pillow in Python 3).

In [56]: from PIL import Image

In [57]: img = Image.open('../data/iss.jpg')


img

Out[57]:

We can convert the image to a 3-D array using numpy. After all, an image is a table of pixels, each containing
the values for red, green, and blue. So, our image is, in fact, a three-dimensional table, where rows and
columns correspond to the pixel index and the depth corresponds to the color channel.

In [58]: imgarray = np.asarray(img)

In [59]: imgarray.shape
84 CHAPTER 2. DATA MANIPULATION

Out[59]: (435, 640, 3)

The shape of the above array indicates (width, height, channels). While it’s quite easy to think of features
when dealing with tabular data, it’s trickier when we deal with images. We could imagine unrolling this
image onto a long list of numbers, walking along each of the three dimensions, and we did so, our dataset of
images would again be a tabular dataset, with each row corresponding to a particular image and each
column corresponding to a specific pixel and color channel.

In [60]: imgarray.ravel().shape

Out[60]: (835200,)

However, not only this procedure created 835200 features for our image but also by doing so we lost most of
the useful information in the image. In other words, a single pixel in an image carries very little
information, while most of the information is in changes and correlations between nearby pixels. Neural
Networks can learn features from that through a technique called convolution, which we will learn about
later in this course.

Sound

Now take sound. Digitally-recorded sound is a long series of ordered numbers representing the sound wave.
Let’s load an example file.

In [61]: from scipy.io import wavfile

In [62]: rate, snd = wavfile.read(filename='../data/sms.wav')

We can play the audio file in the notebook:

In [63]: from IPython.display import Audio

In [64]: Audio(data=snd, rate=rate)

Out[64]: <IPython.lib.display.Audio object>

This file is sampled at 44.1 kHz, which means 44100 times per second. So, our 3-second file contains over
100k samples:

In [65]: len(snd)
2.4. UNSTRUCTURED DATA 85

Out[65]: 110250

In [66]: snd

Out[66]: array([70, 14, 27, ..., 58, 68, 59], dtype=int16)

We can use matplotlib to plot the sound like this:

In [67]: plt.plot(snd)
plt.title('sms.wav as a Line Plot');

If each point in our dataset is a recorded sound, it is likely that each will have a different length. We could
still represent our data in tabular form by taking each consecutive sample as a feature and padding with
zeros the records that are shorter, but these extra zeros would carry no information (unless we had taken
great care to synchronize each file so that the sound started at the same sample number).

Besides, sound information is carried in modulations of frequency, suggesting that the raw form may not be
the best to use. As we shall see, there are better ways to represent sound and to feed it to a Neural Network
for tasks like music recognition or speech-to-text.
86 CHAPTER 2. DATA MANIPULATION

In [68]: _ = plt.specgram(snd, NFFT=1024, Fs=44100)


plt.ylabel('Frequency (Hz)')
plt.xlabel('Time (s)')
plt.title('sms.wav as a Spectrogram');

Text data

Text documents pose similar challenges. If each data point is a document, we need to find a good
representation for it if we want to build a model that identifies it. We could use a dictionary of words and
count the relative frequencies of words, but with Neural Networks we can do better than this.

In general, this is called the problem of representation, and Deep Learning is a great technique to tackle it!

Feature Engineering
As we have seen, unstructured data does not look like tabular data. The traditional solution to connect the
two is feature engineering.

In feature engineering, an expert uses her domain knowledge to create features that correctly encapsulate
the relevant information from the unstructured data. Feature engineering is fundamental to the application
of Machine Learning, and it is both challenging and expensive.
2.6. EXERCISES 87

For example, if we are training a Machine Learning model on a face recognition task from images, we could
use well tested existing methods to detect a face and measure the distance between points like eyes, mouth,
and nose. These distances would be the engineered features we would pass to the model.

Similarly, in the domain of speech recognition, features based on wavelets and Short Time Fourier
Transforms were the standard until not long ago.

Deep Learning disrupts feature engineering by learning the best features directly from the raw
unstructured data. This approach is not only powerful but also much much faster. It is a paradigm shift:
more versatile technique taking the role of the domain expert.

Exercises
Now it’s time to test what you’ve learned with a few exercises.

Exercise 1

• load the dataset: ../data/international-airline-passengers.csv


• inspect it using the .info() and .head() commands
• use the function pd.to_datetime() to change the column type of ‘Month’ to a DateTime type (you
can find the doc here)
• set the index of df to be a DateTime index using the column ‘Month’ and the df.set_index()
method
• choose the appropriate plot and display the data
• choose appropriate scale
• label the axes

In [ ]:

Exercise 2

• load the dataset: ../data/weight-height.csv


• inspect it
• plot it using a scatter plot with Weight as a function of Height
• plot the male and female populations with two different colors on a new scatter plot
• remember to label the axes

In [ ]:

Exercise 3

• plot the histogram of the heights for males and females on the same plot
• use alpha to control transparency in the plot command
• plot a vertical line at the mean of each population using plt.axvline()
88 CHAPTER 2. DATA MANIPULATION

• bonus: plot the cumulative distributions

In [ ]:

Exercise 4

• plot the weights of the males and females using a box plot
• which one is easier to read?
• (remember to put in titles, axes, and legends)

In [ ]:

Exercise 5

• load the dataset: ../data/titanic-train.csv


• learn about scattermatrix here
• display the data using a scattermatrix

In [ ]:
Machine Learning
3
This chapter will introduce some common Machine Learning terms and techniques. When we talk about
Deep Learning, we indicate a set of tools and techniques in Machine Learning that involve artificial Neural
Networks.

Deep Learning is a branch of Artificial Intelligence

Since for the rest of the book we will use terms like train_test_split or cross_validation it makes
sense to introduce these first and then explain Deep Learning.

89
90 CHAPTER 3. MACHINE LEARNING

The purpose of Machine Learning


Machine Learning is a branch of Artificial Intelligence that develops computer programs that can learn
patterns and rules from data. Although its origins can be traced back to the early days of modern computer
science, only in the last decade has Machine Learning become a fundamental tool for companies in all
industries.

Product recommendation, advertisement optimization, machine translation, image recognition,


self-driving cars, spam and fraud detection, automated medical diagnoses: these are just a few examples of
how Machine Learning is omnipresent in business and life.

This revolution has mostly been possible thanks to the combination of 3 factors:

• cheap and powerful memory storage


• cheap and powerful computing power
• explosion of data collected by mobile phones, web apps, and sensors

3 Enablers of the Machine Learning revolution

These same 3 factors are enabling the current Deep Learning and AI revolution. Deep Neural Networks
have been around for quite a while, but it wasn’t until relatively recently that we’ve powerful enough
computers (and large enough datasets) to make good use of them. Things changed in the last few years, and
many companies that used other Machine Learning techniques are now switching to Deep Learning.

Before we start studying Neural Networks, we need to make sure to have a shared understanding of
Machine Learning, so this chapter is a quick summary of its central concepts.

If you are already familiar with terms like Regression, Classification, Cross- Validation and Confusion
matrix, you may want to skim through this section quickly. However, make sure you understand cost
functions and parameter optimization as they are fundamental for everything that will follow!
3.2. DIFFERENT TYPES OF LEARNING 91

Different types of learning


There are several types of Machine Learning, including:

• Supervised Learning
• Unsupervised Learning
• reinforcement learning

Different types of learning

While this course will primarily focus on supervised learning, it is essential to understand the difference in
each of the types.

In Supervised Learning an algorithm learns from labeled data. For example, let’s say we are training an
image recognition algorithm to distinguish cats from dogs: each training datapoint will be the pair of an
image (training data) and a label, which specifies if the image is a cat or a dog. Similarly, if we are training a
translation engine, we will provide both input and output sentences, asking the algorithm to learn the
function that connects them.

Conversely, in Unsupervised Learning, data comes without labels, and the task is to find similar data
points in the dataset, to identify any underlying higher-order structure. For example, in a dataset containing
the purchase preferences of e-commerce users, these users will likely form clusters with similar purchase
behavior regarding amount spent or objects bought. We can think of these as different “tribes” with different
preferences. Once we identify these tribes, we can describe each data point (that is, each user) referring to
the tribe it belongs to, gaining a deeper understanding of the data.

Finally, reinforcement learning is similar to Supervised Learning, but in this case the algorithm is training
an agent to act in an environment. The actions of the agent lead to outcomes that are attached to a score,
and the algorithm tries to maximize such score. Typical examples here are algorithms that learn to play
games, like Chess or Go. The main difference with Supervised Learning is that the score is that the
algorithm does not receive a label (score) for each action it takes. Instead, it needs to perform a sequence of
steps before it knows if that leads to a higher score.

In 2016 a software trained with reinforcement learning beat the world Go champion, marking a new
milestone in the race towards artificial intelligence.
92 CHAPTER 3. MACHINE LEARNING

Supervised Learning
Let’s dive into Supervised Learning by first reviewing some of its successful applications. Have you ever
noticed that email spam is practically non-existent any longer? This thought is thanks to Supervised
Learning.

In the early 2000s, e-mailboxes received tons of emails advertising pills, money making schemes and other
crappy information. The first step to get rid of these was to allow users to move spam emails into a spam
folder, which provided the training labels. With millions of users manually cataloging spam, large email
providers like Google and Yahoo could quickly gather enough examples of what a spam mail looked like to
train a model that would predict the probability for a message to be spam.

This technique is called a binary classifier, and it is a Machine Learning algorithm that learns to
distinguish between 2 classes, like true or false, spam or not spam, positive or negative sentiment, dead or
alive.

Binary classifiers trained with Supervised Learning are ubiquitous. Telecom companies use them to predict
if a user is about to churn and go to a competitor, so they know when and to whom to make an offer to
retain them.

Social media analytics companies use binary classifiers to judge the prevailing sentiment on their clients’
pages. If you are a celebrity, you receive millions of comments each time you post something on Facebook
or Twitter. How can you know if your followers were prevalently happy or angry at what you tweeted? A
sentiment analysis classifier can distinguish that for every single comment, and therefore give us the overall
reaction by aggregating over all observations.

Sentiment of a text sentence

Supervised learning is also used to predict continuous quantities, for example, to forecast retail sales of
next month or to predict how many cars there will be at a particular intersection to offer a better route for
car navigation. In this case, the labels are not discrete like “true/false” or “black/blue/green” but they have
3.4. CONFIGURATION FILE 93

continuous values, like 68, 73, 71 if we’re trying to predict temperature.

What other examples of Supervised Learning can you think of?

Configuration File
As promised in earlier chapter, from this chapter onwards we’ll bundle common packages and
configurations in a single config file that we load at the beginning of the chapter. Let’s go ahead and load it:

In [1]: with open('common.py') as fin:


msg = fin.read()

Let’s take a look at its content:

In [2]: print(msg)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# import seaborn here because otherwise it will mess up warnings


import seaborn as sns
import warnings
import logging

#
https://github.com/tensorflow/tensorflow/issues/8340#issuecomment-332212742
# disabling tensorflow warnings, feel free to uncomment out the next 2 lines
if you want to suppress warnings
logging.getLogger("tensorflow").disabled = True
warnings.simplefilter("ignore")

pd.set_option("display.max_rows", 13)
pd.set_option('display.max_columns', 8)
pd.set_option("display.latex.repr", True)
pd.set_option('max_colwidth', 30)

To execute this file we use the exec function:

In [3]: exec(msg)

We have now loaded pyplot, pandas and numpy and we have set a few configuration parameters. Let’s also
load the configuration for matploltib. For some reason this must be executed in a separate cell or it won’t
work:
94 CHAPTER 3. MACHINE LEARNING

In [4]: with open('matplotlibconf.py') as fin:


exec(fin.read())

We are ready to roll!

Linear Regression
Let’s take a second look at the plot we drew in Exercise 2 of Section 2. As we know by now, it represents a
population of individuals. Each dot is a person, and the position of the dot on the chart is defined by two
coordinates: weight and height. Let’s plot it again:

In [5]: df = pd.read_csv('../data/weight-height.csv')

In [6]: df.head()

Out[6]:

Gender Height Weight


0 Male 73.847017 241.893563
1 Male 68.781904 162.310473
2 Male 74.110105 212.740856
3 Male 71.730978 220.042470
4 Male 69.881796 206.349801

In [7]: def plot_humans():


df.plot(kind='scatter',
x='Height',
y='Weight',
title='Weight and Height in adults')

plot_humans()
3.5. LINEAR REGRESSION 95

Weight and Height in adults

250

200
Weight

150

100

55 60 65 70 75 80
Height

Can we tell if there is a pattern in how the dots are laid out on the graph or do they seem completely
randomly spread? Our visual cortex, a great pattern recognizer, tells us that there is a pattern: dots are
roughly spread around a straight diagonal line. This line seems to indicate the obvious: taller people are also
heavier on average.

Let’s sketch a line to represent this relationship. We can plot this line “by hand”, without any learning, by
choosing the values of the two extremities. For example, let’s draw a segment that starts at the point [55, 78]
and ends at the point [75, 250].

In [8]: plot_humans()

# Here we plot the red line 'by hand' with fixed values
# We'll try to learn this line with an algorithm below
plt.plot([55, 78], [75, 250], color='red', linewidth=3);
96 CHAPTER 3. MACHINE LEARNING

Weight and Height in adults

250

200
Weight

150

100

55 60 65 70 75 80
Height

Can we specify this relationship more precisely? Instead of guessing the position of the line, can we ask an
algorithm to find the best possible line to describe our data? The answer is yes! Let’s see how.

We are saying that weight (our target or label) is a linear function of height (our only feature).

Let’s assign variable names to our quantities. As we saw in the introduction, it is common to use the letter y
to the labels (people’s weight in this case) and the letter X to the input features (only height in this case).

You may remember from high school math that an equation can describe a line in a 2D-space. Indicating
the input as X (horizontal axis) and the outputs as y (vertical axis) we need only two parameters. One
parameter controls the point where the line crosses the vertical axis, the other controls the slope of the line.
We can write the equation of a line in a 2D plane as:

ŷ = b + Xw (3.1)

where ŷ is pronounced y-hat. Let’s first convince ourselves that this indicates any possible line in the 2D
plane (except for a perfectly vertical line).

If we choose b = 0 and w = 0, we obtain the equation ŷ = 0 for any value of X, which is the set of points that
form the horizontal line passing through zero.
3.5. LINEAR REGRESSION 97

If we start changing b, we will obtain ŷ = b, which is still a horizontal line, passing through the constant
point b. Finally, if we also change w, the line will start to be inclined in some way.

So yes, any line in the 2D-plane, except for a vertical line, will have its unique values for w and b.

To find a linear relation between X and y means to describe our labels y as a linear function of X plus some
small correction є:

y = b + Xw + є = ŷ + є (3.2)

It’s good to get used to distinguishing between the values of the output (y, our labels) and the values of the
predictions ( ŷ).

Let’s draw some examples.

In this chapter, we are going to explain how an algorithm can find the perfect line to fit a dataset. Before
writing an algorithm, it’s helpful to understand the dynamics of this line formula. So what we’re going to do
is draw a few plots where we change the values of b and w and see how they affect the position of the line in
the 2D plane. This will give us better insight when we try to automate this process.

Let’s start by defining a simple line function:

In [9]: def line(x, w=0, b=0):


return x * w + b

Then let’s create an array of equally spaced x values between 55 and 80 (these are going to be the values of
height):

In [10]: x = np.linspace(55, 80, 101)


x

Out[10]: array([55. , 55.25, 55.5 , 55.75, 56. , 56.25, 56.5 , 56.75, 57. ,
57.25, 57.5 , 57.75, 58. , 58.25, 58.5 , 58.75, 59. , 59.25,
59.5 , 59.75, 60. , 60.25, 60.5 , 60.75, 61. , 61.25, 61.5 ,
61.75, 62. , 62.25, 62.5 , 62.75, 63. , 63.25, 63.5 , 63.75,
64. , 64.25, 64.5 , 64.75, 65. , 65.25, 65.5 , 65.75, 66. ,
66.25, 66.5 , 66.75, 67. , 67.25, 67.5 , 67.75, 68. , 68.25,
68.5 , 68.75, 69. , 69.25, 69.5 , 69.75, 70. , 70.25, 70.5 ,
70.75, 71. , 71.25, 71.5 , 71.75, 72. , 72.25, 72.5 , 72.75,
73. , 73.25, 73.5 , 73.75, 74. , 74.25, 74.5 , 74.75, 75. ,
75.25, 75.5 , 75.75, 76. , 76.25, 76.5 , 76.75, 77. , 77.25,
77.5 , 77.75, 78. , 78.25, 78.5 , 78.75, 79. , 79.25, 79.5 ,
79.75, 80. ])
98 CHAPTER 3. MACHINE LEARNING

And let’s pass these values to the line function and calculate ŷ. Since both w and b are zero, we expect ŷ to
also be zero:

In [11]: yhat = line(x, w=0, b=0)


yhat

Out[11]: array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [12]: plot_humans()

plt.plot(x, yhat, color='red', linewidth=3);

Weight and Height in adults


250

200

150
Weight

100

50

0
55 60 65 70 75 80
Height

So we’ve drawn a horizontal line as our model. This is not really a good model for our data! It would be a
good model if everyone in our population was floating in space and therefore measured 0 weight regardless
of their height. Fun, but not accurate for our chart! See how far the line is from our data.
3.5. LINEAR REGRESSION 99

If we let b vary, the horizontal line starts to move up or down, indicating constant weight b, regardless of the
value of x (the height).

In [13]: plot_humans()

# three settings for b "offset" the


plt.plot(x, line(x, b=50), color='orange', linewidth=3)
plt.plot(x, line(x, b=150), color='red', linewidth=3)
plt.plot(x, line(x, b=250), color='black', linewidth=3);

Weight and Height in adults


250

200
Weight

150

100

50
55 60 65 70 75 80
Height

This would be a good model only if we had a broken scale that always returned a fixed value, regardless of
who steps on it. Also not accurate.

Finally, if we vary w, the line starts to tilt, with w indicating the increment in weight corresponding to the
increment in 1 unit of height. For example, if w=1, that would imply that 1 pound is gained for each inch of
height.

In [14]: plot_humans()

plt.plot(x, line(x, w=5), color='black', linewidth=3)


100 CHAPTER 3. MACHINE LEARNING

plt.plot(x, line(x, w=3), color='red', linewidth=3)


plt.plot(x, line(x, w=-1), color='orange', linewidth=3);

Weight and Height in adults


400

300

200
Weight

100

100
55 60 65 70 75 80
Height

So, to recap, we started from the intuitive observation that taller people are heavier and we decided to look
for a line function to predict the weight of a person as a function of the height.

Then we observed that any line in the 2D plane needs the definition of two parameters: b and w. We plotted
a few such lines and compared them with our data. Now we need to find the values of such parameters that
correspond to the best line for our data.

Cost Function

To find the best possible linear model to describe our data, we need to define a criterion to evaluate the
“goodness” of a particular model.

In Supervised Learning we know the values of the labels. So we can compare the value predicted by the
hypothesis with the actual value of the label and calculate the error for each data point:

є i = y i − ŷ i (3.3)

Remember that y i is the actual value of the output while ŷ i is our prediction. Also, notice that we used a
3.5. LINEAR REGRESSION 101

subscript index i to indicate the i-th data point. Each data point difference is a residual, and the group of
them together are the residuals.

Note that in this definition, a residual carries a sign, it will be positive if our hypothesis underestimates the
actual weight and negative if it overestimates it.

Residuals

Since we don’t really care about the direction in which our hypothesis is wrong (we only care about the total
amount of being wrong), we can define the total error as the sum of the absolute values of the residuals:

Total Error = ∑ ∣є i ∣ = ∑ ∣y i − ŷ i ∣ (3.4)


i i

The total error is one possible example of what’s called a cost function. We have associated a well-defined
cost that we can calculate from features and labels through the use of our hypothesis ŷ = h(x).

For reasons that will be apparent later, it is often preferable to use another cost function called Mean
Squared Error. This is defined as:

1 2
MSE = ∑(y i − ŷ i ) (3.5)
N i

where N is the number of data points used to train our model.

Notice that since the square is a positive function, the MSE will be big when the total error is big and small
102 CHAPTER 3. MACHINE LEARNING

when the total error is small, so they are equivalent. However, the Mean Squared Error (or MSE, for short) is
preferred because it is smooth and guaranteed to have a global minimum, which is what we are going to
look for.

Finding the best model

Now that we have both a hypothesis (linear model) and a cost function (mean squared error), we need to
find the combination of parameters b and w that minimizes such cost.

Remember, that cost is another way to say the ‘error amount’ of our prediction - we’re assigning a number to
how wrong our prediction is. We want to minimize this error (cost) because if it was zero, that means we
predicted correctly.

Let’s first define a helper function to calculate the MSE and then evaluate the cost for a few lines:

In [15]: def mean_squared_error(y_true, y_pred):


s = (y_true - y_pred)**2
return s.mean()

Let’s also define inputs and outputs for our data. Our input is the height column. We will assign it to the
variable X:

In [16]: X = df[['Height']].values
X

Out[16]: array([[73.84701702],
[68.78190405],
[74.11010539],
...,
[63.86799221],
[69.03424313],
[61.94424588]])

Notice that X is matrix with 10000 rows and a single column:

In [17]: X.shape

Out[17]: (10000, 1)

This format will allow us to extend the linear regression to cases where we want to use more than one
column as input.

Then let’s define the outputs:


3.5. LINEAR REGRESSION 103

In [18]: y_true = df['Weight'].values


y_true

Out[18]: array([241.89356318, 162.31047252, 212.74085556, ..., 128.47531878,


163.85246135, 113.64910268])

The outputs are a single array of values. What is the cost going to be for the horizontal line passing through
zero? We can calculate it as follows.

First we generate predictions for each value of X:

In [19]: y_pred = line(X)


y_pred

Out[19]: array([[0.],
[0.],
[0.],
...,
[0.],
[0.],
[0.]])

And then we calculate the cost, i.e. the mean squared error between these predictions and our true values:

In [20]: mean_squared_error(y_true, y_pred.ravel())

Out[20]: 27093.83757456157

Notice that we flattened out the predictions so that it has the same shape as the output vector.

The cost is above 27,000. What does it mean? Is it bad? Is it good? It’s hard to say because we don’t have
anything to compare it to. Different datasets will have very different numbers here depending on the units
of measure of the quantity we are predicting. So the value of the cost has very little meaning by itself. What
we need to do is compare this cost with that of another choice of b and w. Let’s increase w a little bit:

In [21]: y_pred = line(X, w=2)


mean_squared_error(y_true, y_pred.ravel())

Out[21]: 1457.1224504786412
104 CHAPTER 3. MACHINE LEARNING

The total MSE decreased from over 27000 to below 2000, which is good. It means our new hypothesis with
w=2 is less-wrong than using w=0.

Let’s see what happens if we also change b:

In [22]: y_pred = line(X, w=2, b=20)


mean_squared_error(y_true, y_pred.ravel())

Out[22]: 708.9129575511095

Even better! As you can see, we can keep changing b and w by small amounts, and the value of the cost will
keep changing.

Of course, it’s going to take forever for us to find the best combination if we sit here and tweak numbers
until we see the best ones. A better way would be if we could write a program that would test all possible
values for us and then only report to us the result.

Before we do that, let’s check a couple of other combinations of w and b. Let’s try to keep w fixed and vary
only b.

In [23]: plt.figure(figsize=(10, 5))

# we are going to draw 2 plots in the same figure


# first plot, data and a few lines
ax1 = plt.subplot(121)
df.plot(kind='scatter',
x='Height',
y='Weight',
title='Weight and Height in adults', ax=ax1)

# let's explore the cost function for a


# few values of b between -100 and +150
bbs = np.array([-100, -50, 0, 50, 100, 150])

mses = [] # append the values of the cost here


for b in bbs:
y_pred = line(X, w=2, b=b)
mse = mean_squared_error(y_true, y_pred)
mses.append(mse)
plt.plot(X, y_pred)

# second plot: Cost function


ax2 = plt.subplot(122)
plt.plot(bbs, mses, 'o-')
plt.title('Cost as a function of b')
3.5. LINEAR REGRESSION 105

plt.xlabel('b')
plt.tight_layout();

Weight and Height in adults Cost as a function of b


18000
300
16000
250 14000
200 12000
Weight

10000
150
8000
100 6000
50 4000
2000
0
55 60 65 70 75 80 100 50 0 50 100 150
Height b

When w = 2, the cost as a function of b has a minimum value somewhere near 50.

The same would be true if we let w vary, there will be a value of w for which the cost is minimum. Since we
choose a cost function that is quadratic in b and w, there is a global minimum, corresponding to the
combination of parameters b and w that minimize the mean squared error cost.

TIP: A quadratic function is a polynomial function in one or more variables in which the
highest-degree term is of the second degree. It is a very nice feature that guarantees us that
there is only one minimum, and therefore it is the global one.

Once our parameters w and b reach the values that minimize the cost, we can say that the training is
complete.

Notice what just happened: - We started with a hypothesis: height and weight are connected by a linear
model that depends on parameters. - We defined a cost function: the mean squared error, which we
calculated for each combination of b and w using the training set features and labels. - Finally, we
minimized the cost: the model is trained when we have found the values of b and w that minimize the cost
over the training set.

Another way to say this is that we have turned the problem of training a Machine Learning model into a
minimization problem, where our cost defines a “landscape” made of valleys and peaks, and we are looking
for the global minimum.
106 CHAPTER 3. MACHINE LEARNING

This is great news because there are plenty of techniques to look for the minimum value of a function.

TIP: We solved a Linear Regression problem using Gradient Descent. This was not
necessary since Linear Regression has an exact solution. We used this simple case to
introduce the Gradient Descent technique that we will use throughout the book to train
our Neural Networks.

Linear Regression with Keras

Let’s see if we can use Keras to perform linear regression. We will start by importing a few elements to build
a model, as we did in chapter 1.

In [24]: from tensorflow.keras.models import Sequential


from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam, SGD

The model we need to build is super simple: it has one input and one output, connected by one parameter w
and one parameter b. Let’s do that! First, we initialize the model as a Sequential model. It is the simplest
way to define models in Keras, because we add layers one by one, starting from the input and working our
way towards the output.

In [25]: model = Sequential()

Then we add a single element to our model: a linear operation with one input and one output, connected by
the two parameters w and b. In Keras, we do this with the Dense class. In fact, from the documentation of
Dense we read:

Just your regular densely-connected NN layer.


`Dense` implements the operation:
`output=activation(dot(input, kernel) + bias)`

We can recover our notation with the following substitutions:

output -> y
activation -> None
input -> X
kernel -> w
bias -> b
3.5. LINEAR REGRESSION 107

and noticing that the dot product with a single input is just the multiplication. So Dense(1,
input_shape=(1,)) implements a linear function with 1 input and 1 output. Let’s add it to the model:

In [26]: model.add(Dense(1, input_shape=(1,)))

The .summary() method will tell us the number of parameters in our model:

In [27]: model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 1) 2
=================================================================
Total params: 2
Trainable params: 2
Non-trainable params: 0
_________________________________________________________________

As expected our model has 2 parameters: the bias or b and the kernel or weight or w. Let’s go ahead and
compile the model.

Compilation tells Keras what the cost function is (Mean Squared Error in this case) and what method we
choose to find the minimum value (the Adam optimizer in this case).

TIP If you have never seen an optimizer don’t worry about it, we will explain it in detail
later in the book.

In [28]: model.compile(Adam(lr=0.8), 'mean_squared_error')

Now that we have compiled the model let’s go ahead and train it on our data. To train a model, we use the
method model.fit(X, y). This method requires the input data and the input labels, and it trains a model
by minimizing the value of the cost function over the training data. As an additional parameter to the fit
model, we will pass the number of epochs. This is the number of time we want the training to loop over the
whole training dataset.
108 CHAPTER 3. MACHINE LEARNING

TIP: an Epoch in Deep Learning indicates one cycle over the training dataset. At the end of
an epoch, the model has seen each pair of (input, output) once.

In this example we train the model for 40 epochs, which means we will cycle through the whole (X,
y_true) dataset 40 times.

In [29]: model.fit(X, y_true, epochs=40, verbose=0);

Let’s see how well the model fits our data. We will store our predictions in a variable called y_pred and plot
them over the data:

In [30]: y_pred = model.predict(X)

In [31]: df.plot(kind='scatter',
x='Height',
y='Weight',
title='Weight and Height in adults')
plt.plot(X, y_pred, color='red');
3.5. LINEAR REGRESSION 109

Weight and Height in adults

250

200
Weight

150

100

55 60 65 70 75 80
Height

The line is not perfectly where we would have liked it to be, but it seems to have captured the relationship
between Height and Weight quite well. We can inspect the parameters of the model to see what values our
training decided were optimal for b and w.

In [32]: W, B = model.get_weights()

In [33]: W

Out[33]: array([[7.5945587]], dtype=float32)

In [34]: B

Out[34]: array([-348.71103], dtype=float32)

Notice here that W is returned as a matrix, because in the general case of a Neural Network we could have
many more parameters. In this simple case of a linear regression our matrix has 1 row and 1 column and a
single number for the slope of our line, so let’s extract it.
110 CHAPTER 3. MACHINE LEARNING

In [35]: w = W[0, 0]

B is also a vector with just one entry, so we can extract that too:

In [36]: b = B[0]

The slope parameter w has a value near 7.7. This means that, for 1 inch increase, people are on average 7.7
pounds heavier. The b parameter is roughly -350. This is called offset or bias and corresponds to the weight
of an adult of zero height.

Since negative weight does not make sense, we have to be careful about how we interpret this value. Let’s see
if we can see what’s the minimum height that makes sense in this model. This will be the height that
produces a weight of zero since negative weights are nonsense. Zero weight means y = 0, so now that we
have a model we can look for the value of X that corresponds to y = 0.

Setting y = 0 in the line equation gives:

0 = Xw + b (3.6)

and we can shuffle things around to obtain:

−b
X= (3.7)
w

Let’s calculate it:

In [37]: -b/w

Out[37]: 45.915905

So this model only makes sense for people who are at least about 45 inches tall. If you are shorter than 46
inches, this model predicts you’d have a negative weight, which is wrong.

Evaluating Model Performance

Great! We have trained our first Supervised Learning model and have found the best combination of
parameters b and w. However, is this a good model? Can we trust it to predict the height of new people that
were not part of the training data? In other words, will our model “generalize well” when offered new,
unknown data? Let’s see how we can answer that question.
3.5. LINEAR REGRESSION 111

R 2 coefficient of determination

First of all, we need to define a sort of standard score, a number that will allow us to compare the goodness of
a model regardless of how many data points we used. We could compare losses, but the value of the loss is
ultimately arbitrary and dependent on the scale of the features, so we don’t want to use that. Instead, let’s use
the coefficient of determination R 2 .

This coefficient can be calculated for any model predicting continuous values (like regression), and it will
give some information about the goodness of fit. In the case of regression, the R 2 coefficient is a measure of
how well the regression model approximates the real data points. An R 2 of 1 indicates a regression line that
perfectly fits the data. If the line does not perfectly fit the data, the value of R2 will decrease. A value of 0 or
lower indicates that the model is not a good one.

We recall here Scikit-Learn, a Python package introduced in chapter 1 that contains many Machine Learning
algorithms and supporting functions, including the R 2 score. Let’s calculate it for the current model.

First of all, let’s import it:

In [38]: from sklearn.metrics import r2_score

and then let’s calculate it on the current predictions:

In [39]: r = r2_score(y_true, y_pred)


print("The R2 score is {:0.3f}".format(r))

The R2 score is 0.819

TIP: In the last command we introduced a way to define Python format, to make numbers
more readable. In particular, we specified the format {:0.3f} The brackets and characters
within them are called format fields, and they are replaced with the objects passed into the
str.format() method. The integer after the : will cause that field to be a minimum
number of characters wide, 0 in this case. 3 indicates the number of decimal digits and f
stands for a floating point decimal format.

It’s not too far from 1, which means our regression is not too bad. It doesn’t answer the question about
generalization though, how can we know if our model is going to generalize well?
112 CHAPTER 3. MACHINE LEARNING

Train / Test split

Let’s go back to our dataset. What if, instead of using all of it to train our model, we held out a small fraction
of it, say 20 randomly sampled points. We could train the model on the remaining 80 and use the 20 to
test how good the model is. This would be a good way to test if our model is overfitting.

Overfitting means that our model is just memorizing the answers instead of learning general rules about the
training examples. By withholding a test set, we can test our model on data never seen before. If it performs
just as well, we can assume it will perform well on new data when deployed.

On the other hand, if our model has a good score on the training set but has a bad score on the test set, this
would mean it is not able to generalize to unseen examples, and therefore it’s not ready for deployment.

This is called a train/test split, it’s standard practice for Supervised Learning, and there’s a convenient
Scikit-Learn function for it.

In [40]: from sklearn.model_selection import train_test_split

In [41]: X_train, X_test, y_train, y_test = \


train_test_split(X, y_true, test_size=0.2)

In [42]: len(X_train)

Out[42]: 8000

In [43]: len(X_test)

Out[43]: 2000

Using train_test_split, we split the data into two sets, the training set and the test set. Now we
can use each according to its name: we let the parameters of our models vary to minimize the cost over the
training set and then check the cost and the R 2 score over the test set. If things went well, these two should
be comparable, i.e., the model should perform well on new data. Let’s do that!

First, let’s train our model on the training data (notice the test data is not involved here):

In [44]: model.fit(X_train, y_train, epochs=50, verbose=0);

Then let’s calculate predictions for both the train and test sets.
3.5. LINEAR REGRESSION 113

TIP: Note that unlike training, making predictions is a “read-only” operation and does not
change our model. We’re just making predictions.

In [45]: y_train_pred = model.predict(X_train).ravel()


y_test_pred = model.predict(X_test).ravel()

Let’s calculate the mean squared error and the R2 score for both. We will also import the
mean_squared_error function from Scikit-Learn, which does the same calculation as the function we
defined above, but it’s probably better defined.

In [46]: from sklearn.metrics import mean_squared_error as mse

In [47]: err = mse(y_train, y_train_pred)


print("Mean Squared Error (Train set):\t",
"{:0.1f}".format(err))

err = mse(y_test, y_test_pred)


print("Mean Squared Error (Test set):\t",
"{:0.1f}".format(err))

Mean Squared Error (Train set): 156.5


Mean Squared Error (Test set): 155.8

In [48]: r2 = r2_score(y_train, y_train_pred)


print("R2 score (Train set):\t{:0.3f}".format(r2))

r2 = r2_score(y_test, y_test_pred)
print("R2 score (Test set):\t{:0.3f}".format(r2))

R2 score (Train set): 0.849


R2 score (Test set): 0.844

It appears that both the loss and the R 2 score are comparable for the Train and Test set, which is great! If we
had obtained values that were significantly different, we would have had a problem. Generally speaking, our
test set could perform a little worse because the test data has not been seen before. If the performance on the
test set is significantly lower than on the training set, we are overfitting.
114 CHAPTER 3. MACHINE LEARNING

TIP: The test fraction does not need to be 20. We could use 5, 10, 30, 50 or
anything we like. Keep in mind that if we do not use enough data for testing, we may not
have a credible test of how well the model generalizes, while if we use too much testing
data, we make it harder for the model to learn because it saw too few examples.

Note that this is another reason to prefer an average (i.e., divided by the total number of sample points)
rather than a total loss. In this way, the loss will not depend on the size of the set used to calculate it, and we
will be therefore able to compare losses obtained over datasets of different sizes.

Congratulations! We have just encountered the three basic ingredients of a Neural Network: a hypothesis
with parameters, the cost function and the optimization algorithm.

Classification
So far we have just learned about linear regression and how we can use it to predict a continuous target
variable. We have learned about formulating a hypothesis that depends on parameters and about optimizing
a cost to find the optimal values for such parameters.

We can apply the same framework to cases where the target variable is discrete and not continuous. All we
need to do is to adapt the hypothesis and the cost function.

Let’s see how. Imagine we are predicting if a visitor on our website is going to buy a product, based on how
many seconds he/she spent on the product page. In this case, the outcome variable is binary: the user either
buys or doesn’t buy the product. How can we build a model with a binary outcome? Let’s load some data
and find out:

In [49]: df = pd.read_csv('../data/user_visit_duration.csv')

In [50]: df.head()

Out[50]:

Time (min) Buy


0 2.000000 0
1 0.683333 0
2 3.216667 1
3 0.900000 0
4 1.533333 1

The dataset we loaded has two columns: - Time (min) - Buy


3.6. CLASSIFICATION 115

and we can plot it like this:

In [51]: df.plot(kind='scatter', x='Time (min)', y='Buy',


title='Purchase VS time spent on page');

Purchase VS time spent on page


1.0

0.8

0.6
Buy

0.4

0.2

0.0
0 1 2 3 4
Time (min)

Since the outcome variable can only assume a finite set of distinct values (only 0 and 1 in this case), this is a
classification problem, i.e., we are looking for a model that is capable of predicting to which class a data
point belongs.

TIP: There are many algorithms to solve a classification problem, including K-nearest
neighbors, decision trees, support vector machines, and Naive Bayes classifiers.

Linear regression fail

What happens if we use the same model we have just used to fit this data? Will the model refuse to work?
Will it converge? Will it give helpful predictions?
116 CHAPTER 3. MACHINE LEARNING

Let’s try it and see what happens. First we need to define our features and target variables.

In [52]: X = df[['Time (min)']].values


y = df['Buy'].values

Then we can use the exact same model we used before. We will simple re- initialize it by resetting the
parameter w to 1 and b to 0:

In [53]: model.set_weights([[[ 1.0]], [0.]])

Then we fit the model on X and y for 200 epochs, suppressing the output with verbose=0:

In [54]: model.fit(X, y, epochs=200, verbose=0);

Let’s see what the predictions look like:

In [55]: y_pred = model.predict(X)

df.plot(kind='scatter', x='Time (min)', y='Buy',


title='Linear Regression Fail')
plt.plot(X, y_pred, color='red');

Linear Regression Fail


1.50
1.25
1.00
0.75
Buy

0.50
0.25
0.00
0.25
0 1 2 3 4
Time (min)
3.6. CLASSIFICATION 117

As you can see the linear regression it doesn’t make much sense to use a straight line to predict an outcome
that can only either 0 or 1. That said, the modification we need to apply to our model in order to make it
work is actually quite simple.

Logistic Regression

We will approach this problem with a method called Logistic Regression. Despite the name being
“regression”, this technique is actually useful to solve classification problems, i.e. problems where the
outcome is discrete.

The linear regression technique we have just learned predicts values in the real axis for each input data
point. Can we modify the form of the hypothesis so that we can predict the probability of an outcome? If
we can do that, for each value in the input, our model would give us a value between 0 and 1. At that point,
we could use p = 0.5 as our dividing criterion and assign every point predicted with probability less than 0.5
to class 0, and every point predicted with probability more than 0.5 to class 1.

In other words, if we modify the regression hypothesis to allow for a nonlinear function between the
domain of our data and the interval [0, 1], we can use the same machinery to solve a classification problem.

There’s one additional point we will need to address, which is how to adapt the cost function. Since our
labels are only the values 0 and 1, the Mean Squared Error is not the correct cost function to use. We will
see below how to define a cost that works in this case.

Let us first start by defining a nonlinear hypothesis. We need a nonlinear function that will map all of the
real axis into the interval [0, 1]. There are many such functions and we will see a few in the next chapters. A
simple, smooth and well-behaved function is the Sigmoid function:

1
σ(z) = (3.8)
1 + e −z

which looks like this:

In [56]: def sigmoid(z):


return 1.0/(1.0 + np.exp(-z))

z = np.arange(-10, 10, 0.1)

plt.plot(z, sigmoid(z))
plt.title("The Sigmoid Function");
118 CHAPTER 3. MACHINE LEARNING

The Sigmoid Function


1.0

0.8

0.6

0.4

0.2

0.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0

The Sigmoid starts at values really close to 0 for negative values of x. Then it gradually increases and near
x = 0 it smoothly transitions to values close to 1. Mathematically speaking, the sigmoid function is like a
smooth step function.

Hypothesis

Using the sigmoid we can formulate the hypothesis for our classification problem as:

1
Buy = (3.9)
1 + e −(Time w+b)

or

ŷ = σ(Xw + b) (3.10)

We will encounter this function many times in this book. It is used at the output of a Neural Network when
performing a binary classification and is generally not used between layers because there are better
activation functions.
3.6. CLASSIFICATION 119

Notice that we have introduced two parameters, w and b, in our definition. One of them controls the speed
of the transition between 0 and 1, while the other controls the position of the transition. Let’s plot a few
examples:

In [57]: x = np.linspace(-10, 10, 100)

plt.figure(figsize=(15, 5))

plt.subplot(121)

ws = [0.1, 0.3, 1, 3]
for w in ws:
plt.plot(x, sigmoid(line(x, w=w)))

plt.legend(ws)
plt.title('Changing w')

plt.subplot(122)

bs = [-5, 0, 5]
for b in bs:
plt.plot(x, sigmoid(line(x, w=1, b=b)))

plt.legend(bs)
plt.title('Changing b');

Changing w Changing b
1.0 0.1 1.0 -5
0.3 0
0.8 1 0.8 5
3
0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0

Cost function

Now that we have defined the hypothesis, we need to adjust the definition of the cost function so that it
makes sense for a binary classification problem. There are various options for this, similarly to the
120 CHAPTER 3. MACHINE LEARNING

regression case, including square loss, hinge loss and logistic loss.

As we shall see in chapter 5, Deep Learning models learn by performing gradient descent minimization of
the cost function, which requires the cost function to be “minimizable” in the first place. In mathematics, we
say that the cost function needs to be convex and differentiable.

One of the most commonly used cost function in Deep Learning is the cross- entropy loss.

Let’s explore how it is calculated. We can define the cost for a single point as:

c i = −y i log ( ŷ i ) − (1 − y i ) log (1 − ŷ i ) (3.11)

Notice that due to the binary nature of the outcome variable y, only one of the two terms is present at each
time. If the label y i is 0, then c i = − log (1 − ŷ i ), if the label y i is 1, then c i = − log ( ŷ i ).

Another way of thinking about this in programmable terms might be


⎪− log ( ŷ i )
⎪ for y i = 1
ci = ⎨ (3.12)
⎩− log (1 − ŷ i ) for y i = 0

Let’s look at the first term first, which only contributes to the cost when y i = 1. Remember that ŷ contains
the sigmoid function, so its negative logarithm is:

log(σ(z)) = − log(1) + log(1 + e −z ) = log(1 + e −z ) (3.13)

What this means is if z is big, this quantity goes to zero, if z is negative, this quantity goes to infinity:

In [58]: plt.plot(z, -np.log(sigmoid(z)));


3.6. CLASSIFICATION 121

10

0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0

In other words, when the label is 1 (y = 1), our predictions should also approach 1. Since our predictions are
obtained with the sigmoid, we want ŷ = σ(z) to approach 1 as well. This happens for large values of z.
Therefore, the cost should be minimal when z is large. On the other hand, if z is small, the sigmoid goes to
zero, and our prediction is wrong. That’s why the cost becomes increasingly large for negative values of z.

The same logic applies to the second term for when y = 0: it should push z to have negative values so that
the sigmoid goes to zero and our prediction is correct in this case.

In [59]: plt.plot(z, -np.log(1 - sigmoid(z)));


122 CHAPTER 3. MACHINE LEARNING

10

0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0

Now that we have defined the cost for a single point, we can define the average cost as:

1
c= ∑ ci (3.14)
N i

Where the index i runs over a set of learning examples, also called a batch. This is the average cross-entropy
or log loss.

Now that we have defined hypothesis and cost for the logistic regression case, we can go ahead and look for
the best parameters that minimize the cost, very much in the same way as we did for the linear regression
case.

Generalization to many classes

In Chapter 4 we discuss the multi-class classification in further detail. Here we observe that there are 2 ways
to generalize the binary cross-entropy to multiple classes.

1. If the classes are not mutually exclusive, then we treat them as independent predictions and each of
them is scored with a binary crossentropy like in the binary case.
2. If the classes are mutually exclusive, we need to consider the predictions as joint. Since our model
predicts the probability to be in a class amongst many, the predicted probabilities need to add up to 1.
3.6. CLASSIFICATION 123

In this case we will use a Softmax function at the output, and a Categorical Cross-entropy, which, for a
single data point, has the formula:

c i = ∑ y i j log ( ŷ i j ) (3.15)
j∈labels

where the index i indicates the particular data point and label we are using, while the index j runs over the
classes.

Logistic regression in Keras

First, let’s define a model in Keras. As we have seen above, Dense(1, input_shape=(1,)) implements a
linear function with one input and one output. The only change we need to perform is to add a sigmoid
function that takes the output variable and maps it to the interval [0, 1]. In a way, it’s as if we were
“wrapping” the Dense layer with the sigmoid function.

Let’s first create a model like we did for the linear regression:

In [60]: model = Sequential()


model.add(Dense(1, input_dim=1))

We can add the activation as a layer:

In [61]: from tensorflow.keras.layers import Activation

In [62]: model.add(Activation('sigmoid'))

In [63]: model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 1) 2
_________________________________________________________________
activation (Activation) (None, 1) 0
=================================================================
Total params: 2
Trainable params: 2
Non-trainable params: 0
_________________________________________________________________
124 CHAPTER 3. MACHINE LEARNING

As you can see the model has two parameters, a weight and a bias, and it has a sigmoid activation function
as a second layer. We can convince ourselves that it’s a sigmoid by using the model to predict values for a
few z values:

In [64]: plt.plot(z, model.predict(z));

1.0

0.8

0.6

0.4

0.2

0.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0

Also notice that the weights in the model are initialized randomly, so your sigmoid may look different from
the one in the figure above.

TIP: Keras allows a more compact model specification by including the activation function
in the Dense layer definition. We can define the same model above by:

model.add(Dense(1, input_dim=1, activation='sigmoid'))

The next step is to compile the model like we did before to specify the cost function and the optimizer.
Keras offers several cost functions for classification. The cross-entropy for the binary classification case is
called binary_crossentropy so we will use this one now:
3.6. CLASSIFICATION 125

In [65]: model.compile(optimizer=SGD(lr=0.5),
loss='binary_crossentropy',
metrics=['accuracy'])

Accuracy

Notice that this time we also included an additional metric at compile time: accuracy. Accuracy is one of the
possible scores we can use to judge the quality of a classification model. It tells us what fraction of samples
are predicted in the correct class, so for example an accuracy of 80% or 0.8 means that 80 samples out of
100 are predicted correctly.

Let’s train this new model on our data.

In [66]: model.fit(X, y, epochs=25);

Epoch 1/25
100/100 [==============================] - 0s 1ms/sample - loss: 1.5229 -
accuracy: 0.4100
Epoch 2/25
100/100 [==============================] - 0s 87us/sample - loss: 0.6940 -
accuracy: 0.5000
Epoch 3/25
100/100 [==============================] - 0s 85us/sample - loss: 0.6666 -
accuracy: 0.5200
Epoch 4/25
100/100 [==============================] - 0s 86us/sample - loss: 0.6110 -
accuracy: 0.5800
Epoch 5/25
100/100 [==============================] - 0s 87us/sample - loss: 0.5787 -
accuracy: 0.6600
Epoch 6/25
100/100 [==============================] - 0s 84us/sample - loss: 0.5673 -
accuracy: 0.6500
Epoch 7/25
100/100 [==============================] - 0s 84us/sample - loss: 0.5359 -
accuracy: 0.7200
Epoch 8/25
100/100 [==============================] - 0s 87us/sample - loss: 0.5236 -
accuracy: 0.8500
Epoch 9/25
100/100 [==============================] - 0s 84us/sample - loss: 0.5023 -
accuracy: 0.7400
Epoch 10/25
100/100 [==============================] - 0s 87us/sample - loss: 0.4870 -
accuracy: 0.8400
Epoch 11/25
100/100 [==============================] - 0s 90us/sample - loss: 0.4745 -
accuracy: 0.7900
Epoch 12/25
100/100 [==============================] - 0s 86us/sample - loss: 0.4728 -
accuracy: 0.8100
Epoch 13/25
126 CHAPTER 3. MACHINE LEARNING

100/100 [==============================] - 0s 84us/sample - loss: 0.4679 -


accuracy: 0.8300
Epoch 14/25
100/100 [==============================] - 0s 87us/sample - loss: 0.4517 -
accuracy: 0.8300
Epoch 15/25
100/100 [==============================] - 0s 88us/sample - loss: 0.4419 -
accuracy: 0.7900
Epoch 16/25
100/100 [==============================] - 0s 87us/sample - loss: 0.4485 -
accuracy: 0.8300
Epoch 17/25
100/100 [==============================] - 0s 87us/sample - loss: 0.4501 -
accuracy: 0.7800
Epoch 18/25
100/100 [==============================] - 0s 86us/sample - loss: 0.4401 -
accuracy: 0.8000
Epoch 19/25
100/100 [==============================] - 0s 86us/sample - loss: 0.4312 -
accuracy: 0.8000
Epoch 20/25
100/100 [==============================] - 0s 91us/sample - loss: 0.4198 -
accuracy: 0.8300
Epoch 21/25
100/100 [==============================] - 0s 90us/sample - loss: 0.4222 -
accuracy: 0.7800
Epoch 22/25
100/100 [==============================] - 0s 89us/sample - loss: 0.4143 -
accuracy: 0.8200
Epoch 23/25
100/100 [==============================] - 0s 88us/sample - loss: 0.4212 -
accuracy: 0.7900
Epoch 24/25
100/100 [==============================] - 0s 87us/sample - loss: 0.4068 -
accuracy: 0.8300
Epoch 25/25
100/100 [==============================] - 0s 90us/sample - loss: 0.4082 -
accuracy: 0.8200

The model seems to have converged because the loss does not seem to improve in the last epochs. Let’s see
what the predictions look like:

In [67]: ax = df.plot(kind='scatter', x='Time (min)', y='Buy',


title='Purchase VS time spent on site')

temp = np.linspace(0, 4)
ax.plot(temp, model.predict(temp), color='orange')
plt.legend(['model', 'data']);
3.6. CLASSIFICATION 127

Purchase VS time spent on site


1.0

0.8

0.6
Buy

0.4

0.2
model
0.0 data
0 1 2 3 4
Time (min)

Great! The two parameters in our logistic regression have been tuned to best reproduce our data.

Notice that the logistic regression model predicts a probability. If we want to convert this to a binary
prediction we need to set a threshold. For example we could say that all points predicted to be 1 with p > 0.5
are set to 1 and the others are set to 0.

In [68]: y_pred = model.predict(X)

In [69]: y_class_pred = y_pred > 0.5

With this definition we can calculate the accuracy of our model as the number of correct predictions over
the total number of points. Scikit-learn offers a ready to use function for this behavior called
accuracy_score:

In [70]: from sklearn.metrics import accuracy_score

In [71]: acc = accuracy_score(y, y_class_pred)


print("Accuracy score: {:0.3f}".format(acc))
128 CHAPTER 3. MACHINE LEARNING

Accuracy score: 0.840

Train/Test split

We can repeat the above steps using train/test split. Remember, we’re aiming for similar accuracies in the
train and test sets:

In [72]: X_train, X_test, y_train, y_test = \


train_test_split(X, y, test_size=0.2)

We need to reset the model, or it will retain the previous training. How do we do that? Our model only has 2
parameters, w and b, so we can just reset these two parameters to zero.

In [73]: params = model.get_weights()

In [74]: params

Out[74]: [array([[1.3602622]], dtype=float32), array([-2.6896808], dtype=float32)]

In [75]: params = [np.zeros(w.shape) for w in params]

In [76]: params

Out[76]: [array([[0.]]), array([0.])]

In [77]: model.set_weights(params)

Let’s check that the model is now predicting garbage:

In [78]: acc = accuracy_score(y, model.predict(X) > 0.5)


print("The accuracy score is {:0.3f}".format(acc))

The accuracy score is 0.500

And in fact the model is now a straight line at 0.5:


3.6. CLASSIFICATION 129

In [79]: plt.plot(z, model.predict(z));

0.52

0.51

0.50

0.49

0.48

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0

Let’s re-train it on the training data only

In [80]: model.fit(X_train, y_train, epochs=25, verbose=0);

And let’s check the accuracy score on training and test sets:

In [81]: y_pred_train_class = model.predict(X_train) > 0.5


acc = accuracy_score(y_train, y_pred_train_class)
print("Train accuracy score {:0.3f}".format(acc))

y_pred_test_class = model.predict(X_test) > 0.5


acc = accuracy_score(y_test, y_pred_test_class)
print("Test accuracy score {:0.3f}".format(acc))

Train accuracy score 0.825


Test accuracy score 0.850

So, in this case the model is performing as well on the test set as on the training set. Good!
130 CHAPTER 3. MACHINE LEARNING

Overfitting
We are advancing quickly! This table recaps what we have learned so far:

| Target Variable | Method | Hypothesis | Cost Function | |:-|:-:|:-:|:-:| |Continuous|Linear Regression| ŷ =


X.w + b | Mean squared Error| |Discrete|Logistic Regression| ŷ = sigmoid(X.w + b) | Cross Entropy Error|

Notice we have extended the models to datasets with multiple features using the vector notation:

X.w = x j0 w0 + x j1 w1 + x j2 w2 + .... = ∑ x ji w i for each data point j (3.16)


i

In this case, w is a weight vector of size M, where M is the number of features, while X is a matrix of size
NxM, where N is the number of records in our dataset.

We have also learned to split our data into two parts: a training set and a test set.

Now let’s: overfitting. It is a common pitfall in machine learning, and you need to know how to find it and
how to address it.

Overfitting happens when our model learns the probability distribution of the training set too well and is
not able to generalize to the test set with the same performance. Think of this as learning things by heart
without really understanding them, in a new situation you will be lost and probably underperform.

Overfitting in a classification problem

A straightforward way to check for overfitting is to compare the cost and the performance metrics of the
training and test set. For example, let’s say we are solving a classification problem and we measure the
number of correct predictions, aka the accuracy, to be 99 for the training set and only 85 for the test set.
This means our model is not performing as well on the test set and we are therefore overfitting.
3.8. CROSS-VALIDATION 131

It is going to be very hard to overfit with a simple model with only one parameter, but as the number of
parameters increases, the likelihood of overfitting increases as well. We’ll need to watch out for our model
overfitting the dataset.

How to avoid overfitting

There are several actions we can take to minimize the risk of overfitting.

The first simple check is to make sure that we performed our train/test split correctly and both the train and
test sets are representative of the whole population of features and labels. Common errors include:

• Not preserving the ratio of labels.


• Not randomly sampling the dataset.
• Using a too small test set.
• Using a too small train set.

If the train/test split seems correct, it could be the case that our model has too much “freedom” and
therefore learns by heart the training set. This is usually the case if the number of parameters in the model is
comparable to or greater than the number of data points in the training set. To mitigate this, we can either
reduce the complexity of the model or use regularization, as we shall see later on in the book.

Cross-Validation
Is a train/test split the most efficient way to use our dataset? Even if we took great care in randomly splitting
our data, that’s only one of many possible ways to perform a split. What if we performed several different
train/test splits, checked the test score in each of them and finally averaged the scores? Not only we would
have a more precise estimation of the real accuracy, but also we could calculate the standard deviation of the
scores and therefore know the error on the accuracy itself.

This procedure is called cross-validation. There are many ways to perform cross- validation. The most
common is called K-fold cross-validation.

In K-fold cross validation the whole dataset is split into K equally sized random subsets. Then, each of the K
subsets gets to play the role of the test set, while the others are aggregated back to form a training set. In this
way, we obtain K estimations of the model score, each calculated from a test set that does not overlap with
any of the other test sets.

Not only do we get a better estimate of the validation score, including its standard deviation, but we also
used each data point more efficiently, since each data point gets to play the role of both train and test.

These advantages do not come for free. We had to train the model K times, which takes longer and
consumes more resources than training it just one time. On the other hand, we can parallelize the training
over each fold, either by distributing them across processes or different machines.

Scikit-Learn offers cross-validation out of the box, but we’ll have to wrap our model in a way that can be
understood by Scikit-Learn. This is easy to do using a wrapper class called KerasClassifier.
132 CHAPTER 3. MACHINE LEARNING

Cross Validation

In [82]: from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

In [83]: def build_logistic_regr():


model = Sequential()
model.add(Dense(1, input_dim=1,
activation='sigmoid'))

model.compile(optimizer=SGD(lr=0.5),
loss='binary_crossentropy',
metrics=['accuracy'])
return model

In [84]: model = KerasClassifier(build_fn=build_logistic_regr,


epochs=25, verbose=0)

We’ve just redefined the same model, but in a format that is compatible with Scikit-Learn. Let’s calculate the
cross validation score on a 3-fold (it means K = 3) cross validation:

In [85]: from sklearn.model_selection import cross_val_score


from sklearn.model_selection import KFold

In [86]: cv = KFold(3, shuffle=True)


scores = cross_val_score(model, X, y, cv=cv)

In [87]: scores

Out[87]: array([0.7647059 , 0.90909094, 0.78787881])


3.9. CONFUSION MATRIX 133

The cross validation produced 3 scores, 1 for each fold. We can average them and take their standard
deviation as a better estimation of our accuracy:

In [88]: m = scores.mean()
s = scores.std()
print("Cross Validation accuracy:",
"{:0.4f} ± {:0.4f}".format(m, s))

Cross Validation accuracy: 0.8206 ± 0.0633

There are also other ways to perform a cross validation. Here we mention a few.

Stratified K-fold is similar to K-fold, but it makes sure that the proportions of labels are preserved in the
folds. For example, if we are performing a binary classification and 40 of the data is labeled True, and 60
is labeled False, each of the folds will also contain 40 True labels and 60 False labels.

We can also perform cross-validation by randomly selecting a test set of fixed size multiple times. In this
case, we do not need to make sure that the test sets are disjoint and they will overlap in some points.

Finally, it is worth mentioning Leave-One-Group-Out cross-validation or LOGO. LOGO is useful when


our data is stratified in subgroups. For example, imagine we are building a model to recognize gestures from
phone accelerometer data. Our training dataset probably contains multiple recordings of different gestures
from different users. The labels we are trying to predict are the gestures.

By performing cross-validation both our training and test sets would leave us with sets which contain
recordings from all users. If we train the model in this way, we could very well end up with a good test score,
but we would have no idea about how the model would perform if a new user executed the same gestures.
In other words, the model could be overfitting over each user, and we would have no way of knowing it.

In this case, it is better to split the data relative to the users, using all of the data from some of them as
training, while testing on all of the data from the remaining users. If the test score is good in this case, we
can be fairly sure that the model will perform well with a new user.

Confusion Matrix
Is accuracy the best way to check the performance of our model? It surely tells us how well we are doing
overall, but it doesn’t give us any insight into the kind of errors the model is doing. Let’s see how we can do
better.

In the problem we just introduced, we are estimating the purchase probability from the time spent on a
page. This is a binary classification, and we can be either right or wrong in the four ways represented here:

This table is called confusion matrix and it gives a better view of correct and wrong predictions.

Let’s look at the four cases one at a time. We could be right in predicting the purchase or right in predicting
134 CHAPTER 3. MACHINE LEARNING

Confusion Matrix

the absence of a purchase. These are the True Positives and True Negatives. Summed together they amount
to the number of correct predictions we formulated. If we divide this number by the total number of data
points, we obtain the Accuracy of the model. In other words, accuracy is the overall ratio of correct
predictions:

(TP + TN)
Acc = (3.17)
All

On the other hand, our model could be wrong in two ways.

1. It could predict that a person buys when they are not buying: this is a False Positive.
2. It could predict that a person does not buy when they are buying: this is a False Negative.

Let’s use Scikit-Learn to calculate the confusion matrix of our data:

In [89]: from sklearn.metrics import confusion_matrix

We define a short helper function to add column and row labels for nice display:

In [90]: def pretty_cm(y_true, y_pred, labels=["False", "True"]):


cm = confusion_matrix(y_true, y_pred)
pred_labels = ['Predicted '+ l for l in labels]

df = pd.DataFrame(cm,
index=labels,
columns=pred_labels)
return df
3.9. CONFUSION MATRIX 135

In [91]: pretty_cm(y, y_class_pred, ['Not Buy', 'Buy'])

Out[91]:

Predicted Not Buy Predicted Buy


Not Buy 43 7
Buy 9 41

Let’s stop here for a second. Let’s say that, if the model was predicting True, the user is offered to buy an
additional product at a discount. On which side would you rather the model be wrong? Would you like the
model to offer a discount to users with no intention of buying (False Positive) or would you rather it not
offer a discounted additional item to users who intend to buy (False Negative)?

What if, instead of predicting the purchase behavior from time spent on a page we were determining the
likelihood to have cancer, based on the value of a blood screening exam? Would you want a False Positive or
a False Negative in that case?

Most people would prefer a False Positive, and do an additional screening to make sure of the result, rather
than go home feeling safe and healthy while they are not. Would that be your choice too?

What if you were an (evil) health insurance company instead? Would you still choose to optimize the model
in the same way? A False Positive would be an additional cost to you because the patient would go on to see
a specialist. Would you prefer to minimize False Positives in this case?

As you can see, there is no one correct answer. Different stakeholders will make different choices. This is to
say that the data scientist is not a neutral observer of a Machine Learning process. The decisions he/she
makes, fundamentally determine the outcome of the training!

False Positives and False Negatives are usually expressed in terms of two sister quantities: Precision and
Recall. Here they are:

Precision

We define precision as the ratio of True Positives to the total number of positive tests:

(TP)
Precision = (3.18)
TP + FP

Precision P will tend towards 1 when the number of False Positives goes to zero, i.e. when we do not create
any false alerts and are thus, “precise”. Here on every positive case we are correct.

Recall

On the other hand, recall is defined as the ratio of True Positives to the total number of actually positive
cases:
136 CHAPTER 3. MACHINE LEARNING

(TP)
Recall = (3.19)
TP + FN

Recall R will tend towards 1 when the number of False Negatives goes to zero, i.e. when we do not miss
many of the positive cases or we “recall” all of them.

F1 Score

Finally, we can combine the two in what’s called F1-score:

PR
F1 = 2 (3.20)
P+R

F1 will be close to 1 if both precision and recall are close to 1, while it will go to zero if either of them is low.
In this sense, the F1 score is an excellent way to make sure that both precision and recall are high.

The F1 score is a harmonic mean of precision and recall. The harmonic mean is an average for ratios. There
are also other F-scores that give more weight to precision or to recall more, called F-beta scores. You can
read about them on Wikipedia and Scikit-Learn doc.

Let’s evaluate these scores for our data:

In [92]: from sklearn.metrics import precision_score, recall_score, f1_score

In [93]: precision = precision_score(y, y_class_pred)


print("Precision:\t{:0.3f}".format(precision))

recall = recall_score(y, y_class_pred)


print("Recall: \t{:0.3f}".format(recall))

f1 = f1_score(y, y_class_pred)
print("F1 Score:\t{:0.3f}".format(f1))

Precision: 0.854
Recall: 0.820
F1 Score: 0.837

Scikit-Learn offers a handy classification_report function that combines all these:

In [94]: from sklearn.metrics import classification_report

In [95]: print(classification_report(y, y_class_pred))


3.10. FEATURE PREPROCESSING 137

precision recall f1-score support

0 0.83 0.86 0.84 50


1 0.85 0.82 0.84 50

micro avg 0.84 0.84 0.84 100


macro avg 0.84 0.84 0.84 100
weighted avg 0.84 0.84 0.84 100

support here means how many point were present in each class.

While these definitions hold true only for the binary classification case, we can still extend the confusion
matrix to the case where there are more than 2 classes.

Multi-class Confusion Matrix

In this case, the element i, j of the matrix will tell us how many data points in class i have been predicted to
be in class j. This is very powerful to see if any of the classes are being confused. If so we can isolate the data
being misclassified and try to understand why.

Feature Preprocessing

Categorical Features

Sometimes input data will be categorical, i.e., the feature values will be discrete classes instead of continuous
numbers. For example, in the weight/height dataset above, there’s a 3rd column called Gender which can
either be Male or Female. How can we convert this categorical data to numbers that can be consumed by
our model?

There are several ways to do it, the most common being One-Hot or Dummy encoding. In Dummy
encoding, we substitute the categorical column with a set of boolean columns, one for each category present
in the column. In the Male/Female example above, we would replace the Gender column with two columns
called Gender_Male and Gender_Female that would have binary values. Pandas offers a quick way to do
that:
138 CHAPTER 3. MACHINE LEARNING

In [96]: df = pd.read_csv('../data/weight-height.csv')
df.head()

Out[96]:

Gender Height Weight


0 Male 73.847017 241.893563
1 Male 68.781904 162.310473
2 Male 74.110105 212.740856
3 Male 71.730978 220.042470
4 Male 69.881796 206.349801

Here’s how to create the dummy columns:

In [97]: pd.get_dummies(df['Gender'], prefix='Gender').head()

Out[97]:

Gender_Female Gender_Male
0 0 1
1 0 1
2 0 1
3 0 1
4 0 1

In this particular case, we only need one of the two columns, since we only have two classes, but if we had 3
or more categories, then we would need to pass all the dummy columns to our model.

There are other ways to encode categorical features, including index encoding, hashing trick and
embeddings. We will learn more about these later in the book.

Feature Transformations

As we will see in the exercises, Neural Network models are quite sensitive to the absolute size of the input
features. Passing features with very large or small values will not help them converge to a solution. An easy
way to overcome this problem is to normalize the features to a number near 1.

Here are a few methods we can use to transform our features.

1) Rescale with fixed factor

We could change the unit of measurement. For example, in the Humans example we could rescale the
height by 12 (go from inches to feet) and the weight by 100 (go from pounds to 100 pounds):
3.10. FEATURE PREPROCESSING 139

In [98]: df.head()

Out[98]:

Gender Height Weight


0 Male 73.847017 241.893563
1 Male 68.781904 162.310473
2 Male 74.110105 212.740856
3 Male 71.730978 220.042470
4 Male 69.881796 206.349801

In [99]: df['Height (feet)'] = df['Height']/12.0


df['Weight (100 lbs)'] = df['Weight']/100.0

In [100]: df.describe().round(2)

Out[100]:

Height Weight Height (feet) Weight (100 lbs)


count 10000.00 10000.00 10000.00 10000.00
mean 66.37 161.44 5.53 1.61
std 3.85 32.11 0.32 0.32
min 54.26 64.70 4.52 0.65
25 63.51 135.82 5.29 1.36
50 66.32 161.21 5.53 1.61
75 69.17 187.17 5.76 1.87
max 79.00 269.99 6.58 2.70

As you can see our new features have values that are close to 1 in order of magnitude, which is good enough.

2) MinMax normalization

A second way to normalize features is to take the minimum value and the maximum value and rescale all
values to the interval (0,1). This can be done using the MinMaxScaler provided by sklearn like so:

In [101]: from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()
df['Weight_mms'] = mms.fit_transform(df[['Weight']])
df['Height_mms'] = mms.fit_transform(df[['Height']])
df.describe().round(2)
140 CHAPTER 3. MACHINE LEARNING

Out[101]:

Height Weight Height (feet) Weight (100 lbs) Weight_mms Height_mms


count 10000.00 10000.00 10000.00 10000.00 10000.00 10000.00
mean 66.37 161.44 5.53 1.61 0.47 0.49
std 3.85 32.11 0.32 0.32 0.16 0.16
min 54.26 64.70 4.52 0.65 0.00 0.00
25 63.51 135.82 5.29 1.36 0.35 0.37
50 66.32 161.21 5.53 1.61 0.47 0.49
75 69.17 187.17 5.76 1.87 0.60 0.60
max 79.00 269.99 6.58 2.70 1.00 1.00

Our new features have a maximum value of 1 and a minimum value of 0, exactly as we wanted them.

3) Standard normalization

A third way to normalize large or small features is to subtract the mean and divide by the standard deviation.

In [102]: from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
df['Weight_ss'] = ss.fit_transform(df[['Weight']])
df['Height_ss'] = ss.fit_transform(df[['Height']])
df.describe().round(2)

Out[102]:

Height Weight Height (feet) Weight (100 lbs) Weight_mms Height_mms Weight_ss Height_ss
count 10000.00 10000.00 10000.00 10000.00 10000.00 10000.00 10000.00 10000.00
mean 66.37 161.44 5.53 1.61 0.47 0.49 0.00 0.00
std 3.85 32.11 0.32 0.32 0.16 0.16 1.00 1.00
min 54.26 64.70 4.52 0.65 0.00 0.00 -3.01 -3.15
25 63.51 135.82 5.29 1.36 0.35 0.37 -0.80 -0.74
50 66.32 161.21 5.53 1.61 0.47 0.49 -0.01 -0.01
75 69.17 187.17 5.76 1.87 0.60 0.60 0.80 0.73
max 79.00 269.99 6.58 2.70 1.00 1.00 3.38 3.28

After standard normalization, our new features have approximately zero mean and standard deviation of 1.
This is good in a linear model, because each feature is multiplied by a weight that the model has to find.
Since the weights are initialized to have values near 1, if the feature had a very large or very small scale, the
model could have to adjust the value of the weight enormously, just to account for the different scale. It is
therefore good practice to normalize our features before giving them to a Neural Network.

Note that we have just rescaled the units of our features, but their distribution is the same:
3.11. EXERCISES 141

In [103]: plt.figure(figsize=(15, 5))

for i, feature in enumerate(['Height',


'Height (feet)',
'Height_mms',
'Height_ss']):
plt.subplot(1, 4, i+1)
df[feature].plot(kind='hist', title=feature)
plt.xlabel(feature)

plt.tight_layout();

Height Height (feet) Height_mms Height_ss


2000 2000 2000 2000

1500 1500 1500 1500


Frequency

Frequency

Frequency

Frequency
1000 1000 1000 1000

500 500 500 500

0 0 0 0
60 70 80 5 6 0.0 0.5 1.0 2 0 2
Height Height (feet) Height_mms Height_ss

Now the time has come to apply what you’ve learned with some exercises.

Exercises

Exercise 1

You just started working at a real estate investment firm, and they would like you to build a model for
pricing houses. You receive a dataset that contains data for house prices and a few features like “number of
bedrooms”, “size in square feet” and “age of the house”. Let’s see if you can build a model that can predict the
price. In this exercise, we extend what we have learned about linear regression to a dataset with more than
one feature. Here are the steps to complete it:

1. load the dataset ../data/housing-data.csv

• plot the histograms for each feature


• create two variables called X and y: X shall be a matrix with three columns (sqft, bdrms, age) and y
shall be a vector with one column (price)
• create a linear regression model in Keras with the appropriate number of inputs and output
• split the data into train and test with a 20 test size
• train the model on the training set and check its accuracy on training and test set
142 CHAPTER 3. MACHINE LEARNING

• how’s your model doing? Is the loss growing smaller?


• try to improve your model with these experiments:
– normalize the input features with one of the rescaling techniques mentioned above
– use a different value for the learning rate of your model
– use a different optimizer
• once you’re satisfied with the training, check the R 2 on the test set

In [ ]:

Exercise 2

Your boss was delighted with your work on the housing price prediction model and decided to entrust you
with a more challenging task. They’ve seen many people leave the company recently and they would like to
understand why that’s happening. They have collected historical data on employees, and they would like you
to build a model that can predict which employee will leave next. They would like a model that is better than
random guessing. They also prefer false negatives than false positives, in this first phase. Fields in the dataset
include:

• Employee satisfaction level


• Last evaluation
• Number of projects
• Average monthly hours
• Time spent at the company
• Whether they have had a work accident
• Whether they have had a promotion in the last five years
• Department
• Salary
• Whether the employee has left

Your goal is to predict the binary outcome variable left using the rest of the data. Since the outcome is
binary, this is a classification problem. Here are some things you may want to try out:

1. load the dataset at ../data/HR_comma_sep.csv, inspect it with .head(), .info() and .describe().

• Establish a benchmark: what would be your accuracy score if you predicted everyone stays?
• Check if any feature needs rescaling. You may plot a histogram of the feature to decide which
rescaling method is more appropriate
• convert the categorical features into binary dummy columns. You will then have to combine them
with the numerical features using pd.concat
• do the usual train/test split with a 20 test size
• play around with learning rate and optimizer
• check the confusion matrix, precision, and recall
3.11. EXERCISES 143

• check if you still get the same results if you use 5-Fold cross-validation on all the data
• Is the model good enough for your boss?

As you will see in this exercise, this logistic regression model is not good enough to help your boss. In the
next chapter, we will learn how to go beyond linear models.

This dataset comes from https://www.kaggle.com/ludobenistant/hr-analytics/ and is released under CC


BY-SA 4.0 License.

In [ ]:
144 CHAPTER 3. MACHINE LEARNING
Deep Learning
4
This chapter is about Deep Learning and it will walk you through a few simple examples that generalize how
we approach regression and classification problems.

Beyond linear models


In the previous chapter we encountered two techniques to solve Supervised Learning problems: linear
regression and logistic regression. These two techniques share many characteristics. For instance, both
formulate a hypothesis about the link between features and target, both require a cost function; both depend
on parameters; both learn by finding the combination of parameters that minimizes a given cost over the
training set.

While these techniques work to solve several problems, they also have some limitations.

For example, linear regression doesn’t work well when the relationship between features and output is
nonlinear, i.e., when we cannot use a straight line or a flat plane to represent it. For example, think of the
number of active users of a web product or a social media platform. If the product is successful, the number
of new users added each month would grow, resulting in a nonlinear relationship between the number of
users and time.

Similarly, the Logistic Regression is incapable of separating classes that cannot be pulled apart by a flat
boundary (a line in 2D, a plane in 3D, a hyperplane if we have more than three features). It happens all the
time, and you may hear the term “not linearly separable” to describe two classes that cannot be separated by
a straight boundary. We saw an example of this in the first chapter when we tried to separate the blue dots
from the red crosses.

In general, the boundary between the two classes is rarely linear, especially when dealing with more

145
146 CHAPTER 4. DEEP LEARNING

Active Users Versus Time

Classification with curved boundary


4.2. NEURAL NETWORK DIAGRAMS 147

complex classification problems with thousands of features and many output classes. To extend regression
and classification beyond the linear cases, we need to use more complex models. Historically, computer
scientists have invented many techniques to extend beyond linear models including models such as
Decision Trees, Support Vector Machines, and Naive Bayes.

Deep Neural Networks bring together a unified framework to tackle all these cases: we can do linear and
nonlinear regression, classification, use them to generate new data, and much more!

In this chapter, we will introduce a notation for discussing Neural Networks and rewrite linear and logistic
regression using this notation. Finally, we work through stacking multiple nodes and create a deep network.

Neural Network Diagrams


Let’s look at a few high-level diagrams looking at a more mathematical definition of what we’re doing. If the
math looks latin to you (it is), don’t worry. These are just the more formal definitions of what we’re doing.
After this part of the chapter, we’ll dive right back into code.

For the visual learners out there, this section will help to “chalkboard” the algorithms we’re building.

Linear regression

Let’s look at linear regression. We have introduced linear regression in Chapter 3. As you may remember, it
refers to problems where we try to predict a number from a set of input features. Examples are: predicting
the price of a house, predicting the number of clicks a page will get or predicting the revenue a business will
generate in the future.

As usual, we will refer to the inputs in the problem using the variable x and to the outputs using the variable
y. So, for example, if we are trying to predict the price of a house from its size, x will be the size of the house
and y will be the price. The equation of linear regression is:

y = x.w + b (4.1)

and we can represent its operation as an Artificial Neural Network like this:

Linear regression as a neural net


148 CHAPTER 4. DEEP LEARNING

This network has only one node, the output node, represented by the circle in the diagram. This node is
connected to the input feature x by a weight w. A second edge enters the node carrying the value of the
parameter b, which we will call bias.

Fantastic! We have a simple way to represent linear operations in a graph. Let’s extend the network to
multiple input features. We encountered an example of multivariate regression problem in Exercise 1 of
Chapter 3, where we built a model to predict the price of a house as a function of 3 inputs: the size in square
feet (x1 ), the number of bedrooms (x2 ) and the age (x3 ) of the house. In that case we had 3 input features
and the model had 3 weights (w1 , w2 and w3 ) and 1 bias (b). We can extend our graph notation very simply
to accommodate for this case:

Multivariate linear regression

The output node here connects to the N inputs through N weights, and it also connects to a bias parameter.
The equation is the same as before:

y = X.w + b (4.2)

but now X and w are arrays that contain more than one entry, multiplied using a dot product. So, what the
above equation really means is:

y = x1 w1 + ... + x N w N + b = X.w + b (4.3)

We can now visually represent linear regression with as many inputs as we like.

Logistic regression

Linear regression gives us a linear relationship between the inputs and outputs, but what if we want a
non-binary answer instead of a linear one. For instance, what if we want a binary answer, yes/no answer? For
example, given a list of passengers on the Titanic, can we predict if a specific person would survive or not?
4.2. NEURAL NETWORK DIAGRAMS 149

Can you think of a way to change our equation so that we can allow for binary output?

The answer here is to use a Logistic Regression. Just before we output the value, we’ll use the sigmoid
function to output a binary value instead of sliding one. As you may remember from Chapter 3 the Sigmoid
function maps the all real values to the interval [0, 1]. We can use the sigmoid to assign the output of the
node (so far linear) to the range [0, 1]. We will interpret the result as the probability of a binary outcome.

Neural Network for Logistic Regression

TIP: if you need a refresher about the sigmoid you can check Chapter 3 as well as this nice
article on Wikipedia.

Perceptron

Adding a sigmoid function is just a special case of what is called an activation function. Activation
Function is just a fancy name we give to the function that sits at the output of a node in a Neural Network.
There are many different types of activation functions, and we will encounter them later in this chapter. For
now, know that they are important. For example, we can describe the first Neural Network invented by a
diagram similar to that of the Logistic Regression with just a different activation function. This network is
called Perceptron.

The Perceptron is also a binary classifier, but instead of using a smooth sigmoid activation function, it uses
the step function:


⎪1
⎪ if w.x + b > 0
y=⎨ (4.4)
⎩0 otherwise

We could even simplify our diagram notation without losing information by including the bias and the
activation symbols in the node itself, like this:
150 CHAPTER 4. DEEP LEARNING

Perceptron

A more compact notation


4.2. NEURAL NETWORK DIAGRAMS 151

Before we move on, let’s review each element in the diagram with an example. Let’s say our goal is to build a
model that predicts if a banknote is fake or real based on some of its properties (we’ll do this later in the
book).

First, let’s define our inputs and outputs.

The inputs are the properties of the banknote we plan to use. These could be length, height, thickness,
transparency, and more elaborate properties extracted from their images. These input properties are their
features.

The output is the prediction value, True or False, 0 or 1, that we hope our model to give us to tell us if the
note is real or not.

The graph connecting inputs to output is the architecture of our network. In the simple network above the
graph contains a single node performing a weighted sum of the input features.

Weights and biases are the parameters of the model. These parameters are the things we have control over
(in the beginning). These are what the machine learns in our Machine Learning algorithm. They are the
knobs that can be turned to change the model predictions.

During training, the network will attempt to find the best values for weights and biases, but the inputs
x1 , ..., x n , the outputs, and the network architecture, are given and cannot be changed by the model (or us,
for that matter).

Now that we have established a symbolic notation that allows us to describe both linear regression and
logistic regression in a very compact and visual way let’s see how we can expand the networks.

Deeper Networks

The above simple networks take multiple inputs and calculate each of their outputs as a weighted sum of the
inputs plus a few other things to define a classification model (to make sure numbers make sense – yes, we
can do that). The other things we add to each of the inputs of our model is a fixed bias (usually just some
small number that makes sense the input isn’t zero) and an optional nonlinear activation function for the
classification models.

The weighted sum of the input plus the bias is sometimes also called a linear combination of the input
features because it only involves sums and multiplications by parameters (no additional functions like
exponentials, cosines or similar).

Let’s see what happens when we combine several Perceptrons in the same graph.

We start by taking many of them, each connected by different weights to the same input nodes. We then
calculate the output for each of the nodes, obtaining different predictions, one for each of the Perceptrons.
This is called a fully connected layer, sometimes called a dense layer.

A dense layer contains many identical nodes connected to the same inputs through independent weights, all
operating in parallel.
152 CHAPTER 4. DEEP LEARNING

Nothing prevents us from using the output values of the dense layer as features (or inputs) for even more
Perceptrons. In other words, we can create a deeper fully connected Neural Network by stacking fully
connected layers on top of each other.

Multilayer Perceptron (MLP)

These fully connected layers are the root of Deep Learning and are used all the time.

To recap, we organize Perceptrons with the same inputs in layers, i.e., groups of Perceptrons that receive the
same inputs. As we will see later, creating a fully connected network in Keras is very easy, it’s just a matter of
adding more layers.

Maths of the Forward Pass

We can think of a Neural Network as a function (F), that takes an input value from the feature space and
outputs a value in the target space. This calculation, called Forward Pass is a composition of linear and
nonlinear steps.

For the math inclined reader, let’s look at how we can write the operations performed by a node in the first
layer. Each node in the first layer performs a linear transformation of the input features. Mathematically
speaking, it calculates a weighted average of the inputs and then adds a bias.

If we use the index k to enumerate the nodes in the first layer, we can write the weighted sum z (1) calculated
by that node as:
4.2. NEURAL NETWORK DIAGRAMS 153

(1) (1) (1) (1)


z k = x1 w1k + x2 w2k + ... + b k for every node k in the first layer. (4.5)

where we have used the superscript (1) to indicate that the weights belong to the first layer, and the
subscript jk to identify the weight multiplying the input feature at position j for the node at position k.

In the previous example of the price prediction for a house, the index j runs over the features, so j = 1
locates the first feature (the size of the house in square feet (x1 )), j = 2 the second feature, and so on.

If we consider all the input features as a vector X = [x1 , x2 , x3 , ....] and all the output sums of the first layer as
(1) (1) (1)
a vector Z (1) = [z1 , z2 , z3 , ...], the above weighted sum can be written as a matrix multiplication of the
weight matrix W (1) with the input features:

TIP: if you are not familiar with vectors, matrices, and linear algebra you can keep going
and ignore this mathematical part. There is a more in-depth discussion of these concepts in
the next chapter. That said, linear algebra is a fundamental component of how Machine
Learning and Deep Learning work. So if you are completely foreign to these notions, you
may find it valuable to take a class or two on Youtube about vectors, matrices, and their
operations.

(1) (1)
Z (1) = X.W (1) + B(1) = ∑ x j w jk + b k (4.6)
j

where we arrange the weights in a matrix W (1) whose rows run along the input features and whose columns
run along the nodes in the layer.

The nonlinear activation function will be applied to the weighted sum to yield the activation at the output.
For example, in the case of the Perceptron, we will apply the step function like this:

A(1) = H(Z (1) ) (4.7)

The activation vector A(1) , is a vector of length k, and it becomes the input vector to the second layer in the
network. The second layer will take the output of the first and perform the exact same calculation:

A(2) = H(A(1) .W (2) + B(2) ) (4.8)

yielding a new activation vector A(2) with as many elements as the number of nodes in the second layer.

This is true for any of the layers: a layer takes the output of the previous layer and performs a linear
154 CHAPTER 4. DEEP LEARNING

combination, followed by a nonlinear function. The nonlinear activation function is the most important
part of the transformation. If that were not present, a deep network would produce the same result as a
shallow network, and it wouldn’t be powerful at all.

Activation functions
We’ve looked at two nonlinear activation functions already:

• the step function


• sigmoid

These functions are applied to the output weighted sum calculated by a layer before we pass the values onto
the next layer or to output. They are the key element of Neural Networks. Activation functions are what
make Neural Networks so versatile and powerful! Besides sigmoid and step functions there are other
powerful options. Let’s look at a few more. First let’s load our common files:

In [1]: with open('common.py') as fin:


exec(fin.read())

In [2]: with open('matplotlibconf.py') as fin:


exec(fin.read())

Sigmoid and Step functions are easy to define using numpy (using their mathematical formulas):

In [3]: def sigmoid(x):


return 1.0 / (1.0 + np.exp(-x))

def step(x):
return x > 0

They both map the real axis onto the interval between 0 and 1 ([0, 1]), i.e. they are bounded:

In [4]: x = np.linspace(-10, 10, 100)


plt.plot(x, sigmoid(x))
plt.plot(x, step(x))
plt.legend(['sigmoid', 'step'])
plt.title('Activation Functions');
4.3. ACTIVATION FUNCTIONS 155

Activation Functions
1.0 sigmoid
step
0.8

0.6

0.4

0.2

0.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0

They are designed to squeeze a large output sum to 1 while taking a really negative output that sums to 0.

It’s as if each node was performing an independent classification of the input features and feeding the output
binary outcome onto the next layer.

Besides the sigmoid and step, other nonlinear activation functions are possible and will be used in this
book. Let’s look at a few of them:

Tanh

The hyperbolic tangent has a very similar shape to the sigmoid, but it is bounded and smoothly varying
between [−1, +1] instead of [0, 1], and is defined as:

e x − e −x
y = tanh(x) = (4.9)
e x + e −x

The advantage of this is that negative values of the weighted sum are not forgotten by setting them to zero,
but are given a negative weight. In practice tanh makes the network learn much faster than sigmoid or
step.

We can write the tanh function simply in Python as well, but we don’t have to. An efficient version of the
tanh function is available through numpy:
156 CHAPTER 4. DEEP LEARNING

In [5]: x = np.linspace(-10, 10, 100)


plt.plot(x, sigmoid(x))
plt.plot(x, step(x))
plt.plot(x, np.tanh(x))
plt.legend(['sigmoid', 'step', 'tanh'])
plt.title('Activation Functions');

Activation Functions
1.00 sigmoid
0.75 step
tanh
0.50
0.25
0.00
0.25
0.50
0.75
1.00
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0

ReLU

The rectified linear unit function or simply rectifier is defined as:


⎪x
⎪ if x > 0
y=⎨ (4.10)
⎩0 otherwise

or simply:

y = max(0, x) (4.11)

Initially motivated from biology, it has been shown to be very effective, and it is probably the most popular
4.3. ACTIVATION FUNCTIONS 157

activation function for Deep Neural Networks. It offers two advantages.

1. If it’s implemented as an if statement (the former of the two formulations above), it’s calculation is
very fast, much faster than smooth functions like sigmoid and tanh.
2. Not being bounded on the positive axis, it can distinguish between two large values of input sum,
which helps back-propagation converge faster.

In [6]: def relu(x):


cond = x > 0
return cond * x

In [7]: x = np.linspace(-10, 10, 100)


plt.plot(x, relu(x))
plt.title('relu');

relu
10

0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0

Softplus

The Softplus function is a smooth approximation of the ReLU:


158 CHAPTER 4. DEEP LEARNING

y = log(1 + e x ) (4.12)

We mention it for completeness, though it’s rarely used in practice.

In [8]: def softplus(x):


return np.log1p(np.exp(x))

In [9]: x = np.linspace(-10, 10, 100)


plt.plot(x, relu(x))
plt.plot(x, softplus(x))
plt.legend(['relu', 'softplus'])
plt.title('ReLU and Softplus');

ReLU and Softplus


10 relu
softplus
8

0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0

SeLU

Finally, the SeLU activation function is a very recent development (see paper published in June 2017). The
name stands for scaled exponential linear unit and it’s implemented as:
4.3. ACTIVATION FUNCTIONS 159


⎪x
⎪ if x > 0
y = λ⎨ x
(4.13)
⎩α(e − 1) otherwise

On the positive axis it behaves like the rectified linear unit (ReLU), scaled by a factor λ. On the negative axis
it smoothly goes down to a negative value. This activation function, combined with a new regularization
technique called Alpha Dropout, offers better convergence properties than ReLU!

In [10]: def selu(x):


alpha = 1.6732632423543772848170429916717
scale = 1.0507009873554804934193349852946
res = scale * np.where(x>0.0,
x,
alpha * (np.exp(x) - 1))
return res

In [11]: x = np.linspace(-10, 10, 100)


plt.plot(x, relu(x))
plt.plot(x, selu(x))
plt.legend(['relu', 'selu'])
plt.title('ReLU and SeLU');

ReLU and SeLU


10 relu
selu
8
6
4
2
0
2
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
160 CHAPTER 4. DEEP LEARNING

When creating a deep network, we will use one of these activation functions between one layer and the next,
in order to make the Neural Network nonlinear. These functions are the secret power of Neural Networks:
with nonlinearities at each layer they are able to approximate very complex functions.

Binary classification
Let’s work through classifying a binary dataset using a Neural Network. We’ll need a dataset to work with to
train our Neural Network. Let’s create an example dataset with two classes that are not separable with a
straight boundary, and let’s separate them with a fully connected Neural Network. First we import the
make_moons function from Scikit Learn:

In [12]: from sklearn.datasets import make_moons

And then we use it to generate a synthetic dataset with 1000 points and 2 classes:

In [13]: X, y = make_moons(n_samples=1000,
noise=0.1,
random_state=0)

Let’s plot this dataset and see what it looks like:

In [14]: plt.plot(X[y==0, 0], X[y==0, 1], 'ob', alpha=0.5)


plt.plot(X[y==1, 0], X[y==1, 1], 'xr', alpha=0.5)
plt.legend(['0', '1'])

plt.title('Non linearly separable data');


4.4. BINARY CLASSIFICATION 161

Non linearly separable data


1.25 0
1.00 1
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.0 0.5 0.0 0.5 1.0 1.5 2.0

In [15]: X.shape

Out[15]: (1000, 2)

We split the data into training and test sets:

In [16]: from sklearn.model_selection import train_test_split

In [17]: X_train, X_test, y_train, y_test = \


train_test_split(X, y, test_size=0.3, random_state=42)

To build our Neural Network, let’s import a few libraries from the Keras package:

In [18]: from tensorflow.keras.models import Sequential


from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD, Adam
162 CHAPTER 4. DEEP LEARNING

Logistic Regression

Let’s first verify that a shallow model cannot separate the two classes. This is more for educational purposes
than anything else. We are going to build a model that we know is wrong, since it can only draw straight
boundaries. This model will not be able to separate our data correctly but we will then be able to extend it
and see the power of Neural Networks.

Let’s start by building a Logistic Regression model like we did in the previous chapter. We will create it using
the Sequential API, which is the simpler way to build models in Keras. We add a single Dense layer with 2
inputs, a single node and a sigmoid activation function:

In [19]: model = Sequential()

Now we’ll add a single Dense layer with 2 inputs and we’ll use the sigmoid activation function here.

In [20]: model.add(Dense(1, input_dim=2, activation='sigmoid'))

The arguments of the Dense layer definition map really well to our graph notation above

Dense layer in Keras

Then we compile the model assigning the optimizer, the loss and any additional metric we would like to
include (like the accuracy in this case):

In [21]: model.compile(Adam(lr=0.05),
'binary_crossentropy',
metrics=['accuracy'])

Let’s look at the three arguments to make sure we understand them.


4.4. BINARY CLASSIFICATION 163

• Adam(lr=0.05) is the optimizer, this is the algorithm that performs the actual learning. There are
many different optimizers, and we will explore them in detail in the next chapter. For now, know that
Adam is an excellent one.
• binary_crossentropy is the loss or cost function. We have described it in detail in Chapter 3. For
binary classification problems where we have a single output with a sigmoid activation, we need to
use binary_crossentropy function. For Multiclass classifications where we have multiple classes
with a softmax activation, we need to use categorical_crossentropy, as we’ll see below.
• metrics is just a list of additional metrics we’d like to calculate, in this case, we add the accuracy of
our classification, i.e., the fraction of correct predictions as seen in Chapter 3.

As we have seen in the previous chapter, we can now train the compiled model using our training data. The
model.fit(X, y) method does just that: it uses the training inputs X_train to generate predictions. It
then compares the predictions with the actual labels y_train through the use of the cost function, and it
finally adapts the parameters to minimize such cost.

We will train our model for 200 epochs, which means our model will get to see our training data 200 times.
We also set verbose=0 to suppress printing during the training. Feel free to change it to verbose=1 or
verbose=2 if you want to monitor training as it progresses.

In [22]: model.fit(X_train, y_train, epochs=200, verbose=0);

Now that we have trained our model, we can evaluate its performance on the test data using the function
.evaluate. This takes the input features of the test data X_test and the input labels of the test data y_test
and calculates the average loss and any other metric added during model.compile. In the present case
.evaluate will return two numbers, the loss (cost) and the accuracy:

In [23]: results = model.evaluate(X_test, y_test)

300/300 [==============================] - 0s 236us/sample - loss: 0.3172 -


accuracy: 0.8400

We can print out the accuracy by retrieving the second element in the results tuple:

In [24]: print("The Accuracy score on the Test set is:\t",


"{:0.3f}".format(results[1]))

The Accuracy score on the Test set is: 0.840

The accuracy is better than random guessing, but it’s not 100. Let’s see the boundary identified by the
logistic regression by plotting the boundary as a line:
164 CHAPTER 4. DEEP LEARNING

In [25]: def plot_decision_boundary(model, X, y):


amin, bmin = X.min(axis=0) - 0.1
amax, bmax = X.max(axis=0) + 0.1
hticks = np.linspace(amin, amax, 101)
vticks = np.linspace(bmin, bmax, 101)

aa, bb = np.meshgrid(hticks, vticks)


ab = np.c_[aa.ravel(), bb.ravel()]

c = model.predict(ab)
cc = c.reshape(aa.shape)

plt.figure(figsize=(12, 8))
plt.contourf(aa, bb, cc, cmap='bwr', alpha=0.2)
plt.plot(X[y==0, 0], X[y==0, 1], 'ob', alpha=0.5)
plt.plot(X[y==1, 0], X[y==1, 1], 'xr', alpha=0.5)
plt.legend(['0', '1'])

plot_decision_boundary(model, X, y)

plt.title("Decision Boundary for Logistic Regression");

Decision Boundary for Logistic Regression


1.25 0
1
1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75
1.0 0.5 0.0 0.5 1.0 1.5 2.0

As you can see in the figure, since a shallow model like logistic regression is not able to draw curved
4.4. BINARY CLASSIFICATION 165

boundaries, the best it can do is align the boundary so that most of the blue dots fall in the blue region and
most of the red crosses fall in the red region.

Deep model

The word deep in Deep Learning has changed meaning over time. Initially, it was used to refer to networks
that had more than a single layer. As the field progressed, and researchers designed models with many inner
layers, the word shifted to meaning networks with hundreds of layers and billions of parameters. In this
book, we will use the original meaning and call “deep” any model with more than one layer, so let’s add a few
layers and create our first “deep” model.

Let’s build a model with the following structure:

Graph of our network

This model has three layers. The first layer has four nodes, with two inputs and a relu activation function.
Each of the two nodes in the second layer receives the four values at the output of the first layer, performs a
weighted average and then pipes the output through a relu activation function. Finally, the two outputs of
this layer go into the third layer, which is also our output layer. This only has one node and a sigmoid
activation function, so that the output values are constrained between 0 and 1.

We can build this network in Keras very easily. All we have to do is add more layers to the Sequential
model, specifying the number of nodes and the activation for each of them using the .add() function. Let’s
start with the first layer:

In [26]: model = Sequential()


model.add(Dense(4, input_dim=2, activation='relu'))
166 CHAPTER 4. DEEP LEARNING

This is very similar to what we did above, except that now this Dense layer has four nodes instead of one.
How many parameters are there in this layer? There are twelve parameters, two weights for each of the
nodes (2*4) plus one bias for each of the nodes (4).

Let’s now add a second layer after the first one, with two nodes:

In [27]: model.add(Dense(2, activation='relu'))

Notice that we didn’t have to specify the input_dim parameter because Keras is smart and automatically
matches it with the output size of the previous layer.

Finally, let’s add the output layer:

In [28]: model.add(Dense(1, activation='sigmoid'))

and let’s compile the model:

In [29]: model.compile(Adam(lr=0.05),
'binary_crossentropy',
metrics=['accuracy'])

The input_dim parameter is the number of dimensions in our input data points. In this case, each point is
described by two numbers, so the input dimension is equal to 2 (for the first Dense() layer). Dense(1) is
the output layer. Here we are classifying 2 classes, blue dots and red crosses, and therefore it’s a binary
classification and we are predicting a single number: the probability of being in the class of the red crosses.

Let’s train it and see how it performs, using the .fit() method again:

In [30]: model.fit(X_train, y_train, epochs=100, verbose=0);

We’ll use a couple handy functions from the sklearn.metrics package, the accuracy_score() and
confusion_matrix() functions. First of all let’s see what classes our model predicts using the
.predict_classes() method:

In [31]: y_train_pred = model.predict_classes(X_train)


y_test_pred = model.predict_classes(X_test)

This is different from the .predict() method because it returns the actual predicted class instead of the
predicted probability of each class.
4.4. BINARY CLASSIFICATION 167

In [32]: y_train_prob = model.predict(X_train)


y_test_prob = model.predict(X_test)

Let’s look at the first few values for comparison:

In [33]: y_train_pred[:3]

Out[33]: array([[1],
[1],
[0]], dtype=int32)

In [34]: y_train_prob[:3]

Out[34]: array([[0.9994087],
[0.9994087],
[0. ]], dtype=float32)

Let’s compare the predicted classes with the actual classes on both the training and the test set. First, let’s
import the accuracy_score and the confusion_matrix methods from sklearn:

In [35]: from sklearn.metrics import accuracy_score


from sklearn.metrics import confusion_matrix

Let’s check out the score accuracy here for both the training set and the test set:

In [36]: acc = accuracy_score(y_train, y_train_pred)


print("Accuracy (Train set):\t{:0.3f}".format(acc))

acc = accuracy_score(y_test, y_test_pred)


print("Accuracy (Test set):\t{:0.3f}".format(acc))

Accuracy (Train set): 0.999


Accuracy (Test set): 0.993

Let’s plot the decision boundary for the model:

In [37]: plot_decision_boundary(model, X, y)
plt.title("Decision Boundary for Fully Connected");
168 CHAPTER 4. DEEP LEARNING

Decision Boundary for Fully Connected


1.25 0
1
1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75
1.0 0.5 0.0 0.5 1.0 1.5 2.0

As you can see, our network learned to separate the two classes with a zig-zag boundary, which is typical of
the ReLU activation.

TIP: if your the model has not learned to separate the data well, re- initialize the model and
re-train it. As you’ll see later in this book, the random initialization of the model may have
a significant effect on its ability to effectively learn.

Let’s try building our model again, but use a different activation this time. If we used the tanh function
instead, we’d have obtained a smoother boundary:

In [38]: model = Sequential()


model.add(Dense(4, input_dim=2, activation='tanh'))
model.add(Dense(2, activation='tanh'))
model.add(Dense(1, activation='sigmoid'))
model.compile(Adam(lr=0.05),
loss='binary_crossentropy',
metrics=['accuracy'])
4.4. BINARY CLASSIFICATION 169

model.fit(X_train, y_train, epochs=100, verbose=0)

plot_decision_boundary(model, X, y)

1.25 0
1
1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75
1.0 0.5 0.0 0.5 1.0 1.5 2.0

In [39]: y_train_pred = model.predict_classes(X_train)


y_test_pred = model.predict_classes(X_test)

In [40]: acc = accuracy_score(y_train, y_train_pred)


print("Accuracy (Train set):\t{:0.3f}".format(acc))

acc = accuracy_score(y_test, y_test_pred)


print("Accuracy (Test set):\t{:0.3f}".format(acc))

Accuracy (Train set): 0.999


Accuracy (Test set): 1.000

Adding depth to our model allows us to separate two classes with a boundary of any shape needed. The
complexity of the boundary is determined by the number of nodes and layers. The more we add, the more
parameters our network will learn. Since we can always add more layers, we can always increase the
complexity of a network, and therefore, of the boundary it can learn.
170 CHAPTER 4. DEEP LEARNING

Deep Learning models can have as little as few hundred parameters to as much as a few billions. As the
number of parameters increases, so will the need for data. To train a model with millions of parameters, we
will likely need tens of millions of data points, which will also imply considerable computational resources
as we shall see later on.

Multiclass classification
Neural Networks easily extend to cases where the output is not a single value.

In the case of regression, this means that the output is a vector, while in the case of classification, it means
we have more than one class we’d like to separate.

For example, if we are doing image recognition, we may have several classes for all the objects we’d like to
distinguish (e.g., cat, dog, mouse, bird). Instead of having a single output Yes/No, we allow the network to
predict multiple values.

Similarly, for a self-driving car, we may want our network to predict the direction of the trajectory the
vehicle should take, which means both the speed and the steering angle. This would be a regression with
multiple outputs at the same time. The extension is trivial in the case of regression: we add as many output
nodes as needed and minimize the mean squared error on the whole vector output.

The case of classification requires a little more discussion because we need to choose the activation function
carefully. When we are predicting discrete output we could be in one of two cases:

1. the classes could be mutually exclusive


2. each class could be independent

Let’s consider the example of email classification. We want to use our Machine Learning model to organize a
large pool of emails sitting in our inbox. We could choose two way to organize them.

Tags

One way to arrange our emails would be to add tags to each email to specify the content. We could have a
tag for Work, a tag for Personal, but also a tag for Has_Picture or Has_Attachment. These tags are not
mutually exclusive. Each one is independent of the others, and a single email could carry multiple tags.

The extension of the Neural Network to this case is also pretty straightforward because we will perform an
independent logistic regression on each tag. Like in the case of the regression, all we have to do is add
multiple sigmoid output nodes, and we are done.

Mutually exclusive classes and Softmax

A different case is if we decided to arrange our emails in folders, for example: Work, Personal, Spam, and
move each email to the corresponding folder. In this case, each email can only be in one folder. If it’s in
folder Work, it is automatically not in folder Personal. In this case, we cannot use independent sigmoids;
we need to use an activation function that will normalize the output so that if a node predicts a high
4.5. MULTICLASS CLASSIFICATION 171

probability, all the others will predict a low probability and the sum of all the probabilities will add up to one.

Mathematically, the softmax function is a generalization of the logistic function that does just that:

ez j
σ(z) j = for j = 1, . . . , K. (4.14)
∑Kk=1 e z k

When we deal with mutually exclusive classes, we always have to apply the softmax function to the last
layer.

The Iris dataset

The Iris dataset is a classic dataset used in Machine Learning. It describes three species of flowers, with four
features each, so it’s a great example of a Multiclass classification. Let’s see how Multiclass classification’s
done using Keras and the Iris dataset. First of all, let’s load the data.

In [41]: df = pd.read_csv('../data/iris.csv')

In [42]: df.head()

Out[42]:

sepal_length sepal_width petal_length petal_width species


0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

We need to do a bit of massaging of the data, separate the input features from the target column containing
the species.

First of all let’s create a feature matrix X where we store the first 4 columns:

In [43]: X = df.drop('species', axis=1)


X.head()

Out[43]:
172 CHAPTER 4. DEEP LEARNING

sepal_length sepal_width petal_length petal_width


0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

Let’s also create a target column, where we encode the labels in alphabetical order. We need to do this
because Machine Learning models do not understand string values like setosa or versicolor. We will
first look at the unique values contained in the species column:

In [44]: targets = df['species'].unique()


targets

Out[44]: array(['setosa', 'versicolor', 'virginica'], dtype=object)

And then build a dictionary where we assign an index to each target name in alphabetical order:

In [45]: target_dict = {n:i for i, n in enumerate(targets)}


target_dict

Out[45]: {'setosa': 0, 'versicolor': 1, 'virginica': 2}

Now we can use the .map method to create a new Series from the species column, where each of the
entries is replaced using targed_dict:

In [46]: y= df['species'].map(target_dict)
y.head()

Out[46]:

species
0 0
1 0
2 0
3 0
4 0

Now y is a number indicating the class (0, 1, 2). In order to use this with Neural Networks, we need to
perform one last step: we will expand it it to 3 binary dummy columns. We could use the
pandas.get_dummies function to do this, but Keras also offers an equivalent function, so let’s use that
instead:
4.5. MULTICLASS CLASSIFICATION 173

In [47]: from tensorflow.keras.utils import to_categorical

In [48]: y_cat = to_categorical(y)

Let’s check out what the data looks like by looking at the first 5 values:

In [49]: y_cat[:5]

Out[49]: array([[1., 0., 0.],


[1., 0., 0.],
[1., 0., 0.],
[1., 0., 0.],
[1., 0., 0.]], dtype=float32)

Now we create a train and test split, with 20 test size. We’ll pass the values of the X dataframe because
Keras doesn’t like pandas dataframes. Also notice that we introduce 2 more parameters:

• stratify = True to make sure that we preserve the ratio of labels in each set, i.e. we want each set
to be composed of one third of each flower type.
• random_state = 0 sets the seed of the random number generator in a way that we all get the same
results.

In [50]: X_train, X_test, y_train, y_test = \


train_test_split(X.values, y_cat, test_size=0.2,
random_state=0, stratify=y)

and then create a model with:

• 4 features in input (the sepal_length, sepal_width, petal_length, petal_width)


• 3 in output (each one being the probability of the flower being one of setosa, versicolor,
virginica)
• A softmax activation

This is a shallow model, equivalent of a Logistic Regression with 3 classes instead of two.

In [51]: model = Sequential()


model.add(Dense(3, input_dim=4, activation='softmax'))
model.compile(Adam(lr=0.1),
loss='categorical_crossentropy',
metrics=['accuracy'])
174 CHAPTER 4. DEEP LEARNING

In [52]: model.fit(X_train, y_train,


validation_split=0.1,
epochs=30, verbose=0);

The output of the model is a matrix with 3 columns, corresponding to the predicted probabilities for each
class where each of the 3 output predictions are listed in the columns, ordered by their order in the y_train
array:

In [53]: y_pred = model.predict(X_test)


y_pred

Out[53]: array([[9.84953165e-01, 1.50431255e-02, 3.62222863e-06],


[1.35534555e-02, 6.10131264e-01, 3.76315266e-01],
[9.59943891e-01, 3.99981663e-02, 5.78845938e-05],
[8.50223121e-04, 2.50519723e-01, 7.48630047e-01],
[9.74876344e-01, 2.51093581e-02, 1.42654144e-05],
[1.45609556e-02, 6.23836279e-01, 3.61602753e-01],
[5.36217005e-04, 2.19699532e-01, 7.79764235e-01],
[9.66539264e-01, 3.34275104e-02, 3.31806150e-05],
[9.56479251e-01, 4.34678495e-02, 5.29610115e-05],
[6.18544631e-02, 8.30921590e-01, 1.07223883e-01],
[4.25403632e-05, 4.99334484e-02, 9.50024068e-01],
[3.50145549e-02, 7.21053362e-01, 2.43932113e-01],
[1.01207215e-02, 6.87664211e-01, 3.02215070e-01],
[4.05738043e-04, 2.16491580e-01, 7.83102691e-01],
[5.98751344e-02, 8.21283102e-01, 1.18841775e-01],
[2.56499567e-04, 1.33231789e-01, 8.66511703e-01],
[1.26376064e-04, 1.01343736e-01, 8.98529947e-01],
[3.31360660e-02, 8.26828897e-01, 1.40035018e-01],
[2.79063992e-02, 8.24876964e-01, 1.47216558e-01],
[9.83506680e-01, 1.64842363e-02, 9.06099103e-06],
[9.81369734e-01, 1.86190158e-02, 1.12053631e-05],
[2.49856414e-04, 1.18530639e-01, 8.81219506e-01],
[3.04675923e-04, 1.54964134e-01, 8.44731152e-01],
[1.55627797e-03, 4.12574559e-01, 5.85869193e-01],
[9.72532451e-01, 2.74453480e-02, 2.21977480e-05],
[1.52209895e-02, 7.30566740e-01, 2.54212290e-01],
[2.76254658e-02, 7.31834173e-01, 2.40540460e-01],
[6.88130467e-06, 2.04160493e-02, 9.79577065e-01],
[9.80767131e-01, 1.92240067e-02, 8.89783041e-06],
[9.68774557e-01, 3.12050171e-02, 2.04509543e-05]], dtype=float32)

Which class does our network think each flower is? We can obtain the predicted class with the np.argmax,
which finds the index of the maximum value in an array:
4.5. MULTICLASS CLASSIFICATION 175

In [54]: y_test_class = np.argmax(y_test, axis=1)


y_pred_class = np.argmax(y_pred, axis=1)

Let’s check the classification report and confusion matrix that we have described in Chapter 3

As you may remember classification_report() is found in sklearn.metrics package:

In [55]: from sklearn.metrics import classification_report

To create a classification report, we’ll run the classification_report() method, passing it the test class
(the list that we created before of the correct labels for each dataum) and the y_pred_class (the list we just
obtained of the predicted classes).

In [56]: print(classification_report(y_test_class, y_pred_class))

precision recall f1-score support

0 1.00 1.00 1.00 10


1 1.00 1.00 1.00 10
2 1.00 1.00 1.00 10

micro avg 1.00 1.00 1.00 30


macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

We get the confusion matrix by running the confusion_matrix() method passing it the same arguments
as the classification report:

In [57]: cm = confusion_matrix(y_test_class, y_pred_class)

pd.DataFrame(cm, index = targets,


columns = ['pred_'+c for c in targets])

Out[57]:

pred_setosa pred_versicolor pred_virginica


setosa 10 0 0
versicolor 0 10 0
virginica 0 0 10

In [58]: plt.imshow(cm, cmap='Blues');


176 CHAPTER 4. DEEP LEARNING

0.5
0.0
0.5
1.0
1.5
2.0
2.5
0.5 0.0 0.5 1.0 1.5 2.0 2.5

Recall that the confusion matrix tells us how many examples from one class are predicted in each class. It’s
almost perfect, with the exception of one point in class virginica which gets predicted in class
versicolor. Let’s inspect the data visually to check why. Our data has 4 features, so we need to decide how
to plot it. We could choose 2 features and plot just those:

In [59]: plt.scatter(X.loc[y==0,'sepal_length'],
X.loc[y==0,'petal_length'])

plt.scatter(X.loc[y==1,'sepal_length'],
X.loc[y==1,'petal_length'])

plt.scatter(X.loc[y==2,'sepal_length'],
X.loc[y==2,'petal_length'])

plt.xlabel('sepal_length')
plt.ylabel('petal_length')
plt.legend(targets)
plt.title("The Iris Dataset");
4.5. MULTICLASS CLASSIFICATION 177

The Iris Dataset


7 setosa
versicolor
6 virginica
5
petal_length

4
3
2
1
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
sepal_length

Classes virginica and versicolor are slightly overlapping, which could explain why our model couldn’t
separate them too well. Is it true for every feature? We’ll check that with a very cool visualization library
called Seaborn. Seaborn improves Matplotlib with additional plots, for example the pairplot, which plots
all possible pairs of features in a scatter plot:

In [60]: import seaborn as sns

In [61]: g = sns.pairplot(df, hue="species")


g.fig.suptitle("The Iris Dataset");
178 CHAPTER 4. DEEP LEARNING

8 The Iris Dataset


sepal_length
6

4
sepal_width

2 species
setosa
versicolor
6 virginica
petal_length

4
2

2
petal_width

0
4 6 8 2 4 2.5 5.0 7.5 0 2
sepal_length sepal_width petal_length petal_width

As you can see virginica and versicolor overlap in all the features, which can explain why our model
confuses them. Keep in mind that we used a shallow model to separate them instead of a deeper one.

Conclusion
In this chapter we have introduced fully connected deep Neural Networks and seen how they can be used to
solve linear and nonlinear regression and classification problems. In the exercises we will apply them to
predict the onset of diabetes in a population.

Exercises

Exercise 1

The Pima Indians dataset is a very famous dataset distributed by UCI and originally collected from the
National Institute of Diabetes and Digestive and Kidney Diseases. It contains data from clinical exams for
women age 21 and above of Pima indian origins. The objective is to predict, based on diagnostic
measurements, whether a patient has diabetes.
4.7. EXERCISES 179

It has the following features:

• Pregnancies: Number of times pregnant


• Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
• BloodPressure: Diastolic blood pressure (mm Hg)
• SkinThickness: Triceps skin fold thickness (mm)
• Insulin: 2-Hour serum insulin (mu U/ml)
• BMI: Body mass index (weight in kg/(height in m)ˆ2)
• DiabetesPedigreeFunction: Diabetes pedigree function
• Age: Age (years)

The last column is the outcome, and it is a binary variable.

In this first exercise we will explore it through the following steps:

1. Load the ..data/diabetes.csv dataset, use pandas to explore the range of each feature

• For each feature draw a histogram. Bonus points if you draw all the histograms in the same figure.
• Explore correlations of features with the outcome column. You can do this in several ways, for
example using the sns.pairplot we used above or drawing a heatmap of the correlations.
• Do features need standardization? If so what standardization technique will you use? MinMax?
Standard?
• Prepare your final X and y variables to be used by an ML model. Make sure you define your target
variable well. Will you need dummy columns?

In [ ]:

Exercise 2

Build a fully connected NN model that predicts diabetes. Follow these steps:

1. split your data in a train/test with a test size of 20 and a random_state = 22

• define a sequential model with at least one inner layer. You will have to make choices for the following
things:
– what is the size of the input?
– how many nodes will you use in each layer?
– what is the size of the output?
– what activation functions will you use in the inner layers?
– what activation function will you use at the output?
– what loss function will you use?
180 CHAPTER 4. DEEP LEARNING

– what optimizer will you use?


• fit your model on the training set, using a validation_split of 0.1
• test your trained model on the test data from the train/test split
• check the accuracy score, the confusion matrix and the classification report

In [ ]:

Exercise 3

Compare your work with the results presented in this notebook. Are your Neural Network results better or
worse than the results obtained by traditional Machine Learning techniques?

• Try training a Support Vector Machine or a Random Forest model on the same train/test split. Is the
performance better or worse?
• Try restricting your features to only four features like in the suggested notebook. How does model
performance change?

In [ ]:

Exercise 4

Tensorflow playground is a web-based Neural Network demo. It is beneficial to develop an intuition about
what happens when you change architecture, activation function or other parameters. Try playing with it for
a few minutes. You don’t need to understand the meaning of every knob and button in the page, get a sense
for what happens if you change something. In the next chapter, we’ll explore these things in more detail.

In [ ]:
Deep Learning Internals
5
This is a special chapter
In the last chapter, we introduced the Perceptron with weights, biases and activation functions and fully
connected Neural Networks. This chapter is a bit different from all the other chapters, and it is for the reader
who is interested in understanding the inner workings of a Neural Network.

In this chapter, we learn about gradient descent and backpropagation, which is more technical and abstract
than the rest of the book. We will use mathematical formulas and weird symbols, talk about derivatives and
gradients. We will try to make these concepts as intuitive and comfortable as possible, but these are complex
topics, and it is not possible to introduce them fully without going into some level of detail.

Let us first tell you: you don’t NEED to read this chapter. This book is for the developer and practitioner
that is interested in applying Neural Networks to solve real-world problems. As such, all the previous and
following chapters are focused on the implementation of Neural Networks and their practical application.
This chapter is different from all the others, you will not learn new techniques here, you will not learn new
commands or tricks, nor we will introduce any new Neural Network architecture.

All this chapter does, is to explain what happens when you run the function model.fit, i.e., break down
how a Neural Network learns. As we have already seen in chapters 3 and 4 after we define the model
architecture we usually do two more steps:

1. we .compile the model specifying the optimizer and the cost function

• we .fit the model for a certain number of epochs using the training data

181
182 CHAPTER 5. DEEP LEARNING INTERNALS

Keras executes these two operations for us, and we don’t have to worry about them too much. However, I’m
sure you’ve been wondering why we choose a particular optimizer at compilation or what is happening
during training.

This chapter explains precisely that.

In our opinion, it is essential to learn this for a few reasons. First of all, understanding these concepts allow
us to demystify what’s happening under the hood in our network. Neural Networks are not magic, and
knowing these concepts can give us a better ability to judge where we can use them to solve problems and
where we cannot. Secondly, knowing the internal mechanisms increases our abilities to understand which
knobs to tweak and which optimization algorithms to choose.

So, let us re-iterate this once again: feel free to skip this chapter if your primary goal is to learn how to use
Keras and to apply Neural Networks. You won’t find new code here, mostly a lot of maths and formulas.

On the other hand, if your goal is to understand how things work, then go ahead and read it. Chances are
you will find the answers to some of your questions in this chapter.

Finally, if you are already familiar with derivatives and college calculus, you can probably skim through
large portions of this chapter quite quickly.

All that said, let’s start by introducing derivatives and gradients. First let’s import our usual libraries. By now
you should be very familiar with all of them, but if in doubt on what they do, check back Chapter 2 where
we introduced them:

In [1]: with open('common.py') as fin:


exec(fin.read())

In [2]: with open('matplotlibconf.py') as fin:


exec(fin.read())

Derivatives
As the name suggests a derivative is a function that derives from another function.

Let’s start with an example. Imagine you are driving on the highway. As time goes by you mark your
position along the road, filling a table of values as a function of time. If your speed is 60 miles an hour, every
minute your position will increase by 1 mile.

Let’s indicate your position as a function of time with the variable x(t). Let’s create an array of 10 minutes,
called t and an array of your positions called x:

In [3]: t = np.arange(10)
x = np.arange(10)
5.2. DERIVATIVES 183

Now, let’s make a plot to see the distance over time with respect to the distance traveled.

In [4]: plt.plot(t, x, 'o-')


plt.title("Distance traveled over time")
plt.ylabel("Distance (miles)")
plt.xlabel("Time (minutes)");

Distance traveled over time

6
Distance (miles)

0
0 2 4 6 8
Time (minutes)

The derivative x ′ (t) of this function is the rate of change in position with respect to time. In this example it
is the speed of your car indicated by the odometer. In the example just mentioned, the derivative is a
constant value of 60 miles per hour, or 1 mile per minute. Let’s create an array containing the speed at each
moment in time:

In [5]: v = np.ones(10) # 1 mile per minute or 60 miles per hour

and let’s plot it too:

In [6]: plt.plot(t, v, 'o-')


plt.ylim(0, 2)
184 CHAPTER 5. DEEP LEARNING INTERNALS

plt.title("Speed over time")


plt.ylabel("Speed (miles per minute)")
plt.xlabel("Time (minutes)");

Speed over time


2.00
1.75
Speed (miles per minute)

1.50
1.25
1.00
0.75
0.50
0.25
0.00
0 2 4 6 8
Time (minutes)

In general, the derivative x ′ (t) is itself a function of t that tells us the rate of change of the original function
x(t) at each point in time. This is why it is called a derivative. It can also be written explicitly as:

dx
x ′ (t) ∶= (t) (5.1)
dt

Where the fraction dxdt indicates the ratio between a small change in x due to a small change in t. Let’s look
at a case where the derivative is not constant. Consider an arbitrary curve f (t). Let’s first create a slightly
bigger time array:

In [7]: t = np.linspace(0, 2*np.pi,360)

Then let’s take an arbitrary function and let’s apply it to the array t. We will use the sine function, but that’s
just an example, any function would do:
5.2. DERIVATIVES 185

In [8]: f = np.sin(t)
plt.plot(t, f)
plt.title("Sine Function");

Sine Function
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
0 1 2 3 4 5 6

At each point along the curve f (t), the derivative f ′ (t) is equal to the rate of change in the function.

Finite differences

How do we calculate the value of the derivative at a particular point in t? We can calculate its approximate
value with the method of finite differences:

df ∆f f (t i ) − f (t i−1 )
(t i ) ≈ (t i ) = (5.2)
dt ∆t t i − t i−1

where we indicated with ∆ f the difference between two consecutive values of f .

We can calculate the value of the approximate derivative of the above function by using the function
np.diff that calculates the difference between consecutive elements in an array:

In [9]: dfdt = np.diff(f)/np.diff(t)


186 CHAPTER 5. DEEP LEARNING INTERNALS

Let’s plot it together with the original function:

In [10]: plt.plot(t, f)
plt.plot(t[1:], dfdt)
plt.legend(['f', 'dfdt'])
plt.axhline(0, color='black')
plt.title("Sine Function and it's first derivative");

Sine Function and it's first derivative


1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75 f
1.00 dfdt
0 1 2 3 4 5 6

If we read the figure from left to right, we notice that the value of the derivative is negative when the original
curve is going downhill, and it is positive when the original curve is going uphill. Finally, if we’re at a
minimum or a maximum, the derivative is 0 because the original curve is flat.

Let’s define a simple helper function to plot the tangent line to our curve, i.e., the line that “just touches” the
curve at that point:

In [11]: def plot_tangent(i, color='r'):

plt.plot(t, f)
plt.plot(t[:-1], dfdt)
plt.legend(['f', '$\\frac{df}{dt}$'])
plt.axhline(0)
5.2. DERIVATIVES 187

ti = t[i]
fi = f[i]
dfdti = dfdt[i]

plt.plot(ti, fi, 'o', color=color)


plt.plot(ti, dfdti, 'o', color=color)

x = np.linspace(-0.75, 0.75, 20)


n = 1 + dfdti**2

plt.plot(ti + x/n, fi + dfdti*x/n, color,


linewidth=3)

We can use this helper function to display the relationship between the inclination (slope) of our tangent
line and the value of the derivative function. As you can see, a positive derivative corresponds to an uphill
tangent while negative derivative corresponds to a downhill tangent line.

In [12]: plt.figure(figsize=(14,5))

plt.subplot(131)
plot_tangent(15)
plt.title("Positive Derivative / Upwards")
plt.subplot(132)
plot_tangent(89)
plt.title("Zero Derivative / Flat")
plt.subplot(133)
plot_tangent(175)
plt.title("Negative Derivative / Downwards")
plt.tight_layout();

Positive Derivative / Upwards Zero Derivative / Flat Negative Derivative / Downwards


1.00 1.00 1.00
0.75 0.75 0.75
0.50 0.50 0.50
0.25 0.25 0.25
0.00 0.00 0.00
0.25 0.25 0.25
0.50 0.50 0.50
0.75 f 0.75 f 0.75 f
df df df
1.00 dt 1.00 dt 1.00 dt
0 2 4 6 0 2 4 6 0 2 4 6
188 CHAPTER 5. DEEP LEARNING INTERNALS

Although the finite differences method is useful to calculate the numerical value of a derivative, we don’t
need to use it. Derivatives of simple functions are well known, and we don’t need to calculate them.
Calculus is the branch of mathematics that deals with all this. For our purposes we will summarize here a
few common functions and their derivatives:

Common functions and their derivatives

Partial derivatives and the gradient

When our function has more than one input variable, we need to specify which variable we are using for
derivation. For example, let’s say we measure our elevation on a mountain as a function of our position. Our
GPS position is defined by two variables: longitude and latitude, and therefore the elevation depends on two
variables: y = f (x1 , x2 ).

We can calculate the rate of change in elevation with respect to x1 , and the rate of change with respect x2
independently. These are called partial derivatives, because we only consider the change with respect to
one variable. We will indicate them with a “curly d” symbol:

∂f ∂f
,
∂x1 ∂x2

If we are on top of a hill, the fastest route downhill will not necessarily be along any of the north-south or
5.3. BACKPROPAGATION INTUITION 189

east-west directions. It will be in whatever direction it is more steeply descending.

In the two dimensional plane of x1 and x2 , the direction of the most abrupt change will be a 2-dimensional
vector whose components are the partial derivatives with respect to each variable. We call this vector the
Gradient, and we indicate it with an inverted triangle called del or nabla: ∇.

The gradient is an operation that takes a function of multiple variables and returns a vector. The
components of this vector are all the partial derivatives of the function. Since the partial derivatives are
functions of all variables, the gradient is also a function of all variables. To be precise, it is a vector function.

For each point, (x1 , x2 ), the gradient returns the vector in the direction of maximum steepness in the
graph of the original function. If we want to go downhill, all we have to do is walk in the direction opposite
to the gradient. This will be our strategy for minimizing cost functions.

So we have an operation, the gradient, which takes a function of multiple variables and returns a vector in
the direction of maximum steepness. Pretty cool!

Visualization of gradient in a landscape

Why is this neat? Why is it important? Well, it turns out that we can use this idea to train our networks.

Backpropagation intuition
Now that we have defined the gradient, let’s talk about backpropagation.

Backpropagation is a core concept in Machine Learning. The next several sections work through the math
of backpropagation. As said at the beginning of this chapter, it is not necessary to understand the maths to
be able to build and apply a Deep Learning model. However, the math is not very hard, and with a little bit
of exercise, you’ll be able to see that there is no mystery behind how Neural Networks function.

At a high-level, the backpropagation algorithm is a Supervised Learning method for training our networks.
It uses the error between the model prediction and the ground truth labels to modify the model weights to
190 CHAPTER 5. DEEP LEARNING INTERNALS

reduce the error in the next iteration.

The starting point for backpropagation is the Cost Function we have introduced in Chapter 3.

Let’s consider a generic cost function of a network with just one weight, let’s call this function J(w). For
every value of the weight w, the function calculates a value of the cost J(w).

Linear regression and its loss

The figure shows this situation for the case of a Linear Regression. As seen in Chapter 3, different lines
correspond to different values of w. In the figure, we represented them with different colors. Each line
produces a different cost, here represented with a dot of different color and our goal is to find the value of w
that corresponds to the minimum of J(w).

A bit of algebra solves linear regression easily, but how do we deal with the general case of a network with
millions of weights and biases? The cost function J now depends on millions of parameters, and it is not
obvious how to search for a minimum value.

What is clear is that the shape of such a cost function is not going to be a smooth parabola like the one in the
figure, and we will need a way to navigate a very complex landscape in search for a minimum value.

Let’s say we are sitting at a particular point w0 , corresponding to cost J(w0 ), how do we move towards lower
costs?

We want to move in the direction of decreasing J(w) until we reach the minimum, but we can only use local
information. How do we decide where to go?

As we saw when we talked about descending from a hill, the derivative indicates its slope at each point. So,
to move towards lower values, we need to calculate the derivative at w0 and then change our position by
subtracting the value of the derivative from the value of our starting position w0 .

Programmatically speaking, we can take one step in the direction where the descent algorithm is the lowest,
i.e., that minimizes the cost function.
5.4. LEARNING RATE 191

Weight update

Mathematically speaking, we can take one step following the rule:

dJ
w0 − > w0 − (w0 ) (5.3)
dw

Let’s check that this does move us towards lower values on the vertical axis.

dJ
If we are at w0 like in the figure, the slope of the curve is negative and thus the quantity − dw (w0 ) is positive.
So, the value of w0 will increase, moving us towards the right on the horizontal axis.

The corresponding value on the vertical axis will decrease, and we successfully moved towards a lower value
of the function f (w).

Vice versa, if we were to start at a point w0 where the value of the slope is positive, we would subtract a
dJ
positive quantity dw (w0 ) that is now negative. This would move w0 to the left, and the corresponding values
on the vertical axis would still decrease.

Learning Rate
The update rule we have just introduced one more modification. As it is, it suffers from two problems. If the
cost function is very flat, the derivative will be very very small, and with the current update rule, we will
move very very slowly towards the minimum. Viceversa, if the cost function is very steep, the derivative will
be very large, and we might end up jumping beyond the minimum.

A simple solution to both problems is to introduce a tunable knob that allows us to decide how big of a step
to take in the direction of the gradient. This is the learning rate, and we will indicate it with the Greek letter
η:
192 CHAPTER 5. DEEP LEARNING INTERNALS

dJ
w0 − > w0 − η (w0 )
dw

If we choose a small learning rate, we will move by tiny steps. A larger learning rate will move us by more
significant steps.

However, we must be careful. If the learning rate is too high, we will run away from the solution. At each
new step we move towards the direction of the minimum, but since the step is too large, we overshoot and
go beyond the minimum, at which point we reverse course and repeat, going further and further away.

Gradient descent
This way of looking for the minimum of a function is called Gradient Descent and it is the idea behind
backpropagation. Given a function, we can move towards its minimum by following the path indicated by
its derivative, or in the case of multiple variables, indicated by the gradient.

For a Neural Network, we define a cost function that depends on the values of the parameters, and we find
the values of the parameters by minimizing such cost through gradient descent.

The cost function is the method for how we can optimize our networks. It’s the backbone for a lot of
different Machine Learning and Deep Learning techniques.

All we are doing is taking the cost function, calculating its partial derivatives with respect to each parameter,
and then using update rule to decrease the cost. We do this by subtracting the value of the negative gradient
from the parameters themselves. This is a parameter update.

Gradient calculation in Neural Networks


Let’s recap what we’ve learned so far.

We know that the gradient is a function that indicates the direction of maximum steepness. We also know
that we can move towards the minimum of a function by taking consecutive steps in the direction of the
gradient at each point we visit.

Let’s see this with a programming example. We’ll use an invented cost function. Let’s start by defining an
array x with 100 points in the interval [-4, 4]:

In [13]: x = np.linspace(-4, 4, 100)

Then let’s define an invented cost function J(w) that depends on w in some weird way.

J(w) = 70.0 − 15.0w 2 + 0.5w 3 + w 4 (5.4)

In [14]: def J(w):


5.6. GRADIENT CALCULATION IN NEURAL NETWORKS 193

return 70.0 - 15.0*w**2 + 0.5*w**3 + w**4

Using the table of derivatives presented earlier we can also quickly calculate its derivative.

dJ
(w) = −30.0w + 1.5w 2 + 4w 3 (5.5)
dw

In [15]: def dJdw(w):


return - 30.0*w + 1.5*w**2 + 4*w**3

Let’s plot both functions:

In [16]: plt.subplot(211)
plt.plot(x, J(x))
plt.title("J(w)")

plt.subplot(212)
plt.plot(x, dJdw(x))
plt.axhline(0, color='black')
plt.title("dJdw(w)")
plt.xlabel("w")

plt.tight_layout();

J(w)
100
50
0
4 3 2 1 0 1 2 3 4
dJdw(w)
100
0
100
4 3 2 1 0 1 2 3 4
w
194 CHAPTER 5. DEEP LEARNING INTERNALS

Now let’s find the minimum value of J(w) by gradient descent. The function we have chosen has two
minima, one is a local minimum, the other is the global minimum. If we apply plain gradient descent we
will stop at the minimum that is nearest to where we started. Let’s keep this in mind for later.

Let’s start from a random initial value of w0 = −4:

In [17]: w0 = -4

and let’s apply the update rule:

dJ
w0 − > w0 − η (w0 ) (5.6)
dw

We will choose a small learning rate of η = 0.001 initially:

In [18]: lr = 0.001

The update step is:

In [19]: step = lr * dJdw(w0)


step

Out[19]: -0.112

and the new value of w0 is:

In [20]: w0 - step

Out[20]: -3.888

i.e. we moved to the right, towards the minimum!

Let’s do 30 iterations and se where we get:

In [21]: iterations = 30

w = w0
5.6. GRADIENT CALCULATION IN NEURAL NETWORKS 195

ws = [w]

for i in range(iterations):
step = lr * dJdw(w)
w -= step
ws.append(w)

ws = np.array(ws)

Let’s visualize our descent, zooming in the interesting region of the curve:

In [22]: plt.plot(x, J(x))


plt.plot(ws, J(ws), 'o')
plt.plot(w0, J(w0), 'or')
plt.legend(["J(w)", "steps", "starting point"])
plt.xlim(-4.2, -1);

120 J(w)
steps
100 starting point
80
60
40
20
0
4.0 3.5 3.0 2.5 2.0 1.5 1.0

As you can see, we proceed with small steps towards the minimum, and there we stop. Try to modify the
starting point and re-run the code above to fully understand how this works.

Why is this relevant to Neural Networks?


196 CHAPTER 5. DEEP LEARNING INTERNALS

Remember that a Neural Network is just a function that connects our inputs X to our outputs y. We’ll refer
to this function as ŷ = f (X). This function depends on a set of weights w that modulate the output of a layer
when transferring it to the next layer, and on a set of biases b.

Also, remember that we defined a cost J( ŷ, y) = J( f (X, w, b), y) that is calculated using the training set. So,
for fixed training data, the cost J is a function of the parameters w and b.

The best model is the one that minimizes the cost. We can use gradient descent on the cost function to
update the values of the parameters w and b. The gradient will tell us in which direction to update our
parameters, and it is crucial to learning the optimal values of our network parameters.

∂J
First, we calculate the gradient for each weight (and bias): ∂w and then we update each weight using the
∂J
learning rate we have just introduced: w0 − > w0 − η ∂w .

∂J
All we need to do at this point is to learn how to calculate the calculate the gradient ∂w .

The math of backpropagation

In this section, we will work through the calculation of the gradient for a very simple Neural Network. We
are going to use equations and maths. As said previously, feel free to skim through this part if you’re focused
on applications, you can always come back later to go deeper in the subject. We will start with a network
with only one input, one inner node and one output. This will make our calculations easier to follow.

In order to make the math easier to follow we will break down this graph and highlight the operations
involved:

Starting from the left, the input is multiplied with the first weight w (1) , then the bias b(1) is added and the
sigmoid activation function is applied. This completes the first layer. Then we multiply the output of the first
layer by the second weight w (2) , we add the second bias b(2) and we apply another sigmoid activation
function. This gives us the output ŷ. Finally we use the output ŷ and the labels y to calculate the cost J.
5.7. THE MATH OF BACKPROPAGATION 197

Forward Pass

Let’s formalize the operations described above with math. The forward pass equations are written as follows:

z (1) = xw (1) + b(1) (5.7)


(1) (1)
a = σ(z ) (5.8)
(2) (1) (2) (2)
z =a w +b (5.9)
ŷ = a(2) = σ(z (2) ) (5.10)
J = J( ŷ, y) (5.11)
(5.12)

The input-sum z (1) is obtained through a linear transformation of the input x with weight w (1) and bias b(1) .
In this case, we only have one input, so there really is no weighted “sum”, but we still call it input-sum to
remind ourselves of the general case where multiple inputs and multiple weights are present.

The activation a(1) is obtained by applying the sigmoid function to the input-sum z (1) , and it is indicated by
the letter σ (pronounced sigma). A similar set of equations holds for the second layer with input-sum z (2)
and activation a (2) , which is equivalent to our predicted output in this case.
198 CHAPTER 5. DEEP LEARNING INTERNALS

The cost function J is a function of the correct labels y and the predicted values ŷ, which contain all the
parameters of the network.

The equations described above allow us to calculate the prediction of the network for a given input and the
cost associated with such a prediction. Now we want to calculate the gradients to update the weights and
biases and reduce the cost.

Weight updates

Our goal is to calculate the derivative of the cost function with respect to the parameters of the model, i.e.,
weights and biases. Let’s start by calculating the derivative of the cost function with respect to w (2) , the last
weight used by the network.

∂J
(5.13)
∂w (2)

w (2) appears inside z (2) , which is itself inside the sigmoid function, so we need a way to calculate the
derivative of a nested function.

The technique is pretty easy, and it’s called chain rule. If you need a refresher of how it works, we have an
example of this in the Appendix.

We can look at the graph above to determine which terms will appear in the chain rule and see that J
depends on w (2) through ŷ and z (2) .

If we apply the chain rule, we see that this derivative is the product of three terms.

∂J ∂J ∂ ŷ ∂z (2)
= ⋅ ⋅ (5.14)
∂w (2) ∂ ŷ ∂z (2) ∂w (2)

All this is starting to look pretty complicated. Let’s introduce a simpler notation, following the course by
Roger Grosse at the University of Toronto.

In particular we will use a long line over a variable to indicate the derivative of the cost function with respect
to that variable. E.g.:

∂J
w (2) ∶= (5.15)
∂w (2)

Besides being more comfortable to read, this notation emphasizes the fact that those derivatives are
evaluated at a certain point, i.e., they are numbers, not functions.

Using this notation, we can rewrite the above equation as:


5.7. THE MATH OF BACKPROPAGATION 199

∂z (2) ∂ ŷ ∂z (2)
w (2) = z (2) ⋅ = ŷ ⋅ ⋅ (5.16)
∂w (2) ∂z (2) ∂w (2)

And we can start to see why it is called backpropagation: in order to get to calculate w (2) we will need to
first calculate the derivatives of the terms that follow w (2) in the graph, and then propagate their
contributions back to calculate w (2) .

∂J
Step 1: ŷ = ∂ ŷ

The first term is just the derivative of the cost function with respect to ŷ. This term will depend on the exact
form of the cost function, but it is well defined, and it can be calculated for a given training set. For example,
in the case of the Mean Squared Error 21 ( ŷ − y)2 this term is simply: ( ŷ − y).

Looking at the graph above, we can highlight in red the terms involved in the calculation of ŷ which is only
the labels and the predictions:

∂J
Step 2: z (2) = ∂z (2)

As noted before, the chain rule tells us that z (2) is the product of the derivative of the sigmoid with the term
we just calculated ŷ:

∂ ŷ
z (2) = ŷ = ŷ σ ′ (z (2) ) (5.17)
∂z (2)

Notice how information is propagating backward in the graph:

Since we have already calculated ŷ we don’t need to calculate it again, the only term we need is the derivative
200 CHAPTER 5. DEEP LEARNING INTERNALS

of the sigmoid. This is easy to calculate and we’ll just indicate it with σ ′ .

∂J
Step 3: w (2) = ∂w (2)

Now we can calculate w (2) .

Looking at the formulas above, we know that:

∂z (2)
w (2) = z (2) (5.18)
∂w (2)

∂z (2)
Since we have already calculated z (2) we only need to calculate ∂w (2)
, which is equal to a (1)

So we have:

w (2) = z (2) a (1) (5.19)

This last formula is interesting because it tells us that the update to the weights w (2) is proportional to the
input a(1) received by those weights.

This equation sometimes is also written as:

w (2) = δ (2) a (1)

where δ (2) is calculated using parts of the network that are downstream with respect to w (2) and it
corresponds to the derivative of the cost with respect to the input sum z (2) .
5.7. THE MATH OF BACKPROPAGATION 201

The critical aspect here is that δ (2) , i.e., z (2) , is a constant, representing the downstream contribution of the
network to the error.

Using the same procedure we can calculate the corrections to the bias b(2 as well:

∂J
Step 4: b(2) = ∂b(2)

We can apply the chain rule again and obtain:

∂z (2)
b(2) = z (2) = z (2) (5.20)
∂b(2)

∂z (2)
Since the ∂b(2)
=1

Following a similar procedure we can keep propagating the error back and calculate the corrections to w (1)
and b(1) . Proceeding backwards, the next term we need to calculate is a (1) .
202 CHAPTER 5. DEEP LEARNING INTERNALS

∂J
Step 5: a(1) = ∂a (1)

Looking at the formulas for the forward pass we notice that a (1) appears inside z (2) , so we apply the chain
rule and obtain:

∂z (2)
a (1) = z (2) = z (2) w (2) (5.21)
∂a (1)

At this point the calculation of the other terms is mechanical, and we will just summarize them all here:

∂J
ŷ = (5.22)
∂ ŷ
z (2) = ŷ σ ′ (z (2) ) (5.23)
b(2) = z (2) (5.24)
w (2) = z (2) a (1) (5.25)
a (1) = z (2) w (2) (5.26)
z (1) = a (1) σ ′ (z (1) ) (5.27)
b (1) = z (1) (5.28)
w (1) = z (1) x (5.29)
(5.30)

As you can see each term relies on previously calculated terms, which means we don’t have to calculate them
twice. This is why it’s called backpropagation: because the error terms are propagated back starting from
the cost function and walking along the network graph in reverse order.

Congratulations. You have just completed the hardest part. We hope this was insightful and useful. In the
next section, we will extend these calculations to fully connected networks where there are many nodes for
5.8. FULLY CONNECTED BACKPROPAGATION 203

each layer. As you will see, it’s the same thing; only we will deal with matrices instead of just numbers.

Fully Connected Backpropagation


Let’s see how we can expand the calculation to a fully connected Neural Network.

In a fully connected network, each layer contains several nodes, and each node connects to all of the nodes
in the previous and the next layers. The weights in layer l are organized in a matrix W (l) whose elements
are identified by two indices j and k. The index k indicates the receiving node and the index j indicates the
emitting node. So, for example, the weight connecting node 5 in layer 2 to node 4 in layer 3 is going to be
(3)
noted as w54 and so on.

(1)
The input sum at layer l and node k, z k is the weighted sum of the activations of layer l − 1 plus the bias
term of layer l:

(l) (l−1) (l) (l)


zk = ∑ a j w jk + b k
j

Forward Pass

The forward pass equations can be written as follows:


204 CHAPTER 5. DEEP LEARNING INTERNALS

... (5.31)
(l) (l−1) (l) (l)
zk = ∑ a j w jk + b k (5.32)
j
(l) (l)
a k = σ(z k ) (5.33)
... (5.34)
(5.35)

(l) (l)
The activations a k are obtained by applying the sigmoid function to the input-sums z k coming out from
node k at layer l.

Let’s indicate the last layer with the capital letter L. The equations for the output are:

(L) (L−1) (L) (L)


zs = ∑ ar wrs + bs (5.36)
r
(L)
ŷs = σ(zs ) (5.37)
J = ∑ J( ŷs , ys ) (5.38)
s
(5.39)

The cost function J is a function of the true labels y and the predicted values ŷ, which contain all the
parameters of the network. We indicated it with a sum to include the case where more than one output node
is present.

If the above formulas are hard to read in maths, here’s a code version of them. We allocate an array W with
(l)
random values for the weights w jk . In this particular example, imagine a set of weights connecting a layer
with 4 units to a layer with 2 units:

In [23]: W = np.array([[-0.1, 0.3],


[-0.3, -0.2],
[0.2, 0.1],
[0.2, 0.8]])

We also need an array for the biases, with as many elements as there are units in the receiving layer, i.e. 2:

In [24]: b = np.array([0., 0.])

The output of the layer with 4 elements is represented by the array a, whose elements are a j
(l−1)
:
5.8. FULLY CONNECTED BACKPROPAGATION 205

In [25]: a = np.array([0.5, -0.2, 0.3, 0.])

Then, the layer l performs the operation:

In [26]: z = np.dot(a, W) + b

returning the array of z with elements z k = ∑ j a j


(l) (l−1) (l) (l)
w jk + b k :

In [27]: z

Out[27]: array([0.07, 0.22])

z is indexed by the letter k. There are 2 entries, one for each of the units in the receiving layer. Similarly you
can write code examples for the other equations.

Backpropagation

Although they may seem a bit more complicated, the only thing that changed is that now each node takes
multiple inputs, each with its own weight and so the input sums z are actually summing up the
contributions of the nodes in the previous layer.

The backpropagation formulas are calculated as before. Here is a summary of all of the terms:

∂J
ŷs = (5.40)
∂ ŷs
(L) (L)
zs = ŷs σ ′ (zs ) (5.41)
(L) (L)
bs = zs (5.42)
(L) (L) (L−1)
wrs = zs ar (5.43)
... (5.44)
... (5.45)
(l) (l+1) (l+1)
a k = ∑ w km zm (5.46)
m
(l) (l) (l)
z k = a k σ ′ (z k ) (5.47)
(l) (l)
bk = zk (5.48)
(l) (l) (l−1)
w jk = z k a j (5.49)
(5.50)
206 CHAPTER 5. DEEP LEARNING INTERNALS

These equations are equivalent to the ones for the unidimensional case, with only one major difference.

(l)
The term a k , indicating the change in cost due to the activation at node k in layer l needs to take into
(l)
account all the errors in the nodes downstream at layer l + 1. Since the activation a k is part of the input of
each node in the next layer l + 1, we have to apply the chain rule to each of them and sum all their
contributions together.

Everything else is pretty much the same as the unidimensional case, with just a bunch of indices to keep
track of.

Matrix Notation
We can simplify the above notation a bit by using vectors and matrices to indicate all the ingredients in the
network.

Forward Pass

The equations for the forward pass read:

... (5.51)
(l) (l−1) (l) (l)
z =a W +b (5.52)
(l) (l)
a = σ(z ) (5.53)
... (5.54)
(5.55)

Backpropagation

The equations for the backpropagation read:

... (5.56)
a(l) = W(l+1) T z(l+1) (5.57)
′ (l)
z(l) = a(l) ⊙ σ (z ) (5.58)
b(l) = z(l) (5.59)
W(l) = a(l−1) z(l) T (5.60)
... (5.61)
(5.62)

Circle dot indicates the element-wise product, also called Hadamard product, whereas when we have two
5.10. GRADIENT DESCENT 207

matrices next to each other, we mean the matrix multiplication is taking place.

So we can summarize the backpropagation algorithm as follows:

1. Forward pass: we calculate the input-sum and activation of each neuron proceeding from input to
output.
2. We obtain the error signal of the final layer, by estimating the gradient of the cost function with
respect to the outputs of the network. This expression will depend on the training data and training
labels, as well as the chosen cost function, but it is well-defined for given training data and cost.
3. We propagate the error backward at each operation by taking into account the error signals at the
outputs affected by that operation as well as the kind of operation performed by that specific node.
4. We proceed back till we get to the weights multiplying the input.

A couple of observations: - The gradient of the cost function with respect to the weights is a matrix with the
same shape as the weight matrix. - The gradient of the cost function with respect to the biases is a vector
with the same shape as the biases.

Congratulations! You’ve now gone through the backpropagation algorithm and hopefully see that it’s just
many matrix multiplications. The bigger the network, the bigger your matrices will be and so the larger the
matrix multiplication products. We will go back to this in a few sections. For now, give yourself a pat on the
back: Neural Networks have no more mysteries for you!

Gradient descent
How do backpropagation and gradient descent work in practice in Deep Learning? Let’s use a real world
dataset to explore how this is done in detail.

Let’s say the government has just hired you for a crucial task. A group of counterfeiters is using fake
banknotes, and this is creating all sorts of problems. Luckily your colleague Agent Jones managed to get
hold of a stack of counterfeit banknotes and bring them to the lab for inspection. You’ve scanned true and
fake notes and extracted four spectral features. Let’s build a classifier that can distinguish them.
208 CHAPTER 5. DEEP LEARNING INTERNALS

Banknotes
5.10. GRADIENT DESCENT 209

First of all, let’s load and inspect the dataset:

In [28]: df = pd.read_csv('../data/banknotes.csv')
df.head()

Out[28]:

variance skewness kurtosis entropy class


0 3.62160 8.6661 -2.8073 -0.44699 0
1 4.54590 8.1674 -2.4586 -1.46210 0
2 3.86600 -2.6383 1.9242 0.10645 0
3 3.45660 9.5228 -4.0112 -3.59440 0
4 0.32924 -4.4552 4.5718 -0.98880 0

The four features come from the images (see UCI database for details), and they are like a fingerprint of each
image. Another way to look at it is to say that feature engineering has already been done and we have now
four numbers representing the relevant properties of each image. The class column indicates if a banknote
is true or fake, with 0 indicating true and 1 indicating fake.

Let’s see how many banknotes we have in each class:

In [29]: df['class'].value_counts()

Out[29]:

class
0 762
1 610

We can also calculate the fraction of the larger class by dividing the first row by the total number of rows:

In [30]: df['class'].value_counts()[0]/len(df)

Out[30]: 0.5553935860058309

The larger class amounts to 55 of the total, so we if we build a model it needs to have an accuracy superior
to 55 to be useful.

Let’s use seaborn.pairplot for a quick visual inspection of the data. First, we load the library:

In [31]: import seaborn as sns


210 CHAPTER 5. DEEP LEARNING INTERNALS

Then we plot the whole dataset using a pairplot as we did for the Iris flower dataset in the previous
chapters. This plot allows us to look at how pairs of features are correlated, as well as how each feature
correlates with the labels. Also, it displays the histogram of each feature along the diagonal, and we can use
the hue parameter to color the data using the labels. Pretty nice!

In [32]: sns.pairplot(df, hue="class");

5
variance

0
5

10
skewness

10

10
kurtosis

class
0
0 1

0
entropy

1.00
0.75
class

0.50
0.25
0.00
5 0 5 20 0 0 20 10 0 0.0 0.5 1.0
variance skewness kurtosis entropy class

We can see from the plot that the two sets of banknotes seem quite well separable. In other words the orange
and the blue scatters are not completely overlapped. This induces us to think that we will manage to build a
good classifier and bust the counterfeiters.

Let’s start by building a reference model using Scikit-Learn. As we have seen in Chapter 3,
Scikit-Learn is a great Machine Learning library for Python. It implements many classical algorithms like
Decision Trees, Support Vector Machines, Random Forest and more. It also has many preprocessing and
model evaluation routines, so we strongly encourage you to learn to use it well.
5.10. GRADIENT DESCENT 211

For this Chapter, we would like a model that trains fast, that does not require too much pre-processing and
feature engineering, and that is known to give good results.

Luckily for us, such a model exists, and it’s called Random Forest.

Random Forest

Random Forest is an ensemble learning method for classification and regression that operates by
constructing a multitude of decision trees at training time and outputting the class that is the mode of the
classes (classification) or mean prediction (regression) of the individual trees. You can think of it as a
Decision Tree on steroids!

Scikit Learn provides RandomForestClassifier ready to use in the sklearn.ensemble module.

For this Chapter it is not fundamental that you understand the internals of how the Random Forest classifier
works. The point here is that it is a model that works quite well and so we will use it for comparison.

Let’s start by loading it:

In [33]: from sklearn.ensemble import RandomForestClassifier

and let’s create an instance of the model with default parameters:

In [34]: model = RandomForestClassifier()

Now let’s separate the features from labels as usual:

In [35]: X = df.drop('class', axis=1).values


y = df['class'].values

and we are ready to train the model. In order to be quick and effective in judging the performance of our
model we will use a 3-fold cross validation as done many times in Chapter 3. First we load the
cross_val_score function:

In [36]: from sklearn.model_selection import cross_val_score

And then we run it with the model, features and labels as arguments. This function will return 3 values for
the test accuracy, one for each of the 3 folds.

In [37]: cross_val_score(model, X, y)
212 CHAPTER 5. DEEP LEARNING INTERNALS

Out[37]: array([0.99781659, 0.99343545, 0.99343545])

The Random Forest model seems to work really well on this dataset. We obtain an accuracy score higher
than 99 with a 3-fold cross-validation. This is really good and it also shows us how in some cases
traditional ML methods are very fast and effective solutions.

We can also get the score on a train/test split fixed set in order to compare it later with a Neural Network
based model.

In [38]: from sklearn.model_selection import train_test_split

Let’s split up our data using the train_test_split function:

In [39]: X_train, X_test, y_train, y_test = \


train_test_split(X, y, test_size=0.3,
random_state=42)

Let’s train our model and check the accuracy score now:

In [40]: model.fit(X_train, y_train)


model.score(X_test, y_test)

Out[40]: 0.9951456310679612

The accuracy on the test set is still very high.

Logistic Regression Model

Let’s build a Logistic Regression model in Keras and train it. As we have seen, the parameters of the model
are updated using the gradient calculated from the cost function evaluated on the training data.

d J(y, ŷ(w, X))


dw

X and y here, indicate a pair of training features and labels.

In principle, we could feed the training data one point at a time. For each pair of features and label, calculate
the cost and the gradient and update the weights accordingly. This procedure is called Stochastic Gradient
Descent (also SGD). Once our model has seen each training data once, we say that an Epoch has
completed, and we start again from the first training pair with the following epoch. Let’s manually run one
epoch on this simple model.
5.10. GRADIENT DESCENT 213

Then let’s create a model as we have done in the previous chapters.

Since this is a Logistic Regression, we will only have one Dense layer, with an output of 1 and a sigmoid
activation function. By now you should be very familiar with all this, but in case you have doubts you may
go back to Chapter 4 where we explained Dense layers in more detail.

Let’s start with a few imports:

In [41]: from tensorflow.keras.models import Sequential


from tensorflow.keras.layers import Dense, Activation

and then let’s define the model. We will initialize the weights to one for this time, using the
kernel_initializer parameter. It’s not a good initialization, but it will guarantee that we all get the same
results without any artifacts due to random initialization:

In [42]: model = Sequential()


model.add(Dense(1, kernel_initializer='ones',
input_shape=(4,), activation='sigmoid'))

Then we compile the model as usual. Notice that, since we only have one output node with a sigmoid
activation, we will have to use the binary_crossentropy loss, also introduced in Chapter 4.

TIP: As a reminder, binary crossentropy has the formula:

J( ŷ, y) = − (y log( ŷ) + (1 − y) log(1 − ŷ)) (5.63)

and it can be implemented in code as:

def binary_crossentropy(y, y_hat):


if y == 1:
return - np.log(y_hat)
else:
return - np.log(1 - y_hat)

We compile the model using the sgd optimizer, which stands for Stochastic Gradient Descent. We will
discuss this optimizer along with other more powerful ones later in this chapter, so stay tuned.

Finally, we will compile the model requesting to calculate the accuracy metric at each iteration.
214 CHAPTER 5. DEEP LEARNING INTERNALS

In [43]: model.compile(optimizer='sgd',
loss='binary_crossentropy',
metrics=['accuracy'])

Finally, we save the random weights so that we can always reset the model to this starting point.

In [44]: weights = model.get_weights()

The method .train_on_batch performs a single gradient update over one batch of samples, so we can use
it to train the model on a single data point at a time and then visualize how the loss changes at each point.

We usually train models one batch at a time, passing several points at once and calculating the average
gradient correction. The next plot will make it very clear why.

Let’s train the model one point at a time first, for one epoch, i.e., passing all of the training data once:

In [45]: losses = []
idx = range(len(X_train) - 1)
for i in idx:
loss, _ = model.train_on_batch(X_train[i:i+1],
y_train[i:i+1])
losses.append(loss)

Let’s plot the losses we have just calculated. As you will see the value of the loss changes greatly from one
update to the next:

In [46]: plt.plot(losses)
plt.title('Binary Crossentropy Loss, One Epoch')
plt.xlabel('Data point index')
plt.ylabel('Loss');
5.10. GRADIENT DESCENT 215

Binary Crossentropy Loss, One Epoch

6
Loss

0
0 200 400 600 800 1000
Data point index

As you can see in the plot, passing one data point at a time results in a very noisy estimation of the gradient.
We can improve the estimation of the gradient by averaging the gradients over a few points contained in a
mini- batch.

Common choices for the mini-batch size are 16, 32, 64, 128, 256, and so on (generally powers of 2). With
mini-batch gradient descent, we do N/B weight updates per epoch, with N equals to the number of points
in the training set and B equals to the number of points in a mini-batch.

Let’s reset the model weights to their initial random values:

In [47]: model.set_weights(weights)

Now let’s train the model with batches of 16 points each:

In [48]: B = 16

batch_idx = np.arange(0, len(X_train) - B, B)


batch_losses = []
for i in batch_idx:
loss, _ = model.train_on_batch(X_train[i:i+B],
216 CHAPTER 5. DEEP LEARNING INTERNALS

y_train[i:i+B])
batch_losses.append(loss)

Now let’s plot the losses calculated with mini-batch gradient descent over the losses calculated at each point.
As you will see, the loss decreases in a much smoother fashion:

In [49]: plt.plot(idx, losses)


plt.plot(batch_idx + B, batch_losses)
plt.title('Binary Crossentropy Loss, One Epoch')
plt.xlabel('Gradient update')
plt.ylabel('Loss')
plt.legend(['Single Point Updates',
'Batch of 16 points Updates']);

Binary Crossentropy Loss, One Epoch


Single Point Updates
8 Batch of 16 points Updates

6
Loss

0
0 200 400 600 800 1000
Gradient update

The mini-batch method is what tensorflow.keras automatically does for us when we invoice the .fit
method. When we run model.fit we can specify the number of epochs and the batch_size, like we
have been doing many times:

In [50]: model.set_weights(weights)
5.10. GRADIENT DESCENT 217

history = model.fit(X_train, y_train, batch_size=16,


epochs=20, verbose=0)

Now that we’ve trained the model, we can evaluate its performance on the test set using the
model.evaluate method. This is somewhat equivalent to the model.score method in Scikit-Learn. It
returns a dictionary with the loss, and all the other metrics we passed when we executed model.compile.

In [51]: result = model.evaluate(X_test, y_test)


"Test accuracy: {:0.2f} %".format(result[1]*100)

412/412 [==============================] - 0s 184us/sample - loss: 0.6370 -


accuracy: 0.7160

Out[51]: 'Test accuracy: 71.60 %'

With 20 epochs of training the logistic regression model does not perform as well as the Random Forest
model yet. Let’s see how we can improve it. One direction that we can explore to improve a model is to tune
the hyperparameters. We will start from the most obvious one, which is the Learning Rate.

Learning Rates

Let’s explore what happens to the performance of our model if we change the learning rate. We can do this
with a simple loop where we perform the following steps:

1. We recompile the model with a different learning rate.


2. We reset the weights to the initial value.
3. We retrain the model and append the results to a list.

In [52]: from tensorflow.keras.optimizers import SGD

In [53]: dflist = []

learning_rates = [0.01, 0.05, 0.1, 0.5]

for lr in learning_rates:

model.compile(loss='binary_crossentropy',
optimizer=SGD(lr=lr),
metrics=['accuracy'])

model.set_weights(weights)
218 CHAPTER 5. DEEP LEARNING INTERNALS

h = model.fit(X_train, y_train, batch_size=16,


epochs=10, verbose=0)

dflist.append(pd.DataFrame(h.history,
index=h.epoch))
print("Done: {}".format(lr))

Done: 0.01
Done: 0.05
Done: 0.1
Done: 0.5

We can concatenate all our results in a single file for easy visualization using the pd.concat function along
the columns axis.

In [54]: historydf = pd.concat(dflist, axis=1)

In [55]: historydf

Out[55]:

loss accuracy loss accuracy loss accuracy loss accuracy


0 2.639031 0.251042 0.981409 0.680208 0.593887 0.808333 0.379747 0.921875
1 0.979313 0.540625 0.174453 0.939583 0.115598 0.967708 0.052042 0.984375
2 0.480628 0.802083 0.124538 0.970833 0.086764 0.976042 0.044732 0.982292
3 0.327329 0.884375 0.103279 0.973958 0.073354 0.979167 0.041670 0.984375
4 0.257137 0.912500 0.090302 0.976042 0.064261 0.984375 0.038946 0.985417
5 0.216571 0.929167 0.081890 0.976042 0.058625 0.983333 0.034574 0.989583
6 0.189788 0.937500 0.074120 0.977083 0.054121 0.984375 0.030930 0.988542
7 0.170942 0.944792 0.070225 0.981250 0.050634 0.986458 0.032576 0.986458
8 0.156777 0.952083 0.066026 0.983333 0.047301 0.984375 0.032527 0.986458
9 0.145978 0.955208 0.061782 0.983333 0.045516 0.985417 0.027556 0.990625

And we can add information about the learning rate in a secondary column index using the
pd.MultiIndex class.

In [56]: metrics_reported = dflist[0].columns


idx = pd.MultiIndex.from_product([learning_rates,
metrics_reported],
names=['learning_rate',
'metric'])

historydf.columns = idx
5.10. GRADIENT DESCENT 219

In [57]: historydf

Out[57]:

learning_rate 0.01 0.05 0.10 0.50


metric loss accuracy loss accuracy loss accuracy loss accuracy
0 2.639031 0.251042 0.981409 0.680208 0.593887 0.808333 0.379747 0.921875
1 0.979313 0.540625 0.174453 0.939583 0.115598 0.967708 0.052042 0.984375
2 0.480628 0.802083 0.124538 0.970833 0.086764 0.976042 0.044732 0.982292
3 0.327329 0.884375 0.103279 0.973958 0.073354 0.979167 0.041670 0.984375
4 0.257137 0.912500 0.090302 0.976042 0.064261 0.984375 0.038946 0.985417
5 0.216571 0.929167 0.081890 0.976042 0.058625 0.983333 0.034574 0.989583
6 0.189788 0.937500 0.074120 0.977083 0.054121 0.984375 0.030930 0.988542
7 0.170942 0.944792 0.070225 0.981250 0.050634 0.986458 0.032576 0.986458
8 0.156777 0.952083 0.066026 0.983333 0.047301 0.984375 0.032527 0.986458
9 0.145978 0.955208 0.061782 0.983333 0.045516 0.985417 0.027556 0.990625

Now we can display the behavior of loss and accuracy as a function of the learning rate.

In [58]: ax = plt.subplot(211)
hxs = historydf.xs('loss', axis=1, level='metric')
hxs.plot(ylim=(0,1), ax=ax)
plt.title("Loss")

ax = plt.subplot(212)
hxs = historydf.xs('accuracy', axis=1, level='metric')
hxs.plot(ylim=(0,1), ax=ax)
plt.title("Accuracy")
plt.xlabel("Epochs")

plt.tight_layout();
220 CHAPTER 5. DEEP LEARNING INTERNALS

Loss
1.0 learning_rate
0.01
0.05
0.5 0.1
0.5
0.0
0 1 2 3 4 5 6 7 8 9
Accuracy
1.0
learning_rate
0.5 0.01
0.05
0.1
0.0
0 1 2 3 4 5 6 7 80.5 9
Epochs

As expected a small learning rate gives a much slower decrease in the loss. Another hyperparameter we can
try to tune is the Batch Size. Let’s see how changing batch size affects the convergence of the model.

Batch Sizes

Let’s loop over increasing batch sizes from a single point up to 128.

In [59]: dflist = []

batch_sizes = [1, 8, 32, 128]

model.compile(loss='binary_crossentropy',
optimizer='sgd',
metrics=['accuracy'])

for batch_size in batch_sizes:


model.set_weights(weights)

h = model.fit(X_train, y_train,
batch_size=batch_size,
verbose=0, epochs=20)

dflist.append(pd.DataFrame(h.history,
5.10. GRADIENT DESCENT 221

index=h.epoch))
print("Done: {}".format(batch_size))

Done: 1
Done: 8
Done: 32
Done: 128

Like we did above we can arrange the results in a Pandas Dataframe for easy display. Notice how we are
using the pd.MultiIndex.from_product function to create a multi-index for the columns so that the
data is organized by batch size and by metric.

In [60]: historydf = pd.concat(dflist, axis=1)


metrics_reported = dflist[0].columns
idx = pd.MultiIndex.from_product([batch_sizes,
metrics_reported],
names=['batch_size',
'metric'])
historydf.columns = idx

In [61]: historydf

Out[61]:
222 CHAPTER 5. DEEP LEARNING INTERNALS

batch_size 1 8 32 128
metric loss accuracy loss accuracy loss accuracy loss accuracy
0 2.069690 0.348958 3.911661 0.123958 4.329616 0.084375 4.450428 0.076042
1 0.525396 0.780208 3.035601 0.201042 4.026808 0.118750 4.362427 0.085417
2 0.289708 0.897917 2.453718 0.279167 3.757314 0.140625 4.277809 0.089583
3 0.214022 0.929167 2.018515 0.315625 3.518339 0.154167 4.196561 0.098958
4 0.176570 0.942708 1.660948 0.370833 3.303941 0.173958 4.115206 0.107292
5 0.153161 0.952083 1.361185 0.412500 3.111429 0.188542 4.036785 0.114583
6 0.138158 0.963542 1.115791 0.489583 2.937969 0.210417 3.961509 0.123958
7 0.126656 0.967708 0.923197 0.558333 2.782095 0.232292 3.888864 0.131250
8 0.117760 0.970833 0.777323 0.628125 2.640397 0.257292 3.820208 0.136458
9 0.111091 0.975000 0.667592 0.697917 2.510392 0.276042 3.751939 0.143750
10 0.105214 0.976042 0.584428 0.734375 2.389356 0.282292 3.685020 0.151042
11 0.100218 0.980208 0.520239 0.775000 2.275690 0.296875 3.620519 0.153125
12 0.095988 0.977083 0.469610 0.810417 2.168297 0.300000 3.558348 0.152083
13 0.092290 0.973958 0.429039 0.826042 2.066296 0.309375 3.497187 0.155208
14 0.088986 0.975000 0.395580 0.839583 1.968758 0.321875 3.438424 0.157292
15 0.086015 0.975000 0.367914 0.863542 1.875437 0.329167 3.380392 0.162500
16 0.083265 0.976042 0.344518 0.881250 1.786184 0.343750 3.325220 0.167708
17 0.080949 0.977083 0.324450 0.889583 1.700660 0.355208 3.272031 0.171875
18 0.078603 0.977083 0.307105 0.894792 1.618558 0.364583 3.219164 0.177083
19 0.076666 0.977083 0.291932 0.898958 1.539793 0.379167 3.167959 0.185417

In [62]: ax = plt.subplot(211)
hxs = historydf.xs('loss', axis=1, level='metric')
hxs.plot(ylim=(0,1), ax=ax)
plt.title("Loss")

ax = plt.subplot(212)
hxs = historydf.xs('accuracy', axis=1, level='metric')
hxs.plot(ylim=(0,1), ax=ax)
plt.title("Accuracy")
plt.xlabel("Epochs")

plt.tight_layout();
5.11. OPTIMIZERS 223

Loss
1.0
batch_size
0.5 1
8
32
0.0 128
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5
Accuracy
1.0
batch_size
0.5 1
8
32
0.0 128
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5
Epochs

Smaller batches allow for more updates in a single epoch, on the other hand, they take much longer to run a
single epoch, so there’s a trade-off between speed of training (measured as number of gradients updates)
and speed of convergence (measured as number of epochs). In practice a batch size of 16 or 32 data points is
often used.

A recent research article suggests starting with a small batch size and the increase it gradually. We
encourage you to try and experiment with that strategy as well.

Optimizers
The optimizer is the algorithm used internally by Keras to update the weights and move the model towards
lower values of the cost function. Keras implements several optimizers that go by fancy names like SGD,
Adam, RMSProp and many more. Despite these clever sounding names, the optimizers are all variations of
the same concept, which is the Stochastic Gradient Descent or SGD.

SGD is so fundamental that we have invented an acronym to help you remember it. If you find it hard to
remember Stochastic Gradient Descent, think Simply Go Down, which is what SGD does!

TIP: In the next pages you will find some mathematical symbols when we explain the
algorithms. We highlighted the algorithms pseudo-code parts with a blue box like this:
Here’s the algorithm
224 CHAPTER 5. DEEP LEARNING INTERNALS

Feel free to skim through these if maths is not your favorite thing, you’ll find a practical
comparison of optimizers just after this section.

Stochastic Gradient Descent (or Simply Go Down) and its variations

Let’s begin our discovery of optimizers with a review of the SGD algorithm. SGD only needs one
hyper-parameter: the learning rate. Once we know the learning rate, we proceed in a loop by:

1. Sampling a minibatch from the training set.


2. Computing the gradients.
3. Updating the weights by subtracting the gradient times the learning rate.

Using a bit more formal language, we can write SGD as:

SGD

• Choose an initial vector of parameters w and learning rate η


• Repeat until stop rule:
– Extract a random batch from the training set, with corresponding training labels
– Evaluate the average cost function J(y, ŷ) using the points in the batch
– Evaluate the gradient g = ∇w J(w) using the points in the batch and the current value of the
parameters w
– Apply the update rule: w− > w − ηg

The stopping rule could be a fixed number of updates or epochs as well a condition on the amount of change
in the cost function. For example, we could decide to stop the training loop if the value of the cost is not
changing too much.

Momentum

In recent years, several improvements have been proposed to this formula.

A first improvement of the SGD is to add momentum. Momentum means that we accumulate the gradient
corrections in a variable v called velocity, that serves as a smoothed version of the gradient.

• Like SGD, choose an initial vector of parameters w, a learning rate η and a momentum parameter µ
• Repeat until stop rule:
5.11. OPTIMIZERS 225

– Same 3 steps as SGD (get batch, evaluate cost, evaluate gradient)


– Accumulate gradients into velocity: v = µv − ηg
– Apply the update rule: w− > w − v

Applying momentum is like saying: if you are going down in a direction, then you should keep going more
or less in that direction minus a small correction given by the new gradients. It’s as if instead of walking
downhill, we would roll down like a ball. The name comes from physics, in case you’re curious.

AdaGrad

SGD and SGD + momentum keep the learning rate constant for each parameter. This method can be
problematic if the parameters are sparse (i.e., most of them are zero except a few ones).

An adaptive algorithm, like AdaGrad, overcome this problem by accumulating the square of the gradient
into a normalization variable for each of the parameters. The result of this is that each parameter will have a
personalized learning rate. Parameters whose gradient is large will have a learning rate that decreases fast,
while parameters that have small gradients will have a large learning rate.

This modification makes the loss converge faster than pure SGD.

• Like SGD, choose an initial vector of parameters w, a learning rate η, a small constant δ = 10−7 to
avoid division by 0
• Repeat until stop rule:

– Same 3 steps as SGD (get batch, evaluate cost, evaluate gradient)


– Accumulate the square of the gradient: r− > r + g ⊙ g
– Compute update: ∆w = η δ+1√r ⊙ g
– Apply the update rule w− > w − ∆w

Let’s break down the above equation for the update so that we understand it fully. Both the accumulation
step and the update step are computed element by element so that we can focus on a single parameter.

• For a single parameter w i , g ⊙ g is equivalent to g i2 , so we are accumulating the square of the gradient
in a variable r i for each parameter.
η
• √
δ+ r
⊙ g may look a bit daunting at first, so let’s break it down. η is the learning rate, no surprises
here. For a single parameter w i we are dividing the value of the gradient g i by the square root of the
accumulated square gradients r i . If the gradients are large, we will be dividing by a large quantity. On
the other hand, if the gradients are small, we will be dividing by a small quantity. This yields a
practically constant update step size, multiplied by the learning rate. The δ in the denominator is a
numerical regularization constant so that we do not risk dividing by zero if r becomes too small.
226 CHAPTER 5. DEEP LEARNING INTERNALS

RMSProp: Root Mean Square Propagation (or Adagrad with EWMA)

RMSProp is also adaptive, but it allows to choose the fraction of squared gradients to accumulate, using an
Exponentially Weighted Moving Average (or EWMA) decay in the accumulation formula. If you’re not
familiar with how EWMA works, we strongly encourage you to review the Appendix. EWMA is the most
important algorithm of your life!

• Like SGD, choose an initial vector of parameters w, a learning rate η, a small constant δ = 10−7 to
avoid division by zero and an EWMA mixing factor ρ between 0 and 1, this is also called decay rate
• Repeat until stop rule:
– Same 3 steps as SGD (get batch, evaluate cost, evaluate gradient)
– Accumulate EWMA of the square of the gradient: r− > ρr + (1 − ρ)g ⊙ g
– Same update rules as Adagrad

Adam: Adaptive Moment Estimation (or EWMA everywhere)

Finally, let’s introduce Adam. This algorithm improves upon RMSProp by applying EWMA to the gradient
update as well as the square of the gradient.

• Like SGD, choose an initial vector of parameters w, a learning rate η, a small constant δ = 10−7 to
avoid division by zero and an EWMA mixing factors ρ1 and ρ2 between 0 and 1 (usually chosen as 0.9
and 0.999 respectively)
• Repeat until stop rule:
– Same 3 steps as SGD (get batch, evaluate cost, evaluate gradient)
– Accumulate EWMA of the gradient: v− > ρ1 v + (1 − ρ2 )g
– Accumulate EWMA the square of the gradient: r− > ρ2 r + (1 − ρ2 )g ⊙ g
v
– Correct bias 1: v̂ = 1−ρ t
1
r
– Correct bias 2: r̂ = 1−ρ t
2
– Compute update: ∆w = η 1√ ⊙ v̂
δ+ r̂
– Apply the update rule w− > w − ∆w

This formula may also appear to be a bit complicated, so let’s walk through it step by step.

• We apply EWMA to both the gradient and its square. We take inspiration from both the momentum
and the RMSProp formulas.
• The only other novelty is the bias correction. We take the current value of the accumulated quantity
and divide it by (1 − ρ t ). Since both decay rates are almost 1, the normalization is very small initially,
and it increases as time goes by. This seems to work in practice well.

In summary, we have seen a few of the most popular optimization algorithms. You are probably wondering
how to choose the best one. Unfortunately, there is no best one, and each of them performs better in some
5.11. OPTIMIZERS 227

conditions. What is true though, is that a good choice of the hyperparameters is key for an algorithm to
perform well, and we encourage you to familiarize yourself with one algorithm and understand the effects of
changing hyperparameter.

Let’s compare the performance of a few optimizers in Keras. Optimizers are available in the
keras.optimizer module, so let’s start by importing them:

In [63]: from tensorflow.keras.optimizers import SGD, Adam, Adagrad, RMSprop

We then set the learning rate to be the same for each of them and run the training for five epochs each:

In [64]: dflist = []

opts = ['SGD(lr=0.01)',
'SGD(lr=0.01, momentum=0.3)',
'SGD(lr=0.01, momentum=0.3, nesterov=True)',
'Adam(lr=0.01)',
'Adagrad(lr=0.01)',
'RMSprop(lr=0.01)']

for opt_name in opts:


model.compile(loss='binary_crossentropy',
optimizer=eval(opt_name),
metrics=['accuracy'])

model.set_weights(weights)

h = model.fit(X_train, y_train, batch_size=16,


epochs=5, verbose=0)

dflist.append(pd.DataFrame(h.history,
index=h.epoch))
print("Done: ", opt_name)

Done: SGD(lr=0.01)
Done: SGD(lr=0.01, momentum=0.3)
Done: SGD(lr=0.01, momentum=0.3, nesterov=True)
Done: Adam(lr=0.01)
Done: Adagrad(lr=0.01)
Done: RMSprop(lr=0.01)

We can aggregate the results as we did previously:

In [65]: historydf = pd.concat(dflist, axis=1)


metrics_ = dflist[0].columns
228 CHAPTER 5. DEEP LEARNING INTERNALS

idx = pd.MultiIndex.from_product([opts, metrics_],


names=['optimizers',
'metric'])
historydf.columns = idx

and plot them for comparison:

In [66]: plt.figure(figsize=(15, 6))

ax = plt.subplot(121)
hxs = historydf.xs('loss', axis=1, level='metric')
hxs.plot(ylim=(0,1), ax=ax)
plt.title("Loss")

ax = plt.subplot(122)
hxs = historydf.xs('accuracy', axis=1, level='metric')
hxs.plot(ylim=(0,1), ax=ax)
plt.title("Accuracy")
plt.xlabel("Epochs")

plt.tight_layout();

Loss Accuracy
1.0 1.0
optimizers
SGD(lr=0.01)
0.8 SGD(lr=0.01, momentum=0.3) 0.8
SGD(lr=0.01, momentum=0.3, nesterov=True)
Adam(lr=0.01)
Adagrad(lr=0.01)
0.6 RMSprop(lr=0.01) 0.6

0.4 0.4 optimizers


SGD(lr=0.01)
SGD(lr=0.01, momentum=0.3)
SGD(lr=0.01, momentum=0.3, nesterov=True)
0.2 0.2 Adam(lr=0.01)
Adagrad(lr=0.01)
RMSprop(lr=0.01)
0.0 0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Epochs

As you can see, in this particular case, some optimizers converge a lot faster than others. This could be due
to the particular combination of hyper-parameters chosen as well as to their better performance on this
particular problem. We encourage you to try out different optimizers on your problems, as well as trying
different hyper-parameter combinations.

Initialization
So far we have explored the effect of learning rate, batch size, and optimizers on the speed of convergence of
a model. We have compared their effect starting from the same set of randomly initialized weights. What if
5.12. INITIALIZATION 229

we initialized weights in a different weight and kept everything else fixed? This may seem unimportant, but
it turns out that the initialization is critical. A model could not converge at all for some initialization and
converge quickly for some other initialization. While we don’t understand this fully, we have a few heuristic
strategies available, that we can test, looking for the best one for our specific problem.

keras offers the possibility to initialize the weights in several ways including:

• Zeros, ones, constant: all weights initialized to zero, to one or a fixed value. Generally, these are not
good choices, because they leave the model uncertain on which parameters to optimize first.

Initialization strategies try to “break the symmetry” by assigning random values to the parameters. The
range and type of random distribution can vary, and several initialization schemes have are available:

• Random uniform: each weight receives a random value between 0 and 1, chosen with uniform
probability.
• Lecun_uniform:

like the above, but the values are drawn in the interval [-limit, limit] where limit is
3
# inputs . Where # inputs indicates the number of inputs in the weight tensor for a specific layer.
• Normal: each weight receives a random value drawn from a normal distribution with mean 0 and
standard deviation of 1. √
• He_normal: like the previous one, but with standard deviation σ = #2in .

2
• Glorot_normal: like the previous one, but with standard deviation σ = # in+# out .

You can read more about them here. To see the effect of initialization, we’ll use a deeper network with more
than just five weights.

In [67]: import tensorflow.keras.backend as K

In [68]: dflist = []

inits = ['zeros', 'ones', 'uniform', 'lecun_uniform',


'normal', 'he_normal', 'glorot_normal']

for init in inits:

K.clear_session()

model = Sequential()
model.add(Dense(10, input_shape=(4,),
kernel_initializer=init,
activation='tanh'))
model.add(Dense(10, kernel_initializer=init,
activation='tanh'))
230 CHAPTER 5. DEEP LEARNING INTERNALS

model.add(Dense(10, kernel_initializer=init,
activation='tanh'))
model.add(Dense(1, kernel_initializer=init,
activation='sigmoid'))

model.compile(loss='binary_crossentropy',
optimizer='sgd',
metrics=['accuracy'])

h = model.fit(X_train, y_train, batch_size=16,


epochs=10, verbose=0)

dflist.append(pd.DataFrame(h.history,
index=h.epoch))
print("Done: ", init)

Done: zeros
Done: ones
Done: uniform
Done: lecun_uniform
Done: normal
Done: he_normal
Done: glorot_normal

Let’s aggregate and plot the results

In [69]: historydf = pd.concat(dflist, axis=1)


metrics_ = dflist[0].columns
idx = pd.MultiIndex.from_product([inits, metrics_],
names=['initializers',
'metric'])

historydf.columns = idx

In [70]: styles = ['-+', '-*', '-x', '-d', '-^', '-o', '-s']

plt.figure(figsize=(15, 5))

ax = plt.subplot(121)
xs = historydf.xs('loss', axis=1, level='metric')
xs.plot(ylim=(0,1), ax=ax, style=styles)
plt.title("Loss")

ax = plt.subplot(122)
xs = historydf.xs('accuracy', axis=1, level='metric')
5.13. INNER LAYER REPRESENTATION 231

xs.plot(ylim=(0,1), ax=ax, style=styles)


plt.title("Accuracy")
plt.xlabel("Epochs")

plt.tight_layout();

Loss Accuracy
1.0 1.0

0.8 0.8

0.6 initializers 0.6 initializers


zeros zeros
ones ones
0.4 uniform 0.4 uniform
lecun_uniform lecun_uniform
0.2 normal 0.2 normal
he_normal he_normal
glorot_normal glorot_normal
0.0 0.0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
Epochs

As you can see some initializations don’t even converge, while some do converge rather quickly.
Initialization of the weights plays a significant role in large models, so it is important to try a couple of
different initialization schemes to get the best results.

Inner layer representation


We conclude this dense chapter on how to train a Neural Network with a little treat. As mentioned
previously, a Neural Network can be viewed as a general function between any input and any output. This is
also true for any of the intermediate layers. Each layer learns a nonlinear transformation between its inputs
and its outputs so that we can pull out the values at the output of any layer. This gives us a way to see how
our network is learning to separate our data. Let’s see how it happens. First of all, we will re-train a network
with two layers, the first with two nodes and the second with just one output node.

Let’s clear the backend session first:

In [71]: K.clear_session()

Then we define and compile the model:

In [72]: model = Sequential()


model.add(Dense(2, input_shape=(4,), activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
optimizer=RMSprop(lr=0.01),
metrics=['accuracy'])
232 CHAPTER 5. DEEP LEARNING INTERNALS

We then set the odel weights to some random values. In order to get reproducible results, the random values
are given for this particular run:

In [73]: weights = [np.array([[-0.26285839, 0.82659411],


[ 0.65099144, -0.7858932 ],
[ 0.40144777, -0.92449236],
[ 0.87284446, -0.59128475]]),
np.array([ 0., 0.]),
np.array([[-0.7150408 ], [ 0.54277754]]),
np.array([ 0.])]

model.set_weights(weights)

And then we train the model

In [74]: h = model.fit(X_train, y_train,


batch_size=16, epochs=20,
verbose=0, validation_split=0.3)

Let’s look at the layers using the model.summary function:

In [75]: model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 2) 10
_________________________________________________________________
dense_1 (Dense) (None, 1) 3
=================================================================
Total params: 13
Trainable params: 13
Non-trainable params: 0
_________________________________________________________________

The model only has 2 dense layers. One connecting the input to the 2 inner nodes and one connecting these
2 inner nodes to the output. The list of layers is accessible as an attribute of the model:

In [76]: model.layers

Out[76]: [<tensorflow.python.keras.layers.core.Dense at 0x7f4a983b8c50>,


<tensorflow.python.keras.layers.core.Dense at 0x7f4a983afa58>]
5.13. INNER LAYER REPRESENTATION 233

and the inputs and outputs of each layer are also accessible as attributes. Let’s take the input of the first layer
and the output of the first layer, the one with 2 nodes:

In [77]: inp = model.layers[0].input


out = model.layers[0].output

These variables refer to objects from the keras kernel. This is Tensorflow by default but it can be switched to
other kernels if needed.

In [78]: inp

Out[78]: <tf.Tensor 'dense_input_8:0' shape=(None, 4) dtype=float32>

In [79]: out

Out[79]: <tf.Tensor 'dense_10/Relu:0' shape=(None, 2) dtype=float32>

Both the input and the output are Tensorflow tensors. In the next chapter we will learn more about Tensors,
so don’t worry about them for now.

keras allows us to define a function between any tensors in a model as follows:

In [80]: features_function = K.function([inp], [out])

Notice that features_function is a function itself, so K.function is a function that returns a function.

In [81]: features_function

Out[81]: <tensorflow.python.keras.backend.EagerExecutionFunction at 0x7f4ad8286b38>

We can apply this function to the test data. Notice that the function expects a list of inputs and returns a list
of outputs. Since our inputs list only has one element, so will the output list and we can extract the outputs
by taking the first element:

In [82]: features = features_function([X_test])[0]

The output tensor contains as many points as X_test each represented by 2 numbers, the output values of
the 2 nodes in the first layer:
234 CHAPTER 5. DEEP LEARNING INTERNALS

In [83]: features.shape

Out[83]: (412, 2)

We can plot the data as a scatter plot, and we can see how the network has learned to represent the data in 2
dimensions in such a way that the next layer can separate the 2 classes more easily:

In [84]: plt.scatter(features[:, 0], features[:, 1], c=y_test, cmap='coolwarm');

6
5
4
3
2
1
0
0.0 2.5 5.0 7.5 10.0 12.5 15.0

Let’s plot the output of the second-to-last layer at each epoch in a training loop. First we re-initialize the
model:

In [85]: model.set_weights(weights)

Then we create a K.function between the input and the output of layer 0:

In [86]: inp = model.layers[0].input


out = model.layers[0].output
features_function = K.function([inp], [out])
5.13. INNER LAYER REPRESENTATION 235

Then we train the model one epoch at a time, plotting the 2D representation of the data as it comes out from
layer 0:

In [87]: plt.figure(figsize=(15,10))

for i in range(1, 26):


plt.subplot(5, 5, i)
h = model.fit(X_train, y_train, batch_size=16,
epochs=1, verbose=0)
test_acc = model.evaluate(X_test, y_test,
verbose=0)[1]
features = features_function([X_test])[0]
plt.scatter(features[:, 0], features[:, 1],
c=y_test, cmap='coolwarm', marker='.')
plt.xlim(-0.5, 15)
plt.ylim(-0.5, 15)

acc_ = test_acc * 100.0


t = 'Epoch: {}, Test Acc: {:3.1f} %'.format(i, acc_)
plt.title(t, fontsize=11)

plt.tight_layout();

Epoch: 1, Test Acc: 87.1 % Epoch: 2, Test Acc: 97.1 % Epoch: 3, Test Acc: 99.3 % Epoch: 4, Test Acc: 99.3 % Epoch: 5, Test Acc: 100.0 %

10 10 10 10 10

0 0 0 0 0
0 10 0 10 0 10 0 10 0 10
Epoch: 6, Test Acc: 99.8 % Epoch: 7, Test Acc: 99.3 % Epoch: 8, Test Acc: 99.3 % Epoch: 9, Test Acc: 99.8 % Epoch: 10, Test Acc: 100.0 %

10 10 10 10 10

0 0 0 0 0
0 10 0 10 0 10 0 10 0 10
Epoch: 11, Test Acc: 99.8 % Epoch: 12, Test Acc: 99.8 % Epoch: 13, Test Acc: 99.8 % Epoch: 14, Test Acc: 99.8 % Epoch: 15, Test Acc: 100.0 %

10 10 10 10 10

0 0 0 0 0
0 10 0 10 0 10 0 10 0 10
Epoch: 16, Test Acc: 100.0 % Epoch: 17, Test Acc: 99.8 % Epoch: 18, Test Acc: 99.8 % Epoch: 19, Test Acc: 99.8 % Epoch: 20, Test Acc: 99.8 %

10 10 10 10 10

0 0 0 0 0
0 10 0 10 0 10 0 10 0 10
Epoch: 21, Test Acc: 100.0 % Epoch: 22, Test Acc: 99.8 % Epoch: 23, Test Acc: 100.0 % Epoch: 24, Test Acc: 100.0 % Epoch: 25, Test Acc: 99.8 %

10 10 10 10 10

0 0 0 0 0
0 10 0 10 0 10 0 10 0 10
236 CHAPTER 5. DEEP LEARNING INTERNALS

As you can see, at the beginning the network has no notion of the difference between the two classes. As the
training progresses, the network learns to represent the data in a 2 dimensional space where the 2 classes are
linearly separable, so that the final layer (which is basically a logistic regression) can easily separate them
with a straight line.

This chapter was surely more intense and theoretical than the previous ones, but we hope it gave you a
thorough understanding of the inner workings of how a Neural Network works and what you can do to
improve its performance.

Exercises

Exercise 1

You’ve just started to work at a wine company, and they would like you to help them build a model that
predicts the quality of their wine based on several measurements. They give you a dataset with wine:

• load the ../data/wines.csv into Pandas


• use the column called “Class” as the target
• check how many classes are there in the target, and if necessary use dummy columns for a Multiclass
classification
• use all the other columns as features, check their range and distribution (using seaborn pairplot)
• rescale all the features using either MinMaxScaler or StandardScaler
• build a deep model with at least one hidden layer to classify the data
• choose the cost function, what will you use? Mean Squared Error? Binary Cross- Entropy?
Categorical Cross-Entropy?
• choose an optimizer
• choose a value for the learning rate. You may want to try with several values
• choose a batch size
• train your model on all the data using a validation_split=0.2. Can you converge to 100
validation accuracy?
• what’s the minimum number of epochs to converge?
• repeat the training several times to verify how stable your results are

In [ ]:

Exercise 2

Since this dataset has 13 features, we can only visualize pairs of features as we did in the pairplot. We could,
however, exploit the fact that a Neural Network is a function to extract two high-level features to represent
our data.

• build a deep fully connected network with the following structure:

– Layer 1: 8 nodes
5.14. EXERCISES 237

– Layer 2: 5 nodes
– Layer 3: 2 nodes
– Output: 3 nodes
• choose activation functions, initializations, optimizer, and learning rate so that it converges to 100
accuracy within 20 epochs (not easy)
• remember to train the model on the scaled data
• define a Feature Function as we did above between the input of the 1st layer and the output of the 3rd
layer
• calculate the features and plot them on a 2-dimensional scatter plot
• can we distinguish the three classes well?

In [ ]:

Exercise 3

Keras functional API. So far we’ve always used the Sequential model API in Keras. However, Keras also
offers a Functional API, which is much more powerful. You can find its documentation here. Let’s see how
we can leverage it.

• define an input layer called inputs


• define two hidden layers as before, one with eight nodes, one with five nodes
• define a second_to_last layer with 2 nodes
• define an output layer with three nodes
• create a model that connects input and output
• train it and make sure that it converges
• define a function between inputs and second_to_last layer
• recalculate the features and plot them

In [ ]:

Exercise 4

Keras offers the possibility to call a function at each epoch. These are Callbacks, and their documentation is
here. Callbacks allow us to add some neat functionality. In this exercise, we’ll explore a few of them.

• Split the data into train and test sets with a test_size = 0.3 and random_state=42
• Reset and recompile your model
• train the model on the train data using validation_data=(X_test, y_test)
• Use the EarlyStopping callback to stop your training if the val_loss doesn’t improve
• Use the ModelCheckpoint callback to save the trained model to disk once training is over
• Use the TensorBoard callback to output your training information to a /tmp/ subdirectory

You can use tensorboard in the notebook by running the following two commands:
238 CHAPTER 5. DEEP LEARNING INTERNALS

%load_ext tensorboard.notebook

%tensorboard --logdir /tmp/ztdlbook/tensorboard/

You can also run tensorboard in a separate terminal with the command:

tensorboard --logdir /tmp/ztdl/tensorboard/

and then open another browser window at address: http://localhost:6006.

In [ ]:
Convolutional Neural Networks
6
Intro
In the previous chapter we dove into Deep Learning, we built our first real model, and hopefully demystified
a lot of the complicated stuff. Now it’s time to start applying Deep Learning to a kind of data where it shines:
images!

At the root of it, what is an image? The relations between nearby pixels encode the information in an image.
A slightly darker or lighter image still contains the same information. Similarly, to recognize an object, its
exact position doesn’t matter. Convolutional Neural Networks (CNN), as we will discover in this chapter,
can encode much information about relations between nearby pixels. This makes them great tools to work
with images, as well as with sequences like sounds and movies.

In this section, we will learn what convolutions are, how they can be used to filter images, and how
Convolutional Neural Nets work. By the end of this section, we will train our first CNN to recognize
handwritten digits. We will also introduce the core concept of Tensor. Are you ready? Let’s go!

Machine Learning on images with pixels


Image classification or image recognition is the process of identifying the object contained in an image. The
classifier receives an image as input, and it returns a label indicating the object represented in the image.

Consider this image:

Humans quickly recognize a cat, whereas the computer sees a bunch of pixels and has no prior notion of
what a cat is, nor that a cat is in this image. It may seem magic that Neural Networks can solve the image
classification problem well, but we hope that, by the end of this chapter, how they do it will be quite clear!

239
240 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS

Picture of a cat

To understand why it is so difficult for a computer to classify objects in images let’s start from how they
represent images, and in particular, let’s start with a black and white image.

A black and white image is a grid of points, each point with a binary value. We call these points pixels, and
in a black and white image, they only carry two possible values: 0 and 1.

Let’s create a random Black and White image with Python. As always, we start by importing the common
libraries. By now they should be familiar, but if you need a reminder, feel free to have a look at Chapter 1:

In [1]: with open('common.py') as fin:


exec(fin.read())

In [2]: with open('matplotlibconf.py') as fin:


exec(fin.read())

Let’s use the np.random.binomial function to generate a 10x10 square matrix of random zeros and ones.
Using the np.random.binomial() method will give us an approximately equal amount of zeros and ones.

TIP: according to the documentation, np.random.binomial creates a random


distribution where samples are drawn from a binomial distribution:
binomial(n, p, size=None) Draw samples from a binomial distribution.
with n (>= 0) is the number of trials and p (in the interval [0,1]) is the probability of success.
6.2. MACHINE LEARNING ON IMAGES WITH PIXELS 241

We will use the argument size=(10, 10) to specify that we want an array with 2 axes, each with 10
positions:

In [3]: bw = np.random.binomial(1, 0.5, size=(10, 10))

Let’s print out the array bw:

In [4]: bw

Out[4]: array([[1, 1, 0, 1, 0, 0, 1, 1, 0, 1],


[0, 0, 1, 0, 1, 0, 0, 0, 1, 0],
[1, 1, 1, 0, 1, 0, 0, 1, 0, 0],
[1, 0, 1, 1, 1, 1, 1, 0, 1, 1],
[1, 0, 1, 1, 1, 0, 1, 1, 0, 1],
[1, 1, 0, 1, 0, 0, 0, 0, 1, 1],
[1, 1, 1, 1, 1, 1, 0, 1, 1, 0],
[1, 0, 1, 0, 0, 1, 1, 0, 0, 1],
[0, 1, 1, 1, 0, 1, 1, 1, 0, 1],
[0, 0, 0, 1, 0, 0, 0, 1, 1, 1]])

As promised, it’s a random set of zeros and ones. We can also use the function
matplotlib.pyplot.imshow to visualize it as an image. Let’s do it:

In [5]: plt.imshow(bw, cmap='gray')


plt.title("Black and White pixels");
242 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS

Black and White pixels


0

0 2 4 6 8

Awesome! We have just learned how to create a Black and White image with Python. Let’s now generate a
grayscale image.

To generate a grayscale image we allow the pixels to carry values that are intermediate between 0 and 1.
Since we do not care about infinite possible shades of gray, we usually use unsigned integers with 8 bits, i.e.,
the numbers from 0 to 255.

A 10x10 grayscale image with 8-bit resolution is a grid of numbers, each-of- which is an integer between 0
and 255.

Let’s draw one such image. In this case, we will use the np.random.randint function, which generates
random integers uniformly distributed between a low and a high extreme. Here’s a snippet from the
documentation:

TIP: from the documentation of np.random.randint:


randint(low, high=None, size=None, dtype=‘l’)
6.2. MACHINE LEARNING ON IMAGES WITH PIXELS 243

Return random integers from low (inclusive) to high (exclusive). Low and high are the
lowest (signed) and largest integer to be drawn from the distribution.

In [6]: gs = np.random.randint(0, 256, size=(10, 10))

Let’s print out the array gs:

In [7]: gs

Out[7]: array([[219, 46, 218, 6, 223, 199, 210, 60, 152, 38],
[ 6, 179, 196, 47, 120, 204, 65, 213, 236, 89],
[ 50, 50, 172, 162, 197, 31, 121, 104, 127, 210],
[158, 154, 101, 38, 86, 168, 226, 129, 159, 204],
[166, 162, 98, 207, 124, 20, 128, 3, 82, 187],
[ 43, 13, 113, 165, 94, 247, 56, 124, 31, 126],
[ 34, 15, 37, 250, 64, 31, 7, 9, 121, 152],
[ 66, 8, 13, 173, 142, 154, 197, 185, 52, 73],
[190, 159, 220, 159, 123, 28, 61, 58, 134, 49],
[175, 203, 119, 127, 62, 6, 231, 213, 167, 150]])

As expected it’s a 10x10 grid of random integers between 0 and 255. Let’s visualize it as an image:

In [8]: plt.imshow(gs, cmap='gray')


plt.title("Grey pixels");
244 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS

Grey pixels
0

0 2 4 6 8

Wonderful! In image classification problems we have to think of images as the input into the algorithm,
therefore, this 2D array with 100 numbers, corresponds to one data point in a classification task. How could
we train a Machine Learning algorithm on such data? Let’s say we have many such gray-scale images
representing handwritten digits. How do we feed them to a Machine Learning model?

MNIST

The MNIST database is a very famous dataset of handwritten digits and it has become a benchmark for
image recognition algorithms. It consists of 70000 images of 28 pixel by 28 pixels, each representing a
handwritten digit.

TIP: Think of how many real-world applications involve recognition of handwritten digits:
- zipcodes - tax declarations - student tests - . . .

The target variables are the ten digits from 0 to 9.


6.2. MACHINE LEARNING ON IMAGES WITH PIXELS 245

Keras has it’s built-in dataset for MNIST, so we will load it from there using the load_data function

In [9]: from tensorflow.keras.datasets import mnist

In [10]: (X_train, y_train), (X_test, y_test) = mnist.load_data()

Let’s check the shape of the arrays of the data we received for the training and test sets:

In [11]: X_train.shape

Out[11]: (60000, 28, 28)

In [12]: X_test.shape

Out[12]: (10000, 28, 28)

The loaded data is a numpy array of order 3. It’s like a 3-dimensional matrix, whose elements are identified
by 3 indices. We’ll discuss these more in detail later in this chapter.

For now, it is sufficient to know that the first index (running from 0 to 59999 for X_train) locates a specific
image in the dataset, while the other two indices determine a particular pixel in the picture, i.e., they run
from 0 to the height and width of the image.

For instance, we can select the first image in the training set and take a look at its shape by using the first
index:

In [13]: first_img = X_train[0]

This image is a 2D array of numbers between 0 and 255, like this:

Let’s use plt.imshow once again to display the image:

In [14]: plt.imshow(first_img, cmap='gray');


246 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS

The first number in the mnist training set is a 5

10

15

20

25
0 5 10 15 20 25

Notice that with the gray colormap, zeros are displayed as black pixels while higher numbers are displayed
as lighter pixels.
6.2. MACHINE LEARNING ON IMAGES WITH PIXELS 247

Pixels as features

How can we use this whole image as an input to a classification algorithm?

So far our input datasets have always been 2D tabular sets, where table columns refer to different features
and each data point occupies a row. In this case, each data point is itself a 2D table (an image), and so we
need to decide how to map it to features.

The simplest way to feed images to a Machine Learning algorithm is to use each pixel in the picture as an
individual feature. If we do this, we will have 28 × 28 = 784 independent features, each one being an integer
between 0 and 255, and our dataset will become tabular once again. Each row in the tabular dataset will
represent a different image, and each of the 784 columns will designate a specific pixel.

The reshape method of a numpy array allows us to reshape any array to a new shape. For example, let’s
reshape the training dataset to be a tabular dataset with 60000 rows and 784 columns:

In [15]: X_train_flat = X_train.reshape((60000, 784))

We can check that the operation worked by printing the shape of X_train_flat:

In [16]: X_train_flat.shape

Out[16]: (60000, 784)

Wonderful! Another valid syntax for reshape is to just specify the size of the dimensions we care about and
let the method figure out the other dimension, like this:

In [17]: X_test_flat = X_test.reshape(-1, 28*28)

Again, let’s print the shape to be sure:

In [18]: X_test_flat.shape

Out[18]: (10000, 784)

Great! Now we have 2 tabular datasets like the ones we are familiar with. The features contain values
between 0 and 255:

In [19]: X_train_flat.min()
248 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS

Out[19]: 0

In [20]: X_train_flat.max()

Out[20]: 255

As already seen in Chapter 3, Neural Network models are quite sensitive to the absolute size of the input
features, and hence they like features that are normalized to be somewhat near 1.

We should rescale the values of our features to be between 0 and 1. Lets do it by dividing them by 255 so they
will have values between 0 and 1. Notice that we need to convert the the data type to float32 because
under the hood numpy arrays are implemented in C and therefore are strongly typed.

In [21]: X_train_sc = X_train_flat.astype('float32') / 255.0


X_test_sc = X_test_flat.astype('float32') / 255.0

Great! We now have 2D data that we can use to train a fully connected Neural Network!

Multiclass output

Since our goal is to recognize a digit contained in an image, our final output is a class label between 0 and 9.
Let’s inspect y_train to look at the target values we want to train our network to learn:

In [22]: y_train

Out[22]: array([5, 0, 4, ..., 5, 6, 8], dtype=uint8)

We can use the np.unique method to check what are the unique values for the labels, these should be the
digits from 0 to 9:

In [23]: np.unique(y_train)

Out[23]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=uint8)

y_train is an array of target digits, and it contains values between 0 and 9.

Since there are ten possible output classes, this is a Multiclass classification problem where the outputs are
mutually exclusive. As we have learned in the Chapter 4, we need to convert the labels to a matrix of binary
6.2. MACHINE LEARNING ON IMAGES WITH PIXELS 249

Fully connected network to solve MNIST

columns. In doing so, we communicate to the network that the labels are distinct and it should learn to
predict the probability of an image to correspond to a specific label.

In other words, our goal is to build a network with 784 inputs and 10 output, like the one represented in this
figure:

So that for a given input image the network learns to indicate to which label it corresponds. Therefore we
need to make sure that the shape of the label array matches the output of our network.

We can convert our labels to binary arrays using the to_categorical utility function from
tensorflow.keras. Let’s import it

In [24]: from tensorflow.keras.utils import to_categorical

and let’s convert both y_train and y_test:

In [25]: y_train_cat = to_categorical(y_train)


y_test_cat = to_categorical(y_test)

Let’s double check what’s going on. As we have seen before, the first element of X_train is a handwritten
number 5. So the corresponding label should be a 5.

In [26]: y_train[0]

Out[26]: 5

The corresponding binary version of the label is the following array:


250 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS

In [27]: y_train_cat[0]

Out[27]: array([0., 0., 0., 0., 0., 1., 0., 0., 0., 0.], dtype=float32)

As you can see, this is an array of 10 numbers, zero everywhere except at position 5 (remember we start
counting from 0) indicating which of the 10 classes our image should be classified as.

As seen in Chapter 3 for features and in Chapter 4 for labels, this type of encoding is called one-hot
encoding, meaning we encode classes as an array with as many elements as the number of distinct classes,
zero everywhere except for a 1 at the corresponding class.

Great! Finally let’s check the shape of y_train_cat. This should have as many rows as we have training
examples and 10 columns for the 10 binary outputs:

In [28]: y_train_cat.shape

Out[28]: (60000, 10)

Let’s check our test dataset to make sure it matches as well.

In [29]: y_test_cat.shape

Out[29]: (10000, 10)

Fantastic! We can now train a fully connected Neural Network using all what we’ve learned in the previous
chapters.

Fully connected on images

To build our network, let’s import the usual Keras classes as seen in Chapter 1. Once again we build a
Sequential model, i.e. we add the layers one by one, using fully connected layers, i.e. Dense:

In [30]: from tensorflow.keras.models import Sequential


from tensorflow.keras.layers import Dense

Now let’s build the model. As we have done in Chapter 4, we will build this network layer by layer, making
sure that the sizes of the input/outputs.

The network configuration will be the following:


6.2. MACHINE LEARNING ON IMAGES WITH PIXELS 251

• Input: 784 features


• Layer 1: 512 nodes with Relu activation
• Layer 2: 256 nodes with Relu activation
• Layer 3: 128 nodes with Relu activation
• Layer 4: 32 nodes with Relu activation
• Output Layer: 10 nodes with Softmax activation

Notice a couple of things:

1. We specify the size of the input in the definition of the first layer through the parameter
input_dim=784.

• The choice of the number of layers and the number of nodes per layer is arbitrary. Feel free to
experiment with different architectures and observe:

– if the network performs better or worse


– if the training takes longer or shorter (number of epochs to reach a certain accuracy)

• The last layer added to the stack is also the output layer. This may be sometimes confusing, so make
sure that the number of nodes in the last layer in the stack corresponds to the number of categories in
your dataset
• The last layer has a Softmax activation function at its output. As seen in Chapter 4, this is needed
when the classes are mutually exclusive. In this case, an image of a digit cannot be of 2 different digits
at the same time, and we need to let the model know about it.
• Finally, the model is compiled using the categorical_crossentropy loss, which is the correct one
for classifications with many mutually exclusive classes.

In [31]: model = Sequential()

model.add(Dense(512, input_dim=784, activation='relu'))


model.add(Dense(256, activation='relu'))
model.add(Dense(128, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(10, activation='softmax')) # output

model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])

Let’s print out the model summary:

In [32]: model.summary()
252 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS

Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 512) 401920
_________________________________________________________________
dense_1 (Dense) (None, 256) 131328
_________________________________________________________________
dense_2 (Dense) (None, 128) 32896
_________________________________________________________________
dense_3 (Dense) (None, 32) 4128
_________________________________________________________________
dense_4 (Dense) (None, 10) 330
=================================================================
Total params: 570,602
Trainable params: 570,602
Non-trainable params: 0
_________________________________________________________________

As you can see, the model has about half a million parameters, namely 570,602.

Let’s train it on our data for ten epochs with 128 images per batch. We will need to pass the scaled and
reshaped inputs and outputs.

Also, let’s use a validation_split of 10, meaning we will train the model on 90 of the training data,
and evaluate its performance on the remaining 10. This is like an internal train/test split done on the
training data. It’s useful when we plan to change the network and tune its architecture to maximize its ability
to generalize. We will keep the actual test set for a final check once we have committed to the best
architecture.

In [33]: h = model.fit(X_train_sc, y_train_cat, batch_size=128,


epochs=10, verbose=1,
validation_split=0.1)

Train on 54000 samples, validate on 6000 samples


Epoch 1/10
54000/54000 [==============================] - 2s 38us/sample - loss: 0.2890
- accuracy: 0.9117 - val_loss: 0.1509 - val_accuracy: 0.9537
Epoch 2/10
54000/54000 [==============================] - 1s 27us/sample - loss: 0.0991
- accuracy: 0.9690 - val_loss: 0.0861 - val_accuracy: 0.9767
Epoch 3/10
54000/54000 [==============================] - 1s 28us/sample - loss: 0.0682
- accuracy: 0.9795 - val_loss: 0.0879 - val_accuracy: 0.9755
Epoch 4/10
54000/54000 [==============================] - 1s 27us/sample - loss: 0.0487
- accuracy: 0.9851 - val_loss: 0.0791 - val_accuracy: 0.9787
Epoch 5/10
54000/54000 [==============================] - 1s 28us/sample - loss: 0.0364
- accuracy: 0.9891 - val_loss: 0.0798 - val_accuracy: 0.9798
Epoch 6/10
6.2. MACHINE LEARNING ON IMAGES WITH PIXELS 253

54000/54000 [==============================] - 1s 28us/sample - loss: 0.0296


- accuracy: 0.9909 - val_loss: 0.0975 - val_accuracy: 0.9768
Epoch 7/10
54000/54000 [==============================] - 1s 28us/sample - loss: 0.0252
- accuracy: 0.9929 - val_loss: 0.0815 - val_accuracy: 0.9833
Epoch 8/10
54000/54000 [==============================] - 1s 27us/sample - loss: 0.0202
- accuracy: 0.9942 - val_loss: 0.0903 - val_accuracy: 0.9812
Epoch 9/10
54000/54000 [==============================] - 1s 28us/sample - loss: 0.0183
- accuracy: 0.9945 - val_loss: 0.1127 - val_accuracy: 0.9807
Epoch 10/10
54000/54000 [==============================] - 1s 28us/sample - loss: 0.0181
- accuracy: 0.9950 - val_loss: 0.1164 - val_accuracy: 0.9820

The model seems to be doing very well on the training data (as we can see by the acc output).

Let’s check if it is overfitting, i.e., if it is just memorizing the answers instead of learning general rules about
the training examples

TIP: if you need to refresh your knowledge of overfitting have a look at Chapter 3 as well as
this Wikipedia article.

Let’s plot the history of the accuracy and compare the training accuracy with the validation accuracy.

In [34]: plt.plot(h.history['accuracy'])
plt.plot(h.history['val_accuracy'])
plt.legend(['Training', 'Validation'])
plt.title('Accuracy')
plt.xlabel('Epochs');
254 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS

Accuracy

0.98

0.96

0.94

0.92 Training
Validation
0 2 4 6 8
Epochs

We already notice that while the training accuracy increases, the validation accuracy does not seem to
increase as well. Let’s check the performance on the test set:

In [35]: test_acc = model.evaluate(X_test_sc, y_test_cat)[1]


test_acc

10000/10000 [==============================] - 1s 51us/sample - loss: 0.1228


- accuracy: 0.9790

Out[35]: 0.979

and let’s compare it with the performance on the training set:

In [36]: train_acc = model.evaluate(X_train_sc, y_train_cat)[1]


train_acc

60000/60000 [==============================] - 3s 50us/sample - loss: 0.0209


- accuracy: 0.9954
6.3. BEYOND PIXELS AS FEATURES 255

Out[36]: 0.99535

The performance on the test set is lower than the performance on the training set.

TIP: one question you may have is “When is a difference between the test and train scores
significant”. We can answer this question by running cross- validation to see what the
standard deviation of each score is. Then we can compare their difference between the two
scores with the standard deviation and see if their difference is much higher than the
statistical fluctuations of each score.

This difference between the train and test scores may indicate we are overfitting.

This indication makes sense because the model is trained using the individual pixels as features. This implies
that two images which are similar but slightly rotated or shifted have entirely different features.

To go beyond “pixels as features” we need to extract better features from the images.

Beyond pixels as features


In the previous example, we trained a model to recognize handwritten digits using the raw values of the
pixels as input features. The model performed pretty well on the training data but had some trouble
generalizing to the test set. Intuitively it’s quite clear where the problem is: the absolute value of each pixel is
not a great feature to describe the content of an image. To understand this, realize that you would still
recognize the digits if black turned to gray and white turned to a lighter gray. This is because an image
carries information in the arrangements of nearby pixels, not just in the value of each pixel.

It is legitimate to wonder if there is a better way to extract information from images, and there is.

The process of going from an image to a vector of pixels is just the simplest case of feature extraction from
an image. There are many other methods to extract features from images, including Fourier transforms,
Wavelet transforms, Histograms of oriented gradients (HOG) and many others. These are all methods
that take an image in input and return a vector of numbers we can use as features.

The banknotes dataset we used in the previous chapter is an example of features extracted from images with
these methods.

Although powerful, these methods require profound domain knowledge, and each was developed over time
to solve a specific problem in image recognition. It would be great if we could avoid using these ad-hoc
methods and learn the best features from the image problem itself.

This case is a general issue with feature engineering: identifying features that correctly represent the type of
information we are trying to capture from a rich data point (like an image), is a time consuming and
complex effort, often involving several Ph.D. students doing their thesis on it.
256 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS

Feature extraction

Let’s see if we can use a different approach.

Using local information

Let’s consider an image more in detail. What makes an image different from a vector of numbers is that the
values of pixels are correlated both horizontally and vertically. It’s the 2D pattern that carries the
information contained in the image and these 2D patterns, like for example horizontal and vertical contrast
lines, are specific to an image or a set of images. It would be great to have a technique that can capture them
automatically.

Additionally, if all we care about is recognizing an object, we should strive to be insensitive to the position of
the object in the image, and our features should rely more on local patterns of pixels arranged in the form of
the object, than on the position of such pixels on the grid.

Local patterns in images

The mathematical operation that allows us to look for local patterns is called convolution. However, before
we learn about it, we have to take a moment and learn about Tensors.
6.3. BEYOND PIXELS AS FEATURES 257

Images as tensors

In this section, we talk about tensors. Tensors are common in Physics and Mathematics. However, the
tensors we use in Machine Learning are not the same as the ones used in physics. Tensors in Machine
Learning are just a synonym of Multi-dimensional arrays. This is somewhat misleading and has generated
a bit of a debate (see here), but we will proceed as the mainstream convention and use the word tensors to
refer to multi-dimensional arrays.

In this sense, the order or rank of a tensor refers to the number of axes in the array.

TIP: people tend to use the word dimension to indicate the rank of a tensor (number of
axes) as well as the length of a specific axis. We will call the former rank or order, saving
the word dimension for the latter. More on this later, however.

You may wonder why you should learn about tensors. The answer is, they allow you to apply Machine
Learning to multi-dimensional data like images, movies, text and so on. Tensors are a great way to extend
our skills beyond the tabular datasets we’ve used so far!

Let’s start with scalars. Scalars are just numbers, everyday numbers we are used to. They have no dimension.

In [37]: 5

Out[37]: 5

Vectors can be thought of as lists of numbers. The number of elements in a vector is also called vector
length and sometimes number of dimensions. As already seen many times, in python we can create
vectors using the np.array constructor:

In [38]: v = np.array([1, 3, 2, 1])


v.shape

Out[38]: (4,)

We’ve just created a vector with 4 elements.

TIP: In our terminology, this is a vector of dimension 4, which is still a tensor of order 1,
since it only has one axis.
258 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS

The numbers in the list are the coordinates of a point in a space with as many dimensions as the number of
entries in the list.

Going up one level, we encounter tensors of order 2, which are called matrices. Matrices are tables of
numbers with rows and columns, i.e., they have two axes.

In [39]: M = np.array([[1, 3, 2, 2],


[0, 1, 3, 2]])
M.shape

Out[39]: (2, 4)

The first axis of M has length 2, which is the number of rows in the matrix, the second axis has length 4, and
it corresponds to the columns in the matrix.

A grayscale image, as we saw, is a 2D matrix where each entry corresponds to a pixel.

TIP: notice that plt.imshow takes care of normalizing the values of the matrix so that we
can display them in gray-scale.

In [40]: plt.imshow(M, cmap='gray');

0.5

0.0

0.5

1.0

1.5
0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
6.3. BEYOND PIXELS AS FEATURES 259

A matrix is a list of vectors of the same length, each representing a row of pixels. In the same way, as a vector
is a list of scalars. So if we extract the first element of the matrix, this is a vector:

In [41]: M[0].shape

Out[41]: (4,)

This recursive construction allows us to organize them in the larger family of tensors. Tensors in Machine
Learning can be understood as nested lists of objects of the previous order, all with the same shape.

Tensors

So for example, a tensor of order three can be thought of as an array of matrices, which are tensors of order
two. Since all of these matrices have the same number of rows and columns, the tensor is actually like a
cuboid of numbers.

Each number is located by the row, the column and the depth where it is stored.

The shape of a tensor tells us how many objects there are when counting along a particular axis. So for
example, a vector has only one axis, and a matrix has two axes, indicating the number of rows and the
number of columns. Since most of the data (images, sounds, texts) we will use are stored as tensors, it is
necessary to know the dimensions of these objects for proper use.

Colored images

A colored image is a set of gray-scale images each corresponding to a primary color channel. So, in the case
of RGB, we have three channels (Red, Blue, and Green), each containing the pixels of the image in that
260 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS

Multi-dimensional array

particular channel.

This image is an order three tensor, and there are two major ordering conventions. If we think of the image
as a list of three single-color images, then the axis order will be channel first, then height, and then width.

On the other hand, we can also think of the tensor as an order two list of vector pixels, where each pixel
contains three numbers, one for each of the colors. We call this ordering channel-last, and it’s the convention
used in the rest of the book.

Channel order when an image is represented as a tensor

Let’s create and display a random color image by creating a list of random pixels between 0 and 255:
6.3. BEYOND PIXELS AS FEATURES 261

In [42]: img = np.random.randint(255, size=(4, 4, 3),


dtype='uint8')
img

Out[42]: array([[[ 94, 146, 123],


[228, 183, 80],
[247, 249, 199],
[101, 19, 63]],

[[226, 81, 76],


[221, 203, 132],
[251, 12, 166],
[204, 184, 214]],

[[ 91, 58, 244],


[191, 85, 14],
[243, 29, 171],
[106, 112, 188]],

[[238, 232, 127],


[109, 38, 91],
[ 78, 109, 249],
[196, 102, 6]]], dtype=uint8)

Now let’s display it as a figure, showing each of the dominant pixels in each list.

In [43]: plt.figure(figsize=(5, 5))


plt.subplot(221)
plt.imshow(img)
plt.title("All Channels combined")

plt.subplot(222)
plt.imshow(img[:, : , 0], cmap='Reds')
plt.title("Red channel")

plt.subplot(223)
plt.imshow(img[:, : , 1], cmap='Greens')
plt.title("Green channel")

plt.subplot(224)
plt.imshow(img[:, : , 2], cmap='Blues')
plt.title("Blue channel")
plt.tight_layout()
262 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS

All Channels combined Red channel


0 0

2 2

0 2 0 2
Green channel Blue channel
0 0

2 2

0 2 0 2

Pause here for a second and observe how the colors of the pixels in the colored image reflect the
combination of the colors in the three channels.

Now that we know how to represent images using tensors, we are ready to introduce convolutional Neural
Networks.

TIP: If you’d like to know a bit more about Tensors and how they work, we displayed a few
operations in the Appendix.

Convolutional Neural Networks


Simply stated, convolutional Neural Networks are Neural Networks that replace the matrix multiplication
operation (X ⋅ w) with the convolution operation (X ∗ w) in at least one of their layers.
6.4. CONVOLUTIONAL NEURAL NETWORKS 263

TIP: if you need a refresher about convolutions and how they work, have a look at the
Appendix.

For this chapter, all we need to know is that we can convolve an image with a filter or kernel, which is a
smaller image. The convolution generates a new image, also called a feature map, whose pixels represent the
“degree of matching” of the corresponding receptive field with the kernel.

Feature map and receptive field

So, if we take many filters and arrange them in a convolutional layer the output of the convolution of an
image will be as many feature maps (convolved images) as there are filters. Since all of these images have the
same size, we can arrange them in a tensor, where the number of channels corresponds to the number of
filters used. Let’s use tensors to describe everything: inputs, layers, and outputs.

We can arrange the input data is a tensor of order four. A single image is an order-3 tensor as we know, but
since we have many input images in a batch, and they all have the same size, we might as well stack them in
an order-4 tensor where the first axis indicates the number of samples.

So the four axes are respectively: the number of images in the batch, the height of the image, the width of
the image and the number of color channels in the picture.

For example, in the MNIST training dataset, we have 60000 images, each with 28x28 pixel and only one
264 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS

color channel, because they are grayscale. This gives us an order-4 tensor with the shape (60000, 28, 28,
1).

Image represented as a tensor of order-4

Similarly, we can stack the filters in the convolutional layer as an order-4 tensor. We will use the first two axes
for the height and the weight of the filter. The third axis will correspond to the number of color channels in
the input, while the last axis is for the number nodes in the layer, i.e., the number of different filters we are
going to learn. This is also called the number of output channels sometimes, you’ll soon see why.

Let’s do an example where we build a convolutional layer with four 3x3 filters. The order-4 tensor has a
shape of (3, 3, 1, 4), i.e., four filters of 3x3 pixels each with a single input color channel each.

When we convolve each input image with the convolutional layer, we still obtain an order-4 tensor.

The first axis is still the number of images in the batch or the dataset. The other three axes are for the image
height, width and number of color channels in the output. Notice that this is also the number of filters in the
layer, four in the case of this example.

Notice that since the output is an order-4 tensor, we could feed it to a new convolutional layer, provided we
make sure to match the number of channels correctly.

Convolutional Layers

Convolutional layers are available in keras.layers.Conv2D. Let’s apply a convolutional layer to an image
and see what happens.

First, let’s import the Conv2D layer from keras:

In [44]: from tensorflow.keras.layers import Conv2D


6.4. CONVOLUTIONAL NEURAL NETWORKS 265

Convolutional kernel represented as a tensor of order-4

Feature map represented as a tensor of order-4


266 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS

Now let’s load an example image from scipy.misc:

In [45]: from scipy import misc

In [46]: img = misc.ascent()

and let’s display it:

In [47]: plt.figure(figsize=(5, 5))


plt.imshow(img, cmap='gray');

100

200

300

400

500
0 100 200 300 400 500

Let’s check the shape of img:

In [48]: img.shape

Out[48]: (512, 512)


6.4. CONVOLUTIONAL NEURAL NETWORKS 267

A convolutional layer wants an order-4 tensor as input, so first of all we need to reshape our image so that it
has 4 axes and not 2.

We can add one axis of length 1 for the color channel (which is a grayscale pixel value between 0 and 255)
and one axis of length 1 for the dataset index.

In [49]: img_tensor = img.reshape((1, 512, 512, 1))

Let’s start by applying a large flat filter of size 11x11 pixels. This operation should result in a blurring of the
image because the pixels are averaged.

The syntax ofConv2D is:

Conv2D(filters, kernel_size, ...)

so we will specify 1 for the filter and (11, 11) for the kernel_size. We will also initialize all the weights to
one by using kernel_initializer='ones'. Finally we will need to pass the input shape, since this is the
first layer in the network. This is the shape of a single image, which in this case is (512, 512, 1).

In [50]: model = Sequential()


model.add(Conv2D(1, (11, 11), kernel_initializer='ones',
input_shape=(512, 512, 1)))
model.compile('adam', 'mse')
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 502, 502, 1) 122
=================================================================
Total params: 122
Trainable params: 122
Non-trainable params: 0
_________________________________________________________________

We have a model with one convolutional layer, so the number of parameters is equal to 11 x 11 + 1 where
the +1 comes from the bias term. We can apply the convolution to the image by running a forward pass:

In [51]: img_pred_tensor = model.predict(img_tensor)

To visualize the image we extract it from the tensor.


268 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS

In [52]: img_pred = img_pred_tensor[0, :, :, 0]

and we can use plt.imshow as before:

In [53]: plt.imshow(img_pred, cmap='gray');

100

200

300

400

500
0 100 200 300 400 500

As you can see the image is blurred, as we expected.

TIP: try to change the initialization of the convolutional layer to something else. Then
re-run the convolution and notice how the output image changes.

Great! We have just demonstrated that the convolution with a kernel will produce a new image, whose pixels
will be a combination of the original pixels in a receptive field and the values of the weights in the kernel.

The user does not decide these weights, the network learns them through backpropagation! This allows a
Neural Network to adapt and learn any pattern that is relevant to solving the task.
6.4. CONVOLUTIONAL NEURAL NETWORKS 269

There are two additional arguments to consider when building a convolutional layer with Keras: padding
and stride.

Padding

If you’ve been paying attention, you may have noticed that the convolved image is slightly smaller than the
original image:

In [54]: img_pred_tensor.shape

Out[54]: (1, 502, 502, 1)

This is due to the default setting of padding='valid' in the Conv2D layer, and it has to do with how we
treat the data at the boundaries. Each pixel in the convolved image is the result of the contraction of the
receptive field with the kernel. Since, in this case, the kernel has a size of 11x11, if we start at the top left
corner and slide to the right, there are only 502 possible positions for the receptive field. In other words, we
lose 5 pixels on the right and 5 pixels on the left.

If we would like to preserve the image size, we need to offset the first receptive field so that its center falls on
the top left corner of the input image. We can fill the empty parts with zeros. This is called padding.

In tensorflow.keras we have two padding modes: - valid which means no padding - same which
means pad to keep the same image size.

Let’s check that padding same works as expected:

In [55]: model = Sequential()


model.add(Conv2D(1, (11, 11), padding='same',
kernel_initializer='ones',
input_shape=(512, 512, 1)))
model.compile('adam', 'mse')

model.predict(img_tensor).shape

Out[55]: (1, 512, 512, 1)

Awesome! We know how padding works. Why use padding? We can use padding if we think that the pixels
at the border contain useful information to solve the classification task.
270 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS

Padding
6.4. CONVOLUTIONAL NEURAL NETWORKS 271

Stride

The stride is the number of pixels that we use that separate one receptive field from the next. It’s like the step
size in the convolution. A stride of (1, 1) means we slide the receptive field by one pixel horizontally and
one vertically. Looking at the figure:

Stride

The input image has size 6x6, the filter (not shown) is 3x3 and so is the receptive field. If we perform
convolution with no padding and stride of 1, the output image will lose one pixel on each side, resulting in a
4x4 image. Increasing the stride means skipping a few pixels between one receptive field and the next, so,
for example, a stride of (3, 3), will produce an output image of 2x2.

We can also stride of different length in the two directions, which will produce a rectangular convolved
output image.

Finally, if we don’t want to lose the borders during the convolution, we can pad the image with zeros and
obtain an image with the same size as the input.

The default value for the stride is 1 to the right and 1 down, but we can jump by more significant amounts,
for example, if the image resolution is too high.

This will produce output images that are smaller. For example, let’s jump by 5 pixels in both directions in our
example:

In [56]: model = Sequential()


model.add(Conv2D(1, (11, 11), strides=(5, 5),
padding='same',
kernel_initializer='ones',
input_shape=(512, 512, 1)))
model.compile('adam', 'mse')
272 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS

small_img_tensor = model.predict(img_tensor)
small_img_tensor.shape

Out[56]: (1, 103, 103, 1)

In [57]: plt.imshow(small_img_tensor[0, :, :, 0], cmap='gray');

20

40

60

80

100
0 20 40 60 80 100

The image is still present, but its resolution is now much lower. We can also choose asymmetric strides, if we
believe the image has more resolution in one direction than another:

In [58]: model = Sequential()


model.add(Conv2D(1, (11, 11), strides=(11, 5),
padding='same',
kernel_initializer='ones',
input_shape=(512, 512, 1)))
model.compile('adam', 'mse')

asym_img_tensor = model.predict(img_tensor)
asym_img_tensor.shape
6.4. CONVOLUTIONAL NEURAL NETWORKS 273

Out[58]: (1, 47, 103, 1)

Pooling layers

Another layer we need to learn about is the pooling layer.

Pooling reduces the size of the image by discarding some information. For example, max-pooling only
preserves the maximum value in a patch and stores it in the new image, while dropping the values in the
other pixels.

Also, pooling patches usually do not overlap, which reduces the size of the image.

If we apply pooling to the feature maps, we end up with smaller feature maps, that still retain the highest
matches of our convolutional filters with the input.

Average pooling is similar, only using average instead of max.

These layers are available in tensorflow.keras as MaxPooling2D and AveragePooling2D.

In [59]: from tensorflow.keras.layers import MaxPooling2D


from tensorflow.keras.layers import AveragePooling2D
from tensorflow.keras.layers import GlobalMaxPooling2D

Let’s add a MaxPooling2D layer in a simple network (containing this single layer):

In [60]: model = Sequential()


model.add(MaxPooling2D(pool_size=(5, 5),
input_shape=(512, 512, 1)))
model.compile('adam', 'mse')

and let’s apply it to our example image:

In [61]: img_pred = model.predict(img_tensor)[0, :, :, 0]


img_pred.shape

Out[61]: (102, 102)

In [62]: plt.imshow(img_pred, cmap='gray');


274 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS

20

40

60

80

100
0 20 40 60 80 100

Max-pooling layers are useful in tasks of object recognition, since pixels in feature maps represent the
“degree of matching” of a filter with a receptive field, keeping the max keeps the highest matching feature.

On the other hand, if we are also interested in the location of a particular match, then we shouldn’t be using
max-pooling, because we lose the location information in the pooling operation.

Thus, for example, if we are using a Convolutional Neural Network to read the state of a video game from a
frame, we need to know the exact positions of players and thus using max-pooling is not recommended.

Finally GlobalMaxPooling2D calculates the global max in the image, so it returns a single value for the
image:

In [63]: model = Sequential()


model.add(GlobalMaxPooling2D(input_shape=(512, 512, 1)))
model.compile('adam', 'mse')

In [64]: img_pred_tensor = model.predict(img_tensor)


img_pred_tensor.shape

Out[64]: (1, 1)
6.4. CONVOLUTIONAL NEURAL NETWORKS 275

Final architecture

Convolutional, pooling and activation layers can be stacked together, feeding the output of one layer into
the input of the next. This stacking results in a feature-extraction pipeline that will gradually transform an
image into a tensor with more channels and fewer pixels:

Convolutional stack

The value of each “pixel” in the last feature map is influenced by a large regions of the original image and it
will have learned to recognize complex patterns.

That’s the beauty of stacking convolutional layers. The first layers will learn patterns of pixels in the original
image, while deeper layers will learn more complex patterns that are combinations of the simpler patterns.

In practice, early layers will specialize to recognize contrast lines in different orientations, while deeper
layers will combine those contrast lines to identify parts of objects. The typical example of this is the face
recognition task where middle layers recognize facial features like eyes, noses, and mouths while deeper
nodes specialize on individual faces.

The convolutional stack behaves like an optimized feature extraction pipeline that is trained to solve the task
at hand optimally.

To complete the pipeline and solve the classification task we can pipe the output of the feature extraction
pipeline into a fully connected final stack of layers.

We will need to unroll the output tensor into a long vector (as we did initially for the MNIST data) and
connect this vector to the labels using a fully connected network.

We can also stack multiple fully connected layers if we want. Our final network is like a pancake of many
layers, the convolutional part dealing with feature extraction and the fully connected part handling the
classification.

The deeper we go in the network the richer and more unique are the patterns matched and so more robust
the classification will be.

Convolutional network on images

Let’s build our first convolutional Neural Network to classify the MNIST data. First of all we need to reshape
the data as order-4 tensors. We will store the reshaped data into new variables called X_train_t and
276 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS

Flatten layer

X_test_t.

In [65]: X_train_t = X_train_sc.reshape(-1, 28, 28, 1)


X_test_t = X_test_sc.reshape(-1, 28, 28, 1)

In [66]: X_train_t.shape

Out[66]: (60000, 28, 28, 1)

Then we import the Flatten and Activation layers:

In [67]: from tensorflow.keras.layers import Flatten, Activation

Let’s now build a simple model with the following architecture:

• A Conv2D layer with 32 filters of size 3x3.


• A MaxPooling2D layer of size 2x2.
• An activation layer with a ReLU activation function.
• A couple of fully connected layers leading to the output of 10 classes corresponding to the digits.
6.4. CONVOLUTIONAL NEURAL NETWORKS 277

Notice that between the convolutional layers and the fully connected layers we will need Flatten to
reshape the feature maps into feature vectors.

To speed up the convergence, we initialize the convolutional weights drawing from a random normal
distribution. Later in the book, we will discuss initializations more in detail.

Also notice that we need to pass input_shape=(28, 28, 1) to let the model know our input images are
grayscale 28x28 images:

In [68]: model = Sequential()

model.add(Conv2D(32, (3, 3), input_shape=(28, 28, 1),


kernel_initializer='normal'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Activation('relu'))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dense(10, activation='softmax'))

model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])

model.summary()

Model: "sequential_7"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_4 (Conv2D) (None, 26, 26, 32) 320
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 13, 13, 32) 0
_________________________________________________________________
activation (Activation) (None, 13, 13, 32) 0
_________________________________________________________________
flatten (Flatten) (None, 5408) 0
_________________________________________________________________
dense_5 (Dense) (None, 64) 346176
_________________________________________________________________
dense_6 (Dense) (None, 10) 650
=================================================================
Total params: 347,146
Trainable params: 347,146
Non-trainable params: 0
_________________________________________________________________

This model has 300k parameters, that’s almost half of the fully connected model we designed at the
beginning of this chapter. Let’s train it for five epochs. Notice that we pass the tensor data we created above:
278 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS

In [69]: h = model.fit(X_train_t, y_train_cat, batch_size=128,


epochs=5, verbose=1, validation_split=0.3)

Train on 42000 samples, validate on 18000 samples


Epoch 1/5
42000/42000 [==============================] - 2s 47us/sample - loss: 0.3334
- accuracy: 0.9022 - val_loss: 0.1502 - val_accuracy: 0.9586
Epoch 2/5
42000/42000 [==============================] - 2s 41us/sample - loss: 0.1139
- accuracy: 0.9661 - val_loss: 0.1214 - val_accuracy: 0.9613
Epoch 3/5
42000/42000 [==============================] - 2s 40us/sample - loss: 0.0711
- accuracy: 0.9792 - val_loss: 0.0965 - val_accuracy: 0.9703
Epoch 4/5
42000/42000 [==============================] - 2s 40us/sample - loss: 0.0510
- accuracy: 0.9858 - val_loss: 0.0674 - val_accuracy: 0.9793
Epoch 5/5
42000/42000 [==============================] - 2s 40us/sample - loss: 0.0383
- accuracy: 0.9891 - val_loss: 0.0648 - val_accuracy: 0.9812

Like before, we can display the training history:

In [70]: plt.plot(h.history['accuracy'])
plt.plot(h.history['val_accuracy'])
plt.legend(['Training', 'Validation'])
plt.title('Accuracy')
plt.xlabel('Epochs');
6.4. CONVOLUTIONAL NEURAL NETWORKS 279

Accuracy
Training
0.98 Validation

0.96

0.94

0.92

0.90
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Epochs

and compare the accuracy on train and test sets:

In [71]: train_acc = model.evaluate(X_train_t, y_train_cat,


verbose=0)[1]
test_acc = model.evaluate(X_test_t, y_test_cat,
verbose=0)[1]

print("Train accuracy: {:0.4f}".format(train_acc))


print("Test accuracy: {:0.4f}".format(test_acc))

Train accuracy: 0.9889


Test accuracy: 0.9844

The convolutional model achieved a better performance on the MNIST data in fewer epochs. Overfitting
also decreases, because the model is learning to combine spatial patterns instead of learning the exact values
of the pixels.
280 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS

Beyond images
Convolutional networks are great on all data types where the order matters. For example, they can be used
on sound files using spectrograms. Spectrograms represent sound as an image where the vertical axis
corresponds to the frequency bands, while the horizontal axis indicates the time. We can feed spectrograms
to a convolutional layer and treat it like an image. Some of the most famous speech recognition engines use
this technique.

Similarly, we can map a sentence of text onto an image where the vertical axis indicates the word index in a
dictionary, and the horizontal axis is for the position of the word in the sentence.

Although they appear in many models and domains, CNNs are not useful at all in some case. Since they are
good at capturing spatial patterns, they are of no use when such local patterns do not exist. This is the case
when data is a 2D table coming from a database collecting user data. Each row corresponds to a user and
each column to a feature, but there is no particular order in either columns or rows.

In other words, we can swap the order of the rows or the columns without altering the information
contained in the table. In a case like this, a CNN is completely useless, and we should not use it.

Conclusion
In this chapter, we’ve finally introduced convolutional Neural Networks as a tool to efficiently extract
features from images and more generally from spatially correlated data.

Convolutional networks are ubiquitous in object recognition tasks, widely used in robotics, self-driving
cars, advertising, and many more fields.

Exercises

Exercise 1

You’ve been hired by a shipping company to overhaul the way they route mail, parcels, and packages. They
want to build an image recognition system capable of recognizing the digits in the zip code on a package
automatically route it to the correct location. You are tasked to build the digit recognition system. Luckily,
you can rely on the MNIST dataset for the initial training of your model!

Build a deep convolutional Neural Network with at least two convolutional and two pooling layers before
the fully connected layer:

• start from the network we have just built


• insert one more Conv2D, MaxPooling2D and Activation pancake. You will have to choose the
number of filters in this convolutional layer
• retrain the model
• does performance improve?
• how many parameters does this new model have? More or less than the previous model? Why?
• how long did this second model take to train? Longer or shorter than the previous model? Why?
6.7. EXERCISES 281

• did it perform better or worse than the previous model?

In [ ]:

Exercise 2

Pleased with your performance with the digits recognition task, your boss decides to challenge you with a
harder task. Their online branch allows people to upload images to a website that generates and prints a
postcard and ships it to its destination. Your boss would like to know what images people are loading on the
site to provide targeted advertising on the same page, so he asks you to build an image recognition system
capable of recognizing a few objects. Luckily for you, there’s a dataset ready made with a collection of labeled
images. This is the Cifar 10 Dataset, a very famous dataset that contains images for ten different categories:

• airplane
• automobile
• bird
• cat
• deer
• dog
• frog
• horse
• ship
• truck

In this exercise, we will reach the limit of what you can achieve on your laptop. In later chapters, we will
learn how to leverage GPUs to speed up training.

Here’s what you have to do: - load the cifar10 dataset using keras.datasets.cifar10.load_data() -
display a few images, see how hard/easy it is for you to recognize an object with such low resolution - check
the shape of X_train, does it need reshaping? - check the scale of X_train, does it need rescaling? - check
the shape of y_train, does it need reshaping? - build a model with the following architecture, and choose
the parameters and activation functions for each of the layers: - conv2d - conv2d - maxpool - conv2d -
conv2d - maxpool - flatten - dense - output - compile the model and check the number of parameters -
attempt to train the model with the optimizer of your choice. How fast does training proceed? - If training is
too slow, feel free to stop it and read ahead. In the next chapters, you’ll learn how to use GPUs to

In [72]: from tensorflow.keras.datasets import cifar10

In [ ]:
282 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS
Time Series and Recurrent Neural Networks
7
In this chapter, we will learn mainly about Recurrent Neural Networks (or RNN, for short). RNNs expand
the architectures we have encountered so far by allowing for feedback loops in time. This property makes
RNNs particularly suited to work with ordered data, for example, time series, sound, and text. These
networks can generate arbitrary sequences of outputs, opening the door on many new types of Supervised
Learning problems.

Machine Learning on Time Series requires a bit more caution than usual since we need to avoid leaking
future information into the training. We will start this chapter talking about Time Series and Sequence
problems in general. Then we will introduce RNNs and in particular two famous architectures: LSTMs and
GRUs (for the latter, have a look at Exercise 2).

This chapter contains both practical and theoretical parts with some math. Like we did in chapter 5, let us
first tell you: you don’t NEED to read the math in this chapter. This book is for the developer and
practitioner that is interested in applying Neural Networks to solve great problems. We provide the math for
the curious, and we will make sure to highlight which sections you can skip at a first read.

Time Series
Time series are everywhere. Examples of time series are the values of a stock, music, text, events on your
app, video games, which are sequences of actions, and in general, any quantity monitored over time that
generates a sequence of values.

A time series is an ordered sequence of data points, and it can be univariate or multivariate.

A univariate time series is nothing but a sequence of scalars. Examples of this are temperature values
through the day or the number of app downloads per minute.

283
284 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

A time Series

A time series could also take values in a vector space, in which case it is a multivariate time series.
Examples of vector time series are the speed of a car as a function of time or an audio file recorded in stereo,
which has two channels.

Machine Learning can be applied to time series to solve several problems including forecasting, anomaly
detection, and pattern recognition.

Time Series Problems

Forecasting refers to predicting future samples in a sequence. In a way, this problem is a regression problem
because we are predicting a continuous quantity using features derived from the time series and most likely
it is a nonlinear regression.
7.1. TIME SERIES 285

Anomaly detection refers to identifying deviations from a regular pattern. We can approach this problem
in two ways: if we know the anomalies we are looking for, we treat it as a binary classification problem
between the anomalous and the regular class. If we do not know them, we train a model to forecast future
values (regression) and then compare the predicted value and the original signal. In this case, anomalies are
where the prediction is very different from the actual value in the time series.

Pattern recognition is classification on time series, identifying recurring patterns.

In all these cases we must use particular care because the data is ordered in time and we need to avoid
leaking future information in the features used by the model. This is particularly true for model validation.
If we split the time series into training and test sets, we cannot just pick a random split from the time series.
We need to split the data in time: all the test data should come after the training data.

Train/Test approach for a time series problem

Also, sometimes a trend or a seasonal pattern is distinguishable.

Trend and seasonality

This fact is recognizable in any data related to human activity, where daily, weekly, *monthly and yearly
periodicities** are found.

Think for example of retail sales. A dataset with hourly sales from a shop will have regular patterns during
the day: with a period of higher customer flow and period of lower customer flow, as well as during the
286 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

week. Depending on the type of goods we may find higher or lower sales during the weekend. Special dates,
like black Friday or sales days, will appear as anomalies in these regular patterns and should be easy to catch.
In these cases, it is a good idea to either remove these periodicities beforehand or to add the relevant time
interval as an input feature.

Time series classification


As a warm-up exercise let’s perform a classification on time series data. Let’s load the usual common files:

In [1]: with open('common.py') as fin:


exec(fin.read())

In [2]: with open('matplotlibconf.py') as fin:


exec(fin.read())

The file sequence_classification.csv.bz2 contains a set of 4000 curves. Let’s load it and look at a few
rows and columns:

In [3]: fname = '../data/sequence_classification.csv.bz2'


df = pd.read_csv(fname, compression='bz2')
df.iloc[0:5, 0:5]

Out[3]:

anomaly t_0 t_1 t_2 t_3


0 False 1.000000 0.974399 0.939818 0.906015
1 True 0.626815 0.665145 0.669603 0.693649
2 False 0.983751 0.944806 0.999909 0.975756
3 True 0.977806 1.000000 0.975431 0.966523
4 False 0.691444 0.710671 0.660787 0.690993

TIP: this is the first time that we load a zipped file, i.e., a compressed file convenient to save
storage space. Pandas allows loading directly zipped file saved in several formats, for
example, a bz2 file. Have a look at the documentation for further details and discover all
the formats supported.

In [4]: df.info()
7.2. TIME SERIES CLASSIFICATION 287

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Columns: 201 entries, anomaly to t_199
dtypes: bool(1), float64(200)
memory usage: 6.1 MB

Each row in the dataset is a curve, the labels for anomalies are given in the first column (in this case, we have
two, True and False).

In [5]: df['anomaly'].value_counts()

Out[5]:

anomaly
True 2000
False 2000

As we can see, 2000 curves present anomalies, while the other 2000 do not. Let’s create the X and y arrays
and plot the first four curves.

In [6]: X = df.drop("anomaly", axis="columns").values


y = df["anomaly"].values

Now let’s plot these curves separated by the anomaly values.

In [7]: plt.plot(X[:4].transpose())
plt.legend(y[:4])
plt.title("Curves with Anomalies");
288 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

Curves with Anomalies


1.0

0.8

0.6

0.4
False
0.2 True
False
0.0 True
0 25 50 75 100 125 150 175 200

How do we treat this problem with Machine Learning?

We can approach it in various ways.

1. We could use the values of the curves as features (that is 200 points) and feed them to a fully
connected network.
2. We could engineer features from the curves, like statistical quantities, differences, and Fourier
coefficients and feed those to a Neural Network.
3. We could use a 1D convolutional network to extract patterns from the curves automatically.

Let’s quickly try all three.

TIP: if you had to guess, which of the three approaches seems more promising?

First of all, we will perform a train/test split. In this case we do not need to worry about the order in time
because the sequences are given to us without any information about their absolute time. For all we know
they could be independent measurements of the same phenomenon.
7.2. TIME SERIES CLASSIFICATION 289

Let’s load the train_test_split function from sklearn first:

In [8]: from sklearn.model_selection import train_test_split

Now let’s split the data into the training and test sets:

In [9]: X_train, X_test, y_train, y_test = \


train_test_split(X, y, test_size=0.25,
random_state=0)

Fully connected networks

Let’s load the usual Sequential model and Dense layer from tensorflow.keras so we can build our
fully connected network:

In [10]: from tensorflow.keras.models import Sequential


from tensorflow.keras.layers import Dense

The model will take the 200 values that form a curve as independent input features and it will have a single
output neuron to classify the curve as containing an anomaly or not.

This process should be pretty familiar by now:

In [11]: model = Sequential([


Dense(100, input_dim=200, activation='relu'),
Dense(50, activation='relu'),
Dense(20, activation='relu'),
Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])

Next, let’s train the model to fit against our training set.

In [12]: h = model.fit(X_train, y_train, epochs=30,


verbose=0, validation_split=0.1)

Now, let’s plot the curves of our newly trained model.


290 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

In [13]: plt.plot(h.history['accuracy'])
plt.plot(h.history['val_accuracy'])
plt.legend(['Training', 'Validation'])
plt.title('Accuracy')
plt.xlabel('Epochs');

Accuracy
0.56
0.54
0.52
0.50
0.48
0.46 Training
Validation
0 5 10 15 20 25 30
Epochs

In [14]: acc_ = model.evaluate(X_test, y_test, verbose=0)[1]


print("Test Accuracy: {:0.3}".format(acc_))

Test Accuracy: 0.535

This model does not seem to perform well (operating around 50 accuracy). We can understand the reason
of this poor performance if we notice that the anomaly can be located anywhere along the curve. Since
every point in the curve is treated as an independent input feature, and each of them can contain an
anomaly or not, it is difficult for the network to learn a consistent pattern about the presence of an anomaly
simply by looking at amplitude values.
7.2. TIME SERIES CLASSIFICATION 291

Fully connected networks with feature engineering

Let’s try to extract some features from the curves. We will limit ourselves to:

• std: standard deviation of the curve values


• std_diff: standard deviation of the first order differences

TIP: feel free to add more features like higher order statistical moments or Fourier
coefficients.

First, let’s build the new DataFrame eng_f containing in the two columns the feature std and std_diff.

In [15]: eng_f = pd.DataFrame(X.std(axis=1), columns=['std'])


eng_f['std_diff'] = np.diff(X, axis=1).std(axis=1)

eng_f.head()

Out[15]:

std std_diff
0 0.260902 0.023511
1 0.249588 0.030286
2 0.304086 0.023464
3 0.302908 0.030531
4 0.286405 0.066638

We split the data again:

In [16]: (eng_f_train, eng_f_test,


y_train, y_test) = train_test_split(eng_f.values, y,
test_size=0.25,
random_state=0)

Let’s clear out the backend for any memory we’ve already used:

In [17]: import tensorflow.keras.backend as K

In [18]: K.clear_session()
292 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

Next, let’s train a fully connected model: as already seen many times, the first layer depends on the number
of input features. In this case we only have 2 inputs: std and std_diff. The last layer will still be a binary
classification 0/1 in this contest (notice that the last layer is the same of the previous model, since we didn’t
change our output). The inner layers, only one in this model, depend on the researcher preference and are
treated as hyperparameters.

In [19]: model = Sequential([


Dense(30, input_dim=2, activation='relu'),
Dense(10, activation='relu'),
Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])

Now let’s train our model on our 2 engineered features:

In [20]: h = model.fit(eng_f_train, y_train, epochs=50,


verbose=0, validation_split=0.1)

Let’s plot the output of our model by plotting again:

In [21]: plt.plot(h.history['accuracy'])
plt.plot(h.history['val_accuracy'])
plt.legend(['Training', 'Validation'])
plt.title('Accuracy')
plt.xlabel('Epochs');
7.2. TIME SERIES CLASSIFICATION 293

Accuracy
0.85
0.80
0.75
0.70
0.65
0.60
0.55 Training
0.50 Validation
0 10 20 30 40 50
Epochs

In [22]: acc_ = model.evaluate(eng_f_test, y_test, verbose=0)[1]


print("Test Accuracy: {:0.3}".format(acc_))

Test Accuracy: 0.812

This model is already much better than the previous one, but can we do better? Let’s try with the third
approach, i.e. 1D convolutional network to automatically extract patterns from the curves.

Fully connected networks with 1D Convolution

As we know by now, convolutional layers are good for recognizing spatial patterns. In this case we know the
anomaly spans across a dozen points along the curve, so we should be able to capture it if we cascade a few
Conv1D layers with filter size of 3.

TIP: the filter size, 3 in this case, is an arbitrary choice. In the Appendix we explain how a
294 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

convolution with a filter size equal to 3 helps identify patterns in the 1D sequence.
Cascading multiple layers with small filters allows us to learn longer patterns.

Furthermore, since the anomaly can appear anywhere along the curve, MaxPooling1D introduced in
Chapter 6 may help to reduce the sensitivity to the exact location.

Finally we will need to include a few nonlinear activations, a Flatten layer (seen in Chapter 6), and one or
more fully connected layers. Let’s do it!

First, let’s import the relevant layers from the tensorflow.keras package:

In [23]: from tensorflow.keras.layers import Conv1D, MaxPool1D


from tensorflow.keras.layers import Flatten, Activation

Next, let’s clear out the backend memory, just in case:

In [24]: K.clear_session()

Next, let’s build the model with our layers, considering again the 200 points as input:

In [25]: model = Sequential([


Conv1D(16, 3, input_shape=(200, 1)),
Conv1D(16, 3),
MaxPool1D(),
Activation('relu'),

Conv1D(16, 3),
MaxPool1D(),
Activation('relu'),

Flatten(),

Dense(10, activation='relu'),
Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
7.2. TIME SERIES CLASSIFICATION 295

Conv1D requires the input data to have shape (N_samples, N_timesteps, N_features) so we need to add
a dummy dimension to our data.

In [26]: X_train_t = X_train[:, :, None]


X_test_t = X_test[:, :, None]

Let’s train our model over 30 epochs:

In [27]: h = model.fit(X_train_t, y_train, epochs=30, verbose=0,


validation_split=0.1)

Now let’s plot the accuracy of our model using our 1D convolutional Neural Network:

In [28]: plt.plot(h.history['accuracy'])
plt.plot(h.history['val_accuracy'])
plt.legend(['Training', 'Validation'])
plt.title('Accuracy')
plt.xlabel('Epochs');

Accuracy
1.0

0.9

0.8

0.7

0.6
Training
Validation
0.5
0 5 10 15 20 25 30
Epochs
296 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

In [29]: acc_ = model.evaluate(X_test_t, y_test, verbose=0)[1]


print("Test Accuracy: {:0.3}".format(acc_))

Test Accuracy: 0.985

This model is the best so far, and it required no feature engineering. We just reasoned about the type of
patterns we were looking for and chose the most appropriate Neural Network architecture to detect them.
This is very powerful!

Sequence Problems
Time series problems can be extended to consider general problems involving sequences. In other words,
we can consider a time series as a particular type of sequence, where every element of the sequence is
associated with a time. But, in general, we may have sequences of elements not associated with a specific
time: for example, a word can be thought as a sequence of characters, or a sentence as a sequence of words.
Similarly, the interactions of a user in an app form a sequence of events, and it is a very common use case to
try to classify such a sequence or to predict what the next event is going to be.

More generally, we are going to introduce here a few general scenarios involving sequences, that will stretch
our application of Machine Learning to new problems.

Let’s start with 1-to-1 problems.

1-to-1

The simplest Machine Learning problem involving a sequence is the 1-to-1 problem. All the Machine
Learning problems we have encountered so far are of this type: linear regression, classification, and
convolutional Neural Networks for image or sequence classification. For each input we have one label, for
each image in MNIST we have a digit label, for each user, we have a purchase, for each banknote, we have a
label of real or fake.

In all of these cases, the Machine Learning model learns a stateless function to connect a given input to a
given output.

In the case of sequences, we can expand our framework to allow for the model to make use of past values of
the input and the output. Let’s see how.

1-to-many

The 1-to-many problem starts like the 1-to-1 problem. We have an input, and the model generates an output.
After the first output, the network continues to generate a sequence of outputs using its internal state or the
previous output as input. We can continue like this indefinitely and therefore generate an arbitrary sequence
of outputs.
7.3. SEQUENCE PROBLEMS 297

Problems involving sequences


298 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

A typical example of this situation is image captioning: a single image in input generates as output a text
description of arbitrary length.

TIP: a text description can be thought as a sequence either of words or characters. Every
single words or character is indeed an element of the sequence.

many-to-1

The many-to-1 problem reverses the situation. We feed multiple inputs to the network, and at each step, we
also feed the network output back into the network, until we reach the end of the input sequence. At this
point, we look at the network output.

Text sentiment analysis falls in this category. We associate a single output sentiment label (positive or
negative) to a string of text of arbitrary length in the input.

asynchronous many-to-many

In the asynchronous many-to-many case, we have a sequence in input and a sequence in output. The
model first learns to encode an input sequence of arbitrary length into the internal state. Then, when the
sequence ends, the model starts to generate a new sequence.

The typical application for this setup is language translation, where an input sentence in a language, for
example, English, is translated to an output sentence in a different language, for example, Italian. To
complete the task correctly, the model has to “listen” to the whole input sentence first. Once the sentence is
over, the model goes ahead and translates that into the new sentence.

synchronous many-to-many

Finally, there’s the synchronous many-to-many case, where the network outputs a value at each input,
considering both the input and its previous state. Video frame classification falls in this category because
for each frame we produce a label using the information from the frame but also the information from the
state of the network.

RNN allow graphs with cycles

Recurrent Neural Networks can deal with all these sequence problems because their connections form a
directed cycle. In other words, they can retain state from one iteration to the next by using their output as
input for the next step. This is similar to infinite response filters in signal processing.

In programming terms, this is like running a fixed program with defined inputs and some internal variables.
Viewed this way, RNNs are networks that learn generic programs.

RNNs are Turing-Complete, which means they can simulate arbitrary programs! We can think of
7.4. TIME SERIES FORECASTING 299

feed-forward Neural Networks as approximating arbitrary functions and Recurrent Neural Networks as
approximating arbitrary programs. This makes them extremely powerful.

Time series forecasting


We have seen how to solve some classification problems involving time series with Convolutional Neural
Networks.

The previous dataset, however, was quite unusual for many reasons. First of all, each sample sequence in the
dataset had precisely the same duration, and each curve included exactly 200 timesteps. Secondly, we had
no information about the order of the samples, and so we considered them as independent measurements
and performed train/test split in the usual way.

Both these conditions are not generally present when dealing with forecasting problems on time series or
text data. A time series can have arbitrary length, and it usually comes with a timestamp, indicating the
absolute time of each sample.

Let’s load a new dataset and let’s see how recurrent networks can help in this case.

First of all we are going to load the dataset:

In [30]: fname = '../data/ZonalDemands_2003-2016.csv.bz2'


df = pd.read_csv(fname, compression='bz2',
engine='python')

In [31]: df.head(3)

Out[31]:

Date Hour Total Ontario Northwest Northeast Ottawa East Toronto Essa Bruce Southwest Niagara West Tot Zones diff
0 01-May-03 1 13702 809 1284 965 765 4422 622 41 2729 617 1611 13865 163
1 01-May-03 2 13578 825 1283 923 752 4340 602 43 2731 615 1564 13678 100
2 01-May-03 3 13411 834 1277 910 751 4281 591 45 2696 596 1553 13534 123

In [32]: df.tail(3)

Out[32]:

Date Hour Total Ontario Northwest Northeast Ottawa East Toronto Essa Bruce Southwest Niagara West Tot Zones diff
119853 2016/12/31 22 15195 495 1476 1051 1203 5665 1045 72 2986 465 1334 15790 595
119854 2016/12/31 23 14758 495 1476 1051 1203 5665 1045 72 2986 465 1334 15790 1,032
119855 2016/12/31 24 14153 495 1476 1051 1203 5665 1045 72 2986 465 1334 15790 1,637

The dataset contains hourly electricity demands for different parts of Canada and it runs from May 2003 to
December 2016. Let’s create a pd.DatetimeIndex using the Date and Hour columns.
300 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

In [33]: def combine_date_hour(row):


date = pd.to_datetime(row['Date'])
hour = pd.Timedelta("%d hours" % row['Hour'])
return date + hour

Let’s run this function over our data to generate the DatetimeIndex for each column

In [34]: idx = df.apply(combine_date_hour, axis=1)

In [35]: idx.head()

Out[35]:

0
0 2003-05-01 01:00:00
1 2003-05-01 02:00:00
2 2003-05-01 03:00:00
3 2003-05-01 04:00:00
4 2003-05-01 05:00:00

In [36]: df = df.set_index(idx)

TIP: the function set_index() returns a new DataFrame whose index (row labels) has
been set to the the values of one or more existing column. Unless you use the
inplace=True argument this does not alter the DataFrame, it simply returns a different
version. That’s why we overwrite the original df variable.

Now that we have set the index, let’s select and plot the Total Ontario column:

In [37]: df['Total Ontario'].plot()

Out[37]: <matplotlib.axes._subplots.AxesSubplot at 0x7fd43178b128>


7.4. TIME SERIES FORECASTING 301

25000

20000

15000

10000

5000

2004 2006 2008 2010 2012 2014 2016

The time series seems quite regular, which looks promising for forecasting. Let’s split the data in time on
January 1st, 2014. We will use data before that date as training data and data after that date as test data.

In [38]: split_date = pd.Timestamp('01-01-2014')

Now we copy the data to a pair of new Pandas data frames that only contain the Total Ontario data up to
the split date (train) and after the split date (test).

In [39]: train = df.loc[:split_date, ['Total Ontario']].copy()


test = df.loc[split_date:, ['Total Ontario']].copy()

TIP: We use the .copy() command here because the .loc indexing command may return
a view on the data instead of a copy. This could be a problem later on when we do other
selections or manipulations of the data.

Let’s plot the data. We will use the matplotlib plotting function that is automatically aware of the index with
dates and times and assign a label to each plot so that we can display them with a legend:
302 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

In [40]: plt.figure(figsize=(15,5))
plt.plot(train, label='y_test')
plt.plot(test, label='y_pred')
plt.legend()
plt.title("Energy Consumption in Ontario 2003 - 2017");

Energy Consumption in Ontario 2003 - 2017


25000

20000

15000

10000

5000 y_test
y_pred
2004 2006 2008 2010 2012 2014 2016

We’ve already seen in Chapter 3 that Neural Network models are quite sensitive to the absolute size of the
input features. Passing features with very large or minimal values will not help our model converge to a
solution. Hence, we should rescale the data before anything else.

Notice that there’s a considerable drop somewhere in 2003. We shouldn’t use that as the minimum for our
analysis since it is an outlier.

We will rescale the data in such a way that most of it are close to 1. We can achieve this by subtracting 10000,
which shifts everything down and then dividing by 5000.

TIP: feel free to adjust these values as you prefer, or to try out other scaling methods like
the MinMaxScaler or the StandardScaler. The important thing is to get our data close
to 1 in size, not exactly between 0 and 1.

In [41]: offset = 10000


scale = 5000

train_sc = (train - offset) / scale


test_sc = (test - offset) / scale

Let’s look at the first four dates and demand to make sure our data is in the expected region of where we
think it should be.
7.4. TIME SERIES FORECASTING 303

In [42]: train_sc[:4]

Out[42]:

Total Ontario
2003-05-01 01:00:00 0.7404
2003-05-01 02:00:00 0.7156
2003-05-01 03:00:00 0.6822
2003-05-01 04:00:00 0.7002

Let’s plot our entire dataset to confirm it matches our expectation.

In [43]: plt.figure(figsize=(15,5))
plt.plot(train_sc, label='y_test')
plt.plot(test_sc, label='y_pred')
plt.legend()
plt.title("Energy Consumption Scaled Data");

Energy Consumption Scaled Data


3

1 y_test
y_pred
2004 2006 2008 2010 2012 2014 2016

We are finally ready to build a predictive model. Our target is going to be the value of the energy demand at
a certain point in time and our first model will try to predict such value from the value in the preceding
hour. Our model will thus only have one feature, and the labels will come from the same data, shifted in
time by one hour.

There is a neat trick to generate both labels and features from the same train_sc timeseries. This is
achieved by taking every single point in the timeseries except the last one as features:

In [44]: X_train = train_sc[:-1].values

and every single point starting from the second one untill the last one included as labels:
304 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

In [45]: y_train = train_sc[1:].values

This creates two timeseries shifted by one hour, which will become the features and labels to our model. We
repeat the same process with our test dataset:

In [46]: X_test = test_sc[:-1].values


y_test = test_sc[1:].values

We can check a few input and output values by printing the first 5 points in X_train and y_train:

In [47]: X_train[:5]

Out[47]: array([[0.7404],
[0.7156],
[0.6822],
[0.7002],
[0.802 ]])

In [48]: y_train[:5]

Out[48]: array([[0.7156],
[0.6822],
[0.7002],
[0.802 ],
[1.0226]])

As you can see, the first value in y_train corresponds to the second value in X_train, meaning that we
will be using the first value X_train to predict the next value and so on. Now we have our training data as
well as testing data mapped out. Let’s move on to model building.

Fully connected network

Let’s train a fully connected network to predict and see that it is not able to predict the next value from the
previous one.

The network will have a single input (the previous hour value) and a single output.

We can see this as a simple regression problem since we want to establish a connection between two
continuous variables.
7.4. TIME SERIES FORECASTING 305

TIP: if you need a refresher on what a regression is and why it makes sense to use it here,
have a look at Chapter 3 where we used a Linear regression to predict the weight of
individuals given their height.

Since we want to predict a continuous variable, the output of the network does not need an activation
function, and we will use the mean_squared_error as loss function, which is a standard error metric in
regression models.

Let’s clear the backend of any held memory first, as we have done many times when building a new model:

In [49]: K.clear_session()

Next, let’s build our model:

In [50]: model = Sequential([


Dense(24, input_dim=1, activation='relu'),
Dense(12, activation='relu'),
Dense(6, activation='relu'),
Dense(1)
])

model.compile(optimizer='adam',
loss='mean_squared_error')
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 24) 48
_________________________________________________________________
dense_1 (Dense) (None, 12) 300
_________________________________________________________________
dense_2 (Dense) (None, 6) 78
_________________________________________________________________
dense_3 (Dense) (None, 1) 7
=================================================================
Total params: 433
Trainable params: 433
Non-trainable params: 0
_________________________________________________________________
306 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

In this case, before fitting the built Neural Networks, we load the EarlyStopping callback, to halt the
training if it is not improving.

TIP: a callback is a set of functions to be applied at each epoch during the training. We
have already encountered them in Exercise 4 of Chapter 5. You can pass a list of callbacks
to the .fit() method, and in this specific case we use the EarlyStopping callback to
stop the training if there is no progress. According to the documentation, monitor defines
the quantity to be monitored (the mean_squared_error in this case) and patience
defines the number of epochs with no improvement after which it will stop the training.

In particular, we will set the EarlyStopping callback to monitor the value of the loss and stop the training
loop with a patience=3 if that does not improve. Without this callback, the training will plateau on a fixed
loss without improving, and the training will not stop by itself (go ahead and try to confirm that!).

In [51]: from tensorflow.keras.callbacks import EarlyStopping

In [52]: early_stop = EarlyStopping(monitor='loss',


patience=3,
verbose=1)

Now we can launch the training, using this callback to monitor the progress of the data.

Our dataset has over 100k points so we can choose large batches.

In [53]: model.fit(X_train, y_train, epochs=200,


batch_size=512, verbose=0,
callbacks=[early_stop])

Epoch 00012: early stopping

Out[53]: <tensorflow.python.keras.callbacks.History at 0x7fd620451c50>

The model stopped improving quite quickly. Feel free to experiment with other architectures and other
activation functions. Let’s see how our model is doing. We can generate the predictions on the test set by
running model.predict.

In [54]: y_pred = model.predict(X_test)


7.4. TIME SERIES FORECASTING 307

Let’s visually compare test values and predictions:

In [55]: plt.figure(figsize=(15,5))
plt.plot(y_test, label='y_test')
plt.plot(y_pred, label='y_pred')
plt.legend()
plt.title("True VS Pred Test set, Fully Connected");

True VS Pred Test set, Fully Connected


2.5 y_test
y_pred
2.0

1.5

1.0

0.5

0.0
0 5000 10000 15000 20000 25000

They seem to overlap pretty well. Is it so? Let’s zoom in and watch more closely. We will do this by using the
plt.xlim function that sets the boundaries of the horizontal axis in a plot. Feel free to choose other values
to inspect other regions of the plot. Also, notice that we lost the date labels when we created the data, but
this is not a problem: we can always bring them back from the original series if we need them.

In [56]: plt.figure(figsize=(15,5))
plt.plot(y_test, label='y_test')
plt.plot(y_pred, label='y_pred')
plt.legend()
plt.xlim(1200,1300)
plt.title("Zoom True VS Pred Test set, Fully Connected");
308 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

Zoom True VS Pred Test set, Fully Connected


2.5 y_test
y_pred
2.0

1.5

1.0

0.5

0.0
1200 1220 1240 1260 1280 1300

Fully connected network evaluation

Is this a good model? At first glance, we may be tempted to say it is.

Let’s measure the total mean squared error (a.k.a. our total loss) and the R 2 score on the test set. As seen in
Chapter 3 here), if the R 2 score is far from 1.0, that is a sign of a bad regression.

TIP: If you need a refresher about Mean Squared Error and R 2 score, how they are defined
and used, take a look at Chapter 3 here) and Chapter 3 here

In [57]: from sklearn.metrics import mean_squared_error, mean_absolute_error


from sklearn.metrics import r2_score

In [58]: mse = mean_squared_error(y_test, y_pred)


r2 = r2_score(y_test, y_pred)

print("MSE: {:0.3}".format(mse))
print("R2: {:0.3}".format(r2))

MSE: 0.0149
R2: 0.933

In this case however the R 2 score is quite high, which would lead us to think the model is quite good.
7.4. TIME SERIES FORECASTING 309

However, the model is not good at all!

Why do we say that the model is not good at all? If you scrutinize the graph, you will realize that the
network has just learned to repeat the same value it receives in input!

This is not forecasting at all. In other words, the model has no real predictive power. It behaves like a parrot
that repeats yesterday’s value for today. In this particular case, since the curve is varying smoothly, the
differences between one day and the next are small, and the R 2 score is still pretty close to 1. However, the
model is not anticipating any future value, and so it would be quite useless for forecasting. One easy way to
see this is to measure the correlation between the predicted values and the correct labels, and then repeat
the measure of correlation with labels shifted in time.

If the model was good at forecasting we expect the highest correlation to happen for a zero shift, with
decreasing correlation when we shift the labels either forward or backward in time. Let’s plug our test data
into Pandas Series:

In [59]: y_test_s = pd.Series(y_test.ravel())


y_pred_s = pd.Series(y_pred.ravel())

and now let’s measure the correlation for a few values of the shift:

In [60]: for shift in range(-5, 5):


y_pred_shifted = y_pred_s.shift(shift)
corr = y_test_s.corr(y_pred_shifted)
print("Shift: {:2}, Corr: {:0.2}".format(shift, corr))

Shift: -5, Corr: 0.63


Shift: -4, Corr: 0.76
Shift: -3, Corr: 0.88
Shift: -2, Corr: 0.97
Shift: -1, Corr: 1.0
Shift: 0, Corr: 0.97
Shift: 1, Corr: 0.88
Shift: 2, Corr: 0.76
Shift: 3, Corr: 0.63
Shift: 4, Corr: 0.5

As you can see, the highest correlation for this model is found for a shift of -1, which validates our previous
interpretation. The model is simply copying the input value to the output.

This behavior is not surprising. After all, the only feature our model knew was the value of the time series in
the previous period, so it makes sense that the best it could do was to learn to repeat it as a prediction for
what would come next.

Let’s see if a recurrent network improves the situation.


310 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

Recurrent Neural Networks

Vanilla RNN

As we introduced, Recurrent Neural Networks are able to maintain an internal state using feedback loops.
Let’s see how we could build a simple RNN.

The Vanilla Recurrent Neural Network can be built as a fully connected Neural Network if we unroll the
time axis.

Unrolling time for a vanilla RNN

Ignoring the output of the network for the time being, let’s focus on the recurrent aspect. The network is
recurrent because it’s internal state h at time t is obtained by mixing current input x t with the previous value
of he internal state h t−1 :

h t = tanh(w h t−1 + u x t ) (7.1)

At each instant of time, the simple RNN is behaving as a fully connected network with two inputs: the
current input x t and the previous output h t−1 .

TIP: Notice that for now, we are using a network with a single input and a single output, so
both x and h are numbers. Later we will extend the notation to networks with multiple
inputs and multiple recurrent units in a layer. As you will see the extension is quite simple.
7.4. TIME SERIES FORECASTING 311

Notice only two weights are involved: the weight multiplying the previous value of the output w and the
weight multiplying the current input u. By the way, doesn’t this formula remind you of the Exponentially
Weighted Moving Average (or EWMA)?

TIP: we have already mentioned EWMA in Chapter 5, and we explain it in the appendix.
Just as a reminder, it’s a simple smoothing algorithm that follows the formula:

y t = (1 − α) y t−1 + α x t (7.2)

EWMA smooths a signal given by a sequence of data reducing its fluctuations.

It is not the same, because there is a tanh and the two weights are independent but it does look similar: it’s a
linear mixing of the past output with the present input, followed by a nonlinear activation function.

Also, notice that the weights do not depend on time. The network is learning the best values of its two
weights which don’t change with time.

Deep Vanilla RNN

We can build deep recurrent Neural Networks by stacking recurrent layers onto one another. We feed the
input to a first layer and then feed the output of that layer into a second layer and so on. Also, we can add
multiple recurrent units in each layer. Each unit is receiving inputs from all the units in the previous layer
(or the input) as well as all the units in the same layer at the previous time:

If we have multiple layers, we will need to make sure that an earlier layer returns the whole sequence of
outputs to the next layer. This is achieved in Keras using the return_sequences=True optional argument
when defining a layer. We will see an example of this in Exercise 1.

Keras implements Vanilla Recurrent Layers with the layers.SimpleRNN class. Let’s try it out on our
forecasting problem. First of all we import it.

In [61]: from tensorflow.keras.layers import SimpleRNN

The documentation for recurrent layers reads:

Input shape

3D tensor with shape (batch_size, timesteps, input_dim).


312 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

Multiple RNN layers

however, so far we have used only a tensor of second order for our data. Let’s think about how to reshape
our data because there’s more than one way. Our input data right now has a shape of:

In [62]: X_train.shape

Out[62]: (93551, 1)

So it’s like a matrix with a single column. We want to add a dimension to the tensor so that the data is a
tensor of order three. There are many ways of doing this, the simplest way is to reshape the tensor using the
.reshape method. It’s as if we were breaking our timeseries into a dataset of adjacent and disjoint windows
and then pile all of them one on top of the other.

Let’s define the window length to be 128 hours (a little over five days):

In [63]: win_len = 128

Now we define a helper function that reshapes the data into a tensor of order three. Notice that this function
will pre-pad our sequence with zeros since we need to make sure that the total length is divisible by
win_len:

In [64]: def reshape_tensor(x, l):


orig_len = x.shape[0]
7.4. TIME SERIES FORECASTING 313

n, r = divmod(orig_len, l)
max_len = l * (n + 1)
offset = max_len - orig_len
new_array = np.zeros(max_len)
new_array[offset:] = x.ravel()
return new_array.reshape(n + 1, l, 1)

And let’s test it on a short array with 10 elements, so that we understand its behavior:

In [65]: example_array = np.arange(100,110)


example_array

Out[65]: array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109])

In [66]: example_t = reshape_tensor(example_array, 3)

The shape of example_t is:

In [67]: example_t.shape

Out[67]: (4, 3, 1)

and if we check its content:

In [68]: example_t

Out[68]: array([[[ 0.],


[ 0.],
[100.]],

[[101.],
[102.],
[103.]],

[[104.],
[105.],
[106.]],

[[107.],
[108.],
[109.]]])
314 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

we see that the elements of the array are now arranged in adjacent in 4 sequences, of 3 elements each.
Wonderful! Let’s now convert our Train and Test sets:

In [69]: X_train_t = reshape_tensor(X_train, win_len)


X_test_t = reshape_tensor(X_test, win_len)

y_train_t = reshape_tensor(y_train, win_len)


y_test_t = reshape_tensor(y_test, win_len)

Let’s check the shape of our new variable X_train_t:

In [70]: X_train_t.shape

Out[70]: (731, 128, 1)

Good! We reshaped the data to have one additional axis as requested. Notice that we reshaped the labels too,
which may appear odd at first. In a few lines we will explain why.

First, let’s think about the batches. If we want to pass the whole training set in order, we need to pass one
sequence per batch and we need to pass the batches in order from the start of time.

In other words, we don’t want to randomly sample batches from the sequence; we want to feed the windows
one by one sequentially while maintaining the state of the network between one window and the next.

We can do this by setting the stateful=True argument in the layer, but it requires that the size of our data
is exactly a multiple of the batch size.

Since we want to feed the points one by one, we will choose a batch_size=1. Let’s do it!

Now let’s create a SimpleRNN with one layer with six nodes, i.e., with six recurrent units in the layer. The
principle is the same as above, only each of these units will receive a six-dimensional vector as recurrent
input from the past, together with the single value of the actual input.

TIP: The number of nodes here is arbitrary. We could choose to put many more nodes, but
that would result in a bigger model which is slower to train. We have noticed that with six
nodes results are acceptable, and hence we choose that value.

Notice that since we are using the stateful=True flag, we will need to pass the batch_input_shape to
the first layer.
7.4. TIME SERIES FORECASTING 315

We will use the Adam optimizer (which is one of the most efficient and robust optimizer, as seen in Chapter
5, adopting a small value for the learning rate, since the SimpleRNN can sometimes be unstable.

In [71]: from tensorflow.keras.optimizers import Adam, RMSprop

Let’s clear the backend memory again:

In [72]: K.clear_session()

Now let’s build the model. We will pass a batch_input_shape=(1, win_len, 1) because we read the
data one point at a time. Also we will set the input weights to one. We do this in order to reduce the
variability in the results obtained, as you’ll see, this model is quite unstable.

TIP: if you get a result that is very different from the one of the book, go ahead and
re-initialize the model. It may be just a case of bad luck with the starting point in the
minimization.

We need one last ingredient to create our model. As mentioned above, we have reshaped the labels as well as
the inputs. This means that we will feed an input sequence and the model should return an output sequence
to be compared with the sequence of values in the labels. See the figure:

This method of training RNNs is called (Teacher


Forcing)[https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-
sequence-learning-in-keras.html]. To achieve this, we will use the TimeDistributed layer wrapper around
our output Dense layer and we will instruct the SimpleRNN to return the whole sequence using the
return_sequences=True instead of returning just the last output.

In [73]: from tensorflow.keras.layers import TimeDistributed

In [74]: model = Sequential()


model.add(SimpleRNN(6,
batch_input_shape=(1, win_len, 1),
kernel_initializer='ones',
return_sequences=True,
stateful=True))
model.add(TimeDistributed(Dense(1)))

model.compile(optimizer=Adam(lr=0.0005),
loss='mean_squared_error')
316 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

time_distributed.png

In [75]: model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
simple_rnn (SimpleRNN) (1, 128, 6) 48
_________________________________________________________________
time_distributed (TimeDistri (1, 128, 1) 7
=================================================================
Total params: 55
Trainable params: 55
Non-trainable params: 0
_________________________________________________________________

Now we can fit the data. Since we are maintaining states between point, we shall pass the data in order using
the shuffle=False flag and batch_size=1. Also, we run the training for a single epoch. In our
experiments this should be sufficient to get decent results:

In [76]: model.fit(X_train_t, y_train_t,


epochs=3,
batch_size=1,
verbose=1,
shuffle=False);

Epoch 1/3
7.4. TIME SERIES FORECASTING 317

731/731 [==============================] - 53s 72ms/sample - loss: 0.2998


Epoch 2/3
731/731 [==============================] - 51s 70ms/sample - loss: 0.0499
Epoch 3/3
731/731 [==============================] - 52s 71ms/sample - loss: 0.0213

Let’s plot a small part of our predictive model to compare train and test. We will use the Numpy .ravel()
method to flatten the tensors back into a unidimensional sequence:

In [77]: y_pred = model.predict(X_test_t, batch_size=1)


plt.figure(figsize=(15,5))
plt.plot(y_test_t.ravel(), label='y_test')
plt.plot(y_pred.ravel(), label='y_pred')
plt.legend()
plt.xlim(1200,1300)
plt.title("Zoom True VS Pred Test set, SimpleRNN");

Zoom True VS Pred Test set, SimpleRNN


2.5 y_test
y_pred
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1200 1220 1240 1260 1280 1300

While writing the book we noticed that sometimes the model may converge to different solutions at each
training run. 1. Most often we get a graph that looks a little noisy but it is close to the actual data in the sharp
decay phases, meaning some forecasting power is achieved:
318 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

2. Sometimes we would get a graph that looks very similar to the Fully Connected result, with no
predictivity at all:

3. Sometimes the network will get stuck and give nonsense results like this one:

Feel free to change the number of layers, nodes, optimizer and learning rate to see if you can get better
results. You will notice that this model is very prone to diverging away from a small value of the loss, which
is not ideal at all.

Let’s also check MSE and the R 2 score:

In [78]: mse = mean_squared_error(y_test_t.ravel(), y_pred.ravel())


r2 = r2_score(y_test_t.ravel(), y_pred.ravel())

print("MSE: {:0.3}".format(mse))
print("R2: {:0.3}".format(r2))

MSE: 0.00814
R2: 0.964

And let’s also repeat the correlation with time-shift measure:


7.4. TIME SERIES FORECASTING 319

In [79]: y_test_s = pd.Series(y_test_t.ravel())


y_pred_s = pd.Series(y_pred.ravel())

for shift in range(-5, 5):


y_pred_shifted = y_pred_s.shift(shift)
corr = y_test_s.corr(y_pred_shifted)
print("Shift: {:2}, Corr: {:0.2}".format(shift, corr))

Shift: -5, Corr: 0.52


Shift: -4, Corr: 0.65
Shift: -3, Corr: 0.79
Shift: -2, Corr: 0.91
Shift: -1, Corr: 0.98
Shift: 0, Corr: 0.98
Shift: 1, Corr: 0.92
Shift: 2, Corr: 0.82
Shift: 3, Corr: 0.69
Shift: 4, Corr: 0.56

If you obtained a graph like the one in Case 1 above, you should get a lower MSE and the highest correlation
for a shift of 0. On the other hand, if you obtained a graph like Case 2 above, you should obtain similar MSE
and correlation values as to the Fully connected case.

All in all this model seems a little unstable. The problem probably lies with the fact that the SimpleRNN
actually has a short memory and cannot learn long- term patterns.

Let’s see why this happens and how we can fix it.

Recurrent Neural Network Maths

To fully understand how recurrent networks work and why our simple implementation fails we will need a
little bit of maths. Like we suggested in Chapter 5 you can feel free to skip this section entirely if you want to
320 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

get to the working model. You can always come back to it later on if you are curious about how a recurrent
network works.

Vanishing Gradients

Let’s start from the equation of backpropagation through time, and let’s ignore the output of the network for
now and let’s focus on the recurrent part. This is also called an encoder network, since we discard the
output.

Encoder Network

This network is encountered in many cases, for example when solving many-to-1 problems like sentiment
analysis or asynchronous many-to-many problems like machine translation.

Let’s rewrite the recurrent relations, that in this case are:

z t = w h t−1 + u x t (7.3)
h t = ϕ(z t ) (7.4)
(7.5)

where we substituted the tanh activation function to a generic activation ϕ.

We can now use the overline notation introduced in Chapter 5 to study the backpropagation through time.

If we assume to have already backpropagated from the output all the way back to the error signal h T , we can
write the backpropagation relations as:
7.4. TIME SERIES FORECASTING 321

h t = z t+1 w (7.6)
z t = h t ϕ′ (z t ) (7.7)
(7.8)

Let’s focus our attention on h t , and let’s propagate back all the way to h0 :

h0 = wz1 (7.9)

= wh1 ϕ (z1 ) (7.10)
2 ′ ′
= w h2 ϕ (z1 )ϕ (z2 ) (7.11)
... = w T h T ϕ′ (z1 )ϕ′ (z2 )...ϕ′ (z T ) (7.12)
(7.13)

∂J
Now remembering the definition of h = ∂h we can write:

∂h T
h0 = h T (7.14)
∂h0

which implies:

∂h T
= w T ϕ′ (z1 )ϕ′ (z2 )...ϕ′ (z T ) (7.15)
∂h0

Now let’s stop for a second and focus on ϕ′ (z). For most activation functions (sigmoid, tanh, relu) this
quantity is bounded. This is easily seen looking at the graph of the derivative of these functions:

First, let’s define the sigmoid and relu functions:

In [80]: x = np.linspace(-10, 10, 1000)

def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))

def relu(x):
cond = x > 0
return cond * x

Let’s plots for the sigmoid, the Tanh, and the relu activation functions along with their derivatives.
322 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

In [81]: plt.figure(figsize=(12,8))
plt.subplot(321)
plt.plot(x, sigmoid(x))
plt.title('Sigmoid')

plt.subplot(322)
plt.plot(x[1:], np.diff(sigmoid(x))/np.diff(x))
plt.title('Derivative of Sigmoid')

plt.subplot(323)
plt.plot(x, np.tanh(x))
plt.title('Tanh')

plt.subplot(324)
plt.plot(x[1:], np.diff(np.tanh(x))/np.diff(x))
plt.title('Derivative of Tanh')

plt.subplot(325)
plt.plot(x, relu(x))
plt.title('Relu')

plt.subplot(326)
plt.plot(x[1:], np.diff(relu(x))/np.diff(x))
plt.title('Derivative of Relu')

plt.tight_layout()
7.4. TIME SERIES FORECASTING 323

Sigmoid Derivative of Sigmoid


1.0
0.2
0.5 0.1
0.0 0.0
10 5 0 5 10 10 5 0 5 10
Tanh Derivative of Tanh
1 1.0

0 0.5

1 0.0
10 5 0 5 10 10 5 0 5 10
Relu Derivative of Relu
10 1.0

5 0.5

0 0.0
10 5 0 5 10 10 5 0 5 10

All the derivatives take values between 0 and 1, i.e. they are bounded. We can use this fact to rewrite the last
equation as:

∂h T
= w T ϕ′ (z1 )ϕ′ (z2 )...ϕ′ (z T ) ≤ w T (7.16)
∂h0

which means that the derivative of the last output with respect to the first output is less than or equal to w T .

At this point, the vanishing gradient problem should be evident. If w < 1 the propagation through time is
suppressed at each additional time step. The influence of an input that is three steps back in time will
contribute to the gradient with a term smaller than w 3 . If, for example, w = 0.1, the previous point will
contribute with less than 10, the one before with less than 1 and the one before with 0.1 and so on. You
can see that their contributions quickly disappear.

Let’s take a peek at what this looks like visually. First, let’s create a decay function that we’ll use to create our
plots.

In [82]: def decay(w, T):


t = np.arange(T)
b = w**t
return t, b
324 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

Now let’s plot the quantity w T as a function of T for several values of w:

In [83]: ws = [0.1, 0.4, 0.7, 0.9]


for w in ws:
t, b = decay(w, 10)
plt.plot(t, b, 'o-')

plt.title("$w^T$")
plt.xlabel("Time steps T")
plt.legend(['w = {}'.format(w) for w in ws]);

wT
1.0 w = 0.1
w = 0.4
0.8 w = 0.7
w = 0.9
0.6

0.4

0.2

0.0
0 2 4 6 8
Time steps T

The error signal quickly goes to zero if the recurrent weight is smaller than 1. This suggests that the recurrent
model is only able to capture short time dependencies, but longer dependencies are rendered useless.

Similarly, we can show that when w is greater than a certain threshold, the gradient will exponentially
explode over time (notice that in the gradient we have w T ), rendering the backpropagation unstable.

It would appear as if we are stuck with a model that either does not converge at all or it quickly forgets about
the past. How can we solve this?
7.4. TIME SERIES FORECASTING 325

Long Short-Term Memory Networks (LSTM)

LSTMs were designed to overcome the problems of simple Recurrent Networks by allowing the network to
store data in a sort of memory that it can access at later times. LSTM units are a bit more complicated than
the nodes we have seen so far, so let’s take our time to understand what this means and how they work.

Again, feel free to skip this section at first and come back to it later on.

We will start from an intuitive description of how LSTM works, and we will gradually approach the
mathematical formulas. We will do this because the formulas for the LSTM can be daunting at first, so it is
good to break them down and learn them gradually.

At the core of the LSTM is the internal state ct . This state is like an internal conveyor belt that carries
information from one time step to the next. In the general case, this is a vector, with as many entries as the
number of units in the LSTM layer. The LSTM unit will store information in this vector, and this
information will be available for retrieval later on.

At time t the LSTM block receives 2 inputs:

• the new data at time t, which we indicate as xt


• the previous output at time t − 1 which we indicate as ht−1 .

We concatenate these two inputs to create a unique set of input feature.

For example, if the input vector has length 3 (i.e., there are three input features) and the output vector has
length 2 (i.e., there are two output features), the concatenated vector has now five features, three coming
from the input vector and two coming from the output vector.
326 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

The next step is to apply four different simple Neural Network layers to these concatenated features along
four parallel branches. Each branch takes a copy of the features and multiplies them by an independent set
of weights and a different activation function.

TIP: You may be wondering why 4 and not 3 or 5. The reason is simple: one branch is the
one that will process the data similarly to the Vanilla RNN, i.e., it will take the past and the
present, weight them and send them through a tanh activation function. The other three
branches will control operations that we call gates. As you will see, these gates control how
the internal state stores past and present information. Other kinds of recurrent units, like
GRU, use a different number of gates, so four is specific to the LSTM architecture.

Notice that the weights here are matrices. The number of rows in the weight matrix corresponds to the
number of features, while the number of columns corresponds to the number of output features, i.e., the
number of nodes in the LSTM layer.

After the matrix multiplication with the weights, the results go through four independent nonlinear
activation functions.

Three of these are sigmoids, yielding the output vectors with values between 0 and 1. These three outputs
take the name of gates because they control the flow of information. The last one is not a sigmoid; it is a
tanh.

Let’s now look at the role of each of these nonlinear outputs. We start from the bottom one: the forget gate.
7.4. TIME SERIES FORECASTING 327
328 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

The role of the forget gate is to mediate how much of the internal state vector to keep and pass through to
future times. Since the value of this gate comes from a dense layer followed by a sigmoid, the LSTM node is
learning which fraction of the past data to retain and which fraction to forget.

Notice that the ⊙ operator implies we are multiplying ft and ct elementwise. This fact also means they are
vectors of the same length.

Let’s look at the gate mediated by wi . This gate is the input gate and it mediates how much of the input to
keep. However, it’s not the plain input concatenated vector; it’s a vector that went through the tanh layer. We
call it gt .
7.4. TIME SERIES FORECASTING 329

This gating operation is also performed elementwise on gt .

We add the resulting vector it ⊙ gt to the fraction of internal state retained through the forget gate.

The new internal state ct is the result of these two operations: forgetting a bit of the past state and adding
some new elements coming from the input and the past output.

Now that we have the update rules for the internal state let’s see how to calculate the output state from the
internal state. One last tanh operation

Now let’s look at the output gate ot . This gate mediates the output of the tanh only allowing part of it to
escape to the output.
330 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

The complete LSTM network graph

There we go! It looks complex, but that’s because it’s one of the most complicated units in Neural Networks.

We’ve just dissected the LSTM block that has revolutionized our ability to tackle problems with long term
dependencies. For example, LSTM blocks have been successfully used to learn the structure of language, to
produce code from text descriptions, to translate between language pairs and so on.

For the sake of completeness, we will write here the equations of the LSTM, though it’s not so important that
you learn them: Keras has them implemented in a conveniently available LSTM layer!

ℵt = [xt , ht−1 ] (7.17)


it = σ(ℵt .Wi ) (7.18)
ft = σ(ℵt .W f ) (7.19)
ot = σ(ℵt .Wo ) (7.20)
gt = tanh(ℵt .W g ) (7.21)
ct = ft ⊙ ct−1 + it ⊙ gt (7.22)
ht = ot ⊙ tanh(ct ) (7.23)
(7.24)

LSTM forecasting

Enough with math and theory! Let’s try to use an LSTM and see if we get a better result on our forecasting
problem.

Let’s import the LSTM layer from tensorflow.keras:


7.4. TIME SERIES FORECASTING 331

In [84]: from tensorflow.keras.layers import LSTM

Let’s clear the backend memory again to build our model:

In [85]: K.clear_session()

Now let’s build our model using the LSTM layer now.

TIP: according to the documentation, the LSTM layer may have many arguments. In this
case, we create a layer with six recurrent nodes, like we had six units in our fully connected
layer and we will set batch_input_shape=(1, 1, 1) and stateful=True, i.e., the last
state for each data point will be used as initial state next data point.

Like we did above, we will use the Adam optimizer with a small learning rate and initialize the input weights
to one:

In [86]: model = Sequential()


model.add(LSTM(6,
batch_input_shape=(1, win_len, 1),
kernel_initializer='ones',
return_sequences=True,
stateful=True))
model.add(TimeDistributed(Dense(1)))
332 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

model.compile(loss='mean_squared_error',
optimizer=Adam(lr=0.0005) )

Now let’s train our model. In doing so, we will use the X_train_t and y_train_t set for the training, the
already specified batch_size=1, because we feed the data one point at a time, and shuffle=False to pass
the data in order. We train the model for two epochs.

In [87]: model.fit(X_train_t, y_train_t,


epochs=5,
batch_size=1,
verbose=1,
shuffle=False);

Epoch 1/5
731/731 [==============================] - 4s 6ms/sample - loss: 1.7145
Epoch 2/5
731/731 [==============================] - 4s 5ms/sample - loss: 0.1155
Epoch 3/5
731/731 [==============================] - 4s 5ms/sample - loss: 0.0597
Epoch 4/5
731/731 [==============================] - 4s 5ms/sample - loss: 0.0392
Epoch 5/5
731/731 [==============================] - 4s 5ms/sample - loss: 0.0268

The LSTM takes much longer to train than SimpleRNN because it has many more weights to adjust. In a
future chapter, we will learn how to use GPUs to speed up the training.

To examine the effectiveness of our model, and as we did before, we can plot a small part of the time series
and compare our predictions with the true values:

In [88]: y_pred = model.predict(X_test_t, batch_size=1, )


plt.figure(figsize=(15,5))
plt.plot(y_test_t.ravel(), label='y_test')
plt.plot(y_pred.ravel(), label='y_pred')
plt.legend()
plt.xlim(1200,1300)
plt.title("Zoom True VS Pred Test set, LSTM");
7.4. TIME SERIES FORECASTING 333

Zoom True VS Pred Test set, LSTM


2.5 y_test
y_pred
2.0

1.5

1.0

0.5

0.0
1200 1220 1240 1260 1280 1300

This should look better than what we have obtained previously, but even in this case, we see that the ability
of the network to forecast is limited. As done for the other models, let’s also check the Mean Squared Error
and the R 2 , for an objective evaluation of the error:

In [89]: mse = mean_squared_error(y_test_t.ravel(), y_pred.ravel())


r2 = r2_score(y_test_t.ravel(), y_pred.ravel())

print("MSE: {:0.3}".format(mse))
print("R2: {:0.3}".format(r2))

MSE: 0.0157
R2: 0.93

and the correlation with time-shift measure:

In [90]: y_test_s = pd.Series(y_test_t.ravel())


y_pred_s = pd.Series(y_pred.ravel())

for shift in range(-5, 5):


y_pred_shifted = y_pred_s.shift(shift)
corr = y_test_s.corr(y_pred_shifted)
print("Shift: {:2}, Corr: {:0.2}".format(shift, corr))

Shift: -5, Corr: 0.59


Shift: -4, Corr: 0.72
Shift: -3, Corr: 0.85
Shift: -2, Corr: 0.95
Shift: -1, Corr: 0.99
Shift: 0, Corr: 0.97
Shift: 1, Corr: 0.89
334 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

Shift: 2, Corr: 0.77


Shift: 3, Corr: 0.64
Shift: 4, Corr: 0.51

As you can see, the maximum correlation is obtained for a shift of 0, which is a good indication that the
model has some forecasting power.

Improving forecasting
In all the models used so far, we fed the data sequentially to our recurrent unit, one point at a time. We can
train a recurrent layer in other ways, for example, using the rolling Windows approach.

Rolling windows

Instead of taking a single previous point as input, we can use a set of points, going back in time for a
window. This will allow us to feed data to the network in larger batches, speeding up training and hopefully
improving convergence.

We will reformat our input tensor X to have the following shape: (N_windows, window_len, 1). By
doing this, we treat the time series as if many independent windows of fixed length composed it and we can
treat each window as an individual data point. This has the advantage of allowing to randomize the
Windows in our train and test data.

Let’s start by defining the window size. We’ll take a window of 24 periods, i.e., the data from the previous
day. You can always adjust this later on if you wish:

In [91]: window_len = 24

Next, we’ll use the .shift method of a pandas DataFrame to create lagged copies of our original time
series. Note that we will start from the train_sc and test_sc vectors we’ve defined earlier. Let’s double
check that they still contain what we need:
7.5. IMPROVING FORECASTING 335

In [92]: train_sc.head()

Out[92]:

Total Ontario
2003-05-01 01:00:00 0.7404
2003-05-01 02:00:00 0.7156
2003-05-01 03:00:00 0.6822
2003-05-01 04:00:00 0.7002
2003-05-01 05:00:00 0.8020

To create the lagged data we define a helper function create_lagged_Xy_win that creates an input matrix
X with lags going from start_lag + window_len - 1 to start_lag and an output vector y with the
unaltered values.

So for example if we call: create_lagged_Xy_win(train_sc, start_lag=24, window_len=168) this


will return a dataset X where periods run from 8 days before to 24 hours before the corresponding value in y.

Let’s do it:

In [93]: def create_lagged_Xy_win(data, start_lag=1,


window_len=1):
X = data.shift(start_lag + window_len - 1).copy()
X.columns = ['T_{}'.format(start_lag + window_len - 1)]

if window_len > 1:
for s in range(window_len, 0, -1):
col_ = 'T_{}'.format(start_lag + s - 1)
X[col_] = data.shift(start_lag + s - 1)

X = X.dropna()
idx = X.index
y = data.loc[idx]
return X, y

Now we use the function on the train and test data. We will use start_lag=1 and window_len=24 so that
we can compare the results with our previous results:

In [94]: start_lag=1
window_len=24

X_train, y_train = create_lagged_Xy_win(train_sc,


start_lag,
window_len)
336 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

X_test, y_test = create_lagged_Xy_win(test_sc,


start_lag,
window_len)

Let’s take a look at our data:

In [95]: X_train.head()

Out[95]:
T_24 T_23 T_22 T_21 T_20 T_19 T_18 T_17 T_16 T_15 T_14 T_13 T_12 T_11 T_10 T_9 T_8 T_7 T_6 T_5 T_4 T_3 T_2 T_1
2003-05-02 01:00:00 0.7404 0.7156 0.6822 0.7002 0.8020 1.0226 1.3524 1.5536 1.6074 1.6382 1.6342 1.6242 1.6236 1.5976 1.6036 1.6290 1.6096 1.5486 1.5228 1.6008 1.5408 1.3096 1.0414 0.8694
2003-05-02 02:00:00 0.7156 0.6822 0.7002 0.8020 1.0226 1.3524 1.5536 1.6074 1.6382 1.6342 1.6242 1.6236 1.5976 1.6036 1.6290 1.6096 1.5486 1.5228 1.6008 1.5408 1.3096 1.0414 0.8694 0.7742
2003-05-02 03:00:00 0.6822 0.7002 0.8020 1.0226 1.3524 1.5536 1.6074 1.6382 1.6342 1.6242 1.6236 1.5976 1.6036 1.6290 1.6096 1.5486 1.5228 1.6008 1.5408 1.3096 1.0414 0.8694 0.7742 0.7218
2003-05-02 04:00:00 0.7002 0.8020 1.0226 1.3524 1.5536 1.6074 1.6382 1.6342 1.6242 1.6236 1.5976 1.6036 1.6290 1.6096 1.5486 1.5228 1.6008 1.5408 1.3096 1.0414 0.8694 0.7742 0.7218 0.6914
2003-05-02 05:00:00 0.8020 1.0226 1.3524 1.5536 1.6074 1.6382 1.6342 1.6242 1.6236 1.5976 1.6036 1.6290 1.6096 1.5486 1.5228 1.6008 1.5408 1.3096 1.0414 0.8694 0.7742 0.7218 0.6914 0.7018

In [96]: y_train.head()

Out[96]:

Total Ontario
2003-05-02 01:00:00 0.7742
2003-05-02 02:00:00 0.7218
2003-05-02 03:00:00 0.6914
2003-05-02 04:00:00 0.7018
2003-05-02 05:00:00 0.7904

As you can see, to predict the value 0.7806 that appears in y at 2003-05-08 05:00:00, in X we have the
previous values, going back in time from 0.6950 (previous hour) to 0.6734 (two hours before) and so on.

To feed this data to a recurrent model, we need to reshape as a tensor of order with the shape
(batch_size, timesteps, input_dim). We are still dealing with a univariate time series, so
input_dim=1, while timesteps is going to be 168, the number of timesteps in the window. Easy to do
using the .reshape method from numpy.

We will get numpy arrays using the .values attribute. We have already checked that the data is shifted
correctly, so it’s not a problem to throw away the index and the column names:

In [97]: X_train_t = X_train.values.reshape(-1, window_len, 1)


X_test_t = X_test.values.reshape(-1, window_len, 1)

y_train_t = y_train.values
y_test_t = y_test.values
7.5. IMPROVING FORECASTING 337

Let’s check the shape of our tensor is correct:

In [98]: X_train_t.shape

Out[98]: (93528, 24, 1)

Yes! We have correctly reshaped the tensor. Note here that if we had multiple time series, we could have
bundled them together in an input vector along the last axis.

Let’s build a new recurrent model. This time we will not need to use the stateful=True setting because
some history is already present in the input data and the windows are overlapping. For the same reason, we
will use input_shape instead of batch_input_shape.

Also, since we will use batches of more than one point, and each point contains much history, the model
convergence will be a lot more stable. Therefore we can increase the learning rate a lot without risking that
the model becomes unstable.

Finally, notice that we are not going to be using the TimeDistributed wrapper here, since we didn’t create
output sequences (i.e. no Teacher Forcing):

In [99]: K.clear_session()
model = Sequential()
model.add(LSTM(6, input_shape=(window_len, 1),
kernel_initializer='ones'))
model.add(Dense(1))

model.compile(loss='mean_squared_error',
optimizer=Adam(lr=0.05) )

Let’s go ahead and train our model using a batch of size 256 for five epochs. This may take some time. Later
in the book, we will learn how to speed it up using GPUs. For now, take advantage of this time with a little
break. You deserve it!

In [100]: model.fit(X_train_t, y_train_t,


epochs=5,
batch_size=256,
verbose=1);

Epoch 1/5
93528/93528 [==============================] - 1s 16us/sample - loss: 0.1394
Epoch 2/5
93528/93528 [==============================] - 1s 14us/sample - loss: 0.0079
Epoch 3/5
338 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

93528/93528 [==============================] - 1s 14us/sample - loss: 0.0059


Epoch 4/5
93528/93528 [==============================] - 1s 14us/sample - loss: 0.0055
Epoch 5/5
93528/93528 [==============================] - 1s 14us/sample - loss: 0.0054

Let’s generate the predictions and compare them with the actual values:

In [101]: y_pred = model.predict(X_test_t, batch_size=256)


plt.figure(figsize=(15,5))
plt.plot(y_test_t, label='y_test')
plt.plot(y_pred, label='y_pred')
plt.legend()
plt.xlim(1200,1300)
plt.title("Zoom True VS Pred Test set, LSTM with Windows");

Zoom True VS Pred Test set, LSTM with Windows


3.0 y_test
y_pred
2.5
2.0
1.5
1.0
0.5
0.0
1200 1220 1240 1260 1280 1300

Let’s check the loss and R 2 score:

In [102]: mse = mean_squared_error(y_test_t, y_pred)


r2 = r2_score(y_test_t, y_pred)

print("MSE: {:0.3}".format(mse))
print("R2: {:0.3}".format(r2))

MSE: 0.00487
R2: 0.978

And the correlation:


7.5. IMPROVING FORECASTING 339

In [103]: y_test_s = pd.Series(y_test_t.ravel())


y_pred_s = pd.Series(y_pred.ravel())

for shift in range(-5, 5):


y_pred_shifted = y_pred_s.shift(shift)
corr = y_test_s.corr(y_pred_shifted)
print("Shift: {:2}, Corr: {:0.2}".format(shift, corr))

Shift: -5, Corr: 0.5


Shift: -4, Corr: 0.63
Shift: -3, Corr: 0.76
Shift: -2, Corr: 0.88
Shift: -1, Corr: 0.97
Shift: 0, Corr: 0.99
Shift: 1, Corr: 0.94
Shift: 2, Corr: 0.84
Shift: 3, Corr: 0.71
Shift: 4, Corr: 0.58

This model trained considerably faster than the previous ones, and its predictions should look much better
than the previous models. First of all the model seems to have learned the temporal patter much better than
the other models: it’s not simply repeating the input like a parrot, it’s genuinely trying to predict the future.
Also, the curves look quite close to one another, which is a great sign!

TIP: Try to re-initialize and re-train the model if the loss of your model does not reach 0.05
and the above figure does not look like this:

One problem with recurrent models is that they tend to get stuck in local minima and be
sensitive to initialization. Also, keep in mind that we chose only six units in this network,
which is probably small for this problem.
340 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

Conclusion

Well done! You have completed the chapter on Time Series and Recurrent Neural Networks. Let’s recap
what we have learned.

1. We learned how to classify time series of a fixed length using both fully connected and convolutional
Neural Networks

• We learned about recurrent Neural Networks and about how they allow us to approach new problems
with sequences, including generating a sequence of arbitrary length and learning from sequences of
arbitrary length
• We trained a fully connected network to forecast future values in a sequence
• We performed a deep dive in Recurrent Neural Networks, in particular in the Long Short-Term
Memory network to see what advantage they bring
• Finally, we trained an LSTM model to forecast values using both a single point as well as a window of
past data

Wow, this is a lot for a single chapter!

In the exercises, we will explore a couple of extensions of what we have done, and we will try to predict the
price of Bitcoin from its historical value!

Exercises

Exercise 1

Your manager at the power company is quite satisfied with the work you’ve done predicting the electric load
of the next hour and would like to push it further. He is curious to know if your model can predict the load
on the next day or even the next week instead of the next hour.

• Go ahead and use the helper function create_lagged_Xy_win we created above to generate new X
and y pairs where the start_lag is 36 hours or even further. You may want to extend the window
size to a little longer than a day.
• Train your best model on this data. You may have to use more than one layer. In which case,
remember to use the return_sequences=True argument in all layers except for the last one so that
they pass sequences to one another.
• Check the goodness of your model by comparing it with test data as well as looking at the R 2 score.

In [ ]:

Exercise 2

Gate Recurrent Unit (GRU) are a more modern and simpler implementation of a cell that retains longer
term memory.
7.6. EXERCISES 341

Their flow diagram is as follows:

GRU network graph

Keras makes them available in keras.layers.GRU. Try swapping the LSTM layer with a GRU layer and
re-train the model. Does its performance improve on the 36 hours lag task?

In [ ]:

Exercise 3

Does a fully connected model work well using Windows? Let’s find out! Try to train a fully connected model
on the lagged data with Windows, which will probably train much faster:

• reshape the input data back to an Order-2 tensor, i.e., eliminate the 3rd axis
• build a fully connected model with one or more layers
• train the fully connected model on the windowed data. Does it work well? Is it faster to train?

In [ ]:

Exercise 4

Disclaimer: past performance is no guarantee of future results. This is not investment


advice.
342 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS

Predicting the price of Bitcoin from historical data.

You may have heard people talk about Bitcoin and how it is growing that you decide to put your newly
acquired Deep Learning skills to test in trying to beat the market. The idea is simple: if we could predict
what Bitcoin is going to do in the future, we can trade and profit using that knowledge.

The simplest formulation of this forecasting problem is to try to predict if the price of Bitcoin is going to go
up or down in the future, i.e., we can frame the issue as a binary classification that answers the question: is
Bitcoin going up.

Here are the steps to complete this exercise:

1. Load the data from ../data/poloniex_usdt_btc.json.gz into a Pandas DataFrame. We


obtained this data through the public API of the Poloniex cryptocurrency exchange.

• Check out the data using df.head(). Notice that the dataset contains the close, high, low, open for 30
minutes intervals, which means: the first, highest, lowest and last amounts of US Dollars people were
willing to exchange Bitcoin for during those 30 minutes. The dataset also contains Volume values, that
we shall ignore, and a weighted average value, which is what we will use to build the labels.
• Convert the date column to a datetime object using pd.to_datetime and set it as the index of the
DataFrame.
• Plot the value of df[‘close’] to inspect the data. You will notice that it’s not periodic at all and it has an
overall enormous upward trend, so we will need to transform the data into a stationary time series.
We will use percentage changes, i.e., we will look at relative movements in the price instead of
absolute values.
• Create a new dataset df_percent with percent changes using the formula:
x t − x t−1
v t = 100 × (7.25)
x t−1
this is what we will use next.
• Inspect df_percent and notice that it contains both infinity and nan values. Drop the null values
and replace the infinity values with zero.
• Split the data on January 1st, 2017, using the data before then as training and the data after that as the
test.
• Use the window method to create an input training tensor X_train_t with the shape (n_windows,
window_len, n_features). This is the main part of the exercise since you’ll have to make a few choices
and be careful not to leak information from the future. In particular, you will have to:

– decide the window_len you want to use


– decide which features you’d like to use as input (don’t use weightedAverage, since we’ll need it
for the output.
– decide what lag you want to introduce between the last timestep in your input window and the
timestep of the output.
– You can start from the create_lagged_Xy_win function we defined in Chapter 7, but you will
have to modify it to work with numpy arrays because Pandas DataFrames are only good with
one feature.
7.6. EXERCISES 343

• Create a binary outcome variable that is 1 when train[weightedAverage] >= 0 and 0 otherwise.
This variable is going to be our label.
• Repeat the same operations on the test data
• Create a model to work with this data. Make sure the input layer has the right input_shape and the
output layer has one node with a Sigmoid activation function. Also, make sure to use the
binary_crossentropy loss and to track the accuracy of the model.
• Train the model on the training data
• Test the model on the test data. Is the accuracy better than a baseline guess? Are you going to be rich?

Again disclaimer: past performance is no guarantee of future results. This is not investment
advice.

In [ ]:
344 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS
Natural Language Processing and Text Data
8
In this chapter, we will learn a few techniques to approach problems involving text. This fundamental topic
since textual data is widespread.

We will start by introducing text data, and some use cases of Machine Learning and Deep Learning applied
to text prediction. Then we will explore the traditional approach to text problems: the Bag of Words (BOW)
approach.

This topic will take us to explore how to extract features from a text. We will introduce new techniques to do
this as well as a couple of new Python packages specifically designed to deal with text data.

We will explore the limitations of the BOW approach and see how Neural Networks can help to overcome
them. In particular, we will look at embeddings to encode text and at how to use them in Keras. Let’s get
started!

Use cases
As noted in the introduction text data is encountered in many applications. Let’s take a look at a few of
them. We are all familiar with Spam Detection. It is a text classification problem where we try to
distinguish legitimate documents from spammy ones.

Spam detection involves email spam, SMS spam, instant messaging spam and in general any corpus of
messages. The problem is as a binary classification where one has two sets of documents: the spam messages
and the “ham” messages, i.e., the legitimate messages that we would like to keep.

A similar binary classification problem involving text is that of Sentiment Analysis.

345
346 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA

Imagine you are a rock star tweeting about your latest album. Millions of people will reply to your tweet,
and it will be impossible for you to read all of the messages from your fans. You would like to capture the
overall sentiment of your fan base and see if they are happy about what you tweeted.

Sentiment analysis does that by classifying a piece of text as having positive or negative overall sentiment. If
you know the sentiment for each tweet, it’s easy to extract results like: “74 of your fans responded
positively to your tweet”.

Many fields use Sentiment Analysis including stock trading, e-commerce reviews, customer service and in
general any website or application where users are allowed to submit free-form text comments.

Text problems

Extending beyond classification problems, we can consider regression problems involving text, for example
extracting a score, a price, or any other metric starting from a text document. An example of this would be
estimating the number of followers your tweet will generate based on its text content or predicting the
number of downloads your application will do based on the content of a blog article.

All the above problems are traditional Machine Learning problems where text is the input to the problem.
Text can also be the output of a Machine Learning problem. For example, Machine Translation involves
converting text from a language to another. It is a supervised, many-to-many, sequence learning problem,
where we feed pairs of sentences in two languages to a model that learns to generate the output sequence (for
example a sentence in English), given a particular input sequence (the corresponding sentence in Italian).

Machine translation is an example of a whole category of Machine Learning problems involving text:
problems involving automatic text generation. Another famous example in this category is that of
Language Modeling.
8.2. TEXT DATA 347

In Language Modeling a corpus of documents (see next section for a proper definition) is fed sequentially to
a model. The model will learn the probability distribution of a specific word to appear after a sentence. The
model is then sampled randomly and is capable of producing sentences that resemble the properties of the
corpus. Using this approach people had models produce new sonnets from Shakespeare, new chapters of
Harry Potter, new episodes of popular novels and so on.

Since Language Modeling works on sequences, we can also build character level models that learn the
syntax or our input corpus. In this way, we can produce syntactically correct markup languages like HTML,
Wiki, Latex and even C! See the wonderful article by Andrej Karpathy for a few examples of this.

It is clear that text is involved in several useful application. So let’s see how to prepare text documents for
Machine Learning.

Text Data

Loading text data

Text data is usually a collection of articles or documents. Linguists call this collection a corpus to indicate
that it’s coherent and organized. For example, we could be dealing with the corpus of patents from our
company or with a corpus of articles from a news platform.

The first thing we are going to learn is how to load text data using Scikit- Learn. We will build a simple Spam
detector to separate SMS containing spam from legitimate SMS messages. The data comes from the UCI
SMS Spam collection, but it is re-organized and compressed.

The file data/sms.zip is a compressed archive of a folder with the structure:

sms
|-- ham
| |-- msg_000.txt
| |-- msg_001.txt
| |-- msg_003.txt
| +-- ...
|
+-- spam
|-- msg_002.txt
|-- msg_005.txt
|-- msg_008.txt
+-- ...

Let’s extract all the data into the data folder:

First, let’s import the zipfile package from Python so that we can extract the data into folders:

In [1]: import zipfile


348 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA

zipfile allows to operate directly zipped folder into our workspace. Have a look at the documentation for
further details. Here we use it to extract the data for later loading it:

In [2]: with zipfile.ZipFile('../data/sms.zip', 'r') as fin:


fin.extractall('../data/')

This last operation created a folder called sms inside the data folder. Let’s look at its content. The os
module contains many functions to interact with the host system. Let’s import it:

In [3]: import os

And let’s use the command os.listdir to look at the content of the folder:

In [4]: os.listdir('../data/sms')

Out[4]: ['ham', 'spam']

As expected there are two subfolders: ham and spam. We can count how many files they contain with the
help of the following little function that lists the content of path and uses a filter to only count files.

In [5]: from os.path import isfile, join

In [6]: def count_files(path):


files_list = [name for name in os.listdir(path)
if isfile(join(path, name))]
return len(files_list)

Let’s use this function to count the number of files in the folders:

In [7]: ham_count = count_files('../data/sms/ham/')


ham_count

Out[7]: 4825

In [8]: spam_count = count_files('../data/sms/spam/')


spam_count
8.2. TEXT DATA 349

Out[8]: 747

We have 4825 ham files and 747 spam files. We can use these numbers to establish a baseline for our
classification efforts:

In [9]: baseline_acc = ham_count / (ham_count + spam_count)


print("Baseline accuracy: {:0.3f}".format(baseline_acc))

Baseline accuracy: 0.866

If we always predicted the large class, i.e. we never predicted spam, we would be correct 86.6 of the time.
Our model needs to score higher than that to be of any help.

Let’s also look at a couple of examples of our messages for each class:

In [10]: def read_file(path):


with open(path) as fin:
msg = fin.read()
return msg

In [11]: read_file('../data/sms/ham/msg_000.txt')

Out[11]: 'Go until jurong point, crazy.. Available only in bugis n great world la e
buffet... Cine there got amore wat...'

In [12]: read_file('../data/sms/ham/msg_001.txt')

Out[12]: 'Ok lar... Joking wif u oni...'

In [13]: read_file('../data/sms/spam/msg_002.txt')

Out[13]: "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA
to 87121 to receive entry question(std txt rate)T&C's apply
08452810075over18's"

In [14]: read_file('../data/sms/spam/msg_005.txt')
350 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA

Out[14]: "FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like
some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv"

As expected, spam messages look quite different from ham messages. In order to start building a spam
detection model, let’s first load all the data into a dataset.

Scikit Learn offers a function to load text data from folders for classification purposes. Let’s use the
load_files function from sklearn.datasets package:

In [15]: from sklearn.datasets import load_files

In [16]: data = load_files('../data/sms/', encoding='utf-8')

data is an object of type:

In [17]: type(data)

Out[17]: sklearn.utils.Bunch

The documentation for a Bunch reads:

vocabulary-like object, the interesting attributes are: either


data, the raw text data to learn, or 'filenames', the files
holding it, 'target', the classification labels (integer index),
'target_names', the meaning of the labels, and 'DESCR', the full
description of the dataset.

so let’s look at the available keys:

In [18]: data.keys()

Out[18]: dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

Following the documentation, let’s assign the data.data and data.target to two variables docs and y.

In [19]: docs = data.data


8.2. TEXT DATA 351

The first five text examples in our docs variable are:

In [20]: docs[:5]

Out[20]: ['Hi Princess! Thank you for the pics. You are very pretty. How are you?',
"Hello my little party animal! I just thought I'd buzz you as you were with
your friends ...*grins*... Reminding you were loved and send a naughty
adoring kiss",
'And miss vday the parachute and double coins??? U must not know me very
well...',
'Maybe you should find something else to do instead???',
'What year. And how many miles.']

In [21]: y = data.target

The first five entries in our y variable:

In [22]: y[:5]

Out[22]: array([0, 0, 0, 0, 0])

Before we do anything else, let’s save the data we have loaded as a DataFrame, just in case we need to reload
it later. As usual we import our common files:

In [23]: with open('common.py') as fin:


exec(fin.read())

In [24]: with open('matplotlibconf.py') as fin:


exec(fin.read())

and then create a Dataframe with all the documents

In [25]: df = pd.DataFrame(docs, columns=['message'])


df['spam'] = y
df.head()

Out[25]:
352 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA

message spam
0 Hi Princess! Thank you for... 0
1 Hello my little party anim... 0
2 And miss vday the parachut... 0
3 Maybe you should find some... 0
4 What year. And how many mi... 0

Pandas allows to save a dataframe to a variety of different formats, including Excel, CSV and SAS. We will
export to CSV:

In [26]: df.to_csv('../data/sms_spam.csv',
index=False,
encoding='utf8')

Feature extraction from text

A Machine Learning algorithm is not able to deal with text as it is. Instead, we need to extract features from
the text!

Feature extraction from text

Let’s begin with a naive solution and gradually build up to a more complex one. The simplest way to build
features from a text is to use the counts of certain words that we assume to carry information about the
problem.

For example, spam messages often offer something for free or give a link to some service. Since these are
SMS messages, this link will likely be a number. With these two ideas in mind, let’s build a very simple
classifier that uses only two features:

• The count of the occurrence of the word “free”.


• The number of numerical characters.

Notice that our text contains uppercase and lowercase words, so as a preprocessing step let’s convert
everything to lowercase so we don’t include meaningless features.
8.2. TEXT DATA 353

In [27]: docs_lower = [d.lower() for d in docs]

The first five entries in our docs_lower variable are:

In [28]: docs_lower[:5]

Out[28]: ['hi princess! thank you for the pics. you are very pretty. how are you?',
"hello my little party animal! i just thought i'd buzz you as you were with
your friends ...*grins*... reminding you were loved and send a naughty
adoring kiss",
'and miss vday the parachute and double coins??? u must not know me very
well...',
'maybe you should find something else to do instead???',
'what year. and how many miles.']

We can define a simple helper function that counts the occurrences of a particular word in a sentence:

In [29]: def count_word(word, sentence):


tokens = sentence.split()
return len([w for w in tokens if w == word])

and apply it to each document:

In [30]: free_counts = [count_word('free', d) for d in docs_lower]


df = pd.DataFrame(free_counts, columns=['free'])

In [31]: df.head()

Out[31]:

free
0 0
1 0
2 0
3 0
4 0

Similarly let’s build a helper function that counts the numerical character in a sentence using the re package:

In [32]: import re
354 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA

In [33]: def count_numbers(sentence):


return len(re.findall('[0-9]', sentence))

In [34]: df['num_char'] = [count_numbers(d) for d in docs_lower]

In [35]: df.head()

Out[35]:

free num_char
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0

Spam classification

Notice that most messages don’t contain our special features, so we don’t expect any model to work super
well in this case, but let’s try to build one anyways. First, let’s import the train_test_split function from
sklearn as well as the the usual Sequential model and the Dense layer:

In [36]: from sklearn.model_selection import train_test_split


from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

Now let’s define the helper function that follows the usual process that we repeated several times in the
previous chapters:

• Train/test split
• Model definition
• Model training
• Model evaluation on test set

We will use a simple Logistic Regression model to start, to make things simple and quick:

In [37]: def split_fit_eval(X, y, model=None,


epochs=10,
random_state=0):
8.2. TEXT DATA 355

X_train, X_test, y_train, y_test = \


train_test_split(X, y, random_state=random_state)

if not model:
model = Sequential()
model.add(Dense(1, input_dim=X.shape[1],
activation='sigmoid'))

model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])

h = model.fit(X_train, y_train,
epochs=epochs,
verbose=0)

loss, acc = model.evaluate(X_test, y_test)

return loss, acc, model, h

Executing the function with our values, we’ll capture the result in a variable we’ll call res:

In [38]: res = split_fit_eval(df.values, y)

1393/1393 [==============================] - 0s 81us/sample - loss: 0.3323 -


accuracy: 0.9648

Let’s check the accuracy of our model:

In [39]: print("Simple model accuracy: {:0.3f}".format(res[1]))

Simple model accuracy: 0.965

Despite our initial skepticism, this dataset is easy to separate! It is so easy that two simple features (the
counts of the word free and the count of numerical characters) already achieve a much better accuracy
score than the baseline, that was 0.866.

Bag of Words features

We can extend the simple approach of the previous model in a few ways:
356 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA

• We could build a vocabulary with more than just one word, and build a feature for each of them
which counts how many times that word appears.
• We could filter out common English words.

Scikit Learn has a transformer that does exactly these two tasks, it’s called CountVectorizer. Let’s import
it from sklearn.feature_extraction.text:

In [40]: from sklearn.feature_extraction.text \


import CountVectorizer

Let’s plan on using the top 3000 most common words in the corpus; this is going to be our vocabulary size:

In [41]: vocab_size = 3000

Then we can initialize the vectorizer. Here we have to use the additional argument
stop_words='english' that tells the vectorizer to ignore common English stop words. We do this
because we are ranking features starting from the most common word. If we didn’t ignore common words,
we would end up with word features like “if ”, “and”, “of ” and so on, at the top of our list, since these words
are just ubiquitous in the English language. However, these words do not carry much meaning about spam
and by ignoring them we get word features that are more specific to our corpus.

We are also going to ignore decoding errors using the decode_error='ignore' argument:

In [42]: vect = CountVectorizer(decode_error='ignore',


stop_words='english',
lowercase=True,
max_features=vocab_size)

Notice that it also allows for automatic lowercase conversion. You can check what are the stop words using
the .get_stop_words() method. Let’s look at a few of them:

In [43]: stop_words = list(vect.get_stop_words())

In [44]: stop_words[:10]

Out[44]: ['fifty',
'again',
'move',
'its',
8.2. TEXT DATA 357

'as',
'forty',
'on',
'third',
'hasnt',
'alone']

Now that we have created the vectorizer, let’s apply it to our corpus:

In [45]: X = vect.fit_transform(docs)
X

Out[45]: <5572x3000 sparse matrix of type '<class 'numpy.int64'>'


with 37142 stored elements in Compressed Sparse Row format>

X is a sparse matrix i.e. a matrix in which most of the elements are 0. This makes sense since most messages
are short and they will only contain a few of the 3000 words in our feature list. The X matrix has 5572 rows
(i.e. the total number of sms) and 3000 columns (i.e. the total number of selected words) but only 37142
non-zero entries (less then 1).

To use it for Machine Learning we will convert it to a dense matrix, which we can do by calling todense()
on the object:

In [46]: Xd = X.todense()

TIP: be careful with converting sparse matrices to dense. If you are dealing with large
datasets, you will quickly run out of memory with all those zeros. In those cases, we do
on-the-fly conversion to dense of each batch during Stochastic Gradient Descent.

Let’s also have a look at the features found by the vectorizer:

In [47]: vocab = vect.get_feature_names()

The features are listed in alphabetical order:

In [48]: vocab[:10]
358 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA

Out[48]: ['00',
'000',
'02',
'0207',
'02073162414',
'03',
'04',
'05',
'06',
'07123456789']

In [49]: vocab[-10:]

Out[49]: ['yogasana', 'yor', 'yr', 'yrs', 'yummy', 'yun', 'yunny', 'yuo', 'yup',
'zed']

Let’s use the helper function we’ve defined above to train a model on the new features:

In [50]: res = split_fit_eval(Xd, y)

1393/1393 [==============================] - 0s 89us/sample - loss: 0.1828 -


accuracy: 0.9742

In [51]: print("Test set accuracy:\t{:0.3f}".format(res[1]))

Test set accuracy: 0.974

The accuracy on the test set is not much higher than our simple model, however, we can use this model to
look for features importances, i.e. to identify words whose weight is high when predicting spam or not. Let’s
recover the trained model from the res object returned by our custom function:

In [52]: model = res[2]

Then let’s put the weights in a Pandas Series, indexed by the vocabulary:

In [53]: w_ = model.get_weights()[0].ravel()
vocab_weights = pd.Series(w_, index=vocab)
8.2. TEXT DATA 359

Let’s look at the top 20 words with positive weights:

In [54]: vocab_weights.sort_values(ascending=False).head(20)

Out[54]:

0
txt 0.632662
claim 0.561857
www 0.543640
150p 0.519195
mobile 0.487513
free 0.484431
prize 0.480391
service 0.475981
50 0.464343
18 0.462488
uk 0.457113
reply 0.439950
won 0.435277
16 0.426705
1000 0.421579
stop 0.415596
500 0.399611
ringtone 0.381416
rate 0.376613
tone 0.364868

Not surprisingly we find here words like www, claim, prize, cash etc. Similarly we can look at the bottom
20 words:

In [55]: vocab_weights.sort_values(ascending=False).tail(20)

Out[55]:
360 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA

0
like -0.458966
sorry -0.465436
good -0.467163
oh -0.473112
lt -0.473154
way -0.474795
gt -0.483761
got -0.487025
wat -0.488967
did -0.489641
think -0.510979
need -0.516397
going -0.526966
lor -0.527774
home -0.531137
come -0.533240
later -0.540522
da -0.540787
ll -0.561920
ok -0.597133

and see they are pretty common legitimate words like sorry, ok, lol, etc. . . If we were spammer we could
take advantage of this information and craft messages that attempt to fool these simple features by using a
lot of words like sorryor ok. This is a typical Adversarial Machine Learning scenario, where the target is
constantly trying to beat the model.

In any case, it’s pretty clear that this dataset is an easy one. So let’s load a new dataset and learn a few more
tricks!

Word frequencies

In the previous spam classification problem, we used a CountVectorizer transformer from Scikit Learn to
produce a sparse matrix with term counts of the top 3000 words. Using the absolute counts was ok with
SMS messages because they have a maximum length of 160 characters. In the general case, using absolute
counts may be a problem if we deal with documents of uneven length. Think of a brief email versus a long
article, both about the topic of AI. The word AI will appear in both documents, but it will likely be repeated
more times in the long article. Using the counts would lead us to think that the article is more about AI than
the short email, while it’s simply a longer text. We can account for that using the term frequency instead of
the count, i.e., by dividing the counts by the length of the document. Using term frequencies is already an
improvement, but we can do even better.

There will be some words that are common in every document. These could be common English stop
words, but they could also be words that are common across the specific corpus. For example, if we are
trying to sort a corpus of patents by topics, it is clear that words like patent, application, grant and similar
8.2. TEXT DATA 361

legal terms will be shared across the whole corpus and not indicative of the particular topic of each of the
documents in the corpus.

We want to normalize our term frequencies with a term inversely proportional to the fraction of documents
containing that term, i.e., we want to use an inverse document frequency.

These features go by the name of TF- IDF, i.e., term- frequency–inverse-document-frequency, which is
also available as a vectorizer in Scikit Learn.

In other words, TF-IDF features look like:

TFIDF(word, document) = counts_in_document(word, document) /


counts_of_documents_with_word_in_corpus(word)

or using maths:

tf(w, d)
tf-idf(w, d) = (8.1)
df(w, d)

where w is a word, d is a document, tf stands for “term frequency” and df for “document frequency”.

As you can read in the Wikipedia article, there are several ways to improve the above of the TF-IDF formula,
using different regularization schemes. Scikit Learn implements it as follows:

1 + nd
tf-idf(w, d) = tf(w, d) × log ( )+1 (8.2)
1 + df(w, d)

where nd is the total number of documents and the regularized logarithm takes care of words that are
extremely rare or extremely common.

Sentiment classification

Let’s load a new dataset, containing reviews from the popular website Rotten Tomatoes:

In [56]: df = pd.read_csv('../data/movie_reviews.csv')
df.head()

Out[56]:
362 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA

title review vote


0 Toy story So ingenious in concept, d... fresh
1 Toy story The year’s most inventive ... fresh
2 Toy story A winning animated feature... fresh
3 Toy story The film sports a provocat... fresh
4 Toy story An entertaining computer-g... fresh

Let’s take a peek into the data we loaded, see how many reviews we have and a few other pieces of
information.

In [57]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14072 entries, 0 to 14071
Data columns (total 3 columns):
title 14072 non-null object
review 14072 non-null object
vote 14072 non-null object
dtypes: object(3)
memory usage: 329.9+ KB

Let’s look at the division between the fresh votes, the rotten, and the none votes.

In [58]: df['vote'].value_counts() / len(df)

Out[58]:

vote
fresh 0.612067
rotten 0.386299
none 0.001634

As you can see, the dataset contains reviews about famous movies and a judgment of rotten VS fresh,
which is the class we will try to predict.

First of all, we notice that a small number of reviews do not have a class, so let’s eliminate those few rows
from the dataset. We’ll do this by selecting all the votes that are not none:

In [59]: df = df[df.vote != 'none'].copy()

In [60]: df['vote'].value_counts() / len(df)


8.2. TEXT DATA 363

Out[60]:

vote
fresh 0.613069
rotten 0.386931

Our reference accuracy is 61.3, the fraction of the larger class.

Label encoding

Notice that the labels are strings, and we need to convert them to 0 and 1 in order to use them for
classification. We could do this in many ways, one way is to use the LabelEncoder from Scikit Learn. It is a
transformer that will look at the unique values present in our label column and encode them to numbers
from 0 to N − 1, where N is the number of classes, in our case 2.

Let’s first import it from sklearn.preprocessing:

In [61]: from sklearn.preprocessing import LabelEncoder

Let’s instantiate the LabelEncoder:

In [62]: le = LabelEncoder()

Finally, let’s create a vector of 0s and 1s that represent the features:

In [63]: y = le.fit_transform(df['vote'])

y is now a vector of 0s and 1s. Let’s look at the first 10 entries:

In [64]: y[:10]

Out[64]: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Bag of words: TF-IDF features

Let’s import the TfidfVectorizer vector from Scikit Learn:


364 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA

In [65]: from sklearn.feature_extraction.text import TfidfVectorizer

Let’s initialize it to look at the top 10000 words in the corpus, excluding English stop words:

In [66]: vocab_size = 10000

vect = TfidfVectorizer(decode_error='ignore',
stop_words='english',
max_features=vocab_size)

We can use the vectorizer to transform our reviews:

In [67]: X = vect.fit_transform(df['review'])

In [68]: X

Out[68]: <14049x10000 sparse matrix of type '<class 'numpy.float64'>'


with 130025 stored elements in Compressed Sparse Row format>

This generates a sparse matrix with 14049 rows and 10000 columns. This is still small enough to be
converted to dense and passed to our model evaluation function. Let’s call todense() on the object to
convert it to a dense matrix:

In [69]: Xd = X.todense()

Let’s train our model. We will use a higher number of epochs in this case, to ensure convergence with the
larger dataset:

We’ll use our function again:

In [70]: res = split_fit_eval(Xd, y, epochs=30)

3513/3513 [==============================] - 0s 85us/sample - loss: 0.5135 -


accuracy: 0.7515

In [71]: print("Test set accuracy:\t{:0.3f}".format(res[1]))


8.2. TEXT DATA 365

Test set accuracy: 0.751

The accuracy on the test set is much lower than the last value of accuracy obtained on the training set (last
line printed during training), therefore the model is overfitting. This is not unexpected given the large
number of features. Despite the overfitting, the test score is still higher than the 61.3 accuracy obtained by
always predicting the larger class.

Text as a sequence

The bag of words approach is very crude. It does not take into account context, i.e. each word is treated as
independent feature, regardless of its position in the sentence. This is particularly bad for tasks like
sentiment analysis where negations could be present (“This movie was not good”) and the overall sentiment
could not be carried by any particular word.

To go beyond the bag of words approach, we need to treat the text as a sequence instead of just looking at
frequencies. To do this, we will proceed to:

1. create a vocabulary, indexed starting from the most frequent word and then continuing in decreasing
order.
2. convert the sentences into sequences of integer indices using the dictionary
3. feed the sequences to a Neural Network to perform the sentiment classification

Keras has a preprocessing Tokenizer that allows us to create a vocabulary and convert the sentences using
it. Let’s load it:

In [72]: from tensorflow.keras.preprocessing.text import Tokenizer

Let’s initialize the Tokenizer. We will use the same vocabulary size of 10000 used in the previous task:

In [73]: vocab_size

Out[73]: 10000

In [74]: tokenizer = Tokenizer(num_words=vocab_size)

We can fit the tokenizer on our reviews using the function .fit_on_texts. We will pass the column of the
dataframe df that contains the reviews:

In [75]: tokenizer.fit_on_texts(df['review'])
366 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA

Great! The tokenizer has finished its job, so let’s give a look at some of its attributes.

The .document_count gives us the number of documents used to build the vocabulary:

In [76]: tokenizer.document_count

Out[76]: 14049

These are the 14049 reviews left in the dataset after we removed the ones without a vote. The .num_words
attribute gives us the number of features in the vocabulary. These should be 10000:

In [77]: tokenizer.num_words

Out[77]: 10000

Finally, we can retrieve the word index by calling .word_index, which returns an vocabulary. Let’s look at
the first 10 items in it:

In [78]: list(tokenizer.word_index)[:10]

Out[78]: ['the', 'a', 'and', 'of', 'to', 'is', 'in', 'it', 'that', 'as']

As you can see this is not sorted alphabetically, but in decreasing order of frequency starting from the most
common word. Let’s use the tokenizer to convert our reviews to sequences. For instance, “The movie is
great” translates to the sequence 1539:

Conversion of words to indices

In [79]: sequences = tokenizer.texts_to_sequences(df['review'])

sequences is a list of lists. Each of the inner lists is one of the reviews:
8.2. TEXT DATA 367

In [80]: sequences[:3]

Out[80]: [[36,
1764,
7,
1058,
800,
3,
1765,
9,
27,
151,
268,
8,
21,
2,
9088,
3879,
5881,
115,
3,
101,
20,
22,
17,
360],
[1, 610, 38, 801, 49],
[2, 1012, 347, 225, 9, 24, 107, 14, 564, 21, 1, 354, 7122]]

Let’s just double check that the conversion is correct by converting the first list back to text. We will need to
use the reverse index -> word map:

In [81]: tok_items = tokenizer.word_index.items()


idx_to_word = {i:w for w, i in tok_items}

The first review is:

In [82]: df.loc[0, 'review']

Out[82]: 'So ingenious in concept, design and execution that you could watch it on a
postage stamp-sized screen and still be engulfed by its charm.'

The first sequence is:


368 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA

In [83]: ' '.join([idx_to_word[i] for i in sequences[0]])

Out[83]: 'so ingenious in concept design and execution that you could watch it on a
postage stamp sized screen and still be by its charm'

The two sentences are almost identical, however notice a couple of things:

1. punctuation has been stripped away in the tokenization.

• all words are lowercased.


• some really rare word (e.g. “engulfed”) are out of our top 3000 words and are therefore ignored.

Now that we have sequences of numbers, we can organize them into a matrix with one review per row and
one word per column. Since not all sequences have the same length, we will need to pad the short ones with
zeros.

Let’s calculate the longest sequence length:

In [84]: maxlen = max([len(seq) for seq in sequences])


maxlen

Out[84]: 49

The longest review contains 49 words. Let’s pad every other review to 49 using the pad_sequences
function from Keras:

In [85]: from tensorflow.keras.preprocessing.sequence import pad_sequences

As you can read in the documentation:

Signature: pad_sequences(sequences, maxlen=None, dtype='int32',


padding='pre', truncating='pre', value=0.0)
Docstring:
Pads each sequence to the same length (length of the longest sequence).

If maxlen is provided, any sequence longer


than maxlen is truncated to maxlen.
8.2. TEXT DATA 369

pad_sequences operates on the sequences by padding and truncating them. Let’s set the maxlen
parameter to the value we already found:

In [86]: X = pad_sequences(sequences, maxlen=maxlen)

In [87]: X.shape

Out[87]: (14049, 49)

X has 14049 rows (i.e., the number of samples, review in this case) and 49 columns (i.e., the words of the
longest review). Let’s print out the first few reviews:

In [88]: X[:4]

Out[88]: array([[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 36, 1764, 7, 1058, 800, 3, 1765, 9,
27, 151, 268, 8, 21, 2, 9088, 3879, 5881, 115, 3,
101, 20, 22, 17, 360],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 610, 38, 801, 49],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 2, 1012, 347, 225, 9, 24, 107, 14,
564, 21, 1, 354, 7122],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
16, 1190, 2, 822, 3, 485, 42, 121, 144, 285, 1,
1678, 4, 13, 742, 724]], dtype=int32)

Can we feed this matrix to a Machine Learning model? Let’s think about it for a second. If we treat this
matrix as tabular data, it would mean each column represents a feature. What feature? Columns in the
matrix correspond to the position of the word in a sentence, and so there’s absolutely no reason why two
words appearing at the same position would carry consistent information about sentiment.

Also, the numbers here are the indices of our words in a vocabulary, so their actual value is not a quantity,
it’s their rank in order of frequency in the dictionary. In other words, word number 347 is not 347 times as
large as the word at index 1; it’s just the word that appears at index 347 in the vocabulary.
370 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA

The correct way to think of this data is to recognize that each number in X really represents an index in a
vector with length vocab_size, i.e., a vector with 10000 entries. These are the actual features, i.e., all the
words in our vocabulary.

So, this matrix is a shorthand for an order-3 sparse tensor whose three axes are (sentence, position along the
sentence, word feature index). The first axis would locate the sentence in the dataset, and it corresponds to
the row axis of our X matrix. The second axis would locate the word position in the sentence, and it
corresponds to the column axis of our X matrix. The third would be a sparse vector, locate the word in the
vocabulary. Instead of using a sparse vector, we use the index of the word in the dictionary, which is the
value in that specific entry in the X matrix.

1-hot encoding of words

It looks like one way to feed this data to a Neural Network would be to expand the X matrix to a 1-hot
encoded order-3 tensor with 0s and 1s, and then feed this tensor to our network, for example to a Recurrent
layer with input_shape=(49, 10000). This would be the equivalent of feeding a dataset of 10000 time
series, whose elements are all mostly zeros, except for one at each time, which is not zero when that word
occurs in that particular sentence.

While this encoding works, it is not really memory efficient. Besides that, representing each word along a
different orthogonal axis in a 10000-dimensional vector space doesn’t capture any information on how that
word is used in its context. How can we improve this situation?

One idea would be to insert a fully connected layer to compress our input space from the vast sparse vector
space of 10000 words in our vocabulary to a much smaller, dense, vector space, for example with just 32
axes. In this new space, each word is represented by a dense vector, whose entries are floating point
numbers instead of all 0s and a single 1.

This is cool because now we can feed our sequences of much smaller dense vectors to a recurrent network to
complete the sentiment classification task, i.e., we are treating the sentiment classification problem as a
Sequence Classification problem like the ones encountered in Chapter 7.
8.2. TEXT DATA 371

Dense representation of words with vectors

Furthermore, since the dense vector comes from a fully connected layer, we can jointly train the fully
connected layer and the recurrent layer allowing the fully connected layer to find the best representation for
the words to help the recurrent layer achieve its task.

Embeddings

In practice, we never actually go through the burden of converting the word indices to 1-hot vectors and
then back to dense vectors. We use an Embedding layer. In this layer, we specify the output dimension, i.e.,
the length of the dense vector, and it has an independent set of as many weights for each of the words in the
vocabulary. So, for example, if the vocabulary is 10000 words and we specify an output dim of 100, the
embedding layer will carry 1000000 weights, 100 for each of the words in the vocabulary.

The numbers in the input indicate the indices that select the set of 100 weights for each word, i.e., they will
be interpreted as indices in a phantom sparse space, saving us from converting the data to 1-hot and then
converting it back to dense.

Let’s see how to include this in our network. Let’s load the Embedding layers from Keras:

In [89]: from tensorflow.keras.layers import Embedding

Let’s see how it works by creating a network with a single such layer that maps a feature space of 100 words
to an output dense space of only two dimensions:

In [90]: model = Sequential()


model.add(Embedding(input_dim=100, output_dim=2))
372 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA

Embedding vectors

model.compile(optimizer='sgd',
loss='categorical_crossentropy')

The network above assumes the inputs are numbers between 0 and 99. We interpret these as the indices of
the single non-zero entry in a 100-dimensional 1-hot vector. Sequences of such indices will be interpreted as
sequences of such vectors and will be transformed to sequences 2-dimensional dense vectors since 2 is the
dimension of the output space.

Let’s feed a single sequence of a few indices and perform a forward pass:

In [91]: model.predict(np.array([[ 0, 81, 1, 0, 79]]))

Out[91]: array([[[ 0.00141018, 0.00803044],


[ 0.02288619, -0.0470843 ],
[ 0.00644252, -0.03970464],
[ 0.00141018, 0.00803044],
[ 0.00093368, 0.02331785]]], dtype=float32)

The embedding layer turned the sequence of five numbers into a sequence of five 2-dimensional vectors.
Since we have not trained our Embedding Layer yet, these are just the weight vectors corresponding to each
word, so for example, words 0 corresponds to the weights [-0.03977597, -0.01466479]. Notice how
these appear both on the first row and the fourth row, exactly as one would expect since the words 0 appears
at the first and fourth positions in our five words sentence.

Similarly, if we feed a batch of few sequences of indices, we will obtain a batch of few sequences of vectors,
i.e., a tensor of order three, with axes (sentence, position in the sentence, embedding):
8.2. TEXT DATA 373

In [92]: model.predict(np.array([[ 0, 81, 1, 96, 79],


[ 4, 17, 47, 69, 50],
[15, 49, 3, 12, 88]]))

Out[92]: array([[[ 0.00141018, 0.00803044],


[ 0.02288619, -0.0470843 ],
[ 0.00644252, -0.03970464],
[-0.03070084, 0.04406111],
[ 0.00093368, 0.02331785]],

[[ 0.04599335, -0.03839328],
[ 0.04549471, -0.01657807],
[ 0.04869037, -0.03889428],
[ 0.04784049, 0.03709331],
[-0.01679673, -0.04597836]],

[[ 0.04818919, 0.03411008],
[ 0.01479371, -0.01692833],
[-0.01328155, -0.04962599],
[ 0.04628075, 0.02558115],
[-0.0442862 , -0.04733264]]], dtype=float32)

Great! Now we know what to do to build our sentiment classifier. We will:

• Split as usual our X matrix of indices into train and test


• Build a network with:
– Embedding
– Recurrent
– Dense
• Classify the sentiment of our reviews

Let’s start from the train/test split. As we’ve done several times in the book, we set random_state=0 so that
we all get the same train/test split.

TIP: Setting the random state is useful when you want to have repeatable random splits.

In [93]: X_train, X_test, y_train, y_test = \


train_test_split(X, y, random_state=0)
374 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA

In [94]: X.shape

Out[94]: (14049, 49)

Recurrent model

Let’s build our model as we did in the previous chapter. First, let’s import the LSTM layer from Keras:

In [95]: from tensorflow.keras.layers import LSTM

Next, let’s build up our model. We’ll create our Embedding layer followed by our LSTM layer and the regular
Dense and Activation layers after that:

In [96]: model = Sequential()


model.add(Embedding(input_dim=vocab_size,
output_dim=16,
input_length=maxlen))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])

Let’s train our model using the fit() function. We will train the model on batches of 128 reviews for 8
epochs with a 20 validation split:

In [97]: h = model.fit(X_train, y_train, batch_size=128,


epochs=8, validation_split=0.2)

Train on 8428 samples, validate on 2108 samples


Epoch 1/8
8428/8428 [==============================] - 2s 216us/sample - loss: 0.6651
- accuracy: 0.6163 - val_loss: 0.6537 - val_accuracy: 0.6067
Epoch 2/8
8428/8428 [==============================] - 0s 56us/sample - loss: 0.5673 -
accuracy: 0.6938 - val_loss: 0.5548 - val_accuracy: 0.7111
Epoch 3/8
8428/8428 [==============================] - 0s 51us/sample - loss: 0.3887 -
accuracy: 0.8433 - val_loss: 0.5494 - val_accuracy: 0.7358
Epoch 4/8
8428/8428 [==============================] - 0s 50us/sample - loss: 0.2600 -
accuracy: 0.9003 - val_loss: 0.6109 - val_accuracy: 0.7438
8.2. TEXT DATA 375

Epoch 5/8
8428/8428 [==============================] - 0s 50us/sample - loss: 0.1800 -
accuracy: 0.9375 - val_loss: 0.7881 - val_accuracy: 0.7481
Epoch 6/8
8428/8428 [==============================] - 0s 50us/sample - loss: 0.1284 -
accuracy: 0.9556 - val_loss: 0.7697 - val_accuracy: 0.7329
Epoch 7/8
8428/8428 [==============================] - 0s 50us/sample - loss: 0.0894 -
accuracy: 0.9725 - val_loss: 0.9024 - val_accuracy: 0.7315
Epoch 8/8
8428/8428 [==============================] - 0s 50us/sample - loss: 0.0654 -
accuracy: 0.9820 - val_loss: 0.9921 - val_accuracy: 0.7296

The model seems to be doing much better on the training set than any of the previous models based on Bag
of Words, since it achieves an accuracy greater than 95 in only 10 epochs. On the other hand, the
validation accuracy sees to be consistently lower, which indicates probable overfitting. Let’s evaluate the
model on the test set in order to verify the ability of our model to generalize:

In [98]: loss, acc = model.evaluate(X_test, y_test, batch_size=32)


acc

3513/3513 [==============================] - 0s 71us/sample - loss: 0.9485 -


accuracy: 0.7287

Out[98]: 0.7287219

Ouch! The test score is not much better than the score obtained by our BOW model. This means the model
is overfitting. Let’s plot the training history:

In [99]: dfhistory = pd.DataFrame(h.history)


dfhistory[['accuracy', 'val_accuracy']].plot(ylim=(-0.05, 1.05));
376 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA

1.0

0.8

0.6

0.4

0.2
accuracy
0.0 val_accuracy
0 1 2 3 4 5 6 7

As you can see, after a few of epochs the validation accuracy stops improving while the training accuracy
keeps improving.

We can also look at the loss and notice that the validation loss does not decrease after a certain point, while
the training loss does.

In [100]: dfhistory[['loss', 'val_loss']].plot(ylim=(-0.05, 1.05));


8.2. TEXT DATA 377

1.0 loss
val_loss
0.8

0.6

0.4

0.2

0.0
0 1 2 3 4 5 6 7

How many weights are there in our model? Is it too big?

In [101]: model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 49, 16) 160000
_________________________________________________________________
unified_lstm (UnifiedLSTM) (None, 32) 6272
_________________________________________________________________
dense_3 (Dense) (None, 1) 33
=================================================================
Total params: 166,305
Trainable params: 166,305
Non-trainable params: 0
_________________________________________________________________

The model is quite big compared to the size of the dataset. We have over 160 thousand parameters to classify
less than 15 thousand short reviews. This is not a good situation, and we expect to overfit. In the exercises,
we will repeat the sentiment prediction on a larger corpus of reviews and see if we can get better results.

We will also learn another way to reduce overfitting later in the book when we discuss pre-trained models.
378 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA

Sequence generation and language modeling


As mentioned at the beginning of this chapter and in the previous one, Neural Networks are not only suited
to deal with textual input data but also to generate text output. We can interpret a text as a sequence of
words or as a sequence of characters and we can use a model to predict the next character or words in a
sequence. This is called Language Modeling and it has been successfully used to generate
“Shakespeare-sounding” poems, new pages of Wikipedia and so on (see this wonderful article by A.
Karpathy for a few examples).

The basic idea is to apply to text the same approach we used to improve forecasting in the time series
prediction of the last chapter.

We will start from a corpus of text, split it into short, fixed-size Windows, i.e., sub-sentences with a few
characters, and then train a model to predict the next character after the sequence.

What are “windows” here? What do they refer to?

Windows of text

Let’s give an example by designing an RNN to generate names of babies. We will use this corpus as training
data, which contains thousands of names.

We start by loading all the names from ../data/names.txt. We also add a \n character to allow the
model to learn to predict the end of a name and convert the names to lowercase.

In [102]: with open('../data/names.txt') as f:


8.3. SEQUENCE GENERATION AND LANGUAGE MODELING 379

names = f.readlines()
names = [n.lower().strip() + '\n' for n in names]

print('Loaded %d names' % len(names))

Loaded 7939 names

Let’s have a look at the first three of them:

In [103]: names[:3]

Out[103]: ['aamir\n', 'aaron\n', 'abbey\n']

We need to count all of the characters in our “vocabulary” and build a vocabulary that translates between
the character and its assigned index (and vice versa). We could do this using the Tokenizer from Keras,
but it is so simple that we can do it by hand using a Python set:

In [104]: chars = set()

for name in names:


chars.update(name)

vocab_size = len(chars)

Let’s look at the number of chars we’ve saved:

In [105]: vocab_size

Out[105]: 28

In [106]: chars

Out[106]: {'\n',
'-',
'a',
'b',
'c',
'd',
380 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA

'e',
'f',
'g',
'h',
'i',
'j',
'k',
'l',
'm',
'n',
'o',
'p',
'q',
'r',
's',
't',
'u',
'v',
'w',
'x',
'y',
'z'}

Now let’s create two dictionaries, one to go fro characters to indices and the other to go back from indices to
characters. We’ll use these two dictionaries a bit later.

In [107]: char_to_idx = dict((c, i) for i, c in enumerate(chars))


inds_to_char = dict((i, c) for i, c in enumerate(chars))

Character sequences

We can use the vocabulary created above to translate each name in names to its number format in
int_names. We will achieve this using a nested list comprehension where we iterate on names and for each
name we iterate on characters:

In [108]: int_names = [[char_to_idx[c] for c in n] for n in names]

Now each name has been converted to a sequence of integers, for example, the first name:

In [109]: names[0]

Out[109]: 'aamir\n'
8.3. SEQUENCE GENERATION AND LANGUAGE MODELING 381

Was converted to:

In [110]: int_names[0]

Out[110]: [1, 1, 27, 3, 10, 5]

Great! Now we want to create short sequences of few characters and try to predict the next. We will do this
by cutting up names into input sequence of length maxlen and using the following character as training
labels. Let’s start with maxlen = 3:

In [111]: maxlen = 3

name_parts = []
next_chars = []

for name in int_names:


for i in range(0, len(name) - maxlen):
name_parts.append(name[i: i + maxlen])
next_chars.append(name[i + maxlen])

name_parts is a list with short fractions of names (three characters). Let’s take a look at the first elements:

In [112]: name_parts[:4]

Out[112]: [[1, 1, 27], [1, 27, 3], [27, 3, 10], [1, 1, 10]]

next_chars is a list with single entries, each representing the next character:

In [113]: next_chars[:4]

Out[113]: [3, 10, 5, 21]

As a last step we convert the nested list of name_parts to an array. We can do this, using the same
pad_sequences function used earlier in this chapter. This takes the nested list and converts it to an array
trimming the longer sequences and padding the shorter sequences:

In [114]: X = pad_sequences(name_parts, maxlen=maxlen)


382 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA

The final shape of our input is:

In [115]: X.shape

Out[115]: (32016, 3)

i.e. we have 32016 name parts, each with 3 consecutive characters.

Now let’s deal with the labels. We can use the to_categorical function to 1-hot encode the targets. Let’s
import it from keras.utils:

In [116]: from tensorflow.keras.utils import to_categorical

Now let’s create our categories from the next_chars using this function. Notice that we let Keras know
how many characters are in the vocabulary by setting num_classes=vocab_size in the second argument
of the function:

In [117]: y = to_categorical(next_chars, vocab_size)

The shape of our labels is:

In [118]: y.shape

Out[118]: (32016, 28)

i.e. we have 32016 characters, each represented by a 1-hot encoded vector of vocab_size length.

Recurrent Model

At this point we are ready to design and train our model.

We will need to set up an embedding layer for the input, one or more recurrent layers and a final dense layer
with softmax activation to predict the next character. We can design the model using the Sequential API as
usual, or we can start to practice with the Functional API, which we will use more often later on. This API is
much more powerful than the Sequential API we used so far because it allows us to build models that can
have more than one processing branch. It is good to start approaching it on a simple case so that we will be
more familiar with it when we use it on larger and more complex models.

Let’s import the Model class from tensorflow.keras and the Input layer:
8.3. SEQUENCE GENERATION AND LANGUAGE MODELING 383

In [119]: from tensorflow.keras.models import Model


from tensorflow.keras.layers import Input

In the Functional API, each layer is a function, which receives the output of the previous layer and it returns
an output to the next. When we specify a model in this way, we need to start from an Input layer with the
correct shape.

Since we have padded our name subsequences to a length of 3, we’ll create an Input layer with shape (3,):

TIP: remember that the trailing comma is needed in Python to distinguish a tuple with one
element from a simple number within parentheses.

In [120]: inputs = Input(shape=(3, ))

Let’s look at the inputs variable we have just defined:

In [121]: inputs

Out[121]: <tf.Tensor 'input_1:0' shape=(None, 3) dtype=float32>

It’s a Tensorflow tensor with shape=(?, 3), i.e. it will accept batches of data with 3 features, exactly as we
want. Next we create the Embedding layer, with input dimension equals to the vocabulary size (i.e. 28) and
output dimension equal to 5.

In [122]: emb = Embedding(input_dim=vocab_size, output_dim=5)

Next we will use this layer as a function, i.e. we well pass the inputs tensor to it and save the output tensor
to a temporary variable called h (for hidden).

In [123]: h = emb(inputs)

Note that we could have achieve the previous two operations in a single line by writing:

h = Embedding(input_dim=vocab_size, output_dim=5)(inputs)
384 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA

Following this style, we define the layer to be an LSTM layer with eight units, and we reuse the h variable for
its output:

In [124]: h = LSTM(8)(h)

Finally, we create the output layer, a Dense layer with as many nodes as vocab_size and with a Softmax
activation function:

In [125]: outputs = Dense(vocab_size, activation='softmax')(h)

Now that we have created all the layers we need and connected their inputs and outputs let’s create a model.
This is done using the Model class that needs to know what the inputs and outputs of the model are:

In [126]: model = Model(inputs=inputs, outputs=outputs)

From here onwards we proceed in an identical way to what we’ve been doing with the Sequential API. We
compile the model for a classification problem:

In [127]: model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])

and now we are ready to train it. We will let the training run for at least ten epochs. While the model trains,
let us reflect on a couple of questions:

• Will this model reach 99 accuracy?


• Will any model ever reach 99 accuracy on this task?
• Would this change if we had access to a corpus of millions of names?
• What accuracy would you expect from randomly guessing the next character?

Let’s train our model by using the fit() function. We will run the training for 20 epochs:

In [128]: model.fit(X, y, epochs=20, verbose=0);

Great! The model has finished training. Getting above 30 accuracy is a good result in this case. The reason
is, we are trying to predict the next character after a sequence of three characters, but there is no unique
solution to this prediction problem.

Think for example of the three characters and. How many names are there in the dataset that start with and?
8.3. SEQUENCE GENERATION AND LANGUAGE MODELING 385

• anders -> next char is e


• andie -> next char is i
• andonis -> next char is o
• andre -> next char is r
• andrea -> next char is r
• andreas -> next char is r
• andrej -> next char is r
• andres -> next char is r
• andrew -> next char is r
• andrey -> next char is r
• andri -> next char is r
• andros -> next char is r
• andrus -> next char is r
• andrzej -> next char is r
• andy -> next char is y

From this example, we see that while r is the most frequent answer, it’s not the only one. Other letters could
come after the letters and in our training set.

By training the model on the truncated sequences, we are effectively teaching our model a probability
distribution over our vocabulary. Using the example above, given the series of characters ['a', 'n',
'd'] the model is learning that the character r appears 11/15 times, i.e., it has a probability of 0.733, while
the characters e, i, o, y each appear 1/15 times, i.e., each has a probability of 0.066.

TIP: For the math inclined reader, the model is learning to predict the probability
p(c t ∣c t−3 c t−2 c t−1 ) where the index t indicates the position of a character in the name.
p(A∣B) is the conditional probability of A given B. This is the probability that A will
happen when B has already happened.

Since the vocabulary size is 28, if the next character would have been predicted using a random uniform
distribution over the vocabulary, on average we would predict correctly only one time every 28 trials, which
would give an accuracy of about 3.6. We get to an accuracy of about 30, which is 10x higher than random.

Sampling from the model

Now that we trained the model, we can use it to produce new names, that should at least sound like English
names. We can sample the model by feeding in a few letters and using the model’s prediction for the next
letter. Then we feed the model’s prediction back in to get the next letter, and so on.

First of all, let’s define a helper function called sample. This function has to take an array of probabilities
p = [p i ]i∈vocab for the characters in the vocabulary and return the index of a character, with probabilities
according to p. This will return more probable characters more often than characters with a lower
probability.
386 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA

The multinomial distribution is a generalization of the binomial distribution that can help us in this case.
It is implemented in Numpy and its documentation reads:

The multinomial distribution is a multivariate generalization of the


binomial distribution. Take an experiment with one of ``p``
possible outcomes. An example of such an experiment is throwing a dice,
where the outcome can be 1 through 6. Each sample drawn from the
distribution represents `n` such experiments. Its values,
``X_i = [X_0, X_1, ..., X_p]``, represent the number of times the
outcome was ``i``.

This description says that if our experiment has three possible outcomes with probabilities [0.25, 0.7,
0.05], a single multinomial draw will return an array of length three, where all the entries will be zero
except one, that will be a 1, corresponding to the randomly chosen outcome for that experiment. If we were
to repeat the draws multiple times, the frequencies of each outcome would tend towards the assigned
probabilities.

Therefore we can implement the sample function as:

sample(p) ∶= argmax(multinomial(1, p, 1)) (8.3)

We are going to generalize this a bit more, introducing a parameter called diversity that rescales the
probabilities. For high values of the diversity, the probability vectors will tend to zero, and we will be
approaching the random uniform distribution. When the diversity is low, the most likely characters will be
selected even more often, approaching a deterministic character generator.

Let’s create the sample function that accepts an input list with a diversity argument that allows us to
rescale the probabilities as an argument:

In [129]: def sample(p, diversity=1.0):


p1 = np.asarray(p).astype('float64')
p1 = np.log(p1) / diversity
e_p1 = np.exp(p1)
s = np.sum(e_p1)
p1 = e_p1 / s
return np.argmax(np.random.multinomial(1, p1, 1))

Let’s make sure we understand how this function works with an example. Let’s define the probabilities of 3
outcomes (you may think of these as win-loose- draw) where the first one happens 1/4th of the time, the
second one 65 of the time and the last one only 10 of the time.

In [130]: probs = [0.25, 0.65, 0.1]


8.3. SEQUENCE GENERATION AND LANGUAGE MODELING 387

Drawing samples from this probability distribution we would expect to pull out the number 1
(corresponding to the second outcome) about 65 of the time and so on. Let’s sample 100 times:

In [131]: draws = [sample(probs) for i in range(100)]

and let’s use a Counter to count how many of each we drew:

In [132]: from collections import Counter

In [133]: Counter(draws)

Out[133]: Counter({1: 62, 0: 31, 2: 7})

As you can see our results reflect the actual probabilities, with some statistical fluctuations.

Great! Now that we can sample from the vocabulary let’s generate a few names. We will start from an input
seed of three letters and then iterate in a loop the following steps:

• Use the seed to predict the probability distribution for next characters.
• Sample the distribution using the sample function.
• Append the next character to the seed.
• Shift the input window by one to include the last character appended.
• Repeat.

The loop ends either when we reach a termination character or a pre-defined length.

Let’s go ahead and build this function step by step. Let’s set up the seed of our name to be something like
ali.

In [134]: seed = 'ali'


out = seed

In order to build the name, let’s create an output list (we’ll call x) to store our output, setting the length to
that of the maximum length of the name we want to generate:

In [135]: x = np.zeros((1, maxlen), dtype=int)

Let’s use a variable we’ll call stop to stop the loop if our network predicts the '\n' character as the next
character and set it to False:
388 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA

In [136]: stop = False

Finally let’s loop until we have to stop:

In [137]: while not stop:


for i, c in enumerate(out[-maxlen:]):
x[0, i] = char_to_idx[c]

preds = model.predict(x, verbose=0)[0]

c = inds_to_char[sample(preds)]
out += c

if c == '\n':
stop = True
out

Out[137]: 'alia\n'

The network produced a few characters and then stopped. Now let’s wrap these steps in a function that
encapsulates this entire process in a single method. Let’s call our function complete_name. This function
will take an input seed of three letters and run through the previous steps to predict the next character.

In [138]: def complete_name(seed, maxlen=3, max_name_len=None,


diversity=1.0):
'''
Completes a name until a termination character is
predicted or max_name_len is reached.

Parameters
----------
seed : string
The start of the name to sample
maxlen : int, default 3
The size of the model's input
max_name_len : int, default None
The maximum name length; if None then samples
are generated until the model generates a '.'
diversity : float, default 1.0
Parameter to increase or decrease the randomness
of the samples; higher = more random,
lower = more deterministic

Returns
8.3. SEQUENCE GENERATION AND LANGUAGE MODELING 389

-------
out : string
'''

out = seed

x = np.zeros((1, maxlen), dtype=int)

stop = False

while not stop:


for i, c in enumerate(out[-maxlen:]):
x[0, i] = char_to_idx[c]

preds = model.predict(x, verbose=0)[0]

c = inds_to_char[sample(preds, diversity)]
out += c

if c == '\n':
stop = True
else:
if max_name_len is not None:
if len(out) > max_name_len - 1:
out = out + '\n'
stop = True
return out

Nice! Now that we have a function to complete names, let’s predict a few names that start as jen:

In [139]: for i in range(10):


print(complete_name('jen'))

jenidsy

jen

jen

jen

jene

jenatiss

jencia

jeni
390 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA

jen

jene

Not bad! Let’s play with the diversity parameter to understand what it does. If we set the diversity to be high,
we get random sequences of characters:

In [140]: for i in range(10):


print(complete_name('jen', diversity=10,
max_name_len=20))

jenpt-dwdof

jennofzzl

jenkyybfigiygobxopil

jenhchgkhrtqitoc-tuj

jengefaczvjcesdcttoa

jenzvrzgyexta--xlzso

jenmsxhnqrlg-dgavcq-

jenylqtynmmevft-uftq

jenskwrvyeiellxsoeql

jenxxte

If we set it to a small value, the function becomes deterministic.

TIP: since the sample function involves logarithms and exponential, it accumulates
numerical errors very quickly. It would be better to build a model that predicts logits
instead of probabilities, but Keras does not allow to do that.

In [141]: for i in range(10):


print(complete_name('jen', diversity=0.01,
max_name_len=20))
8.4. EXERCISES 391

jen

jen

jen

jen

jen

jen

jen

jen

jen

jen

Awesome! We now know how to build a language model! Go ahead and unleash your powers on your
author of choice and start producing new poems or stories. The model we built has a memory of 3
characters, so it won’t exactly be “Shakespeare” when it tries to produce sentences. To have a model
producing correct sentences in English, we would need to train it with a much larger corpus and with longer
Windows of text. For example, a memory of 20-25 characters is long enough to generate English-looking
text.

In the next section, we will extend our skills to build a language translation model.

Sequence to sequence models and language translation

Sequence-to-sequence (Seq2Seq) models take a sentence in input and return a new sequence in output.
They are very common in language translation, where the input sequence is a sentence in the first language,
and the output sequence is the translation in the second language.

There is a great article by Francois Chollet on the Keras Blog on how to build them in Keras. We strongly
encourage you to read it!

In this chapter, we have approached text problems from a variety of angles and hopefully inspired you to dig
deeper into this domain.

Exercises

Exercise 1

For our Spam detection model, we used a CountVectorizer with a vocabulary size of 3000. Was this the
best size? Let’s find out:
392 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA

Encoder - Decoder network

• reload the spam dataset


• do a train test split with random_state=0 on the SMS data frame
• write a function train_for_vocab_size that takes vocab_size as input and does the following:
– initialize a CountVectorizer with max_features=vocab_size
– fit the vectorizer on the training messages
– transform both the training and the test messages to count matrices
– train the model on the training set
– return the model accuracy on the training and test set
• plot the behavior of the train and test set accuracies as a function of vocab_size for a range of
different vocab sizes

In [ ]:

Exercise 2

Keras provides a large dataset of movie reviews extracted from the Internet Movie Database for sentiment
analysis purposes. This dataset is much larger than the one we have used, and its already encoded as
sequences of integers. Let’s put what we have learned to good use and build a sentiment classifier for movie
reviews:

• decide what size of vocabulary you are going to use and set the vocab_size variable
• import the imdb module from keras.datasets
• load the train and test sets using num_words=vocab_size
• check the data you have just loaded; they should be sequences of integers
• pad the sequences to a fixed length of your choice. You will need to:
– decide what a reasonable length to express a movie review is
– decide if you are going to truncate the beginning or the end of reviews that are longer than such
length
8.4. EXERCISES 393

– decide if you are going to pad with zeros at the beginning or the end for reviews that are shorter
than such length
• build a model to do sentiment analysis on the truncated sequences
• train the model on the training set
• evaluate the performance of the model on the test set

Bonus points: can you convert back the sentences to their original text form? You should look at
imdb.get_word_index() to download the word index:

In [ ]:
394 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA
Training with GPUs
9
In this chapter, we will learn how to leverage Graphical Processing Units (GPUs) to speed up the training of
our models. The faster a model trains, the more experiments we can run, and therefore the better solutions
we can find. Also, leveraging cloud GPUs has become so easy by now that it would be a pity not to take
advantage of this opportunity. Only a few years ago, training a deep Neural Network using a GPU was a skill
that demanded very sophisticated knowledge and much money. Nowadays, we train a model on many
GPUs at a relatively affordable cost.

We will start this chapter by introducing what a GPU is, where it can be found, what kinds of GPUs are
available and why they are so useful to do Deep Learning. Then we will review several cloud providers of
GPUs and guide you through how to use them. Once we have a working cloud instance with one or more
GPUs, we will compare training a model with and without a GPU, and appreciate the speedup, especially
with Convolutional Neural Networks. We will then extend training to multiple GPUs and introduce a few
ways to use multiple GPUs in Keras.

This chapter is a bit different from the other chapters as there will be less Python code and more links to
external documentation and services. Also, while we will do our best to have the most up to date guide to
currently existing providers, it is important that you understand how fast the landscape is evolving. During
the course of the past six months, each of the providers presented introduced newer and easier ways to
access cloud GPUs, making the previous documentation obsolete. Thus, it is important that you understand
the principles of why accelerated hardware helps and when. If you do this, it will be easy to adapt to new
ways of doing things when they come out. All that said, let’s get started!

Graphical Processing Units


Graphical Processing Units are computer chips that specialize in the parallel manipulation of huge,
multi-dimensional arrays. Originally developed to accelerate the display of video games graphics, they are

395
396 CHAPTER 9. TRAINING WITH GPUS

today widely used for other purposes like Machine Learning acceleration.

The term GPU became famous in 1999 when Nvidia - the dominant player in the field today - marketed the
GeForce 256 as “the world’s first GPU”. In 2002, ATI Technologies, a competitor of Nvidia, coined the term
“visual processing unit” or VPU with the release of the Radeon 9700. The following picture shows the
original GeForce 256 (left side) and the GeForce GTX 1080 (right), one of the latest released and most
powerful graphics cards in the market.

NVIDIA Graphics cards

In 2006, Nvidia came out with a high-level language called CUDA (Compute Unified Device Architecture),
that helps software developers and engineers to write programs from graphics processors in a high-level
language – an approach termed GPGPU (General-Purpose computing on Graphics Processing Units).
CUDA is a language that gives direct access to the GPU’s virtual instruction set and parallel computational
elements, for the execution of compute kernels. This was probably one of the most significant changes in the
way researchers and developers interacted with GPUs.

Why are GPUs, initially developed for video games graphics, so useful for Deep Learning?

As you already know, training a Neural Network requires several operations, many of which involve large
matrix multiplications. They perform matrix multiplications in the forward pass when inputs (or
activations) and weights are multiplied (see Chapter 5 if you need a refresher on the math).
Back-propagation also involves matrix multiplications, when the error is propagated back through the
network to adjust the values of the weights. In practice, training a Neural Network mostly consists of matrix
multiplications. Consider for example VGG16 (a frequently used convolutional Neural Network for image
classification. Proposed by K. Simonyan and A. Zisserman ), it has approximately 140 million parameters.
Using a CPU, it would take weeks to train this model and perform all the matrix multiplications.

GPUs allow to dramatically decrease the time needed for matrix multiplication, offering 10 to 100 times
more computational power than traditional CPUs. There are several reasons why they make this
computational speed-up possible, well discussed in this article.

Summarizing the article, GPUs, comprised of thousands of cores unlike CPUs, not only allow for parallel
operations, but they are ideal when it comes to fetching enormous amounts of memory. The best GPUs can
fetch up to 750GB/s, which is huge if compared with the best CPU which can handle only up to 50GB/s
9.2. CLOUD GPU PROVIDERS 397

memory bandwidth. Of course, dedicated GPUs, designed explicitly for High-Performance Computing and
Deep Learning, are more performant (and expensive) than gaming GPUs, but the latter, usually available in
everyday laptops, is still a good starting option!

The following picture shows a comparison between CPU and GPU performance (source: Nvidia). The left
image shows that the Fermi GPU can process more than ten times the number of images processed (per
second) by an Intel 4 core CPU. The right image shows that a 16 GPU Accelerated Servers can handle a more
than six times bigger Neural Network if compared with a 1000 CPU Servers.

CPU versus GPU

Cloud GPU providers


As of early 2018, all major cloud providers give access to cloud instances with GPUs. The two leaders in the
space are Amazon Web Services (AWS) and Google Cloud Platform (GCP). These two companies have been
pioneers in providing cloud GPUs at affordable rates, and they keep adding new options to their offer.
Besides, they both offer additional services specifically built to optimize and serve Deep Learning models at
scale.

Other companies offering cloud GPUs are Microsoft Azure Cloudand IBM. Also, a few startups have started
to offer Deep Learning optimized cloud instances, that are often cheaper and easier to access. In this chapter
we will review Floydhub, Pipeline.ai and Paperspace.

Regardless of the cloud provider, if you have a Linux box with an NVIDIA GPU, it is not hard to equip it to
run tensorflow-gpu and a Jupyter Notebook.

Google Colab

The easiest way to give GPU acceleration a try is to use Google Colab, also known as Colaboratory. Besides
being so easy, Colab is also free to use (you only need a Google account), which makes it perfect to try out
GPU acceleration.

Colaboratory is a research tool for Machine Learning education and research. It’s a Jupyter Notebook
environment that requires no setup to use: you can create and share Jupyter notebooks with others without
having to download, install, or run anything on your computer other than a browser. It works with most
major browsers, and it is most thoroughly tested with desktop versions of Chrome and Firefox.
398 CHAPTER 9. TRAINING WITH GPUS

This welcome notebook provides the information to start working with Colab. In addition to all the
standard operations in Jupyter you can change the notebook settings to enable GPU support:

Once you’ve done that, you can run this code to verify that GPU is available:

import tensorflow as tf
tf.test.gpu_device_name()

Pipeline AI

The next best option to try a GPU for free in the cloud is the service offered by PipelineAI. PipelineAI
service enables data scientists to train, test, optimize, deploy, and scale models in production rapidly and
directly from a Jupyter Notebook or command-line interface. It provides a platform that simplifies the
workflow and let the user focus only on the essential Machine Learning aspects.

The login process to use PipelineAI is quite simple and straightforward: 1. Sign up at PipelineAI. 2. Once
you are successfully logged in, you should see the following dashboard. You can either launch a new
notebook or directly type commands in a terminal.
9.2. CLOUD GPU PROVIDERS 399

3. Alternatively, you can use some of the already available resources, accessible from the left menu. For
example, you can have a look at the 01a_Explore_GPU.ipynb notebook, under notebooks >
00_GPU_Workshop

PipelineAI is not only a platform providing GPU-powered Jupyter Notebooks, but it also allows you to do
much more, such as monitoring the training of the algorithms, evaluating the results of your model,
comparing the performances of different models, browsing among stored models, and so on. The following
picture shows some of the available tools, but have a look at all the options available in the community
edition.
400 CHAPTER 9. TRAINING WITH GPUS

To better understand the potential of PipelineAI, we encourage you to take this tour. Pipeline is under
active development. You can follow its Github repository

Floydhub

Floydhub is an other easy and cheap option to access GPU in the cloud. Floydhub is a platform for training
and deploying Deep Learning and AI applications. FloydHub comes with fully configured CPU and GPU
environments ready to use for Deep Learning. It includes CUDA, cuDNN and popular frameworks like
Tensorflow, PyTorch, and Keras. Please take a look at the documentation for a more extended explanation
of its features.

This tutorial explains how to start a Jupyter Notebook on Floydhub: 1. Create an account on Floydhub. 2.
Install floyd-cli on your computer.

pip install -U floyd-cli

3. Create a project, named for example my_jupyter_project:


9.2. CLOUD GPU PROVIDERS 401

4. From your terminal, use floyd-cli to initialize the project (be sure to use the name you gave the
project in step 3).

floyd init my_jupyter_project

TIP: if this is the first time you run floyd it will ask you to log in. Just type floyd login
and follow the instructions provided.

5. Use again floyd-cli to kick off your first Jupyter Notebook.

floyd run --gpu --mode Jupyter

This will confirm the job:

and open your FloydHub web page. Here you’ll see a View button that will direct you to a Jupyter Notebook.
The notebook is running on FloyHub’s GPU servers.
402 CHAPTER 9. TRAINING WITH GPUS

Once finished you can stop the Jupyter Notebook with the cancel button. Make sure to save your results by
downloading the notebook before you terminate it:

Paperspace

Paperspace is a platform to access a virtual desktop in the cloud. In particular, the Gradient service allows to
explore, collaborate, share code and data using Jupyter Notebooks, and submit tasks to the Paperspace GPU
cloud.

It is a suite of tools specifically designed to accelerate cloud AI and Machine Learning. Gradient also
includes a powerful job runner (that can even run on the new Google TPUs!), first-class support for
containers and Jupyter notebooks, and a new set of language integrations. Gradient also has a job runner,
that allows you to work on your local machine and submit “jobs” to the cloud to process. Discover more
about this service reading this blog post.

The procedure to run a Jupyter Notebooks within Paperspace is similar to what we have seen so far for
different GPU services:
9.2. CLOUD GPU PROVIDERS 403

1. Create an account on Paperspace.


2. Access the console.

3. Create a Jupyter Notebook to create your models. (Credit card information on the billing page are
required to enable all functionality.).

Paperspace is much more general than simply a hosted Jupyter Notebook service with GPU enabled. Since
Paperspace gives you a full virtual desktop (both Linux and Windows, as shown in the following picture),
you can install any other applications you need, from 3D rendering software to video editing and more.

AWS EC2 Deep Learning AMI

AWS provides a Deep Learning AMI ready to use with all the NVIDIA drivers pre-installed as well as most
Deep Learning frameworks and Python packages. It’s not free, but it’s sufficiently simple and versatile to use.
We can quickly launch Amazon EC2 instances pre-installed with popular Deep Learning frameworks such
as Apache MXNet and Gluon, TensorFlow, Microsoft Cognitive Toolkit, Caffe, Caffe2, Theano, Torch,
PyTorch, Chainer, and Keras to train sophisticated, custom AI models, experiment with new algorithms, or
to learn new skills and techniques.

To use any AWS service, we need to open an account. Several resources are available for a free trial period,
as described in the official web page. After finishing the trial period, keep in mind that the service will
charge you. Also, keep in mind that GPU instances are not included in the free tier so you will incur in
charges if you complete the next steps.

Follow this procedure to spin up a GPU enabled machine on AWS with the Deep Learning AMI:
404 CHAPTER 9. TRAINING WITH GPUS

1. Access the AWS console and select EC2 from the Compute menu.

2. Click on the Launch Instance button.

3. Scroll the page and select an Amazon Machine Image (AMI). The Deep Learning AMI is a good
option to start. It comes in 2 flavors: Ubuntu and Amazon Linux. Both are good, and we recommend
you use the flavor you are more comfortable with. Also, note that there are both a Deep Learning AMI
and a Deep Learning AMI Basic. The Basic AMI has only GPU drivers installed but no Deep Learning
software. The full AMI comes pre-packaged with a ton of useful packages including Tensorflow,
Keras, Pytorch, MXNet, CNTK and more. We recommend you use this one to start.
9.2. CLOUD GPU PROVIDERS 405

4. Chose an instance type from the menu. Roughly speaking, instance types are ordered in ascending
order considering the computational power and the storage space.

Here’s a summary table of AWS GPU instances. Read the documentation for a detailed description of
every instance type.

Once you have chosen the instance go through the other steps:
406 CHAPTER 9. TRAINING WITH GPUS

• Step 3: Configure Instance Details


• Step 4: Add Storage
• Step 5: Add Tags
• Step 6: Configure Security Group: make sure to leave port 22 open for SSH
• Step 7: Review Instance Launch

and finally, launch your instance with a key pair you own. Let’s assume it’s called your-key.pem.

You should now be able to see the newly created instance in the dashboard, and you are now ready to
connect with it.

Finally, take a look at the Tutorials and Examples section to understand better how to use Deep Learning
AMI service offered by AWS.

Connect to AMI (Linux)

Once your Instance state is running you are ready to connect to it. We are going to do that from a terminal.
We will use the ssh key we have generated, and we will also route remote port 8888 to the local port 8888 so
that we get to access Jupyter Notebook. Go ahead and type:

ssh -i your-key.pem -L 8888:localhost:8888 ubuntu@<your-ip>

TIP: if you get a message that says your key is not protected, you need to change the
permissions of your key to read-only. You can do that by executing the command: chmod
600 your-key.pem.
9.2. CLOUD GPU PROVIDERS 407

Once you’re connected you should see a screen like the following, where all the environments are listed:

We will go ahead and activate the tensorflow_py36 environment with the command:

source activate tensorflow_p36

and launch Jupyter Notebook with:

nohup jupyter notebook --no-browser &

This command launches Jupyter in a way that will not stop if you disconnect from the instance. The
final step is to retrieve the Jupyter address: http://localhost:8888/?token=<your-token>. You
will find it in the nohup.out file:

tail nohup.out
408 CHAPTER 9. TRAINING WITH GPUS

Copy it and paste it into your browser. If you’ve done everything correctly you should see a screen like
this one:

TIP: Aws also has a tutorial here:


https://docs.aws.amazon.com/dlami/latest/devguide/tutorials.html

Connect to AMI (Windows)

To connect with the AWS EC2 Deep Learning AMI from Windows similar steps must be followed, but in this
case it is convenient to use PuTTY, an SSH client specifically developed for the Windows platform. After
the installation of Putty in your machine, the procedure to connect with the cloud instance is as follows:

1. In the Session palette:

• Host Name (or IP address): ubuntu@<your-ip>


• Port: 22
• Connection type: SSH

2. In the Connection > SSH > Auth palette:

• Private key file for autentication: browse the key generated by PuttyGen

3. In the Conenction > SSH > Tunnels palette:

• Source port: 8888


• Destination: localhost:8888
9.2. CLOUD GPU PROVIDERS 409

Once you are connected, follow the same steps as for the Linux case.

Turning off the instance

Once done with your experiments, remember to turn off the instance to avoid useless costs. Just go to your
AWS console and either Stop or Terminate the instance, by choosing an action from the Actions menu:

AWS Command Line Interface

AWS also supports a command line interface AWS CLI that allows performing the same operations from
the terminal. If you’d like to try it you can install it using the command:

pip install awscli


410 CHAPTER 9. TRAINING WITH GPUS

from your terminal. Once you have installed it, you need to add configuration credentials. First, you’ll have
to set up an IAM user in the EC2 dashboard, then run the following configuration command:

aws configure

That will prompt you to insert some information:

AWS Access Key ID [None]: <your access key>


AWS Secret Access Key [None]: <your secret>
Default region name [None]: us-east-1
Default output format [None]: ENTER

Aws regions and availability zones are:

Make sure to choose a region that provides a copy of the Deep Learning AMI.

As explained in the AWS CLI guide, the output format can be json:
9.2. CLOUD GPU PROVIDERS 411

or text:

Once configured you can start your Deep Learning instance with the following command:

aws ec2 run-instances \


--image-id <DL-AMI-ID-for-your-region> \
--count 1 \
--instance-type <instance-type> \
--key-name <your-ssh-key-name> \
--subnet-id <subnet-id> \
--security-group-ids <security-group-id> \
--tag-specifications 'ResourceType=instance,
412 CHAPTER 9. TRAINING WITH GPUS

Tags=[{Key=Name,Value=<a-name-tag>}]'

Where you will need to insert the following parameters:

• <DL-AMI-ID-for-your-region>: the AMI ID for the Deep Learning AMI in the AWS region
you’ve chosen
• <instance-type>: the type of instance, like g2.2xlarge, p3.16xlarge etc.
• <your-ssh-key-name>: then give a name to your ssh key. You must have this on your disk.
• <subnet-id>: the subnet id, you can find this when you launch an instance from the web interface.
• <security-group-id>: the security group id, you can find this when you launch an instance from
the web interface as well.
• <a-name-tag>: a name for your instance, so that you can easily retrieve it by name

You can query the status of your launch with the command:

aws ec2 describe-instances

and remember to stop or terminate the instance when you are done, for example using this command:

aws ec2 terminate-instances --instance-ids <your-instance-id>

which would return something like this:

AWS Sagemaker

AWS Sagemaker is an AWS managed solution that allows performing all the steps involved in a Deep
Learning pipeline. In fact, on Sagemaker you can define, train and deploy a Machine Learning model in just
a few steps.
9.2. CLOUD GPU PROVIDERS 413

Sagemaker provides an integrated Jupyter Notebook instance that can be used to access data stored in other
AWS services, explore it, clean it and analyze it as well as to define a Machine Learning model. It also
provides common Machine Learning algorithms that are optimized to run efficiently against extensive data
in a distributed environment.

Detailed information about this service is in the official documentation.

The procedure to spin up a notebook is similar to what previously seen:

1. Create an AWS account and access the console.


2. From the AWS console, select the Amazon SageMaker service, under the Machine Learning group.

3. Click the button Create notebook instance.

4. Assign a Notebook instance name, for example, “my_first_notebook” and click the button Create
notebook instance. Sagemaker offers several types of instances, including a cheap option for
development of your notebook, a CPU- heavy instance if your code requires a lot of CPUs and a
GPU-enabled instance if you need it. Notice that the instance types available for the notebook
instance are different from the ones available for model training and deployment.
414 CHAPTER 9. TRAINING WITH GPUS

5. Start working on the newly created notebook

Once done developing your model, Sagemaker allows to export, train and deploy the model with
straightforward steps. Please refer to the User guide for more information on these steps.

Google Cloud and Microsoft Azure

Although we reviewed in detail the solutions offered by Amazon AWS, both Google Cloud and Microsoft
Azure offer similarly priced GPU-enabled cloud instances. We invite you to check their offering here: -
Google Cloud - Microsoft Azure

The DIY solution (on Ubuntu)

If you’d like start from scratch on a barebone Linux machine with a GPU, here are the steps you will need to
follow:

1. Install NVIDIA Cuda Drivers. CUDA is a language that gives direct access to the GPU’s virtual
instruction set and parallel computational elements, for the execution of compute kernels.

• Download and install CuDNN. CuDNN is an NVIDIA library built on CUDA that implements a lot
of common Neural Network algorithms.
9.3. GPU VS CPU TRAINING 415

• Install Miniconda. Miniconda is a minimal installation of Python and the conda package manager
that we will use to install other packages.
• Install a few standard packages conda install pip numpy pandas scikit-learn scipy
matplotlib seaborn h5py. This command will install the packages in the base environment.
• Install Tensorflow compiled with GPU support: pip install tensorflow-gpu.
• (Optional) Install Keras: pip install keras.

GPU VS CPU training


Regardless of how you decided to get access to a GPU-enabled cloud instance, in the following code, we will
assume that you have access to such an instance and review some functionality that is available in
Tensorflow when running on a GPU instance.

Let’s start by comparing training speed on a CPU vs. a GPU for a Convolutional Neural Network. We will
train this on the CIFAR10 data that we have also encountered in Chapter 6. Let’s load the usual packages of
Numpy, Pandas, and Matplotlib:

In [1]: with open('common.py') as fin:


exec(fin.read())

In [2]: with open('matplotlibconf.py') as fin:


exec(fin.read())

Let’s also import Tensorflow:

In [3]: import tensorflow as tf

Tensorflow 2.0 compatibility

Tensorflow 2.0 enables Eager Execution by default. From our tests this seems to have a problem with the
allocation on GPU vs CPU. The issue is documented here. While the developers at Tensorflow figure out the
problem and find a fix, we will disable eager execution:

In [4]: tf.compat.v1.disable_eager_execution()

Convolutional model comparison

First, we load the data using a helper function that also rescales it and expands the labels to binary
categories. If you’re unfamiliar with these steps, we recommend you review Chapter 3, Chapter 4 and
Chapter 6 where they are repeated multiple times and explained in detail.
416 CHAPTER 9. TRAINING WITH GPUS

In [5]: from tensorflow.keras.datasets import cifar10


from tensorflow.keras.utils import to_categorical

In [6]: def cifar_train_data():


print("Loading CIFAR10 Data")
(X_train, y_train), _ = cifar10.load_data()
X_train = X_train.astype('float32') / 255.0
y_train_cat = to_categorical(y_train, 10)
return X_train, y_train_cat

X_conv, y_conv = cifar_train_data()

Loading CIFAR10 Data

Next, we define a function that creates the convolutional model. By now you should be familiar with every
line of code that follows, but just as a reminder, we create a Sequential model adding layers in sequence,
like pancakes in a stack. The layers in this network are:

• 2D Convolutional layer with 32 filters, each of size 3x3 and ReLU activation. Notice that in the first
layer we also specify the input shape of (32, 32, 3) which means our images are 32x32 pixels with
three colors: RGB.
• 2D Convolutional layer with 32 filters, each of size 3x3 and ReLU activation. We add a second
convolutional layer immediately after the first to effectively convolve over larger regions in the input
image.
• Max Pooling layer 2 D with a pool size of 2x2. This will cut in half the height and the width of our
feature maps, effectively making the calculations four times faster.
• Flatten layer to go from the order four tensors used by convolutional layers to an order-2 tensor
suitable for fully connected networks.
• Fully connected layer with 512 nodes and a ReLU activation
• Output layer with ten nodes and a Softmax activation

If you need to review these concepts, make sure to check out Chapter 6 for more details.

We also compile the model for a classification problem using the Categorical Cross-entropy loss function
and the RMSProp optimizer. These are explained in detail in Chapter 5.

Notice also that we import the time module to track the performance of our model:

In [7]: from tensorflow.keras.models import Sequential


from tensorflow.keras.layers import Flatten, Dense
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from time import time
9.3. GPU VS CPU TRAINING 417

In [8]: def convolutional_model():


print("Defining convolutional model")
t0 = time()
model = Sequential()
model.add(Conv2D(32, (3, 3),
padding='same',
input_shape=(32, 32, 3),
kernel_initializer='normal',
activation='relu'))
model.add(Conv2D(32, (3, 3), activation='relu',
kernel_initializer='normal'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(512, activation='relu'))
model.add(Dense(10, activation='softmax'))

print("{:0.3f} seconds.".format(time() - t0))

print("Compiling the model...")


t0 = time()
model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])

print("{:0.3f} seconds.".format(time() - t0))


return model

Now we are ready to do a comparison between the CPU training time and the GPU training time. We can
force tensorflow to create the model on the the CPU with the context setter with tf.device('cpu:0').
Let’s create a model on the CPU:

In [9]: with tf.device('cpu:0'):


model = convolutional_model()

Defining convolutional model


0.089 seconds.
Compiling the model...
0.110 seconds.

In [10]: model.summary()

Model: "sequential"
_________________________________________________________________
418 CHAPTER 9. TRAINING WITH GPUS

Layer (type) Output Shape Param #


=================================================================
conv2d (Conv2D) (None, 32, 32, 32) 896
_________________________________________________________________
conv2d_1 (Conv2D) (None, 30, 30, 32) 9248
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 15, 15, 32) 0
_________________________________________________________________
flatten (Flatten) (None, 7200) 0
_________________________________________________________________
dense (Dense) (None, 512) 3686912
_________________________________________________________________
dense_1 (Dense) (None, 10) 5130
=================================================================
Total params: 3,702,186
Trainable params: 3,702,186
Non-trainable params: 0
_________________________________________________________________

Now let’s train the CPU model for 2 epochs:

In [11]: print("Training convolutional CPU model...")


t0 = time()
model.fit(X_conv, y_conv,
batch_size=1024,
epochs=2,
shuffle=True)
print("{:0} seconds.".format(time() - t0))

Training convolutional CPU model...


Epoch 1/2
50000/50000 [==============================] - 99s 2ms/sample - loss: 2.0398
- accuracy: 0.2827
Epoch 2/2
50000/50000 [==============================] - 104s 2ms/sample - loss:
1.6681 - accuracy: 0.4196
203.2117199897766 seconds.

Now let’s compare the model with a model living on the GPU. We use a similar context setter: with
tf.device('gpu:0'):

In [12]: with tf.device('gpu:0'):


model = convolutional_model()

Defining convolutional model


0.086 seconds.
9.3. GPU VS CPU TRAINING 419

Compiling the model...


0.180 seconds.

And then we train the model on the GPU:

In [13]: print("Training convolutional GPU model...")


t0 = time()
model.fit(X_conv, y_conv,
batch_size=1024,
epochs=2,
shuffle=True)
print("{:0.3f} seconds.".format(time() - t0))

Training convolutional GPU model...


Epoch 1/2
50000/50000 [==============================] - 4s 75us/sample - loss: 2.0269
- accuracy: 0.2950
Epoch 2/2
50000/50000 [==============================] - 3s 67us/sample - loss: 1.6380
- accuracy: 0.4248
7.543 seconds.

As you can see training on the GPU is much faster than on the CPU. Also notice that the second epoch runs
much faster than the first one. The first epoch also includes the time to transfer the model to the GPU, while
for the following ones the model has already been transferred to the GPU. Pretty cool!

NVIDIA-SMI

We can check that the GPU is actually being utilized using nvidia-smi. The NVIDIA System Management
Interface is a tool that allows us to check the operation of our GPUs. To better understand how it works,
have a look at the documentation.

To use the NVIDIA System Management Interface: 1. Open a new terminal from the Jupyter interface
420 CHAPTER 9. TRAINING WITH GPUS

2. Type nvidia-smi in the command line.

Multiple GPUs
If your machine has more than one GPU, you can use multiple GPUs to improve your training even more.
There are several ways to distribute the training over several GPUs, and the tools to do this are improving
and changing very rapidly.

We will focus here on the general ideas and suggest a couple of ways to perform parallelization of a model.

Distribution strategies

There are many ways to distribute the training across multiple GPUs and even across several machines with
many GPUs. Tensorflow has iterated a lot on the API to do this, and the stable version at the time of
publication (TF 1.13) offers several distribution strategies through the tf.contrib module. All of these will
eventually be ported to TF 2.0, which is the version we are using in this book.

Let’s start from the basics.

One way to distribute the training across multiple GPUs replicate the same model on each GPU and give
each GPU a different batch of data. This is called data parallelization or mirrored strategy. Using this
strategy allows increasing the batch size to N times the original batch size, where N is the number of GPUs
available. At each weight update, each GPU receives a different batch of data, runs the forward pass and the
back-propagation and then communicates the weigh updates to the CPU, where all the updates are averaged
and distributed back to each model on each GPU.

With this strategy, the batch size is not limited by the GPU memory. The more GPUs we add, the larger a
batch size we can use. Many cloud providers offer instances equipped with eight or even sixteen GPUs, and
research groups worldwide published results using hundreds and even thousands of GPUs.

The only limitation of this strategy is that the whole model must fit in the GPU memory, so even though the
batch size is not capped, the model size is capped.
9.4. MULTIPLE GPUS 421

The other way to distribute training across multiple GPUs is to split the model across multiple GPUs, which
goes by the name of also called model parallelization or model distribution. Why would one use this
strategy at all? It turns out that many state-of-the-art results, especially those concerning language modeling
and natural language understanding, require enormous models, that exceed the capacity of a single GPU.
Currently, only researchers and large companies like Google or Amazon use this strategy, but in the future,
it will become more accessible and more common also for other users.

In the rest of this chapter, we will focus on data parallelization.

We will introduce it with the most recent API offered by Tensorflow 2.0, but we will also mention a couple
of other ways to achieve multi GPU parallelization.

Data Parallelization using Tensorflow

Tensorflow makes it easy to parallelize training by distributing data across multiple GPUs through the
tf.distribute module. At the time of publishing, although many strategies are available in TF 1.13, TF 2.0
only implements the data parallelization strategy, which we will review here.

First, we need to create an instance of the MirroredStrategy distribution strategy:

In [14]: strategy = tf.distribute.MirroredStrategy()

Next we take our model and replicate it across multiple GPUs using the context setter with:

In [15]: with strategy.scope():


model = convolutional_model()

Defining convolutional model


0.164 seconds.
Compiling the model...
0.161 seconds.

At this point we can train the model normally, but with a larger batch. We define a flag with the number of
GPUs (2 in our case):

In [16]: # adjust this to the number of gpus in your machine


NGPU = 2

And then we train the model:

In [17]: print("Training recurrent GPU model on {} GPUs ...".format(NGPU))


t0 = time()
422 CHAPTER 9. TRAINING WITH GPUS

model.fit(X_conv, y_conv,
batch_size=1024*NGPU,
epochs=2,
shuffle=True)
print("{:0.3f} seconds.".format(time() - t0))

Training recurrent GPU model on 2 GPUs ...


Epoch 1/2
25/25 [==============================] - 4s 157ms/step - loss: 2.1736 -
accuracy: 0.2430
Epoch 2/2
25/25 [==============================] - 2s 67ms/step - loss: 1.8203 -
accuracy: 0.3727
13.722 seconds.

The API for tf.distribute is still in progress and we invite you to check it out periodically to learn about
new strategies that get added.

Data Parallelization using Keras

Keras also has an independent way to parallelize training by distributing data across multiple GPUs. This is
achieved through the multi_gpu_model command. Let’s import it from keras.utils:

In [18]: from tensorflow.keras.utils import multi_gpu_model

TIP: if you’re on floydhub the keras version is probably earlier than the one we are using in
the book. If you don’t find keras.utils.multi_gpu_model try with

from tensorflow.keras.utils.training_utils import multi_gpu_model

or update keras with pip install --upgrade keras

Now let’s create a new convolutional model (on the cpu):

In [19]: with tf.device("/cpu:0"):


model = convolutional_model()

Defining convolutional model


0.082 seconds.
9.4. MULTIPLE GPUS 423

Compiling the model...


0.107 seconds.

and let’s distribute it over 2 GPUs (this will only work if you have at least 2 GPUs on your machine):

In [20]: model = multi_gpu_model(model, NGPU, cpu_relocation=True)

TIP: you may need to change the cpu_relocation parameter to False if your machine has
NV-link. Check the Tensorflow documentation for more information.

Once the model has been parallelized, we need to re-compile it:

In [21]: model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])

Finally we can train the model in the exact same way as we did before. Notice that the multi_gpu_model
documentation explains how a batch is divided to the GPUs:

E.g. if your `batch_size` is 64 and you use `gpus=2`,


then we will divide the input into 2 sub-batches of 32 samples,
process each sub-batch on one GPU, then return the full
batch of 64 processed samples.

This also means that if we want to maximize GPU utilization we want to increase the batch size by a factor
equal to the number of GPUs, so we will use batch_size=1024*NGPU.

In [22]: print("Training recurrent GPU model on 2 GPUs ...")


t0 = time()
model.fit(X_conv, y_conv,
batch_size=1024*NGPU,
epochs=2,
shuffle=True)
print("{:0.3f} seconds.".format(time() - t0))

Training recurrent GPU model on 2 GPUs ...


Epoch 1/2
424 CHAPTER 9. TRAINING WITH GPUS

50000/50000 [==============================] - 4s 82us/sample - loss: 2.2467


- accuracy: 0.2126
Epoch 2/2
50000/50000 [==============================] - 4s 72us/sample - loss: 1.8653
- accuracy: 0.3554
8.458 seconds.

Since with 2 GPUs, each epoch takes only a few seconds, let’s run the training for a few more epochs:

In [23]: h = model.fit(X_conv, y_conv,


batch_size=1024*NGPU,
epochs=30,
shuffle=True,
verbose=0)

and let’s plot the history like we’ve done many times in this book:

In [24]: pd.DataFrame(h.history).plot()
plt.ylim(0, 1.1)
plt.axhline(1, color='black');

1.0

0.8

0.6

0.4

0.2 loss
accuracy
0.0
0 5 10 15 20 25
9.5. CONCLUSION 425

As you can see, with 30 epochs the model seems to be still improving. Having multiple GPUs allowed us to
iterate fast and explore the performance of a powerful convolutional model more rapidly. Cool!

Data Parallelization using Horovod

Horovod is an open source framework maintained by Uber that allows easy parallelization of Deep
Learning model written in Tensorflow, Keras, PyTorch, and MXNet. Horovod stems from the realization
that the High-performance Computing community has been running programs on supercomputers with
thousands of CPUs and GPUs for decades now.

Instead of designing and implementing an independent parallelization library as Tensorflow did, Horovod
took the approach to leverage the existing best practices from the HPC community and use their results to
distribute Deep Learning models on multiple devices. In particular, Horovod uses an open implementation
of the Message Passing Interface standard called Open MPI.

Horovod is currently not compatible with TF 2.0, so we will refer the interested reader to an example using
Keras and TF 1.13. There’s an open issue tracking this and Horovod’s developers are working on it, so stay
tuned.

Supercomputing with Tensorflow Mesh

Finally, we’d like to mention a new component in the Tensorflow ecosystem called MESH. Mesh TensorFlow
is aimed ad super computers with many CPUs and GPUs and is not really for the everyday user yet, still it is
a very cool project that allows to train incredibly large networks on an arbitrary computing architecture.

Conclusion
In this chapter we have seen how GPUs can easily be used to train faster on larger data. Before you move on
to the next chapter make sure to terminate all instances or you’ll incur in charges!

Exercises

Exercise 1

In Exercise 2 of Chapter 8 we introduced a model for sentiment analysis of the IMDB dataset provided in
Keras.

• Reload that dataset and prepare it for training a model:


– choose vocabulary size
– pad the sequences to a fixed length
• define a function recurrent_model(vocab_size, maxlen) similar to the
convolutional_model function defined earlier. The function should return a recurrent model.
• Train the model on 1 CPU and measure the training time > TIP: This is currently broken. There’s an
issue open about it. The model definition seems to ignore the context setter on the CPU. Just skip this
point for now.
426 CHAPTER 9. TRAINING WITH GPUS

• Train the model on 1 GPU and measure the training time


• Train the model on a machine with more than 1 GPU using multi_gpu_model or even better using
distribution strategy

In [ ]:

Exercise 2

Model parallelism is a technique used for models too large to fit in the memory of a single GPU. While this is
is not the case for the model we developed in Exercise 1, it is still possible to distribute the model across
multiple GPUs using the with context setter. Define a new model with the following architecture:

1. Embedding

• LSTM
• LSTM
• LSTM
• Dense

Place layers 1 and 2 on the first GPU, layers 3 and 4 on the second GPU and the final Dense layer on the CPU.

Train the model and see if the performance improves.

In [ ]:
10
Performance Improvement

Congratulations! We’ve traveled very far along this Deep Learning journey together! We have learned about
fully connected, convolutional and recurrent architectures and we applied them to a variety of problems,
from image recognition to sentiment analysis.

One question we haven’t answered yet is what to do when a model is not performing well. This is very
common for Deep Learning models. We train a model and the performance on the test set is disappointing.

This issue could be due to many reasons:

• too little data


• wrong architecture
• too little training
• wrong hyper-parameters

How do we approach debugging and improving a model?

This chapter is about a few techniques to do that. We will start by introducing Learning Curves, a tool that
is useful to decide if more data is needed. Then we will present several regularization techniques, that may
be useful to fight Overfitting. Some of these techniques have been invented very recently.

Finally, we will discuss data augmentation, which is useful in some cases, e.g., when the input data are
images. We will conclude the chapter with a brief part on hyperparameter optimization. This is a vast
topic, that can be approached in several ways.

Let’s start as usual with a few imports:

427
428 CHAPTER 10. PERFORMANCE IMPROVEMENT

In [1]: with open('common.py') as fin:


exec(fin.read())

In [2]: with open('matplotlibconf.py') as fin:


exec(fin.read())

Learning curves
The first tool we present is the Learning Curve. A learning curve plots the behavior of the training and
validation scores as a function of how much training data we fed to the model.

Let’s load a simple dataset and explore how to build a learning curve. We will use the digits dataset from
Scikit Learn, which is quite small. First of all we import the load_digits function and use it:

In [3]: from sklearn.datasets import load_digits

Now let’s create a variable called digits we’ll fill as the result of calling load_digits():

In [4]: digits = load_digits()

Then we assign digits.data and digits.target to X and y respectively:

In [5]: X, y = digits.data, digits.target

Let’s look at the shape of the X data:

In [6]: X.shape

Out[6]: (1797, 64)

X is an array of 1797 images that have been unrolled as feature vectors of length 64.

In [7]: y.shape

Out[7]: (1797,)

In order to see the images we can always reshape them to the original 8x8 format. Let’s plot a few digits:
10.1. LEARNING CURVES 429

In [8]: for i in range(9):


plt.subplot(3,3,i+1)
plt.imshow(X.reshape(-1, 8, 8)[i], cmap='gray')
plt.title(y[i])
plt.tight_layout()

0 1 2
0 0 0
5 5 5
0 5 0 5 0 5
3 4 5
0 0 0
5 5 5
0 5 0 5 0 5
6 7 8
0 0 0
5 5 5
0 5 0 5 0 5

TIP: the function tight_layout automatically adjusts subplot params so that the
subplot(s) fits in to the figure area. See the Documentation for further details.

Since digits is a Scikit Learn Bunch object, it has a property with the description of the data (in the DESCR
key). Let’s print it out:

In [9]: print(digits.DESCR)
430 CHAPTER 10. PERFORMANCE IMPROVEMENT

.. _digits_dataset:

Optical recognition of handwritten digits dataset


--------------------------------------------------

**Data Set Characteristics:**

:Number of Instances: 5620


:Number of Attributes: 64
:Attribute Information: 8x8 image of integer pixels in the range 0..16.
:Missing Attribute Values: None
:Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
:Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Di
gits

The data set contains images of hand-written digits: 10 classes where


each class refers to a digit.

Preprocessing programs made available by NIST were used to extract


normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an input matrix of 8x8 where each element is an integer in the range
0..16. This reduces dimensionality and gives invariance to small
distortions.

For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.


T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
1994.

.. topic:: References

- C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their


Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
Graduate Studies in Science and Engineering, Bogazici University.
- E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
- Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
Linear dimensionalityreduction using relevance weighted LDA. School of
Electrical and Electronic Engineering Nanyang Technological University.
2005.
- Claudio Gentile. A New Approximate Maximal Margin Classification
Algorithm. NIPS. 2000.

From digits.DESCR we find that the input is made of integers in the range (0,16). Let’s check that it’s true
by calculating the minimum and maximum values of X:

In [10]: X.min()

Out[10]: 0.0
10.1. LEARNING CURVES 431

In [11]: X.max()

Out[11]: 16.0

Let’s also check the data type of X:

In [12]: X.dtype

Out[12]: dtype('float64')

As previously seen in Chapter 3, it’s a good practice to rescale the input so that it’s close to 1. Let’s do this by
dividing by the maximum possible value (16.0):

In [13]: X_sc = X / 16.0

y contains the labels as a list of digits:

In [14]: y[:20]

Out[14]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Although it could appear that the digits are sorted, actually they are not:

In [15]: plt.plot(y[:80], 'o-');


432 CHAPTER 10. PERFORMANCE IMPROVEMENT

0
0 10 20 30 40 50 60 70 80

As seen in Chapter 3, let’s convert them to 1-hot encoding, to substitute the categorical column with a set of
boolean columns, one for each category. First, let’s import the to_categorical method from
keras.utils:

In [16]: from tensorflow.keras.utils import to_categorical

Then let’s set the variable of y_cat to these categories:

In [17]: y_cat = to_categorical(y, 10)

Now we can split the data into a training and a test set. Let’s import the train_test_split function and
call it against our data and the target categories:

In [18]: from sklearn.model_selection import train_test_split

We will split the data with a 70/30 ratio and we will use a random_state here, so that we all get the exact
same train/test split. We will also use the option stratify, to require the ratio of classes be balanced,
i.e. about 10 for each class (we already introduced this concept in Chapter 3 for the stratified K-fold cross
validation).
10.1. LEARNING CURVES 433

In [19]: X_train, X_test, y_train, y_test = \


train_test_split(X_sc, y_cat, test_size=0.3,
random_state=0, stratify=y)

Let’s double check that we have balanced the classes correctly. Since y_test is now a 1-hot encoded vector,
we need first to recover the corresponding digits. We can do this using the function argmax:

In [20]: y_test_classes = np.argmax(y_test, axis=1)

y_test_classes is an array of digits:

In [21]: y_test_classes

Out[21]: array([1, 4, 5, 6, 9, 1, 2, 2, 2, 0, 7, 5, 4, 8, 6, 6, 8, 2, 0, 9, 7, 3,
9, 1, 3, 5, 2, 2, 9, 9, 8, 9, 7, 6, 1, 3, 1, 4, 7, 6, 7, 3, 5, 0,
1, 1, 7, 5, 4, 6, 0, 5, 8, 9, 0, 5, 4, 5, 3, 5, 5, 6, 5, 4, 9, 6,
5, 9, 6, 5, 7, 6, 6, 3, 0, 8, 4, 4, 3, 2, 9, 7, 2, 7, 9, 8, 8, 0,
1, 7, 2, 3, 3, 5, 5, 6, 0, 4, 3, 7, 1, 4, 1, 9, 0, 5, 3, 8, 9, 6,
4, 9, 2, 9, 2, 0, 6, 7, 8, 1, 9, 2, 8, 6, 3, 6, 5, 1, 3, 6, 2, 3,
0, 6, 5, 5, 9, 2, 8, 1, 0, 1, 4, 5, 1, 0, 3, 0, 0, 9, 8, 9, 2, 2,
5, 8, 1, 9, 3, 7, 6, 8, 7, 3, 1, 2, 5, 1, 1, 6, 3, 9, 6, 9, 8, 9,
9, 8, 9, 9, 8, 8, 4, 7, 6, 2, 6, 4, 3, 4, 4, 3, 8, 5, 4, 8, 3, 1,
3, 4, 1, 0, 7, 8, 7, 5, 0, 6, 0, 1, 8, 7, 0, 0, 3, 4, 8, 9, 4, 4,
1, 1, 2, 1, 9, 2, 7, 7, 6, 9, 2, 9, 6, 0, 5, 2, 4, 4, 4, 6, 4, 0,
1, 8, 3, 4, 0, 5, 9, 0, 2, 0, 0, 1, 3, 2, 8, 1, 6, 1, 1, 9, 2, 7,
8, 3, 8, 2, 1, 3, 3, 0, 7, 8, 6, 7, 1, 4, 8, 2, 1, 4, 2, 6, 0, 6,
0, 1, 0, 8, 0, 6, 5, 1, 6, 6, 9, 2, 9, 2, 8, 5, 9, 4, 3, 9, 2, 9,
7, 9, 1, 3, 0, 3, 9, 2, 6, 1, 0, 0, 6, 3, 5, 0, 0, 3, 8, 0, 3, 0,
7, 7, 6, 1, 8, 8, 7, 2, 7, 5, 8, 5, 3, 7, 8, 2, 5, 4, 5, 1, 5, 7,
5, 6, 4, 0, 6, 7, 1, 1, 6, 4, 0, 4, 0, 1, 3, 4, 4, 4, 5, 4, 5, 5,
4, 3, 7, 9, 1, 1, 4, 7, 2, 0, 2, 9, 7, 8, 4, 8, 2, 4, 8, 7, 9, 4,
8, 0, 7, 0, 6, 5, 4, 2, 3, 5, 3, 5, 7, 7, 4, 1, 3, 0, 1, 1, 8, 6,
5, 1, 8, 0, 0, 3, 7, 7, 4, 9, 0, 4, 6, 9, 0, 7, 9, 2, 9, 2, 9, 6,
6, 5, 4, 5, 7, 3, 7, 7, 5, 2, 2, 7, 8, 9, 3, 3, 2, 6, 3, 6, 2, 1,
7, 4, 8, 0, 8, 2, 4, 3, 7, 6, 3, 5, 7, 9, 3, 7, 9, 5, 3, 7, 7, 6,
4, 8, 0, 8, 4, 6, 8, 4, 1, 7, 6, 5, 9, 3, 4, 5, 9, 8, 2, 3, 2, 5,
6, 4, 9, 1, 5, 9, 8, 2, 6, 1, 3, 1, 0, 7, 5, 2, 8, 1, 5, 2, 2, 3,
0, 0, 7, 8, 5, 2, 3, 5, 2, 6, 1, 3])

There are many ways to count the number of each digit, the simplest is to temporarily wrap the array in a
Pandas Series and use the .value_counts() method:

In [22]: pd.Series(y_test_classes).value_counts()
434 CHAPTER 10. PERFORMANCE IMPROVEMENT

Out[22]:

0
5 55
3 55
1 55
9 54
7 54
6 54
4 54
0 54
2 53
8 52

Great! Our classes are balanced, with around 54 samples per class. Let’s quickly train a model to classify
these digits. First we load the necessary libraries:

In [23]: from tensorflow.keras.models import Sequential


from tensorflow.keras.layers import Dense

We create a small, fully connected network with 64 inputs, a single inner layer with 16 nodes and 10 outputs
with a Softmax activation function:

In [24]: model = Sequential()


model.add(Dense(16, input_shape=(64,),
activation='relu'))
model.add(Dense(10, activation='softmax'))

model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])

Let’s also save the initial weights so that we can always re-start from the same initial configuration:

In [25]: initial_weights = model.get_weights()

Now we fit the model on the training data for 100 epochs:

In [26]: model.fit(X_train, y_train, epochs=100, verbose=0)

Out[26]: <tensorflow.python.keras.callbacks.History at 0x7fe37c0f04e0>


10.1. LEARNING CURVES 435

The model converged and we can evaluate the final training performance and test accuracies:

In [27]: _, train_acc = model.evaluate(X_train, y_train,


verbose=0)
_, test_acc = model.evaluate(X_test, y_test,
verbose=0)

TIP: The carachter _ means that part of the function result can be deliberately ignored, and
that variable can be throw away.

And print them out:

In [28]: print("Train accuracy: {:0.4}".format(train_acc))


print("Test accuracy: {:0.4}".format(test_acc))

Train accuracy: 0.9928


Test accuracy: 0.9741

The performance on the test set is lower than the performance on the training set, which indicates the
model is overfitting.

TIP: Overfitting is a fundamental concept in Machine Learning and Deep Learning. If you
are not familiar with it, have a look at Chapter 3.

Before we start playing with different techniques to reduce overfitting, it is legitimate to ask if we don’t have
enough data to solve the problem.

This is a very common situation: you collect data with labels, you train a model, and the model does not
perform as well as you hoped.

What should you do at that point? Should you collect more data? Alternatively, should you invest time in
searching for better features or a different model?

With the little information we have, it is hard to know which of these alternatives is more likely to help.
What is sure, on the other hand, is that all these alternatives carry a cost. For example, let’s say you think
that more data is what you need.
436 CHAPTER 10. PERFORMANCE IMPROVEMENT

Collecting more labeled data could be as cheap and simple as downloading a new dataset from your source,
or it could be as involved and complicated as coordinating with the data collection team at your company,
hiring contractors to label the new data, and so on. In other words, the time and cost associated with new
data collection strongly vary and need to be assessed case by case.

If, on the other hand, you decided to experiment with new features and model architectures, this could be as
simple as adding a few layers and nodes to your model, or as complex as an R&D team dedicating several
months to discovering new features for your particular dataset. Again, the actual cost of this option strongly
depends on your specific use case.

Which of the two choices is more promising?

Do we need more data or a better model?

A learning curve is a tool we can use to answer that question. Here is how we build it.

First, we set the X_test aside, then, we take increasingly large fractions of X_train and use them to train
the model. For each of these fractions, we fit the model. Then we evaluate the model on this fraction and the
test set. Since the training data is small, we expect the model to overfit the training data and perform quite
poorly on the test set.

As we gradually take more training data, the model should improve and learn to generalize better, i.e., the
test score should increase. We proceed like this until we have used all our training data.

At this point two cases are possible. If it looks like the test performance stopped increasing with the size of
the training set, we probably reached the maximum performance of our model. In this case, we should
invest time in looking for a better model to improve our performance.

In the second case, it would seem that the test error would continue to decrease if only we had access to
more training data. If that’s the case, we should probably go out looking for more labeled data first and then
worry about changing model.
10.1. LEARNING CURVES 437

So, now you know how to answer the big question of more data or better model: use a learning curve.

Let’s draw one together. First, we take increasing fractions of the training data using the function
np.linspace.

TIP: np.linspace returns evenly spaced numbers over a specified interval. In this case,
we are creating four fractions, from 10 to 90 of the data.

In [29]: fracs = np.linspace(0.1, 0.90, 5)


fracs

Out[29]: array([0.1, 0.3, 0.5, 0.7, 0.9])

In [30]: train_sizes = list((len(X_train) * fracs).astype(int))


train_sizes

Out[30]: [125, 377, 628, 879, 1131]

Then we loop over the train sizes, and for each train_size we do the following:

• take exactly train_size data from the X_train


• reset the model to the initial weights
• train the model using only the fraction of training data
• evaluate the model on the fraction of training data
• evaluate the model on the test data
• append both scores to arrays for plotting

Handling this in the first case (i.e., the first train_size in our train_sizes array), we’ll use our work to
then iterate over a long list of all the train_sizes.

Let’s create some variables where we’ll store our scores:

In [31]: train_scores = []
test_scores = []

Now let’s break up the test data using the train_test_split function as we usually would:
438 CHAPTER 10. PERFORMANCE IMPROVEMENT

In [32]: X_train_frac, _, y_train_frac, _ = \


train_test_split(X_train, y_train,
train_size=0.1,
test_size=None,
random_state=0,
stratify=y_train)

Let’s reset the weights to their initial values:

In [33]: model.set_weights(initial_weights)

Now we can train our model using the fit function, as normal:

In [34]: h = model.fit(X_train_frac, y_train_frac,


verbose=0,
epochs=100)

With our model trained, let’s evaluate it over our training set and save it into the train_scores variable
from above:

In [35]: r = model.evaluate(X_train_frac, y_train_frac,


verbose=0)
train_scores.append(r[-1])

Let’s do the same with our test set:

In [36]: e = model.evaluate(X_test, y_test, verbose=0)


test_scores.append(e[-1])

It’s kind of silly to do this manually for every train_size entry. Instead, let’s iterate over them and build up
our train_scores and test_scores variables:

In [37]: train_scores = []
test_scores = []

for train_size in train_sizes:


X_train_frac, _, y_train_frac, _ = \
train_test_split(X_train, y_train,
train_size=train_size,
10.1. LEARNING CURVES 439

test_size=None,
random_state=0,
stratify=y_train)

model.set_weights(initial_weights)

h = model.fit(X_train_frac, y_train_frac,
verbose=0,
epochs=100)

r = model.evaluate(X_train_frac, y_train_frac,
verbose=0)
train_scores.append(r[-1])

e = model.evaluate(X_test, y_test, verbose=0)


test_scores.append(e[-1])

print("Done size: ", train_size)

Done size: 125


Done size: 377
Done size: 628
Done size: 879
Done size: 1131

Let’s plot the training score and the test score as a function of increasing training size:

In [38]: plt.plot(train_sizes, train_scores, 'o-', label="Training score")


plt.plot(train_sizes, test_scores, 'o-', label="Test score")
plt.legend(loc="best");
440 CHAPTER 10. PERFORMANCE IMPROVEMENT

0.98
0.96
0.94
0.92
0.90
0.88
Training score
0.86 Test score
200 400 600 800 1000

Judging from the curve, it appears the test score would keep improving if we added more data. This is the
indication we were looking for. If on the other hand the test score was not improving, it would have been
more promising to improve the model first and only then go look for more data if needed.

Reducing Overfitting
Sometimes it’s not easy to go out and look for more data. It could be time consuming and expensive. There
are a few ways to improve a model and reduce its propensity to overfit without requiring additional data.
These fall into the big family of Regularization techniques.

The general idea here is the following. By now you should understand that the complexity of a model is
somewhat represented by the number of parameters the model has. In simple terms, a model with many
layers and many nodes is more complex than a model with a single layer and few nodes. More complexity
gives the model, more freedom to learn nuances in our training data. This is what makes Neural Networks
so powerful.

On the other hand, the more freedom a model has, the more likely it will be to overfit on the training data,
losing the ability to generalize. We could try to reduce the model freedom by reducing the model
complexity, but this would not always be a great idea as it would make the model less able to pick up subtle
patterns in our data.

A different approach would be to keep the model very complex, but change something else in the model to
push it towards less complex solutions. In other words, instead of removing the complexity, we allow the
model to choose complex solutions, but we drive the model towards simpler, more regular, solutions.
10.2. REDUCING OVERFITTING 441

Regularization refers to techniques to keep the complexity of a model from spinning out of control.

Let’s review a few ways to regularize a model, and to ease our comparison we will define a few helper
functions.

First, let’s define a helper function to repeat the training several times. This helper function will be useful to
average out any statistical fluctuations in the model behavior due to the random initialization of the weights.
We will reset the backend at each iteration to save memory and erase any previous training.

Let’s load the backend first:

In [39]: import tensorflow.keras.backend as K

And then let’s define the repeat_train helper function. This function expects an already created
model_fn as input, i.e. a function that returns a model, and it repeats the following process a number of
times specified by the input repeats:

1. clear the session

K.clear_session()

• create a model using the model_fn

model = model_fn()

• train the model using the training data

h = model.fit(X_train, y_train,
validation_data=(X_test, y_test),
verbose=verbose,
batch_size=batch_size,
epochs=epochs)

• retrieve the accuracy of the model on training data (acc) and test data (val_acc) and append the
results to the histories array

histories.append([h.history['accuracy'], h.history['val_accuracy']])

Finally, the repeat_train function calculates the average history along with its standard deviation and
returns them.

In [40]: def repeat_train(model_fn, repeats=3, epochs=40,


verbose=0, batch_size=256):
"""
442 CHAPTER 10. PERFORMANCE IMPROVEMENT

Repeatedly train a model on (X_train, y_train),


averaging the histories.

Parameters
----------
model_fn : a function with no parameters
Function that returns a Keras model

repeats : int, (default=3)


Number of times the training is repeated

epochs : int, (default=40)


Number of epochs for each training run

verbose : int, (default=0)


Verbose option for the `model.fit` function

batch_size : int, (default=256)


Batch size for the `model.fit` function

Returns
-------
mean, std : np.array, shape: (epochs, 2)
mean : array contains the accuracy
and validation accuracy history averaged
over the different training runs
std : array contains the standard deviation
over the different training runs of
accuracy and validation accuracy history
"""
histories = []

# repeat model definition and training


for repeat in range(repeats):
K.clear_session()
model = model_fn()

# train model on training data


h = model.fit(X_train, y_train,
validation_data=(X_test, y_test),
verbose=verbose,
batch_size=batch_size,
epochs=epochs)

# append accuracy and val accuracy to list


histories.append([h.history['accuracy'],
h.history['val_accuracy']])
print(repeat, end=" ")
10.2. REDUCING OVERFITTING 443

histories = np.array(histories)
print()

# calculate mean and std across repeats:


mean = histories.mean(axis=0)
std = histories.std(axis=0)
return mean, std

The repeat_train function expects an already created model_fn as input. Hence, let’s define a new
function that will create a fully connected Neural Network with 3 inner layers. We’ll call this function
base_model, since we will use this basic model for further comparison:

In [41]: def base_model():


"""
Return a fully connected model with 3 inner layers
with 1024 nodes each and relu activation function
"""
model = Sequential()
model.add(Dense(1024, input_shape=(64,),
activation='relu'))
model.add(Dense(1024,
activation='relu'))
model.add(Dense(1024,
activation='relu'))
model.add(Dense(10, activation='softmax'))
model.compile('adam', 'categorical_crossentropy',
metrics=['accuracy'])
return model

TIP: Notice that this model is quite big for the problem we are trying to solve. We
purposefully make the model big, so that there are lots of parameters and it can overfit
easily.

Now we repeat 5 times the training of the base (non-regularized) model using the repeat_train helper
function:

In [42]: ((m_train_base, m_test_base),


(s_train_base, s_test_base)) = \
repeat_train(base_model, repeats=5)
444 CHAPTER 10. PERFORMANCE IMPROVEMENT

0 1 2 3 4

We can plot the histories for training and test. First, let’s define an additional helper function
plot_mean_std(), which plots the average history as a line and add a colored area around it
corresponding to +/- 1 standard deviation:

In [43]: def plot_mean_std(m, s):


"""
Plot the average history as a line
and add a colored area around it corresponding
to +/- 1 standard deviation
"""
plt.plot(m)
plt.fill_between(range(len(m)), m-s, m+s, alpha=0.2)

Then, let’s plot the results obtained training 5 times the base model:

In [44]: plot_mean_std(m_train_base, s_train_base)


plot_mean_std(m_test_base, s_test_base)

plt.title("Base Model Accuracy")


plt.legend(['Train', 'Test'])
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.ylim(0.85, 1.05);
10.2. REDUCING OVERFITTING 445

Base Model Accuracy


1.050
Train
1.025 Test
1.000
0.975
Accuracy

0.950
0.925
0.900
0.875
0.850
0 5 10 15 20 25 30 35 40
Epochs

Overfitting in this case is evident, with the test score saturating at a lower value than the training score.

Model Regularization

[Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics29) is a common procedure


in Machine Learning and it has been used to improve the performance of complex models with many
parameters.

Remember the Cost Function we have introduced in Chapter 3? The main goal of the cost function is to
make sure that the predictions of the model are close to the correct labels.

Regularization works by modifying the original cost function C with an additional term λCr , that somehow
penalizes the complexity of the model:

C ′ = C + λCr (10.1)

The original cost function C would decrease as the model predictions got closer and closer to the actual
labels. In other words, the gradient descent algorithm for the original cost would push the parameters to the
region of parameter space that would give the best predictions on the training data. In complex models with
many parameters, this could result in overfitting because of all the freedom the model had.

The new penalty Cr pushes the model to be “simple”, in other words, it grows with the parameters of the
446 CHAPTER 10. PERFORMANCE IMPROVEMENT

model, but it is entirely unrelated to the goodness of the prediction.

Weight Regularization

The total cost C ′ is a combination of the two terms, and therefore the model will have to try to generate the
best predictions possible while retaining simplicity. In other words, the gradient descent algorithm is now
solving a constrained minimization problem, where some regions of the parameter space are too expensive
to use for a solution.

The hyper-parameter λ determines the relative strength of the regularization, and we can set it.

But how do we implement Cr in practice? There are several ways to do it. Weight Regularization assigns a
penalty proportional to the size of the weights, for example:

Cw = ∑ ∣w∣ (10.2)
w

or:

Cw = ∑ w 2 (10.3)
w
10.2. REDUCING OVERFITTING 447

The first one is called l1-regularization and it is the sum of the absolute values of each weight. The second
one is called l2-regularization and it is the sum of the square values of each weight. While they both
suppress complexity, their effect is different.

l1-regularization pushes most weights to be zero, except for a few that will be non-zero. In other words, the
net effect of l1-regularization is to make the weight matrix sparse.

l2-regularization, on the other hand, suppresses weights quadratically, which means that any weight more
significant than the others will give a much higher contribution to Cr and therefore to the overall cost. The
net effect of this is to make all weights equally small.

Similarly to weight regularization, Bias Regularization and Activity Regularization penalize the cost
function with a term proportional to the size of the biases and the activations respectively.

Let’s compare the behavior of our base model with a model with exact the same architecture but endowed
with the l2 weight regularization.

We start by defining a helper function that creates a model with weight regularization: we start from the
function base_model, and we create the function regularized_model, adding the
kernel_regularizer option to each layer. First of all let’s import keras’s l2 regularizer function:

In [45]: from tensorflow.keras.regularizers import l2

In [46]: def regularized_model():


"""
Return an l2-weight-regularized, fully connected
model with 3 inner layers with 1024 nodes each
and relu activation function.
"""
reg = l2(0.005)

model = Sequential()
model.add(Dense(1024,
input_shape=(64,),
activation='relu',
kernel_regularizer=reg))
model.add(Dense(1024,
activation='relu',
kernel_regularizer=reg))
model.add(Dense(1024,
activation='relu',
kernel_regularizer=reg))
model.add(Dense(10, activation='softmax'))
model.compile('adam', 'categorical_crossentropy',
metrics=['accuracy'])
return model
448 CHAPTER 10. PERFORMANCE IMPROVEMENT

Now we compare the results of no regularization and l2-regularization. Let’s repeat the training 3 times.

In [47]: (m_train_reg, m_test_reg), (s_train_reg, s_test_reg) = \


repeat_train(regularized_model)

0 1 2

TIP: Notice that, since we didn’t specified the number of time to train the model, it will
repeat the training according to the default parameter, i.e. 3 times.

Let’s now compare the performance of the weight regularized model with our base model. We will also plot
a dashed line at the maximum test accuracy obtained by the base model:

In [48]: plot_mean_std(m_train_base, s_train_base)


plot_mean_std(m_test_base, s_test_base)

plot_mean_std(m_train_reg, s_train_reg)
plot_mean_std(m_test_reg, s_test_reg)

plt.axhline(m_test_base.max(),
linestyle='dashed',
color='black')

plt.title("Regularized Model Accuracy")


plt.legend(['Base - Train', 'Base - Test',
'l2 - Train', 'l2 - Test',
'Max Base Test'])
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.ylim(0.85, 1.05);
10.2. REDUCING OVERFITTING 449

Regularized Model Accuracy


1.050
1.025
1.000
0.975
Accuracy

0.950
0.925 Base - Train
Base - Test
0.900 l2 - Train
0.875 l2 - Test
Max Base Test
0.850
0 5 10 15 20 25 30 35 40
Epochs

With this particular dataset, weight regularization does not seem to improve the model performance.

This is visually true at least within the small number of epochs we are running. It may be the case that if we
let the training run much longer regularization would help, but we don’t know for sure, and that can cost a
lot of time and money.

It’s however good to know that this technique exists and keep it in mind as one of the options to try. In
practice, weight regularization has been replaced by more modern regularization techniques such as
Dropout and Batch Normalization.

Dropout

[Dropout](https://en.wikipedia.org/wiki/Dropout_(neural_networks29) was introduced in 2014 by


Srivastava et al. at the University of Toronto to address the problem of overfitting in large networks. The key
idea of Dropout is to randomly drop units (along with their connections) from the Neural Network during
training.

In other words, during the training phase, each unit has a non-zero probability not to emit its output to the
next layer. This prevents units from co-adapting too much.

Let’s reflect on this for a second. It looks as if we are damaging the network by dropping a fraction of the
units with non zero probability during training time. We are crippling the network and making it a lot
harder for it to learn. This is counter-intuitive! Why are we weakening our model?
450 CHAPTER 10. PERFORMANCE IMPROVEMENT

Dropout

It turns out that the underlying principle is quite universal in Machine Learning: we make the network less
stable so that the solution found during training is more general, more robust, and more resilient to failure.
Another way to look at this is to say that we are adding noise at training time so that the network will need
to learn more general patterns that are resistant to noise.

The technique has similarities with ensemble techniques, because it’s as if, during training the network
sampled from an many different “thinned” networks, where a few of the nodes are not working. At test time,
dropout is turned off, and we use the full network. This technique has been shown to improve the
performance of Neural Networks on Supervised Learning tasks in vision, speech recognition, document
classification, and many others.

We strongly encourage you to read the paper if you want to understand how dropout is implemented.

On the other hand, if you are eager to apply it, you’ll be happy to hear that Dropout is implemented in Keras
as a layer, so all we need to do is to add it between the layers. We’ll import it first:

In [49]: from tensorflow.keras.layers import Dropout

And then we define a dropout_model, again starting from the base_model and adding the dropout layers.
We’ve tested several configurations, and we’ve found that with this dataset good results can be obtained with
a dropout rate of 10 at the input and 50 in the inner layers. Feel free to experiment with different
numbers and see what results you get.

TIP: according to the Documentation, in the Dropout layer the argument rate is a float
between 0 and 1, that gives the fraction of the input units to drop.
10.2. REDUCING OVERFITTING 451

In [50]: def dropout_model():


"""
Return a fully connected model
with 3 inner layers with 1024 nodes each
and relu activation function. Dropout can
be applied by selecting the rate of dropout
"""
input_rate = 0.1
rate = 0.5

model = Sequential()
model.add(Dropout(input_rate, input_shape=(64,)))
model.add(Dense(1024, activation='relu'))
model.add(Dropout(rate))
model.add(Dense(1024, activation='relu'))
model.add(Dropout(rate))
model.add(Dense(1024, activation='relu'))
model.add(Dropout(rate))
model.add(Dense(10, activation='softmax'))
model.compile('adam', 'categorical_crossentropy',
metrics=['accuracy'])
return model

Let’s train three times our network using the dropout_model:

In [51]: (m_train_dro, m_test_dro), (s_train_dro, s_test_dro) = \


repeat_train(dropout_model)

0 1 2

Next, let’s plot the accuracy of the dropout model against the base model:

In [52]: plot_mean_std(m_train_base, s_train_base)


plot_mean_std(m_test_base, s_test_base)

plot_mean_std(m_train_dro, s_train_dro)
plot_mean_std(m_test_dro, s_test_dro)

plt.axhline(m_test_base.max(),
linestyle='dashed',
color='black')

plt.title("Dropout Model Accuracy")


452 CHAPTER 10. PERFORMANCE IMPROVEMENT

plt.legend(['Base - Train', 'Base - Test',


'Dropout - Train', 'Dropout - Test',
'Max Base Test'])
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.ylim(0.85, 1.05);

Dropout Model Accuracy


1.050
1.025
1.000
0.975
Accuracy

0.950
0.925 Base - Train
Base - Test
0.900 Dropout - Train
0.875 Dropout - Test
Max Base Test
0.850
0 5 10 15 20 25 30 35 40
Epochs

Nice! Adding Dropout to our model pushed our test score above the base model for the first time (although
not by much)! This is great because we didn’t have to add more data. Also, notice how the training score is
lower than the test score, which indicates the model is not overfitting and also there seem to be even more
room for improvement if we run the training for more epochs!

The Dropout paper also mentions the use of a global constraint to improve the behavior of a Dropout
network further. Constraints can be added in Keras through the kernel_constraint parameter available
in the definition of a layer. Following the paper, let’s see what happens if we impose a max_norm constraint
to the weights of the model. According to the Documentation, this is equivalent to say that the sum of the
square of the weights cannot be higher than a certain constant, which can be specified by the user with the
argument c.

Let’s load the max_norm constraint first:

In [53]: from tensorflow.keras.constraints import max_norm


10.2. REDUCING OVERFITTING 453

Let’s define a new model function dropout_max_norm, that has both dropout and the max_norm
constraint:

In [54]: def dropout_max_norm():


"""
Return a fully connected model with Dropout
and Max Norm constraint.
"""
input_rate = 0.1
rate = 0.5
c = 2.0

model = Sequential()
model.add(Dropout(input_rate, input_shape=(64,)))
model.add(Dense(1024, activation='relu',
kernel_constraint=max_norm(c)))
model.add(Dropout(rate))
model.add(Dense(1024, activation='relu',
kernel_constraint=max_norm(c)))
model.add(Dropout(rate))
model.add(Dense(1024, activation='relu',
kernel_constraint=max_norm(c)))
model.add(Dropout(rate))
model.add(Dense(10, activation='softmax'))
model.compile('adam', 'categorical_crossentropy',
metrics=['accuracy'])
return model

As before, we repeat the training three times and average the results:

In [55]: (m_train_dmn, m_test_dmn), (s_train_dmn, s_test_dmn) = \


repeat_train(dropout_max_norm)

0 1 2

and plot the comparison with the base model:

In [56]: plot_mean_std(m_train_base, s_train_base)


plot_mean_std(m_test_base, s_test_base)

plot_mean_std(m_train_dmn, s_train_dmn)
plot_mean_std(m_test_dmn, s_test_dmn)
454 CHAPTER 10. PERFORMANCE IMPROVEMENT

plt.axhline(m_test_base.max(),
linestyle='dashed',
color='black')

plt.title("Dropout & Max Norm Model Accuracy")


plt.legend(['Base - Train', 'Base - Test',
'Dropout & Max Norm - Train', 'Dropout & Max Norm - Test',
'Max Base Test'])
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.ylim(0.85, 1.05);

Dropout & Max Norm Model Accuracy


1.050
1.025
1.000
0.975
Accuracy

0.950
0.925 Base - Train
Base - Test
0.900 Dropout & Max Norm - Train
0.875 Dropout & Max Norm - Test
Max Base Test
0.850
0 5 10 15 20 25 30 35 40
Epochs

In this particular case, the Max Norm constraint does not seem to produce results that are qualitatively
different from the simple Dropout, but there may be datasets where this constraint helps make the network
converge to a better result.

Batch Normalization

Batch Normalization was introduced in 2015 as an even better regularization technique, as described in this
paper. The authors of the paper started from the observation that training of deep Neural Networks is slow
because the distribution of the inputs to a layer changes during training, as the parameters of the previous
10.2. REDUCING OVERFITTING 455

layers change. Since the inputs to a layer are the outputs of the previous layer, and these are determined by
the parameters of the previous layer, as training proceeds the distribution of the output may drift, making it
harder for the next layer to adapt.

The authors’ solution to this problem is to introduce a normalization step between layers, that will take the
output values for the current batch and normalize them by removing the mean and dividing by the standard
deviation. They observe that their technique allows to use much higher learning rates and be less careful
about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout.

Let’s walk through the batch algorithm with a small code example. First we calculate the mean and standard
deviation of the batch:

mu_B = X_batch.mean()
std_B = X_batch.std()

Then we subtract the mean and divide by the standard deviation:

X_batch_scaled = (X_batch - mu_B) / np.sqrt(std_B**2 + 0.0001)

Finally we rescale the batch with 2 parameters γ and β that are learned during training:

X_batch_norm = gamma * X_batch_rescaled + beta

TIP: Using math notation, the complete algorithm for Batch normalization is the following.
Given a mini-batch B = {x1...m }

1 m
µB = ∑ xi (10.4)
m i=1
1 m 2
σB = ∑(x i − µ B ) (10.5)
m i=1
x i − µB
xˆi = √ 2 (10.6)
σB + є
y i = γ xˆi + β (10.7)
(10.8)

Batch Normalization is very powerful, and Keras makes it available as a layer too, as described in the
Documentation. One important thing to note is that BN needs to be applied before the nonlinear activation
function. Let’s see how it’s done. First we load the BatchNormalization and Activation layers:
456 CHAPTER 10. PERFORMANCE IMPROVEMENT

In [57]: from tensorflow.keras.layers import BatchNormalization, Activation

Then we define again a new model function batch_norm_model that adds Batch Normalization to our fully
connected network defined in the base_model:

In [58]: def batch_norm_model():


"""
Return a fully connected model with
Batch Normalization.

Returns
-------
model : a compiled keras model
"""
model = Sequential()

model.add(Dense(1024, input_shape=(64,)))
model.add(BatchNormalization())
model.add(Activation('relu'))

model.add(Dense(1024))
model.add(BatchNormalization())
model.add(Activation('relu'))

model.add(Dense(1024))
model.add(BatchNormalization())
model.add(Activation('relu'))

model.add(Dense(10))
model.add(BatchNormalization())
model.add(Activation('softmax'))

model.compile('adam', 'categorical_crossentropy',
metrics=['accuracy'])
return model

Batch Normalization seems to work better with smaller batches, so we will run the repeat_train function
with a smaller batch_size.

Since smaller batches mean more weight updates at each epoch, we will also run the training for fewer
epochs.

Let’s do a quick back of the envelope calculation.

We have 1257 points in the training set. Previously, we used batches of 256 points, which gives five weight
10.2. REDUCING OVERFITTING 457

updates per epoch, and a total of 200 updates in 40 epochs. If we reduce the batch size to 32, we will have 40
updates at each epoch, so we should run the training for only five epochs.

We will run it a bit longer to see the effectiveness of Batch Normalization. 10-15 epochs will suffice to bring
the model accuracy to a much higher value on the test set.

In [59]: (m_train_bn, m_test_bn), (s_train_bn, s_test_bn) = \


repeat_train(batch_norm_model,
batch_size=32,
epochs=15)

0 1 2

Let’s plot the results and compare them with the base model:

In [60]: plot_mean_std(m_train_base, s_train_base)


plot_mean_std(m_test_base, s_test_base)

plot_mean_std(m_train_bn, s_train_bn)
plot_mean_std(m_test_bn, s_test_bn)

plt.axhline(m_test_base.max(),
linestyle='dashed',
color='black')

plt.title("Batch Norm Model Accuracy")


plt.legend(['Base - Train', 'Base - Test',
'Batch Norm - Train', 'Batch Norm - Test',
'Max Base Test'])
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.ylim(0.85, 1.05)
plt.xlim(0, 15);
458 CHAPTER 10. PERFORMANCE IMPROVEMENT

Batch Norm Model Accuracy


1.050
1.025
1.000
0.975
Accuracy

0.950
0.925 Base - Train
Base - Test
0.900 Batch Norm - Train
0.875 Batch Norm - Test
Max Base Test
0.850
0 2 4 6 8 10 12 14
Epochs

Awesome! With the addition of Batch Normalization, the model converged to a solution that is better able
to generalize on the Test set, i.e., it is overfitting a lot less than the base solution.

Data augmentation
Another strong technique to improve the performance of a model without requiring the collection of new
data is Data Augmentation. Let’s consider the problem of image recognition, and to make things practical,
let’s consider this nice picture of a squirrel:

If your goal were to recognize the animal in this picture, you would still be able to solve the task effectively
even if we distorted the image or rotated it. There’s a variety of transformations that we could apply to the
image, without altering its information content, including:

• rotation
• shift (up, down, left, right)
• shear
• zoom
• flip (vertical, horizontal)
• rescale
• color correction and changes
• partial occlusion
10.3. DATA AUGMENTATION 459

Image of a squirrel

All these transformations would not destroy the information contained in the image. They would change the
absolute values of the pixels. A human would still be able to recognize a rotated squirrel or a shifted panda,
very much as you can still recognize your friends even after all the filters they apply to their selfies. This
property means a good image recognition algorithm should also be resilient to this kind of transformations.

If we apply these transformations to an image in our training dataset, we can generate an infinite number of
variations of such image, giving us access to a much much larger synthetic training dataset. This process is
what data augmentation is about: generating new labeled data points starting from existing data through the
use of valid transformations.

Although the example we provided is in the domain of image recognition, the same process can be applied
to augment other kinds of data, for example, speech samples for a speech recognition task. Given a
particular sound file, we can change its speed and pitch, add background noise, add silences to generate
variations of the speech snippet that would still be correctly understood by a human.

Let’s see how Keras allows us to do it easily for images. We need to load the ImageDataGenerator object:

In [61]: from tensorflow.keras.preprocessing.image import ImageDataGenerator

This class creates a generator that can apply all sorts of variations to an input image. Let’s initialize it with a
few parameters: - We’ll set the rescale factor to 1/255 to normalize pixel values to the interval [0-1] -
We’ll set the width_shift_range and height_shift_range to ±10 of the total range - We’ll set the
rotation_range to ±20 degrees - We’ll set the shear_range to ±0.3 degrees - We’ll set the zoom_range
to ±30 - We’ll allow for horizontal_flip of the image
460 CHAPTER 10. PERFORMANCE IMPROVEMENT

See the Documentation for a complete overview of all the available arguments.

In [62]: idg = ImageDataGenerator(rescale = 1./255,


width_shift_range=0.1,
height_shift_range=0.1,
rotation_range = 20,
shear_range = 0.3,
zoom_range = 0.3,
horizontal_flip = True)

The next step is to create an iterator that will generate images with the image data generator. We need to tell
where our training data are. Here we use the method flow_from_directory, which is useful when we
have images stored in a directory, and we tell it to produce target images of size 128x128. The input folder
structure needs to be:

top/
class_0/
class_1/
...

Where top is the folder we will flow from, and the images are organized into one subfolder for each class.

In [63]: train_gen = idg.flow_from_directory(


'../data/generator',
target_size = (128, 128),
class_mode = 'binary')

Found 1 images belonging to 1 classes.

Let’s generate a few images and display them:

In [64]: plt.figure(figsize=(12, 12))

for i in range(16):
img, label = train_gen.next()
plt.subplot(4, 4, i+1)
plt.imshow(img[0])
10.3. DATA AUGMENTATION 461

0 0 0 0

50 50 50 50

100 100 100 100


0 50 100 0 50 100 0 50 100 0 50 100
0 0 0 0

50 50 50 50

100 100 100 100


0 50 100 0 50 100 0 50 100 0 50 100
0 0 0 0

50 50 50 50

100 100 100 100


0 50 100 0 50 100 0 50 100 0 50 100
0 0 0 0

50 50 50 50

100 100 100 100


0 50 100 0 50 100 0 50 100 0 50 100

Great! In all of the images the squirrel is still visible and from a single image we have generated 16 different
images that we can use for training!

Let’s apply this technique to our digits and see if we can improve the score on the test set. We will use
slightly less dramatic transformations and also fill the empty space with zeros along the border.

In [65]: digit_idg = ImageDataGenerator(width_shift_range=0.1,


height_shift_range=0.1,
rotation_range = 10,
shear_range = 0.1,
zoom_range = 0.1,
fill_mode='constant')

We will need to reshape our data into tensors with 4 axes, in order to use it with the ImageDataGenerator,
462 CHAPTER 10. PERFORMANCE IMPROVEMENT

so let’s do it:

In [66]: X_train_t = X_train.reshape(-1, 8, 8, 1)


X_test_t = X_test.reshape(-1, 8, 8, 1)

We can use the method .flow to flow directly from a dataset. We will need to provide the labels as well.

In [67]: train_gen = digit_idg.flow(X_train_t, y=y_train)

Notice that by default the .flow method generates a batch of 32 images with corresponding labels:

In [68]: imgs, labels = train_gen.next()

In [69]: imgs.shape

Out[69]: (32, 8, 8, 1)

Let’s display a few of them:

In [70]: plt.figure(figsize=(12, 12))


for i in range(16):
plt.subplot(4, 4, i+1)
plt.imshow(imgs[i,:,:,0], cmap='gray')
plt.title(np.argmax(labels[i]))
10.3. DATA AUGMENTATION 463

2 6 5 8
0 0 0 0
2 2 2 2
4 4 4 4
6 6 6 6
0 4 5 0 1 5 0 7 5 0 7 5
0 0 0 0
2 2 2 2
4 4 4 4
6 6 6 6
0 0 5 0 9 5 0 0 5 0 7 5
0 0 0 0
2 2 2 2
4 4 4 4
6 6 6 6
0 1 5 0 2 5 0 8 5 0 7 5
0 0 0 0
2 2 2 2
4 4 4 4
6 6 6 6
0 5 0 5 0 5 0 5

As you can see the digits are deformed, due to the very low resolution of the images. Will this help our
network or confuse it? Let’s find out!

We will need a model that can deal with a tensor input since the images are now tensors of order 4. Luckily,
it is effortless to adapt our base model to have a Flatten layer as input:

In [71]: from tensorflow.keras.layers import Flatten

In [72]: def tensor_model():


model = Sequential()
model.add(Flatten(input_shape=(8, 8, 1)))
model.add(Dense(1024, activation='relu'))
464 CHAPTER 10. PERFORMANCE IMPROVEMENT

model.add(Dense(1024, activation='relu'))
model.add(Dense(1024, activation='relu'))
model.add(Dense(10, activation='softmax'))
model.compile('adam', 'categorical_crossentropy',
metrics=['accuracy'])
return model

We also need to define a new repeat_train_generator function that allows training a model from a
generator. We can take the original repeat_train function and modify it. We will follow the same
procedure used in with two difference:

1. We’ll define a generator that yields batches from X_train_t using the image data generator

2. We’ll replace the .fit function:

h = model.fit(X_train, y_train,
validation_data=(X_test, y_test),
verbose=verbose,
batch_size=batch_size,
epochs=epochs)

with the .fit_generator function:

h = model.fit_generator(train_gen,
steps_per_epoch=steps_per_epoch,
epochs=epochs,
validation_data=(X_test_t, y_test),
verbose=verbose)

Notice that, since we are now feeding variations of the data in the train set the definition of an epoch
becomes blurry. When does an epoch terminate if we flow random variations of the training data? The
model.fit_generator function allows us to define how many steps_per_epoch we want. We will use
the value of 5, with a batch_size of 256 like in most of the examples above.

In [73]: def repeat_train_generator(model_fn, repeats=3,


epochs=40, verbose=0,
steps_per_epoch=5,
batch_size=256):
"""
Repeatedly train a model on (X_train, y_train),
averaging the histories using a generator.

Parameters
----------
model_fn : a function with no parameters
10.3. DATA AUGMENTATION 465

Function that returns a Keras model

repeats : int, (default=3)


Number of times the training is repeated

epochs : int, (default=40)


Number of epochs for each training run

verbose : int, (default=0)


Verbose option for the `model.fit` function

steps_per_epoch : int, (default=5)


Steps_per_epoch for the `model.fit` function

batch_size : int, (default=256)


Batch size for the `model.fit` function

Returns
-------
mean, std : np.array, shape: (epochs, 2)
mean : array contains the accuracy
and validation accuracy history averaged
over the different training runs
std : array contains the standard deviation
over the different training runs of
accuracy and validation accuracy history
"""
# generator that flows batches from X_train_t
train_gen = digit_idg.flow(X_train_t, y=y_train,
batch_size=batch_size)

histories = []

# repeat model definition and training


for repeat in range(repeats):
K.clear_session()
model = model_fn()

# to train with a generator use .fit_generator()


h = model.fit_generator(train_gen,
steps_per_epoch=steps_per_epoch,
epochs=epochs,
validation_data=(X_test_t, y_test),
verbose=verbose)

# append accuracy and val accuracy to list


histories.append([h.history['accuracy'],
h.history['val_accuracy']])
466 CHAPTER 10. PERFORMANCE IMPROVEMENT

print(repeat, end=" ")

histories = np.array(histories)
print()

# calculate mean and std across repeats:


mean = histories.mean(axis=0)
std = histories.std(axis=0)
return mean, std

Once the function is defined, we can train it as usual:

In [74]: (m_train_gen, m_test_gen), (s_train_gen, s_test_gen) = \


repeat_train_generator(tensor_model)

0 1 2

And compare the results with our base model:

In [75]: plot_mean_std(m_train_base, s_train_base)


plot_mean_std(m_test_base, s_test_base)

plot_mean_std(m_train_gen, s_train_gen)
plot_mean_std(m_test_gen, s_test_gen)

plt.axhline(m_test_base.max(),
linestyle='dashed',
color='black')

plt.title("Image Generator Model Accuracy")


plt.legend(['Base - Train', 'Base - Test',
'Generator - Train', 'Generator - Test',
'Max Base Test'])
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.ylim(0.85, 1.05);
10.4. TENSORFLOW DATA API 467

Image Generator Model Accuracy


1.050
1.025
1.000
0.975
Accuracy

0.950
0.925 Base - Train
Base - Test
0.900 Generator - Train
0.875 Generator - Test
Max Base Test
0.850
0 5 10 15 20 25 30 35 40
Epochs

As you can see, the Data Augmentation process improved the performance of the model on our test set.
Feeding variations of the input data as training, we have made the model more resilient to changes in the
input features.

Tensorflow Data API


Tensorflow 2.0 rationalized the data ingestion process in the tf.data module. This is a very powerful
module, well described in the documentation, and it can be used to build custom imput data generators
from any sorts of files and filesystems. Let’s take a quick look at how to do it. Let’s import Tensorflow:

In [76]: import tensorflow as tf

We will use the Dataset class to flow images from the (X_train_t, y_train) tuple directly. We start
with creating a Dataset instance using the from_tensor_slices method:

In [77]: ds = tf.data.Dataset.from_tensor_slices(
(X_train_t, y_train))

Let’s print it out to check what it is:


468 CHAPTER 10. PERFORMANCE IMPROVEMENT

In [78]: ds

Out[78]: <TensorSliceDataset shapes: ((8, 8, 1), (10,)), types: (tf.float64,


tf.float32)>

ds is an instance of TensorSliceDataset. As you can see it knows the shape and type of our data but it
does not contain any data. This is basically a generator pointing to the location of the data. Datasets can be
created from a variety of sources including: - Text files: tf.data.TextLineDataset - TFRecords:
tf.data.TFRecordDataset - Lists, Tuples, Numpy Arrays & Pandas Dataframes:
tf.data.Dataset.from_tensor_slices - CSV files: tf.data.experimental.CsvDataset and they
can point to filesystems with Gigabytes or even Terabytes of data. This allows us to scale out to large datasets
very easily.

We can map functions to the elements of a dataset with the .map method. For example, let’s apply here a
function that rescales the pixel values to the interval [-1, 1]. Notice that since the dataset will return a tuple
of (images, labels), our function needs to be aware of the presence of the label:

In [79]: def rescale_pixels(image, label):


return (2 * image) - 1, label

Next we tell the dataset to apply the rescale function to every image when it’s loaded, as well as to shuffle the
images, repeat the dataset indefinitely and return batches of size 32:

In [80]: batch_size = 32

In [81]: ds = ds.map(rescale_pixels)
ds = ds.shuffle(buffer_size=2000)
ds = ds.repeat()
ds = ds.batch(batch_size)

In [82]: ds

Out[82]: <BatchDataset shapes: ((None, 8, 8, 1), (None, 10)), types: (tf.float64,


tf.float32)>

ds is now an instance of BatchDataset. Let’s get a batch from it:

In [83]: for images, labels in ds.take(1):


print("Images batch shape:", images.shape)
print("Labels batch shape:", labels.shape)
print("Images minimum value:", images.numpy().min())
print("Images maximum value:", images.numpy().max())
10.5. HYPERPARAMETER OPTIMIZATION 469

Images batch shape: (32, 8, 8, 1)


Labels batch shape: (32, 10)
Images minimum value: -1.0
Images maximum value: 1.0

As you can see the ds.take function returns an iterator that generates batches with the correct shape and
the rescaled values. We can fit a model on this dataset by simply calling model.fit:

In [84]: model = tensor_model()


model.fit(ds,
steps_per_epoch=len(X_train_t)//batch_size,
epochs=2
)

Epoch 1/2
39/39 [==============================] - 1s 22ms/step - loss: 0.5156 -
accuracy: 0.8349
Epoch 2/2
39/39 [==============================] - 0s 6ms/step - loss: 0.1200 -
accuracy: 0.9567

Out[84]: <tensorflow.python.keras.callbacks.History at 0x7fdf602e60b8>

This method can be applied to very large datasets. Take a look at this section of the Tensorflow
Documentation for tips on how to improve the performance of a data ingestion pipeline.

Hyperparameter optimization
One final note on hyper-parameter optimization. Neural Network models have a lot of hyper-parameters.
These are things like: - model architecture - number of layers - type of layers - number of nodes - activation
functions - . . . - optimizer parameters - optimizer type - learning rate - momentum - . . . - training
parameters - batch size - learning rate scheduling - number of epochs - . . . These parameters are called
Hyper-parameters because they define the training experiment and the model is not allowed to change
them while training. That said, they turn out to be important in determining the success of a model in
solving a particular problem.

The topic of hyper-parameter tuning is vast, and we don’t have space to cover it in detail. However, the task
is now simplified thanks to the introduction of a Hyperparameter tuning tool in Tensorboard. Let’s see a
quick example of how it works.

Hyper-parameter tuning in Tensorboard

This section follows parts of the Tensorboard documentation.


470 CHAPTER 10. PERFORMANCE IMPROVEMENT

Let’s star by modifying the model generating function, allowing it to accept a dictionary of
hyper-parameters:

In [85]: def tensor_model(hparams):


model = Sequential()
model.add(Flatten(input_shape=(8, 8, 1)))

for i in range(hparams['n_layers']):
model.add(Dense(hparams['n_units'],
activation=hparams['activation']))
model.add(Dropout(hparams['dropout']))

model.add(Dense(10, activation='softmax'))

model.compile(optimizer=hparams['optimizer'],
loss='categorical_crossentropy',
metrics=['accuracy'])
return model

Next we create a train_test_hp function that will create the model, train it, evaluate it and return the
accuracy. This function too may take hyper-parameters like the batch_size, the number of epochs, and so on.

In [86]: def train_test_hp(hparams):


model = tensor_model(hparams)

model.fit(X_train_t, y_train,
epochs=hparams['epochs'],
batch_size=hparams['batch_size'],
verbose=0)

_, accuracy = model.evaluate(X_test_t, y_test,


verbose=0)
return accuracy

Let’s test that our function works with a set of hyper-parameters:

In [87]: hp_example = {
'n_layers': 3,
'n_units': 1024,
'activation': 'relu',
'optimizer': 'adam',
'epochs': 1,
'dropout': 0.0,
'batch_size': 32
}
10.5. HYPERPARAMETER OPTIMIZATION 471

In [88]: train_test_hp(hp_example)

Out[88]: 0.93333334

Great! Now that we know our function works, we will define a helper function to log the training runs in
Tensorboard. The code for this function is a little complicated, but you can use it as is without having to
worry too much about it. We also need to import a couple of additional functions:

In [89]: from tensorboard.plugins.hparams import api_pb2


from tensorboard.plugins.hparams import summary as hparams_summary

In [90]: def run_eperiment(run_dir, hparams):


writer = tf.summary.create_file_writer(run_dir)
summary_start = hparams_summary.session_start_pb(
hparams=hparams)

with writer.as_default():
accuracy = train_test_hp(hparams)
summary_end = hparams_summary.session_end_pb(
api_pb2.STATUS_SUCCESS)

tf.summary.scalar('accuracy',
accuracy,
step=1,
description="The accuracy")
tf.summary.import_event(
tf.compat.v1.Event(summary=summary_start)
.SerializeToString())
tf.summary.import_event(
tf.compat.v1.Event(summary=summary_end)
.SerializeToString())
return accuracy

Let’s run a a few experiments and visualize the results.

Grid Search and Random Search

There are various strategies to search for the optimal hyper-parameter combination. The two most common
strategies are:

• Grid Search: brute force search through all possible combinations of hyper- parameters
472 CHAPTER 10. PERFORMANCE IMPROVEMENT

• Random Search: try random combinations of hyperparameters

In practice, Random Search is much more effective when dealing with large spaces and many
hyper-parameters.

Scikit-Learn offers two very convenient classes for hyper-parameter search: ParameterGrid and
ParameterSampler. They implement grid search and random search respectively. Let’s import them:

In [91]: from sklearn.model_selection import ParameterGrid, ParameterSampler

Let’s also import a couple of random distributions from Scipy:

In [92]: from scipy.stats.distributions import uniform, randint

and let’s define a larger set of parameters to try:

In [93]: hp_ranges = {
'n_layers': randint(1, 4),
'n_units': [64, 256, 1024],
'activation': ['relu', 'tanh'],
'optimizer': ['adam', 'rmsprop'],
'epochs': [5],
'dropout': uniform(loc=0.0, scale=0.6),
'batch_size': [16, 32, 64]
}

Let’s also define a small helper function to print the hyper-parameters:

In [94]: def print_hparams(d):


for k, v in d.items():
if type(v) == np.float64:
print(' {:<20}: {:0.3}'.format(k, v))
else:
print(' {:<20}: {}'.format(k, v))

Now let’s generate a couple of example experiments with the ParameterSampler:

In [95]: hp_sets = ParameterSampler(hp_ranges, n_iter=2, random_state=1)


10.5. HYPERPARAMETER OPTIMIZATION 473

for i, hp_set in enumerate(hp_sets):


print()
print("Hyperparameter Set {}:".format(i))
print_hparams(hp_set)

Hyperparameter Set 0:
activation : tanh
batch_size : 16
dropout : 0.56
epochs : 5
n_layers : 2
n_units : 256
optimizer : rmsprop

Hyperparameter Set 1:
activation : relu
batch_size : 16
dropout : 0.238
epochs : 5
n_layers : 2
n_units : 64
optimizer : adam

As you can see, the ParameterSampler samples from the possible combinations of parameters. Let’s run a
few experiments and check the results:

In [96]: import os

In [97]: experiment_num = 0
log_dir = '/tmp/ztdl/tensorboard/'

for hparams in ParameterSampler(hp_ranges, n_iter=10, random_state=0):


print('Experiment', experiment_num + 1)
print_hparams(hparams)

run_name = "run-{:d}".format(experiment_num)
accuracy = run_eperiment(os.path.join(log_dir, run_name), hparams)

print("Accuracy: {:0.4}".format(accuracy))
print()

experiment_num += 1

Experiment 1
activation : relu
batch_size : 32
474 CHAPTER 10. PERFORMANCE IMPROVEMENT

dropout : 0.507
epochs : 5
n_layers : 2
n_units : 256
optimizer : adam
Accuracy: 0.95

Experiment 2
activation : relu
batch_size : 64
dropout : 0.034
epochs : 5
n_layers : 1
n_units : 1024
optimizer : rmsprop
Accuracy: 0.937

Experiment 3
activation : relu
batch_size : 64
dropout : 0.341
epochs : 5
n_layers : 2
n_units : 256
optimizer : rmsprop
Accuracy: 0.9426

Experiment 4
activation : relu
batch_size : 32
dropout : 0.389
epochs : 5
n_layers : 1
n_units : 256
optimizer : adam
Accuracy: 0.9407

Experiment 5
activation : tanh
batch_size : 16
dropout : 0.587
epochs : 5
n_layers : 1
n_units : 256
optimizer : rmsprop
Accuracy: 0.9519

Experiment 6
activation : tanh
batch_size : 64
dropout : 0.432
epochs : 5
n_layers : 2
n_units : 256
optimizer : rmsprop
Accuracy: 0.9611

Experiment 7
activation : tanh
10.5. HYPERPARAMETER OPTIMIZATION 475

batch_size : 16
dropout : 0.313
epochs : 5
n_layers : 1
n_units : 1024
optimizer : rmsprop
Accuracy: 0.963

Experiment 8
activation : relu
batch_size : 16
dropout : 0.341
epochs : 5
n_layers : 1
n_units : 64
optimizer : rmsprop
Accuracy: 0.913

Experiment 9
activation : tanh
batch_size : 64
dropout : 0.133
epochs : 5
n_layers : 2
n_units : 64
optimizer : rmsprop
Accuracy: 0.9241

Experiment 10
activation : relu
batch_size : 64
dropout : 0.368
epochs : 5
n_layers : 2
n_units : 256
optimizer : rmsprop
Accuracy: 0.9519

We can now visualize our runs by starting tensorboard using the Tensoboard Notebook Extension. Just
uncomment the next two cells and you should see a window like this appear:

You can also run tensorboard in a separate terminal with the command:

tensorboard --logdir /tmp/ztdl/tensorboard/

and then open another browser window at address: http://localhost:6006.

In [98]: # %load_ext tensorboard.notebook

In [99]: # %tensorboard --logdir {log_dir}


476 CHAPTER 10. PERFORMANCE IMPROVEMENT

tensorboard_hparams.png
10.6. EXERCISES 477

Weights and Biases

Weights and Biases allows you to store parameters and associated performance of thousands of runs. You
can then quickly search for patterns and regions of interest in the Hyperparameter space.

Hyperopt and Hyperas

Hyperopt is a Python library that can perform generalized hyper-parameter tuning using a technique called
Bayesian Optimization.

Hyperas is a library that connects Hyperopt and Keras, making it easy to run parallel trainings of a keras
model with variations in the values of the hyper-parameters.

Cloud based tools

SigOpt is a cloud based implementation of Bayesian hyperparameter search.

AWS SageMaker and Google Cloud ML offer options for spawning parallel training experiments with
different hyper-parameter combinations.

Determined.ai and Pipeline.ai also offer this feature as part of their cloud training platform.

Exercises

Exercise 1

This is a long and complex exercise, that should give you an idea of a real world scenario. Feel free to look at
the solution if you feel lost. Also, feel free to run this on a GPU.

First of all download and unpack the male/female pictures from here into a subfolder of the ../data folder.
These images and labels were obtained from Crowdflower.

Your goal is to build an image classifier that will recognize the gender of a person from pictures.

• Have a look at the directory structure and inspect a couple of pictures


• Design a model that will take a color image of size 64x64 as input and return a binary output
(female=0/male=1)
• Feel free to introduce any regularization technique in your model (Dropout, Batch Normalization,
Weight Regularization)
• Compile your model with an optimizer of your choice
• Using ImageDataGenerator, define a train generator that will augment your images with some
geometric transformations. Feel free to choose the parameters that make sense to you.
• Define also a test generator, whose only purpose is to rescale the pixels by 1./255
• Use the function flow_from_directory to generate batches from the train and test folders. Make
sure you set the target_size to 64x64.
• Use the model.fit_generator function to fit the model on the batches generated from the
ImageDataGenerator. Since you are streaming and augmenting the data in real-time, you will have
478 CHAPTER 10. PERFORMANCE IMPROVEMENT

to decide how many batches make an epoch and how many epochs you want to run
• Train your model (you should get to at least 85 accuracy)
• Once you are satisfied with your training, check a few of the misclassified pictures.
• Read about human bias in Machine Learning datasets

In [ ]:
11
Pretrained Models for Images

Let’s recap of all the great work we’ve done so far.

We started our journey from the basics of Data Manipulation, and Machine Learning, and then we
introduced Deep Learning and Neural Networks. We learned about Deep Learning_Internals and the math
that makes Neural Networks function. Then we explored more complex architectures like Convolutional
Neural Networks for Images and Recurrent Neural Networks for Time Series and for Text Data. Finally, we
learned how to train our models GPUs to speed up training and how to improve a model if it’s overfitting.
With Chapter 10 we conclude the part of the book that deals with understanding how Neural Networks
work and how to train them, and we shift gears to more recent applications.

Many of the techniques we learned are only a few years old, yet the field of Deep Learning is evolving very
fast and in the last years, many new techniques have been invented and discovered.

This chapter walks through how to piggyback on the shoulders of giants. In fact, will learn how to use
pre-trained networks, i.e., networks that have already been trained on a similar task, and adapt them to the
one we would like to perform.

These are often vast networks, with tens of millions of parameters, trained on massive datasets. It would cost
us a lot of computing power to re-train them from scratch. Luckily, we don’t need to do that. Let’s see how.

As usual, we start by importing the common files:

In [1]: with open('common.py') as fin:


exec(fin.read())

In [2]: with open('matplotlibconf.py') as fin:

479
480 CHAPTER 11. PRETRAINED MODELS FOR IMAGES

exec(fin.read())

Recognizing sports from images


Let’s say we’d like to classify a set of images related to sports. We know that a Convolutional Neural Network
can solve the image classification task, but we do not have tens of thousands of images to train it. Can we
still achieve the goal? The answer is yes! Let’s see how.

First let’s load a dataset containing links to images of sports:

In [3]: df = pd.read_csv('../data/sports.csv')

and let’s inspect it with the .head() command:

In [4]: df.head()

Out[4]:

image_url class label:confidence


0 https://multimedia-commons... Formula racing 1.0000
1 https://multimedia-commons... Cross-country skiing 0.9594
2 https://multimedia-commons... Formula racing 1.0000
3 https://multimedia-commons... Formula racing 1.0000
4 https://multimedia-commons... Formula racing 0.6646

As you can see the daset contains 3 columns: - the image url - the class - the label confidence

Let’s first have a look at how many classes there are using the .value_counts() method on the
df['class'] column. This method group the entries in the column by the same type and counts how
many occurrences there are in each group:

In [5]: df['class'].value_counts()

Out[5]:

class
Cross-country skiing 1003
Beach volleyball 1002
Formula racing 1001
11.1. RECOGNIZING SPORTS FROM IMAGES 481

There are three classes, with approximately a thousand images each. This is not enough examples to train a
convolutional network from scratch. We’ll need to use transfer learning to solve the problem.

Before we dive into it, let’s prepare train and test datasets and let’s download all the images to disk. We first
import the train_test_split function from Scikit-Learn:

In [6]: from sklearn.model_selection import train_test_split

Then we split the dataframe df into 70 train and 30 test. Notice that we stratify the split according to the
class distribution, i.e. we make sure that the train set (and the test set) is composed by 1/3 skiing, 1/3 volley
and 1/3 formula racing.

In [7]: train_df, test_df = train_test_split(


df,
test_size=0.3,
random_state=2,
stratify=df['class']
)

Now that we have set up our datasets, we need to download the images.

TIP: for your convenience we also packaged the download code in a convenient script that
you can find at data/sports/download_sport.py.

Let’s define a helper function that checks if an image has alredy been downloaded and it downloads it if it
doesn’t exist. We import the os module to be able to create folders and files:

In [8]: import os

We also load the urlretrieve (from the urllib.request library) function that allows us to download an
image from a url:

In [9]: from urllib.request import urlretrieve

Then let’s define a maybe_download_image function:


482 CHAPTER 11. PRETRAINED MODELS FOR IMAGES

In [10]: def maybe_download_image(save_dir, image_url, label):


"""
Download image to savel_dir/label/. The function
will first check if the image already exists.
Returns 0 if found, 1 if downloaded

Args:
save_dir: Path where images are saved
image_url: An image url
image: A label
"""

# create the output path if it doesn't exist


os.makedirs(save_dir, exist_ok=True)

# create label subfolder if it doesn't exist


label_dir = os.path.join(save_dir, label)
os.makedirs(label_dir, exist_ok=True)

# split the file name from the url


url, fname = os.path.split(image_url)

# return 0 if file already there, 1 if downloaded


save_path = os.path.join(save_dir, label, fname)
if os.path.isfile(save_path):
return 0
else:
urlretrieve(image_url, save_path)
return 1

Let’s test our function on the first item in the dataframe. Let’s retrieve the first url:

In [11]: image_url = df['image_url'][0]


image_url

Out[11]: 'https://multimedia-commons.s3-us-west-2.amazonaws.com/data/images/7ad/a7b/7
ada7b21d671242e368c2390ac5ae7d7.jpg'

and the first label:

In [12]: label = df['class'][0]


label

Out[12]: 'Formula racing'


11.1. RECOGNIZING SPORTS FROM IMAGES 483

Now let’s download the image using our helper function:

In [13]: maybe_download_image('/tmp/ztdlbook/', image_url, label)

Out[13]: 0

The function returns 1, since the image was not there. Notice that if we run it again, this time the function
will return 0:

In [14]: maybe_download_image('/tmp/ztdlbook/', image_url, label)

Out[14]: 0

Now we need to download all the images in the dataframe. We could simply loop over the rows, but this can
be tediously slow. Instead we’ll resort to asynchronous downloading and we’ll start many threads to
download the images concurrenty. To do this we need to import the ThreadPoolExecutor from the
concurrent.futures library, as well as the as_completed function:

In [15]: from concurrent.futures import ThreadPoolExecutor


from concurrent.futures import as_completed

The as_completed function is an iterator over the given futures that yields each as it completes. With these
two components, let’s build another helper function that distributes all the urls to a ThreadPoolExecutor
and runs them in parallel.

In [16]: def get_images(save_dir, image_urls,


image_labels, max_workers=20):
"""
Download list of labeled images using threads.

Args:
save_dir: Path where images are saved
image_urls: A list of image urls
image_labels: A list of image labels
max_workers: Concurrent threads (default=20)
"""
with ThreadPoolExecutor(max_workers=max_workers) as executor:
# we build a dictionary with executors
# as keys and the urls as values
484 CHAPTER 11. PRETRAINED MODELS FOR IMAGES

future_to_url = {}
for url, label in zip(image_urls, image_labels):
k = executor.submit(maybe_download_image,
save_dir, url, label)
future_to_url[k] = url

# we loop over executors as they complete


# and print their return values
for future in as_completed(future_to_url):
url = future_to_url[future]
try:
result = future.result()
print(result, end='')
except Exception as ex:
print('%r exception: %s' % (url, ex))

Let’s run the get_images function on the train dataset. We’ll save them in a sports/train folder inside
../data:

In [17]: train_path = '../data/sports/train/'

In [18]: get_images(train_path,
train_df['image_url'].values,
train_df['class'].values)

0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
11.2. KERAS APPLICATIONS 485

0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000

Similarly we’ll save the test images in a sports/test folder inside ../data:

In [19]: test_path = '../data/sports/test/'

In [20]: get_images(test_path,
test_df['image_url'].values,
test_df['class'].values)

0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000

Now that we have downloaded the data, we are ready to tackle transfer learning

Keras applications
Keras offers many pre-trained models in the keras.applications module. All of them are models
trained for image classification on the Imagenet dataset and they have different architectures. Here we
summarize their main properties:

Keras pre-trained models comparison table

As you can see, some of them have a large memory footprint (up to over 500Mb), while some others trade a
bit of accuracy for a smaller footprint that makes them perfect for running on a mobile phone.
486 CHAPTER 11. PRETRAINED MODELS FOR IMAGES

Pre-trained models are fantastic for two reasons:

1. we can use them without training to classify images of everyday objects

• we can partially retrain them and adapt them to classify new objects, using only a few input images
and a laptop (no need for GPU)!

Let’s go ahead and explore both these applications. First of all we’re going to load the image module from
keras.preprocessing, which will allow us to load images from disk:

In [21]: from tensorflow.keras.preprocessing import image

Now let’s load an image from the ones we have downloaded previously. Let’s define the input_path:

In [22]: dir_ = '../data/sports/train/Beach volleyball/'


fname_ = '1d8a1f53f36487ac4f10e47e6937308.jpg'
input_path = dir_ + fname_

and then load the image:

In [23]: img = image.load_img(input_path, target_size=(299, 299))

Jupyter notebook can display images inline, so let’s have a look at it:

In [24]: img

Out[24]:
11.2. KERAS APPLICATIONS 487

What type of Python object is img? We can check it with the type function and see that it’s a Python Image
data type.

In [25]: type(img)

Out[25]: PIL.Image.Image

Now let’s convert it to a numpy array so that we can feed it to the model. We will use the img_to_array
function from the image module we’ve just loaded:

In [26]: img_array = image.img_to_array(img)

Now the image is an order-3 tensor with 229 pixels in Height and Width and 3 color channels for RGB:

In [27]: img_array.shape

Out[27]: (299, 299, 3)


488 CHAPTER 11. PRETRAINED MODELS FOR IMAGES

Keras convolutional models require an input with 4 axes, i.e. an order-4 tensor, where the first axis locates
the image in the datset (in this case we only have one image, but we can still think of it as the first element in
an order-4 array. We can add this “dummy” dimension with the np.expand_dims function:

In [28]: img_tensor = np.expand_dims(img_array, axis=0)

Let’s double check that the shape of this new tensor is the one we want:

In [29]: img_tensor.shape

Out[29]: (1, 299, 299, 3)

Predict class with pre-trained Xception


Amongst all the pre-trained models provided we really like the Xception network. Not only it provides
great accuracy with a small footprint and load time, but also it was invented by the founder of Keras,
François Chollet. Let’s go ahead and import the Xception class:

In [30]: from tensorflow.keras.applications.xception import Xception

We can create a pre-trained model simply by creating an instance of Xception with the
weights='imagenet' parameter. This command will download the pre- trained weights and create a
model with the Xception architecture.

TIP: note that it could take a few minutes to download the weights.

In [31]: model = Xception(weights='imagenet')

Let’s have a look at the model architecture by printing the summary:

In [32]: model.summary()

Model: "xception"
____________________________________________________________________________
______________________
11.3. PREDICT CLASS WITH PRE-TRAINED XCEPTION 489

Layer (type) Output Shape Param # Connected


to
============================================================================
======================
input_1 (InputLayer) [(None, 299, 299, 3) 0
____________________________________________________________________________
______________________
block1_conv1 (Conv2D) (None, 149, 149, 32) 864
input_1[0][0]
____________________________________________________________________________
______________________
block1_conv1_bn (BatchNormaliza (None, 149, 149, 32) 128
block1_conv1[0][0]
____________________________________________________________________________
______________________
block1_conv1_act (Activation) (None, 149, 149, 32) 0
block1_conv1_bn[0][0]
____________________________________________________________________________
______________________
block1_conv2 (Conv2D) (None, 147, 147, 64) 18432
block1_conv1_act[0][0]
____________________________________________________________________________
______________________
block1_conv2_bn (BatchNormaliza (None, 147, 147, 64) 256
block1_conv2[0][0]
____________________________________________________________________________
______________________
block1_conv2_act (Activation) (None, 147, 147, 64) 0
block1_conv2_bn[0][0]
____________________________________________________________________________
______________________
block2_sepconv1 (SeparableConv2 (None, 147, 147, 128 8768
block1_conv2_act[0][0]
____________________________________________________________________________
______________________
block2_sepconv1_bn (BatchNormal (None, 147, 147, 128 512
block2_sepconv1[0][0]
____________________________________________________________________________
______________________
block2_sepconv2_act (Activation (None, 147, 147, 128 0
block2_sepconv1_bn[0][0]
____________________________________________________________________________
______________________
block2_sepconv2 (SeparableConv2 (None, 147, 147, 128 17536
block2_sepconv2_act[0][0]
____________________________________________________________________________
______________________
block2_sepconv2_bn (BatchNormal (None, 147, 147, 128 512
block2_sepconv2[0][0]
____________________________________________________________________________
______________________
conv2d (Conv2D) (None, 74, 74, 128) 8192
block1_conv2_act[0][0]
____________________________________________________________________________
______________________
block2_pool (MaxPooling2D) (None, 74, 74, 128) 0
block2_sepconv2_bn[0][0]
____________________________________________________________________________
______________________
490 CHAPTER 11. PRETRAINED MODELS FOR IMAGES

batch_normalization_v1 (BatchNo (None, 74, 74, 128) 512


conv2d[0][0]
____________________________________________________________________________
______________________
add (Add) (None, 74, 74, 128) 0
block2_pool[0][0]
batch_normalization_v1[0][0]
____________________________________________________________________________
______________________
block3_sepconv1_act (Activation (None, 74, 74, 128) 0 add[0][0]
____________________________________________________________________________
______________________
block3_sepconv1 (SeparableConv2 (None, 74, 74, 256) 33920
block3_sepconv1_act[0][0]
____________________________________________________________________________
______________________
block3_sepconv1_bn (BatchNormal (None, 74, 74, 256) 1024
block3_sepconv1[0][0]
____________________________________________________________________________
______________________
block3_sepconv2_act (Activation (None, 74, 74, 256) 0
block3_sepconv1_bn[0][0]
____________________________________________________________________________
______________________
block3_sepconv2 (SeparableConv2 (None, 74, 74, 256) 67840
block3_sepconv2_act[0][0]
____________________________________________________________________________
______________________
block3_sepconv2_bn (BatchNormal (None, 74, 74, 256) 1024
block3_sepconv2[0][0]
____________________________________________________________________________
______________________
conv2d_1 (Conv2D) (None, 37, 37, 256) 32768 add[0][0]
____________________________________________________________________________
______________________
block3_pool (MaxPooling2D) (None, 37, 37, 256) 0
block3_sepconv2_bn[0][0]
____________________________________________________________________________
______________________
batch_normalization_v1_1 (Batch (None, 37, 37, 256) 1024
conv2d_1[0][0]
____________________________________________________________________________
______________________
add_1 (Add) (None, 37, 37, 256) 0
block3_pool[0][0]
batch_normalization_v1_1[0][0]
____________________________________________________________________________
______________________
block4_sepconv1_act (Activation (None, 37, 37, 256) 0 add_1[0][0]
____________________________________________________________________________
______________________
block4_sepconv1 (SeparableConv2 (None, 37, 37, 728) 188672
block4_sepconv1_act[0][0]
____________________________________________________________________________
______________________
block4_sepconv1_bn (BatchNormal (None, 37, 37, 728) 2912
block4_sepconv1[0][0]
____________________________________________________________________________
______________________
11.3. PREDICT CLASS WITH PRE-TRAINED XCEPTION 491

block4_sepconv2_act (Activation (None, 37, 37, 728) 0


block4_sepconv1_bn[0][0]
____________________________________________________________________________
______________________
block4_sepconv2 (SeparableConv2 (None, 37, 37, 728) 536536
block4_sepconv2_act[0][0]
____________________________________________________________________________
______________________
block4_sepconv2_bn (BatchNormal (None, 37, 37, 728) 2912
block4_sepconv2[0][0]
____________________________________________________________________________
______________________
conv2d_2 (Conv2D) (None, 19, 19, 728) 186368 add_1[0][0]
____________________________________________________________________________
______________________
block4_pool (MaxPooling2D) (None, 19, 19, 728) 0
block4_sepconv2_bn[0][0]
____________________________________________________________________________
______________________
batch_normalization_v1_2 (Batch (None, 19, 19, 728) 2912
conv2d_2[0][0]
____________________________________________________________________________
______________________
add_2 (Add) (None, 19, 19, 728) 0
block4_pool[0][0]
batch_normalization_v1_2[0][0]
____________________________________________________________________________
______________________
block5_sepconv1_act (Activation (None, 19, 19, 728) 0 add_2[0][0]
____________________________________________________________________________
______________________
block5_sepconv1 (SeparableConv2 (None, 19, 19, 728) 536536
block5_sepconv1_act[0][0]
____________________________________________________________________________
______________________
block5_sepconv1_bn (BatchNormal (None, 19, 19, 728) 2912
block5_sepconv1[0][0]
____________________________________________________________________________
______________________
block5_sepconv2_act (Activation (None, 19, 19, 728) 0
block5_sepconv1_bn[0][0]
____________________________________________________________________________
______________________
block5_sepconv2 (SeparableConv2 (None, 19, 19, 728) 536536
block5_sepconv2_act[0][0]
____________________________________________________________________________
______________________
block5_sepconv2_bn (BatchNormal (None, 19, 19, 728) 2912
block5_sepconv2[0][0]
____________________________________________________________________________
______________________
block5_sepconv3_act (Activation (None, 19, 19, 728) 0
block5_sepconv2_bn[0][0]
____________________________________________________________________________
______________________
block5_sepconv3 (SeparableConv2 (None, 19, 19, 728) 536536
block5_sepconv3_act[0][0]
____________________________________________________________________________
______________________
492 CHAPTER 11. PRETRAINED MODELS FOR IMAGES

block5_sepconv3_bn (BatchNormal (None, 19, 19, 728) 2912


block5_sepconv3[0][0]
____________________________________________________________________________
______________________
add_3 (Add) (None, 19, 19, 728) 0
block5_sepconv3_bn[0][0]
add_2[0][0]
____________________________________________________________________________
______________________
block6_sepconv1_act (Activation (None, 19, 19, 728) 0 add_3[0][0]
____________________________________________________________________________
______________________
block6_sepconv1 (SeparableConv2 (None, 19, 19, 728) 536536
block6_sepconv1_act[0][0]
____________________________________________________________________________
______________________
block6_sepconv1_bn (BatchNormal (None, 19, 19, 728) 2912
block6_sepconv1[0][0]
____________________________________________________________________________
______________________
block6_sepconv2_act (Activation (None, 19, 19, 728) 0
block6_sepconv1_bn[0][0]
____________________________________________________________________________
______________________
block6_sepconv2 (SeparableConv2 (None, 19, 19, 728) 536536
block6_sepconv2_act[0][0]
____________________________________________________________________________
______________________
block6_sepconv2_bn (BatchNormal (None, 19, 19, 728) 2912
block6_sepconv2[0][0]
____________________________________________________________________________
______________________
block6_sepconv3_act (Activation (None, 19, 19, 728) 0
block6_sepconv2_bn[0][0]
____________________________________________________________________________
______________________
block6_sepconv3 (SeparableConv2 (None, 19, 19, 728) 536536
block6_sepconv3_act[0][0]
____________________________________________________________________________
______________________
block6_sepconv3_bn (BatchNormal (None, 19, 19, 728) 2912
block6_sepconv3[0][0]
____________________________________________________________________________
______________________
add_4 (Add) (None, 19, 19, 728) 0
block6_sepconv3_bn[0][0]
add_3[0][0]
____________________________________________________________________________
______________________
block7_sepconv1_act (Activation (None, 19, 19, 728) 0 add_4[0][0]
____________________________________________________________________________
______________________
block7_sepconv1 (SeparableConv2 (None, 19, 19, 728) 536536
block7_sepconv1_act[0][0]
____________________________________________________________________________
______________________
block7_sepconv1_bn (BatchNormal (None, 19, 19, 728) 2912
block7_sepconv1[0][0]
____________________________________________________________________________
11.3. PREDICT CLASS WITH PRE-TRAINED XCEPTION 493

______________________
block7_sepconv2_act (Activation (None, 19, 19, 728) 0
block7_sepconv1_bn[0][0]
____________________________________________________________________________
______________________
block7_sepconv2 (SeparableConv2 (None, 19, 19, 728) 536536
block7_sepconv2_act[0][0]
____________________________________________________________________________
______________________
block7_sepconv2_bn (BatchNormal (None, 19, 19, 728) 2912
block7_sepconv2[0][0]
____________________________________________________________________________
______________________
block7_sepconv3_act (Activation (None, 19, 19, 728) 0
block7_sepconv2_bn[0][0]
____________________________________________________________________________
______________________
block7_sepconv3 (SeparableConv2 (None, 19, 19, 728) 536536
block7_sepconv3_act[0][0]
____________________________________________________________________________
______________________
block7_sepconv3_bn (BatchNormal (None, 19, 19, 728) 2912
block7_sepconv3[0][0]
____________________________________________________________________________
______________________
add_5 (Add) (None, 19, 19, 728) 0
block7_sepconv3_bn[0][0]
add_4[0][0]
____________________________________________________________________________
______________________
block8_sepconv1_act (Activation (None, 19, 19, 728) 0 add_5[0][0]
____________________________________________________________________________
______________________
block8_sepconv1 (SeparableConv2 (None, 19, 19, 728) 536536
block8_sepconv1_act[0][0]
____________________________________________________________________________
______________________
block8_sepconv1_bn (BatchNormal (None, 19, 19, 728) 2912
block8_sepconv1[0][0]
____________________________________________________________________________
______________________
block8_sepconv2_act (Activation (None, 19, 19, 728) 0
block8_sepconv1_bn[0][0]
____________________________________________________________________________
______________________
block8_sepconv2 (SeparableConv2 (None, 19, 19, 728) 536536
block8_sepconv2_act[0][0]
____________________________________________________________________________
______________________
block8_sepconv2_bn (BatchNormal (None, 19, 19, 728) 2912
block8_sepconv2[0][0]
____________________________________________________________________________
______________________
block8_sepconv3_act (Activation (None, 19, 19, 728) 0
block8_sepconv2_bn[0][0]
____________________________________________________________________________
______________________
block8_sepconv3 (SeparableConv2 (None, 19, 19, 728) 536536
block8_sepconv3_act[0][0]
494 CHAPTER 11. PRETRAINED MODELS FOR IMAGES

____________________________________________________________________________
______________________
block8_sepconv3_bn (BatchNormal (None, 19, 19, 728) 2912
block8_sepconv3[0][0]
____________________________________________________________________________
______________________
add_6 (Add) (None, 19, 19, 728) 0
block8_sepconv3_bn[0][0]
add_5[0][0]
____________________________________________________________________________
______________________
block9_sepconv1_act (Activation (None, 19, 19, 728) 0 add_6[0][0]
____________________________________________________________________________
______________________
block9_sepconv1 (SeparableConv2 (None, 19, 19, 728) 536536
block9_sepconv1_act[0][0]
____________________________________________________________________________
______________________
block9_sepconv1_bn (BatchNormal (None, 19, 19, 728) 2912
block9_sepconv1[0][0]
____________________________________________________________________________
______________________
block9_sepconv2_act (Activation (None, 19, 19, 728) 0
block9_sepconv1_bn[0][0]
____________________________________________________________________________
______________________
block9_sepconv2 (SeparableConv2 (None, 19, 19, 728) 536536
block9_sepconv2_act[0][0]
____________________________________________________________________________
______________________
block9_sepconv2_bn (BatchNormal (None, 19, 19, 728) 2912
block9_sepconv2[0][0]
____________________________________________________________________________
______________________
block9_sepconv3_act (Activation (None, 19, 19, 728) 0
block9_sepconv2_bn[0][0]
____________________________________________________________________________
______________________
block9_sepconv3 (SeparableConv2 (None, 19, 19, 728) 536536
block9_sepconv3_act[0][0]
____________________________________________________________________________
______________________
block9_sepconv3_bn (BatchNormal (None, 19, 19, 728) 2912
block9_sepconv3[0][0]
____________________________________________________________________________
______________________
add_7 (Add) (None, 19, 19, 728) 0
block9_sepconv3_bn[0][0]
add_6[0][0]
____________________________________________________________________________
______________________
block10_sepconv1_act (Activatio (None, 19, 19, 728) 0 add_7[0][0]
____________________________________________________________________________
______________________
block10_sepconv1 (SeparableConv (None, 19, 19, 728) 536536
block10_sepconv1_act[0][0]
____________________________________________________________________________
______________________
block10_sepconv1_bn (BatchNorma (None, 19, 19, 728) 2912
11.3. PREDICT CLASS WITH PRE-TRAINED XCEPTION 495

block10_sepconv1[0][0]
____________________________________________________________________________
______________________
block10_sepconv2_act (Activatio (None, 19, 19, 728) 0
block10_sepconv1_bn[0][0]
____________________________________________________________________________
______________________
block10_sepconv2 (SeparableConv (None, 19, 19, 728) 536536
block10_sepconv2_act[0][0]
____________________________________________________________________________
______________________
block10_sepconv2_bn (BatchNorma (None, 19, 19, 728) 2912
block10_sepconv2[0][0]
____________________________________________________________________________
______________________
block10_sepconv3_act (Activatio (None, 19, 19, 728) 0
block10_sepconv2_bn[0][0]
____________________________________________________________________________
______________________
block10_sepconv3 (SeparableConv (None, 19, 19, 728) 536536
block10_sepconv3_act[0][0]
____________________________________________________________________________
______________________
block10_sepconv3_bn (BatchNorma (None, 19, 19, 728) 2912
block10_sepconv3[0][0]
____________________________________________________________________________
______________________
add_8 (Add) (None, 19, 19, 728) 0
block10_sepconv3_bn[0][0]
add_7[0][0]
____________________________________________________________________________
______________________
block11_sepconv1_act (Activatio (None, 19, 19, 728) 0 add_8[0][0]
____________________________________________________________________________
______________________
block11_sepconv1 (SeparableConv (None, 19, 19, 728) 536536
block11_sepconv1_act[0][0]
____________________________________________________________________________
______________________
block11_sepconv1_bn (BatchNorma (None, 19, 19, 728) 2912
block11_sepconv1[0][0]
____________________________________________________________________________
______________________
block11_sepconv2_act (Activatio (None, 19, 19, 728) 0
block11_sepconv1_bn[0][0]
____________________________________________________________________________
______________________
block11_sepconv2 (SeparableConv (None, 19, 19, 728) 536536
block11_sepconv2_act[0][0]
____________________________________________________________________________
______________________
block11_sepconv2_bn (BatchNorma (None, 19, 19, 728) 2912
block11_sepconv2[0][0]
____________________________________________________________________________
______________________
block11_sepconv3_act (Activatio (None, 19, 19, 728) 0
block11_sepconv2_bn[0][0]
____________________________________________________________________________
______________________
496 CHAPTER 11. PRETRAINED MODELS FOR IMAGES

block11_sepconv3 (SeparableConv (None, 19, 19, 728) 536536


block11_sepconv3_act[0][0]
____________________________________________________________________________
______________________
block11_sepconv3_bn (BatchNorma (None, 19, 19, 728) 2912
block11_sepconv3[0][0]
____________________________________________________________________________
______________________
add_9 (Add) (None, 19, 19, 728) 0
block11_sepconv3_bn[0][0]
add_8[0][0]
____________________________________________________________________________
______________________
block12_sepconv1_act (Activatio (None, 19, 19, 728) 0 add_9[0][0]
____________________________________________________________________________
______________________
block12_sepconv1 (SeparableConv (None, 19, 19, 728) 536536
block12_sepconv1_act[0][0]
____________________________________________________________________________
______________________
block12_sepconv1_bn (BatchNorma (None, 19, 19, 728) 2912
block12_sepconv1[0][0]
____________________________________________________________________________
______________________
block12_sepconv2_act (Activatio (None, 19, 19, 728) 0
block12_sepconv1_bn[0][0]
____________________________________________________________________________
______________________
block12_sepconv2 (SeparableConv (None, 19, 19, 728) 536536
block12_sepconv2_act[0][0]
____________________________________________________________________________
______________________
block12_sepconv2_bn (BatchNorma (None, 19, 19, 728) 2912
block12_sepconv2[0][0]
____________________________________________________________________________
______________________
block12_sepconv3_act (Activatio (None, 19, 19, 728) 0
block12_sepconv2_bn[0][0]
____________________________________________________________________________
______________________
block12_sepconv3 (SeparableConv (None, 19, 19, 728) 536536
block12_sepconv3_act[0][0]
____________________________________________________________________________
______________________
block12_sepconv3_bn (BatchNorma (None, 19, 19, 728) 2912
block12_sepconv3[0][0]
____________________________________________________________________________
______________________
add_10 (Add) (None, 19, 19, 728) 0
block12_sepconv3_bn[0][0]
add_9[0][0]
____________________________________________________________________________
______________________
block13_sepconv1_act (Activatio (None, 19, 19, 728) 0
add_10[0][0]
____________________________________________________________________________
______________________
block13_sepconv1 (SeparableConv (None, 19, 19, 728) 536536
block13_sepconv1_act[0][0]
11.3. PREDICT CLASS WITH PRE-TRAINED XCEPTION 497

____________________________________________________________________________
______________________
block13_sepconv1_bn (BatchNorma (None, 19, 19, 728) 2912
block13_sepconv1[0][0]
____________________________________________________________________________
______________________
block13_sepconv2_act (Activatio (None, 19, 19, 728) 0
block13_sepconv1_bn[0][0]
____________________________________________________________________________
______________________
block13_sepconv2 (SeparableConv (None, 19, 19, 1024) 752024
block13_sepconv2_act[0][0]
____________________________________________________________________________
______________________
block13_sepconv2_bn (BatchNorma (None, 19, 19, 1024) 4096
block13_sepconv2[0][0]
____________________________________________________________________________
______________________
conv2d_3 (Conv2D) (None, 10, 10, 1024) 745472
add_10[0][0]
____________________________________________________________________________
______________________
block13_pool (MaxPooling2D) (None, 10, 10, 1024) 0
block13_sepconv2_bn[0][0]
____________________________________________________________________________
______________________
batch_normalization_v1_3 (Batch (None, 10, 10, 1024) 4096
conv2d_3[0][0]
____________________________________________________________________________
______________________
add_11 (Add) (None, 10, 10, 1024) 0
block13_pool[0][0]
batch_normalization_v1_3[0][0]
____________________________________________________________________________
______________________
block14_sepconv1 (SeparableConv (None, 10, 10, 1536) 1582080
add_11[0][0]
____________________________________________________________________________
______________________
block14_sepconv1_bn (BatchNorma (None, 10, 10, 1536) 6144
block14_sepconv1[0][0]
____________________________________________________________________________
______________________
block14_sepconv1_act (Activatio (None, 10, 10, 1536) 0
block14_sepconv1_bn[0][0]
____________________________________________________________________________
______________________
block14_sepconv2 (SeparableConv (None, 10, 10, 2048) 3159552
block14_sepconv1_act[0][0]
____________________________________________________________________________
______________________
block14_sepconv2_bn (BatchNorma (None, 10, 10, 2048) 8192
block14_sepconv2[0][0]
____________________________________________________________________________
______________________
block14_sepconv2_act (Activatio (None, 10, 10, 2048) 0
block14_sepconv2_bn[0][0]
____________________________________________________________________________
______________________
498 CHAPTER 11. PRETRAINED MODELS FOR IMAGES

avg_pool (GlobalAveragePooling2 (None, 2048) 0


block14_sepconv2_act[0][0]
____________________________________________________________________________
______________________
predictions (Dense) (None, 1000) 2049000
avg_pool[0][0]
============================================================================
======================
Total params: 22,910,480
Trainable params: 22,855,952
Non-trainable params: 54,528
____________________________________________________________________________
______________________

Wow! What a huge model! Notice that it has almost 23 Million parameters and many convolutional layers
stacked on top of one another. Let’s test the pre-trained model on the task of recognizing an image without
training.

We will need to pre-process the image so that it has the correct format for the network. Luckily for us, the
keras.applications.xception module also contains a preprocess_intput function that is precisely
what we need. Let’s load it:

In [33]: from tensorflow.keras.applications.xception import preprocess_input

and let’s apply it to a copy of our tensor image:

TIP: we apply it to a copy because the function alters the argument itself and we don’t want
to alter the original version of the image.

In [34]: img_scaled = preprocess_input(np.copy(img_tensor))

img_scaled is scaled such that the minimum value is -1 and the maximum value is 1. We can pass it to the
model and generate a prediction:

In [35]: preds = model.predict(img_scaled)

What do our predictions look like? What is the output of the model? Let’s first look at the shape of the
preds object:

In [36]: preds.shape
11.4. TRANSFER LEARNING 499

Out[36]: (1, 1000)

preds is a vector with 1000 entries. This makes sense: they are the probabilities associated with each of the
1000 classes of objects in the Imagenet dataset.

To recap, our pre-trained Xception model takes an image in input and returns a softmax classification
output with 1000 classes. To interpret the prediction, we will load the decode_predictions function:

In [37]: from tensorflow.keras.applications.xception import decode_predictions

and apply it to the preds vector to get the top three most likely labels for the photo.

In [38]: decode_predictions(preds, top=3)[0]

Out[38]: [('n04540053', 'volleyball', 0.9431327),


('n09421951', 'sandbar', 0.021305429),
('n04371430', 'swimming_trunks', 0.0068396116)]

Not bad! Our model this it’s very likely that the photo is about volleyball, which is not far from beach volley
at all! That is awesome Now let’s do even better, let’s repurpose the model so that it will work with exactly
the three categories of pictures we have. This is called Transfer Learning.

Transfer Learning
Transfer Learning consists in leveraging a pre-trained model to solve a similar task. In this case, we’re going
to use a network that was trained on Imagenet and re-purposed it to solve the sports image classification
task. By using a pre-trained network, we don’t need to train it completely from scratch, a great advantage
both for the computing power required and the amount of data needed.

We will be able to adapt a huge network like Xception, that has more than 20 Million parameters, using a
laptop and a few thousand images. This is an incredibly powerful function! Let’s see how we do it.

First of all, we’re going to set a value for the img_size = 299. This is the correct input size for Xception,
and it corresponds to the size of images used to train it.

In [39]: img_size = 299

We reload the Xception model, but this time we include a couple more arguments besides the weights.

First of all, we specify include_top=False. This option says we don’t want the full model, but only the
convolutional part. If you remember what we’ve learned in Chapter 6 on CNNs, convolutional models are
composed by a cascade of convolutional layers that yield more and more specialized feature maps.
500 CHAPTER 11. PRETRAINED MODELS FOR IMAGES

At some point, the feature maps are flattened to an array which is fed to a Dense layer (or a series of Dense
layers) and finally to the output of the classification. Here we want to load all the layers of Xception up to the
layer before the last fully connected (the top layer). The reason we want to do this is simple to explain. We
want to use the pre-trained model as a giant pre- processing layer that takes an image in input and returns a
few thousand high- level features (we will see these are called bottleneck features). We will then use these
high-level features to perform a standard classification with only the classes of images present in our dataset,
i.e., the three sports.

Transfer Learning with Xception model

When loading the Xception model, we also specify the input_shape and an additional parameter
pooling='avg'. This last one specifies how we’d like to go from the order-4 tensor of the feature maps to
the order-2 tensor that goes in the fully connected top. pooling='avg' means we’re going to apply a Global
Average Pooling layer at the end.

Let’s load this into a variable called base_model:

In [40]: base_model = Xception(include_top=False,


weights='imagenet',
input_shape=(img_size, img_size, 3),
pooling='avg')

Now that we’ve loaded the base model, we’re going to complete the model with a couple of dense layers.
First we load the Sequential model and the Dense and Dropout layers:

In [41]: from tensorflow.keras.models import Sequential


from tensorflow.keras.layers import Dense, Dropout

Let’s create model with the following architecture.


11.4. TRANSFER LEARNING 501

First, we pass the whole base_model as the first layer. This will take an input image, process it with the
pre-trained Xception weights, and pass an array of numbers to the next layer. Then we’ll load a fully
connected layer with 256 nodes and a ReLU activation, then Dropout and finally the output layer with three
nodes and a Softmax. Remember that we have only three classes:

• Beach volleyball
• Cross-country skiing
• Formula racing

that are mutually exclusive, i.e., a picture is only about one of the three sports, so our output needs to have
three nodes with a Softmax:

In [42]: model = Sequential()


model.add(base_model)
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(3, activation='softmax'))

Let’s take a quick look at the model summary:

In [43]: model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
xception (Model) (None, 2048) 20861480
_________________________________________________________________
dense (Dense) (None, 256) 524544
_________________________________________________________________
dropout (Dropout) (None, 256) 0
_________________________________________________________________
dense_1 (Dense) (None, 3) 771
=================================================================
Total params: 21,386,795
Trainable params: 21,332,267
Non-trainable params: 54,528
_________________________________________________________________

Wow! This model still has 20+ millions of parameters. Now here’s the trick: we’re going to set most of them
to be frozen, i.e., backpropagation will not touch them at all! We obtain this by setting the .trainable
attribute of a layer to False. Since we’ve added the base_model as the first layer, we’ll only need to set that
flag:

In [44]: model.layers[0].trainable = False


502 CHAPTER 11. PRETRAINED MODELS FOR IMAGES

Let’s recheck the model summary.

In [45]: model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
xception (Model) (None, 2048) 20861480
_________________________________________________________________
dense (Dense) (None, 256) 524544
_________________________________________________________________
dropout (Dropout) (None, 256) 0
_________________________________________________________________
dense_1 (Dense) (None, 3) 771
=================================================================
Total params: 21,386,795
Trainable params: 525,315
Non-trainable params: 20,861,480
_________________________________________________________________

Of the total number of parameters only a half a million are now trainable, the ones that belong to the two
dense layers we’ve added after the base_model. This seems a much more tractable model than the original
one!

Let’s go ahead and compile the model:

In [46]: model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])

We are now ready to train it.

Data augmentation
Since we don’t have too much data available, we’ll use the trick of data augmentation learned in Chapter 10.
This consists in generating variations of an image with transformations such as zoom, rotate and shear. Let’s
load the ImageDataGenerator:

In [47]: from tensorflow.keras.preprocessing.image import ImageDataGenerator

Let’s set a batch_size = 32. The choice of this number is somewhat arbitrary, but since we have three
classes only, a batch of 32 images will contain on average about ten images for each class. This seems a good
number of examples to learn something from:
11.5. DATA AUGMENTATION 503

In [48]: batch_size = 32

Now let’s create an instance of ImageDataGenerator that applies transformations to the training set. It will
do the following operations:

• apply the preprocess_input function to an image


• rotate it with an angle between -15 and 15 degrees
• apply a shift both in width and height up to ±20
• apply a shear up to 5 degrees
• apply a zoom between 0.8 and 1.2
• possibly flip the image left to right
• fill the borders with the nearest pixel

In [49]: train_datagen = ImageDataGenerator(


preprocessing_function=preprocess_input,
rotation_range=15,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=5,
zoom_range=0.2,
horizontal_flip=True,
fill_mode='nearest')

The train_datagen object contains the instructions about the transformations to apply; now we need to
tell it where to source the images from. We’ll use the very convenient .flow_from_directory method,
specifying the path of the training images, the target size, and the batch size:

In [50]: train_generator = train_datagen.flow_from_directory(


train_path,
target_size=(img_size, img_size),
batch_size=batch_size)

Found 2100 images belonging to 3 classes.

As you can see, it found 2100 images with three classes. This is because the images are arranged in 3
subfolders of the train path:

train_path
|- Beach volleyball
|- img1
504 CHAPTER 11. PRETRAINED MODELS FOR IMAGES

|- img2
|- ...
|- Cross-country skiing
|- img1
|- img2
|- ...
|- Formula racing
|- img1
|- img2
|- ...

Now let’s also create a generator for the test images. Note that the test_path must have the same
subfolders to be compatible with the training set. We will not apply any transformation to test images, and
we will flow them as they are. The reason for this is to have reproducible test results:

In [51]: test_datagen = ImageDataGenerator(


preprocessing_function=preprocess_input)

We will flow test images from the test_path directory:

In [52]: test_generator = test_datagen.flow_from_directory(


test_path,
target_size=(img_size, img_size),
batch_size=batch_size,)

Found 902 images belonging to 3 classes.

We are now ready to train the model with our generator. Since we are generating images with a generator,
the concept of Epoch is no longer well defined. For this reason, we will specify how many updates an epoch
includes.

Let’s see: we have 2100 images in the training folder and we feed batches of 32 images. This means that with
approximately 65 steps we have sent roughly as many images as there are in the training set. Let’s do this: we
train the model for one epoch with 65 update steps:

In [53]: model.fit_generator(
train_generator,
steps_per_epoch=65,
epochs=1)

65/65 [==============================] - 48s 741ms/step - loss: 0.5631 -


accuracy: 0.7882
11.6. BOTTLENECK FEATURES 505

Out[53]: <tensorflow.python.keras.callbacks.History at 0x7fc3103c2a20>

It took a little bit of time, but it looks promising. In a single epoch we have re-purposed a huge
convolutional Neural Network that can now perform image recognition on new classes of images, never
before encountered. Cool!

Let’s assess the accuracy or our model with the .evaluate_generator method on the test set:

In [54]: model.evaluate_generator(test_generator, steps=len(test_generator))

Out[54]: [0.23610197059039412, 0.8980044]

Not bad at all, considering that it only trained for one epoch. Also, let’s check the prediction on our original
image of the volleyball player:

In [55]: model.predict_classes(img_tensor)

Out[55]: array([1])

It’s predicted to be in class 0, which is the correct class. We can check that by looking at the class_indices
defined in the test_generator:

In [56]: train_generator.class_indices

Out[56]: {'Beach volleyball': 0, 'Cross-country skiing': 1, 'Formula racing': 2}

Awesome! We’ve performed transfer learning for the first time, and we’ve reused a giant pre-trained model
for our goal. This was good, but it did take a bit of a long time to train. Can we speed that up? The answer is
yes! Let’s introduce bottleneck features.

Bottleneck features
Let’s stop for a second and think back about what we’ve just done. First, we’ve loaded a large convolutional
network, whose weights have been pre-trained on the Imagenet problem.

Then we’ve used the convolutional part of this network as the first layer of a new network, followed by fully
connected layers. Since its weights are frozen, the convolutional part of the network is acting as a feature
extractor, that extracts a vector of features from the given input image.
506 CHAPTER 11. PRETRAINED MODELS FOR IMAGES

Bottleneck features

These features then go into the fully connected layers which perform the classification. The training process
is still slow because it still applies all the convolutions to the image at each training step.

On the other hand, since the weights are frozen, we could take a different approach. We could pre-process
all the images once: send them through the convolutional part of the network and extract a feature vector
for each of them. We could use this dataset of feature vectors to train a fully connected network for the
classification.

Another way to look at this process is to say that we are using the pre-trained network as a feature extraction
pipeline, not dissimilar from traditional pipelines involving Wavelets, Histograms and so on. The difference
here is that the bottleneck features come from a network pre-trained on the Imagenet classification and are
therefore optimized for that task.

Let’s start by wrapping the ImageDataGenerator and the .flow_from_directory method in a single
function. This function takes the input_path of the images and a couple of other parameters and returns a
generator ready to receive all the images in the input_path. We’ll feed this generator to the
base_model.predict_generator function, which will return the values of the last layer before the output
layer of the full model (remember we loaded the model with the parameter include_top=False. Also,
notice that we will set shuffle=False so that the image order is the same as that contained in
generator.classes and we can later use it.

Here’s the function:

In [57]: def bottleneck_generator(input_path,


img_size=299,
batch_size=32,
shuffle=False):
11.6. BOTTLENECK FEATURES 507

# ImageDataGenerator that applies preprocess_input to each image


datagen = ImageDataGenerator(
preprocessing_function=preprocess_input)

# return batches of preprocessed and scaled images


# together with their labels. Images are taken
# from input_path, labels are generated from the
# subdir stucture. In input_path there are
# 3 subfolders, so there'll be 3 labels.
generator = datagen.flow_from_directory(
input_path,
target_size=(img_size, img_size),
batch_size=batch_size,
class_mode='categorical',
shuffle=shuffle)

return generator

Let’s use this function to generate the bottlenecks for the training images:

In [58]: train_generator = bottleneck_generator(train_path)


bottlenecks_train = base_model.predict_generator(
train_generator, steps=len(train_generator),
verbose=1)

Found 2100 images belonging to 3 classes.


66/66 [==============================] - 34s 514ms/step

Let’s also recover the training labels from the same generator.

In [59]: labels_train = train_generator.classes

Depending on your system, generating bottlenecks may take a long time and having one or more GPU
available will surely speed up the process. Since they are quite small, the code repository already contains a
saved version of these bottlenecks.

Now that we have created the bottlenecks let’s check them. Let’s start from the shape of the tensors:

In [60]: bottlenecks_train.shape

Out[60]: (2100, 2048)


508 CHAPTER 11. PRETRAINED MODELS FOR IMAGES

Train bottlenecks are a matrix with as many rows as there are images in the training set and with as many
columns as the outputs of the base model’s last layer, i.e., the GlobalAveragePooling layer from
Xception, i.e., 2048 features. Let’s plot a few of them to see how they are. Let’s get a bunch of images and
labels from the training generator:

In [61]: images, labels = bottleneck_generator(


train_path, shuffle=True, batch_size=256).next()

Found 2100 images belonging to 3 classes.

Moreover, let’s create a list with the label names for each of the images in the batch. We will do this in two
steps. First, let’s create a label map with the label names corresponding to the class indices:

In [62]: label_map = list(train_generator.class_indices.keys())


label_map

Out[62]: ['Beach volleyball', 'Cross-country skiing', 'Formula racing']

Then let’s use the label_map to convert the labels into the corresponding label names:

In [63]: label_names = [label_map[i] for i in labels.argmax(axis=1)]

label_names[:10]

Out[63]: ['Beach volleyball',


'Beach volleyball',
'Cross-country skiing',
'Beach volleyball',
'Cross-country skiing',
'Cross-country skiing',
'Cross-country skiing',
'Formula racing',
'Formula racing',
'Formula racing']

Great. Now let’s generate bottleneck features for the images in the batch:

In [64]: bottlenecks = base_model.predict(images, verbose=1)


11.6. BOTTLENECK FEATURES 509

256/256 [==============================] - 3s 11ms/sample

Will bottlenecks of images with the same label be similar? Let’s take a look at them on a plot and see if
there’s any pattern we can recognize. We will create a figure with three plots, one for each of the three
classes, and plot the values of the bottlenecks:

In [65]: fig, ax = plt.subplots(nrows=3, ncols=1,


sharex=True, figsize=(15, 5))

for bn, label in zip(bottlenecks, label_names):


idx = train_generator.class_indices[label]
ax[idx].plot(bn)
ax[idx].set_title(label)

plt.xlim(0, 2050)
plt.tight_layout()

Beach volleyball
2
0
Cross-country skiing
2
0
Formula racing
2

0
0 250 500 750 1000 1250 1500 1750 2000

Hmm, although the three plots look somewhat different, it’s hard to tell if anything is interesting. Let’s zoom
in to the range 980-1010:

In [66]: fig, ax = plt.subplots(nrows=3, ncols=1,


sharex=True, figsize=(15, 5))

for bn, label in zip(bottlenecks, label_names):


idx = train_generator.class_indices[label]
ax[idx].plot(bn)
ax[idx].set_title(label)

plt.xlim(980, 1010)
plt.tight_layout()
510 CHAPTER 11. PRETRAINED MODELS FOR IMAGES

Beach volleyball
2
0
Cross-country skiing
2
0
Formula racing
2

0
980 985 990 995 1000 1005 1010

Here it’s clearer that several of the Beach Volleyball bottlenecks have a high spike at features 984, 990 and
1001, while the other two sports do not have those peaks. Bottlenecks are like fingerprints of an image:
features extracted through convolutions that encode the content of the image.

Now that we understand a little more what bottleneck features are, we can save them to disk once and for all.
We can now experiment with fully connected architectures that classify the bottlenecks as input data. Let’s
save the bottlenecks and the labels as numpy arrays. We will use the gzip library for efficiency.

In [67]: import gzip

In [68]: fname_ = '../data/sports/bottlenecks_train.npy.gz'


np.save(gzip.open(fname_, 'wb'), bottlenecks_train)

In [69]: fname_ = '../data/sports/labels_train.npy'


np.save(open(fname_, 'wb'), labels_train)

Let’s also generate the bottlenecks for the test set:

In [70]: test_generator = bottleneck_generator(test_path)


bottlenecks_test = base_model.predict_generator(test_generator, verbose=1, steps=len(te

Found 902 images belonging to 3 classes.


29/29 [==============================] - 15s 501ms/step

and the test labels:

In [71]: labels_test = test_generator.classes


11.7. TRAIN A FULLY CONNECTED ON BOTTLENECKS 511

and let’s save them too:

In [72]: fname_ = '../data/sports/bottlenecks_test.npy.gz'


np.save(gzip.open(fname_, 'wb'), bottlenecks_test)

In [73]: fname_ = '../data/sports/labels_test.npy'


np.save(open(fname_, 'wb'), labels_test)

Great! Now let’s see how we use the bottlenecks.

Train a fully connected on bottlenecks


Bottlenecks saved to disk can be restored by reading them. Let’s read them into two variables called
X_train and X_test.

In [74]: fname_ = '../data/sports/bottlenecks_train.npy.gz'


X_train = np.load(gzip.open(fname_, 'rb'))

fname_ = '../data/sports/bottlenecks_test.npy.gz'
X_test = np.load(gzip.open(fname_, 'rb'))

We can check the shape of the train data, and verify that it’s a matrix:

In [75]: X_train.shape

Out[75]: (2100, 2048)

Similarly, let’s load the labels:

In [76]: fname_ = '../data/sports/labels_train.npy'


y_train = np.load(open(fname_, 'rb'))

fname_ = '../data/sports/labels_test.npy'
y_test = np.load(open(fname_, 'rb'))

and check the shape as well:

In [77]: y_train.shape
512 CHAPTER 11. PRETRAINED MODELS FOR IMAGES

Out[77]: (2100,)

It looks like we have to one-hot encode the labels, so let’s do that using the
keras.utils.to_categorical function that we have used many times in the book:

In [78]: from tensorflow.keras.utils import to_categorical

In [79]: y_train_cat = to_categorical(y_train)


y_test_cat = to_categorical(y_test)

Now we are finally ready to train a fully connected network on the bottleneck features. Let’s import the
Sequential model and Dense and Dropout layers.

In [80]: from tensorflow.keras.models import Sequential


from tensorflow.keras.layers import Dense, Dropout

We’ll define the model using the sequential API. We will build a very simple model with just 2 layers: a
Dropout layer as input, that will receive the bottleneck features and a Dense output layer for the output. The
output layer must have 3 nodes because there are 3 classes and it must have a softmax activation function
because the classes are mutually exclusive. Feel free to change the model definition to something else if you’d
like, keeping in mind that we only have a few thousand training data points so giving the model too much
freedom may lead to overfitting.

Notice that instead of adding the layers like we did in other parts of the book we can pass a list of layers to
the model constructor:

In [81]: fc_model = Sequential([


Dropout(0.5, input_shape=(2048,)),
Dense(16, activation='relu'),
Dropout(0.5),
Dense(3, activation='softmax')
])

Let’s compile the model with our preferred optimizer using the categorical_crossentropy loss:

In [82]: fc_model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])

Here’s a summary of the model:


11.7. TRAIN A FULLY CONNECTED ON BOTTLENECKS 513

In [83]: fc_model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dropout_1 (Dropout) (None, 2048) 0
_________________________________________________________________
dense_2 (Dense) (None, 16) 32784
_________________________________________________________________
dropout_2 (Dropout) (None, 16) 0
_________________________________________________________________
dense_3 (Dense) (None, 3) 51
=================================================================
Total params: 32,835
Trainable params: 32,835
Non-trainable params: 0
_________________________________________________________________

This simple model has a little over 6000 parameters, so it will be very fast to train. Let’s train it for a few
epochs:

In [84]: history = fc_model.fit(X_train, y_train_cat,


epochs=20,
verbose=0,
batch_size=batch_size,
validation_data=(X_test, y_test_cat))

And let’s plot the accuracy:

In [85]: plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Accuracy')
plt.legend(['train', 'test'])
plt.xlabel('Epochs');
514 CHAPTER 11. PRETRAINED MODELS FOR IMAGES

Accuracy
0.96
0.94
0.92
0.90
0.88
0.86
0.84
0.82 train
test
0.80
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5
Epochs

This model trained really fast and the accuracy on the test set is higher than the accuracy on the training set,
which is a really good sign. This is the value of bottleneck features, we can use them as proxies for our
original images and train a simple model using them using a laptop.

Image search

A fun application of pre-trained models is image search.

Imagine the following situation: we have a dataset of images, and we would like to find the most similar
images to a specific one, for example, we have our collection of pictures on our laptop, and we’d like to see all
the ones with a particular person.

Solving this problem requires the definition of a distance measure between images, so that, given a picture,
we can look for images that are close to it. This is hard to do using the raw pixels as features because, as we
have seen many times, images with similar content may look completely different on every single pixel.

On the other hand, since bottleneck features capture high level features from the images, we exploit them to
locate similar images. We will do this using the DistanceMetric class from Scikit Learn. Let’s start by
importing it:

In [86]: from sklearn.neighbors import DistanceMetric


11.7. TRAIN A FULLY CONNECTED ON BOTTLENECKS 515

and let’s load an instance of the Euclidean metric, which is the usual vector difference distance:

In [87]: dist = DistanceMetric.get_metric('euclidean')

We have used to define the Mean Squared Error in Chapter 3 and it is obtained through the sum of the
squares of the differences along each coordinate:

√ √
d(x’, x) = (x’ − x)2 = ∑(x i′ − x i )2 (11.1)
i

and given the two vectors:

In [88]: a = np.array([1, 2])

In [89]: b = np.array([2, -1])

it is calculated as:

In [90]: np.sqrt(np.square(a - b).sum())

Out[90]: 3.1622776601683795

Now that we have defined the euclidean distance metric we can calculate the pairwise distances between all
the bottlenecks and then use that for our image search engine.

Let’s take a few images as an example. Let’s get the training images from the training set using the bottleneck
data generator:

In [91]: images, labels = bottleneck_generator(


train_path, batch_size=2100).next()

Found 2100 images belonging to 3 classes.

and let’s display a few of them. Note that since our images were normalized during pre-processing, their
pixels are now values between -1 and 1. plt.imshow requires floating point images to have values between 0
and 1 so we will need to add 1 and divide by 2 each pre-processed image to display it correctly. Let’s define a
helper function that does that.
516 CHAPTER 11. PRETRAINED MODELS FOR IMAGES

In [92]: def imshow_scaled(img):


plt.imshow((img + 1) / 2)

In [93]: plt.subplot(1, 3, 1)
imshow_scaled(images[0])

plt.subplot(1, 3, 2)
imshow_scaled(images[1])

plt.subplot(1, 3, 3)
imshow_scaled(images[900])

0 0 0
100 100 100
200 200 200

0 200 0 200 0 200

The first two are images of beach volleyball while the third one is of skiing.

The distance between the first and the second image is:

In [94]: np.sqrt(np.square(X_train[0] - X_train[1]).sum())

Out[94]: 6.739547

while the distance between the first and the third is:

In [95]: np.sqrt(np.square(X_train[0] - X_train[900]).sum())

Out[95]: 10.117036

As you can see, the first and the third image are further apart, which makes sense since the last one is very
different from the previous two. We will proceed now to calculate all the distances between all of the images
in the training set using their bottleneck features.
11.7. TRAIN A FULLY CONNECTED ON BOTTLENECKS 517

We will do this calculation using the .pairwise method of the Euclidean distance object we have created
previously. We could also do a double for loop over bottlenecks and calculate the distances manually.
However, this is more efficient:

In [96]: bn_dist = dist.pairwise(X_train)

Let’s check the shape of the matrix we have obtained:

In [97]: bn_dist.shape

Out[97]: (2100, 2100)

Since we have 2100 images in the training set, the pairwise matrix is a square symmetric matrix that
contains all pairwise distances. Let’s visualize it to understand it a little bit better:

In [98]: plt.imshow(bn_dist, cmap='gray')

Out[98]: <matplotlib.image.AxesImage at 0x7fc2081a7940>

0
250
500
750
1000
1250
1500
1750
2000
0 500 1000 1500 2000
518 CHAPTER 11. PRETRAINED MODELS FOR IMAGES

Notice a couple of things about this matrix: 1. The darker a pixel, the closer two corresponding images are. -
The matrix is symmetric on the diagonal, which makes sense since the distance between image 1 and image
2 is the same as the distance between image 2 and image 1 - The diagonal is the darkest of all, which also
makes sense since an image will identical to itself and therefore have a distance of zero from itself - Three
blocks are distinguishable along the diagonal, although a little fuzzy. This makes sense because images are
sorted by class and generally speaking all the images in a class are expected to be more similar to one
another than to images in other classes

Notice that we have obtained these distances using the bottlenecks from the pre- trained model, no
additional training needed. Awesome!

Let’s put this to use in our search engine! Let’s take an image:

In [99]: imshow_scaled(images[0])

0
50
100
150
200
250

0 50 100 150 200 250

Let’s look for the top 9 closest images. All we have to do is select the row in the bn_dist matrix
corresponding to the index of the image selected, which is zero in this case. We will wrap this with a Pandas
Series so that we can use the indices later:
11.7. TRAIN A FULLY CONNECTED ON BOTTLENECKS 519

In [100]: dist_from_sel = pd.Series(bn_dist[0])

Let’s sort these and display the first few images:

In [101]: dist_from_sel.sort_values().head(9)

Out[101]:

0
0 0.000000
15 5.382501
431 5.448220
32 5.697219
636 5.848816
429 6.040056
656 6.046222
48 6.079025
270 6.202633

Let’s display these. We will display 9 images, in a grid of 3x3. Let’s make this configurable by defining a few
parameters:

In [102]: n_rows = 3
n_cols = 3
n_images = n_rows * n_cols

Now let’s take the top 9 images with the shortest distance from our orignal image. This can be done by
sorting the values in dist_from_sel and then using the .head command that retrieves the first elements
in the series:

In [103]: retrieved = dist_from_sel.sort_values().head(n_images)

Now let’s loop over the index of the retrieved images and plot the images:

In [104]: plt.figure(figsize=(10, 10))


i = 1
for idx in retrieved.index:
plt.subplot(n_rows, n_cols, i)
imshow_scaled(images[idx])
i += 1
plt.tight_layout()
520 CHAPTER 11. PRETRAINED MODELS FOR IMAGES

0 0 0
50 50 50
100 100 100
150 150 150
200 200 200
250 250 250
0 100 200 0 100 200 0 100 200
0 0 0
50 50 50
100 100 100
150 150 150
200 200 200
250 250 250
0 100 200 0 100 200 0 100 200
0 0 0
50 50 50
100 100 100
150 150 150
200 200 200
250 250 250
0 100 200 0 100 200 0 100 200

Nice! The first image displayed is the one we had selected, and as you can see the other ones are all very
similar! Let’s try again with another image. We will define a function to make things easy:

In [105]: def image_search(img_index, n_rows=3, n_columns=3):


n_images = n_rows * n_columns

# create Pandas Series with distances from image


dist_from_sel = pd.Series(bn_dist[img_index])

# sort Series and get top n_images


retrieved = dist_from_sel.sort_values().head(n_images)

# create figure, loop over closest images indices


# and display them
11.7. TRAIN A FULLY CONNECTED ON BOTTLENECKS 521

plt.figure(figsize=(10, 10))
i = 1
for idx in retrieved.index:
plt.subplot(n_rows, n_cols, i)
imshow_scaled(images[idx])
if i == 1:
plt.title('Selected image')
else:
plt.title("Dist: {:0.4f}".format(retrieved[idx]))
i += 1
plt.tight_layout()

In [106]: image_search(900)

Selected image Dist: 4.1975 Dist: 4.2411


0 0 0
50 50 50
100 100 100
150 150 150
200 200 200
250 250 250
0 100 200 0 100 200 0 100 200
Dist: 4.3111 Dist: 4.3673 Dist: 4.4444
0 0 0
50 50 50
100 100 100
150 150 150
200 200 200
250 250 250
0 100 200 0 100 200 0 100 200
Dist: 4.4896 Dist: 4.5356 Dist: 4.5594
0 0 0
50 50 50
100 100 100
150 150 150
200 200 200
250 250 250
0 100 200 0 100 200 0 100 200
522 CHAPTER 11. PRETRAINED MODELS FOR IMAGES

In [107]: image_search(1600)

Selected image Dist: 5.0262 Dist: 5.4781


0 0 0
50 50 50
100 100 100
150 150 150
200 200 200
250 250 250
0 100 200 0 100 200 0 100 200
Dist: 5.5642 Dist: 5.6896 Dist: 5.9356
0 0 0
50 50 50
100 100 100
150 150 150
200 200 200
250 250 250
0 100 200 0 100 200 0 100 200
Dist: 5.9472 Dist: 6.0344 Dist: 6.0754
0 0 0
50 50 50
100 100 100
150 150 150
200 200 200
250 250 250
0 100 200 0 100 200 0 100 200

In [108]: image_search(100)
11.7. TRAIN A FULLY CONNECTED ON BOTTLENECKS 523

Selected image Dist: 10.8105 Dist: 10.9381


0 0 0
50 50 50
100 100 100
150 150 150
200 200 200
250 250 250
0 100 200 0 100 200 0 100 200
Dist: 11.0194 Dist: 11.0919 Dist: 11.2358
0 0 0
50 50 50
100 100 100
150 150 150
200 200 200
250 250 250
0 100 200 0 100 200 0 100 200
Dist: 11.2537 Dist: 11.2955 Dist: 11.2996
0 0 0
50 50 50
100 100 100
150 150 150
200 200 200
250 250 250
0 100 200 0 100 200 0 100 200

Notice that we can also sort the distances in reverse order and find the images wich are the furthest away
from a selected image. E.g., for this image:

In [109]: imshow_scaled(images[0])
524 CHAPTER 11. PRETRAINED MODELS FOR IMAGES

0
50
100
150
200
250

0 50 100 150 200 250

The most distant images are:

In [110]: retrieved = pd.Series(bn_dist[0]).sort_values(ascending=False).head(9)

In [111]: plt.figure(figsize=(10, 10))


i = 1
for idx in retrieved.index:
plt.subplot(n_rows, n_cols, i)
imshow_scaled(images[idx])
plt.title("Dist: {:0.4f}".format(retrieved[idx]))
i += 1
plt.tight_layout()
11.8. EXERCISES 525

Dist: 14.5389 Dist: 13.9476 Dist: 13.9367


0 0 0
50 50 50
100 100 100
150 150 150
200 200 200
250 250 250
0 100 200 0 100 200 0 100 200
Dist: 13.8068 Dist: 13.7835 Dist: 13.6913
0 0 0
50 50 50
100 100 100
150 150 150
200 200 200
250 250 250
0 100 200 0 100 200 0 100 200
Dist: 13.6516 Dist: 13.6117 Dist: 13.5701
0 0 0
50 50 50
100 100 100
150 150 150
200 200 200
250 250 250
0 100 200 0 100 200 0 100 200

Which clearly have very little in common with the above image!

In conclusion, Keras offers several pre-trained models for images, that can be used for a variety of tasks
including image recognition, transfer learning and image similarity search.

Exercises

Exercise 1

Use a pre-trained model on a different image.

• Download an image from the web


526 CHAPTER 11. PRETRAINED MODELS FOR IMAGES

• Upload the image through the Jupyter home page


• load the image as a numpy array
• re-run the pre-train to see if the pre-trained model can guess your image
• can you find an image that is outside of the Imagenet classes? (you can see which classes are available
here.

In [ ]:

Exercise 2

Choose another pre-trained model from the ones provided at https://keras.io/applications/ and use it to to
predict the same image. Do the predictions match?

In [ ]:

Exercise 3

The Keras documentation shows how to fine-tune the Inception V3 model by unfreezing some of the
convolutional layers. Try reproducing the results of the documentation on our dataset using the Xception
model and unfreezing some of the top convolutional layers.

In [ ]:
Pretrained Embeddings for Text
12
In the last part of Chapter 8 we introduced the concept of Embeddings. These are dense vectors that
represent words and are often used as a starting point when approaching NLP problems like language
translation or sentiment analysis. The word vectors we introduced in Chapter 8 were trained together with
the rest of the model and therefore were specific to the particular problem we were trying to solve.

For example, when we trained our Recurrent Model for the IMDB sentiment analysis task, the embedding
layer was the first layer in a Sequential model, followed by a recurrent layer and a classification head for the
sentiment prediction. The weights of the word vectors in the embedding layer were learned together with
the weights of the recurrent layer and the classification layer. The single task of predicting the sentiment of a
movie review would provide a value for the loss which would then propagate back through the network to
adjust both the recurrent and the embedding weights.

The approach we have just described has two drawbacks:

1) since the embedding layer has many of weights (e.g., for a vocabulary of 10k words, each embedded
with 100 numbers, we have 1M weights), we need lots of data for this model to generalize well and
avoid overfitting

2) since the embeddings are trained on the sentiment analysis task, they will work well on that task but
will not necessarily learn general properties of the semantic space. In other words, we will not be able
to use those same embeddings for an entirely different NLP task, like machine translation.

To overcome these two limitations, researchers have proposed a different approach to building more generic
embeddings. These approaches try to capture the meaning of a word in a language and learn more general
embeddings where similar vectors represent words with similar meanings. Although this may seem crazy at
first, it works well in practice.

527
528 CHAPTER 12. PRETRAINED EMBEDDINGS FOR TEXT

In this chapter, we will see a couple of different famous embeddings, and we will use them to do fun
operations with text.

Let’s get started.

“Unsupervised”-“supervised learning”
How can we train a generic embedding layer that encodes the meaning of words? We’ll have to resort to a
trick, common when training large networks. This trick is often referred to as “Unsupervised Learning”
although, as we shall see, it is a special case of Supervised Learning where humans do not generate the labels.

Let’s think back of the sequence generation example we introduced in Chapter 8. There we built a network
that learned to predict the most likely letter after a sequence of 3 letters, using a corpus of English baby
names. The same approach can be used to build a model that is trained to predict the most likely word after
a sequence of words. For example, if trained using a corpus of songs from John Lennon, the model should
be able to learn that “heaven” should be the most likely word after the words “imagine there’s no”. This task
is called language modeling because the model learns the structure of a language.

The output of this model is a Softmax over the vocabulary of the language, while the input is a sequence of
words encoded as vectors by the input embedding layer of the language model. Since the model is solving a
forecasting task (predict the most likely words after a sequence of words), we are still in the domain of
Supervised Learning. The labels, however, are contained in the corpus itself.

Consider for example this excerpt from the song Imagine by John Lennon:

In [1]: text = """


imagine there's no heaven
it's easy if you try
no hell below us
above us only sky
imagine all the people living for today

imagine there's no country


it isn't hard to do
nothing to kill or die for
and no religion too
imagine all the people living life in peace,
you

you may say i'm a dreamer


but i'm not the only one
i hope some day you'll join us
and the world will be as one

...
"""
12.2. GLOVE EMBEDDINGS 529

From this text we can build the following pairs of inputs and labels:

Sequence Label
imagine there’s no heaven
there’s no heaven it’s
no heaven it’s easy
heaven it’s easy if
it’s easy if you
easy if you try
... ...

Both the inputs and the labels come from the same corpus by merely sliding a window of fixed length and
asking the model to predict the word coming immediately after the window.

This generic forecasting approach is amazingly powerful! Our ability to label data no longer limit us. We can
use any text. We could use the whole of Wikipedia, and train a very generic language model that attempts to
predict the next word in a sequence. The embeddings of this model must be more generic than the ones
trained on the sentiment problem!

Starting with this intuition, that you can obtain labels from the text, researchers have invented several
approaches to train generic embeddings. We will mention here a few of the most famous and show you
where to find them and how to use them.

Let’s start with a very common set of embeddings called GloVe, which stands for Global Vectors for Word
Representation.

GloVe embeddings
In [2]: with open('common.py') as fin:
exec(fin.read())

In [3]: with open('matplotlibconf.py') as fin:


exec(fin.read())

In the data/embeddings folder we provide a download script that downloads and extracts GloVe
embeddings from: https://nlp.stanford.edu/projects/glove/. Here is its content:

In [4]: cat ../data/embeddings/glove_download.sh

# Script to download and extract Glove


# word embeddings. More information at:
# https://nlp.stanford.edu/projects/glove/
530 CHAPTER 12. PRETRAINED EMBEDDINGS FOR TEXT

# Uncomment the file you'd like to download


EMBEDDINGS=glove.6B
# EMBEDDINGS=glove.42B.300d
# EMBEDDINGS=glove.840B.300d

# download and extract


wget http://nlp.stanford.edu/data/$EMBEDDINGS.zip
unzip $EMBEDDINGS.zip
rm $EMBEDDINGS.zip

Go ahead and run the script to retrieve the glove.6B embeddings. Let’s take a look at them. First let’s
define a path variable:

In [5]: glove_path = '../data/embeddings/glove.6B.50d.txt'

And let’s look at the first line in the file:

In [6]: with open(glove_path, 'r', encoding='utf-8') as fin:


line = fin.readline()

In [7]: line

Out[7]: 'the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862
-0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658
0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131
-0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594
-0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223
-0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411
-0.11514 -0.78581\n'

As you can see the line contains the word the as first element, followed by 50 space-separated floating point
numbers, which form the word vector. Let’s define a parse function that parses the line and returns the word
and the vector as a numpy array.

We should take care of removing the trailing \n character at the end, then split the line at spaces, which will
return a list. Finally, we’ll take the first element in the list as word and the remaining values as the vector.
Here’s the parse function:

In [8]: def parse_line(line):


values = line.strip().split()
word = values[0]
vector = np.asarray(values[1:], dtype='float32')
return word, vector
12.2. GLOVE EMBEDDINGS 531

Now that we have defined a parse function let’s use it to load the word embeddings. We will use a Python
dictionary for the embeddings and one for the word index. Let’s create two empty dictionaries:

In [9]: embeddings = {}
word_index = {}

Let’s also create an empty list for the inverted index that will map numbers to words:

In [10]: word_inverted_index = []

Now we can loop over the lines in the file, parse each line and store it in the embeddings and word index
dictionary. We will enumerate the lines as we proceed with the loop so that we can also retrieve their
numeric index.

Let’s do it:

In [11]: with open(glove_path, 'r', encoding='utf-8') as fin:


for idx, line in enumerate(fin):
word, vector = parse_line(line) # parse a line

embeddings[word] = vector # add word vector


word_index[word] = idx # add idx
word_inverted_index.append(word) # append word

Let’s check a few entries in the indexes we built. For example, using word_index, we can retrieve the line
number at which the word good appears:

In [12]: word_index['good']

Out[12]: 219

Using the word_inverted_index we can do the reverse, i.e., given a line number, find the corresponding
word:

In [13]: word_inverted_index[219]

Out[13]: 'good'
532 CHAPTER 12. PRETRAINED EMBEDDINGS FOR TEXT

The embeddings dictionary contains the actual word vectors, so for example, the word vector
corresponding to the word good is the following:

In [14]: embeddings['good']

Out[14]: array([-3.5586e-01, 5.2130e-01, -6.1070e-01, -3.0131e-01, 9.4862e-01,


-3.1539e-01, -5.9831e-01, 1.2188e-01, -3.1943e-02, 5.5695e-01,
-1.0621e-01, 6.3399e-01, -4.7340e-01, -7.5895e-02, 3.8247e-01,
8.1569e-02, 8.2214e-01, 2.2220e-01, -8.3764e-03, -7.6620e-01,
-5.6253e-01, 6.1759e-01, 2.0292e-01, -4.8598e-02, 8.7815e-01,
-1.6549e+00, -7.7418e-01, 1.5435e-01, 9.4823e-01, -3.9520e-01,
3.7302e+00, 8.2855e-01, -1.4104e-01, 1.6395e-02, 2.1115e-01,
-3.6085e-02, -1.5587e-01, 8.6583e-01, 2.6309e-01, -7.1015e-01,
-3.6770e-02, 1.8282e-03, -1.7704e-01, 2.7032e-01, 1.1026e-01,
1.4133e-01, -5.7322e-02, 2.7207e-01, 3.1305e-01, 9.2771e-01],
dtype=float32)

How many components does this vector have? Let’s check its length:

In [15]: embedding_size = len(embeddings['good'])


embedding_size

Out[15]: 50

We can also plot it:

In [16]: plt.plot(embeddings['good']);
12.2. GLOVE EMBEDDINGS 533

0 10 20 30 40 50

It doesn’t tell us much, but for example, we can compare the word vectors of a few words and see how they
look. Let’s plot a few numbers, like two, three, and four and a few animals like cat, dog, and rabbit. As you will
see numbers will look very similar to one another, and animals will be distinctly different from the numbers:

In [17]: plt.subplot(211)
plt.plot(embeddings['two'])
plt.plot(embeddings['three'])
plt.plot(embeddings['four'])
plt.title("A few numbers")
plt.ylim(-2, 5)

plt.subplot(212)
plt.plot(embeddings['cat'])
plt.plot(embeddings['dog'])
plt.plot(embeddings['rabbit'])
plt.title("A few animals")
plt.ylim(-2, 5)

plt.tight_layout()
534 CHAPTER 12. PRETRAINED EMBEDDINGS FOR TEXT

A few numbers
5.0
2.5
0.0
0 10 20 30 40 50
A few animals
5.0
2.5
0.0
0 10 20 30 40 50

This is reminiscent of Bottleneck features we’ve encountered in Chapter 11. Each word corresponds to a
vector with fifty numbers and words that carry similar semantic value will be encoded with similar vectors.
As we shall see, GloVe vectors are built by looking at co-occurrence of words, so the above plots tell us that
numbers like two, three and four are often found near to each other, which makes sense since they appear in
similar contexts. I can say: “I ran two miles” or “I ran three miles” and both sentences make sense, while I
cannot say “I ran cat miles”. Two and three can be found in similar contexts and therefore are encoded as
similar vectors.

Let’s see how many words are contained in our GloVe embeddings by checking the len of the embeddings
variable:

In [18]: vocabulary_size = len(embeddings)


vocabulary_size

Out[18]: 400000

There are 400000 words in the embeddings, which will cover most of our needs.

TIP: If you are curious to know more about how to build GloVe embeddings, we encourage
you to read the original paper or take a look at the source code.
12.3. LOADING PRE-TRAINED EMBEDDINGS IN KERAS 535

Loading pre-trained embeddings in Keras


Let’s import tensorflow:

In [19]: import tensorflow as tf

As we have seen in Chapter 8, Keras has an Embedding layer that can be trained to build custom
embeddings. Here we will learn how to initialize it using pre-trained embeddings like the GloVe
embeddings we have just loaded. First, let’s load the Sequential model and the Embedding layer from
Keras:

In [20]: from tensorflow.keras.models import Sequential


from tensorflow.keras.layers import Embedding

Next, we are going to arrange our pre-trained embeddings as a giant matrix of shape (vocabulary_size,
embedding_size). We will do this in 2 steps. First, let’s create a zero matrix with the correct shape:

In [21]: embedding_weights = np.zeros((vocabulary_size,


embedding_size))

Then let’s iterate over the items in our word_index dictionary and let’s assign each vector in the
embeddings dictionary to a line in the matrix. For example, we know from above that the word good has
index 219. We will assign its vector to the row in the matrix corresponding to index 219 (i.e., the 220-th
row). Let’s do it:

In [22]: for word, index in word_index.items():


embedding_weights[index, :] = embeddings[word]

Now that we have our pre-trained weights arranged in a matrix, we can create a model with a single
Embedding layer, and we will then set the weights to be the pre-trained weights.

We start by creating the embedding layer:

In [23]: emb_layer = Embedding(input_dim=vocabulary_size,


output_dim=embedding_size,
mask_zero=False,
trainable=False)

Notice that we specified the following parameters: - input_dim: this is the number of distinct words to
embed, and it is equal to the vocabulary_size - output_dim: this is the embedding dimension and it has
536 CHAPTER 12. PRETRAINED EMBEDDINGS FOR TEXT

to coincide with the size of Glove vectors, i.e. 50 in our case. - mask_zero: this tells Keras whether we are
using the index 0 as a special padding value or as a word. In our case, the index zero corresponds to the
word ‘the’, so we need to set this flag to False - trainable: this tells Keras whether the weights in the
embedding layer should be trainable. Since we take weights from Glove, this also should be false.

TIP: for more info on the mask_zero flag, here is its documentation:

mask_zero: Whether or not the input value 0 is a special "padding"


value that should be masked out.
This is useful when using [recurrent layers](recurrent.md)
which may take variable length input.
If this is `True` then all subsequent layers
in the model need to support masking or an exception will be raised.
If mask_zero is set to True, as a consequence, index 0 cannot be
used in the vocabulary (input_dim should equal size of
vocabulary + 1).

In our case, we have used the index 0 in the vocabulary for the word the, as we can check in the
word_inverted_index:

In [24]: word_inverted_index[0]

Out[24]: 'the'

so we have to set mask_zero to False. Had we started enumerating the word vectors from 1, we could have
reserved the 0 value for padding, which, as the doc says, is useful when using recurrent layers.

The trainable=False flag tells Keras that this layer is not trainable, i.e., its weights cannot be changed
during training. We used this earlier when using pre-trained models for images in Chapter11.

Notice that simply passing the matrix to the Embedding constructor is not enough. We need to put this
layer in a model in order for Keras to actually create a Tensorflow graph with it. Let’s do it:

In [25]: model = Sequential()


model.add(emb_layer)

Finally we need to set the weights to be the Glove weights:


12.3. LOADING PRE-TRAINED EMBEDDINGS IN KERAS 537

In [26]: model.set_weights([embedding_weights])

Now that we have created a model we can check that the embedding layer does use the pre-trained weights.
Let’s check the embeddings for the word cat. Here are the original values we loaded from the file:

In [27]: embeddings['cat']

Out[27]: array([ 0.45281 , -0.50108 , -0.53714 , -0.015697, 0.22191 , 0.54602 ,


-0.67301 , -0.6891 , 0.63493 , -0.19726 , 0.33685 , 0.7735 ,
0.90094 , 0.38488 , 0.38367 , 0.2657 , -0.08057 , 0.61089 ,
-1.2894 , -0.22313 , -0.61578 , 0.21697 , 0.35614 , 0.44499 ,
0.60885 , -1.1633 , -1.1579 , 0.36118 , 0.10466 , -0.78325 ,
1.4352 , 0.18629 , -0.26112 , 0.83275 , -0.23123 , 0.32481 ,
0.14485 , -0.44552 , 0.33497 , -0.95946 , -0.097479, 0.48138 ,
-0.43352 , 0.69455 , 0.91043 , -0.28173 , 0.41637 , -1.2609 ,
0.71278 , 0.23782 ], dtype=float32)

Now let’s retieve the index of the word cat and let’s pass it to the model.predict method. First we retrieve
the index of the word cat using the word_index:

In [28]: cat_index = word_index['cat']

The index is:

In [29]: cat_index

Out[29]: 5450

Now that we have the index, we run model.predict on a double nested list containing the single index of
the word cat:

TIP: we need to use a list here because the predict method expects as input an integer
matrix of size (batch, input_length), as explained in the documentation and here we have a
batch of 1 point with a sequence of 1 word.

In [30]: model.predict([[cat_index]])
538 CHAPTER 12. PRETRAINED EMBEDDINGS FOR TEXT

Out[30]: array([[[ 0.45281 , -0.50108 , -0.53714 , -0.015697, 0.22191 ,


0.54602 , -0.67301 , -0.6891 , 0.63493 , -0.19726 ,
0.33685 , 0.7735 , 0.90094 , 0.38488 , 0.38367 ,
0.2657 , -0.08057 , 0.61089 , -1.2894 , -0.22313 ,
-0.61578 , 0.21697 , 0.35614 , 0.44499 , 0.60885 ,
-1.1633 , -1.1579 , 0.36118 , 0.10466 , -0.78325 ,
1.4352 , 0.18629 , -0.26112 , 0.83275 , -0.23123 ,
0.32481 , 0.14485 , -0.44552 , 0.33497 , -0.95946 ,
-0.097479, 0.48138 , -0.43352 , 0.69455 , 0.91043 ,
-0.28173 , 0.41637 , -1.2609 , 0.71278 , 0.23782 ]]],
dtype=float32)

As you can see the method returns exactly the same values, so we have successfully initialized a Keras model
with a pre-trained embedding.

Gensim
Gensim is a topic modelling library in Python that contains a lot of functions related to extracting meaning
and manipulating text. Let’s import it and have some fun with word embeddings:

In [31]: import gensim

In order to load Glove embeddings using Gensim we need to convert them into the appropriate format.
Luckily for us Gensim has a function for that. We just need to import the glove2word2vec script:

In [32]: from gensim.scripts.glove2word2vec import glove2word2vec

and then run it. We first set input and output paths:

In [33]: glove_path = '../data/embeddings/glove.6B.50d.txt'


glove_w2v_path = '../data/embeddings/glove.6B.50d.txt.vec'

In [34]: glove2word2vec(glove_path, glove_w2v_path)

Out[34]: (400000, 50)

Next we use the gensim.models.KeyedVectors.load_word2vec_format to load the word vectors from


the file:

In [35]: from gensim.models import KeyedVectors


12.4. GENSIM 539

In [36]: glove_model = KeyedVectors.load_word2vec_format(


glove_w2v_path, binary=False)

Now that we have loaded the vectors into a Gensim model, we have access to a lot of functionality. For
example, we can quickly find what are the most similar words to a given word.

Here’s how we look for the five closest words to the word good:

In [37]: glove_model.most_similar(positive=['good'], topn=5)

Out[37]: [('better', 0.9284390807151794),


('really', 0.9220625162124634),
('always', 0.9165270328521729),
('sure', 0.9033513069152832),
('something', 0.9014205932617188)]

The .most_similar method allows for both a list of positive and negative words. Feel free to play with
the list of words to get a feel for how they affect the output. The closest words to good are words that can
appear in the same context as good, so it’s quite obvious that we should get similar adjectives like better or
adverbs like really and always.

If we try with the number two we should get other numbers:

In [38]: glove_model.most_similar(positive=['two'], topn=5)

Out[38]: [('three', 0.9885902404785156),


('four', 0.9817472696304321),
('five', 0.9644663333892822),
('six', 0.964131236076355),
('seven', 0.9512959718704224)]

Word Analogies

Since word vectors are vectors, we can do any vector operation with them, including addition, subtraction
and dot products. For example we can perform operations between words like:

result = king - man + woman

where the vector result is a perfectly valid vector in the embedding space. Using the .most_similar
method, we can look for the three vectors closest to result. Can you guess which vector will be the closest?

If you guessed queen, which is the feminine counterpart of king, you guessed right. Let’s see it in action:
540 CHAPTER 12. PRETRAINED EMBEDDINGS FOR TEXT

In [39]: glove_model.most_similar(positive=['king', 'woman'],


negative=['man'], topn=3)

Out[39]: [('queen', 0.8523603677749634),


('throne', 0.7664333581924438),
('prince', 0.7592144012451172)]

i.e. we have found that

queen ~ king - man + woman

Another way to look at this is to say that the vector queen - king is similar to the vector woman - man.
This is often described with this famous picture:

Semantic relations in embedding space

In this figure, we imagine an embedding space that has only two axes (instead of 50 or 300), and we represent
the words as points in the embedded space. The arrows represent the vector distances between the words.

Since this chart can be useful to understand how the model is representing the semantic space, it’s legitimate
to ask if we can visualize all of glove words in a similar chart using a dimensionality reduction technique.
The answer to this question is yes, and we can actually leverage tensorboard for this. We’ll see how in the
next Section.
12.5. VISUALIZATION 541

Visualization
Tensorboard also contains a projector that allows us to explore word embeddings visually. Let’s save our
word embeddings and let’s visualize them in Tensorboard. First we need to create an output folder. We’ll use
the /tmp/ztdl_models/embeddings/ folder for output. We will need the os module:

In [40]: import os

Then let’s define a path variable that we’ll use later too:

In [41]: model_dir = '/tmp/ztdl_models/embeddings/'

Let’s also load the rmtree function from shutil so that we can delete the directory if it already exists:

In [42]: from shutil import rmtree

In [43]: rmtree(model_dir, ignore_errors=True)

Finally let’s create the folder:

In [44]: os.makedirs(model_dir)

For the purposes of this visualization we will limit our Embedding layer to the top most frequent 4000
words in the glove set. Let’s set a variable called n_viz to 4000 (you can change this number if you wish):

In [45]: n_viz = 4000

Let’s create a new embedding layer, with only 4000 x 50 weights. Notice that we still pass the
mask_zero=False parameter since our first vector, corresponding to the index 0, is the word the:

In [46]: emb_layer_viz = Embedding(n_viz,


embedding_size,
mask_zero=False,
trainable=False)

Let’s stick this layer into a Sequential model so that the weights get initialized:
542 CHAPTER 12. PRETRAINED EMBEDDINGS FOR TEXT

In [47]: model = Sequential([emb_layer_viz])

and let’s set the weights to Glove:

In [48]: model.set_weights([embedding_weights[:n_viz]])

Now let’s visualize these embeddings. The most recent documentation on how to do this suggests to create a
separate tsv file for the weights and the metadata. We will choose a different route, saving the model as a
Checkpoint and attaching the metadata to it with a small config file.

Note that in Tensorflow you can save a model as a Checkpoint or as a SavedModel. We’ll use checkpoints
here, and saved models in the next chapter for serving

We need to accomplish three things, which are independent: - save a model checkpoint - save a file with the
words (metadata) - save a configuration file that binds the metadata to the embedding tensor

Let’s start by saving the model:

In [49]: checkpoint = tf.train.Checkpoint(model=model)


checkpoint.save(os.path.join(model_dir, 'model.ckpt'))

Out[49]: '/tmp/ztdl_models/embeddings/model.ckpt-1'

This operation creates a few files in the model_dir folder, as you can see with the os.listdir command:

In [50]: os.listdir(model_dir)

Out[50]: ['model.ckpt-1.index', 'model.ckpt-1.data-00000-of-00001', 'checkpoint']

These files contain the weights, but have no information about which word corresponds to each vector.
What we need is a metadata.tsv file with the list of words. We can easily create it by looping over the
indexes from 0 to n_viz adding one word per line to the file:

In [51]: fname = os.path.join(model_dir, 'metadata.tsv')

with open(fname, 'w', encoding="utf-8") as fout:


for index in range(0, n_viz):
word = word_inverted_index[index]
fout.write(word + '\n')
12.5. VISUALIZATION 543

You can check the content of this file and see that it contains one word per line like:

the
,
.
of
to
and
...

This file can be associated to the model using a small configuration file.

In [52]: config_string = """


embeddings {
tensor_name: "model/layer_with_weights-0/embeddings/.ATTRIBUTES/VARIABLE_VALUE"
metadata_path: "metadata.tsv"
}
"""

In [53]: fname = os.path.join(model_dir, 'projector_config.pbtxt')

with open(fname, 'w', encoding="utf-8") as fout:


fout.write(config_string)

TIP: As noted above, we could also save the weights and the metadata as tsv and then load
them in the Tensorboard projector. This is explained in the Tensorflow 2.0 guide.

We can now start Tensorboard:

tensorboard --logdir=/tmp/ztdl_models/embeddings/

and point our browser to:

http://localhost:6006/#projector

We should see the word embedding projector spinning. Using the Search tab on the right, let’s look for a
specific word, for example, the word network and see what the closest words are. You should see something
like this:
544 CHAPTER 12. PRETRAINED EMBEDDINGS FOR TEXT

Tensorflow projector

You can also uncomment the next two cells to display Tensorboard in the notebook.

In [54]: # %load_ext tensorboard.notebook

In [55]: # %tensorboard --logdir /tmp/ztdl_models/embeddings/

Other pre-trained embeddings


We have learned how to use and visualize pre-trained embeddings using GloVe. GloVe is a commonly used
set of pre-trained embeddings, but it’s not the only one. We will briefly introduce here two other trendy sets
of pre-trained embeddings: Word2Vec and FastText.

Word2Vec

Word2Vec is a set of word vectors introduced by Google in 2013. These vectors also try to encapsulate the
meaning of a word by looking at its context, i.e., the words that precede it and follow it.

You can find a detailed tutorial on how to build Word2Vec vectors in the Tensorflow Tutorials page. The two
main useful ideas are the Skip-gram Model and the noise-contrastive estimation (NCE) loss. Let’s take a look
at these in a bit more detail.
12.6. OTHER PRE-TRAINED EMBEDDINGS 545

Skip-grams

Skip-grams are simply pairs of words that appear in context. Let’s consider the first two lines of the song
imagine:

imagine there's no heaven it's easy if you try

and let’s focus on the word heaven. If we choose a context of -2, +2 words, we see that the following words
appear in the context of heaven: - there’s - no - it’s - easy

We could, therefore, try to build a model that takes a word in input and tries to predict the probability that
another word in the dictionary appears in its context by using input/output pairs like:

INPUT OUTPUT
heaven there’s
heaven no
heaven it’s
heaven easy

We could use a Softmax over the whole dictionary and eventually learn these probabilities. However, this
would require a massive amount of data, since the dictionary size is enormous.

Hence, Word2Vec is trained using a trick called Negative Sampling.

Negative Sampling and the noise-contrastive estimation (NCE) loss

Imagine solving a slightly different problem where instead of having a word as input and a word as output,
we have a pair as input and a binary label as output. We can use the pairs above as positive examples, since
they are actual pairs found in the training text, and we can build fake pairs that have the same first word and
a random second word. Our data will look like this:

INPUT OUTPUT
(heaven, there’s) 1
(heaven, no) 1
(heaven, it’s) 1
(heaven, easy) 1
(heaven, cat) 0
(heaven, brain) 0
(heaven, swimming) 0
(heaven, chair) 0
... ...
546 CHAPTER 12. PRETRAINED EMBEDDINGS FOR TEXT

We have constructed negative pairs by randomly choosing words from the dictionary. This model learns to
predict the probability that given a first word, the second word is in its context or not, which is the same
problem as before. However, this model is much faster to train, because we are only choosing a small set of
negative examples at each batch instead of having the whole dictionary.

These two tricks allow training Word2Vec quite easily. In our case, we will not even train it but rely on a
pre-trained version of these embeddings that’s been trained using data from Google News.

TIP: if you want to learn more about Word2Vec we encourage you to read the Wikipedia
page and to go through the tutorial mentioned earlier.

FastText

FastText is a library for efficient learning of word representations and sentence classification developed by
Facebook Research. It is open-source, free and lightweight and it allows users to learn text representations
and text classifiers. Here is the Github repository and here you can read the blog post announcing its
publication.

FastText has two exciting aspects.

1) fastText word vectors are built from vectors of substrings of characters contained in it. This allows
building vectors even for misspelled words or concatenation of words.
2) fastText has been designed to work on a variety of languages by taking advantage of their
morphological structure. So, pre-trained vectors are available for many other languages besides
English.

You can download the pre-trained English vectors here. You can download the pre-trained vectors for many
other languages here

In the exercises we will use these vectors and compare their results with GloVe and Word2Vec.

Exercises

Exercise 1

Compare the representations of Word2Vec, Glove and FastText. In the data/embeddings folder we
provided you with two additional scripts to download FastText and Word2Vec. Go ahead and download
each of them into the data/embeddings. Then load each of the 3 embeddings in a separate Gensim model
and complete the following steps:

1. define a list of words containing the following words: ‘good’, ‘bad’, ‘fast’, ‘tensor’, ‘teacher’, ‘student’.
12.7. EXERCISES 547

• create a function called get_top_5(words, model) that retrieves the top 5 most similar words to
the list of words and compare what the 3 different embeddings give you

• apply the same function to each word in the list separately and compare the lists of the 3 embeddings.

• explore the following word analogies:


man:king=woman:? ==> expected queen france:paris=germany:? ==> expected berlin
teacher:teach=student:? ==> expected learn cat:kitten=dog:? ==> expected puppy
english:friday=italiano:? ==> expected venerdì

Can word analogies be used for translation?

Note that loading the vector may take several minutes depending on your computer.

In [ ]:

Exercise 2

The Reuters Newswire topic classification dataset is a dataset of 11,228 newswires from Reuters, labeled over
46 topics. This dataset is provided in the keras.datasets module and it’s easy to use.

Let’s compare the performance of a model using pre-trained embeddings with a model using random
embeddings on the topic classification task.

• Load the data from keras.datasets.reuters


• Retrieve the word index and create the reverse_word_idx as done for IMDB in Chapter 8.
• Augment the reverse word index with pad_char, start_char and oov_char at indices 0, 1, 2
respectively.
• Check the maximum length of a newswire and use the pad_sequences function to pad everything to
that 100 words.
• Create and train two models, one using pre-trained embeddings and the other using a randomly
initialized embedding
• Compare their performance on this dataset using a recurrent model. In particular, check which of the
two models shows the worst overfitting.

In [ ]:
548 CHAPTER 12. PRETRAINED EMBEDDINGS FOR TEXT
Serving Deep Learning Models
13
In this chapter, we will learn how to serve a trained model. The goal of training a Machine Learning model
is to use it to generate predictions. Deployment of Machine Learning models is a vast topic; we could write a
whole book just about this topic. In this chapter, we will present two ways of deploying a model and outline
a set of things we need to consider for deployment. In the end, how will decide to deploy a model depends
on the requirements that the application will have to satisfy and this will change case by case.

In this chapter, we want to achieve a few goals. We want to explain at a high level how to think about the
deployment process, highlighting the issues involved, and outlining the possible choices. This chapter will
help us understand the decision you’ll need to make when deploying a model as well as equip you with a list
of resources that you can tap into, according to your needs.

Then we will show you two ways of deploying a model: a simple Flask application using a Python server and
a more general deployment using Tensorflow Serving. These are not the only two ways to deploy a model,
and we’ll make sure to point you to additional resources, companies, and products that simplify the
management of the model deployment cycle.

So, let’s start with the model development/deployment cycle.

The model development cycle


The concepts explained in this part are general to Machine Learning, not just Deep Learning, and they are
independent of the framework used. At a high level, the model development cycle includes these seven steps:

1. Data Collection
2. Data Processing
3. Model Development

549
550 CHAPTER 13. SERVING DEEP LEARNING MODELS

4. Model Evaluation
5. Model Exporting
6. Model Deployment
7. Model Monitoring

These steps are part of a continuous deployment cycle: we never finish to improve our models and to learn
from new data. After deploying our first model, a second, a third, and more will come. For each new model,
we compare its performance with the performance of the current one. Traffic will gradually be shifted
towards the new model, as it happens with any other release in a continuous integration setup.

The seven steps are not mutually exclusive. They happen in parallel. In other words, while you are working
on developing (3) and evaluating (4) the current version of a model, you are already monitoring the
previous version of the model (7) and collecting (1) and processing (2) additional data and labels.

Model deployment cycle

Let’s now look at each step in greater detail.

Data Collection

Throughout this book, we used datasets based on files. These were either tabular files (CSV, Excel) or folders
containing images or documents. In the real world, there is usually a process involving data collection,
where data is stored in a database or a distributed file system for later use.

Examples of this process are:

• a database with the actions of your users in your website or app


13.1. THE MODEL DEVELOPMENT CYCLE 551

• a data-lake with millions of documents that we would like to classify


• a cloud object store with images that we would like to recognize

Depending on the type of data, on its frequency and its size, you will design different collection and storage
systems.

Let’s consider a few examples.

As a first case let’s consider a bank that would like to train a model to decide which people are credit-worthy.
The information used as input for the model are things like:

• user information
• account activity
• past loans
• history of credit

This information will typically be stored in several tables in a database. We will be able to create a dataset to
train our model by simply joining data from a few tables of the database. Furthermore, we can probably
work with a sample of the whole data as a starting point, especially if our first model is not specific to each
user. This “snapshot” of the world, a dataset we extracted at some point in time, is going to be valid for at
least a few days, if not for a few weeks or even months. That is to say, the general lending behavior of a
population will evolve with time, but it will do so quite slowly, not from one day to the next. Another way to
say this is to say that the statistical distribution of our users is stationary, or quasi-stationary, i.e.,
independent from time.

These facts allow us to train a model on a file, maybe a large file, but a fixed snapshot, as we did throughout
this book, and then use that model in production. Once we have trained and evaluated our model, we will
deploy it to our branches and the managers will have an “AI helper” to decide when to issue a loan or not.

Let’s consider a very different case now. Let’s say we want to build a system that will decide which
advertisement to show based on the actions of a user in our web application. In this case, the input will be a
series of events in the app. Both the app and the ads inventory will change much more frequently in time,
due to new feature releases, new clients and so on. In this case, we will re-train our model much more
regularly, possibly every day, using the most recent data.

Besides the frequency of re-training, other things to consider are:

• the kind of data, are these files, documents, images, text, numbers in a table?
• the amount of data, how many new data points do we collect per day? 100? 1000? 1 million? 1 billion?
The data collection and storage process will change dramatically based on that.

Modern Machine Learning products usually work as a continuous pipeline, where a model is continuously
learning from new data, and sometimes a snapshot of the most current model is saved and used in
production for inference.
552 CHAPTER 13. SERVING DEEP LEARNING MODELS

Labels

As we know very well by now, to train a model with supervision, we need labels. Here too, there can be
many different scenarios.

In some cases, we may not have those labels at all. For example, let’s say we are training an algorithm to
recognize offensive pictures in our user-generated content. We will need to collect a sample of images and
have human supervisors manually label the offensive photos with labels such as “violence”, “nudity”, “toxic”
and so on. This labeling process will be slow and costly, but it will be necessary before we can proceed with
any training.

Additionally, if we randomly sample our images, there will likely be very few offensive images, which would
make labeling very slow because our human supervisors would receive mostly normal pictures. This is why
most websites implement a button for users to report offensive content. This will effectively triage the
photos bringing the offensive ones to the surface so that human supervisors can review them and generate
labels accordingly.

On the opposite end, if we are training a model for advertising products, i.e., to predict the likelihood that a
user will click on a particular product, the labels, i.e., the past clicks, are automatically recorded by our
system.

In general, the win-win strategy for label generation is when the product manager can design a product in
such a way that the users or the process automatically generate labels. Examples of this win-win approach
are:

• tagging your friends on Facebook => labels for face recognition algorithm
• recording purchase actions on Amazon => labels for the recommendation of other products
• “flag this post” button on Craigslist => labels for fraud/spam detection algorithm
• captchas that ask you to recognize street signs => labels for image recognition algorithm

What is the process for label generation in your case?

Data Processing

This process is usually referred to as ETL in an enterprise setting. It involves going from the raw data in your
data store to features ready for consumption by the Machine Learning model.

At this stage, you will focus on operations like data cleaning, data imputation, feature extraction, and feature
engineering. Once again, this will depend on your specific situation, but it is essential to keep in mind that
when NULL values are present, i.e., when some data is missing, we need to stop and ask ourselves why it is
missing. Discussion of the different cases of missing data is beyond the scope of this book, but we invite you
to read this Wikipedia article on the topic, to be cognizant of the issues involved when dealing with it.

Other data processing steps may involve generating features, augmenting the data, one-hot encoding, using
pre-trained models for feature extraction.
13.1. THE MODEL DEVELOPMENT CYCLE 553

Model Development

Model development and data processing are strongly interconnected steps. Here is where we focus the
attention on deciding which model we will apply.

• Will you attempt with a classical Machine Learning model first?


• Will you go straight to Neural Networks and Deep Learning?
• What will your model architecture look like?
• Which hyper-parameters will you start with?
• Are you going to train the model with gradient descent?
• If so, which optimizer will you use?

Again, the choices will depend on the particular situation. The general approach to this phase is to keep
your feedback loop as rapid as possible. It’s not a mystery that a quick feedback loop is an excellent
strategy in software development (e.g., Agile development). In Machine Learning this is just as true, so if
you are considering two options to improve your model and one takes one hour to test, and the other one
week, you should choose the former over the latter.

Let’s look at one example. Let’s say that we have some indication our model will improve with more data.
Let’s also suppose that we are not sure that we have chosen the right architecture for the model. In some
cases, getting more data could be as simple as running a new SQL query to extract a few more million data
points from our database, in some other cases it could be much more complicated, involving manual label
generation with a team of supervisors, which would likely require days if not weeks of delay.

On the other hand, if re-training the model takes minutes or even just a few hours we could spin up a new
copy of the model with a different architecture and train it quickly. If training the model takes one week
that’s not an easy option.

We will have to take all these factors into account when developing the model and choosing where to start
first.

Model Evaluation

Once we have decided what model architecture we are going to use, we need to train the model on the data.
The majority of this book focuses on this process, so you should be pretty familiar with terms like train/test
splitting, cross- validation, and hyper-parameter tuning. It is important that at this stage you know what
baseline you measure against and what metric you are going to use. If this is your first model attempt, you
are probably comparing the performance with a dummy model (i.e., one that always predicts the average
label or the majority class). On the other hand, if you have previously deployed other models, you will
compare the model performance with that of the previous model.

You will have to consider what overall goal you are trying to achieve. In the case of a binary classification
problem, you will consider metrics like precision and recall to evaluate if your model has lots of false
positives or false negatives. The choice you make will depend on your business goal and your data. For
example, if you are deploying a model that predicts patient sickness you will try to avoid false negatives
because you wouldn’t want to leave any screened patient with a false impression that they are healthy when
554 CHAPTER 13. SERVING DEEP LEARNING MODELS

they are are not. On the other hand, if you are developing a system for flagging spam, you will focus more
on avoiding false positives, which would route legitimate emails to the spam folder, creating a bad user
experience.

In summary, you will need to decide:

• the metric you are trying to optimize


• the baseline for that metric
• what a significant improvement over the baseline is
• the kind of errors you would like to minimize

These considerations will guide you in during the training process.

If you plan to do hyper-parameter training, you must split your total dataset into three parts:

• training
• validation
• test

The training data will be used to train the model. The validation data will serve as “test” data for
hyper-parameter tuning. I.e., for each new combination of hyper-parameters you will train the model on the
training portion and validate the model on the validation portion. This split is like having two nested
training loops. The inner training loop will choose the weights and biases of your network. The outer
training loop will choose the hyper-parameters like learning rate, batch size, and the number of layers.

None of these models will see the test set until the end. Once you have chosen the best hyper-parameters
and you have trained the best model on that data, only then, you will test your trained model on the test set
to get a sense of how well your model is going to perform with out-of-sample data, i.e., an indication of how
well your model is going to do when deployed.

Model Exporting

After training and evaluating the model, it is time to prepare it for serving predictions. What happens at this
stage will depend on many requirements including desired latency, footprint, the device you are planning to
use for serving and many more.

At one end of the spectrum, this steps is as simple as saving the trained model to disk as is. The trained
model is composed of 2 parts, the model architecture, and the trained weights. This method is perfect when
we use Keras to build our model and plan to serve it as part of a Python/Flask application. It is not
optimized at all, but if all we care is to build a proof of concept and if we don’t need to support high traffic
then this can be fast to execute. This is the first method we will explore in this chapter.

On the other end of the spectrum are large scale deployments. If we are planning to use our model in a high
availability production environment, we need to optimize it for serving predictions within the constraints
required by our application.
13.1. THE MODEL DEVELOPMENT CYCLE 555

For example, if we plan to use a model to make real-time decisions on serving ads or recommending
products, we will have very stringent latency specifications, usually a few tens of milliseconds at most. This
requirement will influence the choices we make when designing the model as well as when saving it.

The topic of model optimization is vast, and it requires tools that go beyond the scope of this book. We will,
therefore, limit ourselves to pointing out what kind of optimizations are possible and where to look for
information about them.

Model optimization techniques may involve:

• stripping away all operations not needed for inference. The Tensorflow graph underlying our Keras
model still contains all the training-related ops, including the gradients calculations and the
optimizer. None of these are relevant at inference time, and we should strip them away from the
graph. Tensorflow has a Graph Transform Tool that includes many options to check out. Common
cases covered by the tool are:

– Optimizing for Deployment


– Fixing Missing Kernel Errors on Mobile
– Shrinking File Size
– Eight-bit Calculations

• low-level compilation of tensorflow operations using the Accelerated Linear Algebra compiler (XLA).
This compiler optimizes operations for the specific platform used for deployment, and it can help in
the following areas:

– Improve execution speed


– Improve memory usage
– Reduce reliance on custom Ops
– Reduce mobile footprint
– Improve portability

In addition to model optimization, we can improve inference performance by choosing the hardware
platform that is most adapt to our model. Currently, Tensorflow supports CPU, GPU, and TPU training and
inference. In the coming years, we’ll see a flourishing of hardware platforms dedicated to Deep Learning
model training and serving, which will bring additional options to the table.

In this chapter, we’ll see how to save a Keras model in Tensorflow format, so that all the above tools can be
applied.

Model Deployment

Model deployment refers to how we are going to make our model available to the rest of the world. In
several cases, this is a Python/Flask application that loads the model to memory and then runs
model.predict when requested. This is the first method we will explore in this chapter. It’s is a great way
to deploy a proof of concept in situations where we do not need high throughput.

The natural extension of this method is to containerize the Flask app with Docker so that we can replicate
the model multiple times and adjust our model to the load requested by our application. While this works, it
556 CHAPTER 13. SERVING DEEP LEARNING MODELS

is not the recommended solution when scaling out operations. Tensorflow offers a server which is the
preferred way to deploy models at scale. Tensorflow 2.0 includes Tensorflow Extended (TFX) which covers
all the steps in the deployment cycle.

That’s why in this chapter we will also go through a minimal deployment with Tensorflow Serving.
Tensorflow Serving is a powerful package developed with large deployments in mind. We will introduce it
and guide you to more resources if you need to scale out operations with your models.

More generally, we deploy a new model in parallel to an existing model we validate its performance with live
traffic in a classic A/B test scenario. Here traffic is only partially routed to the new model, and its
performance is monitored against for some time before completely adopting it and phasing out the old one.

This strategy is why deployment must also include monitoring of the model performance.

Model Monitoring

Last but not least, when we deploy a model we want to monitor its performance. It is important to sample
the predictions of the model and send a few of them to human supervision to verify their quality. In other
words, label collection never ends. We need to keep measuring the performance of our model against a
known set of labels.

In some cases, this process is automated, for example, the case where our model predicts future values of a
time series, e.g., the price of a stock for trading purposes. In this case, as soon as we get the next number in
the time series, we can immediately compare it with the prediction from our model and monitor its quality
in real-time.

In other cases, where human supervisors generate labels, we need to keep sending data to a QA team that
will label them. We can then compare the predictions with the labels and decide how to improve the model
on the cases where it failed.

This process never ends; we can always come up with better models. However, we should not be
discouraged by this. As British mathematician George E. P. Box said: “All models are wrong; some models
are useful”. You can reap enormous benefits from a model that is not perfect.

Let’s deploy our first model. We will build an API that can predict the location of a user based on the
strength of WiFi signals detected.

Deploy a model to predict indoor location


As an example of deployment we will develop an app that can determine the indoor location of a user based
on the WiFi signal strength observed on smartphone. Data comes from the UCI Machine Learning
repository and it is made available to you in data/wifi_location.csv. We will quickly load and train a model
and then focus on exporting this model to build an API that can predict the location of a user based on the
WiFi signal strength.
13.2. DEPLOY A MODEL TO PREDICT INDOOR LOCATION 557

Data exploration

Let’s start by loading the usual packages:

In [1]: with open('common.py') as fin:


exec(fin.read())

In [2]: with open('matplotlibconf.py') as fin:


exec(fin.read())

And let’s load the data with Pandas:

In [3]: df = pd.read_csv('../data/wifi_location.csv')

Let’s quickly inspect the data to get a sense for what we have:

In [4]: df.head()

Out[4]:

a_1 a_2 a_3 a_4 a_5 a_6 a_7 location


0 -64 -56 -61 -66 -71 -82 -81 0
1 -68 -57 -61 -65 -71 -85 -85 0
2 -63 -60 -60 -67 -76 -85 -84 0
3 -61 -60 -68 -62 -77 -90 -80 0
4 -63 -65 -60 -63 -77 -81 -87 0

It looks like we have 7 features, presumably the strengths of the wifi signals coming from 7 different access
points. There’s also a column called location that will be our label. Let’s see how many locations there are
in the dataset:

In [5]: df['location'].value_counts()

Out[5]:

location
3 500
2 500
1 500
0 500
558 CHAPTER 13. SERVING DEEP LEARNING MODELS

Great! The dataset is balanced and it has 500 examples of each location! Since we have only 2000 points
total, we can plot the features and take a look at them:

In [6]: df.plot(figsize=(12, 8))


plt.axvline(500)
plt.axvline(1000)
plt.axvline(1500)
plt.title('Indoor location dataset')
plt.xlabel('Sample number')
plt.ylabel('Wifi strength (dB)');

Indoor location dataset


a_1
0 a_2
a_3
a_4
20 a_5
a_6
a_7
location
Wifi strength (dB)

40

60

80

100
0 250 500 750 1000 1250 1500 1750 2000
Sample number

From the plot we can clearly see that the wifi signal strengths are different in the 4 locations and therefore
we can hope to be able to be able to predict the location of a person based on these features. To further
remark this point, let’s do a pairplot using Seaborn and let’s color the data by location:

In [7]: import seaborn as sns

In [8]: sns.pairplot(df, hue='location');


13.2. DEPLOY A MODEL TO PREDICT INDOOR LOCATION 559

20
40
a_1
60

50
60
a_2

70

40
50
a_3

60
70

20
40
a_4

60
80 location
0
1
40 2
3
60
a_5

80

60
70
80
a_6

90

70
80
a_7

90
100
3
2
location

1
0
75 50 25 60 40 60 40 75 50 25 75 50 100 80 60 100 80 60 0 2
a_1 a_2 a_3 a_4 a_5 a_6 a_7 location

Model definintion and training

It is very clear that the 4 locations are quite well defined and therefore we can hope to train a good model.
Let’s do that! First let’s define our usual X and y arrays of features and labels:

In [9]: X = df.drop('location', axis=1).values


y = df['location'].values

Then we’ll split our data into training and test, using a 25 test split.

In [10]: from sklearn.model_selection import train_test_split


560 CHAPTER 13. SERVING DEEP LEARNING MODELS

In [11]: X_train, X_test, y_train, y_test = \


train_test_split(X, y, test_size=0.25,
random_state=0)

Now let’s build a fully connected model using the Functional API in Keras. Let’s import the Model class as
well as a few layers:

In [12]: from tensorflow.keras.models import Model


from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.layers import BatchNormalization

In particular notice that we’ll use the BatchNormalization layer right after the input, since our features
take negative large numbers, which may slow the convergence of our model. Let’s build a fully connected
model with the following architecture:

• Input
• Batch Normalization
• Fully connected inner layer with 50 nodes and a ReLU activation
• Fully connected inner layer with 30 nodes and a ReLU activation
• Fully connected inner layer with 10 nodes and a ReLU activation
• Output layer with 4 nodes and a Softmax activation

The functional API makes it very easy to build this model:

In [13]: inputs = Input(shape=X_train.shape[1:])


x = BatchNormalization()(inputs)
x = Dense(50, activation='relu')(x)
x = Dense(30, activation='relu')(x)
x = Dense(10, activation='relu')(x)
predictions = Dense(4, activation='softmax')(x)

model = Model(inputs=inputs, outputs=predictions)

Let’s display a model summary and make sure that we have built exactly what we wanted:

In [14]: model.summary()

Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
13.2. DEPLOY A MODEL TO PREDICT INDOOR LOCATION 561

input_1 (InputLayer) [(None, 7)] 0


_________________________________________________________________
batch_normalization_v2 (Batc (None, 7) 28
_________________________________________________________________
dense (Dense) (None, 50) 400
_________________________________________________________________
dense_1 (Dense) (None, 30) 1530
_________________________________________________________________
dense_2 (Dense) (None, 10) 310
_________________________________________________________________
dense_3 (Dense) (None, 4) 44
=================================================================
Total params: 2,312
Trainable params: 2,298
Non-trainable params: 14
_________________________________________________________________

Great! Now we can compile the model. Notice that since our labels are not one- hot encoded we should use
the sparse_categorical_crossentropy instead of the usual categorical_crossentropy:

In [15]: model.compile('adam',
'sparse_categorical_crossentropy',
metrics=['accuracy'])

We are now ready to train the model. Let’s train it for 40 epochs, using the test data to validate the
performance:

In [16]: h = model.fit(X_train, y_train,


batch_size=128,
epochs=40,
verbose=0,
validation_data=(X_test, y_test))

As we have done several times in the book, we can display the history of training leveraging Pandas plotting
capabilities:

In [17]: pd.DataFrame(h.history).plot()
plt.ylim(0, 1);
562 CHAPTER 13. SERVING DEEP LEARNING MODELS

1.0

0.8

0.6 loss
accuracy
val_loss
0.4 val_accuracy

0.2

0.0
0 5 10 15 20 25 30 35

The training graph looks very good. The model has converged to almost perfect accuracy and there is no
sign of overfitting. This is great! We are ready to export the model for deployment.

Export the model with Keras

Tensorflow offers several ways to export a model. The simplest way is to save the model architecture and the
weights as separate compressed file using the model.save_weights method. This saves the model in a
framework-agnostic format where the model structure is specified as a json file and the weights are saved as
an array. We can import a model saved in this way into other frameworks that are not necessarily Keras. You
can read more about the various ways of saving a model here. Let’s start by importing the os, json and
shutil packages:

In [18]: import os # Miscellaneous operating system interfaces


import json # JSON encoder and decoder
import shutil # High-level file operations

Next we define the output path to save our model. This path will be composed of three parts:

• a base path, in this case it’s going to be /tmp/ztdl_models/wifi/


• a subpath, referring to the type of deployment system we’d like to use, here: /flask
• a version number, starting from 1. We use this in case we’d like to deploy a new version of the model
later on.
13.2. DEPLOY A MODEL TO PREDICT INDOOR LOCATION 563

In [19]: base_path = '/tmp/ztdl_models/wifi'


sub_path = 'flask'
version = 1

Let’s combine these in a single path using join:

In [20]: from os.path import join

In [21]: export_path = join(base_path, sub_path, str(version))


export_path

Out[21]: '/tmp/ztdl_models/wifi/flask/1'

Next we create the export path. We delete it first and then re-create it as an empty path:

In [22]: shutil.rmtree(export_path, ignore_errors=True) # delete path, if exists


os.makedirs(export_path) # create path

Now we are ready to save the model. Let’s have a look at the json description of the model:

In [23]: json.loads(model.to_json())

Out[23]: {'class_name': 'Model',


'config': {'name': 'model',
'layers': [{'name': 'input_1',
'class_name': 'InputLayer',
'config': {'batch_input_shape': [None, 7],
'dtype': 'float32',
'sparse': False,
'name': 'input_1'},
'inbound_nodes': []},
{'name': 'batch_normalization_v2',
'class_name': 'BatchNormalizationV2',
'config': {'name': 'batch_normalization_v2',
'trainable': True,
'dtype': 'float32',
'axis': [1],
'momentum': 0.99,
'epsilon': 0.001,
'center': True,
564 CHAPTER 13. SERVING DEEP LEARNING MODELS

'scale': True,
'beta_initializer': {'class_name': 'Zeros', 'config': {}},
'gamma_initializer': {'class_name': 'Ones', 'config': {}},
'moving_mean_initializer': {'class_name': 'Zeros', 'config': {}},
'moving_variance_initializer': {'class_name': 'Ones', 'config': {}},
'beta_regularizer': None,
'gamma_regularizer': None,
'beta_constraint': None,
'gamma_constraint': None},
'inbound_nodes': [['input_1', 0, 0, {}]]},
{'name': 'dense',
'class_name': 'Dense',
'config': {'name': 'dense',
'trainable': True,
'dtype': 'float32',
'units': 50,
'activation': 'relu',
'use_bias': True,
'kernel_initializer': {'class_name': 'GlorotUniform',
'config': {'seed': None}},
'bias_initializer': {'class_name': 'Zeros', 'config': {}},
'kernel_regularizer': None,
'bias_regularizer': None,
'activity_regularizer': None,
'kernel_constraint': None,
'bias_constraint': None},
'inbound_nodes': [['batch_normalization_v2', 0, 0, {}]]},
{'name': 'dense_1',
'class_name': 'Dense',
'config': {'name': 'dense_1',
'trainable': True,
'dtype': 'float32',
'units': 30,
'activation': 'relu',
'use_bias': True,
'kernel_initializer': {'class_name': 'GlorotUniform',
'config': {'seed': None}},
'bias_initializer': {'class_name': 'Zeros', 'config': {}},
'kernel_regularizer': None,
'bias_regularizer': None,
'activity_regularizer': None,
'kernel_constraint': None,
'bias_constraint': None},
'inbound_nodes': [['dense', 0, 0, {}]]},
{'name': 'dense_2',
'class_name': 'Dense',
'config': {'name': 'dense_2',
'trainable': True,
13.2. DEPLOY A MODEL TO PREDICT INDOOR LOCATION 565

'dtype': 'float32',
'units': 10,
'activation': 'relu',
'use_bias': True,
'kernel_initializer': {'class_name': 'GlorotUniform',
'config': {'seed': None}},
'bias_initializer': {'class_name': 'Zeros', 'config': {}},
'kernel_regularizer': None,
'bias_regularizer': None,
'activity_regularizer': None,
'kernel_constraint': None,
'bias_constraint': None},
'inbound_nodes': [['dense_1', 0, 0, {}]]},
{'name': 'dense_3',
'class_name': 'Dense',
'config': {'name': 'dense_3',
'trainable': True,
'dtype': 'float32',
'units': 4,
'activation': 'softmax',
'use_bias': True,
'kernel_initializer': {'class_name': 'GlorotUniform',
'config': {'seed': None}},
'bias_initializer': {'class_name': 'Zeros', 'config': {}},
'kernel_regularizer': None,
'bias_regularizer': None,
'activity_regularizer': None,
'kernel_constraint': None,
'bias_constraint': None},
'inbound_nodes': [['dense_2', 0, 0, {}]]}],
'input_layers': ['input_1', 0, 0],
'output_layers': ['dense_3', 0, 0]},
'keras_version': '2.2.4-tf',
'backend': 'tensorflow'}

Nice! The whole model is specified in a few lines! To save it we’ll open a model.json file and then write to it
the json version of the file:

In [24]: with open(join(export_path, 'model.json'), 'w') as fout:


fout.write(model.to_json())

Next we save the weights. We do this with the .save_weights method of the model:

In [25]: model.save_weights(join(export_path, 'weights.h5'))


566 CHAPTER 13. SERVING DEEP LEARNING MODELS

Let’s check the content of the export_path using the the os.listdir command:

In [26]: os.listdir(export_path, )

Out[26]: ['model.json', 'weights.h5']

As you can see there are 2 files, the json description of the model and the weights. Great! Let’s see how one
would re-load these into a new model. First we need to import the model_from_json function:

In [27]: from tensorflow.keras.models import model_from_json

Next we create a model by reading the json file:

In [28]: with open(join(export_path, 'model.json')) as fin:


loaded_model = model_from_json(fin.read())

The loaded model has random weights, as we can verify by generating predictions on the test set and then
comparing them with the labels. Notice that since the model was defined using the functional API, there is
no .predict_classes method. Let’s use the .predict method to obtain the probabilities for each class:

In [29]: probas = loaded_model.predict(X_test)


probas

Out[29]: array([[1.0000000e+00, 1.6200459e-26, 1.7103024e-31, 6.8443832e-21],


[1.0000000e+00, 2.2648366e-27, 5.6504009e-32, 4.1134971e-20],
[1.0000000e+00, 5.8601245e-27, 9.9094268e-32, 7.0851257e-20],
...,
[1.0000000e+00, 9.0107229e-27, 4.4861510e-31, 1.0041991e-19],
[1.0000000e+00, 5.9906324e-27, 6.1687434e-32, 1.4923703e-19],
[1.0000000e+00, 1.7123073e-30, 2.4330698e-35, 5.2379228e-24]],
dtype=float32)

To retrieve the predicted classes we need to use the argmax function from Numpy:

In [30]: preds = np.argmax(probas, axis=1)


preds
13.2. DEPLOY A MODEL TO PREDICT INDOOR LOCATION 567

Out[30]: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Finally we can check the accuracy of these prediction using the accuracy_score from Scikit-Learn:

In [31]: from sklearn.metrics import accuracy_score

In [32]: accuracy_score(y_test, preds)

Out[32]: 0.264

As expected, this model is not trained. Let’s load the weights now:

In [33]: loaded_model.load_weights(join(export_path, 'weights.h5'))

And let’s repeat the steps above:

In [34]: probas = loaded_model.predict(X_test) # class probabilities


preds = np.argmax(probas, axis=1) # class prediction
accuracy_score(y_test, preds) # accuracy score
568 CHAPTER 13. SERVING DEEP LEARNING MODELS

Out[34]: 0.978

Great! The model is now using the trained weights, so we can use it for inference in deployment.

Notice that this model is not trainable. If you tried to run the command:

loaded_model.fit(X_train, y_train)

You would get a RuntimeError like this one:

RuntimeError: You must compile a model before training/testing. Use


`model.compile(optimizer, loss)`.

As the message explains, to train the model we need to compile it first, i.e., add to the graph all the
operations concerning gradient calculation, loss calculation, and optimizer. We don’t need any of this for
deployment, so let’s not compile the model.

A simple deployment with Flask

WARNING: The simple script we run here is not for production use. Please make sure to
read how to deploy a Flask app to production in the Flask documentation.

As the documentation says, Flask is a microframework for Python based on Werkzeug, Jinja 2, and good
intentions. Also, before you ask: It’s BSD licensed!

It’s a popular choice for simple websites, APIs, and in general web development. We’ll use it here to load our
model in a simple application that will launch from a script.

We will here go through the commands that compose the script and tell you how to run it from the shell.
Let’s get started.

TIP: if you have installed the most recent version of our ztdlbook environment file, you
should already have Flask installed. Otherwise, go back to Chapter 1 and check the
instructions on how to create or update the environment.
13.3. A SIMPLE DEPLOYMENT WITH FLASK 569

The script will first import the Flask and request classes. Flask is the main app, while request will be
used to collect the data received by the app.

In [35]: from flask import Flask


from flask import request

We also import tensorflow which we’ll need when loading the model:

In [36]: import tensorflow as tf

Then we define a global variable for the model and its export path.

In [37]: export_path = '/tmp/ztdl_models/wifi/flask/1/'


loaded_model = None

Next we create the flask app, which is also a global variable:

In [38]: app = Flask(__name__)

The next step is to define a load_model function that loads the model from the export_path like we did
before:

In [39]: def load_model():


"""
Load model and tensorflow graph
into global variables.
"""

# global variable
global loaded_model

# load model architecture from json


with open(join(export_path, 'model.json')) as fin:
loaded_model = model_from_json(fin.read())

# load weights
loaded_model.load_weights(join(export_path, 'weights.h5'))
print("Model loaded.")

The second function we define is a preprocess function that can be used to perform any normalization,
feature engineering or other preprocessing. In the current scenario we use this function to convert the data
from json to a Numpy array.
570 CHAPTER 13. SERVING DEEP LEARNING MODELS

In [40]: def preprocess(data):


"""
Generic function for normalization
and feature engineering.
Convert data from json to numpy array.
"""
res = json.loads(data)
return np.array(res['data'])

Next we define a function called predict, which performs the following operations:

• take the data from request.data


• preprocess the data with the preprocess function
• use the loaded model to predict probabilities
• extract the predicted classes from probabilities using np.argmax
• return a json version of the predictions

Notice that we will “decorate” this function with the decorator:

@app.route('/', methods=["POST"])

This method tells Flask that this function should be called when a POST request is received at the / route.
For more information on how this is done in flask please make sure to check the extensive documentation.

In [41]: @app.route('/', methods=["POST"])


def predict():
"""
Generate predictions with the model
when receiving data as a POST request
"""
if request.method == "POST":
# get data from the request
data = request.data

# preprocess the data


processed = preprocess(data)

# run predictions
probas = loaded_model.predict(processed)

# obtain predicted classes from probabilities


preds = np.argmax(probas, axis=1)
13.3. A SIMPLE DEPLOYMENT WITH FLASK 571

# print in backend
print("Received data:", data)
print("Predicted labels:", preds)

return jsonify(preds.tolist())

Finally we complete the script with an if statement that runs the app in debug mode:

if __name__ == "__main__":
print("* Loading model and starting Flask server...")
load_model()
app.run(host='0.0.0.0', debug=True)

Please note that this is not the preferred mode to run a flask app. Please refer to the documentation for more
information.

Full script

Let’s take a look at the whole script using the cat shell command.

TIP: if this doesn’t work on your system, simply open the script in your favorite text editor:

In [42]: !cat 13_flask_serve_model.py


#!pygmentize -O style=monokai -g 13_flask_serve_model.py

import os
import json
import numpy as np
from tensorflow.keras.models import model_from_json

from flask import Flask


from flask import request, jsonify
import tensorflow as tf

loaded_model = None

app = Flask(__name__)

def load_model(export_path):
"""
Load model and tensorflow graph
into global variables.
572 CHAPTER 13. SERVING DEEP LEARNING MODELS

"""

# global variable
global loaded_model

# load model architecture from json


with open(os.path.join(export_path, 'model.json')) as fin:
loaded_model = model_from_json(fin.read())

# load weights
loaded_model.load_weights(os.path.join(export_path, 'weights.h5'))

print("Model loaded.")

def preprocess(data):
"""
Generic function for normalization
and feature engineering.
Convert data from json to numpy array.
"""
res = json.loads(data)
return np.array(res['data'])

@app.route('/', methods=["POST"])
def predict():
"""
Generate predictions with the model
when receiving data as a POST request
"""
if request.method == "POST":
# get data from the request
data = request.data

# preprocess the data


processed = preprocess(data)

# run predictions
probas = loaded_model.predict(processed)

# obtain predicted classes from predicted probabilities


preds = np.argmax(probas, axis=1)

# print in backend
print("Received data:", data)
print("Predicted labels:", preds)

return jsonify(preds.tolist())

if __name__ == "__main__":
from sys import argv
print("* Loading model and starting Flask server...")
if len(argv) > 1:
export_path = argv[1]
else:
export_path = '/tmp/ztdl_models/wifi/flask/1/'
load_model(export_path)
13.3. A SIMPLE DEPLOYMENT WITH FLASK 573

app.run(host='0.0.0.0', debug=True)

Run the script

We can run this script from the course folder as:

python 13_flask_serve_model.py

Make sure to check Flask Documentation if you encounter any issues with the above steps.

You should see the following output:

Using TensorFlow backend.


* Loading model and starting Flask server...
2018-06-18 11:34:31.339142: I tensorflow/core/platform/cpu_feature_guard.cc:140]
Your CPU supports instructions that this TensorFlow binary was not compiled to
use: AVX2 FMA
Model loaded.
* Serving Flask app "13_flask_serve_model" (lazy loading)
* Environment: development
* Debug mode: on
* Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
* Restarting with stat
Using TensorFlow backend.
* Loading model and starting Flask server...
2018-06-18 11:37:57.101110: I tensorflow/core/platform/cpu_feature_guard.cc:140]
Your CPU supports instructions that this TensorFlow binary was not compiled to
use: AVX2 FMA
Model loaded.
* Debugger is active!
* Debugger PIN: xxx-xxx-xxx

Get Predictions from the API

Now that the server is running, let’s send some data to it and get predictions. We can test the application
with a simple CURL request like:

curl -d '{"data": [[-62, -58, -59, -59, -67, -80, -77],


[-49, -53, -50, -48, -67, -78, -88],
[-52, -57, -49, -50, -66, -80, -80]]}' \
-H "Content-Type: application/json" \
-X POST http://localhost:5000
574 CHAPTER 13. SERVING DEEP LEARNING MODELS

Which should return:

[
0,
2,
2
]

What did we do? We have sent the wifi signal detected by three mobile phones and obtained their location.
The first one is in zone 0, and the other two are in zone 2. Great!

We can also ping our API using Python from the notebook by importing the requests module:

In [43]: import requests

We set the api_url variable:

In [44]: api_url = "http://localhost:5000/"

Get a few points from the test dataset:

In [45]: data = X_test[:5].tolist()

In [46]: data

Out[46]: [[-62, -58, -59, -59, -67, -80, -77],


[-49, -53, -50, -48, -67, -78, -88],
[-52, -57, -49, -50, -66, -80, -80],
[-40, -55, -52, -43, -60, -76, -72],
[-64, -59, -51, -67, -43, -88, -92]]

Create payload and headers dictionaries:

In [47]: payload = {'data': data}


headers = {'content-type': 'application/json'}

Finally send a post request to the api_url with a our data in json format. We collect the request response
into a response variable:
13.4. DEPLOYMENT WITH TENSORFLOW SERVING 575

In [48]: response = requests.post(api_url,


data=json.dumps(payload),
headers=headers)

Let’s check the response:

In [49]: response

Out[49]: <Response [200]>

If you see: <Response [200]> it means the request worked. Let’s check the response we obtained:

In [50]: response.json()

Out[50]: [0, 2, 2, 1, 3]

We can compare that with our labels:

In [51]: y_test[:5]

Out[51]: array([0, 2, 2, 1, 3])

The deployed model is working pretty well! Very nice! There are many options to host your deployed model,
including: - hosting the Flask app on AWS, GCloud, Azure - deploying it on Heroku - deploying it on
Floydhub

Now go ahead and amaze your friends!

This chapter continues introducing a different way to export and deploy a model, which leverages
Tensorflow Serving. This is the preferred way for larger production deployments.

Deployment with Tensorflow Serving


As the documentation says, TensorFlow Serving is a flexible, high-performance serving system for Machine
Learning models, designed for production environments. TensorFlow Serving makes it easy to deploy new
algorithms and experiments while keeping the same server architecture and APIs. TensorFlow Serving
provides out-of-the-box integration with TensorFlow models but can be easily extended to serve other types
of models and data.
576 CHAPTER 13. SERVING DEEP LEARNING MODELS

Tensorflow Serving can accommodate both small and large deployments, and it is ready for production. It
used to be difficult to use but, with the recent release of Tensorflow 2.0, it got greatly simplified. If you are
serious about using it, we recommend you take a look at the Architecture overview that explains many
concepts like Servables, Managers, and Sources.

In this part of the book, we will just show you how to export a model for serving and how to ping a
Tensorflow serving server using both the REST and the gRPC interfaces.

Saving a model for Tensorflow Serving

Let’s get started by exporting the model for Tensorflow Serving. Let’s start by defining an export path:

In [52]: base_path = '/tmp/ztdl_models/wifi'


sub_path = 'tfserving'
version = 1

Notice that we can bump up the version number if we save a new model later on. Like before, we can
combine these:

In [53]: export_path = join(base_path, sub_path, str(version))


export_path

Out[53]: '/tmp/ztdl_models/wifi/tfserving/1'

Let’s clear the export_path in case it already exists:

In [54]: shutil.rmtree(export_path, ignore_errors=True)

Saving a Tensorflow model for serving used to be quite complicated. However, Tensorflow 2.0 makes it as
easy as calling the tf.saved_model.save function. Let’s do it:

In [55]: tf.saved_model.save(model, export_path)

Now let’s check what’s been saved:

In [56]: os.listdir(export_path)

Out[56]: ['assets', 'saved_model.pb', 'variables']


13.4. DEPLOYMENT WITH TENSORFLOW SERVING 577

The export_path contains a couple of folders and a saved_model.pb artifact, which is the model
architecture serialized using protocol buffer.

The variables folder contains the weights of the trained model that we have saved, in the form of
checkpoints:

In [57]: os.listdir(join(export_path, 'variables'))

Out[57]: ['variables.data-00000-of-00001', 'variables.index']

The assets folder is empty in this case, but it is used for example for text files to initialize vocabulary tables.

In [58]: os.listdir(join(export_path, 'assets'))

Out[58]: []

We can use the saved_model_cli command line function to check the content of the saved model:

In [59]: !saved_model_cli show --dir {export_path} --all

MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['__saved_model_init_op']:
The given SavedModel SignatureDef contains the following input(s):
The given SavedModel SignatureDef contains the following output(s):
outputs['__saved_model_init_op'] tensor_info:
dtype: DT_INVALID
shape: unknown_rank
name: NoOp
Method name is:

signature_def['serving_default']:
The given SavedModel SignatureDef contains the following input(s):
inputs['input_1'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 7)
name: serving_default_input_1:0
The given SavedModel SignatureDef contains the following output(s):
outputs['dense_3'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 4)
name: StatefulPartitionedCall:0
Method name is: tensorflow/serving/predict
578 CHAPTER 13. SERVING DEEP LEARNING MODELS

The above lines are not obvious, so let’s see what they tell us. The first line tells us that the model meta-graph
was tagged with serve tag by default. Tags are an advanced concept in Tensorflow Serving, and they are
used to identify the specific meta-graph to load and restore, along with the shared set of variables and assets.

More interestingly, we see a serving_default signature that exposes two tensors: - an input tensor
available at key input_1 - an output tensor available at key dense_3

These correspond to the default names assigned by Keras to the input and output layers in our model. We
can change them by fixing the name of a layer for example like this: Dense(..., name='input').

We will need to remember these signatures when sending a call to the Tensorflow server.

Inference with Tensorflow Serving using Docker and the Rest API

By far the easiest way to get tensorflow serving up and running is to use the pre-built Docker image as
explained in the documentation.

TIP: If you are new to Docker this may be a bit unfamiliar and complicate. Feel free to
either skip this section or read more about Docker and how it works in the comprehensive
documentation.

Assuming you have Docker installed and running on your machine, let’s pull the tensorflow/serving
docker container:

docker pull tensorflow/serving:latest

Next, let’s run the docker container with the following command:

docker run \
-v /tmp/ztdl_models/wifi/tfserving/:/models/wifi \
-e MODEL_NAME=wifi \
-e MODEL_PATH=/models/wifi \
-p 8500:8500 \
-p 8501:8501 \
-t tensorflow/serving

Let’s go through the options selected in detail:

• -v: This bind mounts a volume; it tells Docker to map the internal path /models/wifi to the
/tmp/ztdl_models/wifi/tfserving/ in our host computer.
13.4. DEPLOYMENT WITH TENSORFLOW SERVING 579

• -e: Sets environment variables, in this case, we set the MODEL_NAME and MODEL_PATH variables
• -p: Publishes a container’s port to the host. In this case, we are publishing port 8500 (default gRPC)
and 8501 (default REST).
• -t: Allocate a pseudo-TTY
• tensorflow/serving is the name of the container we are running.

Since Tensorflow 1.8, Tensorflow serving comes with both a gRPC and REST endpoints by default, so we can
test our running server by simply using curl. The correct command for this is:

curl -d '{"signature_name": "serving_default",


"instances": [[-62.0, -58.0, -59.0, -59.0, -67.0, -80.0, -77.0],
[-49.0, -53.0, -50.0, -48.0, -67.0, -78.0, -88.0],
[-52.0, -57.0, -49.0, -50.0, -66.0, -80.0, -80.0]]}' \
-H "Content-Type: application/json" \
-X POST http://localhost:8501/v1/models/wifi:predict

Go ahead and run that in a shell, you should receive an output that looks similar to the following:

{
"predictions": [[0.997524, 1.19462e-05, 0.00171472, 0.000749083],
[3.40611e-06, 0.00262853, 0.997005, 0.000363284],
[2.52653e-05, 0.00507444, 0.993813, 0.00108718]
]
}

We can also pass a request from Jupyter notebook similarly to what we did using flask. Let’s use the same
data we have used for the Flask example:

In [60]: data

Out[60]: [[-62, -58, -59, -59, -67, -80, -77],


[-49, -53, -50, -48, -67, -78, -88],
[-52, -57, -49, -50, -66, -80, -80],
[-40, -55, -52, -43, -60, -76, -72],
[-64, -59, -51, -67, -43, -88, -92]]

Create a new payload with the correct structure:

In [61]: payload = {"signature_name": "serving_default",


"instances": data}
headers = {'content-type': 'application/json'}
580 CHAPTER 13. SERVING DEEP LEARNING MODELS

And send a POST request to the model REST API endpoint:

In [62]: response = requests.post("http://localhost:8501/v1/models/wifi:predict",


data=json.dumps(payload),
headers=headers)

If we’ve done things properly we should get a json object with the predicted probabilities:

In [63]: response.json()

Out[63]: {'predictions': [[0.99733, 0.000386354, 0.00011253, 0.00217141],


[1.00856e-05, 0.00778797, 0.99196, 0.000242161],
[5.23613e-05, 0.00740026, 0.990805, 0.0017419],
[2.83991e-06, 0.940482, 0.0595128, 1.88323e-06],
[5.46268e-07, 3.34143e-10, 4.94043e-06, 0.999995]]}

Wonderful! You have just run your first model using Tensorflow Serving and Docker.

To stop the server, go back to the command prompt, press CTRL+C to exit from the tty session. Then run:

docker container ls

to list all the containers currently running. The output should look similar to this:

CONTAINER ID IMAGE COMMAND ...


fdd7c0958cdf tensorflow/serving "/bin/sh -c 'tensorf..." ...

Find the id of the tensorflow/serving container and then run:

docker stop fdd7c0958cdf

To stop it from running. You can always restart it later if you need it.

The gRPC API

Tensorflow serving can also receive data serialized as protocol buffers, so we will need to do a little bit more
work to use our server for predictions.

First of all let’s create a prediction service. We’ll need to import the insecure_channel from grpc:
13.4. DEPLOYMENT WITH TENSORFLOW SERVING 581

In [64]: from grpc import insecure_channel

Next let’s create an insecure channel to localhost (or to your server) on port 8500, which is the port we chose
for tensorflow serving:

In [65]: channel = insecure_channel('localhost:8500')

In [66]: channel

Out[66]: <grpc._channel.Channel at 0x7f04085da7f0>

Through this channel we’ll be able to perform RPCs. Next we are going to create an instance of
PredictionServiceStub from tensorflow_serving.apis.prediction_service_pb2_grpc. Notice
that most of the documentation you can find online is outdated and uses the legacy beta API. We are using
the most recent version of the gRPC API:

In [67]: from tensorflow_serving.apis.prediction_service_pb2_grpc import PredictionServiceStub

A PredictionService provides access to machine-learned models loaded by model_servers. Let’s create a


stub:

In [68]: stub = PredictionServiceStub(channel)

We are almost ready to send data to our server. The last thing we need to do is convert the data to protocol
buffers. Let’s convert the same data used previously to Protocol Buffers. We use the make_tensor_proto
function from Tensorflow v1 to serialize our data. Notice that we will need wrap our data in a numpy array
since it was passed as a list to the Flask application:

In [69]: data_np = np.array(data)

Let’s make the protobufs:

In [70]: data_pb = tf.compat.v1.make_tensor_proto(data_np,


dtype='float',
shape=data_np.shape)

What do protobufs look like? Let’s print out data_pb:


582 CHAPTER 13. SERVING DEEP LEARNING MODELS

In [71]: data_pb

Out[71]: dtype: DT_FLOAT


tensor_shape {
dim {
size: 5
}
dim {
size: 7
}
}
tensor_content: "\000\000x\302\000\000h\302\000\000l\302\000\000l\302\000\00
0\206\302\000\000\240\302\000\000\232\302\000\000D\302\000\000T\302\000\000H
\302\000\000@\302\000\000\206\302\000\000\234\302\000\000\260\302\000\000P\3
02\000\000d\302\000\000D\302\000\000H\302\000\000\204\302\000\000\240\302\00
0\000\240\302\000\000 \302\000\000\\\302\000\000P\302\000\000,\302\000\000p\
302\000\000\230\302\000\000\220\302\000\000\200\302\000\000l\302\000\000L\30
2\000\000\206\302\000\000,\302\000\000\260\302\000\000\270\302"

As you can see it’s a binary file, with a text header. In the header we can read the data type and the tensor
shape, while the values have been converted to binary values. Now that we have prepared the data, we are
ready to create an instance of PredictRequest, which is a class in
tensorflow_serving.apis.predict_pb2:

In [72]: from tensorflow_serving.apis.predict_pb2 import PredictRequest

In [73]: request = PredictRequest()

When we started our tensorflow serving server, we specified wifi as the model name, so let’s use wifi as
the model name for the request:

In [74]: request.model_spec.name = 'wifi'

Let’s also indicate the signature name, which is predict:

In [75]: request.model_spec.signature_name = 'serving_default'

Finally let’s pass our serialized data to the request input:

In [76]: request.inputs['input_1'].CopyFrom(data_pb)
13.4. DEPLOYMENT WITH TENSORFLOW SERVING 583

In [77]: request

Out[77]: model_spec {
name: "wifi"
signature_name: "serving_default"
}
inputs {
key: "input_1"
value {
dtype: DT_FLOAT
tensor_shape {
dim {
size: 5
}
dim {
size: 7
}
}
tensor_content: "\000\000x\302\000\000h\302\000\000l\302\000\000l\302\00
0\000\206\302\000\000\240\302\000\000\232\302\000\000D\302\000\000T\302\000\
000H\302\000\000@\302\000\000\206\302\000\000\234\302\000\000\260\302\000\00
0P\302\000\000d\302\000\000D\302\000\000H\302\000\000\204\302\000\000\240\30
2\000\000\240\302\000\000 \302\000\000\\\302\000\000P\302\000\000,\302\000\0
00p\302\000\000\230\302\000\000\220\302\000\000\200\302\000\000l\302\000\000
L\302\000\000\206\302\000\000,\302\000\000\260\302\000\000\270\302"
}
}

Great! Now let’s pass the request to the Stub.future method which will invoke the underlying RPC
asynchronously. This method returns an object that is both a Call for the RPC and a Future. In the event of
RPC completion, the return Call- Future’s result value will be the response message of the RPC. Should the
event terminate with non-OK status, the returned Call-Future’s exception value will be an RpcError.

In [78]: result_future = stub.Predict.future(request, 5.0)

Let’s get the result of this future:

In [79]: result = result_future.result()


result

Out[79]: outputs {
key: "dense_3"
value {
584 CHAPTER 13. SERVING DEEP LEARNING MODELS

dtype: DT_FLOAT
tensor_shape {
dim {
size: 5
}
dim {
size: 4
}
}
float_val: 0.997329592704773
float_val: 0.00038635419332422316
float_val: 0.00011253042612224817
float_val: 0.002171410247683525
float_val: 1.008555045700632e-05
float_val: 0.007787971291691065
float_val: 0.9919597506523132
float_val: 0.0002421610406599939
float_val: 5.2361316193128005e-05
float_val: 0.007400262635201216
float_val: 0.990805447101593
float_val: 0.001741902669891715
float_val: 2.8399131224432494e-06
float_val: 0.940482497215271
float_val: 0.05951283499598503
float_val: 1.8832306523108855e-06
float_val: 5.462679268930515e-07
float_val: 3.341430188097405e-10
float_val: 4.940428880217951e-06
float_val: 0.9999945163726807
}
}
model_spec {
name: "wifi"
version {
value: 1
}
signature_name: "serving_default"
}

Wonderful! Our Tensorflow server returned the predicted probabilities. We can convert them back to a
familiar numpy array using the make_ndarray function:

In [80]: scores = tf.make_ndarray(result.outputs['dense_3'])

Here we are, back with our familiar array:


13.5. EXERCISES 585

In [81]: scores

Out[81]: array([[9.97329593e-01, 3.86354193e-04, 1.12530426e-04, 2.17141025e-03],


[1.00855505e-05, 7.78797129e-03, 9.91959751e-01, 2.42161041e-04],
[5.23613162e-05, 7.40026264e-03, 9.90805447e-01, 1.74190267e-03],
[2.83991312e-06, 9.40482497e-01, 5.95128350e-02, 1.88323065e-06],
[5.46267927e-07, 3.34143019e-10, 4.94042888e-06, 9.99994516e-01]],
dtype=float32)

We can retrieve the classes by using argmax, like we previously did:

In [82]: prediction = np.argmax(scores, axis=1)


prediction

Out[82]: array([0, 2, 2, 1, 3])

and we can compare this with the local model we still have in memory:

In [83]: model.predict(np.array(data)).argmax(axis=1)

Out[83]: array([0, 2, 2, 1, 3])

Wonderful! We have successfully retrieved predictions from a Tensorflow serving server. This barely
scratches the surface of what’s possible with Tensorflow Serving. If you are serious about bringing your
models to production we strongly encourage you to read the Documentation as well as to complete the
Basic Tutorial and the Advanced Tutorial.

We conclude here the chapter on deployment. Remember to stop your Docker container if you don’t need it
any longer.

Exercises

Exercise 1

Let’s deploy an image recognition API using Tensorflow Serving. The main difference from the API we have
deployed in this chapter is that we will have to deal with how to pass an image to the model through
tensorflow serving. Since this chapter focuses on deployment, we will take a shortcut and deploy a pre-
trained model that uses Imagenet. In particular, we will deploy the Xception model. If you are unsure
about how to use a pre-trained model, please go back to Chapter 11 for a refresher.

Here are the steps you will need to complete:


586 CHAPTER 13. SERVING DEEP LEARNING MODELS

• load the model in Keras


• export the model for tensorflow serving:
– set the learning phase to zero
– save the model with tf.saved_model.save
• run the model server
• write a short script that:
– loads an image
– pre-processes it with the appropriate function
– serializes the image to Protobuf
– sends the image to the server
– receives a prediction
– decodes the prediction with Keras decode_prediction function

In [ ]:

Exercise 2

The above method of serving a pre-trained model has an issue: we are doing pre- processing and prediction
decoding on the client side. This is not a best practice, because it requires the client to be aware of what kind
of pre- processing and decoding functions the model needs.

We want a server that takes the image as it is and returns a string with the name of the object found.

The easy way to do this is to use the Flask app implementation we have shown in this chapter and move
pre-processing and decoding on the server side.

Go ahead and build a Flask version of the API that takes an image URL as a JSON string, applies
pre-processing, runs and decodes the prediction and returns a string with the response.

You will not use tensorflow serving for this exercise.

Once your script is ready, save it as 13_flask_serve_xception.py, run it as:

python 13_flask_serve_xception.py

and test the prediction with the following command:

curl -d "http://bit.ly/2wb7uqN" \
-H "Content-Type: application/json" \
-X POST http://localhost:5000

If you’ve done things correctly, this should return:


13.5. EXERCISES 587

"king_penguin"

Disclaimer: this script is not for production purposes. Retrieving a file from a URL is not secure, and
you should avoid building an API that retrieves a file from a URL provided from the client. Here we
used the URL retrieval trick to make the curl command shorter.

In [ ]:
588 CHAPTER 13. SERVING DEEP LEARNING MODELS
14
Conclusions and Next Steps

We have reached the end of the book, and hopefully the beginning of an incredible journey for you, the
reader!

We hope we have demystified Deep Learning for you and prepared you to venture into this field as a
practitioner or a researcher. Above all else, we hope you will build some great stuff using this book.

Before we part, we would like to give you some pointers to resources you can tap into. Remember, this field
evolves fast, so make sure to keep yourself updated by discussing with people and checking resources online.

Where to go next
You may be wondering where to go next after you have built the foundation. Here are a few ideas, and two
words of advice.

First of all, we advise you choose a specific project. Start with a goal in mind, and then obtain the resources
you need to achieve it. The field is rapidly evolving, so it’s more useful to get one project to completion and
get some experience through it, rather than trying to know everything like an encyclopedia.

Here are some examples of projects you could pursue:

Find an interesting dataset, build a visualization or a model You can look for datasets in many places
including:

• UCI Machine Learning Repository


• Figure8 Datasets

589
590 CHAPTER 14. CONCLUSIONS AND NEXT STEPS

• AWS Datasets
• Kaggle Datasets
• Awesome Datasets
• Open Government Data
• Data.World

Reproduce the results of a cool deep learning repo Many researchers publish their code, and if they
don’t, others try to reproduce their claims. Look on Github for an exciting project and try to run its code.
Most likely you will encounter hiccups along the way, and these will teach you valuable lessons.

Participate in a Kaggle competition Kaggle is a website that hosts machine learning competitions.
Competitions are a great way to test your understanding of a topic, to make friends with similar interests
and possibly to win some money.

Learn how to deploy models on web apps using Tensorflow and Tensorflow.js Tensorflow.js is a
JavaScript library for training and deploying ML models in the browser and on Node.js. Now that you
understand how to train Neural Networks, it should be fairly straightforward to adapt your knowledge to a
different programming language.

Online resources
By far the most comprehensive list of Resources on Deep Learning is Awesome Deep Learning list on
Github.

This list contains sublists of: Free Online Books, Courses, Videos and Lectures, Papers, Tutorials,
Researchers, Websites, Datasets, Conferences, Frameworks, Tools and more.

Another broad list of research papers is the Awesome Deep Learning Resources list on Github.

which is a list of recent papers regarding deep learning and deep reinforcement learning. They are sorted by
time to see the recent papers first.

Bootcamp
Finally, as you may have noticed on our website, we also run an in-person Bootcamp. It is a 5-day immersive
and hands-on training, covering the content of this book and more. If you like the material here and need
some more help getting started, make sure you check out our Bootcamp at:
www.zerotodeeplearning.com

We hope you have enjoyed our book, Zero to Deep Learning, and will keep practicing in the field!
15 Appendix

In [1]: with open('common.py') as fin:


exec(fin.read())

In [2]: with open('matplotlibconf.py') as fin:


exec(fin.read())

Throughout the book we use several mathematical concepts drawn from linear algebra and calculus. In this
appendix we review them in little more detail. This is meant to be for the curious reader and it’s not
necessary in order to complete the book.

Matrix multiplication
We have introduced matrix in chapter 1. As you know an N × M matrix is an array of numbers organized in
N rows and M columns.

Matrices are multiplied with the same rule of the dot product. Two matrices A and B can be multiplied if the
number of columns of the first is equal to the number of rows of the second. If A is 2x3 and B is 3x2, they
can be multiplied and the resulting matrix will have shape 2x2 if we do A.B and 3x3 if we do B.A.

However, if A is 2x4 and B is 3x5, we cannot multiply the two matrices.

The figure below shows how the elements of this matrix are calculated:

Let’s create 2 matrices in numpy using 2D-array method and check this formula:

591
592 CHAPTER 15. APPENDIX

In [3]: A = np.array([[0, 1, 2],


[2, 3, 0]])

B = np.array([[0, 1],
[2, 3],
[4, 5]])

C = np.array([[0, 1],
[2, 3],
[4, 5],
[0, 1],
[2, 3],
[4, 5]])

print("A is a {} matrix".format(A.shape))
print("B is a {} matrix".format(B.shape))
print("C is a {} matrix".format(C.shape))

A is a (2, 3) matrix
B is a (3, 2) matrix
C is a (6, 2) matrix
15.1. MATRIX MULTIPLICATION 593

The matrix product in Numpy is a function called dot. We can access it as a method of an array:

In [4]: A.dot(B)

Out[4]: array([[10, 13],


[ 6, 11]])

or as a function in Numpy:

In [5]: np.dot(A, B)

Out[5]: array([[10, 13],


[ 6, 11]])

Notice that if we invert the order we do get a 3x3 matrix instead:

In [6]: B.dot(A)

Out[6]: array([[ 2, 3, 0],


[ 6, 11, 4],
[10, 19, 8]])

Or, using the np.dot() version, we get the same as these two methods are functionally equivalent:

In [7]: np.dot(B, A)

Out[7]: array([[ 2, 3, 0],


[ 6, 11, 4],
[10, 19, 8]])

We can also perform the matrix multiplication C.dot(A), however, matrix multiplications are only possible
along axes with the same length. So, for example, we cannot perform the multiplication A.dot(C).

In [8]: C.dot(A)
594 CHAPTER 15. APPENDIX

Out[8]: array([[ 2, 3, 0],


[ 6, 11, 4],
[10, 19, 8],
[ 2, 3, 0],
[ 6, 11, 4],
[10, 19, 8]])

For example, uncomment the next line to get the following error:

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-11-9c5b5a616184> in <module>()
1 # uncomment the next line to get an error
----> 2 A.dot(C)

ValueError: shapes (2,3) and (6,2) not aligned: 3 (dim 1) != 6 (dim 0)

In [9]: # A.dot(C)

TIP: remember this ValueError for mismatching shapes. It’s very common when building
Neural Networks.

Chain rule

Univariate functions

The chain rule is a rule to calculate the derivative of nested functions.

For example, let’s say we have a function:

h(x) = log(2 + cos(x)) (15.1)

How do we calculate the derivative of this function with respect to x? This function is a composition of the
two functions

f (y) = log(y) g(x) = 2 + cos(x) (15.2)

so we can write h as a nested function:


15.2. CHAIN RULE 595

h(x) = f (g(x)) (15.3)

We can calculate the derivative of h with respect to x by applying the chain rule:

• first we calculate the derivative of g with respect to x


• then we calculate the derivative of f with respect to g
• finally we multiply the two

dh(x) d d f dg
= f (g(x)) = ⋅ (15.4)
dx dx d g dx

Using the table above we find:

d d 1
(2 + cos(x)) = − sin(x) log(g) = (15.5)
dx dg g

So finally, the derivative of our nested function h of x is the product of the two derivatives:

d − sin(x)
log(2 + cos(x)) = (15.6)
dx 2 + cos(x)

Notice that we substituted g with 2 + cos(x) in the denominator.

Code example

Let’s define all the above functions and verify that the derivative of h calculated with the chain rule is
equivalent to the derivative calculated with the np.diff function.

In [10]: def f(x):


return np.log(x)

def g(x):
return 2 + np.cos(x)

def h(x):
return f(g(x))

def df(x):
return 1/x
596 CHAPTER 15. APPENDIX

def dg(x):
return -np.sin(x)

def dh(x):
return df(g(x)) * dg(x)

In [11]: x = np.linspace(-5, 5, 500)

plt.subplot(211)
plt.plot(x, dh(x))
plt.legend(['using the chain rule'])

plt.subplot(212)
plt.plot(x[:-1], np.diff(h(x))/np.diff(x))
plt.legend(['using np.diff'])

Out[11]: <matplotlib.legend.Legend at 0x7f86955f7630>

0.5

0.0 using the chain rule

0.5
4 2 0 2 4
0.5

0.0 using np.diff

0.5
4 2 0 2 4

The two are results are the same (minus numerical errors), as expected!
15.2. CHAIN RULE 597

Multivariate functions

The chain rule can be easily extended to the case where f has multiple functions as arguments:

h(x) = f (g(x), k(x)) (15.7)

We simply distribute the chain rule to all the arguments that depend on x:

dh(x) ∂ f d g ∂ f dk
= ⋅ + ⋅ (15.8)
dx ∂g dx ∂k dx

Notice that here we are using the partial derivative symbol ∂ which simply means we are taking the
derivative with respect to one of the variables while keeping all the others fixed.

Exponentially Weighted Moving Average (EWMA)

The EWMA is the most important algorithm of your life. We often use this joke in classes to get the
attention of our students. Although this may or may not true in your particular case, it is true that this
algorithm crops up everywhere, from financial time series to signal processing to Neural Networks.

Different domains name it in different ways but it’s actually always the same thing and it’s worth knowing
how it works in detail.

Let us have a look at how it works. Let’s say we have a sequence of ordered datapoints. These could be the
values of a stock, temperature measurements, anything that is measured in a sequence.

In [12]: points = pd.DataFrame([1.04, 1.30, 3.35, 4.79, 4.15,


3.46, 3.19, 5.04, 6.19, 5.26,
2.62, 3.13, 3.58, 3.01, 3.73,
3.59, 4.12, 2.53, 4.03, 3.91,
4.95, 4.78, 4.56, 7.47, 7.85,
6.36, 7.81, 8.65, 7.33, 7.48,
6.55, 6.19, 6.63, 6.53, 3.27,
3.59, 4.13, 4.09, 3.21, 4.32,
5.46, 3.75, 4.11, 3.01, 2.82,
3.08, 3.07, 3.65, 3.30, 5.08],
columns=['data'])
points.plot(title='Just some data that looks noisy')
plt.show()
598 CHAPTER 15. APPENDIX

Just some data that looks noisy


9
data
8
7
6
5
4
3
2
1
0 10 20 30 40

If this data is noisy, we may want to reduce the noise in order to obtain a more accurate estimation of the
underlying actual values. One easy way to remove noise from a time series is to perform a rolling average
or moving average: you wait to accumulate a certain number of observations and and use their average as
the estimation of the current value. This method works, but it requires to hold the past values in a memory
buffer and constantly update such buffer when a new data point of the sequence arrives. So if we want to
average over a long window, we have to keep the whole window in memory, and also we cannot calculate
the first average until we have observed at least as many points as the window contains (unless we pad with
zeros).

Rolling averages are available in Pandas through the .rolling() method. Let’s plot a few examples:

In [13]: points['rolling mean 5 points'] = \


points['data'].rolling(5).mean()
points['rolling mean 10 points'] = \
points['data'].rolling(10).mean()
points.plot(title='Moving average smoothing')
plt.show()
15.2. CHAIN RULE 599

Moving average smoothing


9
8
7
6
5
4
3
data
2 rolling mean 5 points
1 rolling mean 10 points
0 10 20 30 40

EWMA differs from the moving average because it only requires knowledge of the previous value of the
data and of the current value of the EWMA itself.

Let’s indicate the values of our sequence as x0 , x1 , x2 , . . . , x n . We can calculate the value of the corresponding
EWMA recursively as:

y0 = x0 (15.9)
y n = (1 − α) y n−1 + α x n (15.10)

The two extreme cases of this formula are α = 0, in which case the value of y n will remain fixed to x0 forever
and α = 1, in which case y n will be exactly tracking the value of x n .

If α is between 0 and 1, the EWMA will smooth the signal reducing its fluctuations. Let’s walk through an
example with α = 0.9 to clarify how it works.

When the first point x0 comes in, the EWMA is set to be equal to the raw data, so y0 = x0 .

Then, the second raw value x1 comes in, we take 90 of it and add it to 10 of the previous value of the
moving average y0 :
600 CHAPTER 15. APPENDIX

y1 = 0.1 y0 + 0.9 x1 (15.11)

Since, y0 = x0 , the previous formula is equivalent to:

y1 = 0.1 x0 + 0.9 x1 (15.12)

So, the value of the EWMA will be almost equal to the initial value, with 90 contribution from the new
value x1 .

Then, the third point x2 comes in. Again, we take 90 of its value and add it to 10 of the current EWMA
value y1 .

y2 = 0.1 y1 + 0.9 x2 (15.13)


= 0.1 (0.1 x0 + 0.9 x1 ) + 0.9 x2 (15.14)
= 0.01 x0 + 0.09 x1 + 0.9 x2 (15.15)
(15.16)

This third point will still be mostly influenced by the initial point, but it will also contain contributions from
the most recent two points.

Let’s look at a couple more steps. Here’s y3 :

y3 = 0.1 y2 + 0.9 x3 (15.17)


= 0.1 (0.01 x0 + 0.09 x1 + 0.9 x2 ) + 0.9 x3 (15.18)
= 0.001 x0 + 0.009 x1 + 0.09 x2 + 0.9 x3 (15.19)
(15.20)

and here’s y4 :

y4 = 0.1 y3 + 0.9 x4 (15.21)


= 0.1 (0.001 x0 + 0.009 x1 + 0.09 x2 + 0.9 x3 ) + 0.9 x4 (15.22)
= 0.0001 x0 + 0.0009 x1 + 0.009 x2 + 0.09 x3 + 0.9 x4 (15.23)

As you can see the value of y4 is influenced by all the previous values of x in an exponentially decreasing
fashion.
15.2. CHAIN RULE 601

We can continue playing this game at each new point, and all we need to keep in memory is the previous
value of the EWMA y n−1 until we have mixed it with the current raw value of the signal x n .

This formula is great for two reasons:

1. We only keep the last values of the EWMA in memory, no need for a buffer.
2. We can calculate it from the beginning of the sequence instead of waiting to accumulate some values.

This formula is very popular and goes under different names in different domains. Statisticians would call it
an autoregressive integrated moving aver age model with no constant term or (ARIMA) (0,1,1). Signal
processing people would call it a first order Infinite Impulse Response (IIR) filter, but it’s the same thing.

The idea is simple. Each new value of the smoothed sequence is the sum of two terms: its own previous value
and the current new value of the sequence. The ratio of the mixing is controlled by the parameter α: very
large values will skew the mix towards the raw data, with very little smoothing, very small α (pronounced
alpha) will skew the mix towards the previous smoothed value, therefore with very strong smoothing.

EWMAs are also available in pandas, let’s plot a few:

In [14]: points['ewma a=0.5'] = points['data'].ewm(alpha=0.5).mean()


points['ewma a=0.1'] = points['data'].ewm(alpha=0.1).mean()

cols_ = ['data', 'ewma a=0.5', 'ewma a=0.1']


points[cols_].plot(title='EWMA smoothing')
plt.show()
602 CHAPTER 15. APPENDIX

EWMA smoothing
9
data
8 ewma a=0.5
7 ewma a=0.1
6
5
4
3
2
1
0 10 20 30 40

You can notice a couple of things when comparing this plot with the previous one:

1. the smoothed curves start immediately, we don’t have to wait in order to calculate the EWMA

• a smaller value for alpha gives a stronger smoothing

This algorithm is simple and beautiful, and you will encounter it in many places, beyond optimizers for
neural nets.

Tensors
Let’s create a couple of test tensors. We will create a tensor A of order 4 and a tensor B of order 2:

In [15]: A = np.random.randint(10, size=(2, 3, 4, 5))


B = np.random.randint(10, size=(2, 3))

In [16]: A
15.3. TENSORS 603

Out[16]: array([[[[3, 2, 3, 0, 8],


[1, 2, 4, 5, 9],
[0, 8, 1, 2, 2],
[8, 2, 6, 4, 7]],

[[6, 1, 3, 1, 3],
[9, 6, 6, 7, 2],
[1, 8, 2, 1, 7],
[3, 6, 1, 8, 7]],

[[3, 9, 4, 9, 5],
[9, 5, 7, 0, 7],
[6, 7, 6, 1, 7],
[3, 1, 4, 0, 6]]],

[[[2, 2, 8, 2, 7],
[8, 8, 6, 2, 5],
[0, 8, 3, 4, 4],
[9, 5, 2, 3, 9]],

[[1, 2, 9, 7, 1],
[0, 4, 8, 3, 1],
[5, 6, 5, 8, 6],
[3, 8, 0, 4, 3]],

[[0, 2, 7, 5, 7],
[0, 5, 7, 9, 1],
[7, 4, 2, 6, 9],
[9, 3, 7, 1, 7]]]])

In [17]: B

Out[17]: array([[1, 4, 7],


[0, 7, 1]])

A single number in A is located by four coordinates, so for example:

In [18]: A[0, 1, 0, 3]

Out[18]: 1

Tensors can be multiplied by a scalar, and their shape remains the same:
604 CHAPTER 15. APPENDIX

In [19]: A2 = 2 * A
A2

Out[19]: array([[[[ 6, 4, 6, 0, 16],


[ 2, 4, 8, 10, 18],
[ 0, 16, 2, 4, 4],
[16, 4, 12, 8, 14]],

[[12, 2, 6, 2, 6],
[18, 12, 12, 14, 4],
[ 2, 16, 4, 2, 14],
[ 6, 12, 2, 16, 14]],

[[ 6, 18, 8, 18, 10],


[18, 10, 14, 0, 14],
[12, 14, 12, 2, 14],
[ 6, 2, 8, 0, 12]]],

[[[ 4, 4, 16, 4, 14],


[16, 16, 12, 4, 10],
[ 0, 16, 6, 8, 8],
[18, 10, 4, 6, 18]],

[[ 2, 4, 18, 14, 2],


[ 0, 8, 16, 6, 2],
[10, 12, 10, 16, 12],
[ 6, 16, 0, 8, 6]],

[[ 0, 4, 14, 10, 14],


[ 0, 10, 14, 18, 2],
[14, 8, 4, 12, 18],
[18, 6, 14, 2, 14]]]])

In [20]: A.shape == A2.shape

Out[20]: True

We can also add tensors of the same shape element by element to obtain a third tensor with the same shape:

In [21]: A + A2

Out[21]: array([[[[ 9, 6, 9, 0, 24],


[ 3, 6, 12, 15, 27],
15.3. TENSORS 605

[ 0, 24, 3, 6, 6],
[24, 6, 18, 12, 21]],

[[18, 3, 9, 3, 9],
[27, 18, 18, 21, 6],
[ 3, 24, 6, 3, 21],
[ 9, 18, 3, 24, 21]],

[[ 9, 27, 12, 27, 15],


[27, 15, 21, 0, 21],
[18, 21, 18, 3, 21],
[ 9, 3, 12, 0, 18]]],

[[[ 6, 6, 24, 6, 21],


[24, 24, 18, 6, 15],
[ 0, 24, 9, 12, 12],
[27, 15, 6, 9, 27]],

[[ 3, 6, 27, 21, 3],


[ 0, 12, 24, 9, 3],
[15, 18, 15, 24, 18],
[ 9, 24, 0, 12, 9]],

[[ 0, 6, 21, 15, 21],


[ 0, 15, 21, 27, 3],
[21, 12, 6, 18, 27],
[27, 9, 21, 3, 21]]]])

Tensor Dot Product

One of the most important operation between tensors is the product. If we think about the product between
two scalars, we have no doubts how to perform it. If we think about two vectors a = {a i } and b = {b i }, we
can perform different types of product (for example the dot product or the cross product).

Here we focus on the so called Dot Product. The dot product p between a and b is given by:

p = ∑ ai bi
i

The operation consists in summing up the product between the components of the two vectors. As you may
observe, the results of the dot product between two vectors is a scalar, which is an entity with a lower order
if compared with the two factors. For this reason, this operation is also called contraction.

A similar operation can be performed also between two tensors of higher order, if the two tensors have an
axis with the same length. In this case we can perform a dot product (or a contractio) along that axis. The
606 CHAPTER 15. APPENDIX

shape of the resulting tensor depends on the shapes of the original two tensors that got contracted.

Let’s see a couple of examples. Here are the shapes of A and B. A has order 4, B has order 2:

In [22]: A.shape

Out[22]: (2, 3, 4, 5)

In [23]: B.shape

Out[23]: (2, 3)

Since both A and B have a first axis of length 2, we can perform a tensor dot product along the first axis using
the tensordot function from numpy. In order to perform this product we have to specify not only the two
arguments A and B, but also that we want to perform the operation along the first axis in each of the 2
tensors. This can be done through the argument axes=([0], [0]), as explained in the np.tensordot
documentation.

In [24]: T = np.tensordot(A, B, axes=([0], [0]))

Let’s check the shape of T:

In [25]: T.shape

Out[25]: (3, 4, 5, 3)

Interesting! Can you see what happened? T has four axes, i.e. it has order 4, T = {t jkln }. We can calculate
that by thinking how many free indices remained in A and B after the contraction on axis 0. The elements of
A are indicated by four indices A = {a i jkl }, the elements of B are indicated by two indices B = {b mn }.
Mathematically, the tensor product performs the operation:

T = {t jkln } = ∑ a i jkl b in
i

so, the elements of the resulting tensor T are located by 4 indices: 3 coming from the tensor A and 1 coming
from the tensor B.

Let’s do another example. What will be the shape of the tensor product of A and B if we contract along the
first 2 axes? First of all we have to check that the first two axes have the same length. Then we have to change
the argument into axes=([0, 1], [0, 1]).
15.4. CONVOLUTIONS 607

In [26]: T = np.tensordot(A, B, axes=([0, 1], [0, 1]))


T

Out[26]: array([[ 55, 85, 113, 121, 69],


[100, 94, 140, 63, 74],
[ 88, 135, 88, 75, 130],
[ 71, 92, 45, 65, 105]])

In [27]: T.shape

Out[27]: (4, 5)

Since both axis 0 and axis 1 have been contracted, the only remaining 2 indices come from axis 2 and 3 of
tensor A. This yields a tensor of order 2.

Wonderful! We have learned to perform a few operations using tensors! While this may seem really abstract
and removed from the practical applications of Deep Learning, actually it is not. We need to understand
how to arrange our data using tensors if we want to leverage Neural Networks with their full potential.

Now that we know how to operate with tensors, it is time to dive into convolutions!

We will start from 1D convolutions and then extend the definition to 2D arrays.

Convolutions
In chapter 6 we introduced Convolutional Neural Networks and we went a bit fast when talking about
convolutions. Let’s introduce convolutions here in a bit more detail.

1D Convolution & Correlation

Let’s start with two arrays, a and v.

In [28]: a_ = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
a = np.array(a_, dtype='float32')
a

Out[28]: array([0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0.],
dtype=float32)

In [29]: v = np.array([-1, 1], dtype='float32')


v
608 CHAPTER 15. APPENDIX

Out[29]: array([-1., 1.], dtype=float32)

v is a short array of only two elements, while a is a longer array of several elements. The general question we
are looking to answer is how similar the two arrays are. Since they are not the same length we cannot
perform a dot product between the two. We can, however, define two operations involving a and v: the
correlation and the convolution. These operations try to gauge the similarity between the two arrays,
acknowledging the fact that they don’t have the same length and performing a sort of “rolling dot product”.

In both cases we start from the left-side of a and we take a short sub-array with the same length as v, in this
case 2 numbers. In Machine Learning this sub-array is called receptive field.

Then we perform a tensor dot product of v with the receptive field a, i.e. we multiply the elements of v by
the elements of the sub-array and we sum the products. We then store the result as the first element of our
result array c.

Then we shift the window in a by one number and again perform a product between the new sub-array and
v, also summing at the end. This second value gets stored in the result array as well.

We can continue shifting the window and performing dot products until we reach the end of the array a and
no more shifting is possible.

The difference between convolution and correlation is that the array v is flipped before the multiplication.
15.4. CONVOLUTIONS 609

For a more precise mathematical definition of the correlation and convolution of a with v we invite the user
to consult the many detailed resources that can be found online.

Now, let’s see how we can perform these operations with Numpy. The functions n p.correlate and
np.convolve offer these two functions to perform correlation and convolution of 1D arrays:

In [30]: cc = np.correlate(a, v, mode='valid')


cc

Out[30]: array([ 0., 0., 0., 0., 1., 0., 0., 0., 0., -1., 0., 0., 0.,
0.], dtype=float32)

In [31]: c = np.convolve(a, v, mode='valid')


c

Out[31]: array([ 0., 0., 0., 0., -1., 0., 0., 0., 0., 1., 0., 0., 0.,
0.], dtype=float32)

Let’s plot a and c and see what they look like:

In [32]: fig, ax = plt.subplots(3, 1, sharex=True)


ax[0].plot(range(0, len(a)), a, 'o-')
ax[0].set_title('a')

ax[1].plot(range(1, len(a)), c, 'o-')


ax[1].set_title('convolution')

ax[2].plot(range(1, len(a)), cc, 'o-')


ax[2].set_title('correlation')
plt.tight_layout()
610 CHAPTER 15. APPENDIX

a
1

0
convolution
1
0
1
correlation
1
0
1
0 2 4 6 8 10 12 14

Looking at the plot we notice that both convolution and correlation have spikes when there’s a jump in the
array a. Our short filter v looks for jumps in a and the resulting convolution array represents how similar
each window in a is with v.

TIP: Since in a convolutional Neural Network the filter v is learned during the training
process, it makes no difference whether we flip the filter or not. In fact, if we perform the
flip, the network will simply learn flipped weights. For this reason, convolutional layers in a
Neural Network are actually calculating correlations. We still call them convolutional
layers, but the operation performed is actually a correlation. In what follows we will keep
talking about convolutions, but we’ll keep in mind that flipping the array is not actually
necessary in practice.

2D Convolution

We can easily extend the 1D convolution to 2D convolutions using 2D arrays instead of 1D.

Let’s say we have a 11x11 array A and a 3x3 filter V.

A contains a pattern in the shape of an “X”. For the sake of simplicity, we will also rescale the values of the
array so that the minimum value is -1 and the maximum is +1, but same concepts apply for different range of
15.4. CONVOLUTIONS 611

values.

In [33]: A = np.zeros(shape=(11, 11))


A[2:-2, 2:-2] = np.diag(np.ones((7,))) + \
np.flip(np.diag(np.ones((7,))), 0)
A[5, 5] = 1
A = A * 2 - 1
A

Out[33]: array([[-1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.],
[-1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.],
[-1., -1., 1., -1., -1., -1., -1., -1., 1., -1., -1.],
[-1., -1., -1., 1., -1., -1., -1., 1., -1., -1., -1.],
[-1., -1., -1., -1., 1., -1., 1., -1., -1., -1., -1.],
[-1., -1., -1., -1., -1., 1., -1., -1., -1., -1., -1.],
[-1., -1., -1., -1., 1., -1., 1., -1., -1., -1., -1.],
[-1., -1., -1., 1., -1., -1., -1., 1., -1., -1., -1.],
[-1., -1., 1., -1., -1., -1., -1., -1., 1., -1., -1.],
[-1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.],
[-1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.]])

Let’s display A with a gray colormap (because we’re working with numbers between -1 and 1):

In [34]: plt.figure(figsize=(5.5, 5.5))


plt.imshow(A, cmap='gray')
plt.title('A')

Out[34]: Text(0.5, 1.0, 'A')


612 CHAPTER 15. APPENDIX

A
0

10
0 2 4 6 8 10

V is a 3x3 filter with a diagonal line:

In [35]: V = np.diag([1, 1, 1])


V = V * 2 - 1
V

Out[35]: array([[ 1, -1, -1],


[-1, 1, -1],
[-1, -1, 1]])

Let’s plot the v filter as well:

In [36]: plt.figure(figsize=(1.5, 1.5))


plt.imshow(V, cmap = 'gray')
plt.title('V')
15.4. CONVOLUTIONS 613

Out[36]: Text(0.5, 1.0, 'V')

V
0

2
0.0 2.5

The 2D convolution can be calculated with the scipy.convolve2d function as from scipy. Let’s import
the function from scipy first:

In [37]: from scipy.signal import convolve2d, correlate2d

Let’s run the convolve2d function over A using the V tensor:

In [38]: C = convolve2d(A, V, mode='valid')


C

Out[38]: array([[ 5., 1., 1., 3., 3., 3., 5., 1., 1.],
[ 1., 7., -1., 1., 3., 5., -1., 3., 1.],
[ 1., -1., 9., -1., 3., -1., 1., -1., 5.],
[ 3., 1., -1., 9., -3., 1., -1., 5., 3.],
[ 3., 3., 3., -3., 5., -3., 3., 3., 3.],
[ 3., 5., -1., 1., -3., 9., -1., 1., 3.],
[ 5., -1., 1., -1., 3., -1., 9., -1., 1.],
[ 1., 3., -1., 5., 3., 1., -1., 7., 1.],
[ 1., 1., 5., 3., 3., 3., 1., 1., 5.]])

The convolved array, is obtained by taking the filter V, flipping it on both axis and then multiplying it with a
3x3 patch in the image. Then we shift the patch to the right and repeat. We start at the first patch on the top
left of the image A[0:3, 0:3], multiply this patch with V_rev element by element, then sum all the values.

We are effectively contracting the 2D tensor V with the patch over both axis:

In [39]: V_rev = np.flip(np.flip(V, 1), 0)


614 CHAPTER 15. APPENDIX

In [40]: np.tensordot(A[0:3, 0:3], V_rev)

Out[40]: array(5.)

This produces the first pixel in the output convolution. We then shift to the right by one pixel in A and repeat
the contraction operation:

In [41]: np.tensordot(A[1:4, 0:3], V)

Out[41]: array(1.)

We can continue doing this and accumulate the result in a new 2D array.

Functionally, we can do the same exact thing that scipy.convolve does manually, although in practice we
never need to do this thanks to scipy:

In [42]: win_h = V_rev.shape[0]


win_w = V_rev.shape[1]

out_h = A.shape[0] - win_h + 1


out_w = A.shape[1] - win_w + 1

res = np.zeros((out_h, out_w))

for i in range(out_h):
for j in range(out_w):
patch_ij = A[i:i+win_h, j:j+win_w]
try:
res[i, j] = np.tensordot(patch_ij, V_rev)
except Exception as ex:
print(i, j)
print(patch_ij)
print(V)
raise ex

np.allclose(res, C)

Out[42]: True

TIP: the function np.allclose returns True if two arrays are element-wise equal within a
tolerance. See the documentation for details.
15.4. CONVOLUTIONS 615

Notice that we can rescale the product by the number of elements in the filter V, which is 9, to obtain:

In [43]: C_resc = C / 9
C_resc.round(2)

Out[43]: array([[ 0.56, 0.11, 0.11, 0.33, 0.33, 0.33, 0.56, 0.11, 0.11],
[ 0.11, 0.78, -0.11, 0.11, 0.33, 0.56, -0.11, 0.33, 0.11],
[ 0.11, -0.11, 1. , -0.11, 0.33, -0.11, 0.11, -0.11, 0.56],
[ 0.33, 0.11, -0.11, 1. , -0.33, 0.11, -0.11, 0.56, 0.33],
[ 0.33, 0.33, 0.33, -0.33, 0.56, -0.33, 0.33, 0.33, 0.33],
[ 0.33, 0.56, -0.11, 0.11, -0.33, 1. , -0.11, 0.11, 0.33],
[ 0.56, -0.11, 0.11, -0.11, 0.33, -0.11, 1. , -0.11, 0.11],
[ 0.11, 0.33, -0.11, 0.56, 0.33, 0.11, -0.11, 0.78, 0.11],
[ 0.11, 0.11, 0.56, 0.33, 0.33, 0.33, 0.11, 0.11, 0.56]])

In [44]: plt.imshow(C_resc, cmap='gray')

Out[44]: <matplotlib.image.AxesImage at 0x7f868f68fef0>

0
1
2
3
4
5
6
7
8
0 2 4 6 8
616 CHAPTER 15. APPENDIX

Four pixels in the resulting convolution are exactly equal to 1, corresponding to a perfect match of the filter
with the image at those locations. The other pixels have smaller values with varying degrees, indicating
partial match only.

Image filters with convolutions

Convolutions can be used to perform filters on images, for example to blur it or detect the edges. Let’s have a
look at one example. We load an example image from keras.datasets.mnist:

In [45]: from tensorflow.keras.datasets import mnist

In [46]: (x_train, y_train), (x_test, y_test) = mnist.load_data()

In [47]: img = x_train[0]

In [48]: img.shape

Out[48]: (28, 28)

In [49]: plt.imshow(img, cmap='gray')

Out[49]: <matplotlib.image.AxesImage at 0x7f8642a17e10>


15.4. CONVOLUTIONS 617

10

15

20

25
0 5 10 15 20 25

Let’s filter this image with 3x3 kernels that recognize lines:

In [50]: f1 = np.array([[ 1, 0, 0],


[ 0, 1, 0],
[ 0, 0, 1]])

f2 = np.array([[ 0, 0, 1],
[ 0, 1, 0],
[ 1, 0, 0]])

f3 = np.array([[-1, -1, -1],


[ 0, 0, 0],
[ 1, 1, 1]])

f4 = np.array([[-1, 0, 1],
[-1, 0, 1],
[-1, 0, 1]])

Let’s see what these kernels look like visually with the method imshow():
618 CHAPTER 15. APPENDIX

In [51]: plt.figure(figsize=(6, 6))


plt.subplot(221)
ax = plt.imshow(f1, cmap='gray')
plt.title('filter f1')

plt.subplot(222)
plt.imshow(f2, cmap='gray')
plt.title('filter f2')

plt.subplot(223)
plt.imshow(f3, cmap='gray')
plt.title('filter f3')

plt.subplot(224)
plt.imshow(f4, cmap='gray')
plt.title('filter f4')

plt.tight_layout()
plt.show()
15.4. CONVOLUTIONS 619

filter f1 filter f2
0 0

1 1

2 2
0 1 2 0 1 2
filter f3 filter f4
0 0

1 1

2 2
0 1 2 0 1 2

Now let’s run the 2D convolution on the image and see what these convolutions produce:

In [52]: plt.figure(figsize=(6, 6))

plt.subplot(221)
res = convolve2d(img, f1, mode='valid')
plt.imshow(res, cmap='gray')
plt.title('filtered with f1')

plt.subplot(222)
res = convolve2d(img, f2, mode='valid')
plt.imshow(res, cmap='gray')
plt.title('filtered with f2')
620 CHAPTER 15. APPENDIX

plt.subplot(223)
res = convolve2d(img, f3, mode='valid')
plt.imshow(res, cmap='gray')
plt.title('filtered with f3')

plt.subplot(224)
res = convolve2d(img, f4, mode='valid')
plt.imshow(res, cmap='gray')
plt.title('filtered with f4')

plt.tight_layout()
plt.show()

filtered with f1 filtered with f2


0 0

10 10

20 20
0 10 20 0 10 20
filtered with f3 filtered with f4
0 0

10 10

20 20
0 10 20 0 10 20
15.5. BACKPROPAGATION FOR RECURRENT NETWORKS 621

Great! We have seen how convolutions can be used to filter images. Each pixel in the filtered image is the
result of a tensor contraction of the filter with a patch in the original image. In this respect, the convolution
is the operation that allows us to leverage the fact that information is related to spatial patterns of nearby
pixels.

TIP: If you’ve ever used an image program like Adobe Photoshop, these convolutions are
how the image filters are created for images.

Backpropagation for Recurrent Networks


The recurrent relations in the general case can be written as:

z t = w h t−1 + u x t (15.24)
h t = ϕ(z t ) (15.25)
rt = v ht (15.26)
ŷ t = ϕ(r t ) (15.27)
(15.28)

where we substituted the tanh activation function to a generic activation ϕ and allowed for different weights
on the recurrent relation and the output relation.

The graph of this more general looks like this:

The backpropagation relations can be written as:


622 CHAPTER 15. APPENDIX

recurrent_2.png

∂J
ŷ = (15.29)
∂ ŷ
r t = ŷ ϕ′ (r t ) (15.30)
h t = r t v + z t+1 w (15.31)

z t = h t ϕ (z t ) (15.32)
T
u = ∑ zt xt (15.33)
t=0
T
v = ∑ rt ht (15.34)
t=0
T
w = ∑ z t+1 h t (15.35)
t=0
(15.36)

As you can see, these relations are very similar to the fully connected backpropagation relations we saw in
Chapter 5, with a big difference: the updates to the weights require a summation on the contributions from
all time.

In [ ]:
Getting Started Exercises Solutions
16
In [1]: with open('../course/common.py') as fin:
exec(fin.read())

In [2]: with open('../course/matplotlibconf.py') as fin:


exec(fin.read())

Exercise 1
Let’s practice a little bit with numpy:

• generate an array of zeros with shape=(10, 10), call it a


• set every other element of a to 1, both along columns and rows, so that you obtain a nice
checkerboard pattern of zeros and ones
• generate a second array to be the sequence from 5 included to 15 excluded, call it b
• multiply a times b in such a way that the first row of a is an alternation of zeros and fives, the second
row is an alternation of zeros and sixes and so on. Call this new array c. To complete this part, you
will have to reshape b as a column array
• calculate the mean and the standard deviation of c along rows and columns
• create a new array of shape=(10, 5) and fill it with the non-zero values of c, call it d
• add random Gaussian noise to d, centered in zero and with a standard deviation of 0.1, call this new
array e

In [3]: a = np.zeros((10, 10))

623
624 CHAPTER 16. GETTING STARTED EXERCISES SOLUTIONS

In [4]: a[::2, ::2] = 1


a[1::2, 1::2] = 1

In [5]: a

Out[5]: array([[1., 0., 1., 0., 1., 0., 1., 0., 1., 0.],
[0., 1., 0., 1., 0., 1., 0., 1., 0., 1.],
[1., 0., 1., 0., 1., 0., 1., 0., 1., 0.],
[0., 1., 0., 1., 0., 1., 0., 1., 0., 1.],
[1., 0., 1., 0., 1., 0., 1., 0., 1., 0.],
[0., 1., 0., 1., 0., 1., 0., 1., 0., 1.],
[1., 0., 1., 0., 1., 0., 1., 0., 1., 0.],
[0., 1., 0., 1., 0., 1., 0., 1., 0., 1.],
[1., 0., 1., 0., 1., 0., 1., 0., 1., 0.],
[0., 1., 0., 1., 0., 1., 0., 1., 0., 1.]])

In [6]: b = np.arange(5, 15)

In [7]: b

Out[7]: array([ 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])

In [8]: # c = a * b[:, None]


c = a * b.reshape((10, 1))

In [9]: c

Out[9]: array([[ 5., 0., 5., 0., 5., 0., 5., 0., 5., 0.],
[ 0., 6., 0., 6., 0., 6., 0., 6., 0., 6.],
[ 7., 0., 7., 0., 7., 0., 7., 0., 7., 0.],
[ 0., 8., 0., 8., 0., 8., 0., 8., 0., 8.],
[ 9., 0., 9., 0., 9., 0., 9., 0., 9., 0.],
[ 0., 10., 0., 10., 0., 10., 0., 10., 0., 10.],
[11., 0., 11., 0., 11., 0., 11., 0., 11., 0.],
[ 0., 12., 0., 12., 0., 12., 0., 12., 0., 12.],
[13., 0., 13., 0., 13., 0., 13., 0., 13., 0.],
[ 0., 14., 0., 14., 0., 14., 0., 14., 0., 14.]])

In [10]: c.mean(axis=0)

Out[10]: array([4.5, 5. , 4.5, 5. , 4.5, 5. , 4.5, 5. , 4.5, 5. ])


16.2. EXERCISE 2 625

In [11]: c.mean(axis=1)

Out[11]: array([2.5, 3. , 3.5, 4. , 4.5, 5. , 5.5, 6. , 6.5, 7. ])

In [12]: c.std(axis=0)

Out[12]: array([4.9244289 , 5.38516481, 4.9244289 , 5.38516481, 4.9244289 ,


5.38516481, 4.9244289 , 5.38516481, 4.9244289 , 5.38516481])

In [13]: c.std(axis=1)

Out[13]: array([2.5, 3. , 3.5, 4. , 4.5, 5. , 5.5, 6. , 6.5, 7. ])

In [14]: d = c[c>0].reshape(10, 5)

In [15]: d

Out[15]: array([[ 5., 5., 5., 5., 5.],


[ 6., 6., 6., 6., 6.],
[ 7., 7., 7., 7., 7.],
[ 8., 8., 8., 8., 8.],
[ 9., 9., 9., 9., 9.],
[10., 10., 10., 10., 10.],
[11., 11., 11., 11., 11.],
[12., 12., 12., 12., 12.],
[13., 13., 13., 13., 13.],
[14., 14., 14., 14., 14.]])

In [16]: noise = np.random.normal(scale=0.1, size=(10, 5))

In [17]: e = d + noise

Exercise 2
Practice plotting with matplotlib:

• use plt.imshow() to display the array a as an image, does it look like a checkerboard?
• display c, d and e using the same function, change the colormap to grayscale
626 CHAPTER 16. GETTING STARTED EXERCISES SOLUTIONS

• plot e using a line plot, assigning each row to a different data series. This should produce a plot with
noisy horizontal lines. You will need to transpose the array to obtain this.
• add a title, axes labels, legend and a couple of annotations

In [18]: plt.imshow(a);

0 2 4 6 8

In [19]: plt.imshow(c, cmap='Greys');


16.2. EXERCISE 2 627

0 2 4 6 8

In [20]: plt.imshow(d, cmap='Greys');


628 CHAPTER 16. GETTING STARTED EXERCISES SOLUTIONS

0 2 4

In [21]: plt.imshow(e, cmap='Greys');


16.2. EXERCISE 2 629

0 2 4

In [22]: plt.plot(e.transpose())
plt.title("Noisy lines")
plt.xlabel("the x axis")
plt.xlabel("the y axis")
plt.annotate(xy=(1, 14), xytext=(0, 12.3),
s="The light blue line",
arrowprops={"arrowstyle": '-|>'},
fontsize=12);
630 CHAPTER 16. GETTING STARTED EXERCISES SOLUTIONS

Noisy lines
14
The light blue line
12

10

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0


the y axis

Exercise 3
Reuse your code

Encapsulate the code that calculates the decision boundary in a nice function called
plot_decision_boundary with the signature:

def plot_decision_boundary(model, X, y):


....

In [23]: def plot_decision_boundary(model, X, y):


hticks = np.linspace(X.min()-0.1, X.max()+0.1, 101)
vticks = np.linspace(X.min()-0.1, X.max()+0.1, 101)
aa, bb = np.meshgrid(hticks, vticks)
ab = np.c_[aa.ravel(), bb.ravel()]
c = model.predict(ab)
cc = c.reshape(aa.shape)
plt.figure(figsize=(7, 7))
16.4. EXERCISE 4 631

plt.contourf(aa, bb, cc, cmap='bwr', alpha=0.2)


plt.plot(X[y==0, 0], X[y==0, 1], 'ob', alpha=0.5)
plt.plot(X[y==1, 0], X[y==1, 1], 'xr', alpha=0.5)
plt.title("Blue circles and Red crosses");

Exercise 4
Practice retraining the model on different data:

• use the functions make_blobs and make_moons from Scikit-Learn to generate new datasets with two
classes
• plot the data to make sure you understand it
• re-train your model on each of these datasets
• display the decision boundary for each of these models

In [24]: from tensorflow.keras.models import Sequential


from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD

In [25]: from sklearn.datasets import make_circles

X, y = make_circles(n_samples=1000,
noise=0.1,
factor=0.2,
random_state=0)

In [26]: model = Sequential()


model.add(Dense(4, input_shape=(2,), activation='tanh'))
model.add(Dense(1, activation='sigmoid'))
model.compile(SGD(lr=0.5),
'binary_crossentropy',
metrics=['accuracy'])
model.fit(X, y, epochs=30, verbose=0);

In [27]: plot_decision_boundary(model, X, y)
632 CHAPTER 16. GETTING STARTED EXERCISES SOLUTIONS

Blue circles and Red crosses

1.0

0.5

0.0

0.5

1.0

1.0 0.5 0.0 0.5 1.0

In [28]: from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=1000,
centers=2,
random_state=0)

In [29]: model = Sequential()


model.add(Dense(4, input_shape=(2,), activation='tanh'))
model.add(Dense(1, activation='sigmoid'))
model.compile(SGD(lr=0.5),
'binary_crossentropy',
16.4. EXERCISE 4 633

metrics=['accuracy'])
model.fit(X, y, epochs=30, verbose=0);

In [30]: plot_decision_boundary(model, X, y)

Blue circles and Red crosses


7
6
5
4
3
2
1
0
1

1 0 1 2 3 4 5 6 7

In [31]: from sklearn.datasets import make_moons

X, y = make_moons(n_samples=1000,
noise=0.1,
random_state=0)
634 CHAPTER 16. GETTING STARTED EXERCISES SOLUTIONS

In [32]: model = Sequential()


model.add(Dense(4, input_shape=(2,), activation='tanh'))
model.add(Dense(1, activation='sigmoid'))
model.compile(SGD(lr=0.5),
'binary_crossentropy',
metrics=['accuracy'])
model.fit(X, y, epochs=30, verbose=0);

In [33]: plot_decision_boundary(model, X, y)

Blue circles and Red crosses


2.0

1.5

1.0

0.5

0.0

0.5

1.0
1.0 0.5 0.0 0.5 1.0 1.5 2.0

In [ ]:
Data Manipulation Exercises Solutions
17
In [1]: with open('../course/common.py') as fin:
exec(fin.read())

In [2]: with open('../course/matplotlibconf.py') as fin:


exec(fin.read())

Exercise 1
• load the dataset: ../data/international-airline-passengers.csv
• inspect it using the .info() and .head() commands
• use the function pd.to_datetime() to change the column type of ‘Month’ to a DateTime type (you
can find the doc here)
• set the index of df to be a DateTime index using the column ‘Month’ and the df.set_index()
method
• choose the appropriate plot and display the data
• choose appropriate scale
• label the axes

In [3]: fname_ = '../data/international-airline-passengers.csv'


df = pd.read_csv(fname_)

In [4]: # - inspect it using the .info() and .head() commands


df.info()

635
636 CHAPTER 17. DATA MANIPULATION EXERCISES SOLUTIONS

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144 entries, 0 to 143
Data columns (total 2 columns):
Month 144 non-null object
Thousand Passengers 144 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.3+ KB

In [5]: df.head()

Out[5]:

Month Thousand Passengers


0 1949-01 112
1 1949-02 118
2 1949-03 132
3 1949-04 129
4 1949-05 121

In [6]: # - use the function to_datetime() to change the


# column type of 'Month' to a datatime type
# - set the index of df to be a datetime index using
# the column 'Month' and tthe set_index() method

df['Month'] = pd.to_datetime(df['Month'])
df = df.set_index('Month')

In [7]: df.head()

Out[7]:

Thousand Passengers
Month
1949-01-01 112
1949-02-01 118
1949-03-01 132
1949-04-01 129
1949-05-01 121

In [8]: # - choose the appropriate plot and display the data


# - choose appropriate scale
# - label the axes
17.2. EXERCISE 2 637

df.plot();

600 Thousand Passengers

500

400

300

200

100
1949 1951 1953 1955 1957 1959
Month

Exercise 2
• load the dataset: ../data/weight-height.csv
• inspect it
• plot it using a scatter plot with Weight as a function of Height
• plot the male and female populations with two different colors on a new scatter plot
• remember to label the axes

In [9]: # - load the dataset: ../data/weight-height.csv


# - inspect it
df = pd.read_csv('../data/weight-height.csv')
df.head()

Out[9]:
638 CHAPTER 17. DATA MANIPULATION EXERCISES SOLUTIONS

Gender Height Weight


0 Male 73.847017 241.893563
1 Male 68.781904 162.310473
2 Male 74.110105 212.740856
3 Male 71.730978 220.042470
4 Male 69.881796 206.349801

In [10]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 3 columns):
Gender 10000 non-null object
Height 10000 non-null float64
Weight 10000 non-null float64
dtypes: float64(2), object(1)
memory usage: 234.5+ KB

In [11]: df.describe()

Out[11]:

Height Weight
count 10000.000000 10000.000000
mean 66.367560 161.440357
std 3.847528 32.108439
min 54.263133 64.700127
25 63.505620 135.818051
50 66.318070 161.212928
75 69.174262 187.169525
max 78.998742 269.989699

In [12]: df['Gender'].value_counts()

Out[12]:

Gender
Male 5000
Female 5000

In [13]: # - plot it using a scatter plot with Weight as a


# function of Height
df.plot(kind='scatter', x='Height', y='Weight');
17.2. EXERCISE 2 639

250

200
Weight

150

100

55 60 65 70 75 80
Height

In [14]: # - plot the male and female populations with 2


# different colors on a new scatter plot
# - remember to label the axes

# this can be done in several ways, showing 3 here:


# method 1
males = df[df['Gender'] == 'Male']
females = df.query('Gender == "Female"')
fig, ax = plt.subplots()

males.plot(kind='scatter', x='Height', y='Weight',


ax=ax, color='blue', alpha=0.3,
title='Male & Female Populations')

females.plot(kind='scatter', x='Height', y='Weight',


ax=ax, color='red', alpha=0.3);
640 CHAPTER 17. DATA MANIPULATION EXERCISES SOLUTIONS

Male & Female Populations

250

200
Weight

150

100

55 60 65 70 75 80
Height

In [15]: # method 2
mfmap = {'Male': 'blue', 'Female': 'red'}
df['Gendercolor'] = df['Gender'].map(mfmap)
df.head()

Out[15]:

Gender Height Weight Gendercolor


0 Male 73.847017 241.893563 blue
1 Male 68.781904 162.310473 blue
2 Male 74.110105 212.740856 blue
3 Male 71.730978 220.042470 blue
4 Male 69.881796 206.349801 blue

In [16]: df.plot(kind='scatter',
x='Height',
y='Weight',
c=df['Gendercolor'],
17.2. EXERCISE 2 641

alpha=0.3,
title='Male & Female Populations');

Male & Female Populations

250

200
Weight

150

100

55 60 65 70 75 80
Height

In [17]: # method 3
fig, ax = plt.subplots()
ax.plot(males['Height'], males['Weight'], 'ob',
females['Height'], females['Weight'], 'or',
alpha=0.3)
plt.xlabel('Height')
plt.ylabel('Weight')
plt.title('Male & Female Populations');
642 CHAPTER 17. DATA MANIPULATION EXERCISES SOLUTIONS

Male & Female Populations

250

200
Weight

150

100

55 60 65 70 75 80
Height

Exercise 3
• plot the histogram of the heights for males and females on the same plot
• use alpha to control transparency in the plot command
• plot a vertical line at the mean of each population using plt.axvline()
• bonus: plot the cumulative distributions

In [18]: males['Height'].plot(kind='hist',
bins=50,
range=(50, 80),
alpha=0.3,
color='blue')

females['Height'].plot(kind='hist',
bins=50,
range=(50, 80),
alpha=0.3,
color='red')

plt.title('Height distribution')
17.3. EXERCISE 3 643

plt.legend(["Males", "Females"])
plt.xlabel("Heigth (in)")

plt.axvline(males['Height'].mean(),
color='blue', linewidth=2)

plt.axvline(females['Height'].mean(),
color='red', linewidth=2);

Height distribution
Males
400 Females

300
Frequency

200

100

0
50 55 60 65 70 75 80
Heigth (in)

In [19]: males['Height'].plot(kind='hist',
bins=200,
range=(50, 80),
alpha=0.3,
color='blue',
cumulative=True,
normed=True)

females['Height'].plot(kind='hist',
bins=200,
644 CHAPTER 17. DATA MANIPULATION EXERCISES SOLUTIONS

range=(50, 80),
alpha=0.3,
color='red',
cumulative=True,
normed=True)

plt.title('Height distribution')
plt.legend(["Males", "Females"])
plt.xlabel("Heigth (in)")

plt.axhline(0.8)
plt.axhline(0.5)
plt.axhline(0.2);

Height distribution
1.0 Males
Females
0.8
Frequency

0.6

0.4

0.2

0.0
50 55 60 65 70 75 80
Heigth (in)

Exercise 4
• plot the weights of the males and females using a box plot
• which one is easier to read?
• (remember to put in titles, axes, and legends)
17.4. EXERCISE 4 645

In [20]: dfpvt = df.pivot(columns = 'Gender', values = 'Weight')

In [21]: dfpvt.head()

Out[21]:

Gender Female Male


0 NaN 241.893563
1 NaN 162.310473
2 NaN 212.740856
3 NaN 220.042470
4 NaN 206.349801

In [22]: dfpvt.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Data columns (total 2 columns):
Female 5000 non-null float64
Male 5000 non-null float64
dtypes: float64(2)
memory usage: 234.4 KB

In [23]: dfpvt.plot(kind='box')
plt.title('Weight Box Plot')
plt.ylabel("Weight (lbs)");
646 CHAPTER 17. DATA MANIPULATION EXERCISES SOLUTIONS

Weight Box Plot

250

200
Weight (lbs)

150

100

Female Male

Exercise 5
• load the dataset: ../data/titanic-train.csv
• learn about scattermatrix here
• display the data using a scattermatrix

In [24]: df = pd.read_csv('../data/titanic-train.csv')
df.head()

Out[24]:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Hea... female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

In [25]: from pandas.plotting import scatter_matrix

In [26]: scatter_matrix(df.drop('PassengerId', axis=1),


figsize=(10, 10));
17.5. EXERCISE 5 647

1.0

Survived 0.5

0.03
Pclass

50
Age

5
SibSp

0
5.0
Parch

2.5

0.0
400
Fare

200

0
Survived Pclass SibSp Parch
0

11

30

50

5
0

500
Age Fare
648 CHAPTER 17. DATA MANIPULATION EXERCISES SOLUTIONS
Machine Learning Exercises Solutions
18
In [1]: with open('../course/common.py') as fin:
exec(fin.read())

In [2]: with open('../course/matplotlibconf.py') as fin:


exec(fin.read())

Exercise 1
You just started working at a real estate investment firm, and they would like you to build a model for
pricing houses. You receive a dataset that contains data for house prices and a few features like “number of
bedrooms”, “size in square feet” and “age of the house”. Let’s see if you can build a model that can predict the
price. In this exercise, we extend what we have learned about linear regression to a dataset with more than
one feature. Here are the steps to complete it:

1. load the dataset ../data/housing-data.csv

• plot the histograms for each feature


• create two variables called X and y: X shall be a matrix with three columns (sqft, bdrms, age) and y
shall be a vector with one column (price)
• create a linear regression model in Keras with the appropriate number of inputs and output
• split the data into train and test with a 20 test size
• train the model on the training set and check its accuracy on training and test set
• how’s your model doing? Is the loss growing smaller?

649
650 CHAPTER 18. MACHINE LEARNING EXERCISES SOLUTIONS

• try to improve your model with these experiments:

– normalize the input features with one of the rescaling techniques mentioned above
– use a different value for the learning rate of your model
– use a different optimizer

• once you’re satisfied with the training, check the R 2 on the test set

In [3]: # Load the dataset ../data/housing-data.csv


df = pd.read_csv('../data/housing-data.csv')
df.head()

Out[3]:

sqft bdrms age price


0 2104 3 70 399900
1 1600 3 28 329900
2 2400 3 44 369000
3 1416 2 49 232000
4 3000 4 75 539900

In [4]: # plot the histograms for each feature


plt.figure(figsize=(15, 5))
for i, feature in enumerate(df.columns):
plt.subplot(1, 4, i+1)
df[feature].plot(kind='hist', title=feature)
plt.xlabel(feature)

plt.tight_layout()

sqft bdrms age price


25 7 12
10
6 10
20
8 5
8
Frequency

Frequency

Frequency

Frequency

15 4
6
6
10 3
4 4
2
2 5 2
1
0 0 0 0
1000 2000 3000 4000 2 4 25 50 75 200000 400000 600000
sqft bdrms age price
18.1. EXERCISE 1 651

In [5]: # create 2 variables called X and y:


# X shall be a matrix with 3 columns (sqft,bdrms,age)
# and y shall be a vector with 1 column (price)
X = df[['sqft', 'bdrms', 'age']].values
y = df['price'].values

In [6]: from tensorflow.keras.models import Sequential


from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam, SGD

In [7]: # create a linear regression model in Keras


# with the appropriate number of inputs and output
model = Sequential()
model.add(Dense(1, input_dim=3))
model.compile(Adam(lr=0.8), 'mean_squared_error')

In [8]: from sklearn.model_selection import train_test_split

In [9]: # split the data into train and test with a 20% test
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.2)

In [10]: # train the model on the training set and check its
# accuracy on training and test set
# how's your model doing? Is the loss growing smaller?
model.fit(X_train, y_train, epochs=20, verbose=0);

In [11]: df.describe()

Out[11]:

sqft bdrms age price


count 47.000000 47.000000 47.000000 47.000000
mean 2000.680851 3.170213 42.744681 340412.659574
std 794.702354 0.760982 22.873440 125039.899586
min 852.000000 1.000000 5.000000 169900.000000
25 1432.000000 3.000000 24.500000 249900.000000
50 1888.000000 3.000000 44.000000 299900.000000
75 2269.000000 4.000000 61.500000 384450.000000
max 4478.000000 5.000000 79.000000 699900.000000
652 CHAPTER 18. MACHINE LEARNING EXERCISES SOLUTIONS

In [12]: # try to improve your model with these experiments:


# - normalize the input features with one of the
# rescaling techniques mentioned above
# - use a different value for the learning rate of
# your model
# - use a different optimizer
df['sqft1000'] = df['sqft']/1000.0
df['age10'] = df['age']/10.0
df['price100k'] = df['price']/1e5

In [13]: X = df[['sqft1000', 'bdrms', 'age10']].values


y = df['price100k'].values

In [14]: X_train, X_test, y_train, y_test = \


train_test_split(X, y, test_size=0.2)

In [15]: model = Sequential()


model.add(Dense(1, input_dim=3))
model.compile(Adam(lr=0.1), 'mean_squared_error')
model.fit(X_train, y_train, epochs=20, verbose=0);

In [16]: from sklearn.metrics import r2_score

In [17]: # once you're satisfied with training, check the


# R2score on the test set

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

r_ = r2_score(y_train, y_train_pred)
print("R2 score on Train set is:\t{:0.3f}".format(r_))

r_ = r2_score(y_test, y_test_pred)
print("R2 score on Test set is:\t{:0.3f}".format(r_))

R2 score on Train set is: 0.664


R2 score on Test set is: 0.809

Exercise 2
Your boss was delighted with your work on the housing price prediction model and decided to entrust you
with a more challenging task. They’ve seen many people leave the company recently and they would like to
18.2. EXERCISE 2 653

understand why that’s happening. They have collected historical data on employees, and they would like you
to build a model that can predict which employee will leave next. They would like a model that is better than
random guessing. They also prefer false negatives than false positives, in this first phase. Fields in the dataset
include:

• Employee satisfaction level


• Last evaluation
• Number of projects
• Average monthly hours
• Time spent at the company
• Whether they have had a work accident
• Whether they have had a promotion in the last five years
• Department
• Salary
• Whether the employee has left

Your goal is to predict the binary outcome variable left using the rest of the data. Since the outcome is
binary, this is a classification problem. Here are some things you may want to try out:

1. load the dataset at ../data/HR_comma_sep.csv, inspect it with .head(), .info() and .describe().

• Establish a benchmark: what would be your accuracy score if you predicted everyone stays?
• Check if any feature needs rescaling. You may plot a histogram of the feature to decide which
rescaling method is more appropriate
• convert the categorical features into binary dummy columns. You will then have to combine them
with the numerical features using pd.concat
• do the usual train/test split with a 20 test size
• play around with learning rate and optimizer
• check the confusion matrix, precision, and recall
• check if you still get the same results if you use 5-Fold cross-validation on all the data
• Is the model good enough for your boss?

As you will see in this exercise, this logistic regression model is not good enough to help your boss. In the
next chapter, we will learn how to go beyond linear models.

This dataset comes from https://www.kaggle.com/ludobenistant/hr-analytics/ and is released under CC


BY-SA 4.0 License.

In [18]: # load the dataset at ../data/HR_comma_sep.csv, inspect


# it with `.head()`, `.info()` and `.describe()`.

df = pd.read_csv('../data/HR_comma_sep.csv')
654 CHAPTER 18. MACHINE LEARNING EXERCISES SOLUTIONS

In [19]: df.head()

Out[19]:

satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years sales salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
3 0.72 0.87 5 223 5 0 1 0 sales low
4 0.37 0.52 2 159 3 0 1 0 sales low

In [20]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
satisfaction_level 14999 non-null float64
last_evaluation 14999 non-null float64
number_project 14999 non-null int64
average_montly_hours 14999 non-null int64
time_spend_company 14999 non-null int64
Work_accident 14999 non-null int64
left 14999 non-null int64
promotion_last_5years 14999 non-null int64
sales 14999 non-null object
salary 14999 non-null object
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB

In [21]: df.describe()

Out[21]:

satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years


count 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000
mean 0.612834 0.716102 3.803054 201.050337 3.498233 0.144610 0.238083 0.021268
std 0.248631 0.171169 1.232592 49.943099 1.460136 0.351719 0.425924 0.144281
min 0.090000 0.360000 2.000000 96.000000 2.000000 0.000000 0.000000 0.000000
25 0.440000 0.560000 3.000000 156.000000 3.000000 0.000000 0.000000 0.000000
50 0.640000 0.720000 4.000000 200.000000 3.000000 0.000000 0.000000 0.000000
75 0.820000 0.870000 5.000000 245.000000 4.000000 0.000000 0.000000 0.000000
max 1.000000 1.000000 7.000000 310.000000 10.000000 1.000000 1.000000 1.000000

In [22]: # Establish a benchmark: what would be your accuracy


# score if you predicted everyone stay?

df.left.value_counts() / len(df)

Out[22]:
18.2. EXERCISE 2 655

left
0 0.761917
1 0.238083

Predicting 0 all the time would yield an accuracy of 76

In [23]: # Check if any feature needs rescaling.


# You may plot a histogram of the feature to decide
# which rescaling method is more appropriate.
df['average_montly_hours'].plot(kind='hist');

2500

2000
Frequency

1500

1000

500

0
100 150 200 250 300

In [24]: df['average_montly_hours_100'] = \
df['average_montly_hours']/100.0

In [25]: df['time_spend_company'].plot(kind='hist');
656 CHAPTER 18. MACHINE LEARNING EXERCISES SOLUTIONS

6000
5000
4000
Frequency

3000
2000
1000
0
2 3 4 5 6 7 8 9 10

In [26]: # convert the categorical features into binary dummy columns.


# You will then have to combine them with
# the numerical features using `pd.concat`.
df_dummies = pd.get_dummies(df[['sales', 'salary']])

In [27]: df.columns

Out[27]: Index(['satisfaction_level', 'last_evaluation', 'number_project',


'average_montly_hours', 'time_spend_company', 'Work_accident',
'left',
'promotion_last_5years', 'sales', 'salary',
'average_montly_hours_100'],
dtype='object')

In [28]: X = pd.concat([df[['satisfaction_level',
'last_evaluation',
'number_project',
'time_spend_company',
'Work_accident',
'promotion_last_5years',
'average_montly_hours_100']],
18.2. EXERCISE 2 657

df_dummies], axis=1).values
y = df['left'].values

In [29]: X.shape

Out[29]: (14999, 20)

In [30]: # do the usual train/test split with a 20% test size

X_train, X_test, y_train, y_test = \


train_test_split(X, y, test_size=0.2)

In [31]: # play around with learning rate and optimizer

model = Sequential()
model.add(Dense(1, input_dim=20, activation='sigmoid'))
model.compile(Adam(lr=0.5),
'binary_crossentropy',
metrics=['accuracy'])

In [32]: model.fit(X_train, y_train);

11999/11999 [==============================] - 1s 78us/sample - loss: 0.5453


- accuracy: 0.7622

In [33]: y_test_pred = model.predict_classes(X_test)

In [34]: from sklearn.metrics import confusion_matrix


from sklearn.metrics import classification_report

In [35]: def pretty_confusion_matrix(y_true, y_pred,


labels=["False", "True"]):
cm = confusion_matrix(y_true, y_pred)
pred_labels = ['Predicted '+ l for l in labels]
df = pd.DataFrame(cm,
index=labels,
columns=pred_labels)
return df
658 CHAPTER 18. MACHINE LEARNING EXERCISES SOLUTIONS

In [36]: # check the confusion matrix, precision and recall

pretty_confusion_matrix(y_test, y_test_pred,
labels=['Stay', 'Leave'])

Out[36]:

Predicted Stay Predicted Leave


Stay 2304 1
Leave 695 0

In [37]: print(classification_report(y_test, y_test_pred))

precision recall f1-score support

0 0.77 1.00 0.87 2305


1 0.00 0.00 0.00 695

micro avg 0.77 0.77 0.77 3000


macro avg 0.38 0.50 0.43 3000
weighted avg 0.59 0.77 0.67 3000

In [38]: from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

In [39]: # check if you still get the same results if you use a 5-Fold cross validation on all t

def build_logistic_regr():
model = Sequential()
model.add(Dense(1, input_dim=20, activation='sigmoid'))
model.compile(Adam(lr=0.5),
'binary_crossentropy',
metrics=['accuracy'])
return model

model = KerasClassifier(build_fn=build_logistic_regr,
epochs=10, verbose=0)

In [40]: from sklearn.model_selection import cross_val_score, KFold

In [41]: cv = KFold(5, shuffle=True)


scores = cross_val_score(model, X, y, cv=cv)
18.2. EXERCISE 2 659

print("Cross val accuracy is {:0.4f} ± {:0.4f}".format(


scores.mean(), scores.std()))

Cross val accuracy is 0.7631 ± 0.0317

In [42]: scores

Out[42]: array([0.79066664, 0.76866668, 0.70133334, 0.77766669, 0.77725911])

In [43]: # Is the model good enough for your boss?

No, the model is not good enough for my boss, since it performs no better than the benchmark.

In [ ]:
660 CHAPTER 18. MACHINE LEARNING EXERCISES SOLUTIONS
Deep Learning Exercises Solutions
19
In [1]: with open('../course/common.py') as fin:
exec(fin.read())

In [2]: with open('../course/matplotlibconf.py') as fin:


exec(fin.read())

In [3]: import seaborn as sns

Exercise 1
The Pima Indians dataset is a very famous dataset distributed by UCI and originally collected from the
National Institute of Diabetes and Digestive and Kidney Diseases. It contains data from clinical exams for
women age 21 and above of Pima indian origins. The objective is to predict, based on diagnostic
measurements, whether a patient has diabetes.

It has the following features:

• Pregnancies: Number of times pregnant


• Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
• BloodPressure: Diastolic blood pressure (mm Hg)
• SkinThickness: Triceps skin fold thickness (mm)
• Insulin: 2-Hour serum insulin (mu U/ml)
• BMI: Body mass index (weight in kg/(height in m)ˆ2)

661
662 CHAPTER 19. DEEP LEARNING EXERCISES SOLUTIONS

• DiabetesPedigreeFunction: Diabetes pedigree function


• Age: Age (years)

The last column is the outcome, and it is a binary variable.

In this first exercise we will explore it through the following steps:

1. Load the ..data/diabetes.csv dataset, use pandas to explore the range of each feature

• For each feature draw a histogram. Bonus points if you draw all the histograms in the same figure.
• Explore correlations of features with the outcome column. You can do this in several ways, for
example using the sns.pairplot we used above or drawing a heatmap of the correlations.
• Do features need standardization? If so what standardization technique will you use? MinMax?
Standard?
• Prepare your final X and y variables to be used by an ML model. Make sure you define your target
variable well. Will you need dummy columns?

In [4]: df = pd.read_csv('../data/diabetes.csv')
df.head()

Out[4]:

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome


0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1

In [5]: df.hist(figsize=(12, 10))


plt.tight_layout()
19.1. EXERCISE 1 663

Age BMI BloodPressure


300 250 250
200 200
200
150 150
100 100 100
50 50
0 0 0
20 40 60 80 0 20 40 60 0 50 100
DiabetesPedigreeFunction Glucose Insulin
500
300 200
400
150
200 300
100 200
100
50 100
0 0 0
0 1 2 0 50 100 150 200 0 200 400 600 800
Outcome Pregnancies SkinThickness
500 250
400 200 200
300 150 150
200 100 100
100 50 50
0 0 0
0.00 0.25 0.50 0.75 1.00 0 5 10 15 0 25 50 75 100

In [6]: sns.pairplot(df, hue='Outcome');


664 CHAPTER 19. DEEP LEARNING EXERCISES SOLUTIONS

15

Pregnancies
10
5
0

200
Glucose

100

0
BloodPressure

100
50
0

100
SkinThickness

50

750
500
Insulin

Outcome
250 0
1
0

60
40
BMI

20
0
DiabetesPedigreeFunction

2
1
0

80
60
Age

40
20

1.0
Outcome

0.5

0.0
0 20 0 200 0 100 0 100 0 1000 0 50 0 2 25 50 75 0 1
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome

In [7]: sns.heatmap(df.corr(), annot = True);


19.1. EXERCISE 1 665

1.0
Pregnancies 1 0.13 0.14-0.082-0.0740.018-0.0340.54 0.22
Glucose 0.13 1 0.150.0570.33 0.22 0.14 0.26 0.47 0.8
BloodPressure 0.14 0.15 1 0.210.0890.280.0410.240.065
SkinThickness -0.0820.0570.21 1 0.44 0.39 0.18 -0.110.075 0.6
Insulin -0.0740.330.0890.44 1 0.2 0.19-0.0420.13 0.4
BMI 0.0180.22 0.28 0.39 0.2 1 0.140.0360.29
DiabetesPedigreeFunction -0.0340.140.0410.18 0.19 0.14 1 0.0340.17 0.2
Age 0.54 0.26 0.24 -0.11-0.0420.0360.034 1 0.24
0.0
Outcome 0.22 0.470.0650.0750.13 0.29 0.17 0.24 1

DiabetesPedigreeFunction
Pregnancies
Glucose
BloodPressure
SkinThickness
Insulin
BMI

Age
Outcome

In [8]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies 768 non-null int64
Glucose 768 non-null int64
BloodPressure 768 non-null int64
SkinThickness 768 non-null int64
Insulin 768 non-null int64
BMI 768 non-null float64
DiabetesPedigreeFunction 768 non-null float64
Age 768 non-null int64
Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

In [9]: df.describe()
666 CHAPTER 19. DEEP LEARNING EXERCISES SOLUTIONS

Out[9]:

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome


count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.240885 0.348958
std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.760232 0.476951
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000 0.000000
25 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.000000 0.000000
50 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.000000 0.000000
75 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000 1.000000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 1.000000

In [10]: from sklearn.preprocessing import StandardScaler

In [11]: from tensorflow.keras.utils import to_categorical

In [12]: sc = StandardScaler()
X = sc.fit_transform(df.drop('Outcome', axis=1))
y = df['Outcome'].values
y_cat = to_categorical(y)

Exercise 2
Build a fully connected NN model that predicts diabetes. Follow these steps:

1. split your data in a train/test with a test size of 20 and a random_state = 22

• define a sequential model with at least one inner layer. You will have to make choices for the following
things:
– what is the size of the input?
– how many nodes will you use in each layer?
– what is the size of the output?
– what activation functions will you use in the inner layers?
– what activation function will you use at the output?
– what loss function will you use?
– what optimizer will you use?
• fit your model on the training set, using a validation_split of 0.1
• test your trained model on the test data from the train/test split
• check the accuracy score, the confusion matrix and the classification report

In [13]: from tensorflow.keras.models import Sequential


from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD, Adam
from sklearn.model_selection import train_test_split
19.2. EXERCISE 2 667

In [14]: X.shape

Out[14]: (768, 8)

In [15]: X_train, X_test, y_train, y_test = \


train_test_split(X, y_cat,
random_state=22, test_size=0.2)

In [16]: model = Sequential()


model.add(Dense(32, input_shape=(8,), activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(2, activation='softmax'))
model.compile(Adam(lr=0.05),
loss='categorical_crossentropy',
metrics=['accuracy'])

In [17]: model.fit(X_train, y_train, epochs=20,


verbose=0, validation_split=0.1);

In [18]: y_pred = model.predict(X_test)

In [19]: y_test_class = np.argmax(y_test, axis=1)


y_pred_class = np.argmax(y_pred, axis=1)

In [20]: from sklearn.metrics import accuracy_score


from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [21]: accuracy_score(y_test_class, y_pred_class)

Out[21]: 0.7337662337662337

In [22]: print(classification_report(y_test_class, y_pred_class))

precision recall f1-score support

0 0.74 0.92 0.82 100


1 0.72 0.39 0.51 54

micro avg 0.73 0.73 0.73 154


macro avg 0.73 0.65 0.66 154
weighted avg 0.73 0.73 0.71 154
668 CHAPTER 19. DEEP LEARNING EXERCISES SOLUTIONS

In [23]: confusion_matrix(y_test_class, y_pred_class)

Out[23]: array([[92, 8],


[33, 21]])

Exercise 3
Compare your work with the results presented in this notebook. Are your Neural Network results better or
worse than the results obtained by traditional Machine Learning techniques?

• Try training a Support Vector Machine or a Random Forest model on the same train/test split. Is the
performance better or worse?
• Try restricting your features to only four features like in the suggested notebook. How does model
performance change?

In [24]: from sklearn.ensemble import RandomForestClassifier


from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

for mod in [RandomForestClassifier(), SVC(), GaussianNB()]:


mod.fit(X_train, y_train[:, 1])
y_pred = mod.predict(X_test)
print("="*80)
print(mod)
print("-"*80)
acc_ = accuracy_score(y_test_class, y_pred)
print("Accuracy score: {:0.3}".format(acc_))
print("Confusion Matrix:")
print(confusion_matrix(y_test_class, y_pred))
print()

============================================================================
====
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
----------------------------------------------------------------------------
----
Accuracy score: 0.727
Confusion Matrix:
[[88 12]
[30 24]]
19.3. EXERCISE 3 669

============================================================================
====
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
----------------------------------------------------------------------------
----
Accuracy score: 0.721
Confusion Matrix:
[[89 11]
[32 22]]

============================================================================
====
GaussianNB(priors=None, var_smoothing=1e-09)
----------------------------------------------------------------------------
----
Accuracy score: 0.708
Confusion Matrix:
[[87 13]
[32 22]]
670 CHAPTER 19. DEEP LEARNING EXERCISES SOLUTIONS
Deep Learning Internals Exercises Solutions
20
In [1]: with open('../course/common.py') as fin:
exec(fin.read())

In [2]: with open('../course/matplotlibconf.py') as fin:


exec(fin.read())

Exercise 1
You’ve just started to work at a wine company, and they would like you to help them build a model that
predicts the quality of their wine based on several measurements. They give you a dataset with wine:

• load the ../data/wines.csv into Pandas


• use the column called “Class” as the target
• check how many classes are there in the target, and if necessary use dummy columns for a Multiclass
classification
• use all the other columns as features, check their range and distribution (using seaborn pairplot)
• rescale all the features using either MinMaxScaler or StandardScaler
• build a deep model with at least one hidden layer to classify the data
• choose the cost function, what will you use? Mean Squared Error? Binary Cross- Entropy?
Categorical Cross-Entropy?
• choose an optimizer
• choose a value for the learning rate. You may want to try with several values
• choose a batch size
• train your model on all the data using a validation_split=0.2. Can you converge to 100
validation accuracy?

671
672 CHAPTER 20. DEEP LEARNING INTERNALS EXERCISES SOLUTIONS

• what’s the minimum number of epochs to converge?


• repeat the training several times to verify how stable your results are

In [3]: df = pd.read_csv('../data/wines.csv')

In [4]: df.head()

Out[4]:
Class Alcohol Malic_acid Ash Alcalinity_of_ash Magnesium Total_phenols Flavanoids Nonflavanoid_phenols Proanthocyanins Color_intensity Hue OD280-OD315_of_diluted_wines Proline
0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735

In [5]: y = df['Class']

In [6]: y.value_counts()

Out[6]:

Class
2 71
1 59
3 48

In [7]: y_cat = pd.get_dummies(y)

In [8]: y_cat.head()

Out[8]:

1 2 3
0 1 0 0
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0

In [9]: X = df.drop('Class', axis=1)


20.1. EXERCISE 1 673

In [10]: X.shape

Out[10]: (178, 13)

In [11]: import seaborn as sns

In [12]: sns.pairplot(df, hue='Class')

Out[12]: <seaborn.axisgrid.PairGrid at 0x7ff5ba3bef28>

3
Class

2
1

14
Alcohol

12

6
Malic_acid

4
2

3
Ash

30
Alcalinity_of_ash

20
10

150
Magnesium

100

4
Total_phenols

Class
1
2
3
4
Flavanoids

2
0
Nonflavanoid_phenols

0.50
0.25
Proanthocyanins

3
2
1
Color_intensity

10
5

1.5
Hue

1.0
0.5
OD280-OD315_of_diluted_wines

1500
Proline

1000
500
1 2 3 12.5 15.0 0 5 1 2 3 10 20 30 50 100 150 2 4 0 5 0.0 0.5 0.0 2.5 0 10 1 2 2 4 0 2000
Class Alcohol Malic_acid Ash Alcalinity_of_ash Magnesium Total_phenols Flavanoids Nonflavanoid_phenols Proanthocyanins Color_intensity Hue OD280-OD315_of_diluted_wines Proline
674 CHAPTER 20. DEEP LEARNING INTERNALS EXERCISES SOLUTIONS

In [13]: from sklearn.preprocessing import StandardScaler

In [14]: sc = StandardScaler()

In [15]: Xsc = sc.fit_transform(X)

In [16]: from tensorflow.keras.models import Sequential


from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD, Adam, Adadelta, RMSprop
import tensorflow.keras.backend as K

In [17]: K.clear_session()
model = Sequential()
model.add(Dense(5, input_shape=(13,),
kernel_initializer='he_normal',
activation='relu'))
model.add(Dense(3, activation='softmax'))

model.compile(RMSprop(lr=0.1),
'categorical_crossentropy',
metrics=['accuracy'])

model.fit(Xsc, y_cat.values,
batch_size=8,
epochs=10,
verbose=0,
validation_split=0.2);

Exercise 2
Since this dataset has 13 features, we can only visualize pairs of features as we did in the pairplot. We could,
however, exploit the fact that a Neural Network is a function to extract two high-level features to represent
our data.

• build a deep fully connected network with the following structure:


– Layer 1: 8 nodes
– Layer 2: 5 nodes
– Layer 3: 2 nodes
– Output: 3 nodes
• choose activation functions, initializations, optimizer, and learning rate so that it converges to 100
accuracy within 20 epochs (not easy)
20.2. EXERCISE 2 675

• remember to train the model on the scaled data


• define a Feature Function as we did above between the input of the 1st layer and the output of the 3rd
layer
• calculate the features and plot them on a 2-dimensional scatter plot
• can we distinguish the three classes well?

In [18]: K.clear_session()
model = Sequential()
model.add(Dense(8, input_shape=(13,),
kernel_initializer='he_normal',
activation='tanh'))
model.add(Dense(5, kernel_initializer='he_normal',
activation='tanh'))
model.add(Dense(2, kernel_initializer='he_normal',
activation='tanh'))
model.add(Dense(3, activation='softmax'))

model.compile(RMSprop(lr=0.05),
'categorical_crossentropy',
metrics=['accuracy'])

model.fit(Xsc, y_cat.values,
batch_size=16,
epochs=20,
verbose=0);

In [19]: model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 8) 112
_________________________________________________________________
dense_1 (Dense) (None, 5) 45
_________________________________________________________________
dense_2 (Dense) (None, 2) 12
_________________________________________________________________
dense_3 (Dense) (None, 3) 9
=================================================================
Total params: 178
Trainable params: 178
Non-trainable params: 0
_________________________________________________________________

In [20]: inp = model.layers[0].input


out = model.layers[2].output
676 CHAPTER 20. DEEP LEARNING INTERNALS EXERCISES SOLUTIONS

In [21]: features_function = K.function([inp], [out])

In [22]: features = features_function([Xsc])[0]

In [23]: features.shape

Out[23]: (178, 2)

In [24]: plt.scatter(features[:, 0], features[:, 1], c=y);

1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

Exercise 3
Keras functional API. So far we’ve always used the Sequential model API in Keras. However, Keras also
offers a Functional API, which is much more powerful. You can find its documentation here. Let’s see how
we can leverage it.

• define an input layer called inputs


• define two hidden layers as before, one with eight nodes, one with five nodes
• define a second_to_last layer with 2 nodes
20.3. EXERCISE 3 677

• define an output layer with three nodes


• create a model that connects input and output
• train it and make sure that it converges
• define a function between inputs and second_to_last layer
• recalculate the features and plot them

In [25]: from tensorflow.keras.layers import Input


from tensorflow.keras.models import Model

In [26]: K.clear_session()

inputs = Input(shape=(13,))
x = Dense(8, kernel_initializer='he_normal',
activation='tanh')(inputs)
x = Dense(5, kernel_initializer='he_normal',
activation='tanh')(x)
second_to_last = Dense(2, kernel_initializer='he_normal',
activation='tanh')(x)
outputs = Dense(3, activation='softmax')(second_to_last)

model = Model(inputs=inputs, outputs=outputs)

model.compile(RMSprop(lr=0.05),
'categorical_crossentropy',
metrics=['accuracy'])

model.fit(Xsc, y_cat.values, batch_size=16,


epochs=20, verbose=0);

In [27]: features_function = K.function([inputs], [second_to_last])

In [28]: features = features_function([Xsc])[0]

In [29]: plt.scatter(features[:, 0], features[:, 1], c=y);


678 CHAPTER 20. DEEP LEARNING INTERNALS EXERCISES SOLUTIONS

1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

Exercise 4
Keras offers the possibility to call a function at each epoch. These are Callbacks, and their documentation is
here. Callbacks allow us to add some neat functionality. In this exercise, we’ll explore a few of them.

• Split the data into train and test sets with a test_size = 0.3 and random_state=42
• Reset and recompile your model
• train the model on the train data using validation_data=(X_test, y_test)
• Use the EarlyStopping callback to stop your training if the val_loss doesn’t improve
• Use the ModelCheckpoint callback to save the trained model to disk once training is over
• Use the TensorBoard callback to output your training information to a /tmp/ subdirectory

You can use tensorboard in the notebook by running the following two commands:

%load_ext tensorboard.notebook

%tensorboard --logdir /tmp/ztdlbook/tensorboard/

You can also run tensorboard in a separate terminal with the command:

tensorboard --logdir /tmp/ztdl/tensorboard/


20.4. EXERCISE 4 679

and then open another browser window at address: http://localhost:6006.

In [30]: from tensorflow.keras.callbacks import ModelCheckpoint


from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.callbacks import TensorBoard

In [31]: checkpointer = ModelCheckpoint(


filepath="/tmp/ztdlbook/weights.hdf5",
verbose=1, save_best_only=True)

In [32]: earlystopper = EarlyStopping(


monitor='val_loss', min_delta=0, patience=1,
verbose=1, mode='auto')

In [33]: tensorboard = TensorBoard(


log_dir='/tmp/ztdlbook/tensorboard/')

In [34]: from sklearn.model_selection import train_test_split

In [35]: X_train, X_test, y_train, y_test = \


train_test_split(Xsc, y_cat.values,
test_size=0.3, random_state=42)

In [36]: K.clear_session()

inputs = Input(shape=(13,))

x = Dense(8, kernel_initializer='he_normal',
activation='tanh')(inputs)

x = Dense(5, kernel_initializer='he_normal',
activation='tanh')(x)

second_to_last = Dense(2, kernel_initializer='he_normal',


activation='tanh')(x)

outputs = Dense(3, activation='softmax')(second_to_last)

model = Model(inputs=inputs, outputs=outputs)

model.compile(RMSprop(lr=0.05),
'categorical_crossentropy',
680 CHAPTER 20. DEEP LEARNING INTERNALS EXERCISES SOLUTIONS

metrics=['accuracy'])

callbacks_ = [checkpointer, earlystopper, tensorboard]

model.fit(X_train, y_train, batch_size=32,


epochs=20, verbose=0,
validation_data=(X_test, y_test),
callbacks=callbacks_);

Epoch 00001: val_loss improved from inf to 0.43786, saving model to


/tmp/ztdlbook/weights.hdf5

Epoch 00002: val_loss improved from 0.43786 to 0.31872, saving model to


/tmp/ztdlbook/weights.hdf5

Epoch 00003: val_loss improved from 0.31872 to 0.23610, saving model to


/tmp/ztdlbook/weights.hdf5

Epoch 00004: val_loss improved from 0.23610 to 0.18592, saving model to


/tmp/ztdlbook/weights.hdf5

Epoch 00005: val_loss did not improve from 0.18592


Epoch 00005: early stopping

To run Tensorboard uncomment the next two cells

In [37]: # %load_ext tensorboard.notebook

In [38]: # %tensorboard --logdir /tmp/ztdlbook/tensorboard/


Convolutional Neural Networks Exercises Solutions
21
In [1]: with open('../course/common.py') as fin:
exec(fin.read())

In [2]: with open('../course/matplotlibconf.py') as fin:


exec(fin.read())

Exercise 1
You’ve been hired by a shipping company to overhaul the way they route mail, parcels, and packages. They
want to build an image recognition system capable of recognizing the digits in the zip code on a package
automatically route it to the correct location. You are tasked to build the digit recognition system. Luckily,
you can rely on the MNIST dataset for the initial training of your model!

Build a deep convolutional Neural Network with at least two convolutional and two pooling layers before
the fully connected layer:

• start from the network we have just built


• insert one more Conv2D, MaxPooling2D and Activation pancake. You will have to choose the
number of filters in this convolutional layer
• retrain the model
• does performance improve?
• how many parameters does this new model have? More or less than the previous model? Why?
• how long did this second model take to train? Longer or shorter than the previous model? Why?
• did it perform better or worse than the previous model?

681
682 CHAPTER 21. CONVOLUTIONAL NEURAL NETWORKS EXERCISES SOLUTIONS

In [3]: from tensorflow.keras.utils import to_categorical

In [4]: from tensorflow.keras.models import Sequential


from tensorflow.keras.layers import Dense, Conv2D, MaxPooling2D
from tensorflow.keras.layers import Flatten, Activation
import tensorflow.keras.backend as K

In [5]: from tensorflow.keras.datasets import mnist

In [6]: (X_train, y_train), (X_test, y_test) = mnist.load_data()

In [7]: X_train.shape

Out[7]: (60000, 28, 28)

In [8]: X_train = X_train.astype('float32') / 255.0


X_test = X_test.astype('float32') / 255.0

X_train = X_train.reshape(-1, 28, 28, 1)


X_test = X_test.reshape(-1, 28, 28, 1)

y_train_cat = to_categorical(y_train, 10)


y_test_cat = to_categorical(y_test, 10)

In [9]: model = Sequential()

model.add(Conv2D(32, (3, 3), kernel_initializer='normal',


input_shape=(28, 28, 1)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Activation('relu'))

model.add(Conv2D(32, (3, 3), kernel_initializer='normal'))


model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Activation('relu'))

model.add(Flatten())

model.add(Dense(64, activation='relu'))

model.add(Dense(10, activation='softmax'))

model.compile(loss='categorical_crossentropy',
21.1. EXERCISE 1 683

optimizer='rmsprop',
metrics=['accuracy'])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 26, 26, 32) 320
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 32) 0
_________________________________________________________________
activation (Activation) (None, 13, 13, 32) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 11, 11, 32) 9248
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 32) 0
_________________________________________________________________
activation_1 (Activation) (None, 5, 5, 32) 0
_________________________________________________________________
flatten (Flatten) (None, 800) 0
_________________________________________________________________
dense (Dense) (None, 64) 51264
_________________________________________________________________
dense_1 (Dense) (None, 10) 650
=================================================================
Total params: 61,482
Trainable params: 61,482
Non-trainable params: 0
_________________________________________________________________

In [10]: model.fit(X_train, y_train_cat, batch_size=128,


epochs=5, verbose=1, validation_split=0.3);

Train on 42000 samples, validate on 18000 samples


Epoch 1/5
42000/42000 [==============================] - 3s 81us/sample - loss: 0.3360
- accuracy: 0.8978 - val_loss: 0.1210 - val_accuracy: 0.9626
Epoch 2/5
42000/42000 [==============================] - 2s 44us/sample - loss: 0.0944
- accuracy: 0.9715 - val_loss: 0.0835 - val_accuracy: 0.9729
Epoch 3/5
42000/42000 [==============================] - 2s 44us/sample - loss: 0.0637
- accuracy: 0.9799 - val_loss: 0.1037 - val_accuracy: 0.9669
Epoch 4/5
42000/42000 [==============================] - 2s 44us/sample - loss: 0.0494
- accuracy: 0.9847 - val_loss: 0.0756 - val_accuracy: 0.9769
Epoch 5/5
42000/42000 [==============================] - 2s 44us/sample - loss: 0.0391
- accuracy: 0.9878 - val_loss: 0.0505 - val_accuracy: 0.9855
684 CHAPTER 21. CONVOLUTIONAL NEURAL NETWORKS EXERCISES SOLUTIONS

In [11]: model.evaluate(X_test, y_test_cat)

10000/10000 [==============================] - 1s 53us/sample - loss: 0.0359


- accuracy: 0.9871

Out[11]: [0.03590720935157733, 0.9871]

Exercise 2
Pleased with your performance with the digits recognition task, your boss decides to challenge you with a
harder task. Their online branch allows people to upload images to a website that generates and prints a
postcard and ships it to its destination. Your boss would like to know what images people are loading on the
site to provide targeted advertising on the same page, so he asks you to build an image recognition system
capable of recognizing a few objects. Luckily for you, there’s a dataset ready made with a collection of labeled
images. This is the Cifar 10 Dataset, a very famous dataset that contains images for ten different categories:

• airplane
• automobile
• bird
• cat
• deer
• dog
• frog
• horse
• ship
• truck

In this exercise, we will reach the limit of what you can achieve on your laptop. In later chapters, we will
learn how to leverage GPUs to speed up training.

Here’s what you have to do: - load the cifar10 dataset using keras.datasets.cifar10.load_data() -
display a few images, see how hard/easy it is for you to recognize an object with such low resolution - check
the shape of X_train, does it need reshaping? - check the scale of X_train, does it need rescaling? - check
the shape of y_train, does it need reshaping? - build a model with the following architecture, and choose
the parameters and activation functions for each of the layers: - conv2d - conv2d - maxpool - conv2d -
conv2d - maxpool - flatten - dense - output - compile the model and check the number of parameters -
attempt to train the model with the optimizer of your choice. How fast does training proceed? - If training is
too slow, feel free to stop it and read ahead. In the next chapters, you’ll learn how to use GPUs to

In [12]: from tensorflow.keras.datasets import cifar10

In [13]: (X_train, y_train), (X_test, y_test) = cifar10.load_data()


21.2. EXERCISE 2 685

In [14]: X_train.shape

Out[14]: (50000, 32, 32, 3)

In [15]: plt.imshow(X_train[1]);

0
5
10
15
20
25
30
0 10 20 30

In [16]: X_train = X_train.astype('float32') / 255.0


X_test = X_test.astype('float32') / 255.0

In [17]: y_train.shape

Out[17]: (50000, 1)

In [18]: y_train_cat = to_categorical(y_train, 10)


y_test_cat = to_categorical(y_test, 10)
686 CHAPTER 21. CONVOLUTIONAL NEURAL NETWORKS EXERCISES SOLUTIONS

In [19]: y_train_cat.shape

Out[19]: (50000, 10)

In [20]: model = Sequential()


model.add(Conv2D(32, (3, 3),
padding='same',
input_shape=(32, 32, 3),
kernel_initializer='normal',
activation='relu'))

model.add(Conv2D(32, (3, 3), activation='relu',


kernel_initializer='normal'))

model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(64, (3, 3), padding='same',


kernel_initializer='normal',
activation='relu'))

model.add(Conv2D(64, (3, 3), activation='relu',


kernel_initializer='normal'))

model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Flatten())
model.add(Dense(512, activation='relu'))
model.add(Dense(10, activation='softmax'))

In [21]: model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])

In [22]: model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_2 (Conv2D) (None, 32, 32, 32) 896
_________________________________________________________________
conv2d_3 (Conv2D) (None, 30, 30, 32) 9248
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 15, 15, 32) 0
_________________________________________________________________
conv2d_4 (Conv2D) (None, 15, 15, 64) 18496
21.2. EXERCISE 2 687

_________________________________________________________________
conv2d_5 (Conv2D) (None, 13, 13, 64) 36928
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 6, 6, 64) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 2304) 0
_________________________________________________________________
dense_2 (Dense) (None, 512) 1180160
_________________________________________________________________
dense_3 (Dense) (None, 10) 5130
=================================================================
Total params: 1,250,858
Trainable params: 1,250,858
Non-trainable params: 0
_________________________________________________________________

In [23]: model.fit(X_train, y_train_cat,


batch_size=256,
epochs=2,
validation_data=(X_test, y_test_cat),
shuffle=True);

Train on 50000 samples, validate on 10000 samples


Epoch 1/2
50000/50000 [==============================] - 6s 125us/sample - loss:
1.8650 - accuracy: 0.3282 - val_loss: 1.5813 - val_accuracy: 0.4103
Epoch 2/2
50000/50000 [==============================] - 6s 112us/sample - loss:
1.4362 - accuracy: 0.4858 - val_loss: 1.3471 - val_accuracy: 0.5249
688 CHAPTER 21. CONVOLUTIONAL NEURAL NETWORKS EXERCISES SOLUTIONS
Time Series and Recurrent Neural Networks Exercises
22
Solutions

In [1]: with open('../course/common.py') as fin:


exec(fin.read())

In [2]: with open('../course/matplotlibconf.py') as fin:


exec(fin.read())

Exercise 1
Your manager at the power company is quite satisfied with the work you’ve done predicting the electric load
of the next hour and would like to push it further. He is curious to know if your model can predict the load
on the next day or even the next week instead of the next hour.

• Go ahead and use the helper function create_lagged_Xy_win we created above to generate new X
and y pairs where the start_lag is 36 hours or even further. You may want to extend the window
size to a little longer than a day.
• Train your best model on this data. You may have to use more than one layer. In which case,
remember to use the return_sequences=True argument in all layers except for the last one so that
they pass sequences to one another.
• Check the goodness of your model by comparing it with test data as well as looking at the R 2 score.

In [3]: df = pd.read_csv('../data/ZonalDemands_2003-2016.csv.bz2',
compression='bz2',
engine='python')

689
690 CHAPTER 22. TIME SERIES AND RECURRENT NEURAL NETWORKS EXERCISES SOLUTIONS

In [4]: def combine_date_hour(row):


date = pd.to_datetime(row['Date'])
hour = pd.Timedelta("%d hours" % row['Hour'])
return date + hour

idx = df.apply(combine_date_hour, axis=1)


df = df.set_index(idx)

In [5]: split_date = pd.Timestamp('01-01-2014')


train = df.loc[:split_date, ['Total Ontario']].copy()
test = df.loc[split_date:, ['Total Ontario']].copy()

In [6]: offset = 10000


scale = 5000

train_sc = (train - offset) / scale


test_sc = (test - offset) / scale

In [7]: def create_lagged_Xy_win(data, start_lag=1,


window_len=1):
X = data.shift(start_lag + window_len - 1).copy()
X.columns = ['T_{}'.format(start_lag + window_len - 1)]

if window_len > 1:
for s in range(window_len, 0, -1):
col_ = 'T_{}'.format(start_lag + s - 1)
X[col_] = data.shift(start_lag + s - 1)

X = X.dropna()
idx = X.index
y = data.loc[idx]
return X, y

In [8]: start_lag=36
window_len=72

X_train, y_train = create_lagged_Xy_win(


train_sc, start_lag, window_len)

X_test, y_test = create_lagged_Xy_win(


test_sc, start_lag, window_len)

In [9]: X_train_t = X_train.values.reshape(-1, window_len, 1)


X_test_t = X_test.values.reshape(-1, window_len, 1)
22.1. EXERCISE 1 691

y_train_t = y_train.values
y_test_t = y_test.values

In [10]: from tensorflow.keras.models import Sequential


from tensorflow.keras.layers import LSTM, Dense
import tensorflow.keras.backend as K
from tensorflow.keras.optimizers import Adam

In [11]: K.clear_session()

model = Sequential()
model.add(LSTM(12, input_shape=(window_len, 1),
kernel_initializer='normal',
return_sequences=True))
model.add(LSTM(6, kernel_initializer='normal'))
model.add(Dense(1))

model.compile(optimizer=Adam(lr=0.05),
loss='mean_squared_error')

In [12]: model.fit(X_train_t, y_train_t,


epochs=5,
batch_size=256,
verbose=1);

Epoch 1/5
93445/93445 [==============================] - 5s 53us/sample - loss: 0.2839
Epoch 2/5
93445/93445 [==============================] - 3s 36us/sample - loss: 0.1914
Epoch 3/5
93445/93445 [==============================] - 3s 36us/sample - loss: 0.1039
Epoch 4/5
93445/93445 [==============================] - 3s 36us/sample - loss: 0.0835
Epoch 5/5
93445/93445 [==============================] - 3s 36us/sample - loss: 0.0756

Let’s compare the predictions on the test set. We will a few days of data and put vertical bars to mark an
interval of 36 hours:

In [13]: y_pred = model.predict(X_test_t, batch_size=256)


plt.figure(figsize=(15,5))
plt.plot(y_test_t, label='y_test')
plt.plot(y_pred, label='y_pred')
692 CHAPTER 22. TIME SERIES AND RECURRENT NEURAL NETWORKS EXERCISES SOLUTIONS

plt.legend()
plt.xlim(1100,1500)
plt.axvline(1300)
plt.axvline(1336);

2.5 y_test
y_pred
2.0

1.5

1.0

0.5

0.0
1100 1150 1200 1250 1300 1350 1400 1450 1500

Exercise 2
Gate Recurrent Unit (GRU) are a more modern and simpler implementation of a cell that retains longer
term memory.

Keras makes them available in keras.layers.GRU. Try swapping the LSTM layer with a GRU layer and
re-train the model. Does its performance improve on the 36 hours lag task?

In [14]: from tensorflow.keras.layers import GRU

In [15]: K.clear_session()

model = Sequential()
model.add(GRU(12, input_shape=(window_len, 1),
kernel_initializer='normal',
return_sequences=True))
model.add(GRU(6, kernel_initializer='normal'))
model.add(Dense(1))

model.compile(optimizer=Adam(lr=0.05),
loss='mean_squared_error')

In [16]: model.fit(X_train_t, y_train_t,


epochs=5,
batch_size=256,
verbose=1);
22.3. EXERCISE 3 693

Epoch 1/5
93445/93445 [==============================] - 4s 40us/sample - loss: 0.1784
Epoch 2/5
93445/93445 [==============================] - 3s 37us/sample - loss: 0.0775
Epoch 3/5
93445/93445 [==============================] - 3s 37us/sample - loss: 0.0623
Epoch 4/5
93445/93445 [==============================] - 3s 37us/sample - loss: 0.0600
Epoch 5/5
93445/93445 [==============================] - 3s 37us/sample - loss: 0.0579

In [17]: y_pred = model.predict(X_test_t, batch_size=256)


plt.figure(figsize=(15,5))
plt.plot(y_test_t, label='y_test')
plt.plot(y_pred, label='y_pred')
plt.legend()
plt.xlim(1100,1500)
plt.axvline(1300)
plt.axvline(1336);

2.5 y_test
y_pred
2.0

1.5

1.0

0.5

0.0
1100 1150 1200 1250 1300 1350 1400 1450 1500

GRU not only trains faster, but also seems to reach a better performance than LSTM on this task.

Exercise 3
Does a fully connected model work well using Windows? Let’s find out! Try to train a fully connected model
on the lagged data with Windows, which will probably train much faster:

• reshape the input data back to an Order-2 tensor, i.e., eliminate the 3rd axis
• build a fully connected model with one or more layers
• train the fully connected model on the windowed data. Does it work well? Is it faster to train?

In [18]: X_train = X_train_t.squeeze()


X_test = X_test_t.squeeze()
694 CHAPTER 22. TIME SERIES AND RECURRENT NEURAL NETWORKS EXERCISES SOLUTIONS

In [19]: model = Sequential()

model.add(Dense(24, input_dim=window_len, activation='relu'))


model.add(Dense(12, activation='relu'))
model.add(Dense(6, activation='relu'))
model.add(Dense(1))

model.compile(optimizer='adam', loss='mean_squared_error')

In [20]: model.fit(X_train, y_train_t,


epochs=50,
batch_size=256,
verbose=0);

In [21]: y_pred = model.predict(X_test, batch_size=256)


plt.figure(figsize=(15,5))
plt.plot(y_test_t, label='y_test')
plt.plot(y_pred, label='y_pred')
plt.legend()
plt.xlim(1100,1500)
plt.axvline(1300)
plt.axvline(1336);

y_test
2.5 y_pred
2.0

1.5

1.0

0.5

0.0
1100 1150 1200 1250 1300 1350 1400 1450 1500

Exercise 4

Disclaimer: past performance is no guarantee of future results. This is not investment


advice.
22.4. EXERCISE 4 695

Predicting the price of Bitcoin from historical data.

You may have heard people talk about Bitcoin and how it is growing that you decide to put your newly
acquired Deep Learning skills to test in trying to beat the market. The idea is simple: if we could predict
what Bitcoin is going to do in the future, we can trade and profit using that knowledge.

The simplest formulation of this forecasting problem is to try to predict if the price of Bitcoin is going to go
up or down in the future, i.e., we can frame the issue as a binary classification that answers the question: is
Bitcoin going up.

Here are the steps to complete this exercise:

1. Load the data from ../data/poloniex_usdt_btc.json.gz into a Pandas DataFrame. We


obtained this data through the public API of the Poloniex cryptocurrency exchange.

• Check out the data using df.head(). Notice that the dataset contains the following columns:
– close: last price (in USD) in a 30 minute interval (candle)
– high: highest price in a 30 minute candle
– low: lowest price in a 30 minute candle
– open: first price in a 30 minute candle
– quoteVolume and volume: total amount traded on the exchange
– weightedAverage: this will be our outcome variable
• Convert the date column to a datetime object using pd.to_datetime and set it as the index of the
DataFrame.
• Plot the value of df['close'] to inspect the data. You will notice that it’s not periodic at all and it
has an overall enormous upward trend, so we will need to transform the data into a stationary time
series. We will use percentage changes, i.e., we will look at relative movements in the price instead of
absolute values.
• Create a new dataset df_percent with percent changes using the formula:
x t − x t−1
v t = 100 × (22.1)
x t−1
this is what we will use next.
• Inspect df_percent and notice that it contains both infinity and nan values. Drop the null values
and replace the infinity values with zero.
• Split the data on January 1st, 2017, using the data before then as training and the data after that as the
test.
• Use the window method to create an input training tensor X_train_t with the shape (n_windows,
window_len, n_features). This is the main part of the exercise since you’ll have to make a few choices
and be careful not to leak information from the future. In particular, you will have to:
– decide the window_len you want to use
– decide which features you’d like to use as input (don’t use weightedAverage, since we’ll need it
for the output.
– decide what lag you want to introduce between the last timestep in your input window and the
timestep of the output.
696 CHAPTER 22. TIME SERIES AND RECURRENT NEURAL NETWORKS EXERCISES SOLUTIONS

– You can start from the create_lagged_Xy_win function we defined in Chapter 7, but you will
have to modify it to work with numpy arrays because Pandas DataFrames are only good with
one feature.

• Create a binary outcome variable corresponding to df_percent_train[weightedAverage] >= 0.


This variable is going to be our label and do the same thing for the test set.
• Create a model to work with this data. Make sure the input layer has the right input_shape and the
output layer has one node with a Sigmoid activation function. Also, make sure to use the
binary_crossentropy loss and to track the accuracy of the model.
• Train the model on the training data
• Test the model on the test data. Is the accuracy better than a baseline guess? Are you going to be rich?

Again disclaimer: past performance is no guarantee of future results. This is not investment
advice.

In [22]: df = pd.read_json('../data/poloniex_usdt_btc.json.gz',
compression='gzip')

In [23]: df['date'] = pd.to_datetime(df['date'])


df.set_index('date', inplace=True)

In [24]: df['close'].plot();
22.4. EXERCISE 4 697

20000
17500
15000
12500
10000
7500
5000
2500
0
Jul Jan Jul Jan Jul Jan
2016 2017 2018
date

In [25]: df_percent = ((df - df.shift()) / df.shift()) * 100.0

In [26]: df_percent.head()

Out[26]:

close high low open quoteVolume volume weightedAverage


date
2015-02-19 19:00:00 NaN NaN NaN NaN NaN NaN NaN
2015-02-19 19:30:00 0.000000 0.000000 0.000000 0.000000 -100.000000 -100.000000 0.000000
2015-02-19 20:00:00 6.666667 6.666667 0.000000 0.000000 inf inf 5.818701
2015-02-19 20:30:00 1.666667 1.666667 8.444444 8.444444 -53.317086 -52.158715 2.481361
2015-02-19 21:00:00 0.000000 0.000000 0.000000 0.000000 -100.000000 -100.000000 0.000000

In [27]: df_percent = df_percent.dropna()\


.replace(-np.inf, 0)\
.replace(np.inf, 0)
df_percent['y'] = df_percent['weightedAverage'] >= 0
698 CHAPTER 22. TIME SERIES AND RECURRENT NEURAL NETWORKS EXERCISES SOLUTIONS

In [28]: split_date = pd.Timestamp('01-01-2017')


train = df_percent.loc[:split_date].copy()
test = df_percent.loc[split_date:].copy()

In [29]: def create_lagged_Xy_win_t(data, start_lag=1,


window_len=1):

X = data[['close', 'high', 'low', 'open']].copy()


y = data['y']

rows, columns = X.shape


shape_ = (rows - window_len - 1, window_len, columns)
X_t = np.zeros(shape_)
y_t = y.values[window_len + 1:]

if window_len > 1:
for s in range(window_len, 0, -1):
all_values = X.shift(start_lag + s - 1).values
X_t[:, window_len - s, :] = all_values[window_len + 1:]

return X_t, y_t

In [30]: start_lag = 1
window_len = 36

In [31]: X_train_t, y_train_t = create_lagged_Xy_win_t(


train, start_lag, window_len)

X_test_t, y_test_t = create_lagged_Xy_win_t(


test, start_lag, window_len)

In [32]: train.head(10)

Out[32]:

close high low open quoteVolume volume weightedAverage y


date
2015-02-19 19:30:00 0.000000 0.000000 0.000000 0.000000 -100.000000 -100.000000 0.000000e+00 True
2015-02-19 20:00:00 6.666667 6.666667 0.000000 0.000000 0.000000 0.000000 5.818701e+00 True
2015-02-19 20:30:00 1.666667 1.666667 8.444444 8.444444 -53.317086 -52.158715 2.481361e+00 True
2015-02-19 21:00:00 0.000000 0.000000 0.000000 0.000000 -100.000000 -100.000000 0.000000e+00 True
2015-02-19 22:30:00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 True
2015-02-19 23:00:00 0.000000 0.000000 0.000000 0.000000 -100.000000 -100.000000 0.000000e+00 True
2015-02-20 06:00:00 -1.536838 -1.536837 -1.536838 -1.536837 0.000000 0.000000 -1.536838e+00 False
2015-02-20 06:30:00 0.000000 -0.000001 0.000000 -0.000001 -100.000000 -100.000000 -4.162332e-09 False
2015-02-20 08:30:00 1.977058 1.977058 1.560825 1.560825 0.000000 0.000000 1.909331e+00 True
2015-02-20 09:00:00 0.000000 0.000000 0.409836 0.409836 -100.000000 -100.000000 6.645834e-02 True
22.4. EXERCISE 4 699

In [33]: X_train_t.shape

Out[33]: (23956, 36, 4)

In [34]: y_train_t.shape

Out[34]: (23956,)

In [35]: K.clear_session()

model = Sequential()
model.add(GRU(24, input_shape=(window_len, 4),
kernel_initializer='normal',
return_sequences=True))
model.add(GRU(18, kernel_initializer='normal',
return_sequences=True))
model.add(GRU(12, kernel_initializer='normal'))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer=Adam(lr=0.02),
loss='binary_crossentropy',
metrics=['accuracy'])

In [36]: model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
unified_gru (UnifiedGRU) (None, 36, 24) 2160
_________________________________________________________________
unified_gru_1 (UnifiedGRU) (None, 36, 18) 2376
_________________________________________________________________
unified_gru_2 (UnifiedGRU) (None, 12) 1152
_________________________________________________________________
dense (Dense) (None, 1) 13
=================================================================
Total params: 5,701
Trainable params: 5,701
Non-trainable params: 0
_________________________________________________________________

In [37]: h = model.fit(X_train_t, y_train_t,


epochs=20,
700 CHAPTER 22. TIME SERIES AND RECURRENT NEURAL NETWORKS EXERCISES SOLUTIONS

batch_size=512,
validation_split=0.1,
verbose=1);

Train on 21560 samples, validate on 2396 samples


Epoch 1/20
21560/21560 [==============================] - 1s 60us/sample - loss: 0.6721
- accuracy: 0.5785 - val_loss: 0.6834 - val_accuracy: 0.5492
Epoch 2/20
21560/21560 [==============================] - 1s 24us/sample - loss: 0.6526
- accuracy: 0.6160 - val_loss: 0.6723 - val_accuracy: 0.5826
Epoch 3/20
21560/21560 [==============================] - 0s 22us/sample - loss: 0.6483
- accuracy: 0.6200 - val_loss: 0.6699 - val_accuracy: 0.5839
Epoch 4/20
21560/21560 [==============================] - 0s 22us/sample - loss: 0.6461
- accuracy: 0.6245 - val_loss: 0.6731 - val_accuracy: 0.5918
Epoch 5/20
21560/21560 [==============================] - 0s 22us/sample - loss: 0.6466
- accuracy: 0.6253 - val_loss: 0.6720 - val_accuracy: 0.5872
Epoch 6/20
21560/21560 [==============================] - 0s 22us/sample - loss: 0.6434
- accuracy: 0.6258 - val_loss: 0.6754 - val_accuracy: 0.5835
Epoch 7/20
21560/21560 [==============================] - 0s 22us/sample - loss: 0.6419
- accuracy: 0.6269 - val_loss: 0.6679 - val_accuracy: 0.5972
Epoch 8/20
21560/21560 [==============================] - 0s 22us/sample - loss: 0.6410
- accuracy: 0.6315 - val_loss: 0.6721 - val_accuracy: 0.5885
Epoch 9/20
21560/21560 [==============================] - 0s 22us/sample - loss: 0.6401
- accuracy: 0.6313 - val_loss: 0.6788 - val_accuracy: 0.5705
Epoch 10/20
21560/21560 [==============================] - 0s 22us/sample - loss: 0.6362
- accuracy: 0.6315 - val_loss: 0.6731 - val_accuracy: 0.5931
Epoch 11/20
21560/21560 [==============================] - 0s 22us/sample - loss: 0.6347
- accuracy: 0.6360 - val_loss: 0.6689 - val_accuracy: 0.6027
Epoch 12/20
21560/21560 [==============================] - 0s 22us/sample - loss: 0.6331
- accuracy: 0.6364 - val_loss: 0.6759 - val_accuracy: 0.5881
Epoch 13/20
21560/21560 [==============================] - 0s 22us/sample - loss: 0.6300
- accuracy: 0.6348 - val_loss: 0.6803 - val_accuracy: 0.5618
Epoch 14/20
21560/21560 [==============================] - 0s 22us/sample - loss: 0.6282
- accuracy: 0.6412 - val_loss: 0.6729 - val_accuracy: 0.5897
Epoch 15/20
21560/21560 [==============================] - 0s 22us/sample - loss: 0.6238
- accuracy: 0.6458 - val_loss: 0.6782 - val_accuracy: 0.5676
Epoch 16/20
21560/21560 [==============================] - 0s 22us/sample - loss: 0.6258
- accuracy: 0.6407 - val_loss: 0.6737 - val_accuracy: 0.5847
Epoch 17/20
21560/21560 [==============================] - 0s 22us/sample - loss: 0.6234
- accuracy: 0.6435 - val_loss: 0.6785 - val_accuracy: 0.5684
Epoch 18/20
22.4. EXERCISE 4 701

21560/21560 [==============================] - 0s 22us/sample - loss: 0.6230


- accuracy: 0.6486 - val_loss: 0.6742 - val_accuracy: 0.5793
Epoch 19/20
21560/21560 [==============================] - 0s 22us/sample - loss: 0.6213
- accuracy: 0.6493 - val_loss: 0.6769 - val_accuracy: 0.5793
Epoch 20/20
21560/21560 [==============================] - 0s 22us/sample - loss: 0.6229
- accuracy: 0.6466 - val_loss: 0.6765 - val_accuracy: 0.5914

In [38]: pd.DataFrame(h.history).plot()

Out[38]: <matplotlib.axes._subplots.AxesSubplot at 0x7faa57a1c860>

0.68
0.66
0.64
0.62
0.60
loss
0.58 accuracy
0.56 val_loss
val_accuracy
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5

In [39]: model.evaluate(X_train_t, y_train_t)

23956/23956 [==============================] - 2s 103us/sample - loss:


0.6231 - accuracy: 0.6465

Out[39]: [0.6230709266734255, 0.6465186]


702 CHAPTER 22. TIME SERIES AND RECURRENT NEURAL NETWORKS EXERCISES SOLUTIONS

In [40]: pd.Series(y_train_t).value_counts() / len(y_test_t)

Out[40]:

0
True 0.675954
False 0.544237

In [41]: model.evaluate(X_test_t, y_test_t)

19633/19633 [==============================] - 2s 103us/sample - loss:


0.6374 - accuracy: 0.6461

Out[41]: [0.6373963824251365, 0.6460551]

In [42]: pd.Series(y_test_t).value_counts() / len(y_test_t)

Out[42]:

0
True 0.527326
False 0.472674

In [ ]:
Natural Language Processing and Text Data Exercises
23
Solutions

In [1]: with open('../course/common.py') as fin:


exec(fin.read())

In [2]: with open('../course/matplotlibconf.py') as fin:


exec(fin.read())

Exercise 1
For our Spam detection model, we used a CountVectorizer with a vocabulary size of 3000. Was this the
best size? Let’s find out:

• reload the spam dataset


• do a train test split with random_state=0 on the SMS data frame
• write a function train_for_vocab_size that takes vocab_size as input and does the following:
– initialize a CountVectorizer with max_features=vocab_size
– fit the vectorizer on the training messages
– transform both the training and the test messages to count matrices
– train the model on the training set
– return the model accuracy on the training and test set
• plot the behavior of the train and test set accuracies as a function of vocab_size for a range of
different vocab sizes

703
704 CHAPTER 23. NATURAL LANGUAGE PROCESSING AND TEXT DATA EXERCISES SOLUTIONS

Let’s reload the sms data we have previously saved:

In [3]: df = pd.read_csv('../data/sms_spam.csv')
df.head()

Out[3]:

message spam
0 Hi Princess! Thank you for... 0
1 Hello my little party anim... 0
2 And miss vday the parachut... 0
3 Maybe you should find some... 0
4 What year. And how many mi... 0

Train/Test split on the messages, notice that we use Numpy Arrays, not Pandas Dataframes:

In [4]: from sklearn.model_selection import train_test_split

In [5]: docs_train, docs_test, y_train, y_test = \


train_test_split(df['message'].values,
df['spam'].values,
random_state=0)

Now let’s write the function:

In [6]: from sklearn.feature_extraction.text import CountVectorizer


from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

In [7]: def train_for_vocab_size(vocab_size):


vect = CountVectorizer(decode_error='ignore',
stop_words='english',
max_features=vocab_size)

vect.fit(docs_train)
X_train_sparse = vect.transform(docs_train)
X_train = X_train_sparse.todense()

X_test_sparse = vect.transform(docs_test)
X_test = X_test_sparse.todense()
23.1. EXERCISE 1 705

input_dim = X_train.shape[1]

model = Sequential()
model.add(Dense(1, input_dim=input_dim, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])

model.fit(X_train, y_train, epochs=20, verbose=0)

train_acc = model.evaluate(X_train, y_train, verbose=0)[1]


test_acc = model.evaluate(X_test, y_test, verbose=0)[1]
return input_dim, train_acc, test_acc

Now let’s try a few vocab_sizes with increasing separation:

In [8]: sizes = [2, 3, 5, 10, 30, 50, 100, 300, \


500, 1000, 3000, 5000, 10000]
idx = []
train_accs = []
test_accs = []

for v in sizes:
i, tra, tea = train_for_vocab_size(v)

idx.append(i)
train_accs.append(tra)
test_accs.append(tea)

print("Done vocab size: ", i)

Done vocab size: 2


Done vocab size: 3
Done vocab size: 5
Done vocab size: 10
Done vocab size: 30
Done vocab size: 50
Done vocab size: 100
Done vocab size: 300
Done vocab size: 500
Done vocab size: 1000
Done vocab size: 3000
Done vocab size: 5000
Done vocab size: 7150

Let’s organize the results in a DataFrame


706 CHAPTER 23. NATURAL LANGUAGE PROCESSING AND TEXT DATA EXERCISES SOLUTIONS

In [9]: resdf = pd.DataFrame(train_accs,


columns=['Train'],
index=idx)

resdf['Test'] = test_accs

and let’s plot the results using the logarithmic scale for the x axis. Remember that our benchmark accuracy
is 86.6, so we will add a baseline at that level:

In [10]: resdf.plot(logx=True, style='-o', title='Accuracy')


plt.xlabel('vocab size')
plt.ylim(0.85, 1)
plt.axhline(0.866, color='red');

Accuracy
1.00
Train
0.98 Test
0.96
0.94
0.92
0.90
0.88
0.86
101 102 103
vocab size

Exercise 2
Keras provides a large dataset of movie reviews extracted from the Internet Movie Database for sentiment
analysis purposes. This dataset is much larger than the one we have used, and its already encoded as
23.2. EXERCISE 2 707

sequences of integers. Let’s put what we have learned to good use and build a sentiment classifier for movie
reviews:

• decide what size of vocabulary you are going to use and set the vocab_size variable
• import the imdb module from keras.datasets
• load the train and test sets using num_words=vocab_size
• check the data you have just loaded; they should be sequences of integers
• pad the sequences to a fixed length of your choice. You will need to:

– decide what a reasonable length to express a movie review is


– decide if you are going to truncate the beginning or the end of reviews that are longer than such
length
– decide if you are going to pad with zeros at the beginning or the end for reviews that are shorter
than such length

• build a model to do sentiment analysis on the truncated sequences


• train the model on the training set
• evaluate the performance of the model on the test set

Bonus points: can you convert back the sentences to their original text form? You should look at
imdb.get_word_index() to download the word index:

In [11]: vocab_size=20000

In [12]: from tensorflow.keras.datasets import imdb

In [13]: (X_train, y_train), (X_test, y_test) = \


imdb.load_data(num_words=vocab_size)

In [14]: X_train.shape

Out[14]: (25000,)

In [15]: X_train[0][:10]

Out[15]: [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65]

Let’s use a maximum review length of 80 words. This seems long enough to express an opinion about the
movie:
708 CHAPTER 23. NATURAL LANGUAGE PROCESSING AND TEXT DATA EXERCISES SOLUTIONS

In [16]: maxlen = 80

We will pad sequences using the default padding='pre' and truncating='pre' parameters.

In [17]: from tensorflow.keras.preprocessing.sequence import pad_sequences

In [18]: X_train_pad = pad_sequences(X_train, maxlen=maxlen)


X_test_pad = pad_sequences(X_test, maxlen=maxlen)

Let’s build the model:

In [19]: embedding_size = 100

In [20]: from tensorflow.keras.layers import LSTM, Embedding

In [21]: model = Sequential()


model.add(Embedding(vocab_size, embedding_size))
model.add(LSTM(64, dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])

TIP: in the above model we have used dropout, which has not yet been formally
introduced. For now just know that it’s a technique aimed at reducing overfitting.

In [22]: model.summary()

Model: "sequential_13"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, None, 100) 2000000
_________________________________________________________________
unified_lstm (UnifiedLSTM) (None, 64) 42240
_________________________________________________________________
dense_13 (Dense) (None, 1) 65
23.2. EXERCISE 2 709

=================================================================
Total params: 2,042,305
Trainable params: 2,042,305
Non-trainable params: 0
_________________________________________________________________

Let’s train the model for a couple of epochs. If you run this model on your laptop it may take a few minutes
for each epoch:

In [23]: model.fit(X_train_pad, y_train,


batch_size=32,
epochs=2,
validation_split=0.3);

Train on 17500 samples, validate on 7500 samples


Epoch 1/2
17500/17500 [==============================] - 7s 405us/sample - loss:
0.4405 - accuracy: 0.7917 - val_loss: 0.3638 - val_accuracy: 0.8364
Epoch 2/2
17500/17500 [==============================] - 6s 317us/sample - loss:
0.2486 - accuracy: 0.9010 - val_loss: 0.3832 - val_accuracy: 0.8316

And let’s evaluate the training and test accuracies

In [24]: train_loss, train_acc = model.evaluate(X_train_pad, y_train)

print('Train loss:', train_loss)


print('Train accuracy:', train_acc)

25000/25000 [==============================] - 2s 87us/sample - loss: 0.2329


- accuracy: 0.9168
Train loss: 0.2329213989830017
Train accuracy: 0.91684

In [25]: test_loss, test_acc = model.evaluate(X_test_pad, y_test)

print('Test loss:', test_loss)


print('Test accuracy:', test_acc)

25000/25000 [==============================] - 2s 86us/sample - loss: 0.3924


- accuracy: 0.8267
Test loss: 0.3924247473049164
Test accuracy: 0.82672
710 CHAPTER 23. NATURAL LANGUAGE PROCESSING AND TEXT DATA EXERCISES SOLUTIONS

Not bad! We have a sentiment analysis model that we can unleash on the social media of our choice. Time
to go to an investor and raise money! Not quite, but it’s nice to see how easy it has become to build a model
that would have been unthinkable just a few years ago.

Finally, for the bonus question. Let’s get the word index:

In [26]: idx = imdb.get_word_index()

and let’s create the reverse index. Notice that the documentation of imdb.load_data reads:

"""
Signature: imdb.load_data(path='imdb.npz', num_words=None, skip_top=0,
maxlen=None, seed=113, start_char=1, oov_char=2, index_from=3, **kwargs)
Docstring:
Loads the IMDB dataset.

path: where to cache the data (relative to `~/.keras/dataset`).


num_words: max number of words to include. Words are ranked
by how often they occur (in the training set) and only
the most frequent words are kept
skip_top: skip the top N most frequently occurring words
(which may not be informative).
maxlen: truncate sequences after this length.
seed: random seed for sample shuffling.
start_char: The start of a sequence will be marked with this character.
Set to 1 because 0 is usually the padding character.
oov_char: words that were cut out because of the `num_words`
or `skip_top` limit will be replaced with this character.
index_from: index actual words with this index and higher.

Tuple of Numpy arrays: `(x_train, y_train), (x_test, y_test)`.


"""

so we will need to shift all indices by three to recover meaningful sentences:

In [27]: rev_idx = {v+3:k for k,v in idx.items()}

Also, following the documentation let’s add the start character and the out-of- vocabulary character:

In [28]: rev_idx[1] = 'start_char'


rev_idx[2] = 'oov_char'
23.2. EXERCISE 2 711

We can then apply the reverse index to recover the text of a review:

In [29]: example_review = ' '.join([rev_idx[word] for word in X_train[0]])


example_review

Out[29]: "start_char this film was just brilliant casting location scenery story
direction everyone's really suited the part they played and you could just
imagine being there robert oov_char is an amazing actor and now the same
being director oov_char father came from the same scottish island as myself
so i loved the fact there was a real connection with this film the witty
remarks throughout the film were great it was just brilliant so much that i
bought the film as soon as it was released for retail and would recommend it
to everyone to watch and the fly fishing was amazing really cried at the end
it was so sad and you know what they say if you cry at a film it must have
been good and this definitely was also congratulations to the two little
boy's that played the oov_char of norman and paul they were just brilliant
children are often left out of the praising list i think because the stars
that play them all grown up are such a big profile for the whole film but
these children are amazing and should be praised for what they have done
don't you think the whole story was so lovely because it was true and was
someone's life after all that was shared with us all"

Great! These are indeed movie reviews.


712 CHAPTER 23. NATURAL LANGUAGE PROCESSING AND TEXT DATA EXERCISES SOLUTIONS
Training with GPUs Exercises Solutions
24
In [1]: with open('../course/common.py') as fin:
exec(fin.read())

In [2]: with open('../course/matplotlibconf.py') as fin:


exec(fin.read())

Exercise 1
In Exercise 2 of Chapter 8 we introduced a model for sentiment analysis of the IMDB dataset provided in
Keras.

• Reload that dataset and prepare it for training a model:

– choose vocabulary size


– pad the sequences to a fixed length

• define a function recurrent_model(vocab_size, maxlen) similar to the


convolutional_model function defined earlier. The function should return a recurrent model.
• Train the model on 1 CPU and measure the training time > TIP: This is currently broken. There’s an
issue open about it. The model definition seems to ignore the context setter on the CPU. Just skip this
point for now.
• Train the model on 1 GPU and measure the training time
• Train the model on a machine with more than 1 GPU using multi_gpu_model or even better using
distribution strategy

713
714 CHAPTER 24. TRAINING WITH GPUS EXERCISES SOLUTIONS

In [3]: from time import time


import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import multi_gpu_model

In [4]: vocab_size= 10000


maxlen=80

In [5]: (X_train, y_train), (X_test, y_test) = \


imdb.load_data(num_words=vocab_size)

X_train_pad = pad_sequences(X_train, maxlen=maxlen)


X_test_pad = pad_sequences(X_test, maxlen=maxlen)

In [6]: def recurrent_model(vocab_size, maxlen):


print("Defining recurrent model")
t0 = time()
model = Sequential()
model.add(Embedding(vocab_size, 100, input_length=maxlen))
model.add(LSTM(64, dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

print("{:0.3f} seconds.".format(time() - t0))

print("Compiling the model...")


t0 = time()
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])

print("{:0.3f} seconds.".format(time() - t0))


return model

In [7]: # broken in TF 2.0 alpha release


# with tf.device('cpu:0'):
# model = recurrent_model(vocab_size, maxlen)

In [8]: # print("Training recurrent CPU model...")


# t0 = time()
# model.fit(X_train_pad, y_train,
# batch_size=1024,
24.1. EXERCISE 1 715

# epochs=2,
# shuffle=True)
# print("{:0} seconds.".format(time() - t0))

In [9]: with tf.device('gpu:0'):


model = recurrent_model(vocab_size, maxlen)

Defining recurrent model


0.717 seconds.
Compiling the model...
0.093 seconds.

In [10]: print("Training recurrent GPU model...")


t0 = time()
model.fit(X_train_pad, y_train,
batch_size=1024,
epochs=2,
shuffle=True)
print("{:0} seconds.".format(time() - t0))

Training recurrent GPU model...


Epoch 1/2
25000/25000 [==============================] - 3s 107us/sample - loss:
0.6588 - accuracy: 0.6424
Epoch 2/2
25000/25000 [==============================] - 1s 47us/sample - loss: 0.4336
- accuracy: 0.8075
4.5373570919036865 seconds.

In [11]: NGPU = 2

In [12]: model = recurrent_model(vocab_size, maxlen)

model = multi_gpu_model(model, NGPU, cpu_relocation=True)

Defining recurrent model


0.373 seconds.
Compiling the model...
0.093 seconds.

In [13]: model.compile(loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
716 CHAPTER 24. TRAINING WITH GPUS EXERCISES SOLUTIONS

In [14]: print("Training recurrent GPU model on {} GPUs ...".format(NGPU))


t0 = time()
model.fit(X_train_pad, y_train,
batch_size=1024*NGPU,
epochs=2,
shuffle=True)
print("{:0} seconds.".format(time() - t0))

Training recurrent GPU model on 2 GPUs ...


Epoch 1/2
25000/25000 [==============================] - 2s 87us/sample - loss: 0.6567
- accuracy: 0.6478
Epoch 2/2
25000/25000 [==============================] - 2s 65us/sample - loss: 0.4692
- accuracy: 0.7861
5.121852397918701 seconds.

In [15]: strategy = tf.distribute.MirroredStrategy()

In [16]: with strategy.scope():


model = recurrent_model(vocab_size, maxlen)

Defining recurrent model


0.342 seconds.
Compiling the model...
1.187 seconds.

In [17]: print("Training recurrent GPU model on {} GPUs ...".format(NGPU))


t0 = time()
model.fit(X_train_pad, y_train,
batch_size=1024*NGPU,
epochs=2,
shuffle=True)
print("{:0.3f} seconds.".format(time() - t0))

Training recurrent GPU model on 2 GPUs ...


Epoch 1/2
13/13 [==============================] - 3s 193ms/step - loss: 0.6883 -
accuracy: 0.5892
Epoch 2/2
13/13 [==============================] - 1s 52ms/step - loss: 0.5986 -
accuracy: 0.7420
8.971 seconds.
24.2. EXERCISE 2 717

Exercise 2
Model parallelism is a technique used for models too large to fit in the memory of a single GPU. While this is
is not the case for the model we developed in Exercise 1, it is still possible to distribute the model across
multiple GPUs using the with context setter. Define a new model with the following architecture:

1. Embedding

• LSTM
• LSTM
• LSTM
• Dense

Place layers 1 and 2 on the first GPU, layers 3 and 4 on the second GPU and the final Dense layer on the CPU.

Train the model and see if the performance improves.

In [18]: import tensorflow.keras.backend as K

In [19]: K.clear_session()

In [20]: model = Sequential()


with tf.device('gpu:0'):
model.add(Embedding(input_dim=vocab_size,
output_dim=100,
input_length=maxlen))
model.add(LSTM(64, dropout=0.2,
return_sequences=True))
with tf.device('gpu:1'):
model.add(LSTM(64, dropout=0.2,
return_sequences=True))
model.add(LSTM(64, dropout=0.2))
with tf.device('cpu:0'):
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])

print("{:0.3f} seconds.".format(time() - t0))

print("Compiling the model...")


718 CHAPTER 24. TRAINING WITH GPUS EXERCISES SOLUTIONS

t0 = time()
model.compile(loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])

print("{:0.3f} seconds.".format(time() - t0))

9.747 seconds.
Compiling the model...
0.128 seconds.

In [21]: print("Training distributed recurrent model...")


t0 = time()
model.fit(X_train_pad, y_train,
batch_size=1024,
epochs=2,
shuffle=True)
print("{:0} seconds.".format(time() - t0))

Training distributed recurrent model...


Epoch 1/2
25000/25000 [==============================] - 3s 119us/sample - loss:
0.6162 - accuracy: 0.6496
Epoch 2/2
25000/25000 [==============================] - 2s 97us/sample - loss: 0.4159
- accuracy: 0.8146
7.327203273773193 seconds.

In [ ]:

In [ ]:
Performance Improvement Exercises Solutions
25
In [1]: with open('../course/common.py') as fin:
exec(fin.read())

In [2]: with open('../course/matplotlibconf.py') as fin:


exec(fin.read())

Exercise 1
This is a long and complex exercise, that should give you an idea of a real world scenario. Feel free to look at
the solution if you feel lost. Also, feel free to run this on a GPU.

First of all download and unpack the male/female pictures from here into a subfolder of the ../data folder.
These images and labels were obtained from Crowdflower.

Your goal is to build an image classifier that will recognize the gender of a person from pictures.

• Have a look at the directory structure and inspect a couple of pictures


• Design a model that will take a color image of size 64x64 as input and return a binary output
(female=0/male=1)
• Feel free to introduce any regularization technique in your model (Dropout, Batch Normalization,
Weight Regularization)
• Compile your model with an optimizer of your choice
• Using ImageDataGenerator, define a train generator that will augment your images with some
geometric transformations. Feel free to choose the parameters that make sense to you.

719
720 CHAPTER 25. PERFORMANCE IMPROVEMENT EXERCISES SOLUTIONS

• Define also a test generator, whose only purpose is to rescale the pixels by 1./255
• use the function flow_from_directory to generate batches from the train and test folders. Make
sure you set the target_size to 64x64.
• Use the model.fit_generator function to fit the model on the batches generated from the
ImageDataGenerator. Since you are streaming and augmenting the data in real-time, you will have
to decide how many batches make an epoch and how many epochs you want to run
• Train your model (you should get to at least 85 accuracy)
• Once you are satisfied with your training, check a few of the misclassified pictures.
• Read about human bias in Machine Learning datasets

In [3]: %%bash

if [ ! -d ../data/male_female ]; then
A=https://www.zerotodeeplearning.com/
B=media/z2dl/45bzty/
C=male_female.tgz
wget $A$B$C -O male_female.tgz
tar -xzvf male_female.tgz --directory ../data/
rm male_female.tgz
fi

In [4]: data_path = '../data/male_female/'

In [5]: from tensorflow.keras.models import Sequential


from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import MaxPooling2D
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import BatchNormalization
from itertools import islice
from tensorflow.keras import backend as K
from tensorflow.keras.utils import multi_gpu_model
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import tensorflow as tf

In [6]: from tensorflow.python.client import device_lib

In [7]: tf.compat.v1.disable_eager_execution()

In [8]: def create_model():


model = Sequential()
model.add(Conv2D(32, (3, 3),
input_shape=(64, 64, 3),
25.1. EXERCISE 1 721

activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(BatchNormalization())

model.add(Conv2D(64, (3, 3), activation='relu'))


model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(BatchNormalization())

model.add(Conv2D(64, (3, 3), activation='relu'))


model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(BatchNormalization())

model.add(Flatten())

model.add(Dense(128, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
return model

In [9]: # gpus = K.tensorflow_backend._get_available_gpus()


gpus = ['gpu:0', 'gpu:1']

In [10]: NGPU = len(gpus)

In [11]: if NGPU <= 1:


model = create_model()
ncopies = 1 # for batch size
else:
with tf.device("/cpu:0"):
model = create_model()
model = multi_gpu_model(model, gpus=NGPU)
ncopies = NGPU

In [12]: model.summary()

Model: "model"
____________________________________________________________________________
______________________
Layer (type) Output Shape Param # Connected
to
============================================================================
======================
conv2d_input (InputLayer) [(None, 64, 64, 3)] 0
____________________________________________________________________________
______________________
lambda (Lambda) (None, 64, 64, 3) 0
conv2d_input[0][0]
722 CHAPTER 25. PERFORMANCE IMPROVEMENT EXERCISES SOLUTIONS

____________________________________________________________________________
______________________
lambda_1 (Lambda) (None, 64, 64, 3) 0
conv2d_input[0][0]
____________________________________________________________________________
______________________
sequential (Sequential) (None, 1) 352129
lambda[0][0]
lambda_1[0][0]
____________________________________________________________________________
______________________
dense_1 (Concatenate) (None, 1) 0
sequential[1][0]
sequential[2][0]
============================================================================
======================
Total params: 352,129
Trainable params: 351,809
Non-trainable params: 320
____________________________________________________________________________
______________________

In [13]: model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])

In [14]: batch_size = 16

In [15]: train_gen = ImageDataGenerator(rescale=1./255,


width_shift_range=0.1,
height_shift_range=0.1,
rotation_range=10,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True)

test_gen = ImageDataGenerator(rescale=1./255)

In [16]: train = train_gen.flow_from_directory(


data_path + '/train', target_size=(64, 64),
batch_size=batch_size * ncopies,
class_mode='binary')

test = test_gen.flow_from_directory(
data_path + '/test', target_size=(64, 64),
batch_size=batch_size * ncopies,
class_mode='binary')
25.1. EXERCISE 1 723

Found 11663 images belonging to 2 classes.


Found 2920 images belonging to 2 classes.

In [17]: test.class_indices

Out[17]: {'0_female': 0, '1_male': 1}

In [18]: label_to_class = {0: 'female', 1: 'male'}

In [19]: model.fit_generator(train,
steps_per_epoch=600,
epochs=3);

Epoch 1/3
600/600 [==============================] - 74s 123ms/step - loss: 0.6128 -
accuracy: 0.6974
Epoch 2/3
600/600 [==============================] - 70s 117ms/step - loss: 0.4655 -
accuracy: 0.7679
Epoch 3/3
600/600 [==============================] - 70s 117ms/step - loss: 0.4331 -
accuracy: 0.7935

In [20]: model.evaluate_generator(test, steps=len(test))

Out[20]: [0.39456120862261107, 0.8099315]

In [21]: X_test = []
y_test = []
for ts in islice(test, 50):
X_test.append(ts[0])
y_test.append(ts[1])

X_test = np.concatenate(X_test)
y_test = np.concatenate(y_test)

In [22]: y_test

Out[22]: array([1., 0., 1., ..., 0., 1., 0.], dtype=float32)


724 CHAPTER 25. PERFORMANCE IMPROVEMENT EXERCISES SOLUTIONS

In [23]: y_pred = model.predict(X_test).ravel().round(0)


y_pred

Out[23]: array([1., 0., 1., ..., 0., 1., 0.], dtype=float32)

In [24]: wrong_idx = np.argwhere(y_test != y_pred).ravel()


wrong_idx

Out[24]: array([ 9, 10, 15, 16, 17, 21, 47, 52, 53, 54, 61,
63, 79, 83, 84, 86, 87, 89, 90, 91, 95, 102,
110, 127, 128, 129, 130, 133, 135, 156, 157, 158, 159,
161, 164, 172, 173, 178, 182, 184, 193, 198, 199, 200,
205, 213, 214, 220, 222, 223, 232, 237, 247, 248, 252,
254, 261, 264, 265, 270, 272, 275, 282, 286, 287, 288,
302, 308, 316, 320, 330, 357, 363, 365, 366, 371, 375,
376, 380, 385, 388, 395, 403, 406, 411, 420, 426, 427,
447, 456, 460, 462, 465, 469, 470, 471, 477, 489, 490,
494, 495, 507, 513, 514, 516, 522, 537, 556, 558, 563,
564, 568, 572, 577, 581, 582, 585, 598, 609, 627, 639,
640, 643, 644, 653, 661, 668, 669, 676, 680, 699, 702,
705, 707, 712, 718, 721, 724, 728, 730, 733, 735, 739,
740, 741, 747, 751, 773, 775, 777, 781, 786, 792, 807,
808, 813, 814, 828, 832, 852, 862, 863, 865, 868, 871,
873, 888, 891, 893, 895, 897, 900, 902, 922, 924, 942,
945, 950, 953, 961, 973, 976, 979, 981, 988, 995, 999,
1000, 1013, 1022, 1023, 1026, 1032, 1040, 1044, 1045, 1046, 1047,
1052, 1056, 1067, 1076, 1078, 1087, 1104, 1105, 1112, 1114, 1116,
1123, 1124, 1137, 1140, 1143, 1144, 1158, 1163, 1169, 1173, 1177,
1181, 1188, 1192, 1194, 1197, 1214, 1219, 1221, 1233, 1237, 1253,
1254, 1256, 1257, 1260, 1264, 1269, 1270, 1271, 1275, 1281, 1283,
1291, 1294, 1295, 1297, 1303, 1304, 1308, 1315, 1318, 1327, 1329,
1336, 1340, 1346, 1351, 1354, 1357, 1368, 1376, 1380, 1382, 1383,
1387, 1389, 1393, 1396, 1402, 1403, 1410, 1423, 1424, 1425, 1430,
1437, 1441, 1444, 1450, 1456, 1459, 1468, 1472, 1475, 1491, 1493,
1494, 1503, 1505, 1507, 1508, 1518, 1520, 1521, 1522, 1526, 1527,
1532, 1538, 1539, 1549, 1559, 1565, 1571, 1577, 1578, 1581, 1585])

In [25]: plt.figure(figsize=(10, 10))

i = 1

for idx in wrong_idx[:16]:


plt.subplot(4, 4, i)
plt.imshow(X_test[idx])
label = label_to_class[int(y_test[idx])]
25.1. EXERCISE 1 725

pred = label_to_class[int(y_pred[idx])]
plt.title("Label: {}\nPred: {}".format(label, pred))
i += 1

plt.tight_layout()

Label: male Label: female Label: female Label: male


Pred: female Pred: male Pred: male Pred: female
0 0 0 0
25 25 25 25
50 50 50 50
0 50 0 50 0 50 0 50
Label: female Label: female Label: female Label: female
Pred: male Pred: male Pred: male Pred: male
0 0 0 0
25 25 25 25
50 50 50 50
0 50 0 50 0 50 0 50
Label: female Label: male Label: male Label: male
Pred: male Pred: female Pred: female Pred: female
0 0 0 0
25 25 25 25
50 50 50 50
0 50 0 50 0 50 0 50
Label: female Label: female Label: male Label: female
Pred: male Pred: male Pred: female Pred: male
0 0 0 0
25 25 25 25
50 50 50 50
0 50 0 50 0 50 0 50

The model still has a lot to learn about humans

In [ ]:
726 CHAPTER 25. PERFORMANCE IMPROVEMENT EXERCISES SOLUTIONS
Pretrained Models for Images Exercises Solutions
26
In [1]: with open('../course/common.py') as fin:
exec(fin.read())

In [2]: with open('../course/matplotlibconf.py') as fin:


exec(fin.read())

Exercise 1
Use a pre-trained model on a different image.

• Download an image from the web


• Upload the image through the Jupyter home page
• load the image as a numpy array
• re-run the pre-train to see if the pre-trained model can guess your image
• can you find an image that is outside of the Imagenet classes? (you can see which classes are available
here.

In [3]: from urllib.request import urlretrieve


from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.xception import Xception
from tensorflow.keras.applications.xception import preprocess_input
from tensorflow.keras.applications.xception import decode_predictions

In [4]: img_url = "http://bit.ly/2VKhzWb"

727
728 CHAPTER 26. PRETRAINED MODELS FOR IMAGES EXERCISES SOLUTIONS

In [5]: def load_image_from_url(url, target_size=(299, 299)):


path, response = urlretrieve(
img_url, filename='/tmp/temp_img.jpg')

img = image.load_img(path, target_size=target_size)

img_tensor = np.expand_dims(
image.img_to_array(img), axis=0)

return img, img_tensor

In [6]: img, img_tensor = load_image_from_url(img_url)

In [7]: img_scaled = preprocess_input(np.copy(img_tensor))

In [8]: img

Out[8]:
26.2. EXERCISE 2 729

In [9]: model = Xception(weights='imagenet')

In [10]: preds = model.predict(img_scaled)

In [11]: decode_predictions(preds, top=3)[0]

Out[11]: [('n03100240', 'convertible', 0.32023576),


('n04285008', 'sports_car', 0.16641538),
('n03459775', 'grille', 0.08980309)]

Exercise 2
Choose another pre-trained model from the ones provided at https://keras.io/applications/ and use it to to
predict the same image. Do the predictions match?

In [12]: from tensorflow.keras.applications.vgg16 import VGG16


from tensorflow.keras.applications.vgg16 import preprocess_input
from tensorflow.keras.applications.vgg16 import decode_predictions

In [13]: model = VGG16(weights='imagenet')

In [14]: img, img_tensor = load_image_from_url(


img_url, target_size=(224, 224))

In [15]: img_scaled = preprocess_input(np.copy(img_tensor))

In [16]: preds = model.predict(img_scaled)

In [17]: decode_predictions(preds, top=3)[0]

Out[17]: [('n03594945', 'jeep', 0.19831586),


('n03770679', 'minivan', 0.15666518),
('n03100240', 'convertible', 0.111178935)]

Exercise 3
The Keras documentation shows how to fine-tune the Inception V3 model by unfreezing some of the
convolutional layers. Try reproducing the results of the documentation on our dataset using the Xception
model and unfreezing some of the top convolutional layers.
730 CHAPTER 26. PRETRAINED MODELS FOR IMAGES EXERCISES SOLUTIONS

In [18]: from tensorflow.keras.models import Model


from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications.xception import preprocess_input
from tensorflow.keras.optimizers import SGD

In [19]: img_size = 299

In [20]: base_model = Xception(include_top=False, weights='imagenet',


input_shape=(img_size, img_size, 3),
pooling='avg')

In [21]: x = base_model.output
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
predictions = Dense(3, activation='softmax')(x)

In [22]: model = Model(inputs=base_model.input,


outputs=predictions)

In [23]: for layer in base_model.layers:


layer.trainable = False

In [24]: model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])

In [25]: train_datagen = ImageDataGenerator(


preprocessing_function=preprocess_input,
rotation_range=15,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=5,
zoom_range=0.2,
horizontal_flip=True,
fill_mode='nearest')

In [26]: batch_size = 32

In [27]: train_path = '../data/sports/train/'


26.3. EXERCISE 3 731

In [28]: train_generator = train_datagen.flow_from_directory(


train_path,
target_size=(img_size, img_size),
batch_size=batch_size)

Found 2100 images belonging to 3 classes.

In [29]: model.fit_generator(
train_generator,
steps_per_epoch=65,
epochs=1)

65/65 [==============================] - 49s 754ms/step - loss: 0.5518 -


accuracy: 0.7964

Out[29]: <tensorflow.python.keras.callbacks.History at 0x7f0b5824ee10>

In [30]: for i, layer in enumerate(base_model.layers):


print(i, layer.name)

0 input_3
1 block1_conv1
2 block1_conv1_bn
3 block1_conv1_act
4 block1_conv2
5 block1_conv2_bn
6 block1_conv2_act
7 block2_sepconv1
8 block2_sepconv1_bn
9 block2_sepconv2_act
10 block2_sepconv2
11 block2_sepconv2_bn
12 conv2d_4
13 block2_pool
14 batch_normalization_v1_4
15 add_12
16 block3_sepconv1_act
17 block3_sepconv1
18 block3_sepconv1_bn
19 block3_sepconv2_act
20 block3_sepconv2
21 block3_sepconv2_bn
22 conv2d_5
23 block3_pool
24 batch_normalization_v1_5
25 add_13
26 block4_sepconv1_act
732 CHAPTER 26. PRETRAINED MODELS FOR IMAGES EXERCISES SOLUTIONS

27 block4_sepconv1
28 block4_sepconv1_bn
29 block4_sepconv2_act
30 block4_sepconv2
31 block4_sepconv2_bn
32 conv2d_6
33 block4_pool
34 batch_normalization_v1_6
35 add_14
36 block5_sepconv1_act
37 block5_sepconv1
38 block5_sepconv1_bn
39 block5_sepconv2_act
40 block5_sepconv2
41 block5_sepconv2_bn
42 block5_sepconv3_act
43 block5_sepconv3
44 block5_sepconv3_bn
45 add_15
46 block6_sepconv1_act
47 block6_sepconv1
48 block6_sepconv1_bn
49 block6_sepconv2_act
50 block6_sepconv2
51 block6_sepconv2_bn
52 block6_sepconv3_act
53 block6_sepconv3
54 block6_sepconv3_bn
55 add_16
56 block7_sepconv1_act
57 block7_sepconv1
58 block7_sepconv1_bn
59 block7_sepconv2_act
60 block7_sepconv2
61 block7_sepconv2_bn
62 block7_sepconv3_act
63 block7_sepconv3
64 block7_sepconv3_bn
65 add_17
66 block8_sepconv1_act
67 block8_sepconv1
68 block8_sepconv1_bn
69 block8_sepconv2_act
70 block8_sepconv2
71 block8_sepconv2_bn
72 block8_sepconv3_act
73 block8_sepconv3
74 block8_sepconv3_bn
75 add_18
76 block9_sepconv1_act
77 block9_sepconv1
78 block9_sepconv1_bn
79 block9_sepconv2_act
80 block9_sepconv2
81 block9_sepconv2_bn
82 block9_sepconv3_act
83 block9_sepconv3
84 block9_sepconv3_bn
85 add_19
26.3. EXERCISE 3 733

86 block10_sepconv1_act
87 block10_sepconv1
88 block10_sepconv1_bn
89 block10_sepconv2_act
90 block10_sepconv2
91 block10_sepconv2_bn
92 block10_sepconv3_act
93 block10_sepconv3
94 block10_sepconv3_bn
95 add_20
96 block11_sepconv1_act
97 block11_sepconv1
98 block11_sepconv1_bn
99 block11_sepconv2_act
100 block11_sepconv2
101 block11_sepconv2_bn
102 block11_sepconv3_act
103 block11_sepconv3
104 block11_sepconv3_bn
105 add_21
106 block12_sepconv1_act
107 block12_sepconv1
108 block12_sepconv1_bn
109 block12_sepconv2_act
110 block12_sepconv2
111 block12_sepconv2_bn
112 block12_sepconv3_act
113 block12_sepconv3
114 block12_sepconv3_bn
115 add_22
116 block13_sepconv1_act
117 block13_sepconv1
118 block13_sepconv1_bn
119 block13_sepconv2_act
120 block13_sepconv2
121 block13_sepconv2_bn
122 conv2d_7
123 block13_pool
124 batch_normalization_v1_7
125 add_23
126 block14_sepconv1
127 block14_sepconv1_bn
128 block14_sepconv1_act
129 block14_sepconv2
130 block14_sepconv2_bn
131 block14_sepconv2_act
132 global_average_pooling2d

In [31]: # we chose to train the top 2 separable convolution


# blocks, i.e. we will freeze the first 126 layers
# and unfreeze the rest:

split_layer = 126

for layer in model.layers[:split_layer]:


734 CHAPTER 26. PRETRAINED MODELS FOR IMAGES EXERCISES SOLUTIONS

layer.trainable = False
for layer in model.layers[split_layer:]:
layer.trainable = True

In [32]: model.compile(optimizer=SGD(lr=0.0001, momentum=0.9),


loss='categorical_crossentropy',
metrics=['accuracy'])

In [33]: model.fit_generator(
train_generator,
steps_per_epoch=65,
epochs=1);

65/65 [==============================] - 50s 773ms/step - loss: 0.4687 -


accuracy: 0.8283

In [ ]:
Pretrained Embeddings for Text Exercises Solutions
27
In [1]: with open('../course/common.py') as fin:
exec(fin.read())

In [2]: with open('../course/matplotlibconf.py') as fin:


exec(fin.read())

Exercise 1
Compare the representations of Word2Vec, Glove and FastText. In the data/embeddings folder we
provided you with two additional scripts to download FastText and Word2Vec. Go ahead and download
each of them into the data/embeddings. Then load each of the 3 embeddings in a separate Gensim model
and complete the following steps:

1. define a list of words containing the following words: ‘good’, ‘bad’, ‘fast’, ‘tensor’, ‘teacher’, ‘student’.

• create a function called get_top_5(words, model) that retrieves the top 5 most similar words to
the list of words and compare what the 3 different embeddings give you

• apply the same function to each word in the list separately and compare the lists of the 3 embeddings.

• explore the following word analogies:


man:king=woman:? ==> expected queen france:paris=germany:? ==> expected berlin
teacher:teach=student:? ==> expected learn cat:kitten=dog:? ==> expected puppy
english:friday=italiano:? ==> expected venerdì

735
736 CHAPTER 27. PRETRAINED EMBEDDINGS FOR TEXT EXERCISES SOLUTIONS

Can word analogies be used for translation?

Note that loading the vector may take several minutes depending on your computer.

In [3]: import gensim

In [4]: from gensim.models import KeyedVectors

In [5]: w2v_path = '../data/embeddings/GoogleNews-vectors-negative300.bin'


w2v_gs = KeyedVectors.load_word2vec_format(
w2v_path, binary=True)

In [6]: glove_path = '../data/embeddings/glove.6B.50d.txt.vec'


glove_gs = KeyedVectors.load_word2vec_format(
glove_path, binary=False)

In [7]: fasttext_path = '../data/embeddings/wiki-news-300d-1M.vec'


fasttext_gs = KeyedVectors.load_word2vec_format(
fasttext_path, binary=False)

In [8]: word_list = ['good', 'bad',


'fast', 'tensor',
'teacher', 'student']

In [9]: def get_top_5(words, gs_model):


res = gs_model.most_similar(positive=words, topn=5)
return [r[0] for r in res]

In [10]: for word in word_list:


print(word)
print("W2V : ", get_top_5([word], w2v_gs))
print("Glove : ", get_top_5([word.lower()], glove_gs))
print("FastText: ", get_top_5([word], fasttext_gs))
print()

good
W2V : ['great', 'bad', 'terrific', 'decent', 'nice']
Glove : ['better', 'really', 'always', 'sure', 'something']
FastText: ['bad', 'excellent', 'decent', 'nice', 'great']

bad
W2V : ['good', 'terrible', 'horrible', 'Bad', 'lousy']
27.1. EXERCISE 1 737

Glove : ['worse', 'unfortunately', 'too', 'really', 'little']


FastText: ['good', 'terrible', 'horrible', 'lousy', 'awful']

fast
W2V : ['quick', 'rapidly', 'Fast', 'quickly', 'slow']
Glove : ['slow', 'faster', 'pace', 'turning', 'better']
FastText: ['slow', 'rapid', 'quick', 'Fast', 'faster']

tensor
W2V : ['uniaxial', 'τ ', 'θ ', 'φ', 'wavefunction']
Glove : ['scalar', 'tensors', 'coefficients', 'coefficient',
'formula_12']
FastText: ['tensors', 'Tensor', 'stress-energy', 'pseudotensor',
'tensorial']

teacher
W2V : ['teachers', 'Teacher', 'guidance_counselor', 'elementary',
'PE_teacher']
Glove : ['student', 'graduate', 'teaching', 'taught', 'teaches']
FastText: ['teachers', 'educator', 'Teacher', 'student', 'pupil']

student
W2V : ['students', 'Student', 'teacher', 'stu_dent', 'faculty']
Glove : ['teacher', 'students', 'teachers', 'graduate', 'school']
FastText: ['students', 'teacher', 'Student', 'university', 'graduate']

In [11]: def word_analogy(model,


thing='man',
is_to='king',
like='woman'):
res = model.most_similar(positive=[is_to, like],
negative=[thing],
topn=3)
return [r[0] for r in res]

In [12]: word_analogies = ['man:king=woman:queen',


'france:paris=germany:berlin',
'teacher:teach=student:learn',
'cat:kitten=dog:?',
'english:friday=italiano:?']

for analogy in word_analogies:


first, second = analogy.split('=')
thing, is_to = first.split(':')
like, answer = second.split(':')

print(analogy)
print("W2V : ", word_analogy(
w2v_gs, thing, is_to, like))
738 CHAPTER 27. PRETRAINED EMBEDDINGS FOR TEXT EXERCISES SOLUTIONS

print("Glove : ", word_analogy(


glove_gs, thing, is_to, like))
print("FastText: ", word_analogy(
fasttext_gs, thing, is_to, like))
print()

man:king=woman:queen
W2V : ['queen', 'monarch', 'princess']
Glove : ['queen', 'throne', 'prince']
FastText: ['queen', 'monarch', 'princess']

france:paris=germany:berlin
W2V : ['berlin', 'german', 'lindsay_lohan']
Glove : ['berlin', 'frankfurt', 'vienna']
FastText: ['berlin', 'munich', 'dresden']

teacher:teach=student:learn
W2V : ['educate', 'learn', 'teaches']
Glove : ['students', 'teachers', 'teaching']
FastText: ['learn', 'educate', 'attend']

cat:kitten=dog:?
W2V : ['puppy', 'pup', 'pit_bull']
Glove : ['puppy', 'rottweiler', 'retriever']
FastText: ['puppy', 'puppies', 'pup']

english:friday=italiano:?
W2V : ['noche', 'fatto', 'la_versione']
Glove : ['exxonmobil', 'eni', 'newmont']
FastText: ['dopo', 'meglio', 'lavoro']

Exercise 2
The Reuters Newswire topic classification dataset is a dataset of 11,228 newswires from Reuters, labeled over
46 topics. This dataset is provided in the keras.datasets module and it’s easy to use.

Let’s compare the performance of a model using pre-trained embeddings with a model using random
embeddings on the topic classification task.

• Load the data from keras.datasets.reuters


• Retrieve the word index and create the reverse_word_idx as done for IMDB in Chapter 8.
• Augment the reverse word index with pad_char, start_char and oov_char at indices 0, 1, 2
respectively.
• Check the maximum length of a newswire and use the pad_sequences function to pad everything to
that 100 words.
• Create and train two models, one using pre-trained embeddings and the other using a randomly
initialized embedding
• Compare their performance on this dataset using a recurrent model. In particular, check which of the
two models shows the worst overfitting.
27.2. EXERCISE 2 739

In [13]: from tensorflow.keras.datasets import reuters

In [14]: vocab_size=20000

In [15]: (X_train, y_train), (X_test, y_test) = \


reuters.load_data(num_words=vocab_size, index_from=2)

In [16]: reuters_word_idx = reuters.get_word_index()

In [17]: reverse_reuters_word_idx = {index+2:word for word, index


in reuters_word_idx.items()}

In [18]: reverse_reuters_word_idx[0] = 'pad_char'


reverse_reuters_word_idx[1] = 'start_char'
reverse_reuters_word_idx[2] = 'oov_char'

In [19]: from tensorflow.keras.preprocessing.sequence import pad_sequences

In [20]: ' '.join([reverse_reuters_word_idx[i] for i in X_train[0]])

Out[20]: 'start_char oov_char oov_char said as a result of its december acquisition


of space co it expects earnings per share in 1987 of 1 15 to 1 30 dlrs per
share up from 70 cts in 1986 the company said pretax net should rise to nine
to 10 mln dlrs from six mln dlrs in 1986 and rental operation revenues to 19
to 22 mln dlrs from 12 5 mln dlrs it said cash flow per share this year
should be 2 50 to three dlrs reuter 3'

In [21]: max([len(seq) for seq in X_train])

Out[21]: 2376

In [22]: maxlen=100

In [23]: X_train_pad = pad_sequences(X_train, maxlen=maxlen)


X_test_pad = pad_sequences(X_test, maxlen=maxlen)

In [24]: embedding_model = fasttext_gs


740 CHAPTER 27. PRETRAINED EMBEDDINGS FOR TEXT EXERCISES SOLUTIONS

In [25]: embedding_size = embedding_model.vector_size

In [26]: reuters_emb_weights = np.zeros((vocab_size, embedding_size))

not_found = 0

for i in range(1, vocab_size):


word = reverse_reuters_word_idx[i]
try:
reuters_emb_weights[i] = embedding_model[word]
except:
not_found += 1
# print(word, "not found in pre-trained embedding")
reuters_emb_weights[i] = np.random.random(
size=embedding_size)
pass
print("{} out of {} words not found in pre-trained embedding.".format(not_found, vocab_

4260 out of 20000 words not found in pre-trained embedding.

In [27]: from tensorflow.keras.layers import LSTM, Embedding, Dense


from tensorflow.keras.models import Sequential

In [28]: def build_train_eval(embedding_weights=None):


model = Sequential([
Embedding(vocab_size,
embedding_size,
mask_zero=True,
input_length=maxlen),
LSTM(64, dropout=0.2),
Dense(46, activation='softmax')
])

if embedding_weights is not None:


model.layers[0].set_weights([embedding_weights])
model.layers[0].trainable=False

model.compile(loss='sparse_categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])

h = model.fit(X_train_pad, y_train,
batch_size=32,
epochs=5,
27.2. EXERCISE 2 741

validation_split=0.1)

train_loss, train_acc = model.evaluate(X_train_pad,


y_train)
print('Train loss:', train_loss)
print('Train accuracy:', train_acc)

test_loss, test_acc = model.evaluate(X_test_pad,


y_test)
print('Test loss:', test_loss)
print('Test accuracy:', test_acc)

return h, model

In [29]: h, random_model = build_train_eval()

Train on 8083 samples, validate on 899 samples


Epoch 1/5
8083/8083 [==============================] - 146s 18ms/sample - loss: 2.1761
- accuracy: 0.4587 - val_loss: 1.8156 - val_accuracy: 0.5439
Epoch 2/5
8083/8083 [==============================] - 145s 18ms/sample - loss: 1.6472
- accuracy: 0.5904 - val_loss: 1.9135 - val_accuracy: 0.5373
Epoch 3/5
8083/8083 [==============================] - 144s 18ms/sample - loss: 1.3937
- accuracy: 0.6579 - val_loss: 1.4670 - val_accuracy: 0.6452
Epoch 4/5
8083/8083 [==============================] - 145s 18ms/sample - loss: 1.0325
- accuracy: 0.7477 - val_loss: 1.3999 - val_accuracy: 0.6685
Epoch 5/5
8083/8083 [==============================] - 146s 18ms/sample - loss: 0.7611
- accuracy: 0.8175 - val_loss: 1.3693 - val_accuracy: 0.6986
8982/8982 [==============================] - 16s 2ms/sample - loss: 0.6364 -
accuracy: 0.8602
Train loss: 0.6364203458425861
Train accuracy: 0.86016476
2246/2246 [==============================] - 4s 2ms/sample - loss: 1.4060 -
accuracy: 0.6808
Test loss: 1.4059620158330521
Test accuracy: 0.6807658

In [30]: pd.DataFrame(h.history).plot();
742 CHAPTER 27. PRETRAINED EMBEDDINGS FOR TEXT EXERCISES SOLUTIONS

2.25
loss
2.00 accuracy
val_loss
1.75 val_accuracy
1.50
1.25
1.00
0.75
0.50
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

In [31]: h, fixed_model = build_train_eval(reuters_emb_weights)

Train on 8083 samples, validate on 899 samples


Epoch 1/5
8083/8083 [==============================] - 109s 14ms/sample - loss: 2.2472
- accuracy: 0.4470 - val_loss: 2.0583 - val_accuracy: 0.4794
Epoch 2/5
8083/8083 [==============================] - 108s 13ms/sample - loss: 1.9728
- accuracy: 0.4892 - val_loss: 1.9353 - val_accuracy: 0.4816
Epoch 3/5
8083/8083 [==============================] - 108s 13ms/sample - loss: 1.7839
- accuracy: 0.5346 - val_loss: 1.6880 - val_accuracy: 0.5862
Epoch 4/5
8083/8083 [==============================] - 107s 13ms/sample - loss: 1.6180
- accuracy: 0.5891 - val_loss: 1.5743 - val_accuracy: 0.6196
Epoch 5/5
8083/8083 [==============================] - 107s 13ms/sample - loss: 1.5100
- accuracy: 0.6187 - val_loss: 1.5230 - val_accuracy: 0.6318
8982/8982 [==============================] - 16s 2ms/sample - loss: 1.4570 -
accuracy: 0.6293
Train loss: 1.4569933932068033
Train accuracy: 0.6292585
2246/2246 [==============================] - 4s 2ms/sample - loss: 1.5246 -
accuracy: 0.6104
Test loss: 1.5246085031480525
Test accuracy: 0.6104185
27.2. EXERCISE 2 743

In [32]: pd.DataFrame(h.history).plot();

2.25
2.00
1.75
loss
1.50 accuracy
1.25 val_loss
val_accuracy
1.00
0.75
0.50
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

In [ ]:
744 CHAPTER 27. PRETRAINED EMBEDDINGS FOR TEXT EXERCISES SOLUTIONS
Serving Deep Learning Models Exercises Solutions
28
In [1]: with open('../course/common.py') as fin:
exec(fin.read())

In [2]: with open('../course/matplotlibconf.py') as fin:


exec(fin.read())

Exercise 1
Let’s deploy an image recognition API using Tensorflow Serving. The main difference from the API we have
deployed in this chapter is that we will have to deal with how to pass an image to the model through
tensorflow serving. Since this chapter focuses on deployment, we will take a shortcut and deploy a pre-
trained model that uses Imagenet. In particular, we will deploy the Xception model. If you are unsure
about how to use a pre-trained model, please go back to Chapter 11 for a refresher.

Here are the steps you will need to complete:

• load the model in Keras


• export the model for tensorflow serving:
– set the learning phase to zero
– save the model with tf.saved_model.save
• run the model server
• write a short script that:
– loads an image

745
746 CHAPTER 28. SERVING DEEP LEARNING MODELS EXERCISES SOLUTIONS

– pre-processes it with the appropriate function


– serializes the image to Protobuf
– sends the image to the server
– receives a prediction
– decodes the prediction with Keras decode_prediction function

In [3]: import os
from os.path import join
import shutil

import tensorflow as tf
import numpy as np

from tensorflow.keras.preprocessing import image


from tensorflow.keras.applications.xception import Xception
from tensorflow.keras.applications.xception import preprocess_input
from tensorflow.keras.applications.xception import decode_predictions

from grpc import insecure_channel

In [4]: from tensorflow_serving.apis.prediction_service_pb2_grpc \


import PredictionServiceStub
from tensorflow_serving.apis.predict_pb2 \
import PredictRequest

Save Xception as tensorflow model

In [5]: model = Xception(weights='imagenet')

In [6]: tf.keras.backend.set_learning_phase(0)

In [7]: base_path = '/tmp/ztdl_models/xception'


sub_path = 'tfserving'
version = 1

In [8]: export_path = join(base_path, sub_path, str(version))


shutil.rmtree(export_path, ignore_errors=True)

In [9]: tf.saved_model.save(model, export_path)

In [10]: !saved_model_cli show --dir {export_path} --all


28.1. EXERCISE 1 747

MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['__saved_model_init_op']:
The given SavedModel SignatureDef contains the following input(s):
The given SavedModel SignatureDef contains the following output(s):
outputs['__saved_model_init_op'] tensor_info:
dtype: DT_INVALID
shape: unknown_rank
name: NoOp
Method name is:

signature_def['serving_default']:
The given SavedModel SignatureDef contains the following input(s):
inputs['input_1'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 299, 299, 3)
name: serving_default_input_1:0
The given SavedModel SignatureDef contains the following output(s):
outputs['predictions'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 1000)
name: StatefulPartitionedCall:0
Method name is: tensorflow/serving/predict

Start Server

docker run \
-v /tmp/ztdl_models/xception/tfserving/:/models/xception \
-e MODEL_NAME=xception \
-e MODEL_PATH=/models/xception \
-p 8502:8500 \
-p 8503:8501 \
-t tensorflow/serving

Convert image to protobuf

In [11]: img = image.load_img(


'./13_penguin.jpg', target_size=(299, 299))

In [12]: img_tensor = np.expand_dims(


image.img_to_array(img), axis=0)

In [13]: img_scaled = preprocess_input(img_tensor)

In [14]: data_pb = tf.compat.v1.make_tensor_proto(


img_scaled, dtype='float', shape=img_scaled.shape)
748 CHAPTER 28. SERVING DEEP LEARNING MODELS EXERCISES SOLUTIONS

Send request and retrieve response

In [15]: channel = insecure_channel('localhost:8502')

In [16]: stub = PredictionServiceStub(channel)

In [17]: request = PredictRequest()

In [18]: request.model_spec.name = 'xception'

In [19]: request.model_spec.signature_name = 'serving_default'

In [20]: request.inputs['input_1'].CopyFrom(data_pb)

In [21]: result_future = stub.Predict.future(request, 5.0)

In [22]: result = result_future.result()

Decode predictions

In [23]: scores = tf.make_ndarray(result.outputs['predictions'])

In [24]: preds = decode_predictions(scores, top=1)[0][0][1]

In [25]: preds

Out[25]: 'king_penguin'

Exercise 2
The above method of serving a pre-trained model has an issue: we are doing pre- processing and prediction
decoding on the client side. This is not a best practice, because it requires the client to be aware of what kind
of pre- processing and decoding functions the model needs.

We want a server that takes the image as it is and returns a string with the name of the object found.
28.2. EXERCISE 2 749

The easy way to do this is to use the Flask app implementation we have shown in this chapter and move
pre-processing and decoding on the server side.

Go ahead and build a Flask version of the API that takes an image URL as a JSON string, applies
pre-processing, runs and decodes the prediction and returns a string with the response.

You will not use tensorflow serving for this exercise.

Once your script is ready, save it as 13_flask_serve_xception.py, run it as:

python 13_flask_serve_xception.py

and test the prediction with the following command:

curl -d "http://bit.ly/2wb7uqN" \
-H "Content-Type: application/json" \
-X POST http://localhost:5000

If you’ve done things correctly, this should return:

"king_penguin"

Disclaimer: this script is not for production purposes. Retrieving a file from a URL is not secure, and
you should avoid building an API that retrieves a file from a URL provided from the client. Here we
used the URL retrieval trick to make the curl command shorter.

In [26]: !cat 13_flask_serve_xception.py

import os
import json
import numpy as np

from flask import Flask


from flask import request, jsonify

import tensorflow as tf
from urllib.request import urlretrieve
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.xception import Xception
from tensorflow.keras.applications.xception import preprocess_input
from tensorflow.keras.applications.xception import decode_predictions

loaded_model = None
750 CHAPTER 28. SERVING DEEP LEARNING MODELS EXERCISES SOLUTIONS

app = Flask(__name__)

def load_model():
"""
Load model and tensorflow graph
into global variables.
"""

# global variables
global loaded_model

loaded_model = Xception(weights='imagenet')
print("Model loaded.")

def load_image_from_url(url, target_size=(299, 299)):


path, response = urlretrieve(url, filename='/tmp/temp_img')
img = image.load_img(path, target_size=target_size)
img_tensor = np.expand_dims(image.img_to_array(img), axis=0)
return img, img_tensor

def preprocess(data):
url = data.decode('utf-8')
img, img_tensor = load_image_from_url(url)
img_scaled = preprocess_input(img_tensor)
return img_scaled

@app.route('/', methods=["POST"])
def predict():
"""
Generate predictions with the model
when receiving data as a POST request
"""
if request.method == "POST":
# get url from the request
data = request.data

# preprocess the data


processed = preprocess(data)

# run predictions
preds = loaded_model.predict(processed)

# obtain predicted classes from predicted probabilities


result = decode_predictions(preds, top=1)[0][0][1]

# print in backend
print("Received data:", data)
print("Predicted labels:", result)

return jsonify(result)

if __name__ == "__main__":
print("* Loading model and starting Flask server...")
load_model()
app.run(host='0.0.0.0', debug=True)
28.2. EXERCISE 2 751

# Test this with the following command:


# curl -d 'http://bit.ly/2wb7uqN' -H "Content-Type: application/json" -X
POST http://localhost:5000

In [ ]:

You might also like