Explore 1.5M+ audiobooks & ebooks free for days

Only $12.99 CAD/month after trial. Cancel anytime.

Data Science for Decision Makers:  Enhance your leadership skills with data science and AI expertise
Data Science for Decision Makers:  Enhance your leadership skills with data science and AI expertise
Data Science for Decision Makers:  Enhance your leadership skills with data science and AI expertise
Ebook751 pages4 hours

Data Science for Decision Makers: Enhance your leadership skills with data science and AI expertise

Rating: 0 out of 5 stars

()

Read preview
LanguageEnglish
PublisherPackt Publishing
Release dateJul 26, 2024
ISBN9781837638345
Data Science for Decision Makers:  Enhance your leadership skills with data science and AI expertise
Author

Jon Howells

​Jon Howells is a seasoned AI and Data Science professional with a decade of experience in the field. He runs an AI consultancy called Qualifai and has worked with various companies, including Unilever, Permira and Capgemini, developing and deploying data science services and solutions. He holds a Master's degree in Computational Statistics & Machine Learning from UCL. Jon is particularly interested in the application of Large Language Models (LLMs) in consumer-focused businesses, such as using LLMs for consumer research and feedback analysis, personalized content generation, and enhanced customer support, ultimately helping businesses better understand and engage with their customers.

Related to Data Science for Decision Makers

Related ebooks

Computers For You

View More

Reviews for Data Science for Decision Makers

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Data Science for Decision Makers - Jon Howells

    Cover.png

    Data Science for Decision Makers

    Copyright © 2024 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    Group Product Manager: Ali Abidi

    Publishing Product Manager: Tejashwini R

    Book Project Manager: Hemangi Lotlikar

    Content Development Editor: Joseph Sunil

    Technical Editor: Rahul Limbachiya

    Copy Editor: Safis Editing

    Proofreader: Joseph Sunil

    Indexer: Rekha Nair

    Production Designer: Ponraj Dhandapani

    DevRel Marketing Coordinator: Vinishka Kalra

    First published: June 2024

    Production reference: 1190624

    Published by Packt Publishing Ltd.

    Grosvenor House 11 St Paul’s Square Birmingham B3 1RB, UK

    ISBN 978-1-83763-729-4

    www.packtpub.com

    To my mother and father, Caroline and Robert, for instilling in me the values of education and constant curiosity. To my partner, Yeshica, for your unwavering support, and to my sister, Felicity, for your keen eye in reviewing and shaping this book.

    – Jon Howells

    Contributors

    About the author

    Jon Howells, director of AI consultancy QualifAI, is an experienced professional in data science and machine learning, with over a decade of experience in the consumer goods, market research, and public sectors. He has worked within consultancies including KPMG and Capgemini and with multinational clients such as Unilever and Permira, as well as public sector bodies such as the UK Home Office and the US Food and Drug Administration (FDA).

    With an MSc in computational statistics and machine learning from UCL, Jon specializes in applying large language models (LLMs) to consumer-focused businesses, leveraging them for consumer research, personalized content generation, and enhanced customer support. His expertise helps businesses better understand and engage with their customers, driving innovation and unlocking the potential of data-driven decision-making.

    About the reviewer

    As a principal architect at T-Mobile, Tanmaya Gaur has more than 10 years of web development experience and a passion for delivering technical and architectural leadership for key technology initiatives and business capabilities. In the latest chapter of his professional career, he has been instrumental in shaping the architecture of T-Mobile’s primary CRM solution, which is built using modular micro-frontend architecture and enhances the digital experience for their care representatives and customers.

    His expertise in web, infrastructure, and microservices enables him to design and deliver scalable solutions that are performant, secure, and resilient. He works closely with other business and IT partner teams in a highly collaborative environment and is committed to driving the best customer experience across mobile, desktop, point-of-sale, and other emerging devices.

    Table of Contents

    Preface

    Part 1: Understanding Data Science and Its Foundations

    1

    Introducing Data Science

    Data science, AI, and ML – what’s the difference?

    The mathematical and statistical underpinnings of data science

    Statistics and data science

    What is statistics?

    Descriptive and inferential statistics

    Sampling strategies

    Probability

    Probability distribution

    Conditional probability

    Describing our samples

    Measures of central tendency

    Measures of dispersion

    Degrees of freedom

    Correlation, causation, and covariance

    The shape of data

    Probability distributions

    Discrete probability distributions

    Continuous probability distributions

    Summary

    2

    Characterizing and Collecting Data

    What are the key criteria to consider when evaluating datasets?

    Data quantity

    Data velocity

    Data variety

    Data quality

    First-, second-, and third-party data

    First-party data – the treasure trove within

    Second-party data – building bridges through collaboration

    Third-party data – broadening horizons with external expertise

    Structured, unstructured, and semi-structured data

    Structured data

    Unstructured data

    Semi-structured data

    Methods for collecting data

    Storing and processing data

    Cloud, on-premises, and hybrid solutions – navigating the data storage and analysis landscape

    Cloud computing – scalable services in the cloud

    On-premises – maintaining control within your walls

    Hybrid – the best of both worlds?

    Data processing

    Summary

    3

    Exploratory Data Analysis

    Getting started with Google Colab

    What is Google Colab?

    A step-by-step guide to setting up Google Colab

    Understanding the data you have

    EDA techniques and tools

    Descriptive statistics

    Data visualization

    Histograms

    Density curves

    Boxplots

    Heatmaps

    Dimensionality reduction

    Correlation analysis

    Outlier detection

    Summary

    4

    The Significance of Significance

    The idea of testing hypotheses

    What is a hypothesis?

    How does hypothesis testing work?

    Formulating null and alternative hypotheses

    Determining the significance level

    Understanding errors

    Getting to grips with p-values

    Significance tests for a population proportion – making informed decisions about proportions

    The z-test – comparing a sample proportion to a population proportion

    Z-test example made easy

    Significance tests for a population average (mean)

    Writing hypotheses for a significance test about a mean

    Conditions for a t-test about a mean

    When to use z or t statistics in significance tests

    Example – calculating the t-statistic for a test about a mean

    Using a table to estimate the p-value from the t-statistic

    Comparing the p-value from the t-statistic to the significance level

    One-tailed and two-tailed tests

    Walking through a case study

    Summary

    5

    Understanding Regression

    How can I benefit from understanding regression?

    Introduction to trend lines

    Fitting a trend line to data

    Estimating the line of best fit

    Calculating the equations of the lines of best fit

    Interpreting the slope of a regression line

    Interpreting the intercept of a regression line

    Understanding residuals

    Evaluating the goodness of fit in least-squares regression

    Summary

    Part 2: Machine Learning – Concepts, Applications, and Pitfalls

    6

    Introducing Machine Learning

    From statistics to machine learning

    What is machine learning?

    How does machine learning relate to statistics?

    Why is machine learning important?

    Customer personalization and segmentation

    Fraud detection and security

    Supply chain and inventory optimization

    Predictive maintenance

    Healthcare diagnostics and treatment

    The different types of machine learning

    Supervised learning

    Unsupervised learning

    Semi-supervised learning

    Reinforcement learning

    Transfer learning

    Popular machine learning algorithms

    Linear regression

    Logistic regression

    Decision trees

    Random forests

    Support vector machines

    k-nearest neighbors

    Neural networks

    The machine learning process

    Training a supervised machine learning model

    Validation of a supervised machine learning model

    Testing a supervised machine learning model

    Evaluating machine learning models

    Risks and limitations of machine learning

    Overfitting and underfitting

    Bias and variance

    Balanced dataset

    Models are approximations of reality

    Machine learning on unstructured data

    Natural language processing (NLP)

    Computer vision

    Deep learning and artificial intelligence

    Artificial intelligence

    Deep learning

    Summary

    7

    Supervised Machine Learning

    Defining supervised learning

    Applications of supervised learning

    The two types of supervised learning

    Key factors in supervised learning

    Steps within supervised learning

    Data preparation – laying the foundation

    Algorithm selection – choosing the right tool

    Model training – learning from data

    Model evaluation – assessing performance

    Prediction and deployment – putting the model to work

    Characteristics of regression and classification algorithms

    Regression algorithms

    Classification algorithms

    Key considerations in supervised learning

    Evaluation metrics

    Applications of supervised learning

    Consumer goods

    Retail

    Manufacturing

    Summary

    8

    Unsupervised Machine Learning

    Defining UL

    Practical examples of UL

    Steps in UL

    Step 1 – Data collection

    Step 2 – Data preprocessing

    Step 3 – Choosing the right model

    Step 4 – Training the model

    Step 5 – Interpretation and evaluation

    In summary

    Clustering – unveiling hidden patterns in your data

    What is clustering?

    How does clustering work?

    k-means clustering

    Practical applications of clustering

    Evaluation metrics for clustering

    In summary

    Association rule learning

    What is association rule learning?

    The Apriori algorithm – a practical example

    Evaluation metrics

    In summary

    Applications of UL

    Market segmentation

    Anomaly detection

    Feature extraction

    Summary

    9

    Interpreting and Evaluating Machine Learning Models

    How do I know whether this model will be accurate?

    Evaluating on test (holdout) data

    Understanding evaluation metrics

    Evaluating regression models

    R-squared

    Root mean squared error

    Mean absolute error

    When and how to use each metric

    Practical evaluation strategies

    Summarizing the evaluation of regression models

    Evaluating classification models

    Classification model evaluation metrics

    Precision, recall, and F1-Score

    Recall

    F1-score

    Methods for explaining machine learning models

    Making sense of regression models – the power of coefficients

    Decoding classification models – unveiling feature importance

    Beyond specific models – universal insights using SHAP values

    Summary

    10

    Common Pitfalls in Machine Learning

    Understanding the complexity

    Dirty data, damaged models – how data quantity and quality impact ML

    The importance of adequate training data

    Dealing with poor data quality

    Conclusion

    Overcoming overfitting and underfitting

    Navigating training-serving skew and model drift

    Ensuring fairness

    Mastering overfitting and underfitting for optimal model performance

    Overfitting – when your model is too specific

    Underfitting – when your model is too simplistic

    Spotting the problem

    Conclusion

    Training-serving skew and model drift

    Training-serving skew

    Model drift

    Key takeaways

    Bias and fairness

    Understanding bias

    Understanding fairness

    Mitigating bias and ensuring fairness

    Key takeaways

    Summary

    Part 3: Leading Successful Data Science Projects and Teams

    11

    The Structure of a Data Science Project

    The various types of data science projects

    Data products

    Reports and analytics

    Research and methodology

    The stages of a data product

    Identifying use cases

    Evaluating use cases

    Planning the data product

    Developing a data product

    Data preparation and exploratory analysis

    Model design and development

    Evaluation and testing

    Deploying and monitoring a data product

    General best practices for data product development

    Evaluating impact

    Predictive maintenance in manufacturing

    Fraud detection in banking

    Customer churn prediction in telecom

    Demand forecasting in retail

    Personalized recommendations in e-commerce

    Predictive maintenance in energy

    Workforce optimization in quick service restaurants

    Chatbot-assisted customer support

    Summary

    12

    The Data Science Team

    Assembling your data science team – key roles and considerations

    Data scientists

    Machine learning engineers

    Data engineers

    MLOps engineers

    Analytics engineers

    Software engineers (full stack, frontend, backend)

    Product managers

    Business analysts

    Data storytellers/visualization experts

    Considerations when assembling your team

    Data science teams within larger organizations

    The hub and spoke model

    What is the hub and spoke model?

    Practical applications of the hub and spoke model

    Building a hub and spoke model

    The art of recruitment

    Where to find technical talent

    How high-performing data science teams operate

    Cross-functional collaboration is essential

    Diversity of perspectives drives innovation

    Start with the right problem to solve

    Invest in tooling, infrastructure, and workflow

    Continuous adaption and learning are a must

    Focus ruthlessly on outcomes over activity

    Summary

    13

    Managing the Data Science Team

    Day-to-day management of a data science team

    Enabling rapid experimentation and innovation

    Managing inherent uncertainty

    Balancing research and application

    Communicating effectively in data science and artificial intelligence

    Fostering a culture of curiosity and continuous learning

    Embracing peer review and collaboration

    Common challenges in managing a data science team

    Challenge 1 – recruiting and retaining top talent

    Challenge 2 – aligning projects with business goals

    Challenge 3 – managing inherent uncertainty

    Challenge 4 – scaling and operationalizing models

    Challenge 5 – deploying robust, reliable, fair models ethically

    Empowering and motivating your data science team

    Working with other teams and external stakeholders and empowering them to use data

    Summary

    14

    Continuing Your Journey as a Data Science Leader

    Navigating the landscape of emerging technologies

    Specializing in an industry

    Specializing in a field

    Embracing continuous learning

    Online courses

    Cloud certifications

    Technical tutorials and documentation

    Learning plan framework

    Staying up to date with current DS/ML/AI news and trends

    Promoting data-driven thinking within your organization

    Host internal learning sessions

    Collaborate on cross-functional projects

    Share success stories and lessons learned

    Mentor and upskill colleagues

    Establish a data science community of practice

    Networking beyond your organization

    Attend industry conferences and events

    Join online communities and forums

    Engage with local meetups and user groups

    Collaborate on side projects or research

    Offer mentorship or seek mentors

    Summary

    Index

    Other Books You May Enjoy

    Preface

    Data science, machine learning, and artificial intelligence (AI) are transforming the business landscape.

    Organizations in every industry are harnessing these powerful tools to uncover insights, make predictions, and gain a competitive edge. This trend has only accelerated with the rise in large language models and Generative AI.

    But for decision makers without a data science background, or those stepping up from being a data scientist to leading data teams, there are a myriad of challenges. It can be challenging to understand underlying concepts of statistics, machine learning, and AI; manage data teams effectively; and, most importantly, translate complex models into tangible business outcomes – business outcomes that deliver real, bottom-line value to an organization, not just vanity metrics and shiny demos.

    This book is your guide. In Data Science for Decision Makers, you’ll gain the essential knowledge and skills to lead in the age of AI. Through clear explanations and practical examples, you’ll learn how to interpret machine learning models, identify valuable use cases, and drive measurable results. Step by step, you’ll learn the foundations of statistics and machine learning. You’ll discover how to plan and execute successful data science initiatives from start to finish.

    Along the way, you’ll pick up best practices for building and empowering high-performing teams. Most importantly, you’ll learn how to bridge the gap between the technical world of data science and the business needs of your organization. Whether you’re an executive, a manager, or a data scientist moving into leadership, this book will help you leverage data-driven insights to inform your decisions and propel your company forward.

    Who this book is for

    Are you an executive seeking to harness the power of data science and AI? A manager eager to lead data-driven teams to success? Or perhaps a data scientist ready to step into a leadership role? If so, this book is for you.

    Data Science for Decision Makers is designed for leaders who want to leverage data insights effectively. You don’t need a formal background in statistics or machine learning. What you do need is a desire to understand these concepts, ask the right questions, and make informed decisions.

    If you work with data scientists and machine learning engineers, this book will help you interpret their models with confidence. You’ll learn how to recognize valuable opportunities for AI and plan projects that deliver real business value.

    Executives will gain a solid foundation in data science methods. Managers will discover how to build and guide high-performing teams. Data scientists will develop the skills to become influential leaders. Wherever you are in your career, this book will help you succeed in the age of AI.

    What this book covers

    This book is structured into three parts. Firstly, we cover data science and its foundations in statistics. Then, we cover machine learning as it relates to data science, including core machine learning concepts, applications, and pitfalls to avoid. Finally, we cover how to lead successful data science projects and teams. If you are already familiar with the foundations of data science and the core statistical concepts covered in Part 1, you may wish to skip ahead to Part 2 or refresh your knowledge.

    Part 1: Understanding Data Science and Its Foundations

    Chapter 1

    , Introducing Data Science, will provide you with a foundational understanding of data science, its relationship to AI and machine learning, and key statistical concepts. It explores descriptive and inferential statistics, probability, and data distributions, establishing a common language for readers.

    Chapter 2

    , Characterizing and Collecting Data, will give you the knowledge of how to distinguish between different types of data, including first-, second-, and third-party data, as well as structured, unstructured, and semi-structured data. It explores technologies and methods for collecting, storing, and processing data, and provides guidance on navigating the landscape of data-focused solutions, including cloud, on-premises, and hybrid solutions.

    Chapter 3

    , Exploratory Data Analysis, introduces the process of exploratory data analysis (EDA) and its importance in understanding data, developing hypotheses, and building better models. The chapter provides hands-on code examples in Python to reinforce the concepts, with step-by-step explanations suitable for readers with no prior experience in Python.

    Chapter 4

    , The Significance of Significance, explores the concept of statistical significance and its importance in making data-driven decisions. It covers hypothesis testing, also known as significance testing, and provides practical examples to illustrate its application in business scenarios, such as reducing customer churn and evaluating machine learning model improvements.

    Chapter 5

    , Understanding Regression, introduces regression as a powerful statistical tool for uncovering patterns and relationships within data. It explores various use cases for regression in a business context. The chapter begins with the foundational concept of trend lines before delving into the complexities of regression analysis.

    Part 2: Machine Learning – Concepts, Applications, and Pitfalls

    Chapter 6

    , Introducing Machine Learning, provides an overview of machine learning and its importance in data-driven decision-making. It covers the progression from traditional statistics to machine learning, the various types of machine learning techniques, and the process of training, validating, and testing models.

    Chapter 7

    , Supervised Machine Learning, focuses on one of the most utilized and beneficial subfields of machine learning. It discusses the steps involved in training and deploying supervised machine learning models and core supervised learning algorithms, as well as factors to consider when training and evaluating these models and their applications.

    Chapter 8

    , Unsupervised Machine Learning, explores the field of unsupervised learning, where algorithms discover hidden patterns and insights from unlabeled data. The chapter covers practical examples of unsupervised learning, the key steps involved, and techniques such as clustering, anomaly detection, dimensionality reduction, and association rule learning. It emphasizes the distinct nature of unsupervised learning compared to supervised learning and highlights its potential for uncovering valuable information in data without prior training.

    Chapter 9

    , Interpreting and Evaluating Machine Learning Models, equips readers with the skills needed to assess the accuracy and reliability of machine learning models. You will learn how to use evaluation metrics to measure model performance and understand the importance of using holdout (test) data for unbiased evaluation. The chapter provides insights into the differences between evaluation metrics for regression and classification models, enabling readers to effectively interpret and validate the quality of machine learning models, ensuring their successful implementation in real-world scenarios.

    Chapter 10

    , Common Pitfalls in Machine Learning, provides readers with the knowledge to identify and address common challenges in developing and deploying machine learning models. It covers issues such as inadequate or poor-quality training data, overfitting and underfitting, training-serving skew, model drift, and bias and fairness. You will learn practical strategies to mitigate these pitfalls, ensuring your models are reliable, accurate, and equitable, ultimately leading to better business decisions and outcomes.

    Part 3: Leading Successful Data Science Projects and Teams

    Chapter 11

    , The Structure of a Data Science Project, provides a comprehensive framework for planning and executing data science projects, focusing on delivering impactful data products. You will learn how to identify, evaluate, and prioritize use cases that align with your organization’s goals and have the potential to drive real business value. The chapter covers the key stages of data product development, from data preparation to model design, evaluation, and deployment. You will also learn how to evaluate the business impact of your data products by selecting relevant metrics and KPIs, enabling you to demonstrate the tangible value and ROI of your initiatives and secure ongoing support for your projects.

    Chapter 12

    , The Data Science Team, looks at the art and science of assembling a high-performing data science team. You will learn about the key roles that make up a successful team, including data scientists, machine learning engineers, and data engineers, along with the skills and expertise each role brings to the table. The chapter explores different operating models for structuring data science teams within larger organizations.

    Chapter 13

    , Managing the Data Science Team, explores the unique challenges and best practices for leading data science teams effectively. It covers strategies for enabling rapid experimentation, managing uncertainty, balancing research and production work, communicating effectively, fostering continuous learning, and promoting collaboration. The chapter also discusses common challenges such as aligning projects with business goals, scaling and deploying models, ensuring fairness and ethics, and driving the adoption of data science solutions.

    Chapter 14

    , Continuing Your Journey as a Data Science Leader, provides guidance on navigating the rapidly evolving landscape of data science, machine learning, and AI. It explores strategies for staying current with emerging technologies, specializing in specific industries or fields, and embracing continuous learning. The chapter also discusses the importance of staying informed about the latest trends and news and how data science leaders can promote data-driven thinking within their organizations.

    To get the most out of this book, some familiarity with basic mathematical concepts such as algebra, probability, and statistics is helpful but not required. The real prerequisites are curiosity, a willingness to learn, and a drive to use data for the good of your organization. If you bring those qualities, this book will supply the knowledge and practical skills you need. Step by step, you’ll learn to wield the tools of data science and AI with clarity, confidence, and purpose.

    Setup instructions will be provided in the chapters where there are code exercises.

    Conventions used

    There are a number of text conventions used throughout this book.

    Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: Click on the cell to activate it, type print(Hello, world!), and then click the play button to run the code.

    A block of code is set as follows:

    # Calculate median (middle value)

    median_sales = sales_data_year1.median()

    print(fThe median monthly sales, a typical sales month, is      {round(median_sales)} units.)

    When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

    # Calculate standard deviation (measure of the amount of variation)

    std_dev_sales = sales_data_year1.std()

     

    print(fThe standard deviation,     showing the typical variation from the mean sales,     is {round(std_dev_sales)} units.)

    Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: Click File, then choose New Notebook from the dropdown.

    Tips or important notes

    Appear like this.

    Get in touch

    Feedback from our readers is always welcome.

    General feedback: If you have questions about any aspect of this book, email us at [email protected]

    and mention the book title in the subject of your message.

    Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata

    and fill in the form.

    Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address

    Enjoying the preview?
    Page 1 of 1