Explore 1.5M+ audiobooks & ebooks free for days

From $11.99/month after trial. Cancel anytime.

Data Science with R: Beginner to Expert
Data Science with R: Beginner to Expert
Data Science with R: Beginner to Expert
Ebook523 pages1 hour

Data Science with R: Beginner to Expert

Rating: 0 out of 5 stars

()

Read preview

About this ebook

About the book

Data Science with R covers the R language, statistics, graphing, and machine learning. It is beginner-friendly and easy to follow. The book explains data science concepts in simple language.

The book takes you through data analysis, underlying mathematics, and prediction modeling. The mathematics chapter covers linear algebra, probability, and probability distributions. These are the foundations for learning machine learning. The book covers all the aspects of data science, from data collection to the presentation of prediction models. It covers data preparation, Regression, and Classification extensively. The chapters have to do exercises to practice machine learning.

The book has real datasets examples. The hands-on projects are a detailed step-by-step guide for data analysis and model creation. They provide a practical approach to applying data science techniques.

The book concludes with interview questions.

About the target audience

The book is a beginner-level that requires no prior knowledge.

LanguageEnglish
PublisherNarayana Nemani
Release dateJan 13, 2025
ISBN9798230174318
Data Science with R: Beginner to Expert

Related to Data Science with R

Related ebooks

Computers For You

View More

Reviews for Data Science with R

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Data Science with R - Narayana Nemani

    Data Science with R

    Author - Narayana Nemani

    2025-02-09

    Preface & Table of Contents

    Book cover

    Figure 0.1: Book cover

    0.1 Preface

    Data Science is an emerging field. A large number of organizations are using it for research and business improvement. Glassdoor ranked data science as one of the best careers. There is an ever-growing demand for data scientists.

    0.2 About this book

    This book is for beginners and domain experts who want to start their Data Science journey. The book is precise and complete. It is one of the fastest ways to learn data science. It covers all the aspects of data science, graphing, machine learning. As this is a beginner level book, prior knowledge is not needed. Knowing mathematics, statistics, and programming would be helpful.

    Book chapters -

    Getting started

    Predictive models

    Statistics

    R language

    Mathematics

    Data wrangling

    Data table package

    GGplot2

    Exploratory data analysis

    Advanced GGplot2

    Other visualization packages

    Machine learning

    Data preparation

    Machine learning categories

    Regression

    Classification

    Advanced supervised learning

    Regression hands-on project

    Classification hands-on project

    Supervised learning todo exercises

    Clustering

    Data Science use cases

    Interview questions

    0.3 About the Author

    Narayana Nemani

    Narayana Nemani is a Lead Data Scientist. He is involved in the teaching and research of data science.

    Copyright

    Published by Narayana Nemani

    © 2025 Visakhapatnam

    All rights reserved. No part of this book may be reproduced or modified in any form, including photocopying, recording, or by any information storage and retrieval system, without permission in writing from the publisher.

    The scanning, uploading, and distribution of this book via the internet or any other means without the permission of the publisher is illegal and punishable by law. Please purchase only authorized electronic editions, and do not participate in or encourage piracy of copyrighted materials.

    1 Getting Started

    1.1 Introduction

    Data science is studying and using data.

    Analyzing and predicting data are the primary objectives of data science.

    Data science is an emerging field. It is a subset of artificial intelligence.

    Popular data science applications are self-driving cars, gaming AI, search engines (Google, DuckDuckGo), and virtual assistants (SIRI, Alexa).

    Applications of Data Science

    Figure 1.1: Applications of Data Science

    Data science jobs are one of the highest-paid occupations. Both programmers and domain experts fill up these positions.

    1.2 Data science use cases

    Below are use cases of data science -

    The e-commerce applications recommend products based on previous purchases.

    Supermarkets estimate the sales and fill up the inventory accordingly.

    1.3 Learning Data science

    Data science needs knowledge from various fields. Statistics, domain knowledge, and programming are the pillars of data science.

    Pillars of Data Science

    Figure 1.2: Pillars of Data Science

    1.4 Data sources

    Data is the core of data science.

    Typical data sources are the organization’s internal data, government data, and surveys.

    For example, news channels conduct voting surveys before elections.

    1.5 Predictive models

    Create models for understanding and predicting data. Models use existing data as input and predict outcomes.

    The steps for creating a model are -

    Understand the problem statements

    Transform data

    Analyze data

    Build the model with algorithms

    Creating models is an iterative process. After finding a new insight in a step, make relevant changes in other steps.

    Steps of model creation

    Figure 1.3: Steps of model creation

    1.6 Editors

    1.6.1 Rstudio

    Rstudio is the recommended IDE for the R language. This ebook itself is written in Rstudio. Install locally or access it via the cloud.

    The graphical interface has four areas. Each area has a single or multiple panes.

    RStudio

    Figure 1.4: RStudio

    Source code editor pane - Write the actual code in the editor pane. To create a new script file in Rstudio, select the File > New File > R Script option. Save the code file for reusing it.

    Console pane - Run commands and view results at the console. It is a command-line interface.

    The below example runs the date function in the editor pane. The console displays the result.

    Editor and console panes

    Figure 1.5: Editor and console panes

    The Environment pane displays the variables currently loaded in memory and their values.

    Environment pane

    Figure 1.6: Environment pane

    The history pane shows the previously executed code commands.

    History pane

    Figure 1.7: History pane

    Frequently used Rstudio keyboard shortcuts are -

    Crtl + Enter - Run current line or selected code

    Crtl + Atl + r - Run entire document

    Crtl + Shift + c - Comment/Uncomment current line or selected code

    Crtl + l - Clear Console

    Esc - Interrupt currently executing command

    1.6.2 JupyterLab

    JupyterLab is another popular IDE for data science projects. It supports many programming languages, along with R and Python. JupyterLab segregates code into multiple blocks. The code blocks are called cells.

    Frequently used JupyterLab keyboard shortcuts are -

    Crtl + Enter - Run current cell or selected cells

    A - Insert cell above the current cell

    B - Insert cell below the current cell

    D, D - Delete current cell

    JupyterLab

    Figure 1.8: JupyterLab

    1.7 R Projects

    It is advisable to use multiple script files. At least one for each task. It improves the maintainability of code.

    Typically, create scripts for cleaning, transforming, and model creation. Use folders for multiple files. For example, if there are four transformation scripts, place all of them in a transform folder.

    Rstudio provide project feature. R project clubs all the associated files. Add the script files and data files to the project.

    The files pane displays the code and data files.

    Typical project structure

    Figure 1.9: Typical project structure

    Create projects in Rstudio with the File > New Project option. Select the New Project option in the project type step.

    New project wizard

    Figure 1.10: New project wizard

    ## Warning: package 'treemapify' was built under R version 4.4.2

    2 Predictive Models

    2.1 Introduction

    Construct predictive models to comprehend and forecast data. Models understand existing data and predict outcomes.

    This chapter discusses creating a predictive model without delving into the code.

    Consider a scenario where an investor plans to buy a stock in 2025. He has the stock prices from 2021 to 2024. He needs to predict the price of a stock in the year 2025. Price prediction helps him in deciding whether the stock is a good investment.

    The stock prices are -

    Stock prices chart

    Figure 2.1: Stock prices chart

    The steps for creating the predictive model are -

    Define the problem statement

    Data Wrangling (Transform data)

    Exploratory Data Analysis (Analyze data)

    Build model

    Predict

    2.2 Define the problem statement

    Predict the stock price in the year 2025.

    2.3 Data Wrangling (Transform data)

    Data wrangling is the process of transforming and cleaning data.

    This step includes -

    Handle invalid data.

    Create new columns based on existing columns.

    In this particular example, we are not transforming data.

    2.4 Exploratory Data Analysis (Analyze data)

    Exploratory Data Analysis (EDA) is analyzing data to gain insights into it.

    Graph the data to understand the data patterns.

    Stock price chart

    Figure 2.2: Stock price chart

    2.5 Build model

    Create a model by using the existing data as input.

    We create the model using data from 2021 to 2024.

    model = BuildModel(data)

    2.6 Predict

    The model can predict the future stock prices. Predict the stock price in 2025.

    prediction.2025 = model.predict(year=2025) print(prediction.2025)

    ## [1] 21

    Stock price prediction chart

    Figure 2.3: Stock price prediction chart

    The price of the stock is 12 in 2024. As per the model, the stock price almost doubles to 21 in 2025. The stock is a good investment.

    2.7 Tip

    The subsequent three chapters (Statistics, R Language, and Mathematics) cover the fundamentals needed for data science. They are optional. You can skip them and go to Chapter 6 - Data Wrangling.

    3 Statistics

    3.1 Introduction

    Statistics is the science of collecting, analyzing, and presentation of data. It is built based on mathematical principles.

    Use statistics for making decisions and formulating strategies.

    3.2 Common statistical terms

    An individual is the data of a single person or thing. Usually, they are rows of a table.

    A variable is an attribute about the individuals. Usually, they are columns of a table.

    Individuals and Variables

    Figure 3.1: Individuals and Variables

    The population (N) is the data collected from all the members.

    The sample (n) is the data collected from some population members.

    Population and Sample

    Figure 3.2: Population and Sample

    Collecting data on a large population is not possible. Explore the sample data to understand the data patterns.

    3.3 Types of sampling

    Do sampling in three ways -

    Random sampling is selecting individuals randomly.

    Systematic sampling is sorting and selecting every nth individual.

    The stratified sampling is selecting individuals from each category.

    Types of sampling

    Figure 3.3: Types of sampling

    3.4 Statistics categories

    Statistics is of two types -

    Descriptive statistics describes the data with graphs or parameters.

    Examples of parameters are the average, minimum, and maximum.

    Inferential statistics is understanding the data patterns in sample data and predicting the population outcomes.

    3.5 Data classification

    Data classification

    Figure 3.4: Data classification

    3.5.1 Classification by Values

    Classify the data based on the values in the variable.

    Data classification based on the types of values

    Figure 3.5: Data classification based on the types of values

    Quantitative (numerical)

    Measure quantitative data in numbers.

    For example, the heights of individuals.

    The quantitative is of two types -

    The continuous data has decimal values. Examples of continuous data are temperature, height, and distance.

    The discrete data does not have decimal values. Examples of discrete data are student roll numbers and the number of houses in a locality.

    Qualitative (categorical)

    The qualitative data is the names given to features.

    For example, labeling of materials as brittle or flexible.

    The qualitative data is of two types -

    The nominal data does not have a value associated with it. A nominal value is not greater or lesser than

    Enjoying the preview?
    Page 1 of 1