Data Science with R: Beginner to Expert
()
About this ebook
About the book
Data Science with R covers the R language, statistics, graphing, and machine learning. It is beginner-friendly and easy to follow. The book explains data science concepts in simple language.
The book takes you through data analysis, underlying mathematics, and prediction modeling. The mathematics chapter covers linear algebra, probability, and probability distributions. These are the foundations for learning machine learning. The book covers all the aspects of data science, from data collection to the presentation of prediction models. It covers data preparation, Regression, and Classification extensively. The chapters have to do exercises to practice machine learning.
The book has real datasets examples. The hands-on projects are a detailed step-by-step guide for data analysis and model creation. They provide a practical approach to applying data science techniques.
The book concludes with interview questions.
About the target audience
The book is a beginner-level that requires no prior knowledge.
Related to Data Science with R
Related ebooks
Zombies! Episode 2: Abby's Bad Day Rating: 0 out of 5 stars0 ratingsEgypt: Unchanged Rating: 0 out of 5 stars0 ratingsTEN BABUSHKA DOLLS Rating: 5 out of 5 stars5/5A Tragedy of Coincidence Rating: 5 out of 5 stars5/5The Thirty Years War — Volume 02 Rating: 0 out of 5 stars0 ratingsThe Outlaws of Mars Rating: 4 out of 5 stars4/5Hunt for the Lost Treasure of San Jose Rating: 0 out of 5 stars0 ratingsIslamic Terrorist: The Satanic Mind Rating: 0 out of 5 stars0 ratingsThe Zombie Job: The Zombie Job, #1 Rating: 0 out of 5 stars0 ratingsDeep Trouble Rating: 0 out of 5 stars0 ratingsGraveyard of Dreams Rating: 4 out of 5 stars4/5Halcyon Drift: Hooded Swan, Book 1 Rating: 4 out of 5 stars4/5George Eastman and the Kodak Camera Rating: 0 out of 5 stars0 ratingsZombies! Episode 3.8: Castaways Rating: 0 out of 5 stars0 ratingsSpacemen, Go Home Rating: 0 out of 5 stars0 ratingsInside Nature’s Giants Rating: 3 out of 5 stars3/5Rossum's Universal Robots: A Fantastic Melodrama in Three Acts and an Epilogue Rating: 0 out of 5 stars0 ratingsSingular Irregularity - Time Travel Gone Terribly Wrong Rating: 0 out of 5 stars0 ratingsZombies! Episode 3.1: Island of the Dead Rating: 0 out of 5 stars0 ratingsYouth Rating: 4 out of 5 stars4/5Iranian Drones: A New Menace From the Ayatollah Rating: 0 out of 5 stars0 ratingsScourge of Princes: Came of Age Too Soon - Book One Rating: 5 out of 5 stars5/5The Odd and The Strange: A Collection of Very Short Fiction Rating: 5 out of 5 stars5/5Surviving Cambodia, the Khmer Rouge Regime Rating: 0 out of 5 stars0 ratingsLightspeed Magazine, Issue 145 (June 2022): Lightspeed Magazine, #145 Rating: 0 out of 5 stars0 ratingsThe World At My Feet: The True (And Sometimes Hilarious) Adventures of a Lady Airline Captain Rating: 0 out of 5 stars0 ratingsThe Disastrous Voyage of the East India Tall Ship 'Middelburgh' Rating: 0 out of 5 stars0 ratingsDialogues on Transhumanism Rating: 0 out of 5 stars0 ratings
Computers For You
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 4 out of 5 stars4/5Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5Elon Musk Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5The Self-Taught Computer Scientist: The Beginner's Guide to Data Structures & Algorithms Rating: 0 out of 5 stars0 ratingsCompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide Rating: 5 out of 5 stars5/5Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time! Rating: 0 out of 5 stars0 ratingsCompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61 Rating: 0 out of 5 stars0 ratingsDeep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/52022 Adobe® Premiere Pro Guide For Filmmakers and YouTubers Rating: 5 out of 5 stars5/5The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution Rating: 4 out of 5 stars4/5Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics Rating: 4 out of 5 stars4/5Why Machines Learn: The Elegant Math Behind Modern AI Rating: 3 out of 5 stars3/5Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 5 out of 5 stars5/5Tor and the Dark Art of Anonymity Rating: 5 out of 5 stars5/5Fundamentals of Programming: Using Python Rating: 5 out of 5 stars5/5Microsoft Azure For Dummies Rating: 0 out of 5 stars0 ratingsSlenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls Rating: 4 out of 5 stars4/5Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are Rating: 4 out of 5 stars4/5Technical Writing For Dummies Rating: 0 out of 5 stars0 ratings
Reviews for Data Science with R
0 ratings0 reviews
Book preview
Data Science with R - Narayana Nemani
Data Science with R
Author - Narayana Nemani
2025-02-09
Preface & Table of Contents
Book coverFigure 0.1: Book cover
0.1 Preface
Data Science is an emerging field. A large number of organizations are using it for research and business improvement. Glassdoor ranked data science as one of the best careers. There is an ever-growing demand for data scientists.
0.2 About this book
This book is for beginners and domain experts who want to start their Data Science journey. The book is precise and complete. It is one of the fastest ways to learn data science. It covers all the aspects of data science, graphing, machine learning. As this is a beginner level book, prior knowledge is not needed. Knowing mathematics, statistics, and programming would be helpful.
Book chapters -
Getting started
Predictive models
Statistics
R language
Mathematics
Data wrangling
Data table package
GGplot2
Exploratory data analysis
Advanced GGplot2
Other visualization packages
Machine learning
Data preparation
Machine learning categories
Regression
Classification
Advanced supervised learning
Regression hands-on project
Classification hands-on project
Supervised learning todo exercises
Clustering
Data Science use cases
Interview questions
0.3 About the Author
Narayana Nemani
Narayana Nemani is a Lead Data Scientist. He is involved in the teaching and research of data science.
Copyright
Published by Narayana Nemani
© 2025 Visakhapatnam
All rights reserved. No part of this book may be reproduced or modified in any form, including photocopying, recording, or by any information storage and retrieval system, without permission in writing from the publisher.
The scanning, uploading, and distribution of this book via the internet or any other means without the permission of the publisher is illegal and punishable by law. Please purchase only authorized electronic editions, and do not participate in or encourage piracy of copyrighted materials.
1 Getting Started
1.1 Introduction
Data science is studying and using data.
Analyzing and predicting data are the primary objectives of data science.
Data science is an emerging field. It is a subset of artificial intelligence.
Popular data science applications are self-driving cars, gaming AI, search engines (Google, DuckDuckGo), and virtual assistants (SIRI, Alexa).
Applications of Data ScienceFigure 1.1: Applications of Data Science
Data science jobs are one of the highest-paid occupations. Both programmers and domain experts fill up these positions.
1.2 Data science use cases
Below are use cases of data science -
The e-commerce applications recommend products based on previous purchases.
Supermarkets estimate the sales and fill up the inventory accordingly.
1.3 Learning Data science
Data science needs knowledge from various fields. Statistics, domain knowledge, and programming are the pillars of data science.
Pillars of Data ScienceFigure 1.2: Pillars of Data Science
1.4 Data sources
Data is the core of data science.
Typical data sources are the organization’s internal data, government data, and surveys.
For example, news channels conduct voting surveys before elections.
1.5 Predictive models
Create models for understanding and predicting data. Models use existing data as input and predict outcomes.
The steps for creating a model are -
Understand the problem statements
Transform data
Analyze data
Build the model with algorithms
Creating models is an iterative process. After finding a new insight in a step, make relevant changes in other steps.
Steps of model creationFigure 1.3: Steps of model creation
1.6 Editors
1.6.1 Rstudio
Rstudio is the recommended IDE for the R language. This ebook itself is written in Rstudio. Install locally or access it via the cloud.
The graphical interface has four areas. Each area has a single or multiple panes.
RStudioFigure 1.4: RStudio
Source code editor pane - Write the actual code in the editor pane. To create a new script file in Rstudio, select the File > New File > R Script option. Save the code file for reusing it.
Console pane - Run commands and view results at the console. It is a command-line interface.
The below example runs the date function in the editor pane. The console displays the result.
Editor and console panesFigure 1.5: Editor and console panes
The Environment pane displays the variables currently loaded in memory and their values.
Environment paneFigure 1.6: Environment pane
The history pane shows the previously executed code commands.
History paneFigure 1.7: History pane
Frequently used Rstudio keyboard shortcuts are -
Crtl + Enter - Run current line or selected code
Crtl + Atl + r - Run entire document
Crtl + Shift + c - Comment/Uncomment current line or selected code
Crtl + l - Clear Console
Esc - Interrupt currently executing command
1.6.2 JupyterLab
JupyterLab is another popular IDE for data science projects. It supports many programming languages, along with R and Python. JupyterLab segregates code into multiple blocks. The code blocks are called cells.
Frequently used JupyterLab keyboard shortcuts are -
Crtl + Enter - Run current cell or selected cells
A - Insert cell above the current cell
B - Insert cell below the current cell
D, D - Delete current cell
JupyterLabFigure 1.8: JupyterLab
1.7 R Projects
It is advisable to use multiple script files. At least one for each task. It improves the maintainability of code.
Typically, create scripts for cleaning, transforming, and model creation. Use folders for multiple files. For example, if there are four transformation scripts, place all of them in a transform folder.
Rstudio provide project feature. R project clubs all the associated files. Add the script files and data files to the project.
The files pane displays the code and data files.
Typical project structureFigure 1.9: Typical project structure
Create projects in Rstudio with the File > New Project option. Select the New Project option in the project type step.
New project wizardFigure 1.10: New project wizard
## Warning: package 'treemapify' was built under R version 4.4.2
2 Predictive Models
2.1 Introduction
Construct predictive models to comprehend and forecast data. Models understand existing data and predict outcomes.
This chapter discusses creating a predictive model without delving into the code.
Consider a scenario where an investor plans to buy a stock in 2025. He has the stock prices from 2021 to 2024. He needs to predict the price of a stock in the year 2025. Price prediction helps him in deciding whether the stock is a good investment.
The stock prices are -
Stock prices chartFigure 2.1: Stock prices chart
The steps for creating the predictive model are -
Define the problem statement
Data Wrangling (Transform data)
Exploratory Data Analysis (Analyze data)
Build model
Predict
2.2 Define the problem statement
Predict the stock price in the year 2025.
2.3 Data Wrangling (Transform data)
Data wrangling is the process of transforming and cleaning data.
This step includes -
Handle invalid data.
Create new columns based on existing columns.
In this particular example, we are not transforming data.
2.4 Exploratory Data Analysis (Analyze data)
Exploratory Data Analysis (EDA) is analyzing data to gain insights into it.
Graph the data to understand the data patterns.
Stock price chartFigure 2.2: Stock price chart
2.5 Build model
Create a model by using the existing data as input.
We create the model using data from 2021 to 2024.
model = BuildModel(data)
2.6 Predict
The model can predict the future stock prices. Predict the stock price in 2025.
prediction.2025 = model.predict(year=2025) print(prediction.2025)
## [1] 21
Stock price prediction chartFigure 2.3: Stock price prediction chart
The price of the stock is 12 in 2024. As per the model, the stock price almost doubles to 21 in 2025. The stock is a good investment.
2.7 Tip
The subsequent three chapters (Statistics, R Language, and Mathematics) cover the fundamentals needed for data science. They are optional. You can skip them and go to Chapter 6 - Data Wrangling.
3 Statistics
3.1 Introduction
Statistics is the science of collecting, analyzing, and presentation of data. It is built based on mathematical principles.
Use statistics for making decisions and formulating strategies.
3.2 Common statistical terms
An individual is the data of a single person or thing. Usually, they are rows of a table.
A variable is an attribute about the individuals. Usually, they are columns of a table.
Individuals and VariablesFigure 3.1: Individuals and Variables
The population (N) is the data collected from all the members.
The sample (n) is the data collected from some population members.
Population and SampleFigure 3.2: Population and Sample
Collecting data on a large population is not possible. Explore the sample data to understand the data patterns.
3.3 Types of sampling
Do sampling in three ways -
Random sampling is selecting individuals randomly.
Systematic sampling is sorting and selecting every nth individual.
The stratified sampling is selecting individuals from each category.
Types of samplingFigure 3.3: Types of sampling
3.4 Statistics categories
Statistics is of two types -
Descriptive statistics describes the data with graphs or parameters.
Examples of parameters are the average, minimum, and maximum.
Inferential statistics is understanding the data patterns in sample data and predicting the population outcomes.
3.5 Data classification
Data classificationFigure 3.4: Data classification
3.5.1 Classification by Values
Classify the data based on the values in the variable.
Data classification based on the types of valuesFigure 3.5: Data classification based on the types of values
Quantitative (numerical)
Measure quantitative data in numbers.
For example, the heights of individuals.
The quantitative is of two types -
The continuous data has decimal values. Examples of continuous data are temperature, height, and distance.
The discrete data does not have decimal values. Examples of discrete data are student roll numbers and the number of houses in a locality.
Qualitative (categorical)
The qualitative data is the names given to features.
For example, labeling of materials as brittle or flexible.
The qualitative data is of two types -
The nominal data does not have a value associated with it. A nominal value is not greater or lesser than