Ids Unit I
Ids Unit I
Introduction: Definition of Data Science- Big Data and Data Science hype – and getting past the
hype - Datafication - Current landscape of perspectives - Statistical Inference -
Populations and samples - Statistical modeling, probability distributions, fitting a model –
Over fitting. Basics of R: Introduction, R Environment Setup, Programming with R, Basic
Data Types.
Data Science is an interdisciplinary field that focuses on extracting knowledge from data
sets which are typically huge in amount. The field encompasses analysis, preparing data for
analysis, and presenting findings to inform high-level decisions in an organization. As such, it
incorporates skills from computer science, mathematics, statics, information visualization,
graphic, and business.
Data science is a deep study of the massive amount of data, which involves extracting
meaningful insights from raw, structured, and unstructured data that is processed using the
scientific method, different technologies, and algorithms.
It is a multidisciplinary field that uses tools and techniques to manipulate the data so that you
can find something new and meaningful.
Data science uses the most powerful hardware, programming systems, and most efficient
algorithms to solve the data related problems. It is the future of artificial intelligence.
Note: Few important steps to help you work more successfully with data science
projects:
Setting the research goal: Understanding the business or activity that our data
science project is part of is key to ensuring its success and the first phase of any sound
data analytics project. Defining the what, the why, and the how of our project in a
project charter is the foremost task. Now sit down to define a timeline and concrete
key performance indicators and this is the essential first step to kick-start our data
initiative!
Retrieving data: Finding and getting access to the data needed in our project is the
next step. Mixing and merging data from as many data sources as possible is what makes
a data project great, so look as far as possible. This data is either found within the
company or retrieved from a third party. So, here are a few ways to get ourselves some
usable data: connecting to a database, using API’s or looking for open data.
Data preparation: The next data science step is the dreaded data preparation process
that typically takes up to 80% of the time dedicated to our data project. Checking and
remediating data errors, enriching the data with data from other data sources, and
transforming it into a suitable format for your models.
Data exploration: Now that we have clean our data, it’s time to manipulate it to get
the most value out of it. Diving deeper into our data using descriptive statistics and
visual techniques is how we explore our data. One example of that is to enrich our data
by creating time-based features, such as: Extracting date components (month, hour, day
of the week, week of the year, etc.), Calculating differences between date columns or
Flagging national holidays. Another way of enriching data is by joining datasets —
essentially, retrieving columns from one data-set or tab into a reference data-set.
Presentation and automation: Presenting our results to the stakeholders and
industrializing our analysis process for repetitive reuse and integration with other tools.
When we are dealing with large volumes of data, visualization is the best way to explore
and communicate our findings and is the next phase of our data analytics project.
Data modeling: Using machine learning and statistical techniques is the step to further
achieve our project goal and predict future trends. By working with clustering
algorithms, we can build models to uncover trends in the data that were not
distinguishable in graphs and stats. These create groups of similar events (or clusters)
and more or less explicitly express what feature is decisive in these results.
Big Data and Data Science hype
Big Data: This is a term related to extracting meaningful data by analyzing the huge
amount of complex, variously formatted data generated at high speed, that cannot
be handled, or processed by the traditional system.
Big data refers to significant volumes of data that cannot be processed effectively
with the traditional applications that are currently used. The processing of big data
begins with raw data that isn’t aggregated and is most often impossible to store in the
memory of a single computer.
Big data is a buzzword used to describe immense volumes of data, both
unstructured and structured, that can inundate a business on a day-to-day basis.
Big data is used to analyze insights, which can lead to better decisions and strategic
business moves
Social Media: Today’s world a good percent of the total world population is
engaged with social media like Facebook, WhatsApp, Twitter, YouTube, Instagram,
etc. Each activity on such media like uploading a photo, or video, sending a message,
making comment, putting like, etc create data.
A sensor placed in various places: Sensor placed in various places of the city that
gathers data on temperature, humidity, etc. A camera placed beside the road gather
information about traffic condition and creates data. Security cameras placed in
sensitive areas like airports, railway stations, and shopping malls create a lot of data.
Customer Satisfaction Feedback: Customer feedback on the product or service of
the various company on their website creates data. For Example, retail commercial sites
like Amazon, Walmart, Flipkart, and Myntra gather customer feedback on the quality of
their product and delivery time. Telecom companies, and other service provider
organizations seek customer experience with their service. These create a lot of data.
IoT Appliance: Electronic devices that are connected to the internet create data for
their smart functionality, examples are a smart TV, smart washing machine, smart
coffee machine, smart AC, etc. It is machine-generated data that are created by
sensors kept in various devices. For Example, a Smart printing machine – is
connected to the internet. A number of such printing machines connected to a network
can transfer data within each other. So, if anyone loads a file copy in one printing
machine, the system stores that file content, and another printing machine kept in
another building or another floor can print out that file hard copy. Such data transfer
between various printing machines generates data.
E-commerce: In e-commerce transactions, business transactions, banking, and the
stock market, lots of records stored are considered one of the sources of big data.
Payments through credit cards, debit cards, or other electronic ways, all are kept
recorded as data.
Global Positioning System (GPS): GPS in the vehicle helps in monitoring the
movement of the vehicle to shorten the path to a destination to cut fuel, and time
consumption. This system creates huge data on vehicle position and movement.
Big Data for Financial Services Credit card companies, retail banks, private wealth
management advisories, insurance firms, venture funds, and institutional investment
banks all use big data for their financial services. The common problem among
them all is the massive amounts of multi-structured data living in multiple disparate
systems, which big data can solve. As such, big data is used in several ways, including:
1. Customer analytics
2. Compliance analytics
3. Fraud analytics
4. Operational analytics
Big Data in Communications Gaining new subscribers, retaining customers, and
expanding within current subscriber bases are top priorities for telecommunication
service providers. The solutions to these challenges lie in the ability to combine and
analyze the masses of customer-generated data and machine-generated data that is
being created every day.
Big Data for Retail Whether it’s a brick-and-mortar company an online retailer, the
answer to staying in the game and being competitive is understanding the customer
better. This requires the ability to analyze all disparate data sources that companies deal
with every day, including the weblogs, customer transaction data, social media,
store-branded credit card data, and loyalty program data.
Datafication
The Data Science Landscape Data science is part of the computer sciences . It comprises the
disciplines of i) analytics, ii) statistics and iii) machine learning.
Analytics:
"It is a branch of computer science by which we can create intelligent machines which
can behave like a human, think like humans, and able to make decisions."
Intelligence, as we know, is the ability to acquire and apply knowledge. Knowledge is the
information acquired through experience. Experience is the knowledge gained through
exposure(training). Summing the terms up, we get artificial intelligence as the “copy of
something natural(i.e., human beings) ‘WHO’ is capable of acquiring and applying the
information it has gained through exposure.”
Machine Learning
Machine Learning is said as a subset of artificial intelligence that is mainly concerned with
the development of algorithms which allow a computer to learn from the data and
past experiences on their own.
Machine learning enables a machine to automatically learn from data, improve
performance from experiences, and predict things without being explicitly
programmed.
A Machine Learning system learns from historical data, builds the prediction
models, and whenever it receives new data, predicts the output for it. The
accuracy of predicted output depends upon the amount of data, as the huge amount
of data helps to build a better model which predicts the output more accurately.
Suppose we have a complex problem, where we need to perform some
predictions, so instead of writing a code for it, we just need to feed the data to generic
algorithms, and with the help of these algorithms, machine builds the logic as per the
data and predict the output. Machine learning has changed our way of thinking
about the problem.
Supervised Machine Learning
Supervised learning is the types of machine learning in which machines are trained using
well "labelled" training data, and on basis of that data, machines predict the
output. The labelled data means some input data is already tagged with the correct
output.
Supervised learning is when the model is getting trained on a labelled dataset. A
labelled dataset is one that has both input and output parameters. In this type of
learning both training and validation, datasets are labelled
In supervised learning, the training data provided to the machines work as the
supervisor that teaches the machines to predict the output correctly. It applies the
same concept as a student learns in the supervision of the teacher.
Suppose we have a dataset of different types of shapes which includes square, rectangle,
triangle, and Polygon. Now the first step is that we need to train the model for each
shape.
If the given shape has four sides, and all the sides are equal, then it will be labelled as a
Square.
If the given shape has three sides, then it will be labelled as a triangle.
If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to
identify the shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it
classifies the shape on the bases of a number of sides, and predicts the output.
Both the above figures have labelled data set as follows:
Figure B: It is a Meteorological dataset that serves the purpose of predicting wind speed
based on different parameters. Input: Dew Point, Temperature, Pressure, Relative
Humidity, Wind Direction Output: Wind Speed
Training the system: While training the model, data is usually split in the ratio of 80:20 i.e.
80% as training data and the rest as testing data. In training data, we feed input as well as output
for 80% of data. The model learns from training data only. We use different machine learning
algorithms to build our model. Learning means that the model will build some logic of its
own. Once the model is ready then it is good to be tested. At the time of testing, the
input is fed from the remaining 20% of data that the model has never seen before, the model
will predict some value and we will compare it with the actual output and calculate the
accuracy.
Unsupervised Learning
Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision
Example: Suppose the unsupervised learning algorithm is given an input dataset containing
images of different types of cats and dogs. The algorithm is never trained upon the given
dataset, which means it does not have any idea about the features of the dataset. The task
of the unsupervised learning algorithm is to identify the image features on their own.
Unsupervised learning algorithm will perform this task by clustering the image dataset
into the groups according to similarities between images.
Here, we have taken an unlabeled input data, which means it is not categorized
and corresponding outputs are also not given. Now, this unlabeled input data is fed
to the machine learning model in order to train it. Firstly, it will interpret the raw data
to find the hidden patterns from the data and then will apply suitable algorithms such
as k-means clustering, Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.
Statistical Inference
Types of Statistics
Statistics can be classified into two different categories. The two different types of Statistics are:
Descriptive Statistics
Inferential Statistics
Descriptive statistics summarizes or describes the characteristics of a data set.
Descriptive statistics consists of three basic categories of measures: measures of
central tendency, measures of variability (or spread), and frequency distribution.
Measures of central tendency describe the center of the data set (mean, median,
mode).
Measures of variability describe the dispersion of the data set (variance, standard
deviation).
Measures of frequency distribution describe the occurrence of data within the data set
(count)
Inferential statistics can be defined as a field of statistics that uses analytical
tools for drawing conclusions about a population by examining random samples.
The goal of inferential statistics is to make generalizations about a population. In
inferential statistics, a statistic is taken from the sample data (e.g., the sample mean)
that used to make inferences about the population parameter (e.g., the population
mean).
In Statistics, descriptive statistics describe the data, whereas inferential statistics help
you make predictions from the data. In inferential statistics, the data are taken from
the sample and allows you to generalize the population.
In general, inference means “guess”, which means making inference about something.
So, statistical inference means, making inference about the population. To take a
conclusion about the population, it uses various statistical analysis techniques.
Population:
For example: let us assume that there are 5 employees in my company, so 5 people is a
complete set hence it will represent the population of my company
If I want to find the average age of my company then I will simply add their ages and divide it by
N which is the number of population
ages = {23,45,12,34,22}
μ = ∑5i=1 (xi)/5
= (23 + 45 + 12 + 34 + 22) / 5
the results say that the average age of my company is 27.2 years so this is what we call
the population mean
Sample:
A sample represents a group of the interest of the population which we will use to
represent the data. The sample is an unbiased(balanced) subset of the population in
which we represent the whole data.
A sample is a group of the elements actually participating in the survey or study.
A sample is the representation of the manageable size.
samples are collected and stats are calculated from the sample so one can make
interferences or extrapolations from the sample.
This process of collecting information from the sample is called sampling.
The sample is denoted by the n
500 people from a total population of the Rajasthan state will be considered as a sample
143 total chess players from all total number of chess players will be considered as
a sample
Sample mean is denoted by x –
Example: Let us assume the population of India is 10 million, and recent elections were
conducted in India between two parties ‘party A ‘ and ‘party B’ now researchers want
to find which party is winning so here we will create a group of few people lets say 10,000 from
different regions and age groups so that sample is not biased. Then ask them who they voted
we can get the exit poll. This is the thing which most of our media do during the elections,
and show stats such as there 55% chances of party A winning the elections.
Samples are used when :
The population is too large to collect data. o The data collected is not reliable.
The population is hypothetical(Proposed) and is unlimited in size.
Overfitting is a concept in data science, which occurs when a statistical model fits
exactly against its training data.
When this happens, the algorithm unfortunately cannot perform accurately against
unseen data, defeating its purpose.
When machine learning algorithms are constructed, they leverage(control) a sample
dataset to train the model.
However, when the model trains for too long on sample data or when the model is
too complex, it can start to learn the “noise,” or irrelevant information, within
the dataset.
When the model memorizes the noise and fits too closely to the training set, the model
becomes “overfitted,” and it is unable to generalize well to new data.
If a model cannot generalize well to new data, then it will not be able to perform the
classification or prediction tasks that it was intended for.
If the training data has a low error rate and the test data has a high error rate, it signals
overfitting.
Early stopping: This method seeks to pause training before the model starts
learning the noise within the model.
Train with more data: Expanding the training set to include more data can increase
the accuracy of the model. this is a more effective method when clean, relevant data is
injected into the model
Data augmentation: While it is better to inject clean, relevant data into your training
data, sometimes noisy data is added to make a model more stable.
Which is used to increase the amount of data by adding slightly modified copies of
already existing data or newly created synthetic data from existing data. It acts as
a regularizer and helps reduce overfitting when training a machine learning model.
However, this method should be done sparingly(in a restricted or in small quantities).
Regularization: If overfitting occurs when a model is too complex, it makes sense
for us to reduce the number of features.
But what if we don’t know which inputs to eliminate during the feature selection
process. So certain If we don’t know which features to remove from our model,
regularization methods can be particularly helpful
E.G L1 regularization( Lasso regularization)
Statistical modeling
The first step in developing a statistical model is gathering data, which may be sourced
from spreadsheets, databases, data lakes, or the cloud.
The most common statistical modeling methods for analyzing this data are categorized
as either supervised learning or unsupervised learning.
Some popular statistical model examples include logistic regression, time-series,
clustering, and decision trees.
Supervised learning techniques include regression models and classification models:
Ex:
Example: Suppose we want to do weather forecasting, so for this, we will use the Regression
algorithm. In weather prediction, the model is trained on the past data, and once the training
is completed, it can easily predict the weather for future days.
So for such case we need Regression analysis which is a statistical method and used in machine
learning and data science.
Regression estimates the relationship between the target and the independent variable.
Reasons
Linear Regression:
Decision Tree is a supervised learning algorithm which can be used for solving
both classification and regression problems.
It can solve problems for both categorical and numerical data
Decision Tree regression builds a tree-like structure in which each internal node
represents the "test" for an attribute, each branch represent the result of the test, and
each leaf node represents the final decision or result.
A decision tree is constructed starting from the root node/parent node (dataset),
which splits into left and right child nodes (subsets of dataset). These child nodes
are further divided into their children node, and themselves become the parent
node of those nodes. Consider the below image:
Above image showing the example of Decision Tee regression, here, the model
is trying to predict the choice of a person between Sports cars or Luxury car.
Notes additional :
Classification:
Classification is a process of finding a function which helps in dividing the dataset into
classes based on different parameters. In Classification, a computer program is
trained on the training dataset and based on that training, it categorizes the data
into different classes.
Example: The best example to understand the Classification problem is Email Spam
Detection. The model is trained on the basis of millions of emails on different parameters, and
whenever it receives a new email, it identifies whether the email is spam or not. If the email is
spam, then it is moved to the Spam folder.
Types of ML Classification Algorithms:
Logistic Regression
K-Nearest Neighbours
Support Vector Machines
Kernel SVM
Naïve Bayes
Decision Tree Classification
Random Forest Classification
Probability
1. Trial and Event: The performance of an experiment is called a trial, and the set of its
outcomes is termed an event.
Example: Tossing a coin and getting head is a trial. Then the event is {HT, TH, HH}
Example:
1. Tossing a Coin
2. Rolling a dice
Example:
4. Sample Space: The set of all possible outcomes of an experiment is called sample space and is
denoted by S.
Probability distribution
Probability distribution is a function that is used to give the probability of all the possible
values that a random variable can take.
A probability distribution is a mathematical function that describes the probability of
different possible values of a variable.
Probability distributions are often depicted using graphs or probability tables.
probability distribution gives the possibility of each outcome of a random
experiment or event.
It provides the probabilities of different possible occurrences.
A probability distribution is a statistical function that describes all the possible values
and probabilities for a random variable within a given range.
This range will be bound by the minimum and maximum possible values, but where the
possible value would be plotted on the probability distribution will be determined
by a number of factors
EX:1 A discrete distribution has a range of values that are countable. For
example, the numbers on birthday cards have a possible range from 0 to 122 (122
is the age of Jeanne the oldest person who ever lived)
EX:2 Suppose a fair dice is rolled and the discrete probability distribution has to
be created. The possible outcomes are {1, 2, 3, 4, 5, 6}.
Thus, the total number of outcomes will be 6. All numbers have a fair chance of
turning up.
This means that the probability of getting any one number is 1 / 6. Using this data the
discrete probability distribution table for a dice roll can be given as follows:
A continuous distribution has a range of values that are infinite, and therefore
uncountable.
For example, time is infinite: you could count from 0 seconds to a billion seconds…a
trillion seconds…and so on,
Basics of R: Introduction
R programming is used as a leading tool for machine learning, statistics, and data analysis.
Objects, functions, and packages can easily be created by R.
It’s a platform-independent language. This means it can be applied to all
operating system.
It’s an open-source free language. That means anyone can install it in any
organization without purchasing a license.
R programming language is not only a statistic package but also allows us to integrate
with other languages (C, C++). Thus, you can easily interact with many data sources and
statistical packages.
The R programming language has a vast community of users and it’s growing day by day.
R is currently one of the most requested programming languages in the Data Science job
market that makes it the hottest trend nowadays.
Statistical Features of R:
Basic Statistics: The most common basic statistics terms are the mean, mode, and
median. These are all known as “Measures of Central Tendency.” So using the R
language we can measure central tendency very easily.
Static graphics: R is rich with facilities for creating and developing interesting static
graphics. R contains functionality for many plot types including graphic maps, mosaic
plots, biplots, and the list goes on.
Probability distributions: Probability distributions play a vital role in statistics and by
using R we can easily handle various types of probability distribution such as Binomial
Distribution, Normal Distribution, Chi-squared Distribution and many more.
Data analysis: It provides a large, coherent and integrated collection of tools for data
analysis.
Programming Features of R:
Disadvantages of R:
In the R programming language, the standard of some packages is less than perfect.
Although, R commands give little pressure to memory management. So R programming
language may consume all available memory.
In R basically, nobody to complain if something doesn’t work.
R programming language is much slower than other programming languages such as
Python and MATLAB.
Applications of R:
We use R for Data Science. It gives us a broad variety of libraries related to statistics. It
also provides the environment for statistical computing and design.
R is used by many quantitative analysts as its programming tool. Thus, it helps in data
importing and cleaning.
R is the most prevalent language. So many data analysts and research programmers use
it. Hence, it is used as a fundamental tool for finance.
Tech giants like Google, Facebook, bing, Twitter, Accenture, Wipro and many more
using R nowadays.
Note: "R is an interpreted computer programming language which was created by Ross
Ihaka and Robert Gentleman at the University of Auckland, New Zealand." The R
Development Core Team currently develops R. It is also a software environment used
to analyze statistical information, graphical representation, reporting, and data modeling
R- Environment Setup
Step 2: When we click on Download R 3.6.1 for windows, our downloading will be started of R
setup. Once the downloading is finished, we have to run the setup of R in the following way:
1) Select the path where we want to download the R and proceed to Next.
2) Select all components which we want to install, and then we will proceed to Next.
3) In the next step, we have to select either customized startup or accept the default, and
then we proceed to Next.
4) When we proceed to next, our installation of R in our system will get started:
5) In the last, we will click on finish to successfully install R in our system.
Install R in Linux
Step 1: In the first step, we have to update all the required files in our system using
sudo apt-get update command as:
Step 2: In the second step, we will install R file in our system with the help of sudo apt-get
install r-base as:
Step 3: In the last step, we type R and press enter to work on R editor.
RStudio IDE
Installation of RStudio RStudio Desktop is available for both Windows and Linux. The
open-source RStudio Desktop installation is very simple to install on both operating systems.
The licensed version of RStudio has some more features than open-source.
Before installing RStudio, let's see what are the additional features in the license version of
RStudio.
Installation on Windows/Linux
On Windows and Linux, it is quite simple to install RStudio. The process of installing
RStudio in both the OS is the same. There are the following steps to install RStudio in our
Windows/Linux:
Step 1: In the first step, we visit the RStudio official site and click on Download RStudio.
Step 2: In the next step, we will select the RStudio desktop for open-source license and click on
download.
Step 3: In the next step, we will select the appropriate installer. When we select the
installer, our downloading of RStudion setup will start.
Step 4: In the next step, we will run our setup in the following way: 1) Click on Next.
2) Click on Install.
3) Click on finish.
Features of R programming
R is a domain-specific programming language which aims to do data analysis. It has some unique
features which make it very powerful. The most important arguably being the notation of
vectors. These vectors allow us to perform a complex operation on a set of values in a single
command. There are the following features of R programming:
1. It is a simple and effective programming language which has been well developed.
3. It is a well-designed, easy, and effective language which has the concepts of user-
defined, looping, conditional, and various I/O facilities.
4. It has a consistent and incorporated set of tools which are used for data analysis.
5. For different types of calculation on arrays, lists and vectors, R contains a suite of
operators.
History of R Programming
The history of R goes back about 20-30 years ago. R was developed by Ross lhaka and Robert
Gentleman in the University of Auckland, New Zealand, and the R Development Core Team
currently develops it. This programming language name is taken from the name of both the
developers. The first project was considered in 1992. The initial version was released in
1995, and in 2000, a stable beta version was released.
Why use R Programming?
There are several tools available in the market to perform data analysis. Learning new
languages is time taken. The data scientist can use two excellent tools, i.e., R and
Python. We may not have time to learn them both at the time when we get started to
learn data science. Learning statistical modeling and algorithm is more important than to
learn a programming language. A programming language is used to compute and
communicate our discovery.
The important task in data science is the way we deal with the data: clean, feature
engineering, feature selection, and import. It should be our primary focus. Data scientist
job is to understand the data, manipulate it, and expose the best approach. For
machine learning, the best algorithms can be implemented with R. Keras and
TensorFlow allow us to create high-end machine learning techniques. R has a
package to perform Xgboost. Xgboost is one of the best algorithms for Kaggle
competition.
R communicate with the other languages and possibly calls Python, Java, C++. The big
data world is also accessible to R. We can connect R with different databases
like Spark or Hadoop.
In brief, R is a great tool to investigate and explore the data. The elaborate
analysis such as clustering, correlation, and data reduction are done with R.
Applications of R
There are several-applications available in real-time. Some of the popular applications are
as follows:
Facebook
Google
Twitter
HRDAG
Sunlight Foundation
RealClimate
NDAA
XBOX ONE
ANZ
FDA
Syntax of R Programming
R Command Prompt
It is required that we have already installed the R environment set up in our system to
work on the R command prompt. After the installation of R environment setup, we can
easily start R command prompt by typing R in our Windows command prompt. When
we press enter after typing R, it will launch interpreter, and we will get a prompt on
which we can code our program.
Hello, World!" Program
The code of "Hello World!" in R programming can be written as:
A variable can store different types of values such as numbers, characters etc. These
different types of data that we can use in our code are called data types.
For example,
x <- 123L
Here, 123L is an integer data. So the data type of the variable x is integer. We can verify
this by printing the class of x.
x <- 123L
# print value of x
X<-123
print(x)
# print type of x
print(class(x))
Output
[1] 123
[1] "integer"
Here, x is a variable of data type integer.
Different Types of Data Types In R, there are 6 basic data types:
logical numeric integer complex character raw
Output
[1] TRUE
[1] "logical"
[1] FALSE
[1] "logical"
In the above example,
bool1 has the value TRUE,
bool2 has the value FALSE. Here, we get "logical" when we check the type of both
variables. Note: You can also define logical variables with a single letter - T for TRUE or
F for FALSE. For example,
is_weekend <- F
print(class(is_weekend)) # "logical"
Output
[1] 63.5
[1] "numeric"
[1] 182
[1] "numeric"
Here, both weight and height are variables of numeric type.
3. Integer Data Type
The integer data type specifies real values without decimal points. We use the suffix L to
specify integer data.
For example,
integer_variable <- 186L print(class(integer_variable))
Output
[1] "integer"
Here, 186L is an integer data. So we get "integer" when we print the class of
integer_variable.
Output
[1] "complex"
For example,
print(class(fruit))
print(class(my_char))
Output
[1] "character"
[1] "character"
Here, both the variables - fruit and my_char - are of character data type.
For example,
print(raw_variable) print(class(raw_variable))
print(char_variable)
print(class(char_variable))
print(raw_variable)
print(class(raw_variable))
print(raw_variable)
Output
[1] 57 65 6c 63 6f 6d 65 20 74 6f 20 50 72 6f 67 72 61 6d 69 7a
[1] "raw"
We have first used the charToRaw() function to convert the string "Welcome to
Programming" to raw bytes. This is why we get "raw" as output when we print the class of
raw_variable.
Then, we have used the rawToChar() function to convert the data in raw_variable back to
character form.
This is why we get "character" as output when we print the class of char_variable.
Basic Programs
In R language readline() method takes input in string format. If one inputs an integer
then it is inputted as a string, lets say, one wants to input 255, then it will input as “255”,
like a string.
So one needs to convert that inputted value to the format that he needs. In this case,
string “255” is converted to integer 255. To convert the inputted value to the desired
data type, there are some functions in R,
as.integer(n); —> convert to integer
as.numeric(n); —> convert to numeric type (float, double etc)
as.complex(n); —> convert to complex number (i.e 3+2i)
as.Date(n) —> convert to date …, etc
Syntax:
var = readline();
var = as.integer(var);