Data Science Is An Amalgamation of Different Scientific Methods, Algorithms and Systems Which Enable Us
Data Science Is An Amalgamation of Different Scientific Methods, Algorithms and Systems Which Enable Us
enable us to gain insights and derive knowledge from data in various forms. Various organizations
like Google, Facebook, Uber, Netflix, etc. are already leveraging data science to provide better
experiences to their end users.
Although data science techniques have been conceptualized and in use for several decades now, the
current demand for data science is fueled by the high availability of digital data, and resources for
computation.
This course serves as an introduction to various Data Science concepts such as Probability,
Statistics, Linear Algebra, Machine Learning and Computer Science. At the end of this course, you
should be familiar with the key ideas behind these concepts.
In addition to these we must acquire knowledge about the domain or industry vertical in which we
plan to apply Data Science, such as retail, banking & finance, healthcare, e-commerce, life sciences,
telecom etc.
What is Probability?
Probability is a mathematical subject which enables us in determining or predicting how likely it is
that an event will happen. The probability of occurrence is assigned a value from 0 to 1. When the
value assigned is 1, it implies that the event will happen with all certainty. On the other hand when it
is 0, it implies that the event is not likely to take place. Thus, we can be more certain of an event's
occurrence when its probability is higher.
What is Statistics?
Statistics is another mathematical subject which deals primarily with data. It helps us draw
inferences from data by having procedures in place for collecting, classifying and presenting the
data in an organized manner. The analysis and interpretation of the refined data helps in providing
further insights.
Since, both statistics and probability have their roots in mathematics, computation as a tool is
needed to perform quantitative analysis. The use of computers is also necessary to perform
complex calculations while processing the statistical data.
Linear Algebra works as a computational engine for most of the data science problems because of
its performance advantages over iterative methods. Let us discuss a simple example, to understand
the difference between the two methods.
Problem statement
● We need to transmit a message over the network: “PREPARE to NEGOTIATE”.
● When transmitting, we need to encrypt the message and at the receiving end, we need to
decrypt the message.
● To encrypt and decrypt, we need to use a confidential piece of information, usually referred
to as a key.
● The prime objective is to ensure confidentiality and privacy of data during transmission.
Solution
Step 1: The message is encrypted by assigning a number for each letter in the message. Thus, the
message becomes:
Step 4: At the receiving end, the message is decrypted by multiplying this matrix with the inverse of
the encoding matrix. The inverse of the encoding matrix is:
Step 5: After multiplication, we will get back the original enumerated matrix. The original message
can now be decoded from this matrix.
Problem statement
Currents I1, I2 and I3 need to be determined for the following electrical network:
Solution
Step 1: The equations for current are written based on Kirchhoff’s Law.
How can we use these relationships to extract more information about them and predict their
proposed activities?
Solution
Step 1: These relationships can be converted into a relationship chart in which “1” indicates related
and “0” indicates unrelated:
Step 2: From the chart created in the previous step, the adjacency matrix for the directed graph is:
Step 3: The adjacency matrix may be used as a data structure for representing graphs in computer
programs for manipulation.
● Linear Algebra makes scientific computing easy as most complex equations can be
converted into linear equations with help of vectors and matrices, where we can view vectors
as single dimensional matrices.
● Linear Algebra helps represent large sets of data as matrices enabling us to better visualize
the given data.
● All the operations/processes performed on matrices are batch processes. This means, we
can process millions of data points simultaneously instead of processing each data point
individually.
● Machine Learning is the field of scientific study that concentrates on induction algorithms
and on other algorithms that can be said to "learn". (Ref. Stanford glossary of terms)
● A computer program is said to learn from experience E with respect to some class of tasks T
and performance measure P if its performance at tasks in T, as measured by P, improves
with experience E. (Ref. Tom M. Mitchell)
● Analysis on a given data by a human being has huge associated cost, time and effort. Note
that we are talking about data that is huge in volume, with a lot of variety and coming with
high velocity.
● Human intervention is not sustainable (e.g. If we want to navigate on Mars and we don’t have
the expertise available, we can make a machine learn and let it navigate on unknown territory
without any human intervention).
● Human expertise cannot always be explained (e.g. speech recognition, image processing).
● A solution needs to be adapted to a particular case (e.g. user biometrics).
In our example, out of eight test images, machine was able to classify 6 images correctly and 2
images incorrectly. Hence, accuracy of this supervised machine learning model is 6/8 i.e. 75%
● In the first step, we train the machine with known data so that it learns something from it.
● In the second step we expect the machine to utilize the knowledge it gained in the previous
step and classify a new unknown data point.
● In the third step, the model is evaluated on the basis of how accurately it has classified the
unknown data.
For example, assume that a company wants to predict the budget period of a new project that they
have acquired as 'short-term' or 'long-term', based on various input attributes about the project such
as the number of resources required, software requirements, hardware requirements etc. We will
need to use the classification machine learning technique here.
For example, if we are trying to predict the approximate budget requirement of a new project that the
company has acquired in actual quantifiable figures, based on various input attributes about the
project such as number of resources required, software requirement, hardware requirement etc.,
then we use the regression technique.
Reinforcement learning
Reinforcement machine learning algorithm is a reward based and immediate feedback technique.
Here, the machine's goal is to maximize the numerical reward at each and every step. In the process
of learning, the machine is not provided any supervision as opposed to the previous ML algorithms
we discussed till now. Instead, the machine is expected to figure out the optimum actions which will
reap the maximum reward at each step, all on their own, without any interference.
The actions that the machine takes at each step might not only affect the immediate reward but may
also affect all the subsequent rewards. The ultimate aim is to reach the max possible reward in the
least amount of steps possible.Thus, trial and error search methodology and immediate feedback in
the form of a numerical reward are the two main characteristics of reinforcement learning.
An example of reinforcement learning would be when a machine, learning to play chess, decides
whether a move is right by planning the possible moves, anticipating the corresponding counter
moves and finally choosing one based on reward based appeal for a particular position or set of
moves. Another example could be when a trash collecting bot's charge is about to reach critical
levels and it needs to make a decision, to clean one more room before reaching out for the charging
station or to immediately rush to the nearest charging station. The decision taken by the bot
depends on the ease with which it can reach the charging station, based on its prior knowledge.
Some famous machine learning algorithms
We shall focus on supervised and unsupervised learning algorithms in the forthcoming courses.
Computer Science provides us with the necessary programming languages, database management
systems, statistical analysis and machine learning tools.
● Why is the project being started? What is missing currently and what exactly is required?
● What are they currently doing to fix the problem, and why isn’t it working?
● What all resources are needed? What kind of data is available? Is domain expertise available
within the team? What are the computational resources available/required?
● How does the business organization plan to deploy the derived results? What kind of
problems need to be addressed for successful deployment?
● Classification: Determining which among the given categories a data point falls under
● Scoring: Predicting or estimating a quantifiable value
● Ranking: Ordering the data points depending on the priorities involved
● Clustering: Grouping similar items based on certain parameters
● Finding relations: Finding associations between various features of the data
● Characterization: Creating plots, graphs and various reports for understanding the data
better
If for either of the above questions, the answer is NO, we need to revisit the previous steps.
● Present the details of the model to all the collaborators, clients and sponsors.
● Provide everyone in charge of usage and maintenance of the model, once deployed, with
documentation that covers all aspects of the working of the model.
We must keep in mind that each group of people involved in the project require different kind of
treatment, when it comes to presentations and providing documentation. Hence, specific data
visualization techniques must be used for each of them. What might work for one audience, might
not work for the other.
Churn Prediction
Churn implies loss of customers to competition. For any company, it costs more to acquire new
customers than to retain the old ones. As churn prediction aids in customer retention, it is extremely
important especially for businesses with a repeat customer base. The application of this model cuts
across domains such as Banking, E-Retail, Telecom, Energy and Utilities.
Sentiment Analysis
Also referred to as opinion mining, it is the process of computationally identifying what customers
like and dislike about a product or a brand. A domain which relentlessly makes use of sentiment
analysis is the Retail industry. Companies like Amazon, Flipkart, Reliance, Paytm use customer
feedback from social networking sites like Facebook, Twitter, etc. or their own company websites to
find out what their customers are talking about and how they feel i.e. positive, negative, or neutral.
They leverage this information to reposition their products and provide better/new services.
Online Advertisement
The incremental growth in the complexity of ads industry is due to the ease of access to the internet
via a wide variety of devices around the world. This gives the advertisers an opportunity to study
user preferences and online trends. The insights offered to them through these analysis, translates
to actionable items on issues and opportunities such as reducing ad-blindness or optimizing
cost-per-action (CPA) and click-through-rates (CTR).
Recommendations
Many e-retail companies like Amazon, Netflix, Spotify, Best Buy, You Tube among many others use
recommender systems to improve a customer's shopping experience. This offers the companies a
chance to gather information on customer's preferences, purchases and other browsing patterns
which lend insights that can amplify their return on investment.
News Aggregation
A news aggregator gathers and clusters stories of the same topic from several leading news
websites and also traces the genuine source of a news item and what is the course of the story. It
has a special interactive timeline that allows the reader to flip swiftly between headlines, refining
their search by country or by specific news sites. Notable examples include Google News, Reddit,
Flipboard, Pulse etc.
Scalability
Scalability refers to an enterprise's ability to handle increased demands. In the corporate
environment, a scalable company is one that can maintain or improve profit margins while sales
volume increases. Many a times the process is slowed down by human intervention for decisions.
For example, the credit operations in banks invest substantial time in assessing the credit
worthiness of a client. Client management teams take long time to suggest the right product to the
customer/suggest alternatives. The client help desk takes long time to provide the desired info to
the client. If these processes can be automated, the business can scale up. Data science helps build
systems like recommender systems, Chabot’s etc. to achieve scalability.
Few more platforms that use content discovery algorithms are Facebook and YouTube. The content
that appears in an individual's Facebook news feed and the videos that appear in the "Recommended
for You" section of YouTube user's account, are both altered according to each user's past behavior
and personal preferences.
Intelligent Learning
Intelligent learning has become a part of our day to day lives in various forms. For example, Google
Maps uses undesignated location data from various smart devices to predict the flow of traffic in
real time. It also utilizes user based reports on incidents that might affect the traffic, like road
construction and accidents, to help suggest fastest routes for travel, to users.
Another example would be ride sharing apps like Uber and Ola. They optimize the ride experience by
not only minimizing the ride time but also by matching users with other passengers for least amount
of detours in shared rides
Personalized Medicine
In many cases, the success of a particular treatment for a patients' condition cannot be predicted
beforehand. Thus, many medical practitioners follow a non-optimal trial-and-error approach.
In personalized medicine, a doctor needs to study an individual's genes, environment and lifestyle.
This would help tailor treatments for specific medical conditions as opposed to a trial and error
approach. This would also enable pharmaceutical researchers to create combination drugs targeting
a specific genomic profile which in turn increases safety and efficiency.
Companies that are active in the field of personalized medicine are Roche, Novartis, Johnson &
Johnson among others.
Data Science solutions as a product are offered by various vendors in the market today. Few of the
popular vendors are as follows:
We can observe the in both Data Analytics and Machine Learning fields, IBM Watson emerges as the
top player.
Platform comparison
A Consolidated view of the platform comparison based on Data Science technologies is shown
below.