0% found this document useful (0 votes)
107 views15 pages

DSV Sem Exam

Data exploration is a critical step in data science that involves examining a dataset to understand its characteristics and identify patterns, trends, and relationships. The main objectives are to gain insights, detect any issues, and inform subsequent analysis and model selection. Techniques like statistical summaries, histograms, and scatterplots help visualize the distribution, outliers, correlations, and more. Data exploration ensures reliable analysis and prevents biases from impacting results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views15 pages

DSV Sem Exam

Data exploration is a critical step in data science that involves examining a dataset to understand its characteristics and identify patterns, trends, and relationships. The main objectives are to gain insights, detect any issues, and inform subsequent analysis and model selection. Techniques like statistical summaries, histograms, and scatterplots help visualize the distribution, outliers, correlations, and more. Data exploration ensures reliable analysis and prevents biases from impacting results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

1)Difference Between Data Science and

Data Analytics
Data Science: Data Science is a field that deals with extracting meaningful
information and insights by applying various algorithms, processes., scientific
methods from structured and unstructured data. This field is related to big data and
one of the most demanded skills currently.
Data science comprises mathematics, computations, statistics, programming, etc to
gain meaningful insights from the large amount of data provided in various formats.
Data Analytics: Data Analytics is used to get conclusions by processing the raw data.
It is helpful in various businesses as it helps the company to make decisions based
upon the conclusions from the data. Basically, data analytics helps to convert a Large
number of figures in the form of data into Plain English i.e., conclusions which are
further helpful in making the decisions.
Feature Data Science Data Analytics

Python is the most commonly used


The Knowledge of Python and R
language for data science along with
Coding Language Language is essential for Data
the use of other languages such as C+
Analytics.
+, Java, Perl, etc.

In- depth knowledge of programming Basic Programming skills is necessary


Programming Skills
is required for data science. for data analytics.

Data Science makes use of machine Data Analytics doesn’t makes use of
Use of Machine Learning
learning algorithms to get insights. machine learning.

Data Science makes use of Data


Hadoop Based analysis is used for
Other Skills mining activities for getting
getting conclusions from raw data.
meaningful insights.

The Scope of data analysis is micro


Scope The scope of data science is large.
i.e., small.

Data science deals with explorations Data Analysis makes use of existing
Goals
and new innovations. resources.

Data Science mostly deals with Data Analytics deals with structured
Data Type
unstructured data. data.
Feature Data Science Data Analytics

The statistical skills are necessary in The statistical skills are of minimal or
Statistical Skills
the field of Data Science.. no use in data analytics.

2)predictive analysis in data science

Predictive analysis is a subset of data science that involves using historical data and
statistical algorithms to make predictions about future events or behaviors. It is the
process of using data, statistical algorithms, and machine learning techniques to
identify the likelihood of future outcomes based on historical data. The goal of
predictive analysis is to forecast future trends and behaviors, allowing organizations
to make better decisions based on data-driven insights.

Predictive analysis is widely used in business, healthcare, marketing, finance, and


many other fields. In business, for example, predictive analysis can be used to
forecast sales, identify potential customers, and optimize pricing strategies. In
healthcare, it can be used to identify patients at high risk for certain conditions and
develop targeted interventions to prevent them.

The predictive analysis process typically involves several steps, including data
collection and preparation, exploratory data analysis, feature engineering, model
selection, model training, and model evaluation. The data used in predictive analysis
can come from a variety of sources, including structured data (e.g., databases) and
unstructured data (e.g., social media).

One of the key benefits of predictive analysis is its ability to identify patterns and
relationships in large and complex datasets that might otherwise be difficult to
discern. Machine learning algorithms can identify hidden patterns and correlations
that can be used to make accurate predictions. However, it's important to note that
predictive analysis is not a panacea and is only as good as the quality of the data
used and the accuracy of the models developed.

Overall, predictive analysis is a powerful tool for businesses and organizations


looking to gain insights and make data-driven decisions. As more data becomes
available and machine learning algorithms become more sophisticated, predictive
analysis will likely play an increasingly important role in a wide range of fields.
3) OBJECTIVE OF PRIOR KNOWLEDGE IN DATA SCIENCE

Prior knowledge is essential in data science as it forms the foundation upon which data
scientists build their models, analyses and predictions. It involves a range of skills, including
statistical analysis, programming, data visualization, and domain expertise. Having prior
knowledge helps data scientists to make sense of the data they are analyzing, and to identify
relevant patterns and insights that can be used to make informed decisions.

One of the key objectives of prior knowledge in data science is to enable data scientists to
ask the right questions when approaching a dataset. This includes having an understanding
of the underlying business problem or research question that the data is being used to
address, as well as an awareness of the key variables and factors that may influence the
outcome. This knowledge helps data scientists to design appropriate models and statistical
analyses that are tailored to the specific requirements of the project.

Another objective of prior knowledge in data science is to enable data scientists to interpret
the results of their analyses accurately. This involves having a solid understanding of
statistical concepts such as p-values, hypothesis testing, and confidence intervals. It also
requires knowledge of data visualization techniques and the ability to communicate results
effectively to stakeholders who may not have the same level of technical expertise.

Finally, prior knowledge in data science is essential for identifying potential sources of bias in
the data and ensuring that analyses are conducted in an ethical and responsible manner. This
includes an understanding of issues such as data privacy, data security, and the ethical
implications of using data to make decisions that may impact people's lives.

Overall, prior knowledge plays a critical role in data science, enabling data scientists to
approach problems with a deep understanding of the underlying domain and the technical
skills required to analyze data effectively. By leveraging prior knowledge, data scientists can
produce insights and predictions that are both accurate and meaningful, driving informed
decision-making and ultimately leading to improved outcomes for businesses and society as
a whole.
4) DATA EXPLORATION IN DATA SCIENCE IN 300 WORDS
Data exploration is a critical step in the data science process that involves examining
and understanding the characteristics and structure of a dataset. It is an iterative
process that involves a combination of statistical analysis, data visualization, and
domain expertise to uncover patterns, trends, and relationships in the data.

The main objective of data exploration is to gain insights into the data and identify
any potential issues or anomalies that may impact the accuracy or reliability of
subsequent analyses. It helps data scientists to understand the distribution of the
data, detect outliers or missing values, and identify any correlations or dependencies
among variables.

There are several techniques that data scientists use to explore data, including
statistical summaries, histograms, scatterplots, and heatmaps. These techniques can
be used to identify trends and patterns in the data, detect relationships between
variables, and visualize the distribution of the data.

In addition to identifying patterns and trends, data exploration is also useful for
identifying potential sources of bias in the data. This is especially important when
working with datasets that may be subject to sampling bias, selection bias, or other
forms of bias that can impact the accuracy and reliability of analyses.

Another key objective of data exploration is to inform the selection of appropriate


modeling techniques. By understanding the characteristics and structure of the data,
data scientists can choose the most appropriate statistical models and machine
learning algorithms to analyze the data and make accurate predictions.

Overall, data exploration is a critical step in the data science process that enables
data scientists to gain insights into the data and identify potential issues or
anomalies that may impact the accuracy or reliability of subsequent analyses. By
leveraging statistical analysis, data visualization, and domain expertise, data scientists
can uncover patterns and trends in the data and make informed decisions about how
best to analyze and model the data to achieve their objectives.
5) TECHNIQUES FRO SAMPLING IN DATA SCIENCE

Sampling is a critical technique in data science that involves selecting a subset of data from a
larger population in order to estimate characteristics of the population. This technique is used
when it is not feasible or practical to analyze the entire population, and instead, a smaller sample
can be analyzed to draw inferences about the larger population.

There are several techniques for sampling in data science, including simple random sampling,
stratified sampling, cluster sampling, and systematic sampling.

Simple random sampling is the most straightforward technique, where each individual in the
population has an equal probability of being selected for the sample. This technique is useful
when the population is homogenous and there is no significant variation in the characteristics of
the individuals in the population.

Stratified sampling involves dividing the population into subgroups or strata based on specific
characteristics, such as age, gender, or income level. Samples are then taken from each subgroup
in proportion to its size in the population. This technique is useful when there is significant
variation within the population, and it ensures that the sample is representative of the entire
population.

Cluster sampling involves dividing the population into clusters based on geographic location or
other characteristics, and then randomly selecting clusters to include in the sample. This
technique is useful when the population is spread out over a large area, and it can be more cost-
effective than simple random sampling.

Systematic sampling involves selecting every nth individual from the population to include in the
sample. This technique is useful when the population is large and it is not feasible to select a
sample using other techniques.

In addition to these techniques, data scientists may also use other methods such as convenience
sampling or snowball sampling, although these methods may not always result in a
representative sample.

Overall, sampling is an important technique in data science that enables data scientists to analyze
a smaller subset of data in order to draw inferences about a larger population. By using
appropriate sampling techniques, data scientists can ensure that their analyses are accurate,
reliable, and representative of the population of interest.
6) TYPES OF DATASETS IN DATA SCIENCE
In data science, there are various types of datasets that are used for analysis,
modeling, and prediction. Here are some of the common types of datasets:

1. Cross-sectional datasets: These datasets are collected at a single point in time,


and they provide a snapshot of the characteristics of a population at that
particular moment. Examples include survey data, census data, and data
collected from a single experiment.
2. Time series datasets: These datasets are collected over time, and they track
changes in a particular variable or set of variables over a specified period.
Examples include stock prices, weather data, and economic indicators.
3. Longitudinal datasets: These datasets are collected over a longer period of
time, and they track changes in a population or individual over time. Examples
include medical records, educational records, and longitudinal studies.
4. Panel datasets: These datasets are similar to longitudinal datasets, but they
involve a fixed set of individuals or units that are measured at multiple points
in time. Panel datasets are useful for studying changes in behavior or attitudes
over time.
5. Spatial datasets: These datasets include geographic or spatial data, such as
maps, satellite images, and GPS data. Spatial datasets are useful for studying
patterns and relationships across different regions or areas.
6. Text datasets: These datasets include unstructured data such as text, audio,
and video. Text datasets are often analyzed using natural language processing
techniques to extract insights and information from the data.
7. Image datasets: These datasets include visual data such as photographs, X-
rays, and scans. Image datasets are often analyzed using computer vision
techniques to identify patterns and relationships in the data.

Overall, the type of dataset used in data science depends on the research question
and the goals of the analysis. Data scientists must choose the appropriate type of
dataset based on the research question and the data available, and then use
appropriate statistical techniques to analyze the data and draw meaningful insights
and conclusions.
7) techniques for sampling in data science

Sampling is a critical aspect of data science that involves selecting a


representative subset of data from a larger population. Here are some
common techniques for sampling in data science:

1. Simple random sampling: This involves randomly selecting data


points from the entire population. It is commonly used when the
population is homogeneous and there is no prior knowledge about its
distribution.
2. Stratified sampling: This involves dividing the population into
distinct strata based on some characteristic and then randomly
selecting data points from each stratum. It is commonly used when
the population is heterogeneous and there are known differences
in the distribution of the characteristic of interest.
3. Cluster sampling: This involves dividing the population into clusters
and then randomly selecting clusters to sample from. It is
commonly used when it is impractical or expensive to sample
individual data points.
4. Systematic sampling: This involves selecting data points from the
population at regular intervals. It is commonly used when the
population is large and ordered in some way.
5. Oversampling and undersampling: These techniques involve
intentionally biasing the sample towards certain characteristics by
oversampling or undersampling data points with those
characteristics. They are commonly used when certain
characteristics of the population are of particular interest.
6. Adaptive sampling: This involves using statistical models to
iteratively refine the sample based on previous observations. It is
commonly used in situations where the population is dynamic or
the distribution of the characteristic of interest is unknown.

Overall, the choice of sampling technique will depend on the specific


characteristics of the population being studied and the research
questions being asked.
8) UNIVARIATE EXPLORATION, MULTIVARIATE EXPLORATION,BIVARIENT
in data science

In data science, univariate exploration, multivariate exploration, and


bivariate exploration are different approaches to analyzing and
understanding the relationship between variables in a dataset.

1. Univariate exploration: This approach focuses on analyzing and


understanding a single variable in a dataset. Univariate analysis
typically involves calculating basic statistics such as mean, median,
and standard deviation, as well as visualizing the data using
histograms, box plots, or other graphical representations.
Univariate analysis is used to gain insights into the distribution,
range, and outliers of a single variable.
2. Bivariate exploration: This approach involves analyzing the
relationship between two variables in a dataset. Bivariate analysis
typically involves calculating basic statistics such as correlation
coefficients and visualizing the data using scatter plots or other
graphical representations. Bivariate analysis is used to identify
patterns and trends between two variables and to understand how
they are related.
3. Multivariate exploration: This approach involves analyzing the
relationship between three or more variables in a dataset.
Multivariate analysis typically involves using statistical models such
as regression analysis or principal component analysis to identify
patterns and trends between multiple variables. Multivariate
analysis is used to identify complex relationships between multiple
variables and to gain insights into how they interact with each
other.

Overall, the choice of exploration technique will depend on the specific


research questions being asked and the nature of the dataset being
analyzed. Univariate exploration is often used as a starting point to gain
a basic understanding of the data, while bivariate and multivariate
exploration are used to gain deeper insights into the relationships
between variables.
9) DATA SET PROCESS IN DATA SCIENCE

In data science, the process of working with a dataset typically involves


several key steps:

1. Data collection: The first step in working with a dataset is to collect


the data from various sources, such as databases, files, APIs, or web
scraping. It is important to ensure that the data is collected in a
consistent and organized manner.
2. Data cleaning: Once the data is collected, it is essential to clean
and preprocess the data to remove any errors, inconsistencies, or
missing values. Data cleaning involves techniques such as data
imputation, outlier removal, and normalization.
3. Data exploration: After cleaning the data, the next step is to
explore the data to gain insights and identify patterns. This
involves using visualization tools and statistical methods to identify
trends, correlations, and outliers.
4. Feature engineering: Feature engineering involves selecting and
transforming the relevant variables or features in the dataset to
improve the accuracy of the model. This involves techniques such
as dimensionality reduction, feature selection, and feature scaling.
5. Modeling: Once the data is cleaned and preprocessed, the next
step is to build a predictive model. This involves selecting an
appropriate machine learning algorithm, training the model on the
dataset, and testing the model to evaluate its performance.
6. Model evaluation: After building the model, it is important to
evaluate its performance using various metrics such as accuracy,
precision, recall, and F1 score. This helps to identify areas for
improvement and refine the model.
7. Deployment: Once the model is validated, the final step is to
deploy the model in a production environment. This involves
integrating the model with other systems and ensuring that it can
handle real-world data.

Overall, the process of working with a dataset in data science involves


several iterative steps of data collection, cleaning, exploration, feature
engineering, modeling, evaluation, and deployment.
10) DESCRIPTIVE ANALYRTICS ADN DIAGONTIC ANALYTICS IN DATA
SCIENCE

Descriptive analytics and diagnostic analytics are two of the primary


branches of data analytics that are used to extract insights from data.
While both are important, they serve different purposes and are used in
different contexts.

Descriptive analytics involves analyzing data to describe what happened


in the past. It focuses on summarizing and visualizing data to gain
insights into patterns, trends, and relationships within the data.
Descriptive analytics is useful for understanding the historical
performance of a business or organization, identifying areas for
improvement, and making data-driven decisions.

Diagnostic analytics, on the other hand, goes beyond describing what


happened and seeks to identify why it happened. It involves analyzing
data to uncover the root cause of a problem or issue, as well as
identifying any contributing factors. Diagnostic analytics is useful for
solving problems, optimizing processes, and improving performance.

In practice, descriptive analytics often serves as a precursor to diagnostic


analytics. By first understanding what happened in the past, analysts can
begin to identify potential issues or areas for improvement. They can
then use diagnostic analytics to drill down into the data and identify the
root cause of any problems.

Overall, both descriptive and diagnostic analytics play important roles in


data science, and analysts often use both approaches in combination to
gain a deeper understanding of the data and to make more informed
decisions.
11) what is supervised and unsupervised learning?
Supervised learning: Supervised learning can be separated into two types of
problems when data mining: classification and regression:
• Classification problems use an algorithm to accurately assign test data into
specific categories, such as separating apples from oranges. Or, in the real
world, supervised learning algorithms can be used to classify spam in a
separate folder from your inbox. Linear classifiers, support vector machines,
decision trees and random forest are all common types of classification
algorithms.
• Regression is another type of supervised learning method that uses an
algorithm to understand the relationship between dependent and independent
variables. Regression models are helpful for predicting numerical values based
on different data points, such as sales revenue projections for a given business.
Some popular regression algorithms are linear regression, logistic regression
and polynomial regression
. Unsupervised Learning: Unsupervised learning models are used for three
main tasks: clustering, association and dimensionality reduction:
Clustering is a data mining technique for grouping unlabeled data based on
their similarities or differences. For example, K-means clustering algorithms
assign similar data points into groups, where the K value represents the size of
the grouping and granularity. This technique is helpful for market
segmentation, image compression, etc.
Association is another type of unsupervised learning method that uses
different rules to find relationships between variables in a given dataset. These
methods are frequently used for market basket analysis and recommendation
engines, along the lines of “Customers Who Bought This Item Also Bought”
recommendations.
Dimensionality reduction is a learning technique used when the number of
features (or dimensions) in a given dataset is too high. It reduces the number
of data inputs to a manageable size while also preserving the data integrity.
Often, this technique is used in the preprocessing data stage, such as when
autoencoders remove noise from visual data to improve picture quality
12) Difference between Data Science and Traditional Programming?

Data Science and Traditional Programming are two distinct areas of


computing that serve different purposes. Here are some key differences
between the two:

1. Purpose: Traditional programming is focused on creating software


applications and systems that perform specific functions. Data
science, on the other hand, is focused on extracting insights and
knowledge from data.
2. Input Data: Traditional programming typically works with
structured data that is already well-defined, while data science
works with unstructured or semi-structured data that requires
significant pre-processing and cleaning before analysis.
3. Tools and Techniques: Traditional programming typically relies on
a specific set of programming languages and frameworks for
application development, while data science employs a range of
tools and techniques such as statistical analysis, machine learning,
and data visualization.
4. Outputs: Traditional programming is generally focused on creating
applications or systems that perform specific tasks or functions,
while data science is focused on generating insights, predictions, or
recommendations based on data analysis.
5. Skillset: Traditional programming requires expertise in
programming languages, software development methodologies,
and problem-solving skills, while data science requires expertise in
statistical analysis, mathematics, machine learning, and data
visualization.

In summary, while traditional programming is focused on creating


software applications and systems, data science is focused on extracting
insights and knowledge from data. The two fields require different tools,
techniques, and skillsets.
Criteria Data Science Traditional Programming

Goal Extract insights from data Automate tasks and solve specific problems

Input Data Large, complex, and unstructured Structured and predefined

Output Models, predictions, visualizations Functional software or application

Approach Iterative and exploratory Sequential and methodical

Statistical models, machine learning, Programming languages, libraries, and


Tools big data frameworks

Data Cleaning Major part of the process Minor part of the process

Domain Less crucial to understand the problem and


Knowledge Essential to understand the data solution

Emphasis on testing the model's


Testing accuracy Emphasis on testing the code's functionality

Performance Focus on accuracy and interpretability Focus on efficiency and speed

Strong communication and storytelling Strong technical writing and documentation


Communication skills skills

Note: This table is not exhaustive and there may be overlap in some areas between
the two fields. The differences listed are generalizations and not applicable to every
situation.

12) what is the decision making tree and decision boundary tree and what is
its role in data science?
A Decision Tree is a tree-like structure used in decision analysis, decision
support systems, and machine learning to identify decisions and their possible
consequences. It starts with a root node that represents the best decision to
make, and branches off into multiple nodes that represent the outcomes of
each decision. Each node then becomes a root node for further branching until
a final decision is reached. Decision trees are commonly used in classification
and regression problems.
A Decision Boundary Tree is a variation of a Decision Tree used in binary
classification problems, where the goal is to divide a data set into two separate
classes. It is a tree-like structure where each internal node represents a test of
the input features, each branch represents a possible outcome of the test, and
each leaf node represents a prediction of the target class. The decision
boundary is represented by the path from the root to a leaf node.
In data science, Decision Trees and Decision Boundary Trees are used as a way
to model and analyze complex data sets, and to make predictions about new
data points based on the relationships between input features and target
variables. They provide a visual representation of the decisions and
relationships in a data set, which can be used for exploratory data analysis,
feature selection, and model interpretation

13) DECISION TREES IN DATA SCIENCE


Decision trees are a popular algorithm used in data science for both classification and regression
problems. They are a type of supervised learning algorithm that is used for solving complex
problems by making decisions based on a set of rules or conditions.

The decision tree algorithm works by creating a tree-like structure where each node represents a
decision based on a particular feature or attribute. The root node of the tree represents the most
significant feature or attribute that provides the most information gain, and the subsequent
nodes represent the other features or attributes that are used to make further decisions.

At each node, the decision tree algorithm evaluates the feature or attribute and selects the
branch that provides the most information gain. The information gain is calculated using a
statistical measure called entropy, which measures the degree of randomness or impurity in the
dataset. The goal of the algorithm is to minimize the entropy of the dataset by selecting the
features that provide the most information gain at each node.

Decision trees are popular in data science because they are easy to understand and interpret, and
they can be used for both categorical and continuous data. They also provide a way to handle
missing values and noisy data by creating a default path in the tree.

However, decision trees can be prone to overfitting, which occurs when the algorithm creates a
complex tree structure that fits the training data too closely and does not generalize well to new
data. To avoid overfitting, techniques such as pruning and limiting the depth of the tree can be
used.

Overall, decision trees are a powerful tool in data science for solving classification and regression
problems, and they can be combined with other algorithms such as random forests and gradient
boosting to improve their performance.

You might also like