DATA SCIENCE
COMPREHENSION
WORKSHEETS
Name: Date:
WHAT IS DATA SCIENCE?
Data Science is the study of data to extract meaningful
insights and knowledge. It uses methods from statistics,
mathematics, and computer science. Data can be structured,
such as numbers in spreadsheets, or unstructured, such as
videos, social media posts, and images. The ultimate goal of
Data Science is not only to analyze past data but also to
predict future outcomes.
The process of Data Science follows several steps. First, data
is collected from various sources. Then it is cleaned to
remove errors and inconsistencies. After this, analysis begins
using programming languages like Python and R. Data
visualization tools, such as Tableau or Power BI, help present
the results in charts and dashboards that are easy to
understand.
Data Science is widely used in everyday life. For example, in
healthcare it predicts disease risks, in finance it detects
fraudulent transactions, and in retail it recommends products
to customers. Because of its broad applications, Data Science
has become one of the most important and fast-growing
fields of technology today. Businesses and organizations
depend on it to make better decisions based on evidence
rather than guesswork.
Name: Date:
Answer the Following Questions
1.What is the main purpose of Data Science?
2.Give an example of structured and unstructured data.
3.Why is data cleaning an important step?
4.Name two tools or languages used in Data Science.
5.How is Data Science applied in healthcare?
Answer No 1 :
Answer No 2 :
Answer No 3 :
Answer No 4 :
Answer No 5 :
Name: Date:
Choose the correct answer
Data Science combines:
a) Only programming
b) Only statistics
c) Statistics, computer science, and domain expertise
d) None of the above
Which is unstructured data?
a) Numbers in spreadsheets
b) Tables
c) Social media posts
d) Database records
Tableau is used for:
a) Collecting data
b) Cleaning data
c) Visualization
d) Programming
Data Science helps businesses:
a) Make evidence-based decisions
b) Avoid technology
c) Work without data
d) Eliminate all risks
Which industry uses Data Science to detect fraud?
a) Retail
b) Finance
c) Education
d) Agriculture
Name: Date:
Fill in the blanks
1. Data Science extracts __________ from data.
2. __________ data includes videos, images, and text.
3. Python and __________ are common programming languages
in Data Science.
4. Tableau and Power BI are used for __________.
5. Data Science is one of the most __________ fields today.
Mark True and False.
1. Data Science only studies structured data.
2. Cleaning data prevents wrong conclusions.
3. Python is not used in Data Science.
4. Data Science is applied in healthcare, finance, and retail.
5. The purpose of Data Science is only to look at the past.
Name: Date:
DATA COLLECTION AND
CLEANING
The first step in Data Science is data collection. Data can
come from many sources: surveys, customer transactions,
sensors, websites, or social media. The quality of the data
collected has a big impact on the accuracy of results.
However, raw data is rarely perfect. It may contain errors,
duplicates, or missing values.
That is why data cleaning is considered one of the most
important and time-consuming steps in Data Science. During
cleaning, data scientists remove incorrect entries, fix missing
values, and standardize formats. For example, if one dataset
records “USA” and another records “United States,” the entries
must be made consistent. Likewise, an entry with a negative
age must be corrected or removed.
If data is not cleaned properly, analysis can lead to wrong or
misleading conclusions. A survey with duplicate responses
might give a false picture of customer satisfaction. Clean and
reliable data ensures that the insights generated are
trustworthy. In other words, without accurate data, even the
most advanced algorithms will not produce correct results.
Name: Date:
Answer the Following Questions
1.What is the first step in Data Science?
2.Give two examples of data sources.
3.Why is data cleaning time-consuming?
4.What should be done when two datasets use different formats for
the same country?
5.What is the danger of using unclean data?
Answer No 1 :
Answer No 2 :
Answer No 3 :
Answer No 4 :
Answer No 5 :
Name: Date:
Choose the correct answer
Which is an example of raw data?
a) Final report
b) Customer transaction logs
c) Clean tables
d) Dashboards
The purpose of data cleaning is:
a) To collect more data
b) To make data reliable
c) To delete all data
d) To create charts
A negative age is an example of:
a) Missing data
b) Clean data
c) Incorrect data
d) Structured data
Standardizing formats means:
a) Using the same representation across datasets
b) Making data larger
c) Creating duplicates
d) Ignoring errors
Without cleaning, results may be:
a) Trustworthy
b) Misleading
c) Accurate
d) Perfect
Name: Date:
Fill in the blanks
1. Data collection involves gathering __________ data.
2. Raw data may have errors, duplicates, or __________ values.
3. Data cleaning ensures that data is __________.
4. Duplicate survey responses may lead to __________
conclusions.
5. Making “USA” and “United States” consistent is called
__________.
Mark True and False.
1. Data collection is the second step of Data Science.
2. Raw data is always clean and ready to use.
3. Standardizing formats makes data consistent.
4. Duplicate entries can distort analysis.
5. Clean data improves accuracy of results.
Name: Date:
DATA VISUALIZATION
Data visualization is the practice of turning numbers into
visual forms like charts, graphs, and dashboards. Humans
understand visuals faster than reading rows of raw numbers.
For this reason, visualization is one of the most powerful tools
in Data Science.
Different types of visualizations serve different purposes. Bar
charts compare categories, line graphs show changes over
time, and scatter plots display relationships between two
variables. Heatmaps are used to reveal patterns across large
datasets. A well-designed visualization highlights the
important insights, while a poorly designed one can confuse
or mislead the audience.
Data scientists use several tools for visualization. Python has
libraries like Matplotlib and Seaborn, while R has ggplot2.
Software like Tableau and Power BI helps create interactive
dashboards. These tools make it easier to present findings in
a clear and effective way. Visualization is therefore essential
for communicating results to decision-makers who may not
be familiar with coding or statistics.
Name: Date:
Answer the Following Questions
1.Why is data visualization important?
2.Name two types of graphs and their uses.
3.What is the risk of poor visualization?
4.Mention one Python library used for visualization.
5.Who benefits from clear visualizations?
Answer No 1 :
Answer No 2 :
Answer No 3 :
Answer No 4 :
Answer No 5 :
Name: Date:
Choose the correct answer
Which chart shows trends over time?
a) Bar chart
b) Line graph
c) Scatter plot
d) Pie chart
Which graph shows relationships between variables?
a) Scatter plot
b) Line graph
c) Histogram
d) Bar chart
Tableau is used for:
a) Collecting data
b) Creating dashboards
c) Data cleaning
d) Programming
Poor visualization can:
a) Simplify data
b) Distort meaning
c) Eliminate errors
d) Replace coding
Which library belongs to R?
a) Matplotlib
b) Seaborn
c) ggplot2
d) TensorFlow
Name: Date:
Fill in the blanks
1. Data visualization turns numbers into __________.
2. Bar charts are used to compare __________.
3. Scatter plots show __________ between variables.
4. Heatmaps are useful for showing __________.
5. __________ and Power BI create interactive dashboards.
Mark True and False.
1. Humans understand visuals more quickly than numbers.
2. Line graphs are best for comparing categories.
3. Poor visualization can mislead people.
4. Seaborn is a library in Python.
5. Tableau is a programming language.
Name: Date:
MACHINE LEARNING IN
DATA SCIENCE
Machine Learning (ML) is a part of Artificial Intelligence that
teaches computers to learn from data without direct
programming. It is one of the most important tools in Data
Science for making predictions and recognizing patterns.
There are three main types of ML. Supervised learning uses
labeled data, where both input and output are known. For
example, predicting house prices based on size and location.
Unsupervised learning works with unlabeled data and finds
hidden patterns, such as grouping customers by buying
behavior. Reinforcement learning allows a computer to learn by
trial and error, often applied in robotics and games.
High-quality data is necessary for machine learning. If the data is
biased or inaccurate, predictions will be poor. Data scientists use
Python libraries like scikit-learn and TensorFlow to build and train
ML models. These models improve automatically when more
data is provided.
Applications of ML include spam filters in email, personalized
recommendations in Netflix or YouTube, fraud detection in
banking, and medical image analysis. Machine learning is
becoming a vital technology across almost every industry.
Name: Date:
Answer the Following Questions
1.What is Machine Learning?
2.Name the three types of ML.
3.Give one example of supervised learning.
4.Why is high-quality data important?
5.List two applications of ML.
Answer No 1 :
Answer No 2 :
Answer No 3 :
Answer No 4 :
Answer No 5 :
Name: Date:
Choose the correct answer
Which type of learning uses labeled data?
a) Reinforcement
b) Supervised
c) Unsupervised
d) Deep
Clustering customers is an example of:
a) Supervised learning
b) Reinforcement learning
c) Unsupervised learning
d) Deep learning
Which library is used for ML in Python?
a) scikit-learn
b) Excel
c) Tableau
d) Power BI
Reinforcement learning is often used in:
a) Robotics
b) Healthcare
c) Banking
d) Education
Machine learning improves with:
a) Less data
b) Random guessing
c) More data
d) No data
Name: Date:
Fill in the blanks
1. Machine Learning helps computers learn from __________.
2. Predicting house prices is an example of __________ learning.
3. Customer grouping is an example of __________ learning.
4. Reinforcement learning is often used in __________.
5. __________quality data is needed for ML.
Mark True and False.
1. Machine Learning is a branch of Artificial Intelligence.
2. Unsupervised learning uses labeled data.
3. Spam filters are an application of ML.
4. Scikit-learn is a data visualization tool.
5. ML models get better with more data.
Name: Date:
BIG DATA IN DATA SCIENCE
Big Data refers to datasets that are too large and complex for
traditional tools like spreadsheets. It is described by the “3Vs”:
Volume, Velocity, and Variety. Volume means the massive amount
of data created every second, like online transactions and social
media posts. Velocity is the speed at which data is generated and
processed. Variety means the different formats of data, including
text, images, videos, and audio.
Big Data requires special tools and platforms to store and
process. Hadoop and Spark are two common systems used for
this purpose. Cloud platforms like AWS and Google Cloud provide
infrastructure that allows businesses to scale up easily and
manage big data effectively.
Big Data is very important in industries today. In healthcare, it is
used to track disease outbreaks in real time. In retail, it helps
personalize marketing by studying customer behavior. In
transportation, it supports smart cities through GPS and traffic
analysis. Companies that use Big Data gain valuable insights that
help them stay competitive in a fast-moving world.
Name: Date:
Answer the Following Questions
1.What are the 3Vs of Big Data?
2.Why can’t spreadsheets handle Big Data?
3.Name two platforms used for processing Big Data.
4.Give one application of Big Data in healthcare.
5.How does Big Data give companies an advantage?
Answer No 1 :
Answer No 2 :
Answer No 3 :
Answer No 4 :
Answer No 5 :
Name: Date:
Choose the correct answer
Which is NOT one of the 3Vs?
a) Volume
b) Variety
c) Velocity
d) Validation
Hadoop and Spark are used for:
a) Visualization
b) Storage and processing
c) Data cleaning
d) Data collection
Which cloud service handles Big Data?
a) Excel
b) AWS
c) PowerPoint
d) SPSS
In transportation, Big Data helps with:
a) Traffic and GPS analysis
b) Gaming
c) Writing reports
d) Manufacturing
The main advantage of Big Data is:
a) It creates more spreadsheets
b) It gives deeper insights
c) It avoids technology
d) It replaces algorithms
Name: Date:
Fill in the blanks
1. Big Data is described by Volume, Velocity, and __________.
2. Traditional __________ cannot handle petabytes of data.
3. Hadoop and __________ are platforms for Big Data.
4. In retail, Big Data supports __________ marketing.
5. Big Data provides a __________ advantage to companies.
Mark True and False.
1. The 3Vs of Big Data are Volume, Velocity, and Variety.
2. Spreadsheets are good for managing Big Data.
3. Spark is a Big Data processing tool.
4. Big Data is used in smart city development.
5. Big Data has no impact on competition.
Answer Key
Name: Date:
What is Data Science?
Questions/Answers :
1.The main purpose of Data Science is to extract insights and make
predictions from data.
2.Structured: numbers in spreadsheets; Unstructured: videos, social
media posts, or images.
3.Data cleaning is important to remove errors and prevent wrong
conclusions.
4.Python and R.
5.It is used to predict disease risks.
MCQS :
1.c) Statistics, computer science, and domain expertise
2.c) Social media posts
3.c) Visualization
4.a) Make evidence-based decisions
5.b) Finance
Name: Date:
What is Data Science?
Fill in the blanks :
1.insights
2.Unstructured
3.R
4.Visualization
5.fast-growing
True and False :
1.False
2.True
3.False
4.True
5.False
Name: Date:
Data Collection and Cleaning
Questions/Answers :
1.Data collection.
2.Surveys, customer transactions (any two).
3.Because raw data often contains errors, duplicates, and
inconsistencies.
4.Standardize them so they are consistent.
5.Results may be misleading or wrong.
MCQS :
1.b) Customer transaction logs
2.b) To make data reliable
3.c) Incorrect data
4.a) Using the same representation across datasets
5.b) Misleading
Name: Date:
Data Collection and Cleaning
Fill in the blanks :
1.raw
2.missing
3.reliable
4.misleading
5.standardization
True and False :
1.False
2.False
3.True
4.True
5.True
Name: Date:
Data Visualization
Questions/Answers :
1.It helps people understand data quickly and clearly.
2.Bar charts (compare categories), Line graphs (show trends over
time).
3.Poor visualization can distort meaning and mislead the audience.
4.Matplotlib (or Seaborn).
5.Decision-makers who may not know coding/statistics.
MCQS :
1.b) Line graph
2.a) Scatter plot
3.b) Creating dashboards
4.b) Distort meaning
5.c) ggplot2
Name: Date:
Data Visualization
Fill in the blanks :
1.visuals
2.categories
3.relationships
4.patterns
5.Tableau
True and False :
1.True
2.False
3.True
4.True
5.False
Name: Date:
Machine Learning in Data Science
Questions/Answers :
1.A branch of AI that teaches computers to learn from data without
explicit programming.
2.Supervised learning, Unsupervised learning, Reinforcement learning.
3.Predicting house prices using size and location.
4.Because poor-quality data produces inaccurate predictions.
5.Spam detection, personalized recommendations, fraud detection,
medical imaging (any two).
MCQS :
1.b) Supervised
2.c) Unsupervised
3.a) scikit-learn
4.a) Robotics
5.c) More data
Name: Date:
Machine Learning in Data Science
Fill in the blanks :
1.data
2.supervised
3.unsupervised
4.robotics
5.high
True and False :
1.True
2.False
3.True
4.False
5.True
Name: Date:
Big Data in Data Science
Questions/Answers :
1.Volume, Velocity, Variety.
2.Because they cannot handle huge, complex datasets like petabytes
of data.
3.Hadoop and Spark.
4.Tracking disease outbreaks in real time.
5.By providing deeper insights that improve competitiveness.
MCQS :
1.d) Validation
2.b) Storage and processing
3.b) AWS
4.a) Traffic and GPS analysis
5.b) It gives deeper insights
Name: Date:
Big Data in Data Science
Fill in the blanks :
1.Variety
2.spreadsheets
3.Spark
4.personalized
5.competitive
True and False :
1.True
2.False
3.True
4.True
5.False