0% found this document useful (0 votes)
57 views97 pages

DSF Notes

Uploaded by

Jeeva Harshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views97 pages

DSF Notes

Uploaded by

Jeeva Harshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 97

Department of Chemical Engineering

Subject Code: OCS353


Subject Name: DATA SCIENCE FUNDAMENTALS
COURSE OBJECTIVES:
● Familiarize students with the data science process.
● Understand the data manipulation functions in Numpy and Pandas.
● Explore different types of machine learning approaches.
● Understand and practice visualization techniques using tools.
● Learn to handle large volumes of data with case studies
UNIT I INTRODUCTION 6
Data Science: Benefits and uses – facets of data - Data Science Process: Overview – Defining
research goals – Retrieving data – data preparation - Exploratory Data analysis – build the model
– presenting findings and building applications - Data Mining - Data Warehousing – Basic
statistical descriptions of Data
UNIT II DATA MANIPULATION 9
Python Shell - Jupyter Notebook - IPython Magic Commands - NumPy Arrays-Universal
Functions – Aggregations – Computation on Arrays – Fancy Indexing – Sorting arrays –
Structured data – Data manipulation with Pandas – Data Indexing and Selection – Handling
missing data – Hierarchical indexing – Combining datasets – Aggregation and Grouping – String
operations – Working with time series – High performance
UNIT III MACHINE LEARNING 5
The modeling process - Types of machine learning - Supervised learning - Unsupervised learning
- Semi- supervised learning- Classification, regression - Clustering – Outliers and Outlier
Analysis
UNIT IV DATA VISUALIZATION 5
Importing Matplotlib – Simple line plots – Simple scatter plots – visualizing errors – density and contour
plots – Histograms – legends – colors – subplots – text and annotation – customization – three dimensional
plotting - Geographic Data with Basemap - Visualization with Seaborn
UNIT V HANDLING LARGE DATA 5
Problems - techniques for handling large volumes of data - programming tips for dealing with large data
sets- Case studies: Predicting malicious URLs, Building a recommender system - Tools and techniques
needed - Research question - Data preparation - Model building – Presentation and automation.

30 PERIODS
PRACTICAL EXERCISES: 30 PERIODS LAB EXERCISES
1. Download, install and explore the features of Python for data analytics.
2. Working with Numpy arrays
3. Working with Pandas data frames
4. Basic plots using Matplotlib
5. Statistical and Probability measures a) Frequency distributions b) Mean, Mode, Standard Deviation c)
Variability d) Normal curves e) Correlation and scatter plots f) Correlation coefficient g) Regression
6. Use the standard benchmark data set for performing the following: a) Univariate Analysis: Frequency,
Mean, Median, Mode, Variance, Standard Deviation, Skewness and Kurtosis. b) Bivariate Analysis: Linear
and logistic regression modelling.
7. Apply supervised learning algorithms and unsupervised learning algorithms on any data set.
8. Apply and explore various plotting functions on any data set. Note: Example data sets like: UCI, Iris,
Pima Indians Diabetes etc.

COURSE OUTCOMES:
At the end of this course, the students will be able to: CO1: Gain knowledge
on data science process.
CO2: Perform data manipulation functions using Numpy and Pandas.
CO3 Understand different types of machine learning approaches. CO4: Perform data
visualization using tools.
CO5: Handle large volumes of data in practical scenarios. TOTAL:60
PERIODS

TEXT BOOKS
1. David Cielen, Arno D. B. Meysman, and Mohamed Ali, “Introducing Data Science”, Manning
Publications, 2016.
2. Jake VanderPlas, “Python Data Science Handbook”, O’Reilly, 2016.
REFERENCES
1. Robert S. Witte and John S. Witte, “Statistics”, Eleventh Edition, Wiley Publications, 2017.
2. Allen B. Downey, “Think Stats: Exploratory Data Analysis in Python”, Green Tea Press,2014.
UNIT I NOTES

UNIT I : Introduction
Syllabus
Data Science : Benefits and uses - facets of data Defining research goals - Retrieving data - Data preparation
- Exploratory Data analysis - build the model presenting findings and building applications Warehousing -
Basic Statistical descriptions of Data.
Data Science
• Data is measurable units of information gathered or captured from activity of people, places and things.
• Data science is an interdisciplinary field that seeks to extract knowledge or insights from various forms of
data. At its core, Data Science aims to discover and extract actionable knowledge from data that can be used
to make sound business decisions and predictions. Data science combines math and statistics, specialized
programming, advanced analytics, Artificial Intelligence (AI) and machine learning with specific subject
matter expertise to uncover actionable insights hidden in an organization's data.
• Data science uses advanced analytical theory and various methods such as time series analysis for
predicting future. From historical data, Instead of knowing how many products sold in previous quarter, data
science helps in forecasting future product sales and revenue more accurately.
• Data science is devoted to the extraction of clean information from raw data to form actionable insights.
Data science practitioners apply machine learning algorithms to numbers, text, images, video, audio and
more to produce artificial intelligence systems to perform tasks that ordinarily require human intelligence.
• The data science field is growing rapidly and revolutionizing so many industries. It has incalculable
benefits in business, research and our everyday lives.
• As a general rule, data scientists are skilled in detecting patterns hidden within large volumes of data and
they often use advanced algorithms and implement machine learning models to help businesses and
organizations make accurate assessments and predictions.
Data science and big data evolved from statistics and traditional data management but are now considered to
be distinct disciplines.
Life cycle of data science:
1. Capture: Data acquisition, data entry, signal reception and data extraction.
2. Maintain Data warehousing, data cleansing, data staging, data processing and data architecture.
3. Process Data mining, clustering and classification, data modeling and data summarization.
4. Analyze : Data reporting, data visualization, business intelligence and decision making.
5. Communicate: Exploratory and confirmatory analysis, predictive analysis, regression, text mining and
qualitative analysis.

Big Data
• Big data can be defined as very large volumes of data available at various sources, in varying degrees of
complexity, generated at different speed i.e. velocities and varying degrees of ambiguity, which cannot be
processed using traditional technologies, processing methods, algorithms or any commercial off-the-shelf
solutions.
• 'Big data' is a term used to describe collection of data that is huge in size and yet growing exponentially
with time. In short, such a data is so large and complex that none of the traditional data management tools
are able to store it or process it efficiently.

Characteristics of Big Data


• Characteristics of big data are volume, velocity and variety. They are often referred to as the three V's.
1. Volume Volumes of data are larger than that conventional relational database infrastructure can cope
with. It consisting of terabytes or petabytes of data.
2. Velocity: The term 'velocity' refers to the speed of generation of data. How fast the data is generated and
processed to meet the demands, determines real potential in the data. It is being created in or near real-time.
3. Variety: It refers to heterogeneous sources and the nature of data, both structured and unstructured.
• These three dimensions are also called as three V's of Big Data.

• Two other characteristics of big data is veracity and value.


a) Veracity:
• Veracity refers to source reliability, information credibility and content validity.
• Veracity refers to the trustworthiness of the data. Can the manager rely on the fact that the data is
representative? Every good manager knows that there are inherent discrepancies in all the data collected.
• Spatial veracity: For vector data (imagery based on points, lines and polygons), the quality varies. It
depends on whether the points have been GPS determined or determined by unknown origins or manually.
Also, resolution and projection issues can alter veracity.
• For geo-coded points, there may be errors in the address tables and in the point location algorithms
associated with addresses.
• For raster data (imagery based on pixels), veracity depends on accuracy of recording instruments in
satellites or aerial devices and on timeliness.
b) Value :
• It represents the business value to be derived from big data.
• The ultimate objective of any big data project should be to generate some sort of value for the company
doing all the analysis. Otherwise, user just performing some technological task for technology's sake.
• For real-time spatial big data, decisions can be enhance through visualization of dynamic change in such
spatial phenomena as climate, traffic, social-media-based attitudes and massive inventory locations.
• Exploration of data trends can include spatial proximities and relationships.
• Once spatial big data are structured, formal spatial analytics can be applied, such as spatial
autocorrelation, overlays, buffering, spatial cluster techniques and location quotients.

Difference between Data Science and Big Data


Comparison between Cloud Computing and Big Data

Benefits and Uses of Data Science


• Data science example and applications :
a) Anomaly detection: Fraud, disease and crime
b) Classification: Background checks; an email server classifying emails as "important"
c) Forecasting: Sales, revenue and customer retention
d) Pattern detection: Weather patterns, financial market patterns
e) Recognition : Facial, voice and text
f) Recommendation: Based on learned preferences, recommendation engines can refer user to movies,
restaurants and books
g) Regression: Predicting food delivery times, predicting home prices based on amenities
h) Optimization: Scheduling ride-share pickups and package deliveries

Benefits and Use of Big Data


• Benefits of Big Data :
1. Improved customer service
2. Businesses can utilize outside intelligence while taking decisions
3. Reducing maintenance costs
4. Re-develop our products : Big Data can also help us understand how others perceive our products so that
we can adapt them or our marketing, if need be.
5. Early identification of risk to the product/services, if any
6. Better operational efficiency
• Some of the examples of big data are:
1. Social media : Social media is one of the biggest contributors to the flood of data we have today.
Facebook generates around 500+ terabytes of data everyday in the form of content generated by the users
like status messages, photos and video uploads, messages, comments etc.
2. Stock exchange : Data generated by stock exchanges is also in terabytes per day. Most of this data is the
trade data of users and companies.
3. Aviation industry: A single jet engine can generate around 10 terabytes of data during a 30 minute
flight.
4. Survey data: Online or offline surveys conducted on various topics which typically has hundreds and
thousands of responses and needs to be processed for analysis and visualization by creating a cluster of
population and their associated responses.
5. Compliance data : Many organizations like healthcare, hospitals, life sciences, finance etc has to file
compliance reports.

Foundation of Data Science: Unit I: Introduction : Tag: : Definition, Characteristics,


Comparison, Benefits, Uses - Data Science and Big Data

Facets of Data
• Very large amount of data will generate in big data and data science. These data is various types and main
categories of data are as follows:
a) Structured
b) Natural language
c) Graph-based
d) Streaming
e) Unstructured
f) Machine-generated
g) Audio, video and images

Structured Data
• Structured data is arranged in rows and column format. It helps for application to retrieve and process data
easily. Database management system is used for storing structured data.
• The term structured data refers to data that is identifiable because it is organized in a structure. The most
common form of structured data or records is a database where specific information is stored based on a
methodology of columns and rows.
• Structured data is also searchable by data type within content. Structured data is understood by
computers and is also efficiently organized for human readers.
• An Excel table is an example of structured data.

Unstructured Data
• Unstructured data is data that does not follow a specified format. Row and columns are not used for
unstructured data. Therefore it is difficult to retrieve required information. Unstructured data has no
identifiable structure.
• The unstructured data can be in the form of Text: (Documents, email messages, customer feedbacks),
audio, video, images. Email is an example of unstructured data.
• Even today in most of the organizations more than 80 % of the data are in unstructured form. This carries
lots of information. But extracting information from these various sources is a very big challenge.
• Characteristics of unstructured data:
1. There is no structural restriction or binding for the data.
2. Data can be of any type.
3. Unstructured data does not follow any structural rules.
4. There are no predefined formats, restriction or sequence for unstructured data.
5. Since there is no structural binding for unstructured data, it is unpredictable in nature.

Natural Language
• Natural language is a special type of unstructured data.
• Natural language processing enables machines to recognize characters, words and sentences, then apply
meaning and understanding to that information. This helps machines to understand language as humans do.
• Natural language processing is the driving force behind machine intelligence in many modern real-world
applications. The natural language processing community has had success in entity recognition, topic
recognition, summarization, text completion and sentiment analysis.
•For natural language processing to help machines understand human language, it must go through speech
recognition, natural language understanding and machine translation. It is an iterative process comprised of
several layers of text analysis.

Machine - Generated Data


• Machine-generated data is an information that is created without human interaction as a result of a
computer process or application activity. This means that data entered manually by an end-user is not
recognized to be machine-generated.
• Machine data contains a definitive record of all activity and behavior of our customers, users,
transactions, applications, servers, networks, factory machinery and so on.
• It's configuration data, data from APIs and message queues, change events, the output of diagnostic
commands and call detail records, sensor data from remote equipment and more.
• Examples of machine data are web server logs, call detail records, network event logs and telemetry.
• Both Machine-to-Machine (M2M) and Human-to-Machine (H2M) interactions generate machine data.
Machine data is generated continuously by every processor-based system, as well as many consumer-
oriented systems.
• It can be either structured or unstructured. In recent years, the increase of machine data has surged. The
expansion of mobile devices, virtual servers and desktops, as well as cloud- based services and RFID
technologies, is making IT infrastructures more complex.

Graph-based or Network Data


•Graphs are data structures to describe relationships and interactions between entities in complex systems. In
general, a graph contains a collection of entities called nodes and another collection of interactions between
a pair of nodes called edges.
• Nodes represent entities, which can be of any object type that is relevant to our problem domain. By
connecting nodes with edges, we will end up with a graph (network) of nodes.
• A graph database stores nodes and relationships instead of tables or documents. Data is stored just like we
might sketch ideas on a whiteboard. Our data is stored without restricting it to a predefined model, allowing
a very flexible way of thinking about and using it.
• Graph databases are used to store graph-based data and are queried with specialized query languages such
as SPARQL.
• Graph databases are capable of sophisticated fraud prevention. With graph databases, we can use
relationships to process financial and purchase transactions in near-real time. With fast graph queries, we are
able to detect that, for example, a potential purchaser is using the same email address and credit card as
included in a known fraud case.
• Graph databases can also help user easily detect relationship patterns such as multiple people associated
with a personal email address or multiple people sharing the same IP address but residing in different
physical addresses.
• Graph databases are a good choice for recommendation applications. With graph databases, we can store
in a graph relationships between information categories such as customer interests, friends and purchase
history. We can use a highly available graph database to make product recommendations to a user based on
which products are purchased by others who follow the same sport and have similar purchase history.
• Graph theory is probably the main method in social network analysis in the early history of the social
network concept. The approach is applied to social network analysis in order to determine important features
of the network such as the nodes and links (for example influencers and the followers).
• Influencers on social network have been identified as users that have impact on the activities or opinion of
other users by way of followership or influence on decision made by other users on the network as shown in
Fig. 1.2.1.

• Graph theory has proved to be very effective on large-scale datasets such as social network data. This is
because it is capable of by-passing the building of an actual visual representation of the data to run directly
on data matrices.
Audio, Image and Video
• Audio, image and video are data types that pose specific challenges to a data scientist. Tasks that are
trivial for humans, such as recognizing objects in pictures, turn out to be challenging for computers.
•The terms audio and video commonly refers to the time-based media storage format for sound/music and
moving pictures information. Audio and video digital recording, also referred as audio and video codecs, can
be uncompressed, lossless compressed or lossy compressed depending on the desired quality and use cases.
• It is important to remark that multimedia data is one of the most important sources of information and
knowledge; the integration, transformation and indexing of multimedia data bring significant challenges in
data management and analysis. Many challenges have to be addressed including big data, multidisciplinary
nature of Data Science and heterogeneity.
• Data Science is playing an important role to address these challenges in multimedia data. Multimedia data
usually contains various forms of media, such as text, image, video, geographic coordinates and even pulse
waveforms, which come from multiple sources. Data Science can be a key instrument covering big data,
machine learning and data mining solutions to store, handle and analyze such heterogeneous data.

Streaming Data
Streaming data is data that is generated continuously by thousands of data sources, which typically send in
the data records simultaneously and in small sizes (order of Kilobytes).
• Streaming data includes a wide variety of data such as log files generated by customers using your mobile
or web applications, ecommerce purchases, in-game player activity, information from social networks,
financial trading floors or geospatial services and telemetry from connected devices or instrumentation in
data centers.
Difference between Structured and Unstructured Data

Data Science Process


Data science process consists of six stages :
1. Discovery or Setting the research goal
2. Retrieving data
3. Data preparation
4. Data exploration
5. Data modeling
6. Presentation and automation
• Fig. 1.3.1 shows data science design process.
• Step 1: Discovery or Defining research goal
This step involves acquiring data from all the identified internal and external sources, which helps to answer
the business question.
• Step 2: Retrieving data
It collection of data which required for project. This is the process of gaining a business understanding of the
data user have and deciphering what each piece of data means. This could entail determining exactly what
data is required and the best methods for obtaining it. This also entails determining what each of the data
points means in terms of the company. If we have given a data set from a client, for example, we shall need
to know what each column and row represents.
• Step 3: Data preparation
Data can have many inconsistencies like missing values, blank columns, an incorrect data format, which
needs to be cleaned. We need to process, explore and condition data before modeling. The cleandata, gives
the better predictions.
• Step 4: Data exploration
Data exploration is related to deeper understanding of data. Try to understand how variables interact with
each other, the distribution of the data and whether there are outliers. To achieve this use descriptive
statistics, visual techniques and simple modeling. This steps is also called as Exploratory Data Analysis.
• Step 5: Data modeling
In this step, the actual model building process starts. Here, Data scientist distributes datasets for training and
testing. Techniques like association, classification and clustering are applied to the training data set. The
model, once prepared, is tested against the "testing" dataset.
• Step 6: Presentation and automation
Deliver the final baselined model with reports, code and technical documents in this stage. Model is
deployed into a real-time production environment after thorough testing. In this stage, the key findings are
communicated to all stakeholders. This helps to decide if the project results are a success or a failure based
on the inputs from the model.

Defining Research Goals


• To understand the project, three concept must understand: what, why and how.
a) What is expectation of company or organization?
b) Why does a company's higher authority define such research value?
c) How is it part of a bigger strategic picture?
• Goal of first phase will be the answer of these three questions.
• In this phase, the data science team must learn and investigate the problem, develop context and
understanding and learn about the data sources needed and available for the project.

1. Learning the business domain :
• Understanding the domain area of the problem is essential. In many cases, data scientists will have deep
computational and quantitative knowledge that can be broadly applied across many disciplines.
• Data scientists have deep knowledge of the methods, techniques and ways for applying heuristics to a
variety of business and conceptual problems.
2. Resources :
• As part of the discovery phase, the team needs to assess the resources available to support the project. In
this context, resources include technology, tools, systems, data and people.
3. Frame the problem :
• Framing is the process of stating the analytics problem to be solved. At this point, it is a best practice to
write down the problem statement and share it with the key stakeholders.
• Each team member may hear slightly different things related to the needs and the problem and have
somewhat different ideas of possible solutions.
4. Identifying key stakeholders:
• The team can identify the success criteria, key risks and stakeholders, which should include anyone who
will benefit from the project or will be significantly impacted by the project.
• When interviewing stakeholders, learn about the domain area and any relevant history from similar
analytics projects.
5. Interviewing the analytics sponsor:
• The team should plan to collaborate with the stakeholders to clarify and frame the analytics problem.
• At the outset, project sponsors may have a predetermined solution that may not necessarily realize the
desired outcome.
• In these cases, the team must use its knowledge and expertise to identify the true underlying problem and
appropriate solution.
• When interviewing the main stakeholders, the team needs to take time to thoroughly interview the project
sponsor, who tends to be the one funding the project or providing the high-level requirements.
• This person understands the problem and usually has an idea of a potential working solution.
6. Developing initial hypotheses:
• This step involves forming ideas that the team can test with data. Generally, it is best to come up with a
few primary hypotheses to test and then be creative about developing several more.
• These Initial Hypotheses form the basis of the analytical tests the team will use in later phases and serve
as the foundation for the findings in phase.
7. Identifying potential data sources:
• Consider the volume, type and time span of the data needed to test the hypotheses. Ensure that the team
can access more than simply aggregated data. In most cases, the team will need the raw data to avoid
introducing bias for the downstream analysis.

Retrieving Data
• Retrieving required data is second phase of data science project. Sometimes Data scientists need to go into
the field and design a data collection process. Many companies will have already collected and stored the
data and what they don't have can often be bought from third parties.
• Most of the high quality data is freely available for public and commercial use. Data can be stored in
various format. It is in text file format and tables in database. Data may be internal or external.
1. Start working on internal data, i.e. data stored within the company
• First step of data scientists is to verify the internal data. Assess the relevance and quality of the data that's
readily in company. Most companies have a program for maintaining key data, so much of the cleaning
work may already be done. This data can be stored in official data repositories such as databases, data marts,
data warehouses and data lakes maintained by a team of IT professionals.
• Data repository is also known as a data library or data archive. This is a general term to refer to a data set
isolated to be mined for data reporting and analysis. The data repository is a large database infrastructure,
several databases that collect, manage and store data sets for data analysis, sharing and reporting.
• Data repository can be used to describe several ways to collect and store data:
a) Data warehouse is a large data repository that aggregates data usually from multiple sources or segments
of a business, without the data being necessarily related.
b) Data lake is a large data repository that stores unstructured data that is classified and tagged with
metadata.
c) Data marts are subsets of the data repository. These data marts are more targeted to what the data user
needs and easier to use.
d) Metadata repositories store data about data and databases. The metadata explains where the data source,
how it was captured and what it represents.
e) Data cubes are lists of data with three or more dimensions stored as a table.

Advantages of data repositories:


i. Data is preserved and archived.
ii. Data isolation allows for easier and faster data reporting.
iii. Database administrators have easier time tracking problems.
iv. There is value to storing and analyzing data.
Disadvantages of data repositories :
i. Growing data sets could slow down systems.
ii. A system crash could affect all the data.
iii. Unauthorized users can access all sensitive data more easily than if it was distributed across several
locations.
2. Do not be afraid to shop around
• If required data is not available within the company, take the help of other company, which provides such
types of database. For example, Nielsen and GFK are provides data for retail industry. Data scientists also
take help of Twitter, LinkedIn and Facebook.
• Government's organizations share their data for free with the world. This data can be of excellent quality;
it depends on the institution that creates and manages it. The information they share covers a broad range of
topics such as the number of accidents or amount of drug abuse in a certain region and its demographics.
3. Perform data quality checks to avoid later problem
• Allocate or spend some time for data correction and data cleaning. Collecting suitable, error free data is
success of the data science project.
• Most of the errors encounter during the data gathering phase are easy to spot, but being too careless will
make data scientists spend many hours solving data issues that could have been prevented during data
import.
• Data scientists must investigate the data during the import, data preparation and exploratory phases. The
difference is in the goal and the depth of the investigation.
• In data retrieval process, verify whether the data is right data type and data is same as in the source
document.
• With data preparation process, more elaborate checks performed. Check any shortcut method is used. For
example, check time and data format.
• During the exploratory phase, Data scientists focus shifts to what he/she can learn from the data. Now
Data scientists assume the data to be clean and look at the statistical properties such as distributions,
correlations and outliers.

Data Preparation
• Data preparation means data cleansing, Integrating and transforming data.

Data Cleaning
• Data is cleansed through processes such as filling in missing values, smoothing the noisy data or
resolving the inconsistencies in the data.
• Data cleaning tasks are as follows:
1. Data acquisition and metadata
2. Fill in missing values
3. Unified date format
4. Converting nominal to numeric
5. Identify outliers and smooth out noisy data
6. Correct inconsistent data
• Data cleaning is a first step in data pre-processing techniques which is used to find the missing value,
smooth noise data, recognize outliers and correct inconsistent.
• Missing value: These dirty data will affects on miming procedure and led to unreliable and poor output.
Therefore it is important for some data cleaning routines. For example, suppose that the average salary of
staff is Rs. 65000/-. Use this value to replace the missing value for salary.
• Data entry errors: Data collection and data entry are error-prone processes. They often require human
intervention and because humans are only human, they make typos or lose their concentration for a second
and introduce an error into the chain. But data collected by machines or computers isn't free from errors
either. Errors can arise from human sloppiness, whereas others are due to machine or hardware failure.
Examples of errors originating from machines are transmission errors or bugs in the extract, transform and
load phase (ETL).
• Whitespace error: Whitespaces tend to be hard to detect but cause errors like other redundant characters
would. To remove the spaces present at start and end of the string, we can use strip() function on the string
in Python.
• Fixing capital letter mismatches: Capital letter mismatches are common problem. Most programming
languages make a distinction between "Chennai" and "chennai".
• Python provides string conversion like to convert a string to lowercase, uppercase using lower(), upper().
• The lower() Function in python converts the input string to lowercase. The upper() Function in python
converts the input string to uppercase.

Outlier
• Outlier detection is the process of detecting and subsequently excluding outliers from a given set of data.
The easiest way to find outliers is to use a plot or a table with the minimum and maximum values.
• Fig. 1.6.1 shows outliers detection. Here O1 and O2 seem outliers from the rest.

• An outlier may be defined as a piece of data or observation that deviates drastically from the given norm
or average of the data set. An outlier may be caused simply by chance, but it may also indicate
measurement error or that the given data set has a heavy-tailed distribution.
• Outlier analysis and detection has various applications in numerous fields such as fraud detection, credit
card, discovering computer intrusion and criminal behaviours, medical and public health outlier detection,
industrial damage detection.
• General idea of application is to find out data which deviates from normal behaviour of data set.

Dealing with Missing Value


• These dirty data will affects on miming procedure and led to unreliable and poor output. Therefore it is
important for some data cleaning routines.

How to handle noisy data in data mining?


• Following methods are used for handling noisy data:
1. Ignore the tuple: Usually done when the class label is missing. This method is not good unless the tuple
contains several attributes with missing values.
2. Fill in the missing value manually : It is time-consuming and not suitable for a large data set with many
missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute values by the same
constant.
4. Use the attribute mean to fill in the missing value: For example, suppose that the average salary of
staff is Rs 65000/-. Use this value to replace the missing value for salary.
5. Use the attribute mean for all samples belonging to the same class as the given tuple.
6. Use the most probable value to fill in the missing value.

Correct Errors as Early as Possible


• If error is not corrected in early stage of project, then it create problem in latter stages. Most of the time,
we spend on finding and correcting error. Retrieving data is a difficult task and organizations spend millions
of dollars on it in the hope of making better decisions. The data collection process is errorprone and in a big
organization it involves many steps and teams.
• Data should be cleansed when acquired for many reasons:
a) Not everyone spots the data anomalies. Decision-makers may make costly mistakes on information based
on incorrect data from applications that fail to correct for the faulty data.
b) If errors are not corrected early on in the process, the cleansing will have to be done for every project that
uses that data.
c) Data errors may point to a business process that isn't working as designed.
d) Data errors may point to defective equipment, such as broken transmission lines and defective sensors.
e) Data errors can point to bugs in software or in the integration of software that may be critical to the
company

Combining Data from Different Data Sources


1. Joining table
• Joining tables allows user to combine the information of one observation found in one table with the
information that we find in another table. The focus is on enriching a single observation.
• A primary key is a value that cannot be duplicated within a table. This means that one value can only be
seen once within the primary key column. That same key can exist as a foreign key in another table which
creates the relationship. A foreign key can have duplicate instances within a table.
• Fig. 1.6.2 shows Joining two tables on the CountryID and CountryName keys.
2. Appending tables
• Appending table is called stacking table. It effectively adding observations from one table to another
table. Fig. 1.6.3 shows Appending table. (See Fig. 1.6.3 on next page)

• Table 1 contains x3 value as 3 and Table 2 contains x3 value as 33.The result of appending these tables is
a larger one with the observations from Table 1 as well as Table 2. The equivalent operation in set theory
would be the union and this is also the command in SQL, the common language of relational databases.
Other set operators are also used in data science, such as set difference and intersection.
3. Using views to simulate data joins and appends
• Duplication of data is avoided by using view and append. The append table requires more space for
storage. If table size is in terabytes of data, then it becomes problematic to duplicate the data. For this
reason, the concept of a view was invented.
• Fig. 1.6.4 shows how the sales data from the different months is combined virtually into a yearly sales
table instead of duplicating the data.

Transforming Data
• In data transformation, the data are transformed or consolidated into forms appropriate for mining.
Relationships between an input variable and an output variable aren't always linear.
• Reducing the number of variables: Having too many variables in the model makes the model difficult to
handle and certain techniques don't perform well when user overload them with too many input variables.
• All the techniques based on a Euclidean distance perform well only up to 10 variables. Data scientists use
special methods to reduce the number of variables but retain the maximum amount of data.
Euclidean distance :
• Euclidean distance is used to measure the similarity between observations. It is calculated as the square
root of the sum of differences between each point.
Euclidean distance = √(X1-X2)2 + (Y1-Y2)2
Turning variable into dummies :
• Variables can be turned into dummy variables. Dummy variables canonly take two values: true (1) or
false√ (0). They're used to indicate the absence of acategorical effect that may explain the observation.
Exploratory Data Analysis
• Exploratory Data Analysis (EDA) is a general approach to exploring datasets by means of simple
summary statistics and graphic visualizations in order to gain a deeper understanding of data.
• EDA is used by data scientists to analyze and investigate data sets and summarize their main
characteristics, often employing data visualization methods. It helps determine how best to manipulate data
sources to get the answers user need, making it easier for data scientists to discover patterns, spot anomalies,
test a hypothesis or check assumptions.
• EDA is an approach/philosophy for data analysis that employs a variety of techniques to:
1. Maximize insight into a data set;
2. Uncover underlying structure;
3. Extract important variables;
4. Detect outliers and anomalies;
5. Test underlying assumptions;
6. Develop parsimonious models; and
7. Determine optimal factor settings.
• With EDA, following functions are performed:
1. Describe of user data
2. Closely explore data distributions
3. Understand the relations between variables
4. Notice unusual or unexpected situations
5. Place the data into groups
6. Notice unexpected patterns within groups
7. Take note of group differences
• Box plots are an excellent tool for conveying location and variation information in data sets, particularly
for detecting and illustrating location and variation changes between different groups of data.
• Exploratory data analysis is majorly performed using the following methods:
1. Univariate analysis: Provides summary statistics for each field in the raw data set (or) summary
only on one variable. Ex : CDF,PDF,Box plot
2. Bivariate analysis is performed to find the relationship between each variable in the dataset and the
target variable of interest (or) using two variables and finding relationship between them. Ex: Boxplot,
Violin plot.
3. Multivariate analysis is performed to understand interactions between different fields in the dataset (or)
finding interactions between variables more than 2.
• A box plot is a type of chart often used in explanatory data analysis to visually show the distribution of
numerical data and skewness through displaying the data quartiles or percentile and averages.

1. Minimum score: The lowest score, exlcuding outliers.


2. Lower quartile : 25% of scores fall below the lower quartile value.
3. Median: The median marks the mid-point of the data and is shown by the line that divides the box into
two parts.
4. Upper quartile : 75 % of the scores fall below the upper quartiel value.
5. Maximum score: The highest score, excluding outliers.
6. Whiskers: The upper and lower whiskers represent scores outside the middle 50%.
7. The interquartile range: This is the box plot showing the middle 50% of scores.
• Boxplots are also extremely usefule for visually checking group differences. Suppose we have four groups
of scores and we want to compare them by teaching method. Teaching method is our categorical grouping
variable and score is the continuous outcomes variable that the researchers measured.
Build the Models
• To build the model, data should be clean and understand the content properly. The components of
model building are as follows:
a) Selection of model and variable
b) Execution of model
c) Model diagnostic and model comparison
• Building a model is an iterative process. Most models consist of the following main steps:
1. Selection of a modeling technique and variables to enter in the model
2. Execution of the model
3. Diagnosis and model comparison
Model and Variable Selection
• For this phase, consider model performance and whether project meets all the requirements to use
model, as well as other factors:
1. Must the model be moved to a production environment and, if so, would it be easy to implement?
2. How difficult is the maintenance on the model: how long will it remain relevantif left untouched?
3. Does the model need to be easy to explain?

Model Execution
• Various programming language is used for implementing the model. For model execution, Python
provides libraries like StatsModels or Scikit-learn. These packages use several of the most popular
techniques.
• Coding a model is a nontrivial task in most cases, so having these libraries available can speed up the
process. Following are the remarks on output:
a) Model fit: R-squared or adjusted R-squared is used.
b) Predictor variables have a coefficient: For a linear model this is easy to interpret.
c) Predictor significance: Coefficients are great, but sometimes not enough evidence exists to show that
the influence is there.
• Linear regression works if we want to predict a value, but for classify something, classification models
are used. The k-nearest neighbors method is one of the best method.
• Following commercial tools are used :
1. SAS enterprise miner: This tool allows users to run predictive and descriptive models based on large
volumes of data from across the enterprise.
2. SPSS modeler: It offers methods to explore and analyze data through a GUI.
3. Matlab: Provides a high-level language for performing a variety of data analytics, algorithms and data
exploration.
4. Alpine miner: This tool provides a GUI front end for users to develop analytic workflows and interact
with Big Data tools and platforms on the back end.
• Open Source tools:
1. R and PL/R: PL/R is a procedural language for PostgreSQL with R.
2. Octave: A free software programming language for computational modeling, has some of the
functionality of Matlab.
3. WEKA: It is a free data mining software package with an analytic workbench. The functions created in
WEKA can be executed within Java code.
4. Python is a programming language that provides toolkits for machine learning and analysis.
5. SQL in-database implementations, such as MADlib provide an alterative to in memory desktop
analytical tools.

Model Diagnostics and Model Comparison


Try to build multiple model and then select best one based on multiple criteria. Working with a holdout
sample helps user pick the best-performing model.
• In Holdout Method, the data is split into two different datasets labeled as a training and a testing dataset.
This can be a 60/40 or 70/30 or 80/20 split. This technique is called the hold-out validation technique.
Suppose we have a database with house prices as the dependent variable and two independent variables
showing the square footage of the house and the number of rooms. Now, imagine this dataset has 30 rows.
The whole idea is that you build a model that can predict house prices accurately.
• To 'train' our model or see how well it performs, we randomly subset 20 of those rows and fit the model.
The second step is to predict the values of those 10 rows that we excluded and measure how well our
predictions were.
• As a rule of thumb, experts suggest to randomly sample 80% of the data into the training set and 20% into
the test set.
• The holdout method has two, basic drawbacks :
1. It requires extra dataset.
2. It is a single train-and-test experiment, the holdout estimate of error rate will be misleading if we happen
to get an "unfortunate" split.

Presenting Findings and Building Applications


• The team delivers final reports, briefings, code and technical documents.
• In addition, team may run a pilot project to implement the models in a production environment.
• The last stage of the data science process is where user soft skills will be most useful.
• Presenting your results to the stakeholders and industrializing your analysis process for repetitive
reuse and integration with other tools.

Data Mining
• Data mining refers to extracting or mining knowledge from large amounts of data. It is a process of
discovering interesting patterns or Knowledge from a large amount of data stored either in databases, data
warehouses or other information repositories.

Reasons for using data mining:


1. Knowledge discovery: To identify the invisible correlation, patterns in the database.
2. Data visualization: To find sensible way of displaying data.
3. Data correction: To identify and correct incomplete and inconsistent data.
Functions of Data Mining
• Different functions of data mining are characterization, association and correlation analysis,
classification, prediction, clustering analysis and evolution analysis.
1. Characterization is a summarization of the general characteristics or features of a target class of data. For
example, the characteristics of students can be produced, generating a profile of all the University in first
year engineering students.
2. Association is the discovery of association rules showing attribute-value conditions that occur frequently
together in a given set of data.
3. Classification differs from prediction. Classification constructs a set of models that describe and
distinguish data classes and prediction builds a model to predict some missing data values.
4. Clustering can also support taxonomy formation. The organization of observations into a hierarchy of
classes that group similar events together.
5. Data evolution analysis describes and models' regularities for objects whose behaviour changes over
time. It may include characterization, discrimination, association, classification or clustering of time-related
data.
Data mining tasks can be classified into two categories: descriptive and predictive.

Predictive Mining Tasks


• To make prediction, predictive mining tasks performs inference on the current data. Predictive analysis
provides answers of the future queries that move across using historical data as the chief principle for
decisions
• It involves the supervised learning functions used for the prediction of the target value. The methods fall
under this mining category are the classification, time-series analysis and regression.
• Data modeling is the necessity of the predictive analysis, which works by utilizing some variables to
anticipate the unknown future data values for other variables.
• It provides organizations with actionable insights based on data. It provides an estimation regarding the
likelihood of a future outcome.
• To do this, a variety of techniques are used, such as machine learning, data mining, modeling and game
theory.
• Predictive modeling can, for example, help to identify any risks or opportunities in the future.
• Predictive analytics can be used in all departments, from predicting customer behaviour in sales and
marketing, to forecasting demand for operations or determining risk profiles for finance.
• A very well-known application of predictive analytics is credit scoring used by financial services to
determine the likelihood of customers making future credit payments on time. Determining such a risk
profile requires a vast amount of data, including public and social data.
• Historical and transactional data are used to identify patterns and statistical models and algorithms are
used to capture relationships in various datasets.
• Predictive analytics has taken off in the big data era and there are many tools available for organisations to
predict future outcomes.

Descriptive Mining Task


• Descriptive Analytics is the conventional form of business intelligence and data analysis, seeks to provide
a depiction or "summary view" of facts and figures in an understandable format, to either inform or prepare
data for further analysis.
• Two primary techniques are used for reporting past events : data aggregation and data mining.
• It presents past data in an easily digestible format for the benefit of a wide business audience.
• A set of techniques for reviewing and examining the data set to understand the data and analyze business
performance.
• Descriptive analytics helps organisations to understand what happened in the past. It helps to understand
the relationship between product and customers.
• The objective of this analysis is to understanding, what approach to take in the future. If we learn from
past behaviour, it helps us to influence future outcomes.
• It also helps to describe and present data in such format, which can be easily understood by a wide variety
of business readers.
Architecture of a Typical Data Mining System
• Data mining refers to extracting or mining knowledge from large amounts of data. It is a process of
discovering interesting patterns or knowledge from a large amount of data stored either in databases, data
warehouses.
• It is the computational process of discovering patterns in huge data sets involving methods at the
intersection of AI, machine learning, statistics and database systems.
• Fig. 1.10.1 (See on next page) shows typical architecture of data mining system.
• Components of data mining system are data source, data warehouse server, data mining engine, pattern
evaluation module, graphical user interface and knowledge base.
• Database, data warehouse, WWW or other information repository: This is set of databases, data
warehouses, spreadsheets or other kinds of data repositories. Data cleaning and data integration techniques
may be apply on the data.

• Data warehouse server based on the user's data request, data warehouse server is responsible for fetching
the relevant data.
• Knowledge base is helpful in the whole data mining process. It might be useful for guiding the search or
evaluating the interestingness of the result patterns. The knowledge base might even contain user beliefs and
data from user experiences that can be useful in the process of data mining.
• The data mining engine is the core component of any data mining system. It consists of a number of
modules for performing data mining tasks including association, classification, characterization, clustering,
prediction, time-series analysis etc.
• The pattern evaluation module is mainly responsible for the measure of interestingness of the pattern by
using a threshold value. It interacts with the data mining engine to focus the search towards interesting
patterns.
• The graphical user interface module communicates between the user and the data mining system. This
module helps the user use the system easily and efficiently without knowing the real complexity behind the
process.
• When the user specifies a query or a task, this module interacts with the data mining system and displays
the result in an easily understandable manner.

Classification of DM System
• Data mining system can be categorized according to various parameters. These are database technology,
machine learning, statistics, information science, visualization and other disciplines.
• Fig. 1.10.2 shows classification of DM system.

• Multi-dimensional view of data mining classification.

Data Warehousing
• Data warehousing is the process of constructing and using a data warehouse. A data warehouse is
constructed by integrating data from multiple heterogeneous sources that support analytical reporting,
structured and/or ad hoc queries and decision making. Data warehousing involves data cleaning, data
integration and data consolidations.
• A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in
support of management's decision-making process. A data warehouse stores historical data for purposes of
decision support.
• A database an application-oriented collection of data that is organized, structured, coherent, with
minimum and controlled redundancy, which may be accessed by several users in due time.
• Data warehousing provides architectures and tools for business executives to systematically organize,
understand and use their data to make strategic decisions.
• A data warehouse is a subject-oriented collection of data that is integrated, time-variant, non-volatile,
which may be used to support the decision-making process.
• Data warehouses are databases that store and maintain analytical data separately from transaction-oriented
databases for the purpose of decision support. Data warehouses separate analysis workload from transaction
workload and enable an organization to consolidate data from several source.
• Data organization in data warehouses is based on areas of interest, on the major subjects of the
organization: Customers, products, activities etc. databases organize data based on enterprise applications
resulted from its functions.
• The main objective of a data warehouse is to support the decision-making system, focusing on the
subjects of the organization. The objective of a database is to support the operational system and information
is organized on applications and processes.
• A data warehouse usually stores many months or years of data to support historical analysis. The data in a
data warehouse is typically loaded through an extraction, transformation and loading (ETL) process from
multiple data sources.
• Databases and data warehouses are related but not the same.
• A database is a way to record and access information from a single source. A database is often handling
real-time data to support day-to-day business processes like transaction processing.
• A data warehouse is a way to store historical information from multiple sources to allow you to analyse
and report on related data (e.g., your sales transaction data, mobile app data and CRM data). Unlike a
database, the information isn't updated in real-time and is better for data analysis of broader trends.
• Modern data warehouses are moving toward an Extract, Load, Transformation (ELT) architecture in
which all or most data transformation is performed on the database that hosts the data warehouse.
• Goals of data warehousing:
1. To help reporting as well as analysis.
2. Maintain the organization's historical information.
3. Be the foundation for decision making.
"How are organizations using the information from data warehouses ?"
• Most of the organizations makes use of this information for taking business decision like :
a) Increasing customer focus: It is possible by performing analysis of customer buying.
b) Repositioning products and managing product portfolios by comparing the performance of last year
sales.
c) Analysing operations and looking for sources of profit.
d) Managing customer relationships, making environmental corrections and managing the cost of corporate
assets.

Characteristics of Data Warehouse


1. Subject oriented Data are organized based on how the users refer to them. A data warehouse can be used
to analyse a particular subject area. For example, "sales" can be a particular subject.
2. Integrated: All inconsistencies regarding naming convention and value representations are removed. For
example, source A and source B may have different ways of identifying a product, but in a data warehouse,
there will be only a single way of identifying a product.
3. Non-volatile: Data are stored in read-only format and do not change over time. Typical activities such as
deletes, inserts and changes that are performed in an operational application environment are completely
non-existent in a DW environment.
4. Time variant : Data are not current but normally time series. Historical information is kept in a data
warehouse. For example, one can retrieve files from 3 months, 6 months, 12 months or even previous data
from a data warehouse.
Key characteristics of a Data Warehouse
1. Data is structured for simplicity of access and high-speed query performance.
2. End users are time-sensitive and desire speed-of-thought response times.
3. Large amounts of historical data are used.
4. Queries often retrieve large amounts of data, perhaps many thousands of rows.
5. Both predefined and ad hoc queries are common.
6. The data load involves multiple sources and transformations.

Multitier Architecture of Data Warehouse


• Data warehouse architecture is a data storage framework's design of an organization. A data warehouse
architecture takes information from raw sets of data and stores it in a structured and easily digestible format.
• Data warehouse system is constructed in three ways. These approaches are classified the number of tiers
in the architecture.
a) Single-tier architecture.
b) Two-tier architecture.
c) Three-tier architecture (Multi-tier architecture).
• Single tier warehouse architecture focuses on creating a compact data set and minimizing the amount of
data stored. While it is useful for removing redundancies. It is not effective for organizations with large data
needs and multiple streams.
• Two-tier warehouse structures separate the resources physically available from the warehouse itself. This
is most commonly used in small organizations where a server is used as a data mart. While it is more
effective at storing and sorting data. Two-tier is not scalable and it supports a minimal number of end-users.
Three tier (Multi-tier) architecture:
• Three tier architecture creates a more structured flow for data from raw sets to actionable insights. It is
the most widely used architecture for data warehouse systems.
• Fig. 1.11.1 shows three tier architecture. Three tier architecture sometimes called multi- tier architecture.
• The bottom tier is the database of the warehouse, where the cleansed and transformed data is loaded. The
bottom tier is a warehouse database server.

• The middle tier is the application layer giving an abstracted view of the database. It arranges the data to
make it more suitable for analysis. This is done with an OLAP server, implemented using the ROLAP or
MOLAP model.
• OLAPS can interact with both relational databases and multidimensional databases, which lets them
collect data better based on broader parameters.
• The top tier is the front-end of an organization's overall business intelligence suite. The top-tier is where
the user accesses and interacts with data via queries, data visualizations and data analytics tools.
• The top tier represents the front-end client layer. The client level which includes the tools and Application
Programming Interface (API) used for high-level data analysis, inquiring and reporting. User can use
reporting tools, query, analysis or data mining tools.
Needs of Data Warehouse
1) Business user: Business users require a data warehouse to view summarized data from the past. Since
these people are non-technical, the data may be presented to them in an elementary form.
2) Store historical data: Data warehouse is required to store the time variable data from the past. This input
is made to be used for various purposes.
3) Make strategic decisions: Some strategies may be depending upon the data in the data warehouse. So,
data warehouse contributes to making strategic decisions.
4) For data consistency and quality Bringing the data from different sources at a commonplace, the user
can effectively undertake to bring the uniformity and consistency in data.
5) High response time: Data warehouse has to be ready for somewhat unexpected loads and types of
queries, which demands a significant degree of flexibility and quick response time.

Benefits of Data Warehouse


a) Understand business trends and make better forecasting decisions.
b) Data warehouses are designed to perform well enormous amounts of data.
c) The structure of data warehouses is more accessible for end-users to navigate, understand and query.
d) Queries that would be complex in many normalized databases could be easier to build and maintain in
data warehouses.
e) Data warehousing is an efficient method to manage demand for lots of information from lots of users.
f) Data warehousing provide the capabilities to analyze a large amount of historical data.
Difference between ODS and Data Warehouse

Metadata
• Metadata is simply defined as data about data. The data that is used to represent other data is known as
metadata. In data warehousing, metadata is one of the essential aspects.
• We can define metadata as follows:
a) Metadata is the road-map to a data warehouse.
b) Metadata in a data warehouse defines the warehouse objects.
c) Metadata acts as a directory. This directory helps the decision support system to locate the contents of a
data warehouse.
• In a data warehouse, we create metadata for the data names and definitions of a given data warehouse.
Along with this metadata, additional metadata is also created for time- stamping any extracted data, the
source of extracted data.
Why is metadata necessary in a data warehouse ?
a) First, it acts as the glue that links all parts of the data warehouses.
b) Next, it provides information about the contents and structures to the developers.
c) Finally, it opens the doors to the end-users and makes the contents recognizable in their terms.
• Fig. 1.11.2 shows warehouse metadata.

Basic Statistical Descriptions of Data


• For data preprocessing to be successful, it is essential to have an overall picture of our data. Basic
statistical descriptions can be used to identify properties of the data and highlight which data values should
be treated as noise or outliers.
• Basic statistical descriptions can be used to identify properties of the data and highlight which data values
should be treated as noise or outliers.
• For data preprocessing tasks, we want to learn about data characteristics regarding both central tendency
and dispersion of the data.
• Measures of central tendency include mean, median, mode and midrange.
• Measures of data dispersion include quartiles, interquartile range (IQR) and variance.
• These descriptive statistics are of great help in understanding the distribution of the data.

Measuring the Central Tendency


• We look at various ways to measure the central tendency of data, include: Mean, Weighted mean,
Trimmed mean, Median, Mode and Midrange.
1. Mean :
• The mean of a data set is the average of all the data values. The sample mean x is the point estimator of
the population mean μ.
2. Median :
Sum of the values of then observations Number of observations in the sample Sum of the values
of the N observations Number of observations in the population
• The median of a data set is the value in the middle when the data items are arranged in ascending order.
Whenever a data set has extreme values, the median is the preferred measure of central location.
• The median is the measure of location most often reported for annual income and property value data. A
few extremely large incomes of property values can inflate the mean.
• For an off number of observations:
7 observations= 26, 18, 27, 12, 14, 29, 19.
Numbers in ascending order = 12, 14, 18, 19, 26, 27, 29
• The median is the middle value.
Median=19
• For an even number of observations :
8 observations = 26 18 29 12 14 27 30 19
Numbers in ascending order =12, 14, 18, 19, 26, 27, 29, 30 The
median is the average of the middle two values.
3. Mode:
• The mode of a data set is the value that occurs with greatest frequency. The greatest frequency can occur
at two or more different values. If the data have exactly two modes, the data have exactly two modes, the
data are bimodal. If the data have more than two modes, the data are multimodal.
• Weighted mean: Sometimes, each value in a set may be associated with a weight, the weights reflect the
significance, importance or occurrence frequency attached to their respective values.
• Trimmed mean: A major problem with the mean is its sensitivity to extreme (e.g., outlier) values. Even a
small number of extreme values can corrupt the mean. The trimmed mean is the mean obtained after cutting
off values at the high and low extremes.
• For example, we can sort the values and remove the top and bottom 2 % before computing the mean. We
should avoid trimming too large a portion (such as 20 %) at both ends as this can result in the loss of
valuable information.
• Holistic measure is a measure that must be computed on the entire data set as a whole. It cannot be
computed by partitioning the given data into subsets and merging the values obtained for the measure in
each subset.

Measuring the Dispersion of Data


• An outlier is an observation that lies an abnormal distance from other values in a random sample from a
population.
• First quartile (Q1): The first quartile is the value, where 25% of the values are smaller than Q1 and 75%
are larger.
• Third quartile (Q3): The third quartile is the value, where 75 % of the values are smaller than Q 3 and
25% are larger.
• The box plot is a useful graphical display for describing the behavior of the data in the middle as well as
at the ends of the distributions. The box plot uses the median and the lower and upper quartiles. If the lower
quartile is Q1 and the upper quartile is Q3, then the difference (Q3 - Q1) is called the interquartile range or
IQ.
• Range: Difference between highest and lowest observed values
Variance :
• The variance is a measure of variability that utilizes all the data. It is based on the difference between the
value of each observation (x;) and the mean (x) for a sample, u for a population).
• The variance is the average of the squared between each data value and the mean.
Standard Deviation :
• The standard deviation of a data set is the positive square root of the variance. It is measured in the same
in the same units as the data, making it more easily interpreted than the variance.
• The standard deviation is computed as follows:

Difference between Standard Deviation and Variance

Graphic Displays of Basic Statistical Descriptions


• There are many types of graphs for the display of data summaries and distributions, such as Bar charts,
Pie charts, Line graphs, Boxplot, Histograms, Quantile plots and Scatter plots.
1. Scatter diagram
• Also called scatter plot, X-Y graph.
• While working with statistical data it is often observed that there are connections between sets of data.
For example the mass and height of persons are related, the taller the person the greater his/her mass.
• To find out whether or not two sets of data are connected scatter diagrams can be used. Scatter diagram
shows the relationship between children's age and height.
• A scatter diagram is a tool for analyzing relationship between two variables. One variable is plotted on the
horizontal axis and the other is plotted on the vertical axis.
• The pattern of their intersecting points can graphically show relationship patterns. Commonly a scatter
diagram is used to prove or disprove cause-and-effect relationships.
• While scatter diagram shows relationships, it does not by itself prove that one variable causes other. In
addition to showing possible cause and effect relationships, a scatter diagram can show that two variables
are from a common cause that is unknown or that one variable can be used as a surrogate for the other.
2. Histogram
• A histogram is used to summarize discrete or continuous data. In a histogram, the data are grouped into
ranges (e.g. 10-19, 20-29) and then plotted as connected bars. Each bar represents a range of data.
• To construct a histogram from a continuous variable you first need to split the data into intervals, called
bins. Each bin contains the number of occurrences of scores in the data set that are contained within that bin.
• The width of each bar is proportional to the width of each category and the height is proportional to the
frequency or percentage of that category.
3. Line graphs
• It is also called stick graphs. It gives relationships between variables.
• Line graphs are usually used to show time series data that is how one or more variables vary over a
continuous period of time. They can also be used to compare two different variables over time.
• Typical examples of the types of data that can be presented using line graphs are monthly rainfall and
annual unemployment rates.
• Line graphs are particularly useful for identifying patterns and trends in the data such as seasonal effects,
large changes and turning points. Fig. 1.12.1 show line graph. (See Fig. 1.12.1 on next page)

• As well as time series data, line graphs can also be appropriate for displaying data that are measured over
other continuous variables such as distance.
• For example, a line graph could be used to show how pollution levels vary with increasing distance from a
source or how the level of a chemical varies with depth of soil.
• In a line graph the x-axis represents the continuous variable (for example year or distance from the initial
measurement) whilst the y-axis has a scale and indicated the measurement.
• Several data series can be plotted on the same line chart and this is particularly useful for analysing and
comparing the trends in different datasets.
• Line graph is often used to visualize rate of change of a quantity. It is more useful when the given data has
peaks and valleys. Line graphs are very simple to draw and quite convenient to interpret.
4. Pie charts
• A type of graph is which a circle is divided into sectors that each represents a proportion of whole. Each
sector shows the relative size of each value.
• A pie chart displays data, information and statistics in an easy to read "pie slice" format with varying slice
sizes telling how much of one data element exists.
• Pie chart is also known as circle graph. The bigger the slice, the more of that particular data was gathered.
The main use of a pie chart is to show comparisons. Fig. 1.12.2 shows pie chart. (See Fig. 1.12.2 on next
page)
• Various applications of pie charts can be found in business, school and at home. For business pie charts
can be used to show the success or failure of certain products or services.
• At school, pie chart applications include showing how much time is allotted to each subject. At home pie
charts can be useful to see expenditure of monthly income in different needs.
• Reading of pie chart is as easy figuring out which slice of an actual pie is the biggest. Limitation
of pie chart:
• It is difficult to tell the difference between estimates of similar size. Error bars
or confidence limits cannot be shown on pie graph. Legends and labels on pie
graphs are hard to align and read.
• The human visual system is more efficient at perceiving and discriminating between lines and line lengths
rather than two-dimensional areas and angles.
• Pie graphs simply don't work when comparing data.

Two Marks Questions with Answers


Q.1 What is data science?
Ans; Data science is an interdisciplinary field that seeks to extract knowledge or insights from various forms
of data.
• At its core, data science aims to discover and extract actionable knowledge from data that can be used to
make sound business decisions and predictions.
• Data science uses advanced analytical theory and various methods such as time series analysis for
predicting future.
Q.2 Define structured data.
Ans. Structured data is arranged in rows and column format. It helps for application to retrieve and process
data easily. Database management system is used for storing structured data. The term structured data refers
to data that is identifiable because it is organized in a structure.
Q.3 What is data?
Ans. Data set is collection of related records or information. The information may be on some entity or
some subject area.
Q.4 What is unstructured data ?
Ans. Unstructured data is data that does not follow a specified format. Row and columns are not used for
unstructured data. Therefore it is difficult to retrieve required information. Unstructured data has no
identifiable structure.
Q.5 What is machine - generated data ?
Ans. Machine-generated data is an information that is created without human interaction as a result of a
computer process or application activity. This means that data entered manually by an end-user is not
recognized to be machine-generated.
Q.6 Define streaming data.
Ans; Streaming data is data that is generated continuously by thousands of data sources, which typically
send in the data records simultaneously and in small sizes (order of Kilobytes).
Q.7 List the stages of data science process.
Ans.: Stages of data science process are as follows:
1. Discovery or Setting the research goal
2. Retrieving data
3. Data preparation
4. Data exploration
5. Data modeling
6. Presentation and automation

Q.8 What are the advantages of data repositories?


Ans.: Advantages are as follows:
i. Data is preserved and archived.
ii. Data isolation allows for easier and faster data reporting.
iii. Database administrators have easier time tracking problems.
iv. There is value to storing and analyzing data.

Q.9 What is data cleaning?


Ans. Data cleaning means removing the inconsistent data or noise and collecting necessary information of a
collection of interrelated data.
Q.10 What is outlier detection?
Ans. : Outlier detection is the process of detecting and subsequently excluding outliers from a given set of
data. The easiest way to find outliers is to use a plot or a table with the minimum and maximum values.
Q.11 Explain exploratory data analysis.
Ans. : Exploratory Data Analysis (EDA) is a general approach to exploring datasets by means of simple
summary statistics and graphic visualizations in order to gain a deeper understanding of data. EDA is used
by data scientists to analyze and investigate data sets and summarize their main characteristics, often
employing data visualization methods.
Q.12 Define data mining.
Ans. : Data mining refers to extracting or mining knowledge from large amounts of data. It is a process of
discovering interesting patterns or Knowledge from a large amount of data stored either in databases, data
warehouses, or other information repositories.
Q.13 What are the three challenges to data mining regarding data mining
methodology?
Ans. Challenges to data mining regarding data mining methodology include the following:
1. Mining different kinds of knowledge in databases,
2. Interactive mining of knowledge at multiple levels of abstraction,
3. Incorporation of background knowledge.

Q.14 What is predictive mining?


Ans. Predictive mining tasks perform inference on the current data in order to make predictions. Predictive
analysis provides answers of the future queries that move across using historical data as the chief principle
for decisions.
Q.15 What is data cleaning?
Ans. Data cleaning means removing the inconsistent data or noise and collecting necessary information of a
collection of interrelated data.
Q.16 List the five primitives for specifying a data mining task.
Ans. :
1. The set of task-relevant data to be mined
2. The kind of knowledge to be mined
3. The background knowledge to be used in the discovery process
4. The interestingness measures and thresholds for pattern evaluation
5. The expected representation for visualizing the discovered pattern.

Q.17 List the stages of data science process.


Ans. Data science process consists of six stages:
1. Discovery or Setting the research goal 2. Retrieving data 3. Data preparation
4. Data exploration 5. Data modeling 6. Presentation and automation

Q.18 What is data repository?


Ans. Data repository is also known as a data library or data archive. This is a general term to refer to a data
set isolated to be mined for data reporting and analysis. The data repository is a large database
infrastructure, several databases that collect, manage and store data sets for data analysis, sharing and
reporting.
Q.19 List the data cleaning tasks?
Ans. Data cleaning are as follows:
1. Data acquisition and metadata
2. Fill in missing values
3. Unified date format
4. Converting nominal to numeric
5. Identify outliers and smooth out noisy data
6. Correct inconsistent data

Q.20 What is Euclidean distance ?


Ans. Euclidean distance is used to measure the similarity between observations. It is calculated as the square
root of the sum of differences between each point.
UNIT III MACHINE LEARNING

The modeling process in machine learning

The modeling process in machine learning typically involves several key steps, including data
preprocessing, model selection, training, evaluation, and deployment. Here's an overview of the general
modeling process:
1. Data Collection: Obtain a dataset that contains relevant information for the problem you want to solve.
This dataset should be representative of the real-world scenario you are interested in.

2. Data Preprocessing: Clean the dataset by handling missing values, encoding categorical variables,
and scaling numerical features. This step ensures that the data is in a suitable format for modeling.

3. Feature Selection/Engineering: Select relevant features (columns) from the dataset or create new features
based on domain knowledge. This step helps improve the performance of the model by focusing on the
most important information.

4. Splitting the Data: Split the dataset into training, validation, and test sets. The training set is used to
train the model, the validation set is used to tune hyperparameters, and the test set is used to evaluate the
final model.

5. Model Selection: Choose the appropriate machine learning model(s) for your problem. This decision
is based on factors such as the type of problem (classification, regression, clustering, etc.), the size of
the dataset, and the nature of the data.

6. Training the Model: Train the selected model(s) on the training data. During training, the model
learns patterns and relationships in the data that will allow it to make predictions on new, unseen data.

7. Hyperparameter Tuning: Use the validation set to tune the hyperparameters of the model.
Hyperparameters are parameters that control the learning process of the model (e.g., learning
rate, regularization strength) and can have a significant impact on performance.

8. Model Evaluation: Evaluate the model(s) using the test set. This step involves measuring
performance metrics such as accuracy, precision, recall, F1 score, or mean squared error, depending on
the type of problem.

9. Model Deployment: Once you are satisfied with the performance of the model, deploy it to a production
environment where it can make predictions on new data. This step may involve packaging the model into
a software application or integrating it into an existing system.

10. Monitoring and Maintenance: Continuously monitor the performance of the deployed model and update
it as needed to ensure that it remains accurate and reliable over time.

This is a high-level overview of the modeling process in machine learning. The specific details of each step
may vary depending on the problem you are working on and the tools and techniques you are using.
Types of machine learning

Machine learning can be broadly categorized into three main types based on the
nature of the learning process and the availability of labeled data:
1. Supervised Learning: In supervised learning, the model is trained on a labeled dataset, where each
example is paired with a corresponding label or output. The goal of the model is to learn a mapping from
inputs to outputs so that it can predict the correct output for new, unseen inputs. Examples of supervised
learning algorithms include linear regression, logistic regression, decision trees, random forests, support
vector machines (SVM), and neural networks.
2. Unsupervised Learning: In unsupervised learning, the model is trained on an unlabeled dataset, and the
goal is to find hidden patterns or structures in the data. The model learns to group similar data points
together and identify underlying relationships without explicit guidance. Clustering and dimensionality
reduction are common tasks in unsupervised learning. Examples of unsupervised learning algorithms
include K-means clustering, hierarchical clustering, principal component analysis (PCA), and t-distributed
stochastic neighbor embedding (t-SNE).
3. Reinforcement Learning: Reinforcement learning is a type of machine learning where an agent learns to
make decisions by interacting with an environment. The agent receives feedback in the form of rewards or
penalties based on its actions, and the goal is to learn a policy that maximizes the cumulative reward over
time. Reinforcement learning is commonly used in applications such as game playing, robotics, and
autonomous driving. Examples of reinforcement learning algorithms include Q-learning, deep Q-networks
(DQN), and policy gradient methods.

These are the main types of machine learning, but there are also other subfields and
specialized approaches, such as semi-supervised learning, where the model is trained
on a combination of labeled and unlabeled data, and transfer learning, where
knowledge gained from one task is applied to another related task.

Supervised learning

Supervised learning is a type of machine learning where the model is trained on a


labeled dataset, meaning that each example in the dataset is paired with a
corresponding label or output. The goal of supervised learning is to learn a mapping
from inputs to outputs so that the model can predict the correct output for new,
unseen inputs.

Supervised learning can be further divided into two main categories:


1. Classification: In classification tasks, the goal is to predict a categorical label or
class for each input. Examples of classification tasks include spam detection
(classifying emails as spam or not spam), image classification (classifying images
into different categories), and sentiment analysis (classifying text as positive,
negative, or neutral).

2. Regression: In regression tasks, the goal is to predict a continuous value for each
input. Examples of regression tasks include predicting house prices based on
features such as size, location, and number of bedrooms, predicting stock prices
based on historical data, and predicting the amount of rainfall based on weather
patterns.
Supervised learning algorithms learn from the labeled data by finding patterns and
relationships that allow them to make accurate predictions on new, unseen data.
Some common supervised learning algorithms include:

 Linear Regression: Used for regression tasks where the relationship


between the input features and the output is assumed to be linear.
 Logistic Regression: Used for binary classification tasks where the output is a
binary label
(e.g., spam or not spam).
 Decision Trees: Used for both classification and regression tasks,
decision trees make decisions based on the values of input features.
 Random Forests: An ensemble method that uses multiple decision
trees to improve performance and reduce overfitting.
 Support Vector Machines (SVM): Used for both classification and
regression tasks, SVMs find a hyperplane that separates different classes or
fits the data with the largest margin.
 Neural Networks: A versatile class of models inspired by the structure of the
human brain, neural networks can be used for a wide range of tasks including
classification, regression, and even reinforcement learning.

Overall, supervised learning is a powerful and widely used approach in machine


learning, with applications in areas such as healthcare, finance, marketing, and more.

Unsupervised learning in machine learning

Unsupervised learning is a type of machine learning where the model is trained on an unlabeled dataset,
meaning that the data does not have any corresponding output labels. The goal of unsupervised learning is to
find hidden patterns or structures in the data.

Unlike supervised learning, where the model learns from labeled examples to predict outputs for new
inputs, unsupervised learning focuses on discovering the underlying structure of the data without any
guidance on what the output should be. This makes unsupervised learning particularly useful for exploratory
data analysis and understanding the relationships between data points.

There are several key tasks in unsupervised learning:


1. Clustering: Clustering is the task of grouping similar data points together. The goal is to partition the data
into clusters such that data points within the same cluster are more similar to each other than to those in
other clusters. K-means clustering and hierarchical clustering are common clustering algorithms.

2. Dimensionality Reduction: Dimensionality reduction techniques are used to reduce the number of
features in the dataset while preserving as much information as possible. This can help in visualizing high-
dimensional data and reducing the computational complexity of models. Principal Component Analysis
(PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) are popular dimensionality reduction
techniques.

3. Anomaly Detection: Anomaly detection, also known as outlier detection, is the task of identifying data
points that deviate from the norm in a dataset. Anomalies may indicate errors in the data, fraudulent
behavior, or other unusual patterns. One-class SVM and Isolation Forest are common anomaly detection
algorithms.

4. Association Rule Learning: Association rule learning is the task of discovering interesting relationships
between variables in large datasets. It is often used in market basket analysis to identify patterns in
consumer behavior. Apriori and FP-growth are popular association rule learning algorithms.

Unsupervised learning is widely used in various fields such as data mining, pattern recognition, and
bioinformatics. It can help in gaining insights from data that may not be immediately apparent and can be a
valuable tool in exploratory data analysis and knowledge discovery.

Semi-supervised learning in Machine Learning

Semi-supervised learning is a type of machine learning that falls between supervised


learning and unsupervised learning. In semi-supervised learning, the model is trained
on a dataset that contains both labeled and unlabeled examples. The goal of semi-
supervised learning is to leverage the unlabeled data to improve the performance of
the model on the task at hand.

The main idea behind semi-supervised learning is that labeled data is often expensive
or time- consuming to obtain, while unlabeled data is often abundant and easy to
acquire. By using both labeled and unlabeled data, semi-supervised learning
algorithms aim to make better use of the available data and improve the
performance of the model.

There are several approaches to semi-supervised learning, including:


1. Self-training: In self-training, the model is initially trained on the labeled data. Then,
it uses this model to predict labels for the unlabeled data. The predictions with high
confidence are added to the labeled dataset, and the model is retrained on the
expanded dataset. This process iterates until convergence.

2.Co-training: In co-training, the model is trained on multiple views of the data, each
of which contains a different subset of features. The model is trained on the
labeled data from each view
and then used to predict labels for the unlabeled data in each view. The predictions
from each view are then combined to make a final prediction.

3. Semi-supervised Generative Adversarial Networks (GANs): GANs can be used


for semi- supervised learning by training a generator to produce realistic data
samples and a discriminator to distinguish between real and generated samples. The
generator is trained using both labeled and unlabeled data, while the discriminator is
trained using only labeled data.

Semi-supervised learning is particularly useful in scenarios where labeled data is


scarce but unlabeled data is abundant, such as in medical imaging, speech
recognition, and natural language processing. By effectively leveraging both types of
data, semi-supervised learning can improve the performance of machine learning
models and reduce the need for large amounts of labeled data.

Classification, regression in machine learning

Classification and regression are two fundamental types of supervised learning in


machine learning.

1. Classification :

 Classification is a supervised learning task where the goal is to predict the


categorical label of a new input based on past observations.
 In classification, the output variable is discrete and belongs to a specific class or
category.
 Examples of classification tasks include spam detection (classifying emails as
spam or not spam), sentiment analysis (classifying movie reviews as positive
or negative), and image classification (classifying images into different
categories).
 Common algorithms for classification include logistic regression, decision
trees, random forests, support vector machines (SVM), and neural networks.
 Evaluation metrics for classification include accuracy, precision, recall, F1
score, and area under the receiver operating characteristic curve (ROC-
2. Regression :

 Regression is a supervised learning task where the goal is to predict a


continuous value for a new input based on past observations.
 In regression, the output variable is continuous and can take any value within a
range.
 Examples of regression tasks include predicting house prices based on features
such as size and location, predicting stock prices based on historical data, and
predicting the temperature based on weather patterns.
 Common algorithms for regression include linear regression, polynomial
regression, decision trees, random forests, and neural networks.
 Evaluation metrics for regression include mean squared error (MSE), root
mean squared error (RMSE), mean absolute error (MAE), and R-squared.

Both classification and regression are important tasks in machine learning and are
used in a wide range of applications. The choice between classification and
regression depends on the nature of the output variable and the specific problem
being addressed.

Clustering in machine learning

Clustering is an unsupervised learning technique used to group similar data points


together in such a way that data points in the same group (or cluster) are more
similar to each other than to those in other groups. Clustering is commonly used in
exploratory data analysis to identify patterns, group similar objects together, and
reduce the complexity of data.

There are several types of clustering algorithms, each with its own strengths and
1. K-means Clustering: K-means is one of the most commonly used clustering algorithms. It partitions
the data into K clusters, where each data point belongs to the cluster with the nearest mean. K-means
aims to minimize the sum of squared distances between data points and their corresponding cluster
centroids.
2. Hierarchical Clustering: Hierarchical clustering builds a hierarchy of clusters, where each data point starts
in its own cluster and clusters are successively merged or split based on their similarity. Hierarchical
clustering can be agglomerative (bottom-up) or divisive (top-down).
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise): DBSCAN is a density- based
clustering algorithm that groups together closely packed data points and identifies outliers as noise.
DBSCAN does not require the number of clusters to be specified in advance.
4. Mean Shift: Mean shift is a clustering algorithm that assigns each data point to the cluster
corresponding to the nearest peak in the density estimation of the data. Mean shift can
automatically determine the number of clusters based on the data.
5. Gaussian Mixture Models (GMM): GMM is a probabilistic model that assumes that the data is
generated from a mixture of several Gaussian distributions. GMM can be used for clustering by fitting
the model to the data and assigning each data point to the most likely cluster.
6. Agglomerative Clustering: Agglomerative clustering is a bottom-up hierarchical clustering algorithm
that starts with each data point as a singleton cluster and iteratively merges clusters based on their
similarity.
Clustering is used in various applications such as customer segmentation, image
segmentation, anomaly detection, and recommender systems. The choice of
clustering algorithm depends on the nature of the data and the specific requirements
of the problem.

Outliers are data points that significantly differ from other observations in a dataset.
They can arise due to errors in data collection, measurement variability, or genuine
rare events. Outliers can have a significant impact on the results of data analysis and
machine learning models, as they can skew statistical measures and distort the
learning process.

Outlier analysis is the process of identifying and handling outliers in a dataset. There
are several approaches to outlier analysis:

1. Statistical Methods: Statistical methods such as Z-score, modified Z-score, and


Tukey's method (based on the interquartile range) can be used to detect outliers.
These methods identify data points that fall significantly far from the mean or
median of the dataset.

2. Visualization: Visualization techniques such as box plots, scatter plots, and


histograms can be used to identify outliers visually. Outliers often appear as points
that are far away from the main cluster of data points.

3. Clustering: Clustering algorithms such as K-means can be used to cluster data


points and identify outliers as data points that do not belong to any cluster or belong
to small clusters.

4. Distance-based Methods: Distance-based methods such as DBSCAN (Density-


Based Spatial Clustering of Applications with Noise) can be used to identify outliers
as data points that are far away from dense regions of the data.

Once outliers are identified, there are several approaches to handling them:

1. Removing Outliers: One approach is to remove outliers from the dataset. However,
this approach should be used with caution, as removing outliers can lead to loss of
information and bias in the data.

2. Transforming Variables: Another approach is to transform variables to make


the distribution more normal, which can reduce the impact of outliers.

3. Treating Outliers as Missing Values: Outliers can be treated as missing values


and imputed using techniques such as mean, median, or mode imputation.

4. Using Robust Statistical Methods: Robust statistical methods such as robust


regression or robust clustering can be used that are less sensitive to outliers.
It's important to carefully analyze outliers and consider the context of the data before
deciding on the appropriate approach for handling them.

Here are some multiple-choice questions (MCQs) with answers related to machine
learning:
1. What is machine learning? A) A technique to write algorithms B) A field of study that
gives computers the ability to learn without being explicitly programmed C) A method
to automate tasks
D) A software application
Answer: B) A field of study that gives computers the ability to learn without being
explicitly programmed

2. Which of the following is a supervised learning technique? A) K-means clustering B)


Principal Component Analysis (PCA) C) Linear regression D) Apriori algorithm

Answer: C) Linear regression

3. What is the goal of unsupervised learning? A) To classify data into predefined classes
B) To find patterns in data without any labels C) To predict continuous values D) To
learn from a teacher
Answer: B) To find patterns in data without any labels

4. Which algorithm is used for clustering in machine learning? A) Decision tree B) K-


nearest neighbors (KNN) C) Support Vector Machine (SVM) D) K-means

Answer: D) K-means

5. What is overfitting in machine learning? A) When a model is too simple to capture the
underlying structure of the data B) When a model performs well on the training data
but fails to generalize to new data C) When a model is too complex and memorizes
the training data D) When a model is not trained long enough

Answer: C) When a model is too complex and memorizes the training data

6. Which evaluation metric is commonly used for classification tasks? A) Mean Squared
Error (MSE) B) R-squared C) Accuracy D) Root Mean Squared Error (RMSE)

Answer: C) Accuracy
7. What is the purpose of feature scaling in machine learning? A) To make the data
easier to visualize
B) To speed up the training process C) To ensure all features have the same scale D)
To reduce the number of features
Answer: C) To ensure all features have the same scale

8. Which of the following is an example of a reinforcement learning problem? A) Image


classification
B) Spam detection C) Playing a game of chess D) Customer segmentation
Answer: C) Playing a game of chess

These are just a few examples of MCQs related to machine learning. Depending on
the complexity of the topic and the level of the audience, the questions can vary.

1. What is the first step in the machine learning modeling process? A) Data preprocessing B) Model selection
C) Data collection D) Model evaluation

Answer: C) Data collection

2. What is the purpose of data preprocessing in machine learning? A) To clean and prepare the data for
modeling B) To select the best model for the data C) To evaluate the performance of the model D) To
deploy the model in a production environment

Answer: A) To clean and prepare the data for modeling

3. What is the purpose of model selection in machine learning? A) To clean and prepare the data for modeling
B) To select the best model for the data C) To evaluate the performance of the model D) To deploy the
model in a production environment

Answer: B) To select the best model for the data

4. Which of the following is NOT a step in the machine learning modeling process? A) Data preprocessing B)
Model evaluation C) Model deployment D) Data visualization

Answer: D) Data visualization

5. What is the purpose of model evaluation in machine learning? A) To clean and prepare the data for
modeling B) To select the best model for the data C) To evaluate the performance of the model D) To
deploy the model in a production environment

Answer: C) To evaluate the performance of the model

6. What is the final step in the machine learning modeling process? A) Data preprocessing B) Model selection
C) Model evaluation D) Model deployment
Answer: D) Model deployment

7. What is the goal of data preprocessing in machine learning? A) To create new features from existing data B)
To remove outliers from the data C) To scale the data to a standard range D) To clean and prepare the data
for modeling

Answer: D) To clean and prepare the data for modeling

8. Which of the following is NOT a common evaluation metric used in machine learning? A) Accuracy B)
Mean Squared Error (MSE) C) R-squared D) Principal Component Analysis (PCA)

Answer: D) Principal Component Analysis (PCA)

These questions cover the basic steps of the machine learning modeling process, including data
preprocessing, model selection, model evaluation, and model deployment.

1. What are the main types of machine learning? A) Supervised learning, unsupervised
learning, and reinforcement learning B) Classification, regression, and clustering C)
Neural networks, decision trees, and SVMs D) Linear regression, logistic regression,
and K-means clustering
Answer: A) Supervised learning, unsupervised learning, and reinforcement learning

2. Which type of machine learning is used when the data is labeled? A) Supervised
learning B) Unsupervised learning C) Reinforcement learning D) Semi-supervised
learning
Answer: A) Supervised learning

3. What is the goal of unsupervised learning? A) To predict a continuous value B) To


classify data into predefined classes C) To find patterns in data without any labels D)
To learn from a teacher
Answer: C) To find patterns in data without any labels

4. Which type of machine learning is used when the data is not labeled? A) Supervised
learning B) Unsupervised learning C) Reinforcement learning D) Semi-supervised
learning
Answer: B) Unsupervised learning

5. Which type of machine learning is used when the model learns from its own
experience? A) Supervised learning B) Unsupervised learning C) Reinforcement
learning D) Semi-supervised learning

Answer: C) Reinforcement learning


6. What is the goal of semi-supervised learning? A) To predict a continuous value B) To
classify data into predefined classes C) To find patterns in data without any labels D)
To leverage both labeled and unlabeled data for learning

Answer: D) To leverage both labeled and unlabeled data for learning

7. Which type of machine learning is used for anomaly detection? A) Supervised learning
B) Unsupervised learning C) Reinforcement learning D) Semi-supervised learning

Answer: B) Unsupervised learning

8. Which type of machine learning is used for customer segmentation? A) Supervised


learning B) Unsupervised learning C) Reinforcement learning D) Semi-supervised
learning
Answer: B) Unsupervised learning

These questions cover the main types of machine learning, including supervised
learning, unsupervised learning, and reinforcement learning, as well as their goals
and applications.

Supervised learning of machine learning MCQ with answers

Here are some multiple-choice questions (MCQs) with answers related to supervised
learning in machine learning:

1. What is supervised learning? A) A type of learning where the model learns from its
own experience
B) A type of learning where the model learns from labeled data C) A type of learning
where the model learns without any labels D) A type of learning where the model
Answer: B) A type of learning where the model learns from labeled data

2. Which of the following is an example of a supervised learning task? A) Clustering B)


Dimensionality reduction C) Classification D) Anomaly detection

Answer: C) Classification

3. What is the goal of regression in supervised learning? A) To classify data into


predefined classes B) To predict a continuous value C) To find patterns in data
without any labels D) To learn from a teacher

Answer: B) To predict a continuous value


4. Which of the following is a common algorithm used for classification in supervised
learning? A) K- means clustering B) Decision tree C) Principal Component Analysis
(PCA) D) Apriori algorithm
Answer: B) Decision tree

5. What is the purpose of the training data in supervised learning? A) To evaluate the
performance of the model B) To select the best model for the data C) To clean and
prepare the data for modeling
D) To teach the model to make predictions
Answer: D) To teach the model to make predictions

6. Which of the following is NOT a common evaluation metric used in classification


tasks? A) Accuracy B) Mean Squared Error (MSE) C) Precision D) Recall

Answer: B) Mean Squared Error (MSE)

7. What is the goal of feature selection in supervised learning? A) To clean and prepare
the data for modeling B) To select the best model for the data C) To reduce the
number of features to improve model performance D) To ensure all features have the
same scale
Answer: C) To reduce the number of features to improve model performance

8. Which of the following is an example of a regression task? A) Predicting whether an


email is spam or not B) Predicting house prices based on features such as size and
location C) Clustering customer data to identify segments D) Classifying images into
different categories
Answer: B) Predicting house prices based on features such as size and location

These questions cover the basics of supervised learning in machine learning,


including the goals, algorithms, evaluation metrics, and applications of supervised
learning.

Unsupervised learning of machine learning MCQ with answers

Here are some multiple-choice questions (MCQs) with answers related to unsupervised learning in machine
learning:

1. What is unsupervised learning? A) A type of learning where the model learns from labeled data B) A type of
learning where the model learns from its own experience C) A type of learning where the model learns
without any labels D) A type of learning where the model learns from reinforcement

Answer: C) A type of learning where the model learns without any labels
2. Which of the following is an example of an unsupervised learning task? A) Image classification B)
Clustering C) Spam detection D) Sentiment analysis

Answer: B) Clustering

3. What is the goal of clustering in unsupervised learning? A) To predict a continuous value B) To classify data
into predefined classes C) To find patterns in data without any labels D) To learn from a teacher

Answer: C) To find patterns in data without any labels

4. Which of the following is a common algorithm used for clustering in unsupervised learning? A) Decision
tree B) K-means C) Support Vector Machine (SVM) D) Linear regression

Answer: B) K-means

5. What is the purpose of dimensionality reduction in unsupervised learning? A) To reduce the number of
features to improve model performance B) To select the best model for the data C) To ensure all features
have the same scale D) To clean and prepare the data for modeling

Answer: A) To reduce the number of features to improve model performance

6. Which of the following is an example of an anomaly detection task? A) Predicting house prices based on
features such as size and location B) Classifying images into different categories C) Identifying fraudulent
transactions in financial data D) Clustering customer data to identify segments

Answer: C) Identifying fraudulent transactions in financial data

7. What is the goal of feature extraction in unsupervised learning? A) To clean and prepare the data for
modeling B) To reduce the number of features to improve model performance C) To select the best model
for the data D) To ensure all features have the same scale

Answer: B) To reduce the number of features to improve model performance

8. Which of the following is an example of a dimensionality reduction technique? A) K-means clustering B)


Decision tree C) Principal Component Analysis (PCA) D) Apriori algorithm

Answer: C) Principal Component Analysis (PCA)

These questions cover the basics of unsupervised learning in machine learning, including the goals,
algorithms, and applications of unsupervised learning.

Here are some multiple-choice questions (MCQs) with answers related to semi-
supervised learning in machine learning:

1. What is semi-supervised learning? A) A type of learning where the model learns from
labeled data
B) A type of learning where the model learns from its own experience C) A type of
the model learns from both labeled and unlabeled data D) A type of learning where
the model learns without any labels

Answer: C) A type of learning where the model learns from both labeled and unlabeled
data
2. Which of the following is an example of a semi-supervised learning task? A) Image
classification B) Clustering C) Sentiment analysis with a small labeled dataset and a
large unlabeled dataset D) Regression

Answer: C) Sentiment analysis with a small labeled dataset and a large unlabeled
dataset
3. What is the goal of semi-supervised learning? A) To predict a continuous value B) To
classify data into predefined classes C) To leverage both labeled and unlabeled data
for learning D) To learn from a teacher

Answer: C) To leverage both labeled and unlabeled data for learning

4. Which of the following is a common approach used in semi-supervised learning? A)


Self-training B) K-means clustering C) Support Vector Machine (SVM) D) Principal
Component Analysis (PCA)
Answer: A) Self-training

5. What is the purpose of self-training in semi-supervised learning? A) To clean and


prepare the data for modeling B) To select the best model for the data C) To predict
labels for unlabeled data based on a model trained on labeled data D) To ensure all
features have the same scale
Answer: C) To predict labels for unlabeled data based on a model trained on labeled
data
6. Which of the following is a benefit of using semi-supervised learning? A) It requires a
large amount of labeled data B) It can improve model performance by leveraging
unlabeled data C) It is computationally expensive D) It is only suitable for certain types
of machine learning tasks
Answer: B) It can improve model performance by leveraging unlabeled data

7. What is the main challenge of using semi-supervised learning? A) It requires a large


amount of labeled data B) It can lead to overfitting C) It can be difficult to predict
labels for unlabeled data accurately D) It is not suitable for complex machine learning
tasks
Answer: C) It can be difficult to predict labels for unlabeled data accurately

8. Which of the following is an example of a semi-supervised learning algorithm? A) K-


means clustering B) Decision tree C) Label Propagation D) Linear regression

Answer: C) Label Propagation


These questions cover the basics of semi-supervised learning in machine learning,
including its goals, approaches, benefits, and challenges.

Classification, regression of machine learning MCQ with answers

Here are some multiple-choice questions (MCQs) with answers related to classification and regression in
machine learning:

1. What is the goal of classification in machine learning? A) To predict a continuous value B) To classify data
into predefined classes C) To find patterns in data without any labels D) To learn from a teacher

Answer: B) To classify data into predefined classes

2. Which of the following is an example of a classification task? A) Predicting house prices based on features
such as size and location B) Classifying emails as spam or not spam C) Clustering customer data to identify
segments D) Predicting a student's grade based on the number of hours studied

Answer: B) Classifying emails as spam or not spam

3. What is the goal of regression in machine learning? A) To classify data into predefined classes B) To predict
a continuous value C) To find patterns in data without any labels D) To learn from a teacher

Answer: B) To predict a continuous value

4. Which of the following is an example of a regression task? A) Classifying images into different categories
B) Predicting house prices based on features such as size and location C) Clustering customer data to
identify segments D) Predicting whether a customer will buy a product or not

Answer: B) Predicting house prices based on features such as size and location

5. Which evaluation metric is commonly used for classification tasks? A) Mean Squared Error (MSE) B) R-
squared C) Accuracy D) Root Mean Squared Error (RMSE)

Answer: C) Accuracy

6. Which evaluation metric is commonly used for regression tasks? A) Accuracy B) Mean Squared Error
(MSE) C) Precision D) Recall

Answer: B) Mean Squared Error (MSE)

7. What is the main difference between classification and regression? A) Classification predicts a continuous
value, while regression predicts a discrete class label B) Classification predicts a discrete class label, while
regression predicts a continuous value C) Classification uses labeled data, while regression uses unlabeled
data D) Regression uses labeled data, while classification uses unlabeled data
Answer: B) Classification predicts a discrete class label, while regression predicts a continuous value

8. Which of the following algorithms is commonly used for classification tasks? A) Linear regression B)
Decision tree C) K-means clustering D) Principal Component Analysis (PCA)

Answer: B) Decision tree

These questions cover the basics of classification and regression in machine learning, including their goals,
examples, evaluation metrics, and algorithms.

Clustering of machine learning MCQ with answers

Here are some multiple-choice questions (MCQs) with answers related to clustering in
machine learning:

1. What is clustering in machine learning? A) A type of learning where the model learns
from labeled data B) A type of learning where the model learns from its own
experience C) A type of learning where the model learns without any labels D) A type
of learning where the model learns from reinforcement

Answer: C) A type of learning where the model learns without any labels

2. Which of the following is an example of a clustering task? A) Image classification B)


Predicting house prices based on features such as size and location C) Clustering
customer data to identify segments D) Classifying emails as spam or not spam

Answer: C) Clustering customer data to identify segments

3. What is the goal of clustering in machine learning? A) To predict a continuous value B)


To classify data into predefined classes C) To find patterns in data without any labels
D) To learn from a teacher

Answer: C) To find patterns in data without any labels

4. Which of the following is a common algorithm used for clustering in machine learning?
A) Decision tree B) K-means C) Support Vector Machine (SVM) D) Linear regression

Answer: B) K-means

5. What is the purpose of clustering in machine learning? A) To reduce the number of


features to improve model performance B) To select the best model for the data C) To
find patterns in data without any labels D) To ensure all features have the same scale
Answer: C) To find patterns in data without any labels

6. Which of the following is an example of an evaluation metric used for clustering? A)


Accuracy B) Mean Squared Error (MSE) C) Silhouette score D) Precision

Answer: C) Silhouette score

7. Which of the following is NOT a common approach used in clustering? A) K-means


clustering B) Hierarchical clustering C) DBSCAN D) Linear regression

Answer: D) Linear regression

8. What is the main difference between clustering and classification? A) Clustering


predicts a continuous value, while classification predicts a discrete class label B)
Clustering uses labeled data, while classification uses unlabeled data C) Clustering
predicts a discrete class label, while classification predicts a continuous value D)
Clustering is a type of unsupervised learning, while classification is a type of
supervised learning

Answer: D) Clustering is a type of unsupervised learning, while classification is a type


of supervised learning

These questions cover the basics of clustering in machine learning, including its
goals, examples, algorithms, and evaluation metrics.

Outliers and Outlier Analysis of machine learning MCQ with answers

Here are some multiple-choice questions (MCQs) with answers related to outliers and
outlier analysis in machine learning:

1. What is an outlier in a dataset? A) A data point that is missing a value B) A data point
that is significantly different from other observations C) A data point that is incorrectly
labeled D) A data point that is located at the center of the dataset

Answer: B) A data point that is significantly different from other observations

2. Why are outliers important in data analysis? A) They help to reduce the complexity of
the dataset
B) They can provide valuable insights into the data C) They have no impact on the
results of data analysis D) They make the dataset more difficult to analyze
Answer: B) They can provide valuable insights into the data
3. Which of the following is a common method for detecting outliers? A) Z-score method
B) Mean Squared Error (MSE) C) Root Mean Squared Error (RMSE) D) Silhouette score

Answer: A) Z-score method

4. What is the Z-score method used for in outlier analysis? A) To calculate the mean of
the dataset B) To calculate the standard deviation of the dataset C) To identify data
points that are significantly different from the mean D) To calculate the range of the
dataset
Answer: C) To identify data points that are significantly different from the mean

5. Which of the following is a common approach for handling outliers? A) Removing


outliers from the dataset B) Keeping outliers in the dataset C) Replacing outliers with
the mean of the dataset D) Ignoring outliers in the analysis

Answer: A) Removing outliers from the dataset

6. What is the impact of outliers on statistical measures such as mean and standard
deviation? A) Outliers have no impact on these measures B) Outliers increase the
mean and standard deviation
C) Outliers decrease the mean and standard deviation D) The impact of outliers
depends on their value

Answer: B) Outliers increase the mean and standard deviation

7. Which of the following is a disadvantage of removing outliers from a dataset? A) It


can lead to biased results B) It can improve the accuracy of the analysis C) It can
make the dataset easier to analyze D) It can reduce the complexity of the dataset

Answer: A) It can lead to biased results

8. What is the purpose of outlier analysis in machine learning? A) To identify errors in


the dataset B) To improve the accuracy of machine learning models C) To reduce the
complexity of the dataset D) To increase the number of data points in the dataset

Answer: B) To improve the accuracy of machine learning models

These questions cover the basics of outliers and outlier analysis in machine
learning, including their detection, impact, and handling.
UNIT II DATA MANIPULATION

Python Shell
The Python Shell, also known as the Python interactive interpreter or Python REPL (Read-Eval-Print Loop),
is a command-line tool that allows you to interactively execute Python code. It provides a convenient way to
experiment with Python code, test small snippets, and learn about Python features.

To start the Python Shell, you can open a terminal or command prompt and type python or

python3 depending on your Python installation. This will launch the Python interpreter, and you
will see a prompt (>>>) where you can start entering Python code. Here is
an example of using the Python Shell:
$ python

Python 3.8.5 (default, Jan 27 2021, 15:41:15) [GCC


9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

>>> print("Hello, world!") Hello,


world!
>>> x = 5

>>> y = 10

>>> print(x + y) 15
>>> exit()

In this example, we start the Python interpreter, print a message, perform some basic arithmetic

operations, and then exit the Python interpreter using the exit() function.

Jupyter Notebook
jupyter Notebook is an open-source web application that allows you to create and
share documents that contain live code, equations, visualizations, and narrative text.
It supports various programming languages, including Python, R, and Julia, among
others. Jupyter Notebook is widely used for data cleaning, transformation, numerical
simulation, statistical modeling, data visualization, machine learning, and more.

To start using Jupyter Notebook, you first need to have Python installed on your
computer. You can then install Jupyter Notebook using pip, the Python package
installer, by running the following command in your terminal or command prompt:

pip install jupyterlab

Once Jupyter Notebook is installed, you can start it by running the following command in your terminal or
command prompt:

jupyter notebook

This will launch the Jupyter Notebook server and open a new tab in your web browser
with the Jupyter Notebook interface. From there, you can create a new notebook or
open an existing one. You can write and execute code in the notebook cells, add text
and equations using Markdown, and create visualizations using libraries like
Matplotlib and Seaborn.

Jupyter Notebook is a powerful tool for interactive computing and is widely used in
data science and research communities.

IPython Magic Commands

IPython magic commands are special commands that allow you to perform various
tasks in IPython, the enhanced interactive Python shell. Magic commands are
prefixed by one or two percentage signs (% or %%) and provide additional functionality
beyond what standard Python syntax offers. Here are some commonly used IPython
magic commands:
1. %run: Run a Python script inside the IPython session. Usage: %run script.py.

2. %time and %timeit: Measure the execution time of a single statement (%time)
or a Python statement or expression (%timeit).

3. %load: Load code into the current IPython session. Usage: %load file.py.
4. %matplotlib: Enable inline plotting of graphs and figures in IPython. Usage: %matplotlib inline.

5. %reset: Reset the IPython namespace by removing all variables, functions, and imports. Usage:
%reset -f.

6. %who and %whos: List all variables in the current IPython session (%who) or list all variables with
additional information such as type and value (%whos).

7. %%time and %%timeit: Measure the execution time of a cell (%%time) or a cell statement
(%%timeit) in IPython.

8. %magic: Display information about IPython magic commands and their usage. Usage: %magic.

9. %history: Display the command history for the current IPython session. Usage: %history.

10. %pdb: Activate the interactive debugger (Python debugger) for errors in the IPython session. Usage:

%pd .
b

These are just a few examples of IPython magic commands. IPython provides many
more magic commands for various purposes, and you can explore them by typing
%lsmagic to list all available magic commands and for help on a specific magic
%<command>?
command (e.g., %time? for help on the %time command).

NumPy Arrays

NumPy is a Python library that provides support for creating and manipulating arrays
and matrices. NumPy arrays are the core data structure used in NumPy to store and
manipulate data efficiently. Here's a brief overview of NumPy arrays:

1. Creating NumPy Arrays : NumPy arrays can be created using the numpy.array() function by
passing a Python list as an argument. For example:
import numpy as np
arr = np.array([1, 2, 3, 4, 5])

Array Attributes: NumPy arrays have several attributes that provide information about the array, such as its
shape, size, and data type. Some common attributes include shape, size, and dtype.

print(arr.shape) # (5,) - shape of the array


print(arr.size) # 5 - number of elements in the array
print(arr.dtype) # int64 - data type of the array elements

Array NumPy arrays support element-wise operations, such as addition, subtraction,


multiplication,
Operations: and division. These operations are performed on each element of the array.

arr1 = np.array([1, 2, 3])


arr2 = np.array([4, 5, 6])
result = arr1 + arr2 # [5, 7, 9]

Indexing and NumPy arrays support indexing and slicing operations to access and modify
elements
Slicing: of the array.

print(arr[0]) # 1 - access the first element of the array


print(arr[1:3]) # [2, 3] - slice the array from index 1 to 2

Array Broadcasting: NumPy arrays support broadcasting, which allows operations to be performed on
arrays of different shapes.
arr = np.array([[1, 2, 3], [4, 5, 6]])
scalar = 2
result = arr * scalar # [[2, 4, 6], [8, 10, 12]]

1. Array Functions: NumPy provides a variety of functions for creating and manipulating arrays, such as
np.arange() , np.zeros() , np.ones() , np.linspace() , np.concatenate() , and more.

NumPy arrays are widely used in scientific computing, data analysis, and machine learning due to their
efficiency and versatility.

Universal Functions of data manipulation


Universal functions (ufuncs) in NumPy are functions that operate element-wise on
arrays. They are essential for numerical operations on arrays, as they provide a way
to efficiently perform operations without the need for explicit loops. Ufuncs support
various mathematical operations and can be used to manipulate data in arrays. Here
are some common ufuncs used for data manipulation in NumPy:

1. Mathematical Functions: NumPy provides ufuncs for basic mathematical operations such as np.add(),
np.subtract(), np.multiply(), np.divide(), np.power(), np.sqrt(), np.exp(), np.log(), and more. These
functions can be used to perform element-wise arithmetic operations on arrays.
2. Trigonometric Functions: NumPy provides ufuncs for trigonometric functions such as np.sin(), np.cos(),
np.tan(), np.arcsin(), np.arccos(), np.arctan(), and more. These functions operate element-wise on
arrays and are useful for mathematical calculations involving angles.
3. Statistical Functions: NumPy provides ufuncs for statistical functions such as np.mean(), np.median(),
np.std(), np.var(), np.sum(), np.min(), np.max(), and more. These functions can be used to calculate
various statistical measures of arrays.
4. Logical Functions: NumPy provides ufuncs for logical operations such as np.logical_and(),
np.logical_or(), np.logical_not(), and more. These functions operate element-wise on boolean arrays
and are useful for logical operations.
5. Comparison Functions: NumPy provides ufuncs for comparison operations such as np.equal(),
np.not_equal(), np.greater(), np.greater_equal(), np.less(), np.less_equal(), and more. These functions
compare elements of arrays and return boolean arrays indicating the result of the comparison.
6. Bitwise Functions: NumPy provides ufuncs for bitwise operations such as np.bitwise_and(),
np.bitwise_or(), np.bitwise_xor(), np.bitwise_not(), and more. These functions operate element-wise
on integer arrays and perform bitwise operations.

These are just a few examples of the many ufuncs available in NumPy for data
manipulation. Ufuncs are an important part of NumPy and are widely used for
performing efficient and vectorized operations on arrays.

Aggregations of datamanipulation

Aggregations in NumPy refer to the process of performing a computation on an array


and summarizing the result. NumPy provides several functions for aggregations, which
can be used to
calculate various statistical measures of an array. Some common aggregation
functions in NumPy include:

1. np.sum : Calculates the sum of all elements in the array or along a specified axis.

import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
total_sum = np.sum(arr) # 21

np.mean: Calculates the mean (average) of all elements in the array or along a specified axis.
mean_value = np.mean(arr) # 3.5

np.median: Calculates the median of all elements in the array or along a specified axis.

median_value = np.median(arr) # 3.5

np.mi and np.max: Calculate the minimum and maximum values in the array or along a specified axis.
n

min_value = np.min(arr) # 1
max_value = np.max(arr) # 6

np.st and np.var: Calculate the standard deviation and variance of the elements in the array or along a
specified
d axis.

std_value = np.std(arr) # 1.7078


var_value = np.var(arr) # 2.9167

np.sum(axis=0): Calculate the sum of elements along a specified axis (0 for columns, 1 for rows).
col_sum = np.sum(arr, axis=0) # array([5, 7, 9])

np.prod(): Calculate the product of all elements in the array or along a specified axis.
prod_value = np.prod(arr) # 720
These aggregation functions are useful for summarizing and analyzing data in NumPy arrays. They provide
efficient ways to calculate various statistical measures and perform calculations on arrays.

Computation on Arrays

Computation on arrays in NumPy allows you to perform element-wise operations, broadcasting, and
vectorized computations efficiently. Here are some key concepts and examples:

1. Element-wise operations: NumPy allows you to perform arithmetic operations (addition, subtraction,
multiplication, division) on arrays of the same shape element-wise.

pytho
n

import numpy as np
x = np.array([1, 2, 3, 4])
y = np.array([5, 6, 7, 8])
z = x + y # [6, 8, 10, 12]

Broadcasting: Broadcasting is a powerful mechanism that allows NumPy to work with arrays of different
shapes when performing arithmetic operations.

x = np.array([[1, 2, 3], [4, 5, 6]])


y = np.array([10, 20, 30])
z = x + y # [[11, 22, 33], [14, 25, 36]]

Universal functions (ufuncs): NumPy provides a set of mathematical functions that operate element-wise
on arrays. These functions are called universal functions (ufuncs).

x = np.array([1, 2, 3, 4])
y = np.sqrt(x) # [1. 1.41421356 1.73205081 2. ]
Aggregation functions: NumPy provides functions for aggregating data in arrays, such as sum, mean, min,
max, std, and var.

x = np.array([1, 2, 3, 4]) sum_x =


np.sum(x) # 10 mean_x =
np.mean(x) # 2.5

Vectorized computations: NumPy allows you to express batch operations on data without writing any for
loops, which can lead to more concise and readable code.

x = np.array([[1, 2], [3, 4]])


y = np.array([[5, 6], [7, 8]])
z = x * y # Element-wise multiplication: [[5, 12], [21, 32]]

NumPy's array operations are optimized and implemented in C, making them much faster than equivalent
Python operations using lists. This makes NumPy a powerful tool for numerical computation and data
manipulation in Python.

Fancy Indexing

Fancy indexing in NumPy refers to indexing using arrays of indices or boolean arrays.
It allows you to access and modify elements of an array in a more flexible way than
simple indexing. Here are some examples of fancy indexing:

1. Indexing with arrays of indices :


import numpy as np
x = np.array([10, 20, 30, 40, 50])
indices = np.array([1, 3, 4])
y = x[indices] # [20, 40, 50]

Indexing with boolean :


arrays
x = np.array([10, 20, 30, 40, 50])
mask = np.array([False, True, False, True, True]) y =
x[mask] # [20, 40, 50]

Combining multiple boolean conditions: x = np.array([10, 20, 30, 40, 50])


mask = (x > 20) & (x < 50)
y = x[mask] # [30, 40]

Assigning values using fancy indexing: x = np.array([10, 20, 30, 40, 50])
indices = np.array([1, 3, 4])
x[indices] = 0
# x is now [10, 0, 30, 0, 0]

Indexing multi-dimensional :
xarrays
= np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
row_indices = np.array([0, 2])
col_indices = np.array([1, 2])
y = x[row_indices, col_indices] # [2, 9]

Fancy indexing can be very useful for selecting and modifying specific elements of arrays based on complex
conditions. However, it is important to note that fancy indexing creates copies of the data, not views, so
modifying the result of fancy indexing will not affect the original array.

Sorting arrays

In NumPy, you can sort arrays using the np.sort() function or the sort() method of the array object.
Both functions return a sorted copy of the array without modifying the original array. Here are some
examples of sorting arrays in NumPy:
:
import numpy as np
Sorting 1D
arrays
x = np.array([3, 1, 2, 5, 4]) sorted_x =
np.sort(x)
# sorted_x: [1, 2, 3, 4, 5]

Sorting 2D arrays by rows or :


xcolumns
= np.array([[3, 1, 2], [6, 4, 5], [9, 7, 8]])
# Sort each row
sorted_rows = np.sort(x, axis=1)
# sorted_rows: [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

# Sort each column


sorted_cols = np.sort(x, axis=0)
# sorted_cols: [[3, 1, 2], [6, 4, 5], [9, 7, 8]]

Sorting with argsort: NumPy's argsort() function returns the indices that would sort an array. This can
be useful for sorting one array based on the values in another array.

x = np.array([3, 1, 2, 5, 4])
indices = np.argsort(x) sorted_x
= x[indices]
# sorted_x: [1, 2, 3, 4, 5]

Sorting in-place: If you want to sort an array in-place (i.e., modify the original array), you can use the
method of the array object.
sort()
x = np.array([3, 1, 2, 5, 4])
x.sort()
# x: [1, 2, 3, 4, 5]
Sorting with complex numbers: Sorting works with complex numbers as well, with the real part used for
sorting. If the real parts are equal, the imaginary parts are used.

x = np.array([3+1j, 1+2j, 2+3j, 5+4j, 4+5j]) sorted_x =


np.sort(x)
# sorted_x: [1.+2.j, 2.+3.j, 3.+1.j, 4.+5.j, 5.+4.j]

Structured data

Structured data in NumPy refers to arrays where each element can contain multiple fields or columns,
similar to a table in a spreadsheet or a database table. NumPy provides the class to
represent structured data, and you can create structured arrays using the numpy.ndarray
function with a
dtype parameter specifying the data type for each field. Here's an example:numpy.array()

import numpy as np

# Define the data type for the structured array dtype =


[('name', 'S10'), ('age', int), ('height', float)]

# Create a structured array


data = np.array([('Alice', 25, 5.6), ('Bob', 30, 6.0)], dtype=dtype)

# Accessing elements in a structured array


print(data['name']) # ['Alice' 'Bob']
print(data['age']) # [25 30]
print(data['height']) # [5.6 6. ]
In this example, we define a dtype for the structured array with three fields: 'name' (string of
length 10), 'age' (integer), and 'height' (float). We then create a structured array data with two elements,
each containing values for the three fields.

You can also access and modify individual elements or slices of a structured array using the field names. For
example, to access the 'name' field of the first element, you can use data[0]['name'].

Structured arrays are useful for representing and manipulating tabular data in NumPy, and they provide a
way to work with heterogeneous data in a structured manner.

Data manipulation with Pandas


Pandas is a popular Python library for data manipulation and analysis. It provides
powerful data structures,
Series such as and DataFrame, that allow you to work with
structured data easily.
Here's an overview of how to perform common data manipulation tasks with Pandas:

1. Importing Pandas :
import pandas as pd
Creating a DataFrame: You can create a DataFrame from various data sources, such as lists, dictionaries,
NumPy arrays, or from a file (e.g., CSV, Excel).
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age':
[25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']} df =
pd.DataFrame(data)

Reading and Writing Data: Pandas provides functions to read data from and write data to various file
formats, such as CSV, Excel, SQL, and more.
# Read data from a CSV file df =
pd.read_csv('data.csv')

# Write data to a CSV file


df.to_csv('data.csv', index=False)
Viewing Data: Pandas provides functions to view the data in a DataFrame, such as head(), tail(), and
.
print(df.head()) # View the first few rows print(df.tail())
# View the last few rows
print(df.sample(2)) # View a random sample of rows

Selecting Data: You can select columns or rows from a DataFrame using indexing and slicing.
# Select a single column
print(df['Name'])

# Select multiple columns


print(df[['Name', 'Age']])

# Select rows based on a condition print(df[df['Age'] >


30])

Adding and Removing Columns: You can add new columns to a DataFrame or remove existing columns.

# Add a new column


df['Gender'] = ['Female', 'Male', 'Male']

# Remove a column
df = df.drop('City', axis=1)

Grouping and Aggregating Data: Pandas allows you to group data based on one or more columns and
perform aggregation

# Group data by 'City' and calculate the mean age in each city print(df.groupby('City')['Age'].mean())
Handling Missing Data: Pandas provides functions to handle missing data, such as dropna(), fillna(), and
isnull().

# Drop rows with missing values df =


df.dropna()

# Fill missing values with a specific value df =


df.fillna(0)

# Check for missing values


print(df.isnull().sum())

Merging and Joining DataFrames: Pandas provides functions to merge or join multiple DataFrames based
on a common column.

df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})


df2 = pd.DataFrame({'A': [3, 4, 5], 'C': [7, 8, 9]})

# Merge DataFrames based on the 'A' column merged_df


= pd.merge(df1, df2, on='A')

These are just a few examples of how you can manipulate data with Pandas. Pandas provides a wide range
of functions and methods for data cleaning, transformation, and analysis, making it a powerful tool for data
manipulation in Python.

Data Indexing and Selection of pandas


Data indexing and selection in pandas are fundamental operations for working with
data frames and series. Here's a brief overview:
1. Indexing with []: You can use square brackets to select columns of a DataFrame or specific
elements of a Series.
 DataFrame: df['column_name'] or df[['column_name1', 'column_name2']]
 Series: s[index] df.loc[row_label, 'column_name']
2. Label-based indexing with .loc[]: Use row .loc[] for label-based indexing, where you specify the or
and column labels. .iloc[] for position-based indexing, where you
 DataFrame: df.loc[row_label, column_label]
 Series: s.loc[label]
3. Position-based indexing with .iloc[]: Use
specify the row and column positions (0-based index).
 DataFrame: df.iloc[row_index, column_index]
 Series: s.iloc[index]
4. Boolean indexing: You can use boolean arrays for selection, which allows you to filter rows based on
conditions.
pythonCopy code
'column_name' 0
5. Attribute access: If your column names are valid Python identifiers, you can use attribute access to
select columns.

df.column_name

Callable indexing with .loc[] and .iloc[]: You can use callables with and for more
.loc[] .iloc[]
advanced selection.
df.loc[lambda df: df['column_name'] > 0]

These are the basic ways to index and select data in pandas. Each method has its strengths, so choose the
one that best fits your use case.

Handling missing data in pandas


Handling missing data in pandas is crucial for data analysis and modeling. Pandas
provides several methods for dealing with missing data:

1. Identifying Missing :
 isna(), isnull(): Returns a boolean mask indicating missing
Data
values.
notna() notnull() isna() isnull()
 ,
2. Removing Missing : : Returns the opposite of and .
 dropna() : Removes rows or columns with missing
Data
values.
df.dropna(axis=0) # Remove rows with missing values df.dropna(axis=1)
# Remove columns with missing values
Filling Missing Data :

 fillna() : Fills missing values with a specified value or method.


df.replace(to_replace=np.nan, value=0) # Replace NaN with 0

Interpolating Missing Data :


 interpolate() : Performs linear interpolation to fill missing values.
df.interpolate() # Perform linear interpolation
1. Ignoring Missing :
Data Many operations in pandas have an NA-aware counterpart that ignores missing
values (e.g., min() max()
sum() mean()
2. , , , ). :
Filling Missing Data with Group-specific
groupby() Values
 with transform(): Fill missing values within groups based on

df.groupby('group_column')['value_column'].transform(lambda x: x.fillna(x.mean()))

Using Sentinel Values :


 Sometimes, missing values are represented by sentinel values (e.g., -999).
df.replace(to_replace=-999, value=np.nan)

These methods provide flexibility in handling missing data in pandas, allowing you to choose the approach
that best suits your data and analysis needs.

Hierarchical indexing in pandas


Hierarchical indexing, also known as MultiIndexing, enables you to work with higher-
dimensional data in pandas by allowing you to have multiple index levels on an axis.
This is particularly useful for representing higher-dimensional data in a two-
dimensional DataFrame. Here's a basic overview of hierarchical indexing in pandas:

1. Creating a MultiIndex : You can create a MultiIndex by passing a list of index levels to the index
parameter when creating a DataFrame.
import pandas as pd

arrays = [
['A', 'A', 'B', 'B'], [1,
2, 1, 2]
]
index = pd.MultiIndex.from_arrays(arrays, names=('first', 'second'))

df = pd.DataFrame({'data': [1, 2, 3, 4]}, index=index)

Indexing with a MultiIndex: You can use tuples to index into the DataFrame at multiple levels.
# Selecting a single value df.loc[('A',
1)]

# Selecting a single level df.loc['A']

# Selecting on both levels df.loc[('A',


1):('B', 1)]

# Selecting on the second level only df.loc[:, 1]

MultiIndex columns: You can also have a MultiIndex for columns.


columns = pd.MultiIndex.from_tuples([('A', 'one'), ('A', 'two'), ('B', 'three'), ('B', 'four')],
names=('first', 'second'))
df = pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], index=['foo', 'bar'], columns=columns)

Indexing with MultiIndex columns: Indexing with MultiIndex columns is similar to indexing with
MultiIndex rows.
# Selecting a single column df[('A',
'one')]
# Selecting on the first level of columns df['A']

# Selecting on both levels of columns df.loc[:,


('A', 'one'):('B', 'three')]

Creating from a dictionary with tuples: You can also create a DataFrame with a MultiIndex from a
dictionary where keys are tuples representing the index levels.
data = {('A', 1): 1, ('A', 2): 2, ('B', 1): 3, ('B', 2): 4}
df = pd.Series(data)
Hierarchical indexing provides a powerful way to represent and manipulate higher-dimensional datasets in
pandas. It allows for more flexible data manipulation and analysis.

Combining datasets in pandas

Combining datasets in pandas typically involves operations like merging, joining, and
concatenating DataFrames. Here's an overview of each:

1. Concatenation :
 Use pd.concat() to concatenate two or more DataFrames along a particular axis (row or column).
 By default, it concatenates along axis=0 (rows), but you can specify axis=1 to concatenate
columns.

df_concatenated = pd.concat([df1, df2], axis=0)

Merging :
 Use pd.merge() to merge two DataFrames based on a common column or index.
 Specify the on parameter to indicate the column to join on.
merged_df = pd.merge(df1, df2, on='common_column')

Joining :
 Use the .join() method to join two DataFrames on their indexes.
 By default, it performs a left join ( how='left' ), but you can specify other types of joins.
joined_df = df1.join(df2, how='inner')

Appending :
 Use the .append() method to append rows of one DataFrame to another.
 This is similar to concatenation along axis=0, but with more concise syntax.
appended_df = df1.append(df2)
Merging on Index :
 You can merge DataFrames based on their index using left_index=True and
right_index=True .

merged_on_index = pd.merge(df1, df2, left_index=True, right_index=True)

Specifying Merge Keys :


 For more complex merges, you can specify multiple columns to merge on using the
left_on and right_on parameters.
merged_df = pd.merge(df1, df2, left_on=['key1', 'key2'], right_on=['key1', 'key2'])

Handling Overlapping Column Names :


 If the DataFrames have overlapping column names, you can specify suffixes to add to the
column names in the merged DataFrame.
merged_df = pd.merge(df1, df2, on='key', suffixes=('_left', '_right'))

These methods provide flexible ways to combine datasets in pandas, allowing you to perform various types
of joins and concatenations based on your data's structure and requirements.

Aggregation and Grouping in pandas

Aggregation and grouping are powerful features in pandas that allow you to perform
operations on groups of data. Here's an overview:

1. GroupBy :
 Use groupby() to group data based on one or more columns
grouped = df.groupby('column_name')
Aggregation Functions :
 Apply aggregation functions like sum(), mean(), count(), min(), max(), etc., to calculate
summary statistics for each group.
grouped.sum()

Custom Aggregation :
 You can also apply custom aggregation functions using agg() with a dictionary mapping column
names to functions.
grouped.agg({'column1': 'sum', 'column2': 'mean'})

Applying Multiple Aggregations :


 You can apply multiple aggregation functions to the same column or multiple columns.

grouped['column_name'].agg(['sum', 'mean', 'count'])

Grouping with Multiple :


Columns
 You can group by multiple columns to create hierarchical groupings.
Grouping with Multiple :
Columns
 You can group by multiple columns to create hierarchical groupings.
Iterating Over :
 You can iterate over groups groupby() to perform more complex
Groups
using
for name, group in grouped: operations.
print(name)
print(group)

Filtering Groups :

 You can filter groups based on group properties using filter() .


grouped.filter(lambda x: x['column_name'].sum() > threshold)

Grouping with Time Series Data :

 For time series data, you can use resample() to group by a specified frequency.
df.resample('M').sum()

Grouping with Categorical Data :


 For categorical data, you can use groupby() directly on the categorical column.
pythonC
o
df.groupby('category_column').mean()
These are some of the key concepts and techniques for aggregation and grouping in pandas. They allow
you to perform a wide range of operations on grouped data efficiently.
String operations in pandas

String operations in pandas are used to manipulate string data in Series and
DataFrame columns. Pandas provides a wide range of string methods that are
vectorized, meaning they can operate on each element of a Series without the need
for explicit looping. Here are some common string operations in pandas:

1. Accessing String Methods :


 Use the .str accessor to access string methods.
df['column_name'].str.method_name()

Lowercasing/Uppercasing :
 Convert strings to lowercase or uppercase.
pytho
n
df['column_name'].str.lower()
df['column_name'].str.upper()
String Length :
 Get the length of each string.
df['column_name'].str.len()

String Concatenation :
 Concatenate strings with other strings or Series.
df['column_name'].str.cat(sep=',')

Substrings :
 Extract substrings using slicing or regular expressions.
df['column_name'].str.slice(start=0, stop=3)
df['column_name'].str.extract(r'(\d+)')
String Splitting :
 Split strings into lists using a delimiter.
python
df['column_name'].str.split(',')

String Stripping :
 Remove leading and trailing whitespace.
df['column_name'].str.strip()

String Replacement :
 Replace parts of strings with other strings.
df['column_name'].str.replace('old', 'new')

String Counting :
 Count occurrences of a substring.
df['column_name'].str.count('substring')

Checking for Substrings :


 Check if a substring is contained in each string.
df['column_name'].str.contains('substring')

String Alignment :
 Left or right align strings.
df['column_name'].str.ljust(width)
df['column_name'].str.rjust(width)

String Padding :
 Pad strings with a specified character to reach a desired length.
df['column_name'].str.pad(width, side='left', fillchar='0')
These are just some of the string operations available in pandas. They are efficient for working with string
data and can be used to clean and transform text data in your DataFrame.

Working with time series in pandas

Working with time series data in pandas involves DateTime functionality provided
using the by by dates or times.
pandas to manipulate, analyze, and visualize data that is indexed
Here's a basic overview of working with time series in pandas:

1. Creating a DateTimeIndex :

 Ensure your DataFrame has a DateTimeIndex, which can be set using the pd.to_datetime()
function.
df.index = pd.to_datetime(df.index)

Resampling :
 Use resample() to change the frequency of your time series data (e.g., from daily to
monthly).
df.resample('M').mean()

Indexing and Slicing :


 Use DateTimeIndex to index and slice your data based on dates.
df['2019-01-01':'2019-12-31']

Shifting :

 Use shift() to shift your time series data forward or backward in time.
df.shift(1)
Rolling :
Windows
 Use rolling() to calculate rolling statistics (e.g., rolling mean, sum) over
a specified window size.
df.rolling(window=3).mean()

Time Zone Handling :

 Use tz_localize() and tz_convert() to handle time zones in your data.


df.tz_localize('UTC').tz_convert('US/Eastern')

Date Arithmetic :
 Perform arithmetic operations with dates, like adding or subtracting time deltas.
df.index + pd.DateOffset(days=1)

Resampling with Custom Functions :

 Use apply() with resample() to apply custom aggregation functions.


df.resample('M').apply(lambda x: x.max() - x.min())

Handling Missing Data :

 Use fillna() or interpolate() to handle missing data in your time series.


df.fillna(method='ffill')

Time Series Plotting :

 Use plot() to easily visualize your time series data.


df.plot()

These are some common operations for working with time series data in pandas. The
functionality in pandas makes it easy to handle and analyze time series data efficiently.

UNIT IV DATA VISUALIZATION

Matplotlib
Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.
It can be used to create a wide range of plots and charts, including line plots, bar plots, histograms, scatter
plots, and more. Here's a basic overview of using Matplotlib for plotting:

Installing Matplotlib :
 You can install Matplotlib using pip:
pip install matplotlib

Importing Matplotlib :
 Import the matplotlib.pyplot module, which provides a MATLAB-like plotting interface.
import matplotlib.pyplot as plt

Creating a Simple Plot :

 Use the plot() function to create a simple line plot.


x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]
plt.plot(x, y)
plt.show()

Adding Labels and Title :


 Use xlabel() , ylabel() , and title() to add labels and a title to your plot.
plt.xlabel('x-axis
label') plt.ylabel('y-
axis label')
plt.title('Title')
Customizing Plot Appearance :
 Use various formatting options to customize the appearance of your plot.
plt.plot(x, y, color='red', linestyle='--', marker='o',
label='data') plt.legend()
Creating Multiple Plots :

 Use subplot() to create multiple plots in the same figure.


plt.subplot(2, 1, 1)
plt.plot(x, y)

plt.subplot(2, 1, 2)
plt.scatter(x, y)
Saving :
Plots Us savefig() to save your plot as an image file (e.g., PNG,
e
plt.savefig('plot.png') PDF, Types
Other SVG). of :
 Matplotlib supports
Plots many other types of plots, including bar plots, histograms,
scatter plots, and more.
plt.bar(x, y)
plt.hist(data, bins=10)
plt.scatter(x, y)
Matplotlib provides a wide range of customization options and is highly flexible, making it a powerful tool
for creating publication-quality plots and visualizations in Python.

Simple line plots in Matplotlib


Creating a simple line plot in Matplotlib involves specifying the x-axis and y-axis values and then using the
function to create the plot. Here's a basic example:
import matplotlib.pyplot as plt

# Data
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]

# Create a simple line plot


plt.plot(x, y)

# Add labels and title


plt.xlabel('X-axis label')
plt.ylabel('Y-axis label')
plt.title('Simple Line
Plot')

# Display the plot


plt.show()
This code will create a simple line plot with the given x and y values, and display it with labeled axes and a
title. You can customize the appearance of the plot further by using additional arguments in the
function, such as color, linestyle, and marker. plot()

Simple scatter plots in Matplotlib

Creating a simple scatter plot in Matplotlib involves specifying the x-axis and y-axis values and then using
the function to create the plot. Here's a basic example:
scatter()
import matplotlib.pyplot as plt

# Data
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]
# Create a simple scatter plot
plt.scatter(x, y)

# Add labels and title


plt.xlabel('X-axis label')
plt.ylabel('Y-axis label')
plt.title('Simple Scatter
Plot')

# Display the plot


plt.show()

This code will create a simple scatter plot with the given x and y values, and display it with labeled axes and
a title. You can customize the appearance of the plot further by using additional arguments in the
scatter() function, such as color, s (size of markers), and alpha (transparency).

visualizing errors in Matplotlib


Visualizing errors in Matplotlib can be done using error bars or shaded regions to
represent uncertainty or variability in your data. Here are two common ways to
visualize errors:
1. Error :
Bars

 Use the errorbar() function to plot data points with error bars.
import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]
yerr = [0.5, 0.3, 0.7, 0.4, 0.8] # Error values

plt.errorbar(x, y, yerr=yerr, fmt='o',


capsize=5) plt.xlabel('X-axis label')
plt.ylabel('Y-axis label')
plt.title('Error Bar Plot')
plt.show()

 Use thefill_between() function to plot shaded regions representing


Shaded :
uncertainti errors or
Regions
es.

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 10, 100)


y = np.sin(x)
error = 0.1 # Error value

plt.plot(x, y)
plt.fill_between(x, y - error, y + error,
alpha=0.2) plt.xlabel('X-axis label')
plt.ylabel('Y-axis label')
plt.title('Shaded Error Region')
plt.show()

These examples demonstrate how to visualize errors in your data using Matplotlib. You can adjust the error
values and plot styles to suit your specific needs and data.

density and contour plots in Matplotlib

Density and contour plots are useful for visualizing the distribution and density of
data points in a 2D space. Matplotlib provides several functions to create these plots,
imshow()
such as for
density plots and contour() for contour plots. Here's how you can create them:
1. Density Plot (imshow) :
 Use the imshow() function to create a density plot. You can use a 2D histogram or a kernel density
estimation (KDE) to calculate the density.
import numpy as np
import matplotlib.pyplot as plt

# Generate random data


x = np.random.normal(size=1000)
y = np.random.normal(size=1000)
# Create density plot
plt.figure(figsize=(8, 6))
plt.hist2d(x, y, bins=30,
cmap='Blues')
plt.colorbar(label='Density')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Density Plot')
plt.show()

Contour Plot (contour) :


 Use the contour() function to create a contour plot. You can specify the number of contour
levels and the colormap.
import numpy as np
import matplotlib.pyplot as plt

# Generate random data


x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
X, Y = np.meshgrid(x, y)
Z = np.sin(X**2 + Y**2)

# Create contour plot


plt.figure(figsize=(8,
6))
plt.contour(X, Y, Z, levels=20, cmap='RdGy')
plt.colorbar(label='Intensity')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Contour
Plot') plt.show()

These examples demonstrate how to create density and contour plots in Matplotlib. You can customize the
plots by adjusting parameters such as the number of bins, colormap, and contour levels to better visualize
your data.

Histograms in Matplotlib
Histograms are a useful way to visualize the distribution of a single numerical variable. Matplotlib provides
the function to create histograms. Here's a basic example:
import numpy as np
hist()
import matplotlib.pyplot as plt

# Generate random data


data = np.random.normal(loc=0, scale=1, size=1000)

# Create a histogram
plt.hist(data, bins=30, color='skyblue', edgecolor='black')

# Add labels and title


plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram of Random Data')

# Display the plot


plt.show()
In this data is a NumPy array containing random data sampled from a
example, hist() normal function creates a histogram with 30 bins, colored in
distribution. sky bluethe
edges. The x-axis represents with black and the y-axis represents the frequency of
values,
each value.

You can customize the appearance of the histogram by adjusting parameters such as
bins, color, edgecolor, and adding labels and a title to make the plot more

legends in Matplotlib

Legends in Matplotlib are used to identify different elements of a plot, such as lines,
markers, or colors, and associate them with labels. Here's how you can add legends
to your plots:
1. Basic Legend :
 Use the legend() function to add a legend to your plot. You can specify the labels for each element
in the legend.
import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]
y1 = [1, 2, 3, 4, 5]
y2 = [5, 4, 3, 2, 1]

plt.plot(x, y1, label='Line


1') plt.plot(x, y2,
label='Line 2') plt.legend()
plt.show()

Customizing Legend Location :


 You can specify the location of the legend using the loc parameter. Common location values
are 'upper left', 'upper right', 'lower left', 'lower right'.
plt.legend(loc='upper left')

Adding Legend Title :

 You can add a title to the legend using the title parameter.
plt.legend(title='Legend Title')

Customizing Legend Labels :


 You can customize the labels in the legend by passing a list of labels to the labels
parameter.
plt.legend(labels=['Label 1', 'Label 2'])

Adding Legend to Specific Elements :


 You can add legends to specific plot elements by passing the label parameter to the plot functions.

plt.plot(x, y1, label='Line 1')


plt.plot(x, y2, label='Line 2')

Multiple Legends :
 You can create multiple legends by calling the legend() function multiple times with
different labels.
plt.plot(x, y1)
plt.plot(x, y2)
plt.legend(['Line 1', 'Line 2'], loc='upper left')
plt.legend(['Line 3', 'Line 4'], loc='lower right')

1. Removing Legend :
 You can remove the legend from your plot by calling plt.legend().remove() or
plt.gca().legend().remove() .

These are some common ways to add and customize legends in Matplotlib. Legends
are useful for explaining the components of your plot and making it easier for viewers
to understand the data.

colors in Matplotlib

In Matplotlib, you can specify colors in several ways, including using predefined color
names, RGB or RGBA tuples, hexadecimal color codes, and more. Here's how you can
specify colors in Matplotlib:

1. Predefined Color Names :


 Matplotlib provides a set of predefined color names, such as 'red', 'blue', 'green', etc.
import matplotlib.pyplot as plt

plt.plot([1, 2, 3, 4], [1, 4, 9, 16], color='red') # Plot with red color


plt.show()

RGB or RGBA Tuples:


 You can specify colors using RGB or RGBA tuples, where each value ranges from 0 to 1.
plt.plot([1, 2, 3, 4], [1, 4, 9, 16], color=(0.1, 0.2, 0.5)) # Plot with RGB color
plt.show()

Hexadecimal Color Codes :


 You can also specify colors using hexadecimal color codes.
plt.plot([1, 2, 3, 4], [1, 4, 9, 16], color='#FF5733') # Plot with hexadecimal color
plt.show()

Short Color Codes :


 Matplotlib also supports short color codes, such as 'r' for red, 'b' for blue, 'g' for green, etc.
plt.plot([1, 2, 3, 4], [1, 4, 9, 16], color='g') # Plot with green color
plt.show()
Color Maps :
 You can use color maps (colormaps) to automatically assign colors based on a range of values.

import numpy as np

x = np.linspace(0, 10, 100)


y = np.sin(x)

plt.scatter(x, y, c=x, cmap='viridis') # Scatter plot with colormap


plt.colorbar() # Add colorbar to show the mapping
plt.show()

These are some common ways to specify colors in Matplotlib. Using colors effectively can enhance the
readability and visual appeal of your plots.

subplots in Matplotlib

Subplots in Matplotlib allow you to create multiple plots within the same figure. You can arrange subplots
in a grid-like structure and customize each subplot independently. Here's a basic example of creating
subplots:
import matplotlib.pyplot as plt
import numpy as np

# Data for plotting


x = np.linspace(0, 2*np.pi, 100)
y1 = np.sin(x)
y2 = np.cos(x)

# Create a figure and a grid of subplots


fig, axs = plt.subplots(2, 1, figsize=(8, 6))

# Plot data on the first subplot


axs[0].plot(x, y1, label='sin(x)', color='blue')
axs[0].set_title('Plot of sin(x)')
axs[0].legend()

# Plot data on the second subplot


axs[1].plot(x, y2, label='cos(x)', color='red')
axs[1].set_title('Plot of cos(x)')
axs[1].legend()

# Adjust layout and display the plot


plt.tight_layout()
plt.show()

In this plt.subplots(2, 1) creates a figure with 2 rows and 1 column of


example, subplots.the
axs variable is a NumPy array containing The
axes objects for each subplot. You can
then use these axes objects to plot data and customize each subplot independently.
text and annotation in Matplotlib

Text and annotations in Matplotlib are used to add descriptive text, labels, and
annotations to your plots. Here's how you can add text and annotations:

1. Adding Text :

 Use the text() function to add text at a specific location on the plot.
import matplotlib.pyplot as plt

plt.plot([1, 2, 3, 4], [1, 4, 9, 16])


plt.text(2, 10, 'Example Text', fontsize=12, color='red')
plt.show()

Adding :
Annotations

 Use annotate() function to add annotations with arrows pointing to specific


the the points on
plot.
import matplotlib.pyplot as plt

plt.plot([1, 2, 3, 4], [1, 4, 9, 16])


plt.annotate('Example Annotation', xy=(2, 4), xytext=(3, 8),
arrowprops=dict(facecolor='black', shrink=0.05))
plt.show()

Customizing Text Properties :


 You can customize the appearance of text and annotations using various properties like
fontsize , color , fontstyle , fontweight , etc.
plt.text(2, 10, 'Example Text', fontsize=12, color='red', fontstyle='italic', fontweight='bold')

Text Alignment :

 Use the ha and va parameters to specify horizontal and vertical alignment of text.
plt.text(2, 10, 'Example Text', ha='center', va='top')

Adding Mathematical Expressions :


 You can use LaTeX syntax to include mathematical expressions in text and annotations.
plt.text(2, 10, r'$\alpha > \beta$', fontsize=12)
:
Rotating
Text
plt.text(2, 10, 'Example Text', rotation=45)

Adding Background Color :

 Use the bbox parameter to add a background color to text.


plt.text(2, 10, 'Example Text', bbox=dict(facecolor='red', alpha=0.5))
These are some common techniques for adding text and annotations to your plots in Matplotlib. They can be
useful for providing additional information and context to your visualizations.

customization in Matplotlib

Customization in Matplotlib allows you to control various aspects of your plots, such
as colors, line styles, markers, fonts, and more. Here are some common
customization options:
1. Changing Figure Size :
 Use figsize in plt.subplots() or plt.figure() to set the size of the figure.
fig, ax = plt.subplots(figsize=(8, 6))

Changing Line Color, Style, and Width :


 Use color , linestyle , and linewidth parameters in plot functions to customize the lines.
plt.plot(x, y, color='red', linestyle='--', linewidth=2)

Changing Marker Style and Size :


 Use marker, markersize, and markerfacecolor parameters to customize markers in scatter plots.

plt.scatter(x, y, marker='o', s=100, c='blue')

Setting Axis Limits :

 Use xlim() and ylim() to set the limits of the x and y axes.
plt.xlim(0, 10)
plt.ylim(0, 20)

Setting Axis Labels and Title :

 Use xlabel() , ylabel() , and title() to set axis labels and plot title.
plt.xlabel('X-axis Label',
fontsize=12) plt.ylabel('Y-axis
Label', fontsize=12) plt.title('Plot
Title', fontsize=14)

Changing Tick Labels :

 Use xticks() and yticks() to set custom tick labels on the x and y axes.
plt.xticks([1, 2, 3, 4, 5], ['A', 'B', 'C', 'D', 'E'])

Adding Gridlines:
 Use grid() to add gridlines to the plot.
plt.grid(True)

Changing Font Properties :

 Use fontdict parameter in text functions to set font properties.


plt.text(2, 10, 'Example Text', fontdict={'family': 'serif', 'color': 'blue', 'size': 12})
Adding Legends :

 Use legend() to add a legend to the plot.


plt.legend(['Line 1', 'Line 2'], loc='upper left')
These are some common customization options in Matplotlib. You can combine these options to create
highly customized and visually appealing plots for your data.

three dimensional plotting in Matplotlib

Matplotlib provides a toolkit called mplot3d for creating 3D plots. You can create 3D scatter plots, surface
plots, wireframe plots, and more. Here's a basic example of creating a 3D scatter plot:

import matplotlib.pyplot as plt


from mpl_toolkits.mplot3d import Axes3D
import numpy as np

# Generate random data


x = np.random.normal(size=500)
y = np.random.normal(size=500)
z = np.random.normal(size=500)

# Create a 3D scatter plot


fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x, y, z, c='b', marker='o')

# Set labels and title


ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_zlabel('Z Label')
ax.set_title('3D Scatter Plot')

# Show plot
plt.show()
In this example, fig.add_subplot(111, projection='3d') creates a 3D subplot, and
ax.scatter(x, y, z, c='b', marker='o') creates a scatter plot in 3D space. You can customize
the appearance of the plot by changing parameters such as c (color), marker, and adding labels and a title.

You can also create surface plots and wireframe plots using the plot_surface() and

plot_wireframe() functions, respectively. Here's an example of a 3D surface plot:

import matplotlib.pyplot as plt


from mpl_toolkits.mplot3d import Axes3D
import numpy as np

# Generate data
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
x, y = np.meshgrid(x, y)
z = np.sin(np.sqrt(x**2 + y**2))

# Create a 3D surface plot


fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(x, y, z, cmap='viridis')

# Set labels and title


ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_zlabel('Z Label')
ax.set_title('3D Surface Plot')

# Show plot
plt.show()

These examples demonstrate how to create basic 3D plots in Matplotlib. You can explore the mplot3d
toolkit and its functions to create more advanced 3D visualizations.

Geographic Data with Basemap in Matplotlib


Basemap is a toolkit for Matplotlib that allows you to create maps and plot geographic data. It provides
various map projections and features for customizing maps. Here's a basic example of plotting geographic
data using Basemap:

import matplotlib.pyplot as plt


from mpl_toolkits.basemap import Basemap

# Create a map
plt.figure(figsize=(10,
6))
m = Basemap(projection='mill',llcrnrlat=-90,urcrnrlat=90,\
llcrnrlon=-180,urcrnrlon=180,resolution='c')
m.drawcoastlines()
m.drawcountries()
m.fillcontinents(color='lightgray',lake_color='aqua')
m.drawmapboundary(fill_color='aqua')

# Plot cities
lons = [-77.0369, -122.4194, 120.9660, -0.1276]
lats = [38.9072, 37.7749, 14.5995, 51.5074]
cities = ['Washington, D.C.', 'San Francisco', 'Manila', 'London']
x, y = m(lons, lats)
m.scatter(x, y, marker='o', color='r')

# Add city labels


for city, xpt, ypt in zip(cities, x, y):
plt.text(xpt+50000, ypt+50000, city, fontsize=10, color='blue')

# Add a title
plt.title('Cities Around the World')

# Show the map


plt.show()
In this example, we first create a Basemap instance with the desired projection and
map extent. We then draw coastlines, countries, continents, and a map boundary.
Next, we plot cities on the map using the method and add labels for each city using
scatter()
plt.text(). Finally, we add a title to the plot and display the map.

Basemap offers a wide range of features for working with geographic data, including
support for various map projections, drawing political boundaries, and plotting points,
lines, and shapes on maps. You can explore the Basemap documentation for more
advanced features and customization options.

Visualization with Seaborn

Seaborn is a Python visualization library based on Matplotlib that provides a high-


level interface for creating attractive and informative statistical graphics. It is
particularly useful for visualizing data from Pandas DataFrames and NumPy arrays.
Seaborn simplifies the process of creating complex visualizations such as categorical
plots, distribution plots, and relational plots. Here's a brief overview of some of the
key features of Seaborn:

1. Installation :
 You can install Seaborn using pip:
pip install seaborn

Importing Seaborn :
 Import Seaborn as sns conventionally:
import seaborn as sns

Loading Example Datasets :


 Seaborn provides several built-in datasets for practice and exploration:
tips = sns.load_dataset('tips')

Categorical Plots :
 Seaborn provides several functions for visualizing categorical data, such as sns.catplot(),
sns.barplot() , sns.countplot() , and sns.boxplot() .
sns.catplot(x='day', y='total_bill', data=tips, kind='box')

Distribution Plots :
 Seaborn offers various functions for visualizing distributions, including sns.distplot(),
sns.kdeplot() , and sns.histplot() .
sns.distplot(tips['total_bill'])
Relational Plots :
 Seaborn provides functions for visualizing relationships between variables, such as
sns.relplot() , sns.scatterplot() , and sns.lineplot() .
sns.relplot(x='total_bill', y='tip', data=tips, kind='scatter')
Heatmaps :

 Seaborn can create heatmaps to visualize matrix-like data using sns.heatmap() .


flights = sns.load_dataset('flights').pivot('month', 'year', 'passengers')
sns.heatmap(flights, annot=True, fmt='d')

Pairplots :

 Pairplots are useful for visualizing pairwise relationships in a dataset using sns.pairplot() .
sns.pairplot(tips, hue='sex')

1. Styling and :
Themes
 Seaborn allows you to customize the appearance of plots using
styling functions
sns.set() (
sns.set_style() , sns.set_context() , ) and themes (
sns.set_theme() )
2. Other : .
Plots
 Seaborn offers many other types of plots and customization options. The
official Seaborn documentation provides detailed examples and explanations
for each type of plot.
Seaborn is built on top of Matplotlib and integrates well with Pandas, making it a
powerful tool for visualizing data in Python.

UNIT V HANDLING LARGE DATA

techniques for handling large volumes of data

Handling large volumes of data requires a combination of techniques to efficiently


process, store, and analyze the data. Some common techniques include:

1. Distributed computing: Using frameworks like Apache Hadoop and Apache Spark
to distribute data processing tasks across multiple nodes in a cluster, allowing for
parallel processing of large datasets.
2. Data compression: Compressing data before storage or transmission to reduce
the amount of space required and improve processing speed.
3. Data Dividing large datasets into smaller, more manageable
partitions based
partitioning: on certain criteria (e.g., range, hash value) to improve
processing
Data efficiency.
4. Identifying and eliminating duplicate data to reduce storage
deduplication:
requirements and improve data processing efficiency.
5. Database Partitioning a database into smaller, more manageable parts
sharding:
called shards, which can be distributed across multiple servers for improved
scalability
Stream and performance.
6. processing: Processing data in real-time as it is generated, allowing for
immediate
In-memory analysis and decision-making.
7. Storing data in memory instead of on disk to improve
computing:
processing speed, particularly for frequently accessed data.
8. Parallel Using multiple processors or cores to simultaneously execute data
processing:
processing tasks, improving processing speed for large datasets.
9. Data Creating indexes on data fields to enable faster data retrieval,
especially
indexing: for queries involving large datasets.
10.Data Combining multiple data points into a single, summarized value
to reduce the overall volume of data while retaining important information.
aggregation:

These techniques can be used individually or in combination to handle large volumes


of data effectively and efficiently.

programming tips for dealing with large data sets

When dealing with large datasets in programming, it's important to use efficient
techniques to manage memory, optimize processing speed, and avoid common
pitfalls. Here are some programming tips for dealing with large datasets:

1. Use efficient data structures: Choose data structures that are optimized for the
operations you need to perform. For example, use hash maps for fast lookups,
arrays for sequential access, and trees for hierarchical data.
2. Lazy loading: Use lazy loading techniques to load data into memory only when it
is needed, rather than loading the entire dataset at once. This can help reduce
memory usage and improve performance.
3. Batch processing: Process data in batches rather than all at once, especially for
operations like data transformation or analysis. This can help avoid memory issues
and improve processing speed.
4. Use streaming APIs: Use streaming APIs and libraries to process data in a
streaming fashion, which can be more memory-efficient than loading the entire
dataset into memory.
5. Use indexes and caching to optimize data access,
especially for large reduce the time it takes to access and
retrieve data.
6. Parallel processing: Use parallel processing techniques, such as
multithreading or
7. Use efficient Choose algorithms that are optimized for large datasets, such
as sorting
algorithms: algorithms that use divide and conquer techniques or algorithms that can
be parallelized.
Optimize I/O
8. Minimize I/O operations and use buffered I/O where possible
operations:
to reduce the overhead of reading and writing data to disk.
9. Monitor memory Keep an eye on memory usage and optimize your code to
usage:
minimize memory leaks and excessive memory consumption.
10.Use external storage For extremely large datasets that cannot fit
into memory, consider using external storage solutions such as databases or
solutions:
distributed file systems.

Case studies: Predicting malicious URLs,

Predicting malicious URLs is a critical task in cybersecurity to protect users from


phishing attacks, malware distribution, and other malicious activities. Machine
learning models can be used to classify URLs as either benign or malicious based on
features such as URL length, domain age, presence of certain keywords, and
historical data. Here are two case studies that demonstrate how machine learning
can be used to predict malicious URLs:

1. Google Safe Browsing:


 Google Safe Browsing is a service that helps protect users from malicious websites by
identifying and flagging unsafe URLs.
 The service uses machine learning models to analyze URLs and classify them as safe or
unsafe.
 Features used in the model include URL length, domain reputation, presence of suspicious
keywords, and similarity to known malicious URLs.
 The model is continuously trained on new data to improve its accuracy and effectiveness.

2. Microsoft SmartScreen:
 Microsoft SmartScreen is a feature in Microsoft Edge and Internet Explorer browsers that helps
protect users from phishing attacks and malware.
 SmartScreen uses machine learning models to analyze URLs and determine their safety.
 The model looks at features such as domain reputation, presence of phishing keywords, and
similarity to known malicious URLs.
 SmartScreen also leverages data from the Microsoft Defender SmartScreen service to
improve its accuracy and coverage.

In both cases, machine learning is used to predict the likelihood that a given URL is
malicious based on various features and historical data. These models help protect
users from online threats and improve the overall security of the web browsing
experience.
Case studies: Building a recommender system
Building a recommender system involves predicting the "rating" or "preference" that
a user would give to an item. These systems are widely used in e-commerce, social
media, and content streaming platforms to personalize recommendations for users.
Here are two case studies that demonstrate how recommender systems can be built:

1. Netflix Recommendation System:


 Netflix uses a recommendation system to suggest movies and TV shows to its users.
 The system uses collaborative filtering, which involves analyzing user behavior (e.g., viewing
history, ratings) to identify patterns and make recommendations.
 Netflix also incorporates content-based filtering, which considers the characteristics of the items
(e.g., genre, cast, director) to make recommendations.
 The system uses machine learning algorithms such as matrix factorization and deep learning to
improve the accuracy of its recommendations.
 Netflix continuously collects data on user interactions and feedback to refine its
recommendation algorithms.

2. Amazon Product Recommendation System:


 Amazon uses a recommendation system to suggest products to its customers based on their
browsing and purchase history.
 The system uses collaborative filtering to identify products that are popular among similar users.
 Amazon also uses item-to-item collaborative filtering, which recommends products that are similar
to those that a user has previously viewed or purchased.
 The system incorporates user feedback and ratings to improve the relevance of its
recommendations.
 Amazon's recommendation system is powered by machine learning algorithms that analyze large
amounts of data to make personalized recommendations.

In both cases, the recommendation systems use machine learning and data analysis
techniques to analyze user behavior and make personalized recommendations. These
systems help improve user engagement, increase sales, and enhance the overall user
experience.

Tools and techniques needed for dealing with large data

Dealing with large datasets requires a combination of tools and techniques to


manage, process, and analyze the data efficiently. Here are some key tools and
techniques:
1. Big Data Frameworks: Frameworks such as Apache Hadoop, Apache Spark, and Apache Flink
provide tools for distributed storage and processing of large datasets.
2. Data Use of distributed file systems like Hadoop Distributed File System (HDFS), cloud
storage services like Amazon S3, or NoSQL databases like Apache Cassandra or MongoDB for storing
Storage:
large volumes of data.
3. Data Techniques such as MapReduce, Spark RDDs, and Spark DataFrames for parallel
processing
Processing: of data across distributed computing clusters.
4. Data Tools like Apache Kafka or Apache Flink for processing real-time streaming data.
5. Data Compression:
Streaming: Techniques like gzip, Snappy, or Parquet for compressing data to reduce storage
requirements and improve processing speed.
6. Data Divide large datasets into smaller, more manageable partitions based on certain
criteria to improve processing efficiency.
Partitioning:
7. Distributed Use of cloud computing platforms like Amazon Web Services (AWS),
Google
Computing:Cloud Platform (GCP), or Microsoft Azure for scalable and cost-effective processing of large
datasets.
8. Data Create indexes on data fields to enable faster data retrieval, especially for queries
involving
Indexing: large datasets.
9. Machine Use of machine learning algorithms and libraries (e.g., scikit-learn, TensorFlow) for
analyzing
Learning: and deriving insights from large datasets.
10.Data Tools like Matplotlib, Seaborn, or Tableau for visualizing large datasets to gain
insights and
Visualization: make data-driven decisions.

By leveraging these tools and techniques, organizations can effectively manage and
analyze large volumes of data to extract valuable insights and drive informed
decision-making.

Data preparation for dealing with large data


Data preparation is a crucial step in dealing with large datasets, as it ensures that the
data is clean, consistent, and ready for analysis. Here are some key steps involved in
data preparation for large datasets:

1. Data Cleaning: Remove or correct any errors or inconsistencies in the data,


such as missing values, duplicate records, or outliers.
2. Data Integration: Combine data from multiple sources into a single dataset,
ensuring that the
data is consistent and can be analyzed together.
3. Data Transformation: Convert the data into a format that is suitable for
analysis, such as converting categorical variables into numerical ones or
normalizing numerical variables.
4. Data Reduction: Reduce the size of the dataset by removing unnecessary features
or aggregating data to a higher level of granularity.
5. Data Sampling: If the dataset is too large to analyze in its entirety, use sampling
techniques to extract a representative subset of the data for analysis.
6. Feature Engineering: Create new features from existing ones to improve the
performance of machine learning models or better capture the underlying
patterns in the data.
7. Data Splitting: Split the dataset into training, validation, and test sets to evaluate
8. Data Visualize the data to explore its characteristics and identify any
patterns or trends
Visualization: that may be present.
9. Data Ensure that the data is secure and protected from unauthorized
access or loss, especially when dealing with sensitive information.
Security:

Model building for dealing with large data


When building models for large datasets, it's important to consider scalability,
efficiency, and performance. Here are some key techniques and considerations for
model building with large data:
1. Use Distributed Computing: Utilize frameworks like Apache Spark or TensorFlow with distributed
computing capabilities to process large datasets in parallel across multiple nodes.
2. Feature Choose relevant features and reduce the dimensionality of the dataset to
improve model
Selection: performance and reduce computation time.
3. Model Use models that are scalable and efficient for large datasets, such as gradient
Selection:
boosting machines, random forests, or deep learning models.
4. Batch If real-time processing is not necessary, consider batch processing techniques to
Processing:
handle large volumes of data in scheduled intervals.
5. Samplin Use sampling techniques to create smaller subsets of the data for model building and
g:
validation, especially if the entire dataset cannot fit into memory.
6. Incremental Implement models that can be updated incrementally as new data
Learning:
becomes available, instead of retraining the entire model from scratch.
7. Feature Create new features or transform existing features to better represent the
Engineering:
underlying patterns in the data and improve model performance.
8. Model Use appropriate metrics to evaluate model performance, considering the
Evaluation: accuracy, scalability, and computational resources.
9. Parallelization:
trade-offs Use parallel processing techniques within the model training process to speed up
computations, such as parallelizing gradient computations in deep learning models.
10. Data Partitioning: Partition the data into smaller subsets for training and validation to improve
efficiency and reduce memory requirements.

By employing these techniques, data scientists and machine learning engineers can
build models that are scalable, efficient, and capable of handling large datasets
effectively.

Presentation and automation for dealing with large data


Presentation and automation are key aspects of dealing with large datasets
to effectively communicate insights and streamline data processing tasks.
Here are some strategies for presentation and automation:

1. Visualization: Use data visualization tools like Matplotlib, Seaborn, or Tableau to create
visualizations that help stakeholders understand complex patterns and trends in the data.
2. Dashboardi Build interactive dashboards using tools like Power BI or Tableau that
allow
ng: users to explore the data and gain insights in real-time.
3. Automated Use tools like Jupyter Notebooks or R Markdown to create
automated reports that can be generated regularly with updated data.
Reporting:
4. Implement data pipelines using tools like Apache Airflow or Luigi to
Data
automate data ingestion, processing, and analysis tasks.
5. Pipelines: Use containerization technologies like Docker to deploy
machine
Model learning models as scalable and reusable components.
6. Deployment: Set up monitoring and alerting systems to track the
performance
Monitoring andof data pipelines and models, and to be notified of any issues or
anomalies.
Alerting:
7. Use version control systems like Git to track changes to your data
Version
processing scripts and models, enabling collaboration and reproducibility.
8. Control: Leverage cloud services like AWS, Google Cloud Platform, or Azure
Cloud
for scalable storage, processing, and deployment of large datasets and models.
Services:

By incorporating these strategies, organizations can streamline their data


processes, improve decision-making, and derive more value from their large
datasets.

You might also like