DSF Notes
DSF Notes
30 PERIODS
PRACTICAL EXERCISES: 30 PERIODS LAB EXERCISES
1. Download, install and explore the features of Python for data analytics.
2. Working with Numpy arrays
3. Working with Pandas data frames
4. Basic plots using Matplotlib
5. Statistical and Probability measures a) Frequency distributions b) Mean, Mode, Standard Deviation c)
Variability d) Normal curves e) Correlation and scatter plots f) Correlation coefficient g) Regression
6. Use the standard benchmark data set for performing the following: a) Univariate Analysis: Frequency,
Mean, Median, Mode, Variance, Standard Deviation, Skewness and Kurtosis. b) Bivariate Analysis: Linear
and logistic regression modelling.
7. Apply supervised learning algorithms and unsupervised learning algorithms on any data set.
8. Apply and explore various plotting functions on any data set. Note: Example data sets like: UCI, Iris,
Pima Indians Diabetes etc.
COURSE OUTCOMES:
At the end of this course, the students will be able to: CO1: Gain knowledge
on data science process.
CO2: Perform data manipulation functions using Numpy and Pandas.
CO3 Understand different types of machine learning approaches. CO4: Perform data
visualization using tools.
CO5: Handle large volumes of data in practical scenarios. TOTAL:60
PERIODS
TEXT BOOKS
1. David Cielen, Arno D. B. Meysman, and Mohamed Ali, “Introducing Data Science”, Manning
Publications, 2016.
2. Jake VanderPlas, “Python Data Science Handbook”, O’Reilly, 2016.
REFERENCES
1. Robert S. Witte and John S. Witte, “Statistics”, Eleventh Edition, Wiley Publications, 2017.
2. Allen B. Downey, “Think Stats: Exploratory Data Analysis in Python”, Green Tea Press,2014.
UNIT I NOTES
UNIT I : Introduction
Syllabus
Data Science : Benefits and uses - facets of data Defining research goals - Retrieving data - Data preparation
- Exploratory Data analysis - build the model presenting findings and building applications Warehousing -
Basic Statistical descriptions of Data.
Data Science
• Data is measurable units of information gathered or captured from activity of people, places and things.
• Data science is an interdisciplinary field that seeks to extract knowledge or insights from various forms of
data. At its core, Data Science aims to discover and extract actionable knowledge from data that can be used
to make sound business decisions and predictions. Data science combines math and statistics, specialized
programming, advanced analytics, Artificial Intelligence (AI) and machine learning with specific subject
matter expertise to uncover actionable insights hidden in an organization's data.
• Data science uses advanced analytical theory and various methods such as time series analysis for
predicting future. From historical data, Instead of knowing how many products sold in previous quarter, data
science helps in forecasting future product sales and revenue more accurately.
• Data science is devoted to the extraction of clean information from raw data to form actionable insights.
Data science practitioners apply machine learning algorithms to numbers, text, images, video, audio and
more to produce artificial intelligence systems to perform tasks that ordinarily require human intelligence.
• The data science field is growing rapidly and revolutionizing so many industries. It has incalculable
benefits in business, research and our everyday lives.
• As a general rule, data scientists are skilled in detecting patterns hidden within large volumes of data and
they often use advanced algorithms and implement machine learning models to help businesses and
organizations make accurate assessments and predictions.
Data science and big data evolved from statistics and traditional data management but are now considered to
be distinct disciplines.
Life cycle of data science:
1. Capture: Data acquisition, data entry, signal reception and data extraction.
2. Maintain Data warehousing, data cleansing, data staging, data processing and data architecture.
3. Process Data mining, clustering and classification, data modeling and data summarization.
4. Analyze : Data reporting, data visualization, business intelligence and decision making.
5. Communicate: Exploratory and confirmatory analysis, predictive analysis, regression, text mining and
qualitative analysis.
Big Data
• Big data can be defined as very large volumes of data available at various sources, in varying degrees of
complexity, generated at different speed i.e. velocities and varying degrees of ambiguity, which cannot be
processed using traditional technologies, processing methods, algorithms or any commercial off-the-shelf
solutions.
• 'Big data' is a term used to describe collection of data that is huge in size and yet growing exponentially
with time. In short, such a data is so large and complex that none of the traditional data management tools
are able to store it or process it efficiently.
Facets of Data
• Very large amount of data will generate in big data and data science. These data is various types and main
categories of data are as follows:
a) Structured
b) Natural language
c) Graph-based
d) Streaming
e) Unstructured
f) Machine-generated
g) Audio, video and images
Structured Data
• Structured data is arranged in rows and column format. It helps for application to retrieve and process data
easily. Database management system is used for storing structured data.
• The term structured data refers to data that is identifiable because it is organized in a structure. The most
common form of structured data or records is a database where specific information is stored based on a
methodology of columns and rows.
• Structured data is also searchable by data type within content. Structured data is understood by
computers and is also efficiently organized for human readers.
• An Excel table is an example of structured data.
Unstructured Data
• Unstructured data is data that does not follow a specified format. Row and columns are not used for
unstructured data. Therefore it is difficult to retrieve required information. Unstructured data has no
identifiable structure.
• The unstructured data can be in the form of Text: (Documents, email messages, customer feedbacks),
audio, video, images. Email is an example of unstructured data.
• Even today in most of the organizations more than 80 % of the data are in unstructured form. This carries
lots of information. But extracting information from these various sources is a very big challenge.
• Characteristics of unstructured data:
1. There is no structural restriction or binding for the data.
2. Data can be of any type.
3. Unstructured data does not follow any structural rules.
4. There are no predefined formats, restriction or sequence for unstructured data.
5. Since there is no structural binding for unstructured data, it is unpredictable in nature.
Natural Language
• Natural language is a special type of unstructured data.
• Natural language processing enables machines to recognize characters, words and sentences, then apply
meaning and understanding to that information. This helps machines to understand language as humans do.
• Natural language processing is the driving force behind machine intelligence in many modern real-world
applications. The natural language processing community has had success in entity recognition, topic
recognition, summarization, text completion and sentiment analysis.
•For natural language processing to help machines understand human language, it must go through speech
recognition, natural language understanding and machine translation. It is an iterative process comprised of
several layers of text analysis.
• Graph theory has proved to be very effective on large-scale datasets such as social network data. This is
because it is capable of by-passing the building of an actual visual representation of the data to run directly
on data matrices.
Audio, Image and Video
• Audio, image and video are data types that pose specific challenges to a data scientist. Tasks that are
trivial for humans, such as recognizing objects in pictures, turn out to be challenging for computers.
•The terms audio and video commonly refers to the time-based media storage format for sound/music and
moving pictures information. Audio and video digital recording, also referred as audio and video codecs, can
be uncompressed, lossless compressed or lossy compressed depending on the desired quality and use cases.
• It is important to remark that multimedia data is one of the most important sources of information and
knowledge; the integration, transformation and indexing of multimedia data bring significant challenges in
data management and analysis. Many challenges have to be addressed including big data, multidisciplinary
nature of Data Science and heterogeneity.
• Data Science is playing an important role to address these challenges in multimedia data. Multimedia data
usually contains various forms of media, such as text, image, video, geographic coordinates and even pulse
waveforms, which come from multiple sources. Data Science can be a key instrument covering big data,
machine learning and data mining solutions to store, handle and analyze such heterogeneous data.
Streaming Data
Streaming data is data that is generated continuously by thousands of data sources, which typically send in
the data records simultaneously and in small sizes (order of Kilobytes).
• Streaming data includes a wide variety of data such as log files generated by customers using your mobile
or web applications, ecommerce purchases, in-game player activity, information from social networks,
financial trading floors or geospatial services and telemetry from connected devices or instrumentation in
data centers.
Difference between Structured and Unstructured Data
Retrieving Data
• Retrieving required data is second phase of data science project. Sometimes Data scientists need to go into
the field and design a data collection process. Many companies will have already collected and stored the
data and what they don't have can often be bought from third parties.
• Most of the high quality data is freely available for public and commercial use. Data can be stored in
various format. It is in text file format and tables in database. Data may be internal or external.
1. Start working on internal data, i.e. data stored within the company
• First step of data scientists is to verify the internal data. Assess the relevance and quality of the data that's
readily in company. Most companies have a program for maintaining key data, so much of the cleaning
work may already be done. This data can be stored in official data repositories such as databases, data marts,
data warehouses and data lakes maintained by a team of IT professionals.
• Data repository is also known as a data library or data archive. This is a general term to refer to a data set
isolated to be mined for data reporting and analysis. The data repository is a large database infrastructure,
several databases that collect, manage and store data sets for data analysis, sharing and reporting.
• Data repository can be used to describe several ways to collect and store data:
a) Data warehouse is a large data repository that aggregates data usually from multiple sources or segments
of a business, without the data being necessarily related.
b) Data lake is a large data repository that stores unstructured data that is classified and tagged with
metadata.
c) Data marts are subsets of the data repository. These data marts are more targeted to what the data user
needs and easier to use.
d) Metadata repositories store data about data and databases. The metadata explains where the data source,
how it was captured and what it represents.
e) Data cubes are lists of data with three or more dimensions stored as a table.
Data Preparation
• Data preparation means data cleansing, Integrating and transforming data.
Data Cleaning
• Data is cleansed through processes such as filling in missing values, smoothing the noisy data or
resolving the inconsistencies in the data.
• Data cleaning tasks are as follows:
1. Data acquisition and metadata
2. Fill in missing values
3. Unified date format
4. Converting nominal to numeric
5. Identify outliers and smooth out noisy data
6. Correct inconsistent data
• Data cleaning is a first step in data pre-processing techniques which is used to find the missing value,
smooth noise data, recognize outliers and correct inconsistent.
• Missing value: These dirty data will affects on miming procedure and led to unreliable and poor output.
Therefore it is important for some data cleaning routines. For example, suppose that the average salary of
staff is Rs. 65000/-. Use this value to replace the missing value for salary.
• Data entry errors: Data collection and data entry are error-prone processes. They often require human
intervention and because humans are only human, they make typos or lose their concentration for a second
and introduce an error into the chain. But data collected by machines or computers isn't free from errors
either. Errors can arise from human sloppiness, whereas others are due to machine or hardware failure.
Examples of errors originating from machines are transmission errors or bugs in the extract, transform and
load phase (ETL).
• Whitespace error: Whitespaces tend to be hard to detect but cause errors like other redundant characters
would. To remove the spaces present at start and end of the string, we can use strip() function on the string
in Python.
• Fixing capital letter mismatches: Capital letter mismatches are common problem. Most programming
languages make a distinction between "Chennai" and "chennai".
• Python provides string conversion like to convert a string to lowercase, uppercase using lower(), upper().
• The lower() Function in python converts the input string to lowercase. The upper() Function in python
converts the input string to uppercase.
Outlier
• Outlier detection is the process of detecting and subsequently excluding outliers from a given set of data.
The easiest way to find outliers is to use a plot or a table with the minimum and maximum values.
• Fig. 1.6.1 shows outliers detection. Here O1 and O2 seem outliers from the rest.
• An outlier may be defined as a piece of data or observation that deviates drastically from the given norm
or average of the data set. An outlier may be caused simply by chance, but it may also indicate
measurement error or that the given data set has a heavy-tailed distribution.
• Outlier analysis and detection has various applications in numerous fields such as fraud detection, credit
card, discovering computer intrusion and criminal behaviours, medical and public health outlier detection,
industrial damage detection.
• General idea of application is to find out data which deviates from normal behaviour of data set.
• Table 1 contains x3 value as 3 and Table 2 contains x3 value as 33.The result of appending these tables is
a larger one with the observations from Table 1 as well as Table 2. The equivalent operation in set theory
would be the union and this is also the command in SQL, the common language of relational databases.
Other set operators are also used in data science, such as set difference and intersection.
3. Using views to simulate data joins and appends
• Duplication of data is avoided by using view and append. The append table requires more space for
storage. If table size is in terabytes of data, then it becomes problematic to duplicate the data. For this
reason, the concept of a view was invented.
• Fig. 1.6.4 shows how the sales data from the different months is combined virtually into a yearly sales
table instead of duplicating the data.
Transforming Data
• In data transformation, the data are transformed or consolidated into forms appropriate for mining.
Relationships between an input variable and an output variable aren't always linear.
• Reducing the number of variables: Having too many variables in the model makes the model difficult to
handle and certain techniques don't perform well when user overload them with too many input variables.
• All the techniques based on a Euclidean distance perform well only up to 10 variables. Data scientists use
special methods to reduce the number of variables but retain the maximum amount of data.
Euclidean distance :
• Euclidean distance is used to measure the similarity between observations. It is calculated as the square
root of the sum of differences between each point.
Euclidean distance = √(X1-X2)2 + (Y1-Y2)2
Turning variable into dummies :
• Variables can be turned into dummy variables. Dummy variables canonly take two values: true (1) or
false√ (0). They're used to indicate the absence of acategorical effect that may explain the observation.
Exploratory Data Analysis
• Exploratory Data Analysis (EDA) is a general approach to exploring datasets by means of simple
summary statistics and graphic visualizations in order to gain a deeper understanding of data.
• EDA is used by data scientists to analyze and investigate data sets and summarize their main
characteristics, often employing data visualization methods. It helps determine how best to manipulate data
sources to get the answers user need, making it easier for data scientists to discover patterns, spot anomalies,
test a hypothesis or check assumptions.
• EDA is an approach/philosophy for data analysis that employs a variety of techniques to:
1. Maximize insight into a data set;
2. Uncover underlying structure;
3. Extract important variables;
4. Detect outliers and anomalies;
5. Test underlying assumptions;
6. Develop parsimonious models; and
7. Determine optimal factor settings.
• With EDA, following functions are performed:
1. Describe of user data
2. Closely explore data distributions
3. Understand the relations between variables
4. Notice unusual or unexpected situations
5. Place the data into groups
6. Notice unexpected patterns within groups
7. Take note of group differences
• Box plots are an excellent tool for conveying location and variation information in data sets, particularly
for detecting and illustrating location and variation changes between different groups of data.
• Exploratory data analysis is majorly performed using the following methods:
1. Univariate analysis: Provides summary statistics for each field in the raw data set (or) summary
only on one variable. Ex : CDF,PDF,Box plot
2. Bivariate analysis is performed to find the relationship between each variable in the dataset and the
target variable of interest (or) using two variables and finding relationship between them. Ex: Boxplot,
Violin plot.
3. Multivariate analysis is performed to understand interactions between different fields in the dataset (or)
finding interactions between variables more than 2.
• A box plot is a type of chart often used in explanatory data analysis to visually show the distribution of
numerical data and skewness through displaying the data quartiles or percentile and averages.
Model Execution
• Various programming language is used for implementing the model. For model execution, Python
provides libraries like StatsModels or Scikit-learn. These packages use several of the most popular
techniques.
• Coding a model is a nontrivial task in most cases, so having these libraries available can speed up the
process. Following are the remarks on output:
a) Model fit: R-squared or adjusted R-squared is used.
b) Predictor variables have a coefficient: For a linear model this is easy to interpret.
c) Predictor significance: Coefficients are great, but sometimes not enough evidence exists to show that
the influence is there.
• Linear regression works if we want to predict a value, but for classify something, classification models
are used. The k-nearest neighbors method is one of the best method.
• Following commercial tools are used :
1. SAS enterprise miner: This tool allows users to run predictive and descriptive models based on large
volumes of data from across the enterprise.
2. SPSS modeler: It offers methods to explore and analyze data through a GUI.
3. Matlab: Provides a high-level language for performing a variety of data analytics, algorithms and data
exploration.
4. Alpine miner: This tool provides a GUI front end for users to develop analytic workflows and interact
with Big Data tools and platforms on the back end.
• Open Source tools:
1. R and PL/R: PL/R is a procedural language for PostgreSQL with R.
2. Octave: A free software programming language for computational modeling, has some of the
functionality of Matlab.
3. WEKA: It is a free data mining software package with an analytic workbench. The functions created in
WEKA can be executed within Java code.
4. Python is a programming language that provides toolkits for machine learning and analysis.
5. SQL in-database implementations, such as MADlib provide an alterative to in memory desktop
analytical tools.
Data Mining
• Data mining refers to extracting or mining knowledge from large amounts of data. It is a process of
discovering interesting patterns or Knowledge from a large amount of data stored either in databases, data
warehouses or other information repositories.
• Data warehouse server based on the user's data request, data warehouse server is responsible for fetching
the relevant data.
• Knowledge base is helpful in the whole data mining process. It might be useful for guiding the search or
evaluating the interestingness of the result patterns. The knowledge base might even contain user beliefs and
data from user experiences that can be useful in the process of data mining.
• The data mining engine is the core component of any data mining system. It consists of a number of
modules for performing data mining tasks including association, classification, characterization, clustering,
prediction, time-series analysis etc.
• The pattern evaluation module is mainly responsible for the measure of interestingness of the pattern by
using a threshold value. It interacts with the data mining engine to focus the search towards interesting
patterns.
• The graphical user interface module communicates between the user and the data mining system. This
module helps the user use the system easily and efficiently without knowing the real complexity behind the
process.
• When the user specifies a query or a task, this module interacts with the data mining system and displays
the result in an easily understandable manner.
Classification of DM System
• Data mining system can be categorized according to various parameters. These are database technology,
machine learning, statistics, information science, visualization and other disciplines.
• Fig. 1.10.2 shows classification of DM system.
Data Warehousing
• Data warehousing is the process of constructing and using a data warehouse. A data warehouse is
constructed by integrating data from multiple heterogeneous sources that support analytical reporting,
structured and/or ad hoc queries and decision making. Data warehousing involves data cleaning, data
integration and data consolidations.
• A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in
support of management's decision-making process. A data warehouse stores historical data for purposes of
decision support.
• A database an application-oriented collection of data that is organized, structured, coherent, with
minimum and controlled redundancy, which may be accessed by several users in due time.
• Data warehousing provides architectures and tools for business executives to systematically organize,
understand and use their data to make strategic decisions.
• A data warehouse is a subject-oriented collection of data that is integrated, time-variant, non-volatile,
which may be used to support the decision-making process.
• Data warehouses are databases that store and maintain analytical data separately from transaction-oriented
databases for the purpose of decision support. Data warehouses separate analysis workload from transaction
workload and enable an organization to consolidate data from several source.
• Data organization in data warehouses is based on areas of interest, on the major subjects of the
organization: Customers, products, activities etc. databases organize data based on enterprise applications
resulted from its functions.
• The main objective of a data warehouse is to support the decision-making system, focusing on the
subjects of the organization. The objective of a database is to support the operational system and information
is organized on applications and processes.
• A data warehouse usually stores many months or years of data to support historical analysis. The data in a
data warehouse is typically loaded through an extraction, transformation and loading (ETL) process from
multiple data sources.
• Databases and data warehouses are related but not the same.
• A database is a way to record and access information from a single source. A database is often handling
real-time data to support day-to-day business processes like transaction processing.
• A data warehouse is a way to store historical information from multiple sources to allow you to analyse
and report on related data (e.g., your sales transaction data, mobile app data and CRM data). Unlike a
database, the information isn't updated in real-time and is better for data analysis of broader trends.
• Modern data warehouses are moving toward an Extract, Load, Transformation (ELT) architecture in
which all or most data transformation is performed on the database that hosts the data warehouse.
• Goals of data warehousing:
1. To help reporting as well as analysis.
2. Maintain the organization's historical information.
3. Be the foundation for decision making.
"How are organizations using the information from data warehouses ?"
• Most of the organizations makes use of this information for taking business decision like :
a) Increasing customer focus: It is possible by performing analysis of customer buying.
b) Repositioning products and managing product portfolios by comparing the performance of last year
sales.
c) Analysing operations and looking for sources of profit.
d) Managing customer relationships, making environmental corrections and managing the cost of corporate
assets.
• The middle tier is the application layer giving an abstracted view of the database. It arranges the data to
make it more suitable for analysis. This is done with an OLAP server, implemented using the ROLAP or
MOLAP model.
• OLAPS can interact with both relational databases and multidimensional databases, which lets them
collect data better based on broader parameters.
• The top tier is the front-end of an organization's overall business intelligence suite. The top-tier is where
the user accesses and interacts with data via queries, data visualizations and data analytics tools.
• The top tier represents the front-end client layer. The client level which includes the tools and Application
Programming Interface (API) used for high-level data analysis, inquiring and reporting. User can use
reporting tools, query, analysis or data mining tools.
Needs of Data Warehouse
1) Business user: Business users require a data warehouse to view summarized data from the past. Since
these people are non-technical, the data may be presented to them in an elementary form.
2) Store historical data: Data warehouse is required to store the time variable data from the past. This input
is made to be used for various purposes.
3) Make strategic decisions: Some strategies may be depending upon the data in the data warehouse. So,
data warehouse contributes to making strategic decisions.
4) For data consistency and quality Bringing the data from different sources at a commonplace, the user
can effectively undertake to bring the uniformity and consistency in data.
5) High response time: Data warehouse has to be ready for somewhat unexpected loads and types of
queries, which demands a significant degree of flexibility and quick response time.
Metadata
• Metadata is simply defined as data about data. The data that is used to represent other data is known as
metadata. In data warehousing, metadata is one of the essential aspects.
• We can define metadata as follows:
a) Metadata is the road-map to a data warehouse.
b) Metadata in a data warehouse defines the warehouse objects.
c) Metadata acts as a directory. This directory helps the decision support system to locate the contents of a
data warehouse.
• In a data warehouse, we create metadata for the data names and definitions of a given data warehouse.
Along with this metadata, additional metadata is also created for time- stamping any extracted data, the
source of extracted data.
Why is metadata necessary in a data warehouse ?
a) First, it acts as the glue that links all parts of the data warehouses.
b) Next, it provides information about the contents and structures to the developers.
c) Finally, it opens the doors to the end-users and makes the contents recognizable in their terms.
• Fig. 1.11.2 shows warehouse metadata.
• As well as time series data, line graphs can also be appropriate for displaying data that are measured over
other continuous variables such as distance.
• For example, a line graph could be used to show how pollution levels vary with increasing distance from a
source or how the level of a chemical varies with depth of soil.
• In a line graph the x-axis represents the continuous variable (for example year or distance from the initial
measurement) whilst the y-axis has a scale and indicated the measurement.
• Several data series can be plotted on the same line chart and this is particularly useful for analysing and
comparing the trends in different datasets.
• Line graph is often used to visualize rate of change of a quantity. It is more useful when the given data has
peaks and valleys. Line graphs are very simple to draw and quite convenient to interpret.
4. Pie charts
• A type of graph is which a circle is divided into sectors that each represents a proportion of whole. Each
sector shows the relative size of each value.
• A pie chart displays data, information and statistics in an easy to read "pie slice" format with varying slice
sizes telling how much of one data element exists.
• Pie chart is also known as circle graph. The bigger the slice, the more of that particular data was gathered.
The main use of a pie chart is to show comparisons. Fig. 1.12.2 shows pie chart. (See Fig. 1.12.2 on next
page)
• Various applications of pie charts can be found in business, school and at home. For business pie charts
can be used to show the success or failure of certain products or services.
• At school, pie chart applications include showing how much time is allotted to each subject. At home pie
charts can be useful to see expenditure of monthly income in different needs.
• Reading of pie chart is as easy figuring out which slice of an actual pie is the biggest. Limitation
of pie chart:
• It is difficult to tell the difference between estimates of similar size. Error bars
or confidence limits cannot be shown on pie graph. Legends and labels on pie
graphs are hard to align and read.
• The human visual system is more efficient at perceiving and discriminating between lines and line lengths
rather than two-dimensional areas and angles.
• Pie graphs simply don't work when comparing data.
The modeling process in machine learning typically involves several key steps, including data
preprocessing, model selection, training, evaluation, and deployment. Here's an overview of the general
modeling process:
1. Data Collection: Obtain a dataset that contains relevant information for the problem you want to solve.
This dataset should be representative of the real-world scenario you are interested in.
2. Data Preprocessing: Clean the dataset by handling missing values, encoding categorical variables,
and scaling numerical features. This step ensures that the data is in a suitable format for modeling.
3. Feature Selection/Engineering: Select relevant features (columns) from the dataset or create new features
based on domain knowledge. This step helps improve the performance of the model by focusing on the
most important information.
4. Splitting the Data: Split the dataset into training, validation, and test sets. The training set is used to
train the model, the validation set is used to tune hyperparameters, and the test set is used to evaluate the
final model.
5. Model Selection: Choose the appropriate machine learning model(s) for your problem. This decision
is based on factors such as the type of problem (classification, regression, clustering, etc.), the size of
the dataset, and the nature of the data.
6. Training the Model: Train the selected model(s) on the training data. During training, the model
learns patterns and relationships in the data that will allow it to make predictions on new, unseen data.
7. Hyperparameter Tuning: Use the validation set to tune the hyperparameters of the model.
Hyperparameters are parameters that control the learning process of the model (e.g., learning
rate, regularization strength) and can have a significant impact on performance.
8. Model Evaluation: Evaluate the model(s) using the test set. This step involves measuring
performance metrics such as accuracy, precision, recall, F1 score, or mean squared error, depending on
the type of problem.
9. Model Deployment: Once you are satisfied with the performance of the model, deploy it to a production
environment where it can make predictions on new data. This step may involve packaging the model into
a software application or integrating it into an existing system.
10. Monitoring and Maintenance: Continuously monitor the performance of the deployed model and update
it as needed to ensure that it remains accurate and reliable over time.
This is a high-level overview of the modeling process in machine learning. The specific details of each step
may vary depending on the problem you are working on and the tools and techniques you are using.
Types of machine learning
Machine learning can be broadly categorized into three main types based on the
nature of the learning process and the availability of labeled data:
1. Supervised Learning: In supervised learning, the model is trained on a labeled dataset, where each
example is paired with a corresponding label or output. The goal of the model is to learn a mapping from
inputs to outputs so that it can predict the correct output for new, unseen inputs. Examples of supervised
learning algorithms include linear regression, logistic regression, decision trees, random forests, support
vector machines (SVM), and neural networks.
2. Unsupervised Learning: In unsupervised learning, the model is trained on an unlabeled dataset, and the
goal is to find hidden patterns or structures in the data. The model learns to group similar data points
together and identify underlying relationships without explicit guidance. Clustering and dimensionality
reduction are common tasks in unsupervised learning. Examples of unsupervised learning algorithms
include K-means clustering, hierarchical clustering, principal component analysis (PCA), and t-distributed
stochastic neighbor embedding (t-SNE).
3. Reinforcement Learning: Reinforcement learning is a type of machine learning where an agent learns to
make decisions by interacting with an environment. The agent receives feedback in the form of rewards or
penalties based on its actions, and the goal is to learn a policy that maximizes the cumulative reward over
time. Reinforcement learning is commonly used in applications such as game playing, robotics, and
autonomous driving. Examples of reinforcement learning algorithms include Q-learning, deep Q-networks
(DQN), and policy gradient methods.
These are the main types of machine learning, but there are also other subfields and
specialized approaches, such as semi-supervised learning, where the model is trained
on a combination of labeled and unlabeled data, and transfer learning, where
knowledge gained from one task is applied to another related task.
Supervised learning
2. Regression: In regression tasks, the goal is to predict a continuous value for each
input. Examples of regression tasks include predicting house prices based on
features such as size, location, and number of bedrooms, predicting stock prices
based on historical data, and predicting the amount of rainfall based on weather
patterns.
Supervised learning algorithms learn from the labeled data by finding patterns and
relationships that allow them to make accurate predictions on new, unseen data.
Some common supervised learning algorithms include:
Unsupervised learning is a type of machine learning where the model is trained on an unlabeled dataset,
meaning that the data does not have any corresponding output labels. The goal of unsupervised learning is to
find hidden patterns or structures in the data.
Unlike supervised learning, where the model learns from labeled examples to predict outputs for new
inputs, unsupervised learning focuses on discovering the underlying structure of the data without any
guidance on what the output should be. This makes unsupervised learning particularly useful for exploratory
data analysis and understanding the relationships between data points.
2. Dimensionality Reduction: Dimensionality reduction techniques are used to reduce the number of
features in the dataset while preserving as much information as possible. This can help in visualizing high-
dimensional data and reducing the computational complexity of models. Principal Component Analysis
(PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) are popular dimensionality reduction
techniques.
3. Anomaly Detection: Anomaly detection, also known as outlier detection, is the task of identifying data
points that deviate from the norm in a dataset. Anomalies may indicate errors in the data, fraudulent
behavior, or other unusual patterns. One-class SVM and Isolation Forest are common anomaly detection
algorithms.
4. Association Rule Learning: Association rule learning is the task of discovering interesting relationships
between variables in large datasets. It is often used in market basket analysis to identify patterns in
consumer behavior. Apriori and FP-growth are popular association rule learning algorithms.
Unsupervised learning is widely used in various fields such as data mining, pattern recognition, and
bioinformatics. It can help in gaining insights from data that may not be immediately apparent and can be a
valuable tool in exploratory data analysis and knowledge discovery.
The main idea behind semi-supervised learning is that labeled data is often expensive
or time- consuming to obtain, while unlabeled data is often abundant and easy to
acquire. By using both labeled and unlabeled data, semi-supervised learning
algorithms aim to make better use of the available data and improve the
performance of the model.
2.Co-training: In co-training, the model is trained on multiple views of the data, each
of which contains a different subset of features. The model is trained on the
labeled data from each view
and then used to predict labels for the unlabeled data in each view. The predictions
from each view are then combined to make a final prediction.
1. Classification :
Both classification and regression are important tasks in machine learning and are
used in a wide range of applications. The choice between classification and
regression depends on the nature of the output variable and the specific problem
being addressed.
There are several types of clustering algorithms, each with its own strengths and
1. K-means Clustering: K-means is one of the most commonly used clustering algorithms. It partitions
the data into K clusters, where each data point belongs to the cluster with the nearest mean. K-means
aims to minimize the sum of squared distances between data points and their corresponding cluster
centroids.
2. Hierarchical Clustering: Hierarchical clustering builds a hierarchy of clusters, where each data point starts
in its own cluster and clusters are successively merged or split based on their similarity. Hierarchical
clustering can be agglomerative (bottom-up) or divisive (top-down).
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise): DBSCAN is a density- based
clustering algorithm that groups together closely packed data points and identifies outliers as noise.
DBSCAN does not require the number of clusters to be specified in advance.
4. Mean Shift: Mean shift is a clustering algorithm that assigns each data point to the cluster
corresponding to the nearest peak in the density estimation of the data. Mean shift can
automatically determine the number of clusters based on the data.
5. Gaussian Mixture Models (GMM): GMM is a probabilistic model that assumes that the data is
generated from a mixture of several Gaussian distributions. GMM can be used for clustering by fitting
the model to the data and assigning each data point to the most likely cluster.
6. Agglomerative Clustering: Agglomerative clustering is a bottom-up hierarchical clustering algorithm
that starts with each data point as a singleton cluster and iteratively merges clusters based on their
similarity.
Clustering is used in various applications such as customer segmentation, image
segmentation, anomaly detection, and recommender systems. The choice of
clustering algorithm depends on the nature of the data and the specific requirements
of the problem.
Outliers are data points that significantly differ from other observations in a dataset.
They can arise due to errors in data collection, measurement variability, or genuine
rare events. Outliers can have a significant impact on the results of data analysis and
machine learning models, as they can skew statistical measures and distort the
learning process.
Outlier analysis is the process of identifying and handling outliers in a dataset. There
are several approaches to outlier analysis:
Once outliers are identified, there are several approaches to handling them:
1. Removing Outliers: One approach is to remove outliers from the dataset. However,
this approach should be used with caution, as removing outliers can lead to loss of
information and bias in the data.
Here are some multiple-choice questions (MCQs) with answers related to machine
learning:
1. What is machine learning? A) A technique to write algorithms B) A field of study that
gives computers the ability to learn without being explicitly programmed C) A method
to automate tasks
D) A software application
Answer: B) A field of study that gives computers the ability to learn without being
explicitly programmed
3. What is the goal of unsupervised learning? A) To classify data into predefined classes
B) To find patterns in data without any labels C) To predict continuous values D) To
learn from a teacher
Answer: B) To find patterns in data without any labels
Answer: D) K-means
5. What is overfitting in machine learning? A) When a model is too simple to capture the
underlying structure of the data B) When a model performs well on the training data
but fails to generalize to new data C) When a model is too complex and memorizes
the training data D) When a model is not trained long enough
Answer: C) When a model is too complex and memorizes the training data
6. Which evaluation metric is commonly used for classification tasks? A) Mean Squared
Error (MSE) B) R-squared C) Accuracy D) Root Mean Squared Error (RMSE)
Answer: C) Accuracy
7. What is the purpose of feature scaling in machine learning? A) To make the data
easier to visualize
B) To speed up the training process C) To ensure all features have the same scale D)
To reduce the number of features
Answer: C) To ensure all features have the same scale
These are just a few examples of MCQs related to machine learning. Depending on
the complexity of the topic and the level of the audience, the questions can vary.
1. What is the first step in the machine learning modeling process? A) Data preprocessing B) Model selection
C) Data collection D) Model evaluation
2. What is the purpose of data preprocessing in machine learning? A) To clean and prepare the data for
modeling B) To select the best model for the data C) To evaluate the performance of the model D) To
deploy the model in a production environment
3. What is the purpose of model selection in machine learning? A) To clean and prepare the data for modeling
B) To select the best model for the data C) To evaluate the performance of the model D) To deploy the
model in a production environment
4. Which of the following is NOT a step in the machine learning modeling process? A) Data preprocessing B)
Model evaluation C) Model deployment D) Data visualization
5. What is the purpose of model evaluation in machine learning? A) To clean and prepare the data for
modeling B) To select the best model for the data C) To evaluate the performance of the model D) To
deploy the model in a production environment
6. What is the final step in the machine learning modeling process? A) Data preprocessing B) Model selection
C) Model evaluation D) Model deployment
Answer: D) Model deployment
7. What is the goal of data preprocessing in machine learning? A) To create new features from existing data B)
To remove outliers from the data C) To scale the data to a standard range D) To clean and prepare the data
for modeling
8. Which of the following is NOT a common evaluation metric used in machine learning? A) Accuracy B)
Mean Squared Error (MSE) C) R-squared D) Principal Component Analysis (PCA)
These questions cover the basic steps of the machine learning modeling process, including data
preprocessing, model selection, model evaluation, and model deployment.
1. What are the main types of machine learning? A) Supervised learning, unsupervised
learning, and reinforcement learning B) Classification, regression, and clustering C)
Neural networks, decision trees, and SVMs D) Linear regression, logistic regression,
and K-means clustering
Answer: A) Supervised learning, unsupervised learning, and reinforcement learning
2. Which type of machine learning is used when the data is labeled? A) Supervised
learning B) Unsupervised learning C) Reinforcement learning D) Semi-supervised
learning
Answer: A) Supervised learning
4. Which type of machine learning is used when the data is not labeled? A) Supervised
learning B) Unsupervised learning C) Reinforcement learning D) Semi-supervised
learning
Answer: B) Unsupervised learning
5. Which type of machine learning is used when the model learns from its own
experience? A) Supervised learning B) Unsupervised learning C) Reinforcement
learning D) Semi-supervised learning
7. Which type of machine learning is used for anomaly detection? A) Supervised learning
B) Unsupervised learning C) Reinforcement learning D) Semi-supervised learning
These questions cover the main types of machine learning, including supervised
learning, unsupervised learning, and reinforcement learning, as well as their goals
and applications.
Here are some multiple-choice questions (MCQs) with answers related to supervised
learning in machine learning:
1. What is supervised learning? A) A type of learning where the model learns from its
own experience
B) A type of learning where the model learns from labeled data C) A type of learning
where the model learns without any labels D) A type of learning where the model
Answer: B) A type of learning where the model learns from labeled data
Answer: C) Classification
5. What is the purpose of the training data in supervised learning? A) To evaluate the
performance of the model B) To select the best model for the data C) To clean and
prepare the data for modeling
D) To teach the model to make predictions
Answer: D) To teach the model to make predictions
7. What is the goal of feature selection in supervised learning? A) To clean and prepare
the data for modeling B) To select the best model for the data C) To reduce the
number of features to improve model performance D) To ensure all features have the
same scale
Answer: C) To reduce the number of features to improve model performance
Here are some multiple-choice questions (MCQs) with answers related to unsupervised learning in machine
learning:
1. What is unsupervised learning? A) A type of learning where the model learns from labeled data B) A type of
learning where the model learns from its own experience C) A type of learning where the model learns
without any labels D) A type of learning where the model learns from reinforcement
Answer: C) A type of learning where the model learns without any labels
2. Which of the following is an example of an unsupervised learning task? A) Image classification B)
Clustering C) Spam detection D) Sentiment analysis
Answer: B) Clustering
3. What is the goal of clustering in unsupervised learning? A) To predict a continuous value B) To classify data
into predefined classes C) To find patterns in data without any labels D) To learn from a teacher
4. Which of the following is a common algorithm used for clustering in unsupervised learning? A) Decision
tree B) K-means C) Support Vector Machine (SVM) D) Linear regression
Answer: B) K-means
5. What is the purpose of dimensionality reduction in unsupervised learning? A) To reduce the number of
features to improve model performance B) To select the best model for the data C) To ensure all features
have the same scale D) To clean and prepare the data for modeling
6. Which of the following is an example of an anomaly detection task? A) Predicting house prices based on
features such as size and location B) Classifying images into different categories C) Identifying fraudulent
transactions in financial data D) Clustering customer data to identify segments
7. What is the goal of feature extraction in unsupervised learning? A) To clean and prepare the data for
modeling B) To reduce the number of features to improve model performance C) To select the best model
for the data D) To ensure all features have the same scale
These questions cover the basics of unsupervised learning in machine learning, including the goals,
algorithms, and applications of unsupervised learning.
Here are some multiple-choice questions (MCQs) with answers related to semi-
supervised learning in machine learning:
1. What is semi-supervised learning? A) A type of learning where the model learns from
labeled data
B) A type of learning where the model learns from its own experience C) A type of
the model learns from both labeled and unlabeled data D) A type of learning where
the model learns without any labels
Answer: C) A type of learning where the model learns from both labeled and unlabeled
data
2. Which of the following is an example of a semi-supervised learning task? A) Image
classification B) Clustering C) Sentiment analysis with a small labeled dataset and a
large unlabeled dataset D) Regression
Answer: C) Sentiment analysis with a small labeled dataset and a large unlabeled
dataset
3. What is the goal of semi-supervised learning? A) To predict a continuous value B) To
classify data into predefined classes C) To leverage both labeled and unlabeled data
for learning D) To learn from a teacher
Here are some multiple-choice questions (MCQs) with answers related to classification and regression in
machine learning:
1. What is the goal of classification in machine learning? A) To predict a continuous value B) To classify data
into predefined classes C) To find patterns in data without any labels D) To learn from a teacher
2. Which of the following is an example of a classification task? A) Predicting house prices based on features
such as size and location B) Classifying emails as spam or not spam C) Clustering customer data to identify
segments D) Predicting a student's grade based on the number of hours studied
3. What is the goal of regression in machine learning? A) To classify data into predefined classes B) To predict
a continuous value C) To find patterns in data without any labels D) To learn from a teacher
4. Which of the following is an example of a regression task? A) Classifying images into different categories
B) Predicting house prices based on features such as size and location C) Clustering customer data to
identify segments D) Predicting whether a customer will buy a product or not
Answer: B) Predicting house prices based on features such as size and location
5. Which evaluation metric is commonly used for classification tasks? A) Mean Squared Error (MSE) B) R-
squared C) Accuracy D) Root Mean Squared Error (RMSE)
Answer: C) Accuracy
6. Which evaluation metric is commonly used for regression tasks? A) Accuracy B) Mean Squared Error
(MSE) C) Precision D) Recall
7. What is the main difference between classification and regression? A) Classification predicts a continuous
value, while regression predicts a discrete class label B) Classification predicts a discrete class label, while
regression predicts a continuous value C) Classification uses labeled data, while regression uses unlabeled
data D) Regression uses labeled data, while classification uses unlabeled data
Answer: B) Classification predicts a discrete class label, while regression predicts a continuous value
8. Which of the following algorithms is commonly used for classification tasks? A) Linear regression B)
Decision tree C) K-means clustering D) Principal Component Analysis (PCA)
These questions cover the basics of classification and regression in machine learning, including their goals,
examples, evaluation metrics, and algorithms.
Here are some multiple-choice questions (MCQs) with answers related to clustering in
machine learning:
1. What is clustering in machine learning? A) A type of learning where the model learns
from labeled data B) A type of learning where the model learns from its own
experience C) A type of learning where the model learns without any labels D) A type
of learning where the model learns from reinforcement
Answer: C) A type of learning where the model learns without any labels
4. Which of the following is a common algorithm used for clustering in machine learning?
A) Decision tree B) K-means C) Support Vector Machine (SVM) D) Linear regression
Answer: B) K-means
These questions cover the basics of clustering in machine learning, including its
goals, examples, algorithms, and evaluation metrics.
Here are some multiple-choice questions (MCQs) with answers related to outliers and
outlier analysis in machine learning:
1. What is an outlier in a dataset? A) A data point that is missing a value B) A data point
that is significantly different from other observations C) A data point that is incorrectly
labeled D) A data point that is located at the center of the dataset
2. Why are outliers important in data analysis? A) They help to reduce the complexity of
the dataset
B) They can provide valuable insights into the data C) They have no impact on the
results of data analysis D) They make the dataset more difficult to analyze
Answer: B) They can provide valuable insights into the data
3. Which of the following is a common method for detecting outliers? A) Z-score method
B) Mean Squared Error (MSE) C) Root Mean Squared Error (RMSE) D) Silhouette score
4. What is the Z-score method used for in outlier analysis? A) To calculate the mean of
the dataset B) To calculate the standard deviation of the dataset C) To identify data
points that are significantly different from the mean D) To calculate the range of the
dataset
Answer: C) To identify data points that are significantly different from the mean
6. What is the impact of outliers on statistical measures such as mean and standard
deviation? A) Outliers have no impact on these measures B) Outliers increase the
mean and standard deviation
C) Outliers decrease the mean and standard deviation D) The impact of outliers
depends on their value
These questions cover the basics of outliers and outlier analysis in machine
learning, including their detection, impact, and handling.
UNIT II DATA MANIPULATION
Python Shell
The Python Shell, also known as the Python interactive interpreter or Python REPL (Read-Eval-Print Loop),
is a command-line tool that allows you to interactively execute Python code. It provides a convenient way to
experiment with Python code, test small snippets, and learn about Python features.
To start the Python Shell, you can open a terminal or command prompt and type python or
python3 depending on your Python installation. This will launch the Python interpreter, and you
will see a prompt (>>>) where you can start entering Python code. Here is
an example of using the Python Shell:
$ python
>>> y = 10
>>> print(x + y) 15
>>> exit()
In this example, we start the Python interpreter, print a message, perform some basic arithmetic
operations, and then exit the Python interpreter using the exit() function.
Jupyter Notebook
jupyter Notebook is an open-source web application that allows you to create and
share documents that contain live code, equations, visualizations, and narrative text.
It supports various programming languages, including Python, R, and Julia, among
others. Jupyter Notebook is widely used for data cleaning, transformation, numerical
simulation, statistical modeling, data visualization, machine learning, and more.
To start using Jupyter Notebook, you first need to have Python installed on your
computer. You can then install Jupyter Notebook using pip, the Python package
installer, by running the following command in your terminal or command prompt:
Once Jupyter Notebook is installed, you can start it by running the following command in your terminal or
command prompt:
jupyter notebook
This will launch the Jupyter Notebook server and open a new tab in your web browser
with the Jupyter Notebook interface. From there, you can create a new notebook or
open an existing one. You can write and execute code in the notebook cells, add text
and equations using Markdown, and create visualizations using libraries like
Matplotlib and Seaborn.
Jupyter Notebook is a powerful tool for interactive computing and is widely used in
data science and research communities.
IPython magic commands are special commands that allow you to perform various
tasks in IPython, the enhanced interactive Python shell. Magic commands are
prefixed by one or two percentage signs (% or %%) and provide additional functionality
beyond what standard Python syntax offers. Here are some commonly used IPython
magic commands:
1. %run: Run a Python script inside the IPython session. Usage: %run script.py.
2. %time and %timeit: Measure the execution time of a single statement (%time)
or a Python statement or expression (%timeit).
3. %load: Load code into the current IPython session. Usage: %load file.py.
4. %matplotlib: Enable inline plotting of graphs and figures in IPython. Usage: %matplotlib inline.
5. %reset: Reset the IPython namespace by removing all variables, functions, and imports. Usage:
%reset -f.
6. %who and %whos: List all variables in the current IPython session (%who) or list all variables with
additional information such as type and value (%whos).
7. %%time and %%timeit: Measure the execution time of a cell (%%time) or a cell statement
(%%timeit) in IPython.
8. %magic: Display information about IPython magic commands and their usage. Usage: %magic.
9. %history: Display the command history for the current IPython session. Usage: %history.
10. %pdb: Activate the interactive debugger (Python debugger) for errors in the IPython session. Usage:
%pd .
b
These are just a few examples of IPython magic commands. IPython provides many
more magic commands for various purposes, and you can explore them by typing
%lsmagic to list all available magic commands and for help on a specific magic
%<command>?
command (e.g., %time? for help on the %time command).
NumPy Arrays
NumPy is a Python library that provides support for creating and manipulating arrays
and matrices. NumPy arrays are the core data structure used in NumPy to store and
manipulate data efficiently. Here's a brief overview of NumPy arrays:
1. Creating NumPy Arrays : NumPy arrays can be created using the numpy.array() function by
passing a Python list as an argument. For example:
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
Array Attributes: NumPy arrays have several attributes that provide information about the array, such as its
shape, size, and data type. Some common attributes include shape, size, and dtype.
Indexing and NumPy arrays support indexing and slicing operations to access and modify
elements
Slicing: of the array.
Array Broadcasting: NumPy arrays support broadcasting, which allows operations to be performed on
arrays of different shapes.
arr = np.array([[1, 2, 3], [4, 5, 6]])
scalar = 2
result = arr * scalar # [[2, 4, 6], [8, 10, 12]]
1. Array Functions: NumPy provides a variety of functions for creating and manipulating arrays, such as
np.arange() , np.zeros() , np.ones() , np.linspace() , np.concatenate() , and more.
NumPy arrays are widely used in scientific computing, data analysis, and machine learning due to their
efficiency and versatility.
1. Mathematical Functions: NumPy provides ufuncs for basic mathematical operations such as np.add(),
np.subtract(), np.multiply(), np.divide(), np.power(), np.sqrt(), np.exp(), np.log(), and more. These
functions can be used to perform element-wise arithmetic operations on arrays.
2. Trigonometric Functions: NumPy provides ufuncs for trigonometric functions such as np.sin(), np.cos(),
np.tan(), np.arcsin(), np.arccos(), np.arctan(), and more. These functions operate element-wise on
arrays and are useful for mathematical calculations involving angles.
3. Statistical Functions: NumPy provides ufuncs for statistical functions such as np.mean(), np.median(),
np.std(), np.var(), np.sum(), np.min(), np.max(), and more. These functions can be used to calculate
various statistical measures of arrays.
4. Logical Functions: NumPy provides ufuncs for logical operations such as np.logical_and(),
np.logical_or(), np.logical_not(), and more. These functions operate element-wise on boolean arrays
and are useful for logical operations.
5. Comparison Functions: NumPy provides ufuncs for comparison operations such as np.equal(),
np.not_equal(), np.greater(), np.greater_equal(), np.less(), np.less_equal(), and more. These functions
compare elements of arrays and return boolean arrays indicating the result of the comparison.
6. Bitwise Functions: NumPy provides ufuncs for bitwise operations such as np.bitwise_and(),
np.bitwise_or(), np.bitwise_xor(), np.bitwise_not(), and more. These functions operate element-wise
on integer arrays and perform bitwise operations.
These are just a few examples of the many ufuncs available in NumPy for data
manipulation. Ufuncs are an important part of NumPy and are widely used for
performing efficient and vectorized operations on arrays.
Aggregations of datamanipulation
1. np.sum : Calculates the sum of all elements in the array or along a specified axis.
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
total_sum = np.sum(arr) # 21
np.mean: Calculates the mean (average) of all elements in the array or along a specified axis.
mean_value = np.mean(arr) # 3.5
np.median: Calculates the median of all elements in the array or along a specified axis.
np.mi and np.max: Calculate the minimum and maximum values in the array or along a specified axis.
n
min_value = np.min(arr) # 1
max_value = np.max(arr) # 6
np.st and np.var: Calculate the standard deviation and variance of the elements in the array or along a
specified
d axis.
np.sum(axis=0): Calculate the sum of elements along a specified axis (0 for columns, 1 for rows).
col_sum = np.sum(arr, axis=0) # array([5, 7, 9])
np.prod(): Calculate the product of all elements in the array or along a specified axis.
prod_value = np.prod(arr) # 720
These aggregation functions are useful for summarizing and analyzing data in NumPy arrays. They provide
efficient ways to calculate various statistical measures and perform calculations on arrays.
Computation on Arrays
Computation on arrays in NumPy allows you to perform element-wise operations, broadcasting, and
vectorized computations efficiently. Here are some key concepts and examples:
1. Element-wise operations: NumPy allows you to perform arithmetic operations (addition, subtraction,
multiplication, division) on arrays of the same shape element-wise.
pytho
n
import numpy as np
x = np.array([1, 2, 3, 4])
y = np.array([5, 6, 7, 8])
z = x + y # [6, 8, 10, 12]
Broadcasting: Broadcasting is a powerful mechanism that allows NumPy to work with arrays of different
shapes when performing arithmetic operations.
Universal functions (ufuncs): NumPy provides a set of mathematical functions that operate element-wise
on arrays. These functions are called universal functions (ufuncs).
x = np.array([1, 2, 3, 4])
y = np.sqrt(x) # [1. 1.41421356 1.73205081 2. ]
Aggregation functions: NumPy provides functions for aggregating data in arrays, such as sum, mean, min,
max, std, and var.
Vectorized computations: NumPy allows you to express batch operations on data without writing any for
loops, which can lead to more concise and readable code.
NumPy's array operations are optimized and implemented in C, making them much faster than equivalent
Python operations using lists. This makes NumPy a powerful tool for numerical computation and data
manipulation in Python.
Fancy Indexing
Fancy indexing in NumPy refers to indexing using arrays of indices or boolean arrays.
It allows you to access and modify elements of an array in a more flexible way than
simple indexing. Here are some examples of fancy indexing:
Assigning values using fancy indexing: x = np.array([10, 20, 30, 40, 50])
indices = np.array([1, 3, 4])
x[indices] = 0
# x is now [10, 0, 30, 0, 0]
Indexing multi-dimensional :
xarrays
= np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
row_indices = np.array([0, 2])
col_indices = np.array([1, 2])
y = x[row_indices, col_indices] # [2, 9]
Fancy indexing can be very useful for selecting and modifying specific elements of arrays based on complex
conditions. However, it is important to note that fancy indexing creates copies of the data, not views, so
modifying the result of fancy indexing will not affect the original array.
Sorting arrays
In NumPy, you can sort arrays using the np.sort() function or the sort() method of the array object.
Both functions return a sorted copy of the array without modifying the original array. Here are some
examples of sorting arrays in NumPy:
:
import numpy as np
Sorting 1D
arrays
x = np.array([3, 1, 2, 5, 4]) sorted_x =
np.sort(x)
# sorted_x: [1, 2, 3, 4, 5]
Sorting with argsort: NumPy's argsort() function returns the indices that would sort an array. This can
be useful for sorting one array based on the values in another array.
x = np.array([3, 1, 2, 5, 4])
indices = np.argsort(x) sorted_x
= x[indices]
# sorted_x: [1, 2, 3, 4, 5]
Sorting in-place: If you want to sort an array in-place (i.e., modify the original array), you can use the
method of the array object.
sort()
x = np.array([3, 1, 2, 5, 4])
x.sort()
# x: [1, 2, 3, 4, 5]
Sorting with complex numbers: Sorting works with complex numbers as well, with the real part used for
sorting. If the real parts are equal, the imaginary parts are used.
Structured data
Structured data in NumPy refers to arrays where each element can contain multiple fields or columns,
similar to a table in a spreadsheet or a database table. NumPy provides the class to
represent structured data, and you can create structured arrays using the numpy.ndarray
function with a
dtype parameter specifying the data type for each field. Here's an example:numpy.array()
import numpy as np
You can also access and modify individual elements or slices of a structured array using the field names. For
example, to access the 'name' field of the first element, you can use data[0]['name'].
Structured arrays are useful for representing and manipulating tabular data in NumPy, and they provide a
way to work with heterogeneous data in a structured manner.
1. Importing Pandas :
import pandas as pd
Creating a DataFrame: You can create a DataFrame from various data sources, such as lists, dictionaries,
NumPy arrays, or from a file (e.g., CSV, Excel).
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age':
[25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']} df =
pd.DataFrame(data)
Reading and Writing Data: Pandas provides functions to read data from and write data to various file
formats, such as CSV, Excel, SQL, and more.
# Read data from a CSV file df =
pd.read_csv('data.csv')
Selecting Data: You can select columns or rows from a DataFrame using indexing and slicing.
# Select a single column
print(df['Name'])
Adding and Removing Columns: You can add new columns to a DataFrame or remove existing columns.
# Remove a column
df = df.drop('City', axis=1)
Grouping and Aggregating Data: Pandas allows you to group data based on one or more columns and
perform aggregation
# Group data by 'City' and calculate the mean age in each city print(df.groupby('City')['Age'].mean())
Handling Missing Data: Pandas provides functions to handle missing data, such as dropna(), fillna(), and
isnull().
Merging and Joining DataFrames: Pandas provides functions to merge or join multiple DataFrames based
on a common column.
These are just a few examples of how you can manipulate data with Pandas. Pandas provides a wide range
of functions and methods for data cleaning, transformation, and analysis, making it a powerful tool for data
manipulation in Python.
df.column_name
Callable indexing with .loc[] and .iloc[]: You can use callables with and for more
.loc[] .iloc[]
advanced selection.
df.loc[lambda df: df['column_name'] > 0]
These are the basic ways to index and select data in pandas. Each method has its strengths, so choose the
one that best fits your use case.
1. Identifying Missing :
isna(), isnull(): Returns a boolean mask indicating missing
Data
values.
notna() notnull() isna() isnull()
,
2. Removing Missing : : Returns the opposite of and .
dropna() : Removes rows or columns with missing
Data
values.
df.dropna(axis=0) # Remove rows with missing values df.dropna(axis=1)
# Remove columns with missing values
Filling Missing Data :
df.groupby('group_column')['value_column'].transform(lambda x: x.fillna(x.mean()))
These methods provide flexibility in handling missing data in pandas, allowing you to choose the approach
that best suits your data and analysis needs.
1. Creating a MultiIndex : You can create a MultiIndex by passing a list of index levels to the index
parameter when creating a DataFrame.
import pandas as pd
arrays = [
['A', 'A', 'B', 'B'], [1,
2, 1, 2]
]
index = pd.MultiIndex.from_arrays(arrays, names=('first', 'second'))
Indexing with a MultiIndex: You can use tuples to index into the DataFrame at multiple levels.
# Selecting a single value df.loc[('A',
1)]
Indexing with MultiIndex columns: Indexing with MultiIndex columns is similar to indexing with
MultiIndex rows.
# Selecting a single column df[('A',
'one')]
# Selecting on the first level of columns df['A']
Creating from a dictionary with tuples: You can also create a DataFrame with a MultiIndex from a
dictionary where keys are tuples representing the index levels.
data = {('A', 1): 1, ('A', 2): 2, ('B', 1): 3, ('B', 2): 4}
df = pd.Series(data)
Hierarchical indexing provides a powerful way to represent and manipulate higher-dimensional datasets in
pandas. It allows for more flexible data manipulation and analysis.
Combining datasets in pandas typically involves operations like merging, joining, and
concatenating DataFrames. Here's an overview of each:
1. Concatenation :
Use pd.concat() to concatenate two or more DataFrames along a particular axis (row or column).
By default, it concatenates along axis=0 (rows), but you can specify axis=1 to concatenate
columns.
Merging :
Use pd.merge() to merge two DataFrames based on a common column or index.
Specify the on parameter to indicate the column to join on.
merged_df = pd.merge(df1, df2, on='common_column')
Joining :
Use the .join() method to join two DataFrames on their indexes.
By default, it performs a left join ( how='left' ), but you can specify other types of joins.
joined_df = df1.join(df2, how='inner')
Appending :
Use the .append() method to append rows of one DataFrame to another.
This is similar to concatenation along axis=0, but with more concise syntax.
appended_df = df1.append(df2)
Merging on Index :
You can merge DataFrames based on their index using left_index=True and
right_index=True .
These methods provide flexible ways to combine datasets in pandas, allowing you to perform various types
of joins and concatenations based on your data's structure and requirements.
Aggregation and grouping are powerful features in pandas that allow you to perform
operations on groups of data. Here's an overview:
1. GroupBy :
Use groupby() to group data based on one or more columns
grouped = df.groupby('column_name')
Aggregation Functions :
Apply aggregation functions like sum(), mean(), count(), min(), max(), etc., to calculate
summary statistics for each group.
grouped.sum()
Custom Aggregation :
You can also apply custom aggregation functions using agg() with a dictionary mapping column
names to functions.
grouped.agg({'column1': 'sum', 'column2': 'mean'})
Filtering Groups :
For time series data, you can use resample() to group by a specified frequency.
df.resample('M').sum()
String operations in pandas are used to manipulate string data in Series and
DataFrame columns. Pandas provides a wide range of string methods that are
vectorized, meaning they can operate on each element of a Series without the need
for explicit looping. Here are some common string operations in pandas:
Lowercasing/Uppercasing :
Convert strings to lowercase or uppercase.
pytho
n
df['column_name'].str.lower()
df['column_name'].str.upper()
String Length :
Get the length of each string.
df['column_name'].str.len()
String Concatenation :
Concatenate strings with other strings or Series.
df['column_name'].str.cat(sep=',')
Substrings :
Extract substrings using slicing or regular expressions.
df['column_name'].str.slice(start=0, stop=3)
df['column_name'].str.extract(r'(\d+)')
String Splitting :
Split strings into lists using a delimiter.
python
df['column_name'].str.split(',')
String Stripping :
Remove leading and trailing whitespace.
df['column_name'].str.strip()
String Replacement :
Replace parts of strings with other strings.
df['column_name'].str.replace('old', 'new')
String Counting :
Count occurrences of a substring.
df['column_name'].str.count('substring')
String Alignment :
Left or right align strings.
df['column_name'].str.ljust(width)
df['column_name'].str.rjust(width)
String Padding :
Pad strings with a specified character to reach a desired length.
df['column_name'].str.pad(width, side='left', fillchar='0')
These are just some of the string operations available in pandas. They are efficient for working with string
data and can be used to clean and transform text data in your DataFrame.
Working with time series data in pandas involves DateTime functionality provided
using the by by dates or times.
pandas to manipulate, analyze, and visualize data that is indexed
Here's a basic overview of working with time series in pandas:
1. Creating a DateTimeIndex :
Ensure your DataFrame has a DateTimeIndex, which can be set using the pd.to_datetime()
function.
df.index = pd.to_datetime(df.index)
Resampling :
Use resample() to change the frequency of your time series data (e.g., from daily to
monthly).
df.resample('M').mean()
Shifting :
Use shift() to shift your time series data forward or backward in time.
df.shift(1)
Rolling :
Windows
Use rolling() to calculate rolling statistics (e.g., rolling mean, sum) over
a specified window size.
df.rolling(window=3).mean()
Date Arithmetic :
Perform arithmetic operations with dates, like adding or subtracting time deltas.
df.index + pd.DateOffset(days=1)
These are some common operations for working with time series data in pandas. The
functionality in pandas makes it easy to handle and analyze time series data efficiently.
Matplotlib
Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.
It can be used to create a wide range of plots and charts, including line plots, bar plots, histograms, scatter
plots, and more. Here's a basic overview of using Matplotlib for plotting:
Installing Matplotlib :
You can install Matplotlib using pip:
pip install matplotlib
Importing Matplotlib :
Import the matplotlib.pyplot module, which provides a MATLAB-like plotting interface.
import matplotlib.pyplot as plt
plt.subplot(2, 1, 2)
plt.scatter(x, y)
Saving :
Plots Us savefig() to save your plot as an image file (e.g., PNG,
e
plt.savefig('plot.png') PDF, Types
Other SVG). of :
Matplotlib supports
Plots many other types of plots, including bar plots, histograms,
scatter plots, and more.
plt.bar(x, y)
plt.hist(data, bins=10)
plt.scatter(x, y)
Matplotlib provides a wide range of customization options and is highly flexible, making it a powerful tool
for creating publication-quality plots and visualizations in Python.
# Data
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]
Creating a simple scatter plot in Matplotlib involves specifying the x-axis and y-axis values and then using
the function to create the plot. Here's a basic example:
scatter()
import matplotlib.pyplot as plt
# Data
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]
# Create a simple scatter plot
plt.scatter(x, y)
This code will create a simple scatter plot with the given x and y values, and display it with labeled axes and
a title. You can customize the appearance of the plot further by using additional arguments in the
scatter() function, such as color, s (size of markers), and alpha (transparency).
Use the errorbar() function to plot data points with error bars.
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]
yerr = [0.5, 0.3, 0.7, 0.4, 0.8] # Error values
import numpy as np
import matplotlib.pyplot as plt
plt.plot(x, y)
plt.fill_between(x, y - error, y + error,
alpha=0.2) plt.xlabel('X-axis label')
plt.ylabel('Y-axis label')
plt.title('Shaded Error Region')
plt.show()
These examples demonstrate how to visualize errors in your data using Matplotlib. You can adjust the error
values and plot styles to suit your specific needs and data.
Density and contour plots are useful for visualizing the distribution and density of
data points in a 2D space. Matplotlib provides several functions to create these plots,
imshow()
such as for
density plots and contour() for contour plots. Here's how you can create them:
1. Density Plot (imshow) :
Use the imshow() function to create a density plot. You can use a 2D histogram or a kernel density
estimation (KDE) to calculate the density.
import numpy as np
import matplotlib.pyplot as plt
These examples demonstrate how to create density and contour plots in Matplotlib. You can customize the
plots by adjusting parameters such as the number of bins, colormap, and contour levels to better visualize
your data.
Histograms in Matplotlib
Histograms are a useful way to visualize the distribution of a single numerical variable. Matplotlib provides
the function to create histograms. Here's a basic example:
import numpy as np
hist()
import matplotlib.pyplot as plt
# Create a histogram
plt.hist(data, bins=30, color='skyblue', edgecolor='black')
You can customize the appearance of the histogram by adjusting parameters such as
bins, color, edgecolor, and adding labels and a title to make the plot more
legends in Matplotlib
Legends in Matplotlib are used to identify different elements of a plot, such as lines,
markers, or colors, and associate them with labels. Here's how you can add legends
to your plots:
1. Basic Legend :
Use the legend() function to add a legend to your plot. You can specify the labels for each element
in the legend.
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y1 = [1, 2, 3, 4, 5]
y2 = [5, 4, 3, 2, 1]
You can add a title to the legend using the title parameter.
plt.legend(title='Legend Title')
Multiple Legends :
You can create multiple legends by calling the legend() function multiple times with
different labels.
plt.plot(x, y1)
plt.plot(x, y2)
plt.legend(['Line 1', 'Line 2'], loc='upper left')
plt.legend(['Line 3', 'Line 4'], loc='lower right')
1. Removing Legend :
You can remove the legend from your plot by calling plt.legend().remove() or
plt.gca().legend().remove() .
These are some common ways to add and customize legends in Matplotlib. Legends
are useful for explaining the components of your plot and making it easier for viewers
to understand the data.
colors in Matplotlib
In Matplotlib, you can specify colors in several ways, including using predefined color
names, RGB or RGBA tuples, hexadecimal color codes, and more. Here's how you can
specify colors in Matplotlib:
import numpy as np
These are some common ways to specify colors in Matplotlib. Using colors effectively can enhance the
readability and visual appeal of your plots.
subplots in Matplotlib
Subplots in Matplotlib allow you to create multiple plots within the same figure. You can arrange subplots
in a grid-like structure and customize each subplot independently. Here's a basic example of creating
subplots:
import matplotlib.pyplot as plt
import numpy as np
Text and annotations in Matplotlib are used to add descriptive text, labels, and
annotations to your plots. Here's how you can add text and annotations:
1. Adding Text :
Use the text() function to add text at a specific location on the plot.
import matplotlib.pyplot as plt
Adding :
Annotations
Text Alignment :
Use the ha and va parameters to specify horizontal and vertical alignment of text.
plt.text(2, 10, 'Example Text', ha='center', va='top')
customization in Matplotlib
Customization in Matplotlib allows you to control various aspects of your plots, such
as colors, line styles, markers, fonts, and more. Here are some common
customization options:
1. Changing Figure Size :
Use figsize in plt.subplots() or plt.figure() to set the size of the figure.
fig, ax = plt.subplots(figsize=(8, 6))
Use xlim() and ylim() to set the limits of the x and y axes.
plt.xlim(0, 10)
plt.ylim(0, 20)
Use xlabel() , ylabel() , and title() to set axis labels and plot title.
plt.xlabel('X-axis Label',
fontsize=12) plt.ylabel('Y-axis
Label', fontsize=12) plt.title('Plot
Title', fontsize=14)
Use xticks() and yticks() to set custom tick labels on the x and y axes.
plt.xticks([1, 2, 3, 4, 5], ['A', 'B', 'C', 'D', 'E'])
Adding Gridlines:
Use grid() to add gridlines to the plot.
plt.grid(True)
Matplotlib provides a toolkit called mplot3d for creating 3D plots. You can create 3D scatter plots, surface
plots, wireframe plots, and more. Here's a basic example of creating a 3D scatter plot:
# Show plot
plt.show()
In this example, fig.add_subplot(111, projection='3d') creates a 3D subplot, and
ax.scatter(x, y, z, c='b', marker='o') creates a scatter plot in 3D space. You can customize
the appearance of the plot by changing parameters such as c (color), marker, and adding labels and a title.
You can also create surface plots and wireframe plots using the plot_surface() and
# Generate data
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
x, y = np.meshgrid(x, y)
z = np.sin(np.sqrt(x**2 + y**2))
# Show plot
plt.show()
These examples demonstrate how to create basic 3D plots in Matplotlib. You can explore the mplot3d
toolkit and its functions to create more advanced 3D visualizations.
# Create a map
plt.figure(figsize=(10,
6))
m = Basemap(projection='mill',llcrnrlat=-90,urcrnrlat=90,\
llcrnrlon=-180,urcrnrlon=180,resolution='c')
m.drawcoastlines()
m.drawcountries()
m.fillcontinents(color='lightgray',lake_color='aqua')
m.drawmapboundary(fill_color='aqua')
# Plot cities
lons = [-77.0369, -122.4194, 120.9660, -0.1276]
lats = [38.9072, 37.7749, 14.5995, 51.5074]
cities = ['Washington, D.C.', 'San Francisco', 'Manila', 'London']
x, y = m(lons, lats)
m.scatter(x, y, marker='o', color='r')
# Add a title
plt.title('Cities Around the World')
Basemap offers a wide range of features for working with geographic data, including
support for various map projections, drawing political boundaries, and plotting points,
lines, and shapes on maps. You can explore the Basemap documentation for more
advanced features and customization options.
1. Installation :
You can install Seaborn using pip:
pip install seaborn
Importing Seaborn :
Import Seaborn as sns conventionally:
import seaborn as sns
Categorical Plots :
Seaborn provides several functions for visualizing categorical data, such as sns.catplot(),
sns.barplot() , sns.countplot() , and sns.boxplot() .
sns.catplot(x='day', y='total_bill', data=tips, kind='box')
Distribution Plots :
Seaborn offers various functions for visualizing distributions, including sns.distplot(),
sns.kdeplot() , and sns.histplot() .
sns.distplot(tips['total_bill'])
Relational Plots :
Seaborn provides functions for visualizing relationships between variables, such as
sns.relplot() , sns.scatterplot() , and sns.lineplot() .
sns.relplot(x='total_bill', y='tip', data=tips, kind='scatter')
Heatmaps :
Pairplots :
Pairplots are useful for visualizing pairwise relationships in a dataset using sns.pairplot() .
sns.pairplot(tips, hue='sex')
1. Styling and :
Themes
Seaborn allows you to customize the appearance of plots using
styling functions
sns.set() (
sns.set_style() , sns.set_context() , ) and themes (
sns.set_theme() )
2. Other : .
Plots
Seaborn offers many other types of plots and customization options. The
official Seaborn documentation provides detailed examples and explanations
for each type of plot.
Seaborn is built on top of Matplotlib and integrates well with Pandas, making it a
powerful tool for visualizing data in Python.
1. Distributed computing: Using frameworks like Apache Hadoop and Apache Spark
to distribute data processing tasks across multiple nodes in a cluster, allowing for
parallel processing of large datasets.
2. Data compression: Compressing data before storage or transmission to reduce
the amount of space required and improve processing speed.
3. Data Dividing large datasets into smaller, more manageable
partitions based
partitioning: on certain criteria (e.g., range, hash value) to improve
processing
Data efficiency.
4. Identifying and eliminating duplicate data to reduce storage
deduplication:
requirements and improve data processing efficiency.
5. Database Partitioning a database into smaller, more manageable parts
sharding:
called shards, which can be distributed across multiple servers for improved
scalability
Stream and performance.
6. processing: Processing data in real-time as it is generated, allowing for
immediate
In-memory analysis and decision-making.
7. Storing data in memory instead of on disk to improve
computing:
processing speed, particularly for frequently accessed data.
8. Parallel Using multiple processors or cores to simultaneously execute data
processing:
processing tasks, improving processing speed for large datasets.
9. Data Creating indexes on data fields to enable faster data retrieval,
especially
indexing: for queries involving large datasets.
10.Data Combining multiple data points into a single, summarized value
to reduce the overall volume of data while retaining important information.
aggregation:
When dealing with large datasets in programming, it's important to use efficient
techniques to manage memory, optimize processing speed, and avoid common
pitfalls. Here are some programming tips for dealing with large datasets:
1. Use efficient data structures: Choose data structures that are optimized for the
operations you need to perform. For example, use hash maps for fast lookups,
arrays for sequential access, and trees for hierarchical data.
2. Lazy loading: Use lazy loading techniques to load data into memory only when it
is needed, rather than loading the entire dataset at once. This can help reduce
memory usage and improve performance.
3. Batch processing: Process data in batches rather than all at once, especially for
operations like data transformation or analysis. This can help avoid memory issues
and improve processing speed.
4. Use streaming APIs: Use streaming APIs and libraries to process data in a
streaming fashion, which can be more memory-efficient than loading the entire
dataset into memory.
5. Use indexes and caching to optimize data access,
especially for large reduce the time it takes to access and
retrieve data.
6. Parallel processing: Use parallel processing techniques, such as
multithreading or
7. Use efficient Choose algorithms that are optimized for large datasets, such
as sorting
algorithms: algorithms that use divide and conquer techniques or algorithms that can
be parallelized.
Optimize I/O
8. Minimize I/O operations and use buffered I/O where possible
operations:
to reduce the overhead of reading and writing data to disk.
9. Monitor memory Keep an eye on memory usage and optimize your code to
usage:
minimize memory leaks and excessive memory consumption.
10.Use external storage For extremely large datasets that cannot fit
into memory, consider using external storage solutions such as databases or
solutions:
distributed file systems.
2. Microsoft SmartScreen:
Microsoft SmartScreen is a feature in Microsoft Edge and Internet Explorer browsers that helps
protect users from phishing attacks and malware.
SmartScreen uses machine learning models to analyze URLs and determine their safety.
The model looks at features such as domain reputation, presence of phishing keywords, and
similarity to known malicious URLs.
SmartScreen also leverages data from the Microsoft Defender SmartScreen service to
improve its accuracy and coverage.
In both cases, machine learning is used to predict the likelihood that a given URL is
malicious based on various features and historical data. These models help protect
users from online threats and improve the overall security of the web browsing
experience.
Case studies: Building a recommender system
Building a recommender system involves predicting the "rating" or "preference" that
a user would give to an item. These systems are widely used in e-commerce, social
media, and content streaming platforms to personalize recommendations for users.
Here are two case studies that demonstrate how recommender systems can be built:
In both cases, the recommendation systems use machine learning and data analysis
techniques to analyze user behavior and make personalized recommendations. These
systems help improve user engagement, increase sales, and enhance the overall user
experience.
By leveraging these tools and techniques, organizations can effectively manage and
analyze large volumes of data to extract valuable insights and drive informed
decision-making.
By employing these techniques, data scientists and machine learning engineers can
build models that are scalable, efficient, and capable of handling large datasets
effectively.
1. Visualization: Use data visualization tools like Matplotlib, Seaborn, or Tableau to create
visualizations that help stakeholders understand complex patterns and trends in the data.
2. Dashboardi Build interactive dashboards using tools like Power BI or Tableau that
allow
ng: users to explore the data and gain insights in real-time.
3. Automated Use tools like Jupyter Notebooks or R Markdown to create
automated reports that can be generated regularly with updated data.
Reporting:
4. Implement data pipelines using tools like Apache Airflow or Luigi to
Data
automate data ingestion, processing, and analysis tasks.
5. Pipelines: Use containerization technologies like Docker to deploy
machine
Model learning models as scalable and reusable components.
6. Deployment: Set up monitoring and alerting systems to track the
performance
Monitoring andof data pipelines and models, and to be notified of any issues or
anomalies.
Alerting:
7. Use version control systems like Git to track changes to your data
Version
processing scripts and models, enabling collaboration and reproducibility.
8. Control: Leverage cloud services like AWS, Google Cloud Platform, or Azure
Cloud
for scalable storage, processing, and deployment of large datasets and models.
Services: