IPL Match Winner Prediction with ML
Submitted by:
Deepshekhar Dey
Chandan Kumar Gupta
Himadri Sikhar Gogoi
Ritusman Kashyap Bhuyan
i
INTERNSHIP REPORT
A report submitted in partial fulfillment of the requirements for the Award of Degree of
BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE AND ENGINEERING
by
Himadri Sikhar Gogoi
ASTU regd. No.: 279307118
Ritusman Kashyap Bhuyan
ASTU regd. No.: 294307118
Chandan Kumar Gupta
ASTU regd. No.: 277007118
Deepshekhar Dey
ASTU regd. No.: 277907118
Under Supervision of
Sri Ranjit Das
Deputy Director(Technical),
National Institute of Electronics & Information Technology (NIELIT), Guwahati
(Duration: 21st June, 2021 to 18th July, 2021)
i
Certificate from NIELIT Guwahati
ii
Acknowledgement
First, we would like to express our sincere gratitude to NIELIT ,Guwahati, for allowing us the
opportunity to undertake our internship under their esteemed institute. A special thanks to the
coordinator of the program Sri Rintu Das sir for organizing the internship without whom we would
never have been able to work on this project.
During these four weeks we learned a lot about Machine Language and Computer Science as a
whole, and it was all due to the presence of such dedicated instructors such as Sri David Ray sir and
Sri Apurba Dey sir. We could not have imagined having better teachers than them.
Though the time spent has quickly passed but we will be sure to carry on the knowledge that we have
gained during our time here.
A heartiest appreciation for everyone’s efforts in teaching us.
iii
Declaration
We, Himadri Sikhar Gogoi, Chandan Kumar Gupta, Ritusman Bhuyan and Deepshekhar Dey, hereby
declare that the report of internship titled “IPL Match Winner Prediction using ML” is uniquely
prepared by us after the completion of four weeks long internship under the wings of NIELIT,
Guwahati.
We also declare the report is prepared for our academic requirements and not for any other purpose.
It might not be used in the opposite party of the institute.
-------------------------------------------- -------------------------------------------
Himadri Sikhar Gogoi Ritusman Kashyap Bhuyan
Jorhat Engineering College Jorhat Engineering College
-------------------------------------------- -------------------------------------------
Chandan Kumar Gupta Deepshekhar Dey
Jorhat Engineering College Jorhat Engineering College
iv
INDEX
S.no CONTENTS Page no
1. Introduction………..……………………………………………………..1
2. Theoretical background………………….……………………………….2
2.1 Python………………………………………………………………..2
2.1.1 Python for ML……………………………………………..2
2.2 Machine Learning……………………………………………………3
2.2.1 Importance of ML…………………………………………3
2.2.2 Types of ML……………………………………………….4
2.2.3 Some python libraries……………………………………...5
2.3 Linear Regression…………………………………………………….7
2.3.1 Preparation of Data for Linear Regression…………………8
3. Project Execution…………………………………………………………9
3.1 Importing required dataset……………………………………………9
3.2 Missing Value Detection and Removal………………………………10
3.3 Training, Testing, splitting and Prediction…………………………...13
3.4 Comparing different models………………………………………….14
3.4.1 Predictions through Logistic Regression…………………..14
3.4.1 Predictions through Decision Tree…………………………15
3.4.1 Predictions through Random Forest………………………..16
3.4.1 Predictions through SVC…………………………………...16
4. Results…………………………………………………………………….18
5. Conclusion………………………………………………………………...20
6. Areas of further Improvement………………………………………21
7. Bibliography……………………………………………………………….22
v
Internship Objectives
➢ Internships are generally thought of to be reserved for college students looking to gain
experience in a particular field. With this current internship we hope to benefit from the
NIELIT institute and receive real time experience and develop our skills.
➢ The internship aims to provide students the opportunity to understand the real life use and
working of Machine Language in order to use the skills gathered here for broader
applications.
➢ Understanding the broader scope of ML in the future.
➢ Understanding concepts of Data Science, AI, Deep Learning and ML.
➢ The internship is a great way to build our resume and develop skills that can be emphasized
in our resume for future jobs.
vi
Weekly Overview of Internship Activities
Name of Module Completed
Python General Syntax
Range in Python
Loops in Python
Keywords in Python
Week 1
Control Structure in Python
Break, continue, pass in Python
Common python runtime errors
Lists in Python
Strings, string formatting in Python
Name of Module Completed
Week 2 Numpy library
Pandas library
Name of Module Completed
Matplotlib library
Seaborn library
Week 3
Plotly library
Linear regression
Random state
vii
Name of Module Completed
ML Statistics
T test
Week 4
Covariance
Bayes’ Theorem
K-NN algorithm
viii
1. Introduction
According to Arthur Samuel, Machine Learning algorithms enable the computers to learn
from data, and even improve themselves, without being explicitly programmed.
Machine learning (ML) is a category of an algorithm that allows software applications to
become more accurate in predicting outcomes without being explicitly programmed. The
basic idea of machine learning is to build algorithms that can receive input data and use
statistical analysis to predict an output while updating outputs as new data becomes available.
Thanks to statistics, machine learning became very famous in the 1990s. The intersection of
computer science and statistics gave birth to probabilistic approaches in AI. This shifted the
field further toward data-driven approaches. Having large-scale data available, scientists
started to build intelligent systems that were able to analyze and learn from large amounts of
data. Machine Language is nowadays used everywhere ranging from the very mundane tasks
to being greatly utilized in large-scale industries.
Machine learning is a subfield of artificial intelligence (AI). The goal of machine learning
generally is to understand the structure of data and fit that data into models that can be
understood and utilized by people.
ML differs from traditional approaches in the sense that in traditional methods instructions
are provided beforehand to produce a strictly calculated result, while in ML we train our
algorithms based on input data and use statistical analysis to produce a range of output
values. Thus, ML allows us to build models from sample data in order to automate decision
making processes based on input data.
1
2. Theoretical background
2.1 Python
Python is an interpreted high-level general-purpose programming language. Python's design
philosophy emphasizes code readability with its notable use of significant indentation.
Its language constructs as well as its object-oriented approach aim to help programmers write
clear, logical code for small and large-scale projects.
Python is dynamically-typed and garbage-collected. It supports multiple programming
paradigms, including structured (particularly, procedural), object-oriented and functional
programming.
2.1.1 Python for ML
Presently, the most sought-for programming language in the machine learning professional
field is Python. Python’s popularity may be due to the increased development of deep
learning frameworks available for this language in the recent times, and no one can deny the
convenience of the programming language.
Python offers concise and readable code. While complex algorithms and versatile workflows
stand behind machine learning and AI, Python’s simplicity allows developers to write
reliable systems. It is considered by many as one of the easiest programming language, if not,
the easiest programming language. Nevertheless, Python programming language provides us
with a variety of reasons for its stand as one of the most used programming languages.
Additionally, Python is appealing to many developers as it’s easy to learn. Python code is
understandable by most making it easier to build models for machine learning. One key to
Python’s popularity is that it’s a platform independent language. Python is supported by
many platforms including Linux, Windows, and macOS. Python code can be used to create
standalone executable programs for most common operating systems, which means that
Python software can be easily distributed and used on those operating systems without a
Python interpreter.
2
To reduce development time Python provides us with various frameworks and libraries.
Libraries are pre-written codes to solve common programming tasks. Python provides us
with several python frameworks and libraries to make our work easier and enable us to
complete our work faster. Some libraries that we have used are:
➢ Numpy
➢ Pandas
➢ Seaborn
2.2 Machine Learning (ML)
Machine learning (ML) is a type of Artificial Intelligence (AI) that allows software
applications to become more accurate at predicting outcomes without being explicitly
programmed to do so. Machine learning algorithms use historical data as input to predict new
output values.
Some popular used of ML includes fraud detection, spam filtering, malware threat detection,
process automation and predictive maintenance.
2.2.1 Importance of ML
Machine Learning provides us with a way to view trends in business operational and check
customer behavior patterns. It has a wide range of applications.
In this day and age, the nearly limitless quantity of available data, affordable data storage,
and the growth of more powerful processing machines has propelled the growth of machine
learning, with many industries developing more and more machine learning models capable
of analyzing bigger and more complex data while delivering faster, more accurate results on
vast scales. Machine learning tools enable organizations to more quickly identify profitable
opportunities and potential risks using reliable predictive models.
The practical applications of machine learning drive business results which can dramatically
affect a company’s bottom line. New techniques in the field are evolving rapidly and
3
expanded the application of machine learning to nearly limitless possibilities. Industries and
companies that depend on vast quantities of data, and need a system to analyze it efficiently
and accurately, have embraced machine learning as the best way to build models, strategize,
and plan.
Healthcare, government systems, marketing and sales, e-commerce and social media,
logistics, business companies, manufacturing and financial industries all use machine
learning.
The future of machine learning is an empty unwritten page that has a large possibility of
taking over the world.
2.2.2 Types of ML
There are four basic approaches for machine learning:
a) Supervised learning
In this type of machine learning, data scientists supply algorithms with labeled training data
and define the variables they want the algorithm to assess for correlations. Both the input and
the output of the algorithm is specified.
It is good for the following tasks:
➢ Binary classification
➢ Multi-class classification
➢ Regression modeling
➢ Ensembling
b) Unsupervised learning
This type of machine learning involves algorithms that train on unlabeled data. The algorithm
scans through data sets looking for any meaningful connection. The data that algorithms train
on as well as the predictions or recommendations they output are predetermined.
It is good for the following tasks:
➢ Clustering
4
➢ Anomaly detection
➢ Association mining
➢ Dimensionality reduction
c) Semi-supervised learning
This approach to machine learning involves a mix of the two preceding types. Data scientists
may feed an algorithm mostly labeled training data, but the model is free to explore the data
on its own and develop its own understanding of the data set.
It is good for the following tasks:
➢ Machine translation
➢ Fraud detection
➢ Labelling data
d) Reinforcement learning
Data scientists typically use reinforcement learning to teach a machine to complete a multi-
step process for which there are clearly defined rules. Data scientists program an algorithm to
complete a task and give it positive or negative cues as it works out how to complete a task.
But for the most part, the algorithm decides on its own what steps to take along the way.
It is good for the following tasks:
➢ Robotics
➢ Video gameplay
➢ Resource management
2.2.3 Some python libraries
Numpy
NumPy is one of the most used libraries for tasks involving modern scientific computations and
evolving yet powerful domains like Data Science and Machine Learning. The two vital benefits that
5
NumPy has to offer is the support for powerful N-dimensional array objects and built-in tools for
performing intensive mathematical as well as scientific calculations.
Other impressive features of NumPy include the use of an optimized C core for delivering high
performance, interoperability with various computing platforms and hardware, and ease of use.
NumPy also plays well with many other visualization libraries, such as Matplotlib, Seaborn,
and Plotly.
Pandas
Similar to NumPy, Pandas is another popular high-performance Python library that is being widely
used today for solving modern Data Science and Machine Learning problems. By offering
developers access to flexible yet extremely responsive data structures for working with time series
and structured data along with the stack of other vital features, Pandas aims to become the best data
analysis tool available for solving real-world problems. A brief rundown of the features offered by
Pandas include:
➢ An efficient DataFrame object for data manipulation
➢ Easy reshaping and pivoting of data sets
➢ Merging and joining of data sets
➢ Label-based data slicing, indexing, and subsetting
➢ Allows working with time-series data
Matplotlib
Matplotlib is undoubtedly one of the most popular visualization libraries for Python. Being used by
hundreds of companies and individuals, matplotlib lets us visualize your data in several different
ways.
We can use it to create a wide variety of visualizations, including line plots, histograms, bar charts,
pie charts, scatter plots, tables, and many other styles.
Matplotlib’s visualizations are not only restricted to static visualizations. We can create interactive
and animated visualizations as well using it. Also, the publication-quality visuals created using
matplotlib are fully customizable and can be exported seamlessly for other applications.
Seaborn
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface
for drawing attractive and informative statistical graphics. Seaborn has various dataset-oriented
6
plotting functions that operate on data frames and arrays that have whole datasets within them. Then
it internally performs the necessary statistical aggregation and mapping functions to create
informative plots that the user desires. It is a high-level interface for creating beautiful and
informative statistical graphics that are integral to exploring and understanding data. The Seaborn
data graphics can include bar charts, pie charts, histograms, scatterplots, error charts, etc. Seaborn
also has various tools for choosing color palettes that can reveal patterns in the data.
Plotly
Plotly is a free open-source graphing library that can be used to form data visualizations. Plotly
(plotly.py) can be used to create web-based data visualizations that can be displayed in Jupyter
notebooks or web applications. Plotly provides more than 40 unique chart types like scatter plots,
histograms, line charts, bar charts, pie charts, error bars, box plots, multiple axes, sparklines,
dendrograms, 3-D charts, etc. Plotly also provides contour plots, which are not that common in other
data visualization libraries. In addition to all this, Plotly can be used offline with no internet
connection.
2.3 Linear Regression
Linear regression is perhaps one of the most well known and well understood algorithms in statistics
and machine learning. Linear regression was developed in the field of statistics and is studied as a
model for understanding the relationship between input and output numerical variables, but has been
borrowed by machine learning. It is both a statistical algorithm and a machine learning algorithm.
Linear regression is an attractive model because the representation is so simple.
The representation is a linear equation that combines a specific set of input values (x) the solution to
which is the predicted output for that set of input values (y). As such, both the input values (x) and
the output value are numeric.
7
2.3.1 Preparation of Data for Linear Regression
Linear regression has been studied at great length, and there has been a lot of discussion on how our
data must be structured to make best use of the model. Hence there are a few steps that we must
follow to prepare our data to be implemented into a linear regression model:
➢ Cleaning the data
➢ Removing noise
➢ Removing Collinearity
➢ Gaussian distributions
➢ Outlier treatment
➢ Training and testing the model
8
3. Project Execution
IPL MATCH WINNER PREDICTION
Our objective is to create a model for IPL Match Winner Prediction using previous year’s data.
3.1 Importing required dataset
For this purpose we have downloaded the dataset of the last 10 years from kaggle.com.
Firstly we have imported the important libraries that will be required for our prediction and the
imported libraries are:
➢ Numpy array: Numpy is the fundamental package for scientific computing in Python. It is a
Python library that provides a multidimensional array object, various derived objects (such as
masked arrays and matrices), and an assortment of routines for fast operations on arrays,
including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier
transforms, basic linear algebra, basic statistical operations, random simulation and much
more.
➢ Pandas: Pandas has been one of the most popular and favorite data science tools used
in Python programming language for data wrangling and analysis. Data is unavoidably messy
9
in real world. And Pandas is seriously a game changer when it comes to cleaning,
transforming, manipulating and analyzing data. In simple terms, Pandas helps to clean the
mess.
➢ Matplotlib is a low level graph plotting library in python that serves as a visualization utility.
Most of the Matplotlib utilities lie under the pyplot submodule.
➢ Warning: Warning messages are typically issued in situations where it is useful to alert the
user of some condition in a program, where that condition (normally) doesn’t warrant raising
an exception and terminating the program. We here used the warning to get filter and we
tried to ignore them.
Then we imported our dataset (“ipl matches 2008-2020 datasheet.csv”) using pandas.
3.2 Missing Value Detection and Removal
Then we checked for the missing values in our dataset. We found that the some of the cities are
missing. We fix them by checking the venue and looking the stadium present there and we saw that
from venue name Dubai International Cricket Stadium, the city Dubai is missing. So, we filled the
missing cities with Dubai.
10
Then we saw that the winner column which is our target has some missing values which is possible
only when there is no winner between the two teams. That means we can say that the match was
draw. So we filled the missing value with ‘Draw" value.
Then we extracted the useful columns on which are going to do the prediction and stored the values
again in matches_df. So, from the dataset the useful columns that we have found are 'team1', 'team2',
'city', 'toss_decision', 'toss_winner', 'venue' and ‘winner'.
11
After handling the missing values we have no missing values in our required dataset.
Next thing we saw in our dataset that our features are of object data type which can bring a problem
for our prediction so we have to convert it into numerical category.
For the teams we have created a dictionary and gave each team a team number which converted our
categorical data into numerical.
12
For the remaining categorical data except the winner column we have used LabelEncoder.
LabelEncoder can be used to normalize labels. It can also be used to transform non-numerical labels
(as long as they are hashable and comparable) to numerical labels. Fit label encoder. It transforms
our categorical columns ( city, toss_decision, venue)into numerical labels.
After Data Handling and converting our categorical data into numerical data our dataset is now ready
for training and testing.
3.3 Training, Testing, splitting and Prediction
For Training testing and splitting we require a library called train_test_split which we can import
from sklearn.model_selection. After importing we split our model into 20-80 percent. We gave 20%
of the data from our dataset for testing purpose and remaining data is used for training our model.
13
3.4 Comparing different models
After Training testing and splitting we used different model for prediction. We do this because we
want the best result. The different models that we used are LogisticRegression,
DecisionTreeClassifier, RandomForestClassifier, SVC.
3.4.1 Predictions through Logistic Regression
14
Firstly, we used LogisticRegression to predict our required desire and then we checked its accuracy,
and its accuracy was 25% which is very less. Then we have tried to compare the Actual and
predicted winners through the LogisticRegression model.
3.4.2 Predictions through Decision tree
Secondly we used DecisionTreeClassifier to predict the required result and then we checked its
accuracy, and we found its accuracy as 52% which is acceptable. Then we have tried to compare the
Actual and predicted winners through the DecisionTreeClassifier model.
15
3.4.3 Predictions through Random Forest
Thirdly we used RandomForestClassifier to predict the required result and then we checked its
accuracy, and we found its accuracy as 51% which is also acceptable. Then we have tried to compare
the Actual and predicted winners through the RandomForestClassifier model.
3.4.4 Predictions through SVC
16
Lastly we used SVC to predict the required result and then we checked its accuracy, and we found its
accuracy as 40% which is quite acceptable. Then we have tried to compare the Actual and predicted
winners through the SVC model.
17
4. Results
For our model we have used Random Forest model as it gives us acceptable accuracy. The popularity
of the Random Forest model is explained by its various advantages:
➢ Accurate and efficient when running on large databases.
➢ Multiple trees reduce the variance and bias of a smaller set or single tree.
➢ Resistant to overfitting.
➢ Can handle thousands of input variables without variable deletion.
➢ Can estimate what variables are important in classification.
➢ Provides effective methods for estimating missing data.
➢ Maintains accuracy when a large proportion of the data is missing.
18
19
5. Conclusion
We have learned ML using for four weeks and finally we have been able to complete a project that
shows how Machine Learning could be used to calculate probabilities in a simulation. However, we
must keep in mind that a model cannot always get 100% accuracy due to the various faults in its
dataset. Nevertheless we have been able to obtain satisfactory results. We have cleaned our dataset,
trained and tested our model, and thus have succeeded in creating a working model to determine the
outcome of an IPL match using data of previous years. We have even compared different models for
our dataset and we have used the random forest model as it gives us adequate accuracy.
20
6. Areas of further Improvement
➢ Dataset — Find a dataset with less anomalies, and taking into account individual players' data
to assess the quality of each team player.
➢ Trying more complex Machine Learning algorithms like Xgboost.
➢ A confusion matrix would be great to analyse which games the model got wrong.
➢ We could ensemble that is we could try stacking more models together to improve the
accuracy.
➢ Going even further and making a model based on player statistics.
21
7. Bibliography
➢ Google
➢ Wikipedia
➢ Youtube
22