0% found this document useful (0 votes)
62 views4 pages

Air Quality Index Analysis of Bengaluru City Air Pollutants Using Expectation Maximization Clustering

The document presents an analysis of the Air Quality Index (AQI) in Bengaluru, focusing on pollutants from 2015 to 2020 using Expectation Maximization clustering and machine learning techniques. It discusses data preprocessing, attribute selection, and the visualization of air quality data from ten monitoring stations, revealing areas with poor air quality. The study aims to provide insights into pollution levels and their potential health impacts on residents in affected areas.

Uploaded by

rifanoorain918
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views4 pages

Air Quality Index Analysis of Bengaluru City Air Pollutants Using Expectation Maximization Clustering

The document presents an analysis of the Air Quality Index (AQI) in Bengaluru, focusing on pollutants from 2015 to 2020 using Expectation Maximization clustering and machine learning techniques. It discusses data preprocessing, attribute selection, and the visualization of air quality data from ten monitoring stations, revealing areas with poor air quality. The study aims to provide insights into pollution levels and their potential health impacts on residents in affected areas.

Uploaded by

rifanoorain918
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Air quality index analysis of Bengaluru city air

pollutants using Expectation Maximization


2021 International Conference on Advancements in Electrical, Electronics, Communication, Computing and Automation (ICAECA) | 978-1-6654-2829-3/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICAECA52838.2021.9675669

clustering
Dr.R.Senthil Kumar Dr.Anidha Arulanandham Dr.Suresh Arumugam
Associate Professor, Department of Associate Professor, Department of Associate Professor,Department of
Computer Science and Engineering Computer Science and Engineering Artificial Intelligence and Machine
New Horizon College of Engineering New Horizon College of Engineering Learning
Marathalli,Bengaluru Marathalli,Bengaluru New Horizon College of Engineering
[email protected] [email protected] Marathalli,Bengaluru
[email protected]

Abstract— Local air quality is important which affects the of the pollutants which plays major impact The goal of
human breathe and life. Air quality changes like the weather attribute subset selection method determines a small subset
from day to day. The information about outdoor air quality a of highly informative attributes which reduces processing
or AQI pollution in the air. The AQI measuring system gives complexity and provides higher classification accuracy.
information to the people about the locations air quality. The
objective of the proposed method is to analyze and visualize the The remaining parts of this paper is organized as
AQI of Bengaluru city. This data collected ten different follows; The section II gives the details of the related work
stations of Bengaluru city for the six years from 2015 to 2020. implementation and analysis. Section III gives the details of
The proposed work analyses the important pollutants which preprocessing techniques[13] applied on the data set and the
are selected using attribute selection method such as decision selection of attributes using attribute subset selection
tree and correlation matrix used in the date set of all the method. The results are discussed and presented in section
stations. Expectation Maximization (EM) Clustering technique IV and section V concludes the paper.
applied to analyze the data and the results are discussed.
II. LITREATURE
Keywords— Air quality index, Machine Learning,
Kingsy et al., [1] has calculated the AQI from the
Classification, ELM, SVM
enhanced k-means algorithm. The quality of air indicated in
I. INTRODUCTION numeric values. K-means accuracy is greater than the other
implemented algorithms. Ojeda-Magaña[2] developed a
The AQI is divided into six levels. The AQI value is technique to monitor the PM10 in air. The Particle Matter
between 0 and 50 then air quality is good and satisfactory which is less than 2.5 micro diameters is dangerous for
and no health risk. If The AQI is between 51 and 100 it is health as per WHO standard. Austin [3] proposed pollutant
moderate but sensitive individual may experience respiratory monitoring of NO2, O3, PM2.5 for about 48 hours. The
symptoms. If the values are between 151 and 200, then the proposed model predicts efficient accuracy. The proposed
AQI is unhealthy. People may experience minor health model use machine learning techniques which proves better
effects but sensitive people may experience some serious than linear regression technique.
health effects. If the values between 201 and 300, then AQI
is very unhealthy which trigger a health alert. If the index is Yajie et al, introduced a technique for pollutant data
over 300 then AQI is termed as hazardous considered as the PM10, PM2.5, CO and O3 dataset from Londo which uses
condition is may cause serious health issues grid sensor data[4]. Pearl Pullan[5] et al., proposed Air
Quality Management System which considers the data of
Bengaluru is the third most populated and capital city of the pollution levels by considering PM2.5 levels and
Karnataka state, India. The major sources of pollution in calculates the Air Quality Index (AQI). S Sankar Ganesh[6]
Bengaluru are from the roads. This comprises the various et al, has developed Support Vector Regression for the AQI
vehicles on the road and the road dust into the atmosphere. which is dependent on pollutants of NO2, CO, O3, PM2.5,
These dust consist of fine particulate matter from gases PM10 and SO2. Based on these models, the SVR has
released from vehicle and the dirt particles of the particular exhibited a high performance in terms of quality measures.
areas. The extremely large number of vehicles on the road
and the emissions in very high quantity. Diesel Vehicle Liye Song [7], has analyzed the effects of various
emits far more pollution when compared with other pollutants such as PM2.5, NO2, PM10, SO2, CO and O3 on
available fuel. These fossil fuels and other organic matter the AQI in the year of 2016 in Jinan. The effects were
combined with many other chemicals that are produced from analyzed using the correlation analysis and path analysis.
vehicular emissions, such as Nitrogen Dioxide (NO2) and Ruijun Yang[8] et al, developed a Bayes network evaluated
Sulfur Dioxide (SO2) and Ozone (O3) which emits the air quality characteristics to check the city air quality,
pollutants cause detrimental effects on human health. So, the which create a directed acyclic graph (DAG), and the
vehicles and the construction industry are the main culprit in training and validation data set contains Shanghai data to
the production of high air pollution in various places of experiment and the it coincides with the real situation,
Bangalore. The machine learning classification is a two–step Ranjana Waman Gore[9] et al, has conducted AQI analysis
process. The first step selects a subset of significant and of using various data of pollutants such as SO2, CO, NO2,
relevant features from the dataset and the second step and O3. Kaggle dataset contains the information of air
implements classification model which produce prediction pollutant and its AQI values. The proposed work applies

978-1-6654-2829-3/21/$31.00 ©2021 IEEE

Authorized licensed use limited to: Academy for Technical and Management Excellence. Downloaded on April 03,2025 at 08:10:23 UTC from IEEE Xplore. Restrictions apply.
Naive Baysian[12] classification method and Decision tree the burning of fossil fuels, which contain sulfur and found in
J48 classification algorithm predicts the health concern the atmosphere. Ozone (O3) gas will produced when
issues related to AQI. Jana Shafi [10] et al., has temperature cause chemical reactions between oxides of
experimented a K-Means clustering technique , shows quick nitrogen (NOX, Benzene, toluene, and xylene are the Volatile
changes happening in AQI of lowermost toxic level to Organic Compounds (VOCs) in atmosphere is present as a
highest toxic level of the same place based on the fire precursors for ground level ozone production.
pollutants in hourly based.
III. AIR QUALITY ANALYSIS AND The AQI values also classified to the six standard
VISUALIZATION classifications such as good, moderate, satisfactory, poor,
very poor and severe. The data set contains every day values
A. Data pre processing of the all above attributes from the year 2015 to 2020. The
The Central Pollution Control Board (CPCB) has original data contains many missing values, which are
installed ten active stations in and around of Bengaluru city. handled using filling mean values of the missing values.
The data set collected from Kaggle website which gives the
pollution data of BTM Layout, Kadabesanahalli , Bapuji The actual AQI bucket values contain six categories as
Nagar, City Railway Station, Hebbal,Hombegowda Nagar, per the CPCB standard which is reduced to three categories
Jayanagar 5th Block, Peenya, Sanegurava Halli and Silk such as Moderate, Satisfactory and Poor. The six category
Board. The data set comprises of the numerical values of values are converted to three classes by applying min max
PM2.5, PM10, NO, NO2, CO, SO2, NOx, NH3, O3, normalization, which is represented in Table-I. The
Benzene, Toluene and Xylene pollutants of the above ten normalization applied based on the minimum and maximum
stations as shown in Fig-1 before preprocessing. values of the data set. The lower two boundaries and the
upper two boundaries of the actual data set has very less
values which are scaled down to the above mentioned three
levels.

Fig. 1. Raw data before pre-processing in .csv format

The particulate matter (PM10 and PM2.5) is a mixture of


many chemical solids, contains inorganic ions, metallic
compounds, elemental carbon, organic compounds and Fig. 2. Data after pre-processing using WEKA 3.8
compounds from the earth’s crust. The Nitrogen oxides
(NOx) refer to nitrogen monoxide (NO) and nitrogen dioxide Principal component analysis with ranker algorithm
(NO2). Nitrogen monoxide (NO) is a colorless gas and applied in WEKA 3.8 tool for attribute subset selection.
nitrogen. Nitrogen dioxide (NO2) is a reddish-brown gas Based on the correlation matrix and the similar attributes
with a pungent, acrid odour and one of the several oxides of removed from the data set. A J4.8 decision tree algorithm
nitrogen. Carbon monoxide (CO)is a toxic air pollutant applied to remove the attributes, which are having less gain
produced by combustion of carbon-containing fuels, such as ratio. The Benzene, Toluene and Xylene pollutants had less
gasoline, natural gas, oil etc. Sulfur dioxide (SO2) emitted by gain ratio, which is removed from the data set for the further

Authorized licensed use limited to: Academy for Technical and Management Excellence. Downloaded on April 03,2025 at 08:10:23 UTC from IEEE Xplore. Restrictions apply.
analysis. Fig.2 shows the preprocessed data which is taken The three clusters Good, Moderate and Poor is represented
for the analysis. by blue color , red color and green color respectively. All the
10 stations are represented in numbers from KA-2 to KA-11
which is represented in Y-axis.
The visualization shows clearly from the Fig.3 for PM10
TABLE I. CLUSTER RANGE VS NORMALIZED CLUSTER RANGE
analysis, Fig.4 for PM2.5 analysis, Fig.5 for O3 analysis and
Normalized Fig.6 for CO analysis, states that station KA-9 and KA-10
AQI Range AQI Value
Cluster Ranges which are located in Peenya and Sanegurava Halli
0-50 Good Cluster-0 respectively has poor air quality index. The Peenya area of
51-100 Moderate ( Good) Bengaluru is the one of the biggest industrial area comprises
large number of small scale , medium scale and large-scale
Cluster-1 industries. The Sanegurava Halli area is the center of city
101-150 Satisfactory where the maximum number vehicle crosses. The Table-II
(Satisfactory)
represents the fine tuned mean values of the pollutants value
151-200 Poor which gives the improved mean value by mixing two normal
201-250 Very Poor Cluster-2 (Poor) distributed values. The Table-II represents the results are
drawn from the improved mean values which is actually
251-300 Severe
contributes with the analysis.

The analysis applies the Estimation Maximization (EM)


algorithm first initializes the mean of each attribute
hypothesis to h = (µ1, µ2), where µ1 and µ2 are arbitrary
initial values. These values re-estimates the h value by
repeating the two steps mentioned below until the procedure
converges to a constant h value, which is the best mean value
to improve the clustering.
Step-1. Expected value E[zij] Calculated for each hidden
variable zij, with the assumption that the current hypothesis
holds h = (µ1, µ2).
Step-2. The new maximum likelihood hypothesis h' =
(µ1', µ2') calculated by assuming the values taken by each
hidden variable zij is its expected value E[zij] which is already
calculated in Step-1. Then the hypothesis h = (µ1, µ2), Fig. 3. PM10 visualization of stations
replaced by the new hypothesis h' = (µ1', µ2') and iterates
continuously. The new maximum likelihood [11] value µj
updated using

------------ Eq. (1).

The EM algorithm estimate the means of a mixture of k


Gaussian distributions. The current µj is used to estimate the
unobserved missing values and the expected values of any
feature can be used to for analysis. The estimated mean Fig. 4. PM2.5 visualization of all stations
values applied in k mean clustering which produces
improved accuracy.
IV. RESULT ANALYSIS
After The analysis of AQI was performed on the
important four features such as particle matter 10
micrometers (PM10), particle matter 2.5 micrometers
(PM2.5), Carbon Monoxide (CO) and Ozone (O3). These
features were evaluated by constructing a J48 algorithm and
the tree analyzed using confusion matrix which provides
94% accuracy on these features. The tree constructed using
10 fold cross validation. The EM based clustering algorithm
analysis considers the ten stations of Bengaluru which are Fig. 5. O3 visualization of all stations
BTM Layout, Kadabesanahalli , Bapuji Nagar, City Railway
Station, Hebbal,Hombegowda Nagar, Jayanagar 5th Block,
Peenya, Sanegurava Halli and Silk Board in the X-axis.
The Y-axis consider the features PM10, PM2.5, CO and O3
separately and visualize the area which has poor AQI index.

Authorized licensed use limited to: Academy for Technical and Management Excellence. Downloaded on April 03,2025 at 08:10:23 UTC from IEEE Xplore. Restrictions apply.
Maximization clustering technique. The feature selections
used J48 decision tree to select the features which has
maximum gain ratio. Correlation matrix analysis used to
remove the similar features from the input data. The
analysis results of the 10-fold Cross Validation test using
the standard datasets shows the area which has poor air
quality in Bengaluru city. The further work can be
extended in future by relating health issues of the people
living in these specific areas with pollutants presents in
the air.
Fig. 6. CO visualization of all stations

TABLE II. FINETUNED MEAN DISTRIBUTIONS FOR LABELLED


CLUSTERS REFERENCES
Cluster-Type [1] R Kingsy Grace,R.Manimegalai,M.S. GeethaaDevasena, S. Rajathi,
K. Usha, N. RaabiathullBaseria, “Air Pollution Analysisuusing
Enhanced K-Means Clustering Algorithm for Real Time Sensor
Good Satisfactory Poor
Data”, IEEEERegion 10 conference held on 221st to 25th Nov, 2016,
K-Means 38% 60% 2% [2] B.Ojeda-Magaña, R.Ruelas, L.GómezzBarba,M. A. Coronaakamura,
EM 42% 35% 23% J. M. Barrón-Adame, M.G.Cortina-Januchss, J. Quintanilla-
Domínguez, A. VegaaCorona “Air Pollution Analysis with a
Possibilistics and Fuzzy ClusteringsAlgorithm Applied in a Real
The Table II shows the percentage of the instances fine Databasesof Salamanca”, Vol. 14, pp. 1262–1263, 2013’
[3] E.Austin, B.Coull, D.Thomas, P. Koutrakis, “A frameworksfor
tuned from clusters of K-mean model to clusters of EM identifying distinctsmulti pollutant profiles in air pollutiontdata”,
model. The K-mean clusters formed using Euclidean Environment, Vol. 45, pp.12-121, 2012.
distance which uses the everyday recording of all [4] YajieeMa, Mark Richards, MoustafaaGhanem, Yike Guonand John
pollutants of entire data set. The EM clustering improves Hassard, "Air Pollution Monitoringsand Mining Based one Sensor
Grid in London", Vol. 8, pp. 3601-3623, 2008.
the mean values of each pollutants and the AQI index [5] Pearl Pullan; Chitra Gautam; Vandana Niranjan , "Air Quality
calculated with modified instances. Most of the Management System", Published in: 2020 IEEE International
satisfactory instances are distributed to other two clusters Conference on Computing, Power and Communication Technologies
and the instances of ‘Poor’ cluster is increased in EM (GUCON), Date of Conference: 2-4 Oct. 2020,IEEE Xplore
[6] S Sankar Ganesh; Sri Harsha Modali; Soumith Reddy Palreddy; P
algorithm. The Result is displayed in the following chart. Arulmozhivarman, “ Forecasting air quality index using regression
models: A case study on Delhi and Houston” , Published in: 2017
International Conference on Trends in Electronics and Informatics
(ICEI), Date of Conference: 11-12 May 2017, IEEE
[7] Liye Song, “Impact Analysissof Air Pollutants on the Air Quality
Index in Jinan Winter” (ICEI Conference ), IEEE Xplore
[8] Ruijun Yang; Feng Yan; Nan Zhao “ Urban air quality based on
Bayesian network”, International Conference on Communication
Software and Networks(ICCSN), 6 to8th, May-2017, IEEE Xplore.
[9] Ranjana Waman Gore; Deepa S.Deshpande “An approach for
classificationsof health risks based on air quality levels”, ICISIM, 5th
to 6th October- 2017- IEEE Xplore
[10] Jana Shafi; Amtul Waheed,“K-Means Clustering Analysing Abrupt
Fig. 7. Comparision of K-means and EM Changes in Air Quality” , ICECA, 5th to 7th Novomber. 2020, IEEE
Xplore
[11] MachineLearning. T.Mitchell.McGraw-Hill, New York, (1997 ).
Fig-7 Shows the percentage improvement of instances [12] R. S. Kumar and C. Ramesh, "A study on prediction of rainfall using
from k-means clustering to EM clustering. The EM datamining technique," 2016 International Conference on Inventive
clustering shows the instances belong to the proper Computation Technologies (ICICT), 2016, pp. 1-9, doi:
clusters with improved mean values. The EM method 10.1109/INVENTIVE.2016.7830208.
[13] R Senthil Kumar, C.Ramesh, “Extreme Precipitation Events in
identifies the major pollutants, which are PM10, PM2.5, Chennai Metro City Using Data Mining”, International Journal of
CO and O3 from the available ten AQI monitoring Innovative Technology and Exploring Engineering (IJITEE) ISSN:
stations of Bengaluru city , which is more accurate than 2278-3075, Volume-8 Issue-11, September 2019, DOI:
K-Mean clustering technique. 10.35940/ijitee.J9978.0981119

V. CONCLUSION
The air quality index of the four major features such
as PM10, PM2.5, O3 and CO analyzed using Expectation

Authorized licensed use limited to: Academy for Technical and Management Excellence. Downloaded on April 03,2025 at 08:10:23 UTC from IEEE Xplore. Restrictions apply.

You might also like