NAAC Accredited ’’A”
D. E. Society’s
Kirti College of Arts, Science And Commerce
Department Of Computer Science & IT
(2018-2019)
A PROJECT SYNOPSIS ON
“RAINFALL IN INDIA ANALYSIS”
Exam Seat No: 30541
SUBMITTED
TO
UNIVERSITY OF MUMBAI
Guided By: Developed By:
[Link] Phadke [Link] Y Patil
Certificate
This is to certify that the project Synopsis “ RAINFALL ANALYSIS IN INDIA”
under the guidance of Prof. Aniruddha Phadke has successfully completed by
examination Seat no: 30541 as per syllabus and partial fulfillment for the
completion of the [Link] II Sem III in Computer Science from University Of
Mumbai.
It is also to certify that this is the original work of the candidate during the
academic year 2018-2019.
Date:
Place: MUMBAI
Project Guide Head of Department External Examiner
IndeX
Sr. No Topic Page No.
1 Introduction
2 Rainfall Analysis(Related Work)
3 Objective
4 Methodology
5 References
Introduction:
Climate of India
The Climate of India comprises a wide range of weather conditions across a vast
geographic scale and varied topography, making generalisations difficult.
The country's meteorological department follows the international standard of four
climatological seasons with some local adjustments: winter (December, January and
February), summer (March, April and May), a monsoon rainy season (June to
September), and a post-monsoon period (October to November).
A product of southeast trade winds originating from a high-pressure mass centred over
the southern Indian Ocean, the monsoonal torrents supply over 80% of India's annual
rainfall.
Rainfall Analysis:
New monthly, seasonal and annual rainfall time series of 36 meteorological
subdivisions of India were constructed using the monthly rainfall data for the
period 1901–2015 of fixed network of 1476 rain gauge stations .
Related Work:
BIG DATA AND HADOOP
BIG DATA
Big Data is a term used to suggest huge data sets (several gigabytes / terabytes /
petabytes) of data. The data is so large and complex that it would become difficult to process
using traditional data processing applications. Big data requires a new set of tools,
applications and frameworks to process and manage data.
Evolution of Data/Big Data
Data has always been around and there has been a need to store, process and manage
data since the beginning of modern human civilization. However, the amount of data
captured, stored, processed and managed depends on various factors including the necessity
felt by humans for certain information, available tools and technologies needed for making
decisions based on the data analysis and so on.
In today’s world, due to advancements of technology there is a huge (several
terabytes /petabytes) amount of data that’s being constantly captured [4]. Natural curiosity
about truly important things, like whether more teenagers like Justin Bieber than millennials,
demand processing Twitter data, which is huge.
Characteristics of Big Data
VOLUME
Volume refers to the size of the data that the user is working with. Due to
advancements of technology the amount of data that is being generated is growing rapidly.
Data is spread across different places, different formats, in large volumes ranging from
gigabytes to terabytes to petabytes. Data is not only being generated from humans but by
machines too. Now-a-days the data that’s getting generated from machines is surpassing that
data that’s being generated by Humans Weather data is a good example.
VARIETY
Variety refers to different formats in which data is getting generated. Apart from
structured data like spreadsheets and traditional flat files, there is a large amount of
unstructured data that’s being generated in the form of weblogs, sensor data, social media,
etc. Enterprises are making use of both structured and unstructured data for data analysis, and
thereby making better business decisions to stay competitive
VELOCITY
Velocity refers to the speed at which data is getting generated. Different applications
in different fields have different requirements. So we see data getting generated at different
speeds based on the application requirements.
HADOOP
Hadoop is an open source framework. It is capable of processing large amounts of
data sets in a distributed fashion across clusters using a simplified programming model.
Hadoop provides a reliable way to store, process and analyze the data.
Hadoop Architecture
Hadoop works in a master-slave fashion. Hadoop has two core components-
HDFS(Hadoop Distributed File System) and MapReduce.
Hadoop Components
HDFS (HADOOP DISTRIBUTED FILE SYSTEM)
HDFS offers a reliable and distributed storage. It replicates the data across multiple
nodes on clouds or commodity computers. Unlike a regular file system, when data is pushed
into HDFS, it internally splits into multiple data blocks (configurable parameter with default
size of 64Mb). Each incoming file is broken into 64Mb data block by default and all blocks
which make up the file are of the same size 64Mb except the last block which could be less
than 64Mb depending upon the size of the incoming file.
HDFS also replicates (replication rate is configurable) the data across various data
nodes thus ensuring fault tolerance and reliability. It also ensures that a replication factor is
maintained, so that if a node goes down, one can recover by using replicated data on other
nodes.
HDFS is capable of storing large amounts of data which can be structured or unstructured.
The computers present in the cluster can be present in any location and there is no
physical location dependency. HDFS works in a master/slave fashion.
NameNode: NameNode is a master component and holds the information about all
other nodes in the Hadoop cluster, files present and their locations in the cluster.
There is only one NameNode per cluster.
DataNode: DataNode is a slave node and holds the user data in the form of data blocks.
There can be a lot of DataNodes in a Hadoop cluster.
MAPREDUCE
MapReduce offers a framework/analysis system which performs complex
computations on large datasets in a parallelized fashion. This system breaks down the
complex computations into multiple smaller tasks and assigns those to individual slave nodes
and takes care of the co-ordination and consolidation of the results [4]. These tasks run
independently on various nodes across the cluster. There are primarily two types of tasks:
Map tasks and Reduce tasks.
As in HDFS, MapReduce (computation part) also works in master/slave fashion.
JobTracker: Keeps track of the tasks assigned and co-ordinates the exchange of
information and the results with the slave nodes. Its responsibility also includes
rescheduling of failed tasks and monitoring the overall progress of the job. There is
only one JobTracker per cluster.
TaskTracker: Acts as a slave and is responsible for running the tasks assigned by the
JobTracker and providing the results back to the JobTracker. There can be multiple
TaskTracker nodes that can exist in a cluster.
Figure 2.3. Data processing using MapReduce framework.
FileInputFormat: This is the input file/data which need to be processed.
Split: Hadoop splits the incoming data into several blocks.
RecordReader: RecordReader helps to read the data line by line and converts into
key/value pairs to be passed as the input to the Mapper.
Mapper: Mapper contains the logic to process input data. The Map function
transforms the input records to intermediate records.
Combiner: This is an optional step often used to improve the performance by
reducing the data to be transferred across the network .
Shuffle: Output of all the mappers is collected, shuffled and sorted, to be sent to the
Reducer.
Reducer: Reducer applies logic to aggregate the data and provide it to an
FileOutputFormat class.
FileOutputFormat: It is a pre-defined class provided by the MapReducer framework
through which final output can be written to HDFS.
Hadoop Characteristics
Hadoop provides a reliable shared storage system (HDFS) and data analysis
system (MapReduce) .
Cost effective, as it can work with commodity hardware and doesn’t need
expensive hardware.
Flexible and can process both structured as well as un-structured data sets.
Is optimized for large and very large data sets. It takes a lot less data processing
time due to parallel processing, when compared with traditional data base
management systems.
Very scalable. As a result the Hadoop cluster can contain hundreds or thousands
of servers.
Provides a very reliable system as data is replicated across multiple nodes
(replication factor is configurable).
Hive
Hive is a data warehouse infrastructure which is built on top of the Hadoop
distributed system and it provides tools to enable easy ETL to join, aggregate and filter
different data sets. It also allows programmers to build custom MapReduce functionalities.
Hive provides an SQL like query interface called HiveQL which internally does MapReduce
operations. Hive is extremely useful when processing large amounts of data (terabytes). Hive
is easier to use as it abstracts the complexity of Hadoop. Lots of companies support Hive, a
simple reason being to encourage SQL based queries on top of Hadoop.
HIVE ARCHITECTURE
Figure 2.4. Hive Architecture block diagram.
When a user logs in to Hive terminal through a CLI (command line interface) or a
Web graphical user interface, it directly connects to Hive drivers through the Thrift server.
The queries which are written by users are received by drivers and sent to Hadoop, where
Hadoop gets the data and divides the work using NameNode, DataNode, JobTracker and
TaskTracker.
HIVE COMPONENTS
– Thrift server
This component is optional. This allows a remote client to submit requests to Hive to
retrieve results. A variety of programming languages can be used to accomplish this .
– Driver
Driver is a very important component that takes all the requests from CLI (command
line interface) or a web interface, or the Thrift server, and does the compilation, optimization
and execution of the data.
– Meta Store
This component stores all the structure information of various tables and partitions in
the warehouse including column and column type information, serilizers and de-serializers
necessary to read and write data and corresponding HDFS files where the data is stored .
– HDFS
All data is stored in HDFS. A detailed explanation of HDFS has been provided in
section [Link]. Hive currently uses HDFS as its execution engine.
Disadvantages of Hive
- It’s not designed for online transaction processing.
- There is a built-in latency for every job.
- When Hive compiles a query into a set of MapReduce jobs it has to co-ordinate
and launch the jobs on the cluster.
Pig
Pig is a high level data flow system that provides a simple language popularly known
as Pig Latin that can be used for manipulating data and queries. Pig Hadoop was developed
by Yahoo in the year 2006 such that they can have an ad-hoc method for creating and
executing Map Reduce jobs on huge data sets . Pig has relational database features and is
built on top of Hadoop and makes it easier to clean and analyze big data sets without having
to write vanilla MapReduce jobs in Hadoop . The Pig tool itself converts all high level operations
into MapReduce jobs. It follows
a multi query approach and helps cut down the number of times the data is scanned.
Performance of Pig is on par with the performance of raw MapReduce. The Pig programs
structure is amenable to substantial parallelization which enables them to handle very large
data sets . Pig could be used for ETL tasks naturally as it can handle unstructured data.
PIG ARCHITECTURE
Figure 2.5. Pig Architecture block diagram.
PIG LATIN COMPILER
The Pig Latin compiler converts the Pig Latin code into executable code. The
executable code is in the form of MapReduce jobs .
The sequence of MapReduce programs enables Pig programs to do data processing
and analysis in parallel.
BENEFITS OF PIG
Learning curve is not steep.
Decrease in development time when compared with the vanilla MapReduce jobs
due to reduced complexity and maintenance needs .
Helps with faster prototyping of algorithms due to the ease of using the Pig Latin
language.
Effective for unstructured data.
It’s procedural. Provides better expressiveness in the transformation of data at
every step .
DISADVANTAGES OF PIG
It is not very mature. Even though it has been around for quite some time, it’s still
in development.
Doesn’t clearly distinguish the type of error. It just gives an execution error when
something goes wrong. Doesn’t specify if it’s a syntax error or run time error or
type error.
Support: Google and Stack overflow doesn’t generally lead to good solutions for
problems.
Typically for complex business logic involving encryption of data, Pig is not
used. Java API for cryptography is picked over Pig in such cases.
Objective:
Analysis of Rainfall scenario at national level
Rainfall in India
State & Year wise distribution Month distribution of
rainfall.
Areas/States with max & min Rainfall in India.
Analysis of Rainfall scenario at state level
Rainfall in India Mentioned in points
Methodology:
[Link] Flume is a distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of streaming data into
the Hadoop Distributed File System (HDFS).
[Link] is a processing technique and a program model for distributed
computing based on java. The MapReduce algorithm contains two important tasks,
namely Map and Reduce. Map takes a set of data and converts it into another set of
data, where individual elements are broken down into tuples (key/value pairs).
Secondly, reduce task, which takes the output from a map as an input and
combines those data tuples into a smaller set of tuples. As the sequence of the
name MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over
multiple computing nodes. Under the MapReduce model, the data processing
primitives are called mappers and reducers. Decomposing a data processing
application into mappers and reducers is sometimes nontrivial. But, once we write
an application in the MapReduce form, scaling the application to run over
hundreds, thousands, or even tens of thousands of machines in a cluster is merely a
configuration change. This simple scalability is what has attracted many
programmers to use the MapReduce model.
[Link] -Dataflow
The frozen part of the MapReduce framework is a large distributed sort. The hot
spots, which the application defines are:
An input reader
A Map function
A partition function
A compare function
A Reduce function
An output writer
Input reader
The input reader divides the input into appropriate size 'splits' (in practice typically
64 MB to 128 MB) and the framework assigns
one split to each Map function. The input reader reads data from stable storage
(typically a distributed file system) and generates key/value pairs.
A common example will read a directory full of text files and return each line as a
record.
Map function
The Map function takes a series of key/value pairs, processes each, and generates
zero or more output key/value pairs. The input and output types of the map can be
(and often are) different from each other.
If the application is doing a word count, the map function would break the line into
words and output a key/value pair for each word. Each output pair would contain
the word as the key and the number of instances of that word in the line as the
value.
Partition function
Each Map function output is allocated to a particular reducer by the
application's partition function for sharding purposes. The partition function is
given the key and the number of reducers and returns the index of the
desired reducer.
A typical default is to hash the key and use the hash value modulo the number
of reducers. It is important to pick a partition function that gives an approximately
uniform distribution of data per shard for load-balancing purposes, otherwise the
MapReduce operation can be held up waiting for slow reducers to finish (i.e. the
reducers assigned the larger shares of the non-uniformly partitioned data).
Between the map and reduce stages, the data are shuffled (parallel-sorted /
exchanged between nodes) in order to move the data from the map node that
produced them to the shard in which they will be reduced. The shuffle can
sometimes take longer than the computation time depending on network
bandwidth, CPU speeds, data produced and time taken by map and reduce
computations.
Comparison function
The input for each Reduce is pulled from the machine where the Map ran and
sorted using the application's comparison function.
Reduce function
The framework calls the application's Reduce function once for each unique key in
the sorted order. The Reduce can iterate through the values that are associated with
that key and produce zero or more outputs.
In the word count example, the Reduce function takes the input values, sums them
and generates a single output of the word and the final sum
Techniques:
We will be using Hadoop framework to build the project. The Hadoop framework
consists of the many tools like HDFS, MapReduce, Pig, Sqoop, Flume and many
others.
HDFS is used to store the data; MapReduce is a data processing framework. Pig on
the same lines is a processing framework, but code is written in PigLatin.
Whereas code in Mapreduce is written in Java. Our data will be copied in HDFS
and will be processed using Mapreduce as well as HDFS.
Hardware and Software
Hardware:
- 8 GB RAM
- 2.5 GHz clock speed
- 10 GB HDD
- i5 Processor
- Graphics card supported with Virtualization
Software:
Linux
Hadoop Distributed File System
MapReduce
Pig
Cloudera
Windows
VM Workstation
MS Excel
Notepad++
References:
[Link]
er_India
[Link]
term_rainfall_trends_in_India
[Link]
mperature_and_Rainfall_in_India
[Link]
_Analysis_of_long-term_rainfall_trends_in_India.pdf
[Link]
[Link]
[Link]
[Link]
[Link]