Big Data
Airlines Project
ZIYAD SALEH
What is Big Data ?
Big data is a broad term for very large or complex data sets that are
difficult to process using traditional data processing applications.
Big Data is Terra bytes (1024 GB) of data to be processed and
analyzed, terra bytes of new data is being generated daily, which
means the speed of analyzing this huge flow of data is a challenge.
Big data can be described by the 4 Vs which are: Volume, Velocity,
Variety and Veracity.
Big Data Airline Project at UAEU
Big Data Airline Project at UAEU
Big Data Airline Project at UAEU
Small Data Vs. Big Data
Big Data Airline Project at UAEU
Big Data Airline Project at UAEU
Map Reduce
Map Reduce model
Big Data Airline Project at UAEU
Project Scope
The Scope is limited to :
1. Installing and configuring Hadoop Map/Reduce
platform.
2. Analyzing a big data sample belonging to U.S
domestic flights performance and delay for 5 years
to try to figure out
1. Top carriers experiencing delays.
2. Top airports and states with departure delays.
3. Plotting state delay in a thematic map of USA
Source of Data for the project
Datasets will be collected
from :
U.S. Department of
Transportation's (DOT) –
Statistical Computing
Dataset size will be between 500 GB and 1 TB and
covering 5 years of flight statistics.
Size of Data
Field Name Description
Year Year of the scheduled flight
Month Month of the scheduled flight (1–12).
Day Day of the month (1–31).
DepTime Actual departure time of the flight
CRSDepTime Scheduled departure time
ArrTime Actual arrival time in HH/MM format
CRSArrTime Scheduled arrival time
FlightNum Flight number.
ArrDelay Arrival delay
DepDelay departure delay, in minutes
CarrierDelay Delay (in minutes) caused by factors within control of the carrier.
WeatherDelay Delay (in minutes) caused by extreme weather conditions
NASDelay Delay (in minutes) within the control of the National Airspace System (NAS)
SecurityDelay Security delay (in minutes) caused by security reasons
LateAircraftDelay Delay (in minutes) due to the same aircraft arriving late at a previous airport.
Table 1 : Airline Dataset Dictionary.
Data Pre-Processing , Processing and Analytics
Data pre-processing:
Data will be cleansed and some artifacts will be filtered out as
necessary. Many fields in the airline data set need to be discarded as
they are irrelevant to the subject of delay that we are concerned on.
Data Processing and Analytics :
Data will be processed using java programming on Map/Reduce to
reduce the size of the data and produce an organized smaller
datasets.
Next, the resulting datasets will be analyzed using additional tools
like R.
Data Storage
Data will be stored in the HDFS multiple
storage nodes with total size between
500 GB and 1 TB.
Airlines
Big Data
HDFS
Target Analysis:
During the 5 years of all US domestic airlines flight
information
1. Which carriers have the most aggregated
delay in their flights ?
2. What are the states with most delays. ) ?
Design
Airlines Project Workflow and Design
Master Node Node 1
Node 2
Node 3
Node 4
Name
Node
Job
Tracker
Airlines
Big Data
Task
Java Code
Reducer Node
HDFS
Mapper
Reducer
Top Airlines
Implementation
Software and Tools
1. CentOS Linux Operating System.
2. Apache Hadoop
3. Cloudera CDH 5.3 virtual machine
4. Oracle VM Virtual Box Manager
5. Eclipse IDE
6. Java (Oracle JDK )
7. Maven
8. Microsoft Excel and Access 2010.
9. The R statistical tool
Mapper :
Reducer:
R:
Findings
US Airlines Delay (Per Carrier)
0
0.2
0.4
0.6
0.8
1
1.2
WN AA OO MQ US DL UA XE NW CO EV 9E FL YV OH B6 AS F9 HA AQ PI HP EA PS TW
ArrivalOnTime
ArrivalDelays
DepartureOnTime
DepartureDelays
Cancellations
Diversions
Thematic Map of US Airlines Delay (Per State)
Conclusion
Conclusion:
 Big Data is the large amount of continuously generated data that cannot be processed and
analyzed using traditional data management tools .
 Big data is a new topic that is rising dramatically , reshaping the future , and a large demand
for big data scientist is taking place and will continue to happen during the coming period of
time.
 Hadoop is an open source framework for storing and processing large datasets using clusters
of commodity hardware.
 Big Data analytics is attracting both business and policy makers to leverage from this new
phenomenon towards more informed decisions and planning for the future.
 Big Data now , Normal Data tomorrow.
Big Data Tutorials
Online Big Data Tutorials:
1. Udemy : https://www.udemy.com/course/subscribe/?courseId=336982&dtcode=lGCe31035ujY
2. Udacity : https://www.udacity.com/courses#!/data-science
3. EMC : https://education.emc.com/guest/campaign/data_science.aspx
4. Coursera : https://www.coursera.org/course/datasci
5. CalTech’s : Learning from Data http://work.caltech.edu/telecourse.html
6. MIT : Open Courseware http://ocw.mit.edu/courses/sloan-school-of-management/15-062-
data-mining-spring-2003/index.htm
7. Stanford’s OpenClassroom
http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning
8. Big Data University : https://bigdatauniversity.com/curriculum-map/
Thank You
Ziyad Saleh
34
‫ينفعنا‬ ‫ما‬ ‫علمنا‬ ‫اللهم‬..‫علمتنا‬ ‫بما‬ ‫وانفعنا‬
‫علما‬ ‫وزدنا‬

More Related Content

PPTX
Big Data Analysis of Airline Data Set on Cloud Computing
PDF
Airline Analysis of Data Using Hadoop
PPTX
Airline traffic management analysis
PPT
using big-data methods analyse the Cross platform aviation
PDF
Airline and Airport Big Data: Impact and Efficiencies
PDF
Big Data Transforms Flight Operations
PPTX
Data mining & predictive analytics for US Airlines' performance
PDF
Airline flights delay prediction- 2014 Spring Data Mining Project
Big Data Analysis of Airline Data Set on Cloud Computing
Airline Analysis of Data Using Hadoop
Airline traffic management analysis
using big-data methods analyse the Cross platform aviation
Airline and Airport Big Data: Impact and Efficiencies
Big Data Transforms Flight Operations
Data mining & predictive analytics for US Airlines' performance
Airline flights delay prediction- 2014 Spring Data Mining Project

What's hot (20)

PDF
Elastic in oil and gas
PDF
Big data analytics for transport
ODP
Oil & gas
PDF
ICARUS @ 27th ACRIS Meeting (February 2020, London)
PDF
The Critical Role of IoT Data Integration to develop Big Data Applications (f...
PDF
Opportunities in Sensor Networks and Big Data in 2014 (for NIKKEI Big Data Co...
PPTX
MapR Edge : Act Locally Learn Globally
PDF
How to Create the Google for Earth Data (XLDB 2015, Stanford)
PPTX
SXSW Proposal - Harnessing Data from Connected Vehicles
PDF
Rainer Sternfeld - Planetary Big Data - PlanetOS - Stanford Engineering - Mar...
PDF
How to Gain a Competitive Edge with an Open Source, Purpose-built Time Series...
PDF
Indexing the Real World Sensor Networks (at RE.WORK Internet of Things Summit...
PDF
Esri News for Petroleum Winter 2013/2014 newsletter
PPTX
Chen - New data and frontier tools
PDF
Data Skipping Technology
PDF
Experience Big Data Analytics use cases ranging from cancer research to IoT a...
PPTX
BIG Data & Hadoop Applications in Logistics
PDF
HPC Market Update from IDC
PDF
Bigdata 2016- projects list
PPTX
Visualizing Big Data with augmented and virtual reality
Elastic in oil and gas
Big data analytics for transport
Oil & gas
ICARUS @ 27th ACRIS Meeting (February 2020, London)
The Critical Role of IoT Data Integration to develop Big Data Applications (f...
Opportunities in Sensor Networks and Big Data in 2014 (for NIKKEI Big Data Co...
MapR Edge : Act Locally Learn Globally
How to Create the Google for Earth Data (XLDB 2015, Stanford)
SXSW Proposal - Harnessing Data from Connected Vehicles
Rainer Sternfeld - Planetary Big Data - PlanetOS - Stanford Engineering - Mar...
How to Gain a Competitive Edge with an Open Source, Purpose-built Time Series...
Indexing the Real World Sensor Networks (at RE.WORK Internet of Things Summit...
Esri News for Petroleum Winter 2013/2014 newsletter
Chen - New data and frontier tools
Data Skipping Technology
Experience Big Data Analytics use cases ranging from cancer research to IoT a...
BIG Data & Hadoop Applications in Logistics
HPC Market Update from IDC
Bigdata 2016- projects list
Visualizing Big Data with augmented and virtual reality
Ad

Similar to Big Data Airline Project at UAEU (20)

PPTX
Flight data analysis using apache pig--------------Final Year Project
PDF
PDF
Big Data For Flight Delay Report
PDF
Flight delay detection data mining project
PDF
Data Mining & Analytics for U.S. Airlines On-Time Performance
PPTX
bigdatatoavoidweatherrelatedflightdelays-201219091805.pptx
PDF
Data Science Presentation.pdf
PPTX
Big Data to avoid weather related flight delays
PPTX
Phase1review
PPTX
FlightDelayAnalysis
PPTX
World Routes 2014 Keynote Presentation – How Big Date Changes Aviation Effici...
PPTX
Keynote Presentation – How Big Date Changes Aviation Efficiency (Josh Marks, ...
PPTX
Data science life cycle
PPTX
Data science life cycle final
PPTX
big data slides.pptx
PPTX
Prediction of Airlines Delay
PPTX
So your boss says you need to learn data science
PDF
KNN and regression Tree
PDF
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
PDF
Air Travel Analytics in SAS
Flight data analysis using apache pig--------------Final Year Project
Big Data For Flight Delay Report
Flight delay detection data mining project
Data Mining & Analytics for U.S. Airlines On-Time Performance
bigdatatoavoidweatherrelatedflightdelays-201219091805.pptx
Data Science Presentation.pdf
Big Data to avoid weather related flight delays
Phase1review
FlightDelayAnalysis
World Routes 2014 Keynote Presentation – How Big Date Changes Aviation Effici...
Keynote Presentation – How Big Date Changes Aviation Efficiency (Josh Marks, ...
Data science life cycle
Data science life cycle final
big data slides.pptx
Prediction of Airlines Delay
So your boss says you need to learn data science
KNN and regression Tree
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Air Travel Analytics in SAS
Ad

Recently uploaded (20)

PDF
Nucleic-Acids_-Structure-Typ...-1.pdf 011
PDF
Machine Learning Final Summary Cheat Sheet
PDF
MISO Deep-NARX Forecasting for Energy and Electricity Demand/Price Data
PPTX
Machine Learning: An Introduction to Smart AI
PPTX
Bussiness Plan S Group of college 2020-23 Final
PPTX
Chapter_5_ network layer control plan v8.2.pptx
PPTX
1.Introduction to orthodonti hhhgghhcs.pptx
PDF
Library Hi Tech, technology of the world
PPTX
logistic__regression_for_beginners_.pptx
PPTX
Power BI - Microsoft Power BI is an interactive data visualization software p...
PPT
DWDM unit 1 for btech 3rd year students.ppt
PDF
GPL License Terms of document persentaion
PDF
Delhi c@ll girl# cute girls in delhi with travel girls in delhi call now
PPT
Drug treatment of Malbbbbbhhbbbbhharia.ppt
PPTX
cardiac failure and associated notes.pptx
PPTX
An Introduction to Lean Six Sigma for Bilginer
PPTX
AI-Augmented Business Process Management Systems
PPTX
Transport System for Biology students in the 11th grade
PDF
toaz.info-grade-11-2nd-quarter-earth-and-life-science-pr_5360bfd5a497b75f7ae4...
PPT
Handout for Lean and Six Sigma application
Nucleic-Acids_-Structure-Typ...-1.pdf 011
Machine Learning Final Summary Cheat Sheet
MISO Deep-NARX Forecasting for Energy and Electricity Demand/Price Data
Machine Learning: An Introduction to Smart AI
Bussiness Plan S Group of college 2020-23 Final
Chapter_5_ network layer control plan v8.2.pptx
1.Introduction to orthodonti hhhgghhcs.pptx
Library Hi Tech, technology of the world
logistic__regression_for_beginners_.pptx
Power BI - Microsoft Power BI is an interactive data visualization software p...
DWDM unit 1 for btech 3rd year students.ppt
GPL License Terms of document persentaion
Delhi c@ll girl# cute girls in delhi with travel girls in delhi call now
Drug treatment of Malbbbbbhhbbbbhharia.ppt
cardiac failure and associated notes.pptx
An Introduction to Lean Six Sigma for Bilginer
AI-Augmented Business Process Management Systems
Transport System for Biology students in the 11th grade
toaz.info-grade-11-2nd-quarter-earth-and-life-science-pr_5360bfd5a497b75f7ae4...
Handout for Lean and Six Sigma application

Big Data Airline Project at UAEU

  • 2. What is Big Data ? Big data is a broad term for very large or complex data sets that are difficult to process using traditional data processing applications. Big Data is Terra bytes (1024 GB) of data to be processed and analyzed, terra bytes of new data is being generated daily, which means the speed of analyzing this huge flow of data is a challenge. Big data can be described by the 4 Vs which are: Volume, Velocity, Variety and Veracity.
  • 6. Small Data Vs. Big Data
  • 13. The Scope is limited to : 1. Installing and configuring Hadoop Map/Reduce platform. 2. Analyzing a big data sample belonging to U.S domestic flights performance and delay for 5 years to try to figure out 1. Top carriers experiencing delays. 2. Top airports and states with departure delays. 3. Plotting state delay in a thematic map of USA
  • 14. Source of Data for the project Datasets will be collected from : U.S. Department of Transportation's (DOT) – Statistical Computing
  • 15. Dataset size will be between 500 GB and 1 TB and covering 5 years of flight statistics. Size of Data
  • 16. Field Name Description Year Year of the scheduled flight Month Month of the scheduled flight (1–12). Day Day of the month (1–31). DepTime Actual departure time of the flight CRSDepTime Scheduled departure time ArrTime Actual arrival time in HH/MM format CRSArrTime Scheduled arrival time FlightNum Flight number. ArrDelay Arrival delay DepDelay departure delay, in minutes CarrierDelay Delay (in minutes) caused by factors within control of the carrier. WeatherDelay Delay (in minutes) caused by extreme weather conditions NASDelay Delay (in minutes) within the control of the National Airspace System (NAS) SecurityDelay Security delay (in minutes) caused by security reasons LateAircraftDelay Delay (in minutes) due to the same aircraft arriving late at a previous airport. Table 1 : Airline Dataset Dictionary.
  • 17. Data Pre-Processing , Processing and Analytics Data pre-processing: Data will be cleansed and some artifacts will be filtered out as necessary. Many fields in the airline data set need to be discarded as they are irrelevant to the subject of delay that we are concerned on. Data Processing and Analytics : Data will be processed using java programming on Map/Reduce to reduce the size of the data and produce an organized smaller datasets. Next, the resulting datasets will be analyzed using additional tools like R.
  • 18. Data Storage Data will be stored in the HDFS multiple storage nodes with total size between 500 GB and 1 TB. Airlines Big Data HDFS
  • 19. Target Analysis: During the 5 years of all US domestic airlines flight information 1. Which carriers have the most aggregated delay in their flights ? 2. What are the states with most delays. ) ?
  • 21. Airlines Project Workflow and Design Master Node Node 1 Node 2 Node 3 Node 4 Name Node Job Tracker Airlines Big Data Task Java Code Reducer Node HDFS Mapper Reducer Top Airlines
  • 23. Software and Tools 1. CentOS Linux Operating System. 2. Apache Hadoop 3. Cloudera CDH 5.3 virtual machine 4. Oracle VM Virtual Box Manager 5. Eclipse IDE 6. Java (Oracle JDK ) 7. Maven 8. Microsoft Excel and Access 2010. 9. The R statistical tool
  • 26. R:
  • 28. US Airlines Delay (Per Carrier) 0 0.2 0.4 0.6 0.8 1 1.2 WN AA OO MQ US DL UA XE NW CO EV 9E FL YV OH B6 AS F9 HA AQ PI HP EA PS TW ArrivalOnTime ArrivalDelays DepartureOnTime DepartureDelays Cancellations Diversions
  • 29. Thematic Map of US Airlines Delay (Per State)
  • 31. Conclusion:  Big Data is the large amount of continuously generated data that cannot be processed and analyzed using traditional data management tools .  Big data is a new topic that is rising dramatically , reshaping the future , and a large demand for big data scientist is taking place and will continue to happen during the coming period of time.  Hadoop is an open source framework for storing and processing large datasets using clusters of commodity hardware.  Big Data analytics is attracting both business and policy makers to leverage from this new phenomenon towards more informed decisions and planning for the future.  Big Data now , Normal Data tomorrow.
  • 33. Online Big Data Tutorials: 1. Udemy : https://www.udemy.com/course/subscribe/?courseId=336982&dtcode=lGCe31035ujY 2. Udacity : https://www.udacity.com/courses#!/data-science 3. EMC : https://education.emc.com/guest/campaign/data_science.aspx 4. Coursera : https://www.coursera.org/course/datasci 5. CalTech’s : Learning from Data http://work.caltech.edu/telecourse.html 6. MIT : Open Courseware http://ocw.mit.edu/courses/sloan-school-of-management/15-062- data-mining-spring-2003/index.htm 7. Stanford’s OpenClassroom http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning 8. Big Data University : https://bigdatauniversity.com/curriculum-map/
  • 34. Thank You Ziyad Saleh 34 ‫ينفعنا‬ ‫ما‬ ‫علمنا‬ ‫اللهم‬..‫علمتنا‬ ‫بما‬ ‫وانفعنا‬ ‫علما‬ ‫وزدنا‬