0% found this document useful (0 votes)
30 views7 pages

Bda Case Study UBER

user bda

Uploaded by

btsislifeduh7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views7 pages

Bda Case Study UBER

user bda

Uploaded by

btsislifeduh7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Uber Case-Study

Uber, as we all might have heard about it.


But does any one know how huge amount of data handling does uber
have to do?
What made uber jump from traditional database to Hadoop ?

End user 100 petrabytes


details( driver, of analytical
customer, data
employees
etc.)

Forecasting, traffic
jams
Uber Using Traditional
database (Before 2014)
To leverage the data, their engineers had to access each database or table
individually. At that time, they didn’t have global access or a global view of all
their stored data. In fact, our data was scattered across different OLTP
databases.

You might get a question, then why did they


desired to switch for Data warehouse ?
Elastic mapReduce

Extract, transform
and load

Limitations:
• Since data was ingested through ad hoc ETL jobs and they lacked a formal schema communication
mechanism, data reliability became a concern. Most of their source data was in JSON format, and
ingestion jobs were not resilient to changes in the producer code.
• As company was in emerging state, scaling data warehouse became increasingly expensive for them.
To cut down on costs, they started deleting older, obsolete data to free up space for new data.
• The same data could be ingested multiple times if different users performed different transformations
during ingestion. this resulted in multiple copies of almost the same data being stored in our
warehouse, further increasing storage costs.
Introduces Hadoop

Limitations:
❑ As the company continued scaling and with tens of petabytes of data stored in our ecosystem, we
faced a new set of challenges.
❑ Data latency was still far from what the business needed. New data was only accessible to users
once every 24 hours, which was too slow to make real-time decisions.
❑ Since HDFS and Parquet do not support data updates, all ingestion jobs needed to create new
snapshots from the updated source data, ingest the new snapshot into Hadoop, convert it into
Parquet format, and then swap the output tables to view the new data.
A big part of each job involved converting both historical and new data from the latest snapshot.
While only over 100 gigabytes of new data was added every day for each table, each run of the
ingestion job had to convert the entire, over 100 terabyte dataset for that specific table. This was
also true for ETL and modeling jobs that recreated new derived tables on every run. These jobs had
to rely on snapshot-based ingestion of the source data because of the high ratio of updates on
historical data. By nature, the data contains a lot of update operations (i.e., rider and driver ratings or
support fare adjustments a few hours or even days after a completed trip).
What’s
next ?
Generation 4

HDFS scalability limitation, Faster data in Hadoop, Support of updates and deletes in Hadoop
and Parquet, Faster ETL and modeling. They built Hadoop Upserts anD
Using the Hudi library, they were able to move away from Incremental(Hudi), an open source Spark
the snapshot-based ingestion of raw data to an incremental library that provides an abstraction layer on top
ingestion model that enabled them to reduce data latency of HDFS and Parquet to support the required
from 24 hours to less than one hour. update and delete operations.

They also formalized the hand-over of upstream data


store changes between the storage and big data
teams through Apache Kafka for generic data
ingestion.

You might also like