0% found this document useful (0 votes)

69 views3 pages

Tamr: Unifying Hadoop Data Lakes

Tamr integrates related datasets across Hadoop nodes very efficiently and enables customers to finally realize a return on their Big Data investments. Tamr's matching engine aligns attributes from different datasets to a unified schema, identifies records referring to the same entities, and produces clean integrated datasets within Hadoop to enable downstream analytics without moving data. The platform continuously matches new data as it is added to maintain updated integrated datasets for ongoing analytics as companies leverage Hadoop as an enterprise data source.

Uploaded by

cgarciahe5427

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views3 pages

Tamr: Unifying Hadoop Data Lakes

Uploaded by

cgarciahe5427

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Tamr and the Data Lake

Tamr Unifies Datasets In Hadoop To Unlock

Hidden Insights
Companies Struggle With Integrating Data In Hadoop
Hadoop has helped organizations significantly reduce the cost of data processing by spreading work over
clusters built on commodity hardware as well as giving companies the ability to host massive amounts
of heterogeneous and diverse data sets. With the growing popularity of Hadoop, a significant amount of
organizations have been creating Data Lakes, where they store data derived from structured and unstructured
data sources in its raw format. However, these companies struggle with connecting and transforming the data
into a unified dataset for business analysis without significant investment in time and money. This is largely
because schema proflieration is rampant and very rarely are any two datasets structured exactly alike.

Tamr’s Matching Engine Unifies Data Within Hadoop

Tamr solves the biggest challenge in unifying datasets in Hadoop, namely connecting and cleaning the data so
that it’s ready for analytics. Tamr is a data unification platform that leverages machine learning and customer
expertise to create integrated, clean datasets with unrivaled speed and scalability. In particular, Tamr focuses
on profiling datasets, creating ideal target schemas, and deduplicating records in order to prepare datasets
for analysis.

Tamr’s core offering for Hadoop consists of two components:

+ A module for training, administration, and expert sourcing that runs on top of a relational database on an
edge node of the customer’s Hadoop cluster.

+ A matching engine that runs distributed on the Hadoop cluster where pertinent data is stored
Companies struggle with connecting and
Because of the scale of the data, it is very expensive to move Hadoop-based data outside of the Data Lake. transforming the data into a unified dataset for
Therefore, Tamr avoids this by doing all data-scale processing within the Data Lake, thus eliminating the need to business analysis without significant investment
replicate the entire data set. in time and money.

Tamr’s Data Lake capabilities can be best explained by example. Let’s take a (simplified) example where
customer data is the focus of what’s stored in a company’s Data Lake and the com is trying to generate a 360
view of their customers. The particular data sets stored are:

+ CRM data with information about each customer the organization does business with, often with
several duplicates

+ Clickstream data from the company’s website

+ Transactional data such as prior purchases by each customer

Let’s assume the following data structures:

CRM Data

Clickstream Data

Transactional Data

1
Tamr and the Data Lake

The ‘Clickstream Data’ and ‘Transactional Data’ are transactional in nature and, hence, will likely grow to be very
large in size. As an example, if a company has 1 million customers, the clickstream data might contain billions of
rows. Moving it out of the Data Lake is not an option due to cost and technical challenges.

Tamr Deployment

Below is the process for how Tamr will be deployed in the environment described:

Registration
To begin, a user would deploy a Tamr instance on the edge node of the company’s Data Lake (Hadoop). The
user then registers data sources within Hadoop that are relevant to the particular analysis being conducted. For
example, in this case the three nodes with the data sources mentioned above would be registered.

Schema Mapping
Once registration of sources is complete, Tamr will read all relevant data sources in Hadoop and pull in samples
of the data in order to conduct schema mapping. This data set will only have a small sample of rows to help the
user conduct a schema map. Through a unique combination of machine learning and expert sourcing, Tamr will
work with the end user and business experts to create a schema for the unified dataset, coined the ‘Unified
Schema’.

Let’s assume that the schema mapping is identified as follows. Note that most attributes from transactional data
sources are not mapped, as they do not help to identify a customer. The table generated through this schema
could later be joined with a transactional database for analytics.

Entity Matching

Once critical attributes are identified, Tamr will bring in records from each dataset for entity matching. Because
only the unified attributes are relevant, Tamr can bring in records that only differ in those aspects through its
intelligent sampling capabilities. As an example, the billions of rows in the clickstream data will be reduced to
millions of unique records.

Tamr will then use machine learning to automate most of the entity matching, while ensuring the highest levels
of accuracy through Tamr’s native capabilities around expert sourcing. Tamr’s machine learning-based approach
has the added benefit of learning as experts feed insight into the product (i.e. verification of matches) such that
future matching processes require continuously less human intervention.

Matching Engine Deployment On Hadoop To Create Clean, Integrated Datasets

Tamr produces ‘reference maps’ of keys as a result of this sampling exercise, which contain IDs related to all
unique entities Tamr has identified as well as IDs related to the unified attributes identified. These reference
maps are then pushed to Tamr’s matching engine, which is natively deployed on Hadoop.

2
Tamr and the Data Lake

Tamr’s matching engine will leverage the reference maps to:

+ Align all relevant source attributes from across datasets registered in Hadoop to unified schema
attributes, regardless of the multitude of naming conventions associated with the source attributes. In
the Customer 360 example, Tamr could identify all source attributes that relate to a unified attribute,
for example ‘Customer Name’, even if the source attributes are called ‘Cust_Name’, ‘Full Name’, or
‘Consumer_Name’.

+ Identify and cluster records referencing the same entity across multiple datasets in Hadoop. In the
Customer 360 example, Tamr would identify a cluster of names that are all related to ‘Robert Smith’
even if the source records listed the customer as ‘R. Smith’, ‘Bob Smith’, or ‘ Rob S’.

The application of Tamr’s matching engine on the Hadoop nodes will enable a customer to produce data tables
of interest. In the Customer 360 example, Tamr would help produce a customer table which can be used in the
construction of a data mart or in downstream analytics tools. This table represents the clean, integrated dataset
missing from many Hadoop implementations today.

Iterative, Ongoing Analytics

As additional data gets added to registered data sets, Tamr continues to match it to the reference data set
initially created. In the Customer 360 example, as new customers get added, there might not be a match to
the existing reference data set. In this case, Tamr will automatically add the record to the reference data set or Tamr integrates related datasets across Hadoop nodes very
send it to experts for review. This is enabled through Tamr’s ability to create master lists of entities (in this case, efficiently and enables customers to finally realize a return
customers) being modeled and match new data sets to it at scale. The capability allows the organization’s down- on their Big Data investments.
stream analytics to remain updated as they continue to use Hadoop as a source for critical enterprise data.

Unleash The Power Of Your Hadoop Implementation

Investment in Hadoop is expected to continue for some time as companies look to capture and analyze data that
was not previously accessible to them. One of the major factors that turn Data Lakes into data swamps is the
failure to integrate related datasets across Hadoop nodes. Moreover, there is a need to conduct this integration
without having to spend the time and money extracting the data from Hadoop. Tamr solves both of these prob-
lems very efficiently and enables customers to finally realize a return on their Big Data investments.

About Tamr
Tamr, Inc., provides a data unification platform that dramatically reduces the time and effort of connecting and
enriching multiple data sources to achieve a unified view of siloed enterprise data. Using Tamr, organizations are
able to complete data unification projects in days or weeks versus months or quarters. For your own personal-
3
ized Tamr demo, visit www.tamr.com.

Data - Purging Best - Practices - FINAL
No ratings yet
Data - Purging Best - Practices - FINAL
11 pages
HDFC Bank CRM Success Story
0% (1)
HDFC Bank CRM Success Story
3 pages
How To Work With Spirits - Taylor Ellwood
No ratings yet
How To Work With Spirits - Taylor Ellwood
8 pages
Using ArchiMate To Represent ITIL Metamodel
No ratings yet
Using ArchiMate To Represent ITIL Metamodel
6 pages
CRM Project
No ratings yet
CRM Project
15 pages
Spanish English Speech Practices
100% (2)
Spanish English Speech Practices
22 pages
Differentiation Formulas - Derivative Formulas List
No ratings yet
Differentiation Formulas - Derivative Formulas List
13 pages
Vadnana Luthra Orignal
No ratings yet
Vadnana Luthra Orignal
11 pages
IT Services: 24/7 Managed Solutions
No ratings yet
IT Services: 24/7 Managed Solutions
2 pages
10 Masalah Yang Disolusikan Oleh Hadoop
No ratings yet
10 Masalah Yang Disolusikan Oleh Hadoop
7 pages
Cell MCQ Collection Biology Grade Xi
No ratings yet
Cell MCQ Collection Biology Grade Xi
22 pages
Cloudera Case Study
No ratings yet
Cloudera Case Study
4 pages
Big Data Analytics Explained
No ratings yet
Big Data Analytics Explained
4 pages
Customer Relationships Management in Organizations
No ratings yet
Customer Relationships Management in Organizations
6 pages
Infosys - Enhance-Customer-Satisfaction PDF
No ratings yet
Infosys - Enhance-Customer-Satisfaction PDF
4 pages
Infosys Helps Outotec: Case Study
No ratings yet
Infosys Helps Outotec: Case Study
4 pages
Q Skill-1-Reading Final Test
100% (1)
Q Skill-1-Reading Final Test
4 pages
CRM Strategies and AI Integration
No ratings yet
CRM Strategies and AI Integration
4 pages
CRM Orders Archiving
No ratings yet
CRM Orders Archiving
5 pages
CRM Development For A Retail Bank
No ratings yet
CRM Development For A Retail Bank
5 pages
2021-Forrester-The Total Economic Impact™ of The Tamr Cloud Native Master Data Management Platform
No ratings yet
2021-Forrester-The Total Economic Impact™ of The Tamr Cloud Native Master Data Management Platform
28 pages
Escritura 1
No ratings yet
Escritura 1
7 pages
Case Study DSBA
No ratings yet
Case Study DSBA
21 pages
Class Notes (Week 10)
No ratings yet
Class Notes (Week 10)
17 pages
Company Profile
No ratings yet
Company Profile
16 pages
Custom Notes
No ratings yet
Custom Notes
10 pages
Software Architecture 8 PG
No ratings yet
Software Architecture 8 PG
11 pages
Implementation of Microsoft Dynamics 365
No ratings yet
Implementation of Microsoft Dynamics 365
7 pages
Salesforce Integration Questions For Discovery
No ratings yet
Salesforce Integration Questions For Discovery
3 pages
Common and Proper Nouns Lesson Plan
80% (5)
Common and Proper Nouns Lesson Plan
4 pages
Segmentation of Retail Customers Based On Cluster Analysis in Building Successful CRM
No ratings yet
Segmentation of Retail Customers Based On Cluster Analysis in Building Successful CRM
17 pages
01 Hadoop Crs1 Elearning
No ratings yet
01 Hadoop Crs1 Elearning
30 pages
Bda Ese
No ratings yet
Bda Ese
66 pages
MDM For CRM A Test Approach
No ratings yet
MDM For CRM A Test Approach
9 pages
Customer Relationship Management in Banking Sector
No ratings yet
Customer Relationship Management in Banking Sector
7 pages
Tecton Prima Case Study
No ratings yet
Tecton Prima Case Study
8 pages
100 Questions To Help Kickstart Your MDM Decision White Paper 4260en
No ratings yet
100 Questions To Help Kickstart Your MDM Decision White Paper 4260en
13 pages
CRM 12
No ratings yet
CRM 12
6 pages
Spark 4-2 Documentation
No ratings yet
Spark 4-2 Documentation
60 pages
Big Data Analytics Toward Intelligent Mobile Service Provisions of Customer Relationship Management in E-Commerce
No ratings yet
Big Data Analytics Toward Intelligent Mobile Service Provisions of Customer Relationship Management in E-Commerce
10 pages
Implementing An Operational Data Layer
No ratings yet
Implementing An Operational Data Layer
29 pages
Tera Data
No ratings yet
Tera Data
86 pages
The Data Quality in CRM Systems: Strategy and Privacy
No ratings yet
The Data Quality in CRM Systems: Strategy and Privacy
6 pages
Critical Data Warehouse Trends
No ratings yet
Critical Data Warehouse Trends
30 pages
Modern Data Architecture For Financial Services With Apache Hadoop On Windows White Paper
No ratings yet
Modern Data Architecture For Financial Services With Apache Hadoop On Windows White Paper
20 pages
Mr. Sanman Jain N, (E-Mail: Sanmanjainn@yahoo - Co.in) : CRM Marketing Mantra: Indian Perspective
No ratings yet
Mr. Sanman Jain N, (E-Mail: Sanmanjainn@yahoo - Co.in) : CRM Marketing Mantra: Indian Perspective
7 pages
Chandru Advanced Internship Report
No ratings yet
Chandru Advanced Internship Report
20 pages
Enterprise Archiving With Apache Hadoop Featuring The 2015 Gartner Magic Quadrant
No ratings yet
Enterprise Archiving With Apache Hadoop Featuring The 2015 Gartner Magic Quadrant
27 pages
Document Registration System Enterprise Architecture Framework
No ratings yet
Document Registration System Enterprise Architecture Framework
192 pages
Windows Azure™ Marketplace Datamarket: Published
No ratings yet
Windows Azure™ Marketplace Datamarket: Published
16 pages
Big Data
No ratings yet
Big Data
106 pages
1 Customer Relationship Management Systems Help Firms Achieve Customer Profiling and Personalizing
No ratings yet
1 Customer Relationship Management Systems Help Firms Achieve Customer Profiling and Personalizing
13 pages
BlueGranite Data Lake Ebook
100% (1)
BlueGranite Data Lake Ebook
23 pages
(EN) Document Management Market - Dr. Ulrich Kampffmeyer - DLM Forum 2000
100% (1)
(EN) Document Management Market - Dr. Ulrich Kampffmeyer - DLM Forum 2000
11 pages
EAM WP v3.06b
No ratings yet
EAM WP v3.06b
8 pages
BDA Answers-1
No ratings yet
BDA Answers-1
15 pages
Data Lakes White Paper PDF
No ratings yet
Data Lakes White Paper PDF
16 pages
Naukri PrashantKumarGoda (15y 0m)
No ratings yet
Naukri PrashantKumarGoda (15y 0m)
4 pages
Data Management for Enterprises
No ratings yet
Data Management for Enterprises
15 pages
Review Report On E-Farming
80% (5)
Review Report On E-Farming
35 pages
A Story On Aanvikshiki (How Ton Think)
No ratings yet
A Story On Aanvikshiki (How Ton Think)
16 pages
Master Data Management Strategies
100% (1)
Master Data Management Strategies
8 pages
JKD Conversations With John Little
No ratings yet
JKD Conversations With John Little
37 pages
Tools New
No ratings yet
Tools New
28 pages
Fidelity Bond Forms
No ratings yet
Fidelity Bond Forms
28 pages
Lightweight Edge Detection Network
No ratings yet
Lightweight Edge Detection Network
15 pages
Draft Intern Report Well Group
No ratings yet
Draft Intern Report Well Group
68 pages
Uncous Hengfeng Aleyrodes
No ratings yet
Uncous Hengfeng Aleyrodes
2 pages
Bootcamp 2020 Complete Course Outline
No ratings yet
Bootcamp 2020 Complete Course Outline
25 pages
Riege - aier.EA Contingency.2009
No ratings yet
Riege - aier.EA Contingency.2009
12 pages
Lesson Plan Grade 2 Competency 1 Quarter 1
No ratings yet
Lesson Plan Grade 2 Competency 1 Quarter 1
17 pages
AD Past Paper 06
No ratings yet
AD Past Paper 06
10 pages
Think l5 Unit 3 Vocabulary Extension
100% (1)
Think l5 Unit 3 Vocabulary Extension
2 pages
Dama Dmbok 77-100
100% (1)
Dama Dmbok 77-100
24 pages
Why Business Should Take Enterprise Architecture Seriously: September 2013
No ratings yet
Why Business Should Take Enterprise Architecture Seriously: September 2013
13 pages
ECC Application Form
No ratings yet
ECC Application Form
2 pages
LM Asante Twi Section 2 T Version
No ratings yet
LM Asante Twi Section 2 T Version
20 pages
Issue: 20th February 2011
No ratings yet
Issue: 20th February 2011
4 pages
How To Simplify The Evolution of Business Process
No ratings yet
How To Simplify The Evolution of Business Process
9 pages
Chap 2 Emerging Database Landscape
No ratings yet
Chap 2 Emerging Database Landscape
10 pages
Resume Help for Job Seekers
100% (1)
Resume Help for Job Seekers
4 pages
The Essence of Interdisciplinary Research: Speaker: Martin Dunn Writer: Sreetej Lakkam
No ratings yet
The Essence of Interdisciplinary Research: Speaker: Martin Dunn Writer: Sreetej Lakkam
2 pages
Jayson Bejec: Industrial Engineering Resume
No ratings yet
Jayson Bejec: Industrial Engineering Resume
3 pages
New GRE Scoring Format: GRE Test GRE Test GRE Exam GRE Exam
No ratings yet
New GRE Scoring Format: GRE Test GRE Test GRE Exam GRE Exam
2 pages
Constructing A Service Software With Microservices
No ratings yet
Constructing A Service Software With Microservices
2 pages
Pecs 10
No ratings yet
Pecs 10
2 pages
Varun Sodhi Resume PDF
No ratings yet
Varun Sodhi Resume PDF
1 page
Calander 2018-2019 Tusd
No ratings yet
Calander 2018-2019 Tusd
1 page
Diagnostic Test Variant 2
No ratings yet
Diagnostic Test Variant 2
3 pages
Catherine Hoblin-5
No ratings yet
Catherine Hoblin-5
1 page
Udemy Course Quality+Checklist
No ratings yet
Udemy Course Quality+Checklist
1 page
English 3
100% (1)
English 3
5 pages
RBV ChoosingDepVar SMJ04 PDF
No ratings yet
RBV ChoosingDepVar SMJ04 PDF
15 pages

Tamr: Unifying Hadoop Data Lakes

Uploaded by

Tamr: Unifying Hadoop Data Lakes

Uploaded by

Tamr and the Data Lake

Tamr Unifies Datasets In Hadoop To Unlock

Tamr’s Matching Engine Unifies Data Within Hadoop

Tamr’s core offering for Hadoop consists of two components:

+ Clickstream data from the company’s website

+ Transactional data such as prior purchases by each customer

Let’s assume the following data structures:

Matching Engine Deployment On Hadoop To Create Clean, Integrated Datasets

Tamr’s matching engine will leverage the reference maps to:

Iterative, Ongoing Analytics

Unleash The Power Of Your Hadoop Implementation

You might also like