Cobrix – A COBOL data
source for Spark
Ruslan Iushchenko (ABSA), Felipe Melo (ABSA)
• Who we are?
• Motivation
• Mainframe files and copybooks
• Loading simple files
• Loading hierarchical databases
• Performance and results
• Cobrix in ABSA Big Data space
Outline
2/40
About us
. ABSA is a Pan-African financial services provider
- With Apache Spark at the core of its data engineering
. We fill gaps in the Hadoop ecosystem, when we find them
. Contributions to Apache Spark
. Spark-related open-source projects (https://github.com/AbsaOSS)
- Spline - a data lineage tracking and visualization tool
- ABRiS - Avro SerDe for structured APIs
- Atum - Data quality library for Spark
- Enceladus - A dynamic data conformance engine
- Cobrix - A Cobol library for Spark (focus of this presentation)
3/40
The market for Mainframes is strong, with no signs of cooling down.
Mainframes
Are used by 71% of Fortune 500 companies
Are responsible for 87% of all credit card transactions in the world
Are part of the IT infrastructure of 92 out of the 100 biggest banks in the world
Handle 68% of the world’s production IT workloads, while accounting for only 6%
of IT costs.
For companies relying on Mainframes, becoming data-centric can be
prohibitively expensive
High cost of hardware
Expensive business model for data science related activities
Source: http://blog.syncsort.com/2018/06/mainframe/9-mainframe-statistics/
Business Motivation
4/40
Technical Motivation
• The process above takes 11 days for a 600GB file
• Legacy data models (hierarchical)
• Need for performance, scalability, flexibility, etc
• SPOILER alert: we brought it to 1.1 hours 5/40
Mainframes PC
Fixed-length
Text Files
CSVHDFS
1. Extract 2. Transform
4. Load 3. Transform
Proprietary
Tools
• Run analytics / Spark on mainframes
• Message Brokers (e.g. MQ)
• Sqoop
• Proprietary solutions
• But ...
• Pricey
• Slow
• Complex (specially for legacy systems)
• Require human involvement
What can you do?
6/40
How Cobrix can help
• Decreasing human involvement
• Simplifying the manipulation of hierarchical structures
• Providing scalability
• Open-source
7/40
Mainframe file
(EBCDIC)
Schema
(copybook)
Apache
Spark Application
Cobrix Output
(Parquet, JSON, CSV…)df df df
transformations
Writer
...
Cobrix – a custom Spark data source
8/40
A copybook is a schema definition
A data file is a collection of binary records
Name: █ █ █ █ Age: █ █
Company: █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █ █
Zip: █ █ █ █ █
Name: J O H N Age: 3 2
Company: F O O . C O M
Phone #: + 2 3 1 1 - 3 2 7
Zip: 1 2 0 0 0
A * N J O H N G A 3 2 S H
K D K S I A S S A S K A S
A L , S D F O O . C O M X
L Q O K ( G A } S N B W E
S < N J X I C W L D H J P
A S B C + 2 3 1 1 - 3 2 7
C = D 1 2 0 0 0 F H 0 D .
9/40
Similar to IDLs of Avro, Thrift, Protocol Buffers, etc.
struct Company {
1: required i64 id,
2: required string name,
3: optional list<string> contactPeople
}
Thrift
message Company {
required int64 id = 1;
required string name = 2;
repeated string contact_people = 3;
}
10 COMPANY.
15 ID PIC 9(12) COMP.
15 NAME PIC X(40).
15 CONTACT-PEOPLE PIC X(20)
OCCURS 10.
COBOL
record Company {
int64 id;
string name;
array<string> contactPeople;
}
10/40
val df = spark
.read
.format("cobol")
.option("copybook", "data/example.cob")
.load("data/example")
01 RECORD.
05 COMPANY-ID PIC 9(10).
05 COMPANY-NAME PIC X(40).
05 ADDRESS PIC X(60).
05 REG-NUM PIC X(8).
05 ZIP PIC X(6).
A * N J O H N G A 3 2 S H K D K S I
A S S A S K A S A L , S D F O O . C
O M X L Q O K ( G A } S N B W E S <
N J X I C W L D H J P A S B C + 2 3
COMPANY_ID COMPANY_NAME ADDRESS REG_NUM ZIP
100 ABCD Ltd. 10 Garden st. 8791237 03120
101 ZjkLPj 11 Park ave. 1233971 23111
102 Robotrd Inc. 12 Forest st. 0382979 12000
103 Xingzhoug 8 Mountst. 2389012 31222
104 Example.co 123 Tech str. 3129001 19000
Loading Mainframe Data
11/40
val spark = SparkSession.builder() .appName("Example").getOrCreate()
val df = spark
.read
.format("cobol")
.option("copybook", "data/example.cob")
.load("data/example")
// ...Business logic goes here...
df.write.parquet("data/output")
This App is
● Distributed
● Scalable
● Resilient
EBCDIC to Parquet examples
12/40
A * N J O H N G A 3 2 S H K D K S I A
S S A S K A S A L , S D F O O . C O M
X L Q O K ( G A } S N B W E S < N J X
I C W L D H J P A S B C + 2 3 1 1 - 3
2 7 C = D 1 2 0 0 0 F H 0 D . A * N J
O H N G A 3 2 S H K D K S I A S S A S
K A S A L , S D F O O . C O M X L Q O
K ( G A } S N B W E S < N J X I C W L
D H J P A S B C + 2 B W E S < N J X P
FIRST-NAME: █ █ █ █ █ █ LAST-NAME: █ █ █ █ █COMPANY-NAME: █ █ █ █ █ █ █ █ █ █ █ █ █ █
• Redefined fields AKA
• Unchecked unions
• Untagged unions
• Variant type fields
• Several fields occupy the same space
01 RECORD.
05 IS-COMPANY PIC 9(1).
05 COMPANY.
10 COMPANY-NAME PIC X(40).
05 PERSON REDEFINES COMPANY.
10 FIRST-NAME PIC X(20).
10 LAST-NAME PIC X(20).
05 ADDRESS PIC X(50).
05 ZIP PIC X(6).
Redefined Fields
13/40
01 RECORD.
05 IS-COMPANY PIC 9(1).
05 COMPANY.
10 COMPANY-NAME PIC X(40).
05 PERSON REDEFINES COMPANY.
10 FIRST-NAME PIC X(20).
10 LAST-NAME PIC X(20).
05 ADDRESS PIC X(50).
05 ZIP PIC X(6).
• Cobrix applies all redefines for each
record
• Some fields can clash
• It’s up to the user to apply business logic
to separate correct and wrong data
IS_COMPANY COMPANY PERSON ADDRESS ZIP
1 {“COMPANY_NAME”: “September Ltd.”} {“FIRST_NAME”: “Septem”,
“LAST_NAME”: “ber Ltd.”}
74 Lawn ave., Denver 39023
0 {“COMPANY_NAME”: “Beatrice
Gagliano”}
{“FIRST_NAME”: “Beatrice”,
“LAST_NAME”: “Gagliano”}
10 Garden str. 33113
1 {“COMPANY_NAME”: “Beatrice
Gagliano”}
{“FIRST_NAME”: “Januar”,
“LAST_NAME”: “y Inc.”}
122/1 Park ave. 31234
Redefined Fields
14/40
df.select($"IS_COMPANY",
when($"IS_COMPANY" === true, "COMPANY_NAME")
.otherwise(null).as("COMPANY_NAME"),
when($"IS_COMPANY" === false, "CONTACTS")
.otherwise(null).as("FIRST_NAME")),
...
IS_COMPANY COMPANY_NAME FIRST_NAME LAST_NAME ADDRESS ZIP
1 September Ltd. 74 Lawn ave., Denver 39023
0 Beatrice Gagliano 10 Garden str. 33113
1 January Inc. 122/1 Park ave. 31234
Clean Up Redefined Fields + flatten structs
15/40
Hierarchical DBs
• Several record types
• AKA segments
• Each segment type has its
own schema
• Parent-child relationships
between segments
COMPANY
ID: █ █ █ █ █ █ █ █ █
Name: █ █ █ █ █ █ █ █ █ █ █ █
Address: █ █ █ █ █ █ █ █ █
CONTACT-PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
Root segment
Child segment
CONTACT-PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
…
Child segment
16/40
• The combined copybook has to contain all the segments as redefined
fields:
01 COMPANY-DETAILS.
05 SEGMENT-ID PIC X(5).
05 COMPANY-ID PIC X(10).
05 COMPANY.
10 NAME PIC X(15).
10 ADDRESS PIC X(25).
10 REG-NUM PIC 9(8) COMP.
05 CONTACT REDEFINES COMPANY.
10 PHONE-NUMBER PIC X(17).
10 CONTACT-PERSON PIC X(28).
common data
segment 1
segment 2
COMPANY
Name: █ █ █ █ █ █ █ █ █ █ █
Address: █ █ █ █ █ █ █ █ █
Reg-Num: █ █ █ █ █ █
CONTACT
Phone #: █ █ █ █ █ █ █
Contact Person: █ █ █ █
█ █ █ █ █ █ █ █ █ █ █
Defining a Copybook
17/40
• The code snippet for reading the data:
val df = spark
.read
.format("cobol")
.option("copybook", "/path/to/copybook.cpy")
.option("is_record_sequence", "true")
.load("examples/multisegment_data")
Reading all the segments
18/40
• The dataset for the whole copybook:
• Invalid redefines are highlighted
SEGMENT_ID COMPANY_ID COMPANY CONTACT
C 1005918818 [ ABCD Ltd. ] [ invalid ]
P 1005918818 [ invalid ] [ Cliff Wallingford ]
C 1036146222 [ DEFG Ltd. ] [ invalid ]
P 1036146222 [ invalid ] [ Beatrice Gagliano ]
C 1045855294 [ Robotrd Inc. ] [ invalid ]
P 1045855294 [ invalid ] [ Doretha Wallingford ]
P 1045855294 [ invalid ] [ Deshawn Benally ]
P 1045855294 [ invalid ] [ Willis Tumlin ]
C 1057751949 [ Xingzhoug ] [ invalid ]
P 1057751949 [ invalid ] [ Mindy Boettcher ]
Reading all the segments
19/40
A * N J O
H N G A 3
2 S H K D
K S I A S
S A S K A
S A L , S
D F O O .
C O M X L
Q O K ( G
A } S N B
W E S < N
J X I C W
L D H J P
A S B C +
2 3 1 1 -
3 2 7 C =
D 1 2 0 0
0 F H 0 D
. K A I O
D A P D F
C J S C D
A D F R J
F D F C L
COMPANY
ID: 2 8 0 0 0 3 9 4 1
Name: E x a m p l e . c o m █
Address: 1 0 █ G a r d e n
PERSON
Name: J A N E █ █ █ █
R O B E R T S █ █
Phone #: + 9 3 5 2 8 0
PERSON
Name: J A N E █ █ █ █
R O B E R T S █ █
Phone #: + 9 3 5 2 8 0
COMPANY
ID: 2 8 0 0 0 3 9 4 1
Name: E x a m p l e . c o m █
Address: 1 0 █ G a r d e n
COMPANY
ID: 2 8 0 0 0 3 9 4 1
Name: E x a m p l e . c o m █
Address: 1 0 █ G a r d e n
Id Name Address Reg_Num
100 Example.com 10 Garden st. 8791237
101 ZjkLPj 11 Park ave. 1233971
102 Robotrd Inc. 12 Forest st. 0382979
103 Xingzhoug 8 Mountst. 2389012
104 ABCD Ltd. 123 Tech str. 3129001
Company_Id Contact_Person Phone_Number
100 Jane +32186331
100 Colyn +23769123
102 Robert +12389679
102 Teresa +32187912
102 Laura +42198723
Separate segments by dataframes
20/40
• Filter segment #1 (companies)
val dfCompanies =
df.filter($"SEGMENT_ID"==="C")
.select($"COMPANY_ID",
$"COMPANY.NAME".as($"COMPANY_NAME"),
$"COMPANY.ADDRESS",
$"COMPANY.REG_NUM")
Company_Id Company_Name Address Reg_Num
100 ABCD Ltd. 10 Garden st. 8791237
101 ZjkLPj 11 Park ave. 1233971
102 Robotrd Inc. 12 Forest st. 0382979
103 Xingzhoug 8 Mountst. 2389012
104 Example.co 123 Tech str. 3129001
Reading root segments
21/40
• Filter segment #2 (people)
val dfContacts = df
.filter($"SEGMENT_ID"==="P")
.select($"COMPANY_ID",
$"CONTACT.CONTACT_PERSON",
$"CONTACT.PHONE_NUMBER")
Company_Id Contact_Person Phone_Number
100 Marry +32186331
100 Colyn +23769123
102 Robert +12389679
102 Teresa +32187912
102 Laura +42198723
Reading child segments
22/40
Company_Id Company_Name Address Reg_Num
100 ABCD Ltd. 10 Garden st. 8791237
101 ZjkLPj 11 Park ave. 1233971
102 Robotrd Inc. 12 Forest st. 0382979
103 Xingzhoug 8 Mountst. 2389012
104 Example.co 123 Tech str. 3129001
Company_Id Contact_Person Phone_Number
100 Marry +32186331
100 Colyn +23769123
102 Robert +12389679
102 Teresa +32187912
102 Laura +42198723
Company_Id Company_Name Address Reg_Num Contact_Person Phone_Number
100 ABCD Ltd. 10 Garden st. 8791237 Marry +32186331
100 ABCD Ltd. 10 Garden st. 8791237 Colyn +23769123
102 Robotrd Inc. 12 Forest st. 0382979 Robert +12389679
102 Robotrd Inc. 12 Forest st. 0382979 Teresa +32187912
102 Robotrd Inc. 12 Forest st. 0382979 Laura +42198723
The two segments can now be joined by Company_Id
23/40
• Joining segments 1 and 2
val dfJoined =
dfCompanies.join_outer(dfContacts, "COMPANY_ID")
Results:
Company_Id Company_Name Address Reg_Num Contact_Person Phone_Number
100 ABCD Ltd. 10 Garden st. 8791237 Marry +32186331
100 ABCD Ltd. 10 Garden st. 8791237 Colyn +23769123
102 Robotrd Inc. 12 Forest st. 0382979 Robert +12389679
102 Robotrd Inc. 12 Forest st. 0382979 Teresa +32187912
102 Robotrd Inc. 12 Forest st. 0382979 Laura +42198723
Joining in Spark is easy
24/40
• The joined table can also be denormalized for document storage
val dfCombined =
dfJoined
.groupBy($"COMPANY_ID",
$"COMPANY_NAME",
$"ADDRESS",
$"REG_NUM")
.agg(
collect_list(
struct($"CONTACT_PERSON",
$"PHONE_NUMBER"))
.as("CONTACTS"))
{
"COMPANY_ID": "8216281722",
"COMPANY_NAME": "ABCD Ltd.",
"ADDRESS": "74 Lawn ave., New York",
”REG_NUM": "33718594",
"CONTACTS": [
{
"CONTACT_PERSON": "Cassey Norgard",
"PHONE_NUMBER": "+(595) 641 62 32"
},
{
"CONTACT_PERSON": "Verdie Deveau",
"PHONE_NUMBER": "+(721) 636 72 35"
},
{
"CONTACT_PERSON": "Otelia Batman",
"PHONE_NUMBER": "+(813) 342 66 28"
}
]
}
Denormalize data
25/40
Restore parent-child relationships
• In our example we had
COMPANY_ID field that is
present in all segments
• In real copybooks this is not
the case
• What can we do?
COMPANY
ID: █ █ █ █ █ █ █ █ █
Name: █ █ █ █ █ █ █ █ █ █ █ █
Address: █ █ █ █ █ █ █ █ █
CONTACT-PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
Root segment
Child segment
CONTACT-PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
…
Child segment
26/40
01 COMPANY-DETAILS.
05 SEGMENT-ID PIC X(5).
05 COMPANY.
10 NAME PIC X(15).
10 ADDRESS PIC X(25).
10 REG-NUM PIC 9(8) COMP.
05 CONTACT REDEFINES COMPANY.
10 PHONE-NUMBER PIC X(17).
10 CONTACT-PERSON PIC X(28).
• If COMPANY_ID is not part
of all segments
Cobrix can generate it for you
val df = spark
.read
.format("cobol")
.option("copybook", "/path/to/copybook.cpy")
.option("is_record_sequence", "true")
.option("segment_field", "SEGMENT-ID")
.option("segment_id_level0", "C")
.option("segment_id_prefix", "ID")
.load("examples/multisegment_data")
No COMPANY-ID
Id Generation
27/40
01 COMPANY-DETAILS.
05 SEGMENT-ID PIC X(5).
05 COMPANY.
10 NAME PIC X(15).
10 ADDRESS PIC X(25).
10 REG-NUM PIC 9(8) COMP.
05 CONTACT REDEFINES COMPANY.
10 PHONE-NUMBER PIC X(17).
10 CONTACT-PERSON PIC X(28).
• Seg0_Id can be used to restore
parent-child relationship
between segments
No COMPANY-ID
SEGMENT_ID Seg0_Id COMPANY CONTACT
C ID_0_0 [ ABCD Ltd. ] [ invalid ]
P ID_0_0 [ invalid ] [ Cliff Wallingford ]
C ID_0_2 [ DEFG Ltd. ] [ invalid ]
P ID_0_2 [ invalid ] [ Beatrice Gagliano ]
C ID_0_4 [ Robotrd Inc. ] [ invalid ]
P ID_0_4 [ invalid ] [ Doretha Wallingford ]
P ID_0_4 [ invalid ] [ Deshawn Benally ]
Id Generation
28/40
• When transferred from a mainframe a hierarchical database becomes
• A sequence of records
• To read next record a previous record should be read first
• A sequential format by it's nature
• How to make it scalable?
A * N J O H N G A 3 2 S H K D K S I A S S A S K A S A L , S D F O O . C O M X L Q O K ( G A } S N B W E S < N J
X I C W L D H J P A S B C + 2 3 1 1 - 3 2 7 C = D 1 2 0 0 0 F H 0 D . K A I O D A P D F C J S C D C D C W E P 1
9 1 2 3 – 3 1 2 2 1 . 3 1 F A D F L 1 7
COMPANY
ID: █ █ █ █ █ █ █ █ █
Name: █ █ █ █ █ █ █ █ █ █ █ █
Address: █ █ █ █ █ █ █ █ █
COMPANY
ID: █ █ █ █ █ █ █ █ █
Name: █ █ █ █ █ █ █ █ █ █ █ █
Address: █ █ █ █ █ █ █ █ █
PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
A data file
PERSON
Name: J A N E █ █ █ █
R O B E R T S █ █
Phone #: + 9 3 5 2 8 0
PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
COMPANY
ID: 2 8 0 0 0 3 9 4 1
Name: E x a m p l e . c o m █
Address: 1 0 █ G a r d e n
Variable Length Records (VLR)
29/40
Performance challenge of VLRs
• Naturally sequential files
• To read next record the prior
record need to be read first
• Each record had a length
field
• Acts as a pointer to the next
record
• No record delimiter when
reading a file from the middle
VLR structure
30/40
0
20
40
60
80
100
120
140
160
180
0 10 20 30 40 50 60 70
MB/s Number of Spark cores
Throughput, variable record length
Sequential processing
10 MB/s
Spark on HDFS
Blocks
Partitions
Data Node
Spark Executor
Blocks
Partitions
Data Node
Spark Executor
Blocks
Partitions
Data Node
Spark Executor
. . .
HDFS Namenode
Spark Driver
31/40
Data node 1 Data node 2 Data node 3
HDFS
Cobrix
1- Read headers
List(offsets,lengths)
Spark
3 - Parse records
In parallel from
parallelized offsets
and lengths
Spark cluster
2 - Parallelize
Offsets and lengths
Parallelizing Sequential Reads
32/40
Data node 1 Data node 2 Data node 3
HDFS
Namenode Cobrix
1- Read VLR 1 header
offset = 3000
length = 1002 - where is
offset 3000
until 3100?
3 - Check nodes
3, 18 and 41.
Spark
4 - Load VLR 1
Preferred location
is node 3
5 - Launch task
On executor
hosted on node 3
Spark cluster
Enabling Data Locality for VLRs
33/40
Throughput when sparse indexes are used
• Experiments were ran
on our lab cluster
• 4 nodes
• 380 cores
• 10 Gbit network
• Scalable – for bigger
files when using more
executors the
throughput is bigger
34/40
0
20
40
60
80
100
120
140
160
180
0 10 20 30 40 50 60 70
MB/s
Number of Spark cores
Throughput, variable record length
10 GB file
20 GB file
40 GB file
Sequential
Comparison versus fixed length record performance
● Distribution and locality is
handled completely by Spark
● Parallelism is achieved using
sparse indexes
35/40
0
20
40
60
80
100
120
140
160
180
0 10 20 30 40 50 60 70
MB/s
Number of Spark cores
Throughput, fixed length records
40 GB file 150 MB/s
0
20
40
60
80
100
120
140
160
180
0 10 20 30 40 50 60 70
MB/s
Number of Spark cores
Throughput, variable record length
40 GB file
Sequential
145 MB/s
10 MB/s
Cobrix in ABSA Data Infrastructure - Batch
Mainframes
180+
sources
HDFS
Enceladus
Cobrix
Spark
Spline
2. Parse
3. Conform
4. Track Lineage5. Re-ingest
6. Consume
36/40
Cobrix in ABSA Data Infrastructure - Stream
Mainframes
180+
sources
Cobrix
Parser ABRiS
Enceladus
Kafka
Spline
1.Ingest 2.Parse 3.Avro
4.Conform
5. Track Lineage
6. Consume
37/40
Cobrix in ABSA Data Infrastructure
Mainframes
180+
sources
HDFS
Cobrix
Parser ABRiS
Enceladus
Kafka
Cobrix
Spark
Spline
1.Ingest
2. Parse
3. Conform
4. Track Lineage5. Re-ingest
6. Consume
2.Parse 3.Avro
4.Conform
5. Track Lineage
6. Consume
Batch
Stream
38/40
● Thanks to the following people the project was made possible and for all
the help along the way:
○ Andrew Baker, Francois Cillers, Adam Smyczek,
Jan Scherbaum, Peter Moon, Clifford Lategan,
Rekha Gorantla, Mohit Suryavanshi, Niel Steyn
• Thanks to the authors of the original COBOL parser:
○ Ian De Beer, Rikus de Milander
(https://github.com/zenaptix-lab/copybookStreams)
Acknowledgment
39/40
• Combine expertise to make access mainframe data in Hadoop seamless
• Our goal is to support the widest range of use cases possible
• Report a bug !
• Request new feature !
• Create a pull request ! " # $
Our home: https://github.com/AbsaOSS/cobrix
Your Contribution is Welcome
40/40

More Related Content

PDF
Azure Data Factory Introduction.pdf
PDF
Introduction to Elasticsearch
PDF
Introduction to Redis
PDF
Get to know PostgreSQL!
PDF
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
PDF
Democratizing Data at Airbnb
PPTX
Introduction to Data Engineering
PPTX
Data warehouse and data mining
Azure Data Factory Introduction.pdf
Introduction to Elasticsearch
Introduction to Redis
Get to know PostgreSQL!
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Democratizing Data at Airbnb
Introduction to Data Engineering
Data warehouse and data mining

What's hot (20)

PPTX
Netflix Data Pipeline With Kafka
PDF
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
PPTX
Elastic stack Presentation
PDF
Transparent Data Encryption in PostgreSQL and Integration with Key Management...
PDF
Observability: Beyond the Three Pillars with Spring
PPTX
Introduction to ETL process
PPTX
Postgresql Database Administration Basic - Day1
PDF
Spark SQL
PDF
Intro to Delta Lake
PDF
Summary introduction to data engineering
PPTX
Oracle Database | Computer Science
PPTX
Introduction to Data Mining
ODP
OpenGurukul : Database : PostgreSQL
PDF
Spark SQL Beyond Official Documentation
PPTX
Elastic Stack Introduction
PDF
COBOL to Apache Spark
PPTX
Database Security And Authentication
PPTX
Snowflake Datawarehouse Architecturing
PDF
How to Analyze and Tune MySQL Queries for Better Performance
PPTX
Netflix Data Pipeline With Kafka
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
Elastic stack Presentation
Transparent Data Encryption in PostgreSQL and Integration with Key Management...
Observability: Beyond the Three Pillars with Spring
Introduction to ETL process
Postgresql Database Administration Basic - Day1
Spark SQL
Intro to Delta Lake
Summary introduction to data engineering
Oracle Database | Computer Science
Introduction to Data Mining
OpenGurukul : Database : PostgreSQL
Spark SQL Beyond Official Documentation
Elastic Stack Introduction
COBOL to Apache Spark
Database Security And Authentication
Snowflake Datawarehouse Architecturing
How to Analyze and Tune MySQL Queries for Better Performance
Ad

Similar to Cobrix – a COBOL Data Source for Spark (20)

PPTX
Sharing names and address cleaning patterns for Patstat
PDF
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
PDF
Adaptive Query Processing on RAW Data
PDF
How to use Parquet as a basis for ETL and analytics
DOCX
Data Quality, Correctness and Dynamic Transformations using Spark and Scala
PDF
Schema on read is obsolete. Welcome metaprogramming..pdf
PDF
Schema management with Scalameta
PDF
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
PPTX
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
PPTX
Using Spark to Load Oracle Data into Cassandra
PDF
Composable Data Processing with Apache Spark
PPT
Toronto jaspersoft meetup
PDF
Building Robust ETL Pipelines with Apache Spark
PDF
icpe2019_ishizaki_public
PDF
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
PPT
Slinging Data: Data Loading and Cleanup in Evergreen
ODP
Introduciton to Apache Cassandra for Java Developers (JavaOne)
PPTX
Measuring MARC (ELAG 2018)
Sharing names and address cleaning patterns for Patstat
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Adaptive Query Processing on RAW Data
How to use Parquet as a basis for ETL and analytics
Data Quality, Correctness and Dynamic Transformations using Spark and Scala
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema management with Scalameta
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra
Composable Data Processing with Apache Spark
Toronto jaspersoft meetup
Building Robust ETL Pipelines with Apache Spark
icpe2019_ishizaki_public
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Slinging Data: Data Loading and Cleanup in Evergreen
Introduciton to Apache Cassandra for Java Developers (JavaOne)
Measuring MARC (ELAG 2018)
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
Technical Debt in the AI Coding Era - By Antonio Bianco
PDF
CCUS-as-the-Missing-Link-to-Net-Zero_AksCurious.pdf
PDF
Decision Optimization - From Theory to Practice
PPTX
Information-Technology-in-Human-Society.pptx
PDF
EIS-Webinar-Regulated-Industries-2025-08.pdf
PDF
CEH Module 2 Footprinting CEH V13, concepts
PPTX
How to Convert Tickets Into Sales Opportunity in Odoo 18
PPTX
Blending method and technology for hydrogen.pptx
PDF
Advancements in abstractive text summarization: a deep learning approach
PDF
NewMind AI Journal Monthly Chronicles - August 2025
PDF
State of AI in Business 2025 - MIT NANDA
PPT
Storage Area Network Best Practices from HP
PDF
Fitaura: AI & Machine Learning Powered Fitness Tracker
PDF
Internet of Things (IoT) – Definition, Types, and Uses
PPTX
Report in SIP_Distance_Learning_Technology_Impact.pptx
PPT
Overviiew on Intellectual property right
PDF
Addressing the challenges of harmonizing law and artificial intelligence tech...
PDF
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
PDF
GDG Cloud Southlake #45: Patrick Debois: The Impact of GenAI on Development a...
PPTX
From XAI to XEE through Influence and Provenance.Controlling model fairness o...
Technical Debt in the AI Coding Era - By Antonio Bianco
CCUS-as-the-Missing-Link-to-Net-Zero_AksCurious.pdf
Decision Optimization - From Theory to Practice
Information-Technology-in-Human-Society.pptx
EIS-Webinar-Regulated-Industries-2025-08.pdf
CEH Module 2 Footprinting CEH V13, concepts
How to Convert Tickets Into Sales Opportunity in Odoo 18
Blending method and technology for hydrogen.pptx
Advancements in abstractive text summarization: a deep learning approach
NewMind AI Journal Monthly Chronicles - August 2025
State of AI in Business 2025 - MIT NANDA
Storage Area Network Best Practices from HP
Fitaura: AI & Machine Learning Powered Fitness Tracker
Internet of Things (IoT) – Definition, Types, and Uses
Report in SIP_Distance_Learning_Technology_Impact.pptx
Overviiew on Intellectual property right
Addressing the challenges of harmonizing law and artificial intelligence tech...
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
GDG Cloud Southlake #45: Patrick Debois: The Impact of GenAI on Development a...
From XAI to XEE through Influence and Provenance.Controlling model fairness o...

Cobrix – a COBOL Data Source for Spark

  • 1. Cobrix – A COBOL data source for Spark Ruslan Iushchenko (ABSA), Felipe Melo (ABSA)
  • 2. • Who we are? • Motivation • Mainframe files and copybooks • Loading simple files • Loading hierarchical databases • Performance and results • Cobrix in ABSA Big Data space Outline 2/40
  • 3. About us . ABSA is a Pan-African financial services provider - With Apache Spark at the core of its data engineering . We fill gaps in the Hadoop ecosystem, when we find them . Contributions to Apache Spark . Spark-related open-source projects (https://github.com/AbsaOSS) - Spline - a data lineage tracking and visualization tool - ABRiS - Avro SerDe for structured APIs - Atum - Data quality library for Spark - Enceladus - A dynamic data conformance engine - Cobrix - A Cobol library for Spark (focus of this presentation) 3/40
  • 4. The market for Mainframes is strong, with no signs of cooling down. Mainframes Are used by 71% of Fortune 500 companies Are responsible for 87% of all credit card transactions in the world Are part of the IT infrastructure of 92 out of the 100 biggest banks in the world Handle 68% of the world’s production IT workloads, while accounting for only 6% of IT costs. For companies relying on Mainframes, becoming data-centric can be prohibitively expensive High cost of hardware Expensive business model for data science related activities Source: http://blog.syncsort.com/2018/06/mainframe/9-mainframe-statistics/ Business Motivation 4/40
  • 5. Technical Motivation • The process above takes 11 days for a 600GB file • Legacy data models (hierarchical) • Need for performance, scalability, flexibility, etc • SPOILER alert: we brought it to 1.1 hours 5/40 Mainframes PC Fixed-length Text Files CSVHDFS 1. Extract 2. Transform 4. Load 3. Transform Proprietary Tools
  • 6. • Run analytics / Spark on mainframes • Message Brokers (e.g. MQ) • Sqoop • Proprietary solutions • But ... • Pricey • Slow • Complex (specially for legacy systems) • Require human involvement What can you do? 6/40
  • 7. How Cobrix can help • Decreasing human involvement • Simplifying the manipulation of hierarchical structures • Providing scalability • Open-source 7/40
  • 8. Mainframe file (EBCDIC) Schema (copybook) Apache Spark Application Cobrix Output (Parquet, JSON, CSV…)df df df transformations Writer ... Cobrix – a custom Spark data source 8/40
  • 9. A copybook is a schema definition A data file is a collection of binary records Name: █ █ █ █ Age: █ █ Company: █ █ █ █ █ █ █ █ Phone #: █ █ █ █ █ █ █ █ Zip: █ █ █ █ █ Name: J O H N Age: 3 2 Company: F O O . C O M Phone #: + 2 3 1 1 - 3 2 7 Zip: 1 2 0 0 0 A * N J O H N G A 3 2 S H K D K S I A S S A S K A S A L , S D F O O . C O M X L Q O K ( G A } S N B W E S < N J X I C W L D H J P A S B C + 2 3 1 1 - 3 2 7 C = D 1 2 0 0 0 F H 0 D . 9/40
  • 10. Similar to IDLs of Avro, Thrift, Protocol Buffers, etc. struct Company { 1: required i64 id, 2: required string name, 3: optional list<string> contactPeople } Thrift message Company { required int64 id = 1; required string name = 2; repeated string contact_people = 3; } 10 COMPANY. 15 ID PIC 9(12) COMP. 15 NAME PIC X(40). 15 CONTACT-PEOPLE PIC X(20) OCCURS 10. COBOL record Company { int64 id; string name; array<string> contactPeople; } 10/40
  • 11. val df = spark .read .format("cobol") .option("copybook", "data/example.cob") .load("data/example") 01 RECORD. 05 COMPANY-ID PIC 9(10). 05 COMPANY-NAME PIC X(40). 05 ADDRESS PIC X(60). 05 REG-NUM PIC X(8). 05 ZIP PIC X(6). A * N J O H N G A 3 2 S H K D K S I A S S A S K A S A L , S D F O O . C O M X L Q O K ( G A } S N B W E S < N J X I C W L D H J P A S B C + 2 3 COMPANY_ID COMPANY_NAME ADDRESS REG_NUM ZIP 100 ABCD Ltd. 10 Garden st. 8791237 03120 101 ZjkLPj 11 Park ave. 1233971 23111 102 Robotrd Inc. 12 Forest st. 0382979 12000 103 Xingzhoug 8 Mountst. 2389012 31222 104 Example.co 123 Tech str. 3129001 19000 Loading Mainframe Data 11/40
  • 12. val spark = SparkSession.builder() .appName("Example").getOrCreate() val df = spark .read .format("cobol") .option("copybook", "data/example.cob") .load("data/example") // ...Business logic goes here... df.write.parquet("data/output") This App is ● Distributed ● Scalable ● Resilient EBCDIC to Parquet examples 12/40
  • 13. A * N J O H N G A 3 2 S H K D K S I A S S A S K A S A L , S D F O O . C O M X L Q O K ( G A } S N B W E S < N J X I C W L D H J P A S B C + 2 3 1 1 - 3 2 7 C = D 1 2 0 0 0 F H 0 D . A * N J O H N G A 3 2 S H K D K S I A S S A S K A S A L , S D F O O . C O M X L Q O K ( G A } S N B W E S < N J X I C W L D H J P A S B C + 2 B W E S < N J X P FIRST-NAME: █ █ █ █ █ █ LAST-NAME: █ █ █ █ █COMPANY-NAME: █ █ █ █ █ █ █ █ █ █ █ █ █ █ • Redefined fields AKA • Unchecked unions • Untagged unions • Variant type fields • Several fields occupy the same space 01 RECORD. 05 IS-COMPANY PIC 9(1). 05 COMPANY. 10 COMPANY-NAME PIC X(40). 05 PERSON REDEFINES COMPANY. 10 FIRST-NAME PIC X(20). 10 LAST-NAME PIC X(20). 05 ADDRESS PIC X(50). 05 ZIP PIC X(6). Redefined Fields 13/40
  • 14. 01 RECORD. 05 IS-COMPANY PIC 9(1). 05 COMPANY. 10 COMPANY-NAME PIC X(40). 05 PERSON REDEFINES COMPANY. 10 FIRST-NAME PIC X(20). 10 LAST-NAME PIC X(20). 05 ADDRESS PIC X(50). 05 ZIP PIC X(6). • Cobrix applies all redefines for each record • Some fields can clash • It’s up to the user to apply business logic to separate correct and wrong data IS_COMPANY COMPANY PERSON ADDRESS ZIP 1 {“COMPANY_NAME”: “September Ltd.”} {“FIRST_NAME”: “Septem”, “LAST_NAME”: “ber Ltd.”} 74 Lawn ave., Denver 39023 0 {“COMPANY_NAME”: “Beatrice Gagliano”} {“FIRST_NAME”: “Beatrice”, “LAST_NAME”: “Gagliano”} 10 Garden str. 33113 1 {“COMPANY_NAME”: “Beatrice Gagliano”} {“FIRST_NAME”: “Januar”, “LAST_NAME”: “y Inc.”} 122/1 Park ave. 31234 Redefined Fields 14/40
  • 15. df.select($"IS_COMPANY", when($"IS_COMPANY" === true, "COMPANY_NAME") .otherwise(null).as("COMPANY_NAME"), when($"IS_COMPANY" === false, "CONTACTS") .otherwise(null).as("FIRST_NAME")), ... IS_COMPANY COMPANY_NAME FIRST_NAME LAST_NAME ADDRESS ZIP 1 September Ltd. 74 Lawn ave., Denver 39023 0 Beatrice Gagliano 10 Garden str. 33113 1 January Inc. 122/1 Park ave. 31234 Clean Up Redefined Fields + flatten structs 15/40
  • 16. Hierarchical DBs • Several record types • AKA segments • Each segment type has its own schema • Parent-child relationships between segments COMPANY ID: █ █ █ █ █ █ █ █ █ Name: █ █ █ █ █ █ █ █ █ █ █ █ Address: █ █ █ █ █ █ █ █ █ CONTACT-PERSON Name: █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ Phone #: █ █ █ █ █ █ █ Root segment Child segment CONTACT-PERSON Name: █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ Phone #: █ █ █ █ █ █ █ … Child segment 16/40
  • 17. • The combined copybook has to contain all the segments as redefined fields: 01 COMPANY-DETAILS. 05 SEGMENT-ID PIC X(5). 05 COMPANY-ID PIC X(10). 05 COMPANY. 10 NAME PIC X(15). 10 ADDRESS PIC X(25). 10 REG-NUM PIC 9(8) COMP. 05 CONTACT REDEFINES COMPANY. 10 PHONE-NUMBER PIC X(17). 10 CONTACT-PERSON PIC X(28). common data segment 1 segment 2 COMPANY Name: █ █ █ █ █ █ █ █ █ █ █ Address: █ █ █ █ █ █ █ █ █ Reg-Num: █ █ █ █ █ █ CONTACT Phone #: █ █ █ █ █ █ █ Contact Person: █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ Defining a Copybook 17/40
  • 18. • The code snippet for reading the data: val df = spark .read .format("cobol") .option("copybook", "/path/to/copybook.cpy") .option("is_record_sequence", "true") .load("examples/multisegment_data") Reading all the segments 18/40
  • 19. • The dataset for the whole copybook: • Invalid redefines are highlighted SEGMENT_ID COMPANY_ID COMPANY CONTACT C 1005918818 [ ABCD Ltd. ] [ invalid ] P 1005918818 [ invalid ] [ Cliff Wallingford ] C 1036146222 [ DEFG Ltd. ] [ invalid ] P 1036146222 [ invalid ] [ Beatrice Gagliano ] C 1045855294 [ Robotrd Inc. ] [ invalid ] P 1045855294 [ invalid ] [ Doretha Wallingford ] P 1045855294 [ invalid ] [ Deshawn Benally ] P 1045855294 [ invalid ] [ Willis Tumlin ] C 1057751949 [ Xingzhoug ] [ invalid ] P 1057751949 [ invalid ] [ Mindy Boettcher ] Reading all the segments 19/40
  • 20. A * N J O H N G A 3 2 S H K D K S I A S S A S K A S A L , S D F O O . C O M X L Q O K ( G A } S N B W E S < N J X I C W L D H J P A S B C + 2 3 1 1 - 3 2 7 C = D 1 2 0 0 0 F H 0 D . K A I O D A P D F C J S C D A D F R J F D F C L COMPANY ID: 2 8 0 0 0 3 9 4 1 Name: E x a m p l e . c o m █ Address: 1 0 █ G a r d e n PERSON Name: J A N E █ █ █ █ R O B E R T S █ █ Phone #: + 9 3 5 2 8 0 PERSON Name: J A N E █ █ █ █ R O B E R T S █ █ Phone #: + 9 3 5 2 8 0 COMPANY ID: 2 8 0 0 0 3 9 4 1 Name: E x a m p l e . c o m █ Address: 1 0 █ G a r d e n COMPANY ID: 2 8 0 0 0 3 9 4 1 Name: E x a m p l e . c o m █ Address: 1 0 █ G a r d e n Id Name Address Reg_Num 100 Example.com 10 Garden st. 8791237 101 ZjkLPj 11 Park ave. 1233971 102 Robotrd Inc. 12 Forest st. 0382979 103 Xingzhoug 8 Mountst. 2389012 104 ABCD Ltd. 123 Tech str. 3129001 Company_Id Contact_Person Phone_Number 100 Jane +32186331 100 Colyn +23769123 102 Robert +12389679 102 Teresa +32187912 102 Laura +42198723 Separate segments by dataframes 20/40
  • 21. • Filter segment #1 (companies) val dfCompanies = df.filter($"SEGMENT_ID"==="C") .select($"COMPANY_ID", $"COMPANY.NAME".as($"COMPANY_NAME"), $"COMPANY.ADDRESS", $"COMPANY.REG_NUM") Company_Id Company_Name Address Reg_Num 100 ABCD Ltd. 10 Garden st. 8791237 101 ZjkLPj 11 Park ave. 1233971 102 Robotrd Inc. 12 Forest st. 0382979 103 Xingzhoug 8 Mountst. 2389012 104 Example.co 123 Tech str. 3129001 Reading root segments 21/40
  • 22. • Filter segment #2 (people) val dfContacts = df .filter($"SEGMENT_ID"==="P") .select($"COMPANY_ID", $"CONTACT.CONTACT_PERSON", $"CONTACT.PHONE_NUMBER") Company_Id Contact_Person Phone_Number 100 Marry +32186331 100 Colyn +23769123 102 Robert +12389679 102 Teresa +32187912 102 Laura +42198723 Reading child segments 22/40
  • 23. Company_Id Company_Name Address Reg_Num 100 ABCD Ltd. 10 Garden st. 8791237 101 ZjkLPj 11 Park ave. 1233971 102 Robotrd Inc. 12 Forest st. 0382979 103 Xingzhoug 8 Mountst. 2389012 104 Example.co 123 Tech str. 3129001 Company_Id Contact_Person Phone_Number 100 Marry +32186331 100 Colyn +23769123 102 Robert +12389679 102 Teresa +32187912 102 Laura +42198723 Company_Id Company_Name Address Reg_Num Contact_Person Phone_Number 100 ABCD Ltd. 10 Garden st. 8791237 Marry +32186331 100 ABCD Ltd. 10 Garden st. 8791237 Colyn +23769123 102 Robotrd Inc. 12 Forest st. 0382979 Robert +12389679 102 Robotrd Inc. 12 Forest st. 0382979 Teresa +32187912 102 Robotrd Inc. 12 Forest st. 0382979 Laura +42198723 The two segments can now be joined by Company_Id 23/40
  • 24. • Joining segments 1 and 2 val dfJoined = dfCompanies.join_outer(dfContacts, "COMPANY_ID") Results: Company_Id Company_Name Address Reg_Num Contact_Person Phone_Number 100 ABCD Ltd. 10 Garden st. 8791237 Marry +32186331 100 ABCD Ltd. 10 Garden st. 8791237 Colyn +23769123 102 Robotrd Inc. 12 Forest st. 0382979 Robert +12389679 102 Robotrd Inc. 12 Forest st. 0382979 Teresa +32187912 102 Robotrd Inc. 12 Forest st. 0382979 Laura +42198723 Joining in Spark is easy 24/40
  • 25. • The joined table can also be denormalized for document storage val dfCombined = dfJoined .groupBy($"COMPANY_ID", $"COMPANY_NAME", $"ADDRESS", $"REG_NUM") .agg( collect_list( struct($"CONTACT_PERSON", $"PHONE_NUMBER")) .as("CONTACTS")) { "COMPANY_ID": "8216281722", "COMPANY_NAME": "ABCD Ltd.", "ADDRESS": "74 Lawn ave., New York", ”REG_NUM": "33718594", "CONTACTS": [ { "CONTACT_PERSON": "Cassey Norgard", "PHONE_NUMBER": "+(595) 641 62 32" }, { "CONTACT_PERSON": "Verdie Deveau", "PHONE_NUMBER": "+(721) 636 72 35" }, { "CONTACT_PERSON": "Otelia Batman", "PHONE_NUMBER": "+(813) 342 66 28" } ] } Denormalize data 25/40
  • 26. Restore parent-child relationships • In our example we had COMPANY_ID field that is present in all segments • In real copybooks this is not the case • What can we do? COMPANY ID: █ █ █ █ █ █ █ █ █ Name: █ █ █ █ █ █ █ █ █ █ █ █ Address: █ █ █ █ █ █ █ █ █ CONTACT-PERSON Name: █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ Phone #: █ █ █ █ █ █ █ Root segment Child segment CONTACT-PERSON Name: █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ Phone #: █ █ █ █ █ █ █ … Child segment 26/40
  • 27. 01 COMPANY-DETAILS. 05 SEGMENT-ID PIC X(5). 05 COMPANY. 10 NAME PIC X(15). 10 ADDRESS PIC X(25). 10 REG-NUM PIC 9(8) COMP. 05 CONTACT REDEFINES COMPANY. 10 PHONE-NUMBER PIC X(17). 10 CONTACT-PERSON PIC X(28). • If COMPANY_ID is not part of all segments Cobrix can generate it for you val df = spark .read .format("cobol") .option("copybook", "/path/to/copybook.cpy") .option("is_record_sequence", "true") .option("segment_field", "SEGMENT-ID") .option("segment_id_level0", "C") .option("segment_id_prefix", "ID") .load("examples/multisegment_data") No COMPANY-ID Id Generation 27/40
  • 28. 01 COMPANY-DETAILS. 05 SEGMENT-ID PIC X(5). 05 COMPANY. 10 NAME PIC X(15). 10 ADDRESS PIC X(25). 10 REG-NUM PIC 9(8) COMP. 05 CONTACT REDEFINES COMPANY. 10 PHONE-NUMBER PIC X(17). 10 CONTACT-PERSON PIC X(28). • Seg0_Id can be used to restore parent-child relationship between segments No COMPANY-ID SEGMENT_ID Seg0_Id COMPANY CONTACT C ID_0_0 [ ABCD Ltd. ] [ invalid ] P ID_0_0 [ invalid ] [ Cliff Wallingford ] C ID_0_2 [ DEFG Ltd. ] [ invalid ] P ID_0_2 [ invalid ] [ Beatrice Gagliano ] C ID_0_4 [ Robotrd Inc. ] [ invalid ] P ID_0_4 [ invalid ] [ Doretha Wallingford ] P ID_0_4 [ invalid ] [ Deshawn Benally ] Id Generation 28/40
  • 29. • When transferred from a mainframe a hierarchical database becomes • A sequence of records • To read next record a previous record should be read first • A sequential format by it's nature • How to make it scalable? A * N J O H N G A 3 2 S H K D K S I A S S A S K A S A L , S D F O O . C O M X L Q O K ( G A } S N B W E S < N J X I C W L D H J P A S B C + 2 3 1 1 - 3 2 7 C = D 1 2 0 0 0 F H 0 D . K A I O D A P D F C J S C D C D C W E P 1 9 1 2 3 – 3 1 2 2 1 . 3 1 F A D F L 1 7 COMPANY ID: █ █ █ █ █ █ █ █ █ Name: █ █ █ █ █ █ █ █ █ █ █ █ Address: █ █ █ █ █ █ █ █ █ COMPANY ID: █ █ █ █ █ █ █ █ █ Name: █ █ █ █ █ █ █ █ █ █ █ █ Address: █ █ █ █ █ █ █ █ █ PERSON Name: █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ Phone #: █ █ █ █ █ █ █ A data file PERSON Name: J A N E █ █ █ █ R O B E R T S █ █ Phone #: + 9 3 5 2 8 0 PERSON Name: █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ Phone #: █ █ █ █ █ █ █ PERSON Name: █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ Phone #: █ █ █ █ █ █ █ COMPANY ID: 2 8 0 0 0 3 9 4 1 Name: E x a m p l e . c o m █ Address: 1 0 █ G a r d e n Variable Length Records (VLR) 29/40
  • 30. Performance challenge of VLRs • Naturally sequential files • To read next record the prior record need to be read first • Each record had a length field • Acts as a pointer to the next record • No record delimiter when reading a file from the middle VLR structure 30/40 0 20 40 60 80 100 120 140 160 180 0 10 20 30 40 50 60 70 MB/s Number of Spark cores Throughput, variable record length Sequential processing 10 MB/s
  • 31. Spark on HDFS Blocks Partitions Data Node Spark Executor Blocks Partitions Data Node Spark Executor Blocks Partitions Data Node Spark Executor . . . HDFS Namenode Spark Driver 31/40
  • 32. Data node 1 Data node 2 Data node 3 HDFS Cobrix 1- Read headers List(offsets,lengths) Spark 3 - Parse records In parallel from parallelized offsets and lengths Spark cluster 2 - Parallelize Offsets and lengths Parallelizing Sequential Reads 32/40
  • 33. Data node 1 Data node 2 Data node 3 HDFS Namenode Cobrix 1- Read VLR 1 header offset = 3000 length = 1002 - where is offset 3000 until 3100? 3 - Check nodes 3, 18 and 41. Spark 4 - Load VLR 1 Preferred location is node 3 5 - Launch task On executor hosted on node 3 Spark cluster Enabling Data Locality for VLRs 33/40
  • 34. Throughput when sparse indexes are used • Experiments were ran on our lab cluster • 4 nodes • 380 cores • 10 Gbit network • Scalable – for bigger files when using more executors the throughput is bigger 34/40 0 20 40 60 80 100 120 140 160 180 0 10 20 30 40 50 60 70 MB/s Number of Spark cores Throughput, variable record length 10 GB file 20 GB file 40 GB file Sequential
  • 35. Comparison versus fixed length record performance ● Distribution and locality is handled completely by Spark ● Parallelism is achieved using sparse indexes 35/40 0 20 40 60 80 100 120 140 160 180 0 10 20 30 40 50 60 70 MB/s Number of Spark cores Throughput, fixed length records 40 GB file 150 MB/s 0 20 40 60 80 100 120 140 160 180 0 10 20 30 40 50 60 70 MB/s Number of Spark cores Throughput, variable record length 40 GB file Sequential 145 MB/s 10 MB/s
  • 36. Cobrix in ABSA Data Infrastructure - Batch Mainframes 180+ sources HDFS Enceladus Cobrix Spark Spline 2. Parse 3. Conform 4. Track Lineage5. Re-ingest 6. Consume 36/40
  • 37. Cobrix in ABSA Data Infrastructure - Stream Mainframes 180+ sources Cobrix Parser ABRiS Enceladus Kafka Spline 1.Ingest 2.Parse 3.Avro 4.Conform 5. Track Lineage 6. Consume 37/40
  • 38. Cobrix in ABSA Data Infrastructure Mainframes 180+ sources HDFS Cobrix Parser ABRiS Enceladus Kafka Cobrix Spark Spline 1.Ingest 2. Parse 3. Conform 4. Track Lineage5. Re-ingest 6. Consume 2.Parse 3.Avro 4.Conform 5. Track Lineage 6. Consume Batch Stream 38/40
  • 39. ● Thanks to the following people the project was made possible and for all the help along the way: ○ Andrew Baker, Francois Cillers, Adam Smyczek, Jan Scherbaum, Peter Moon, Clifford Lategan, Rekha Gorantla, Mohit Suryavanshi, Niel Steyn • Thanks to the authors of the original COBOL parser: ○ Ian De Beer, Rikus de Milander (https://github.com/zenaptix-lab/copybookStreams) Acknowledgment 39/40
  • 40. • Combine expertise to make access mainframe data in Hadoop seamless • Our goal is to support the widest range of use cases possible • Report a bug ! • Request new feature ! • Create a pull request ! " # $ Our home: https://github.com/AbsaOSS/cobrix Your Contribution is Welcome 40/40