Cobrix – a COBOL Data Source for Spark

Cobrix – A COBOL data
source for Spark
Ruslan Iushchenko (ABSA), Felipe Melo (ABSA)

• Who we are?
• Motivation
• Mainframe files and copybooks
• Loading simple files
• Loading hierarchical databases
• Performance and results
• Cobrix in ABSA Big Data space
Outline
2/40

About us
. ABSA is a Pan-African financial services provider
- With Apache Spark at the core of its data engineering
. We fill gaps in the Hadoop ecosystem, when we find them
. Contributions to Apache Spark
. Spark-related open-source projects (https://github.com/AbsaOSS)
- Spline - a data lineage tracking and visualization tool
- ABRiS - Avro SerDe for structured APIs
- Atum - Data quality library for Spark
- Enceladus - A dynamic data conformance engine
- Cobrix - A Cobol library for Spark (focus of this presentation)
3/40

The market for Mainframes is strong, with no signs of cooling down.
Mainframes
Are used by 71% of Fortune 500 companies
Are responsible for 87% of all credit card transactions in the world
Are part of the IT infrastructure of 92 out of the 100 biggest banks in the world
Handle 68% of the world’s production IT workloads, while accounting for only 6%
of IT costs.
For companies relying on Mainframes, becoming data-centric can be
prohibitively expensive
High cost of hardware
Expensive business model for data science related activities
Source: http://blog.syncsort.com/2018/06/mainframe/9-mainframe-statistics/
Business Motivation
4/40

Technical Motivation
• The process above takes 11 days for a 600GB file
• Legacy data models (hierarchical)
• Need for performance, scalability, flexibility, etc
• SPOILER alert: we brought it to 1.1 hours 5/40
Mainframes PC
Fixed-length
Text Files
CSVHDFS
1. Extract 2. Transform
4. Load 3. Transform
Proprietary
Tools

• Run analytics / Spark on mainframes
• Message Brokers (e.g. MQ)
• Sqoop
• Proprietary solutions
• But ...
• Pricey
• Slow
• Complex (specially for legacy systems)
• Require human involvement
What can you do?
6/40

How Cobrix can help
• Decreasing human involvement
• Simplifying the manipulation of hierarchical structures
• Providing scalability
• Open-source
7/40

Mainframe file
(EBCDIC)
Schema
(copybook)
Apache
Spark Application
Cobrix Output
(Parquet, JSON, CSV…)df df df
transformations
Writer
...
Cobrix – a custom Spark data source
8/40

A copybook is a schema definition
A data file is a collection of binary records
Name: █ █ █ █ Age: █ █
Company: █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █ █
Zip: █ █ █ █ █
Name: J O H N Age: 3 2
Company: F O O . C O M
Phone #: + 2 3 1 1 - 3 2 7
Zip: 1 2 0 0 0
A * N J O H N G A 3 2 S H
K D K S I A S S A S K A S
A L , S D F O O . C O M X
L Q O K ( G A } S N B W E
S < N J X I C W L D H J P
A S B C + 2 3 1 1 - 3 2 7
C = D 1 2 0 0 0 F H 0 D .
9/40

Similar to IDLs of Avro, Thrift, Protocol Buffers, etc.
struct Company {
1: required i64 id,
2: required string name,
3: optional list<string> contactPeople
}
Thrift
message Company {
required int64 id = 1;
required string name = 2;
repeated string contact_people = 3;
}
10 COMPANY.
15 ID PIC 9(12) COMP.
15 NAME PIC X(40).
15 CONTACT-PEOPLE PIC X(20)
OCCURS 10.
COBOL
record Company {
int64 id;
string name;
array<string> contactPeople;
}
10/40

val df = spark
.read
.format("cobol")
.option("copybook", "data/example.cob")
.load("data/example")
01 RECORD.
05 COMPANY-ID PIC 9(10).
05 COMPANY-NAME PIC X(40).
05 ADDRESS PIC X(60).
05 REG-NUM PIC X(8).
05 ZIP PIC X(6).
A * N J O H N G A 3 2 S H K D K S I
A S S A S K A S A L , S D F O O . C
O M X L Q O K ( G A } S N B W E S <
N J X I C W L D H J P A S B C + 2 3
COMPANY_ID COMPANY_NAME ADDRESS REG_NUM ZIP
100 ABCD Ltd. 10 Garden st. 8791237 03120
101 ZjkLPj 11 Park ave. 1233971 23111
102 Robotrd Inc. 12 Forest st. 0382979 12000
103 Xingzhoug 8 Mountst. 2389012 31222
104 Example.co 123 Tech str. 3129001 19000
Loading Mainframe Data
11/40

val spark = SparkSession.builder() .appName("Example").getOrCreate()
val df = spark
.read
.format("cobol")
.option("copybook", "data/example.cob")
.load("data/example")
// ...Business logic goes here...
df.write.parquet("data/output")
This App is
● Distributed
● Scalable
● Resilient
EBCDIC to Parquet examples
12/40

A * N J O H N G A 3 2 S H K D K S I A
S S A S K A S A L , S D F O O . C O M
X L Q O K ( G A } S N B W E S < N J X
I C W L D H J P A S B C + 2 3 1 1 - 3
2 7 C = D 1 2 0 0 0 F H 0 D . A * N J
O H N G A 3 2 S H K D K S I A S S A S
K A S A L , S D F O O . C O M X L Q O
K ( G A } S N B W E S < N J X I C W L
D H J P A S B C + 2 B W E S < N J X P
FIRST-NAME: █ █ █ █ █ █ LAST-NAME: █ █ █ █ █COMPANY-NAME: █ █ █ █ █ █ █ █ █ █ █ █ █ █
• Redefined fields AKA
• Unchecked unions
• Untagged unions
• Variant type fields
• Several fields occupy the same space
01 RECORD.
05 IS-COMPANY PIC 9(1).
05 COMPANY.
05 PERSON REDEFINES COMPANY.
10 FIRST-NAME PIC X(20).
10 LAST-NAME PIC X(20).
05 ZIP PIC X(6).
Redefined Fields
13/40

01 RECORD.
05 IS-COMPANY PIC 9(1).
05 COMPANY.
05 PERSON REDEFINES COMPANY.
10 FIRST-NAME PIC X(20).
10 LAST-NAME PIC X(20).
05 ZIP PIC X(6).
• Cobrix applies all redefines for each
record
• Some fields can clash
• It’s up to the user to apply business logic
to separate correct and wrong data
IS_COMPANY COMPANY PERSON ADDRESS ZIP
1 {“COMPANY_NAME”: “September Ltd.”} {“FIRST_NAME”: “Septem”,
“LAST_NAME”: “ber Ltd.”}
74 Lawn ave., Denver 39023
0 {“COMPANY_NAME”: “Beatrice
Gagliano”}
{“FIRST_NAME”: “Beatrice”,
“LAST_NAME”: “Gagliano”}
10 Garden str. 33113
1 {“COMPANY_NAME”: “Beatrice
Gagliano”}
{“FIRST_NAME”: “Januar”,
“LAST_NAME”: “y Inc.”}
122/1 Park ave. 31234
Redefined Fields
14/40

df.select($"IS_COMPANY",
when($"IS_COMPANY" === true, "COMPANY_NAME")
.otherwise(null).as("COMPANY_NAME"),
when($"IS_COMPANY" === false, "CONTACTS")
.otherwise(null).as("FIRST_NAME")),
...
IS_COMPANY COMPANY_NAME FIRST_NAME LAST_NAME ADDRESS ZIP
1 September Ltd. 74 Lawn ave., Denver 39023
0 Beatrice Gagliano 10 Garden str. 33113
1 January Inc. 122/1 Park ave. 31234
Clean Up Redefined Fields + flatten structs
15/40

Hierarchical DBs
• Several record types
• AKA segments
• Each segment type has its
own schema
• Parent-child relationships
between segments
COMPANY
ID: █ █ █ █ █ █ █ █ █
Name: █ █ █ █ █ █ █ █ █ █ █ █
Address: █ █ █ █ █ █ █ █ █
CONTACT-PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
Root segment
Child segment
CONTACT-PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
…
Child segment
16/40

• The combined copybook has to contain all the segments as redefined
fields:
01 COMPANY-DETAILS.
05 SEGMENT-ID PIC X(5).
05 COMPANY-ID PIC X(10).
05 COMPANY.
10 NAME PIC X(15).
10 REG-NUM PIC 9(8) COMP.
05 CONTACT REDEFINES COMPANY.
10 PHONE-NUMBER PIC X(17).
10 CONTACT-PERSON PIC X(28).
common data
segment 1
segment 2
COMPANY
Name: █ █ █ █ █ █ █ █ █ █ █
Address: █ █ █ █ █ █ █ █ █
Reg-Num: █ █ █ █ █ █
CONTACT
Phone #: █ █ █ █ █ █ █
Contact Person: █ █ █ █
█ █ █ █ █ █ █ █ █ █ █
Defining a Copybook
17/40

• The code snippet for reading the data:
val df = spark
.read
.format("cobol")
.option("copybook", "/path/to/copybook.cpy")
.option("is_record_sequence", "true")
.load("examples/multisegment_data")
Reading all the segments
18/40

• The dataset for the whole copybook:
• Invalid redefines are highlighted
SEGMENT_ID COMPANY_ID COMPANY CONTACT
C 1005918818 [ ABCD Ltd. ] [ invalid ]
P 1005918818 [ invalid ] [ Cliff Wallingford ]
C 1036146222 [ DEFG Ltd. ] [ invalid ]
P 1036146222 [ invalid ] [ Beatrice Gagliano ]
C 1045855294 [ Robotrd Inc. ] [ invalid ]
P 1045855294 [ invalid ] [ Doretha Wallingford ]
P 1045855294 [ invalid ] [ Deshawn Benally ]
P 1045855294 [ invalid ] [ Willis Tumlin ]
C 1057751949 [ Xingzhoug ] [ invalid ]
P 1057751949 [ invalid ] [ Mindy Boettcher ]
Reading all the segments
19/40

A * N J O
H N G A 3
2 S H K D
K S I A S
S A S K A
S A L , S
D F O O .
C O M X L
Q O K ( G
A } S N B
W E S < N
J X I C W
L D H J P
A S B C +
2 3 1 1 -
3 2 7 C =
D 1 2 0 0
0 F H 0 D
. K A I O
D A P D F
C J S C D
A D F R J
F D F C L
COMPANY
ID: 2 8 0 0 0 3 9 4 1
Name: E x a m p l e . c o m █
Address: 1 0 █ G a r d e n
PERSON
Name: J A N E █ █ █ █
R O B E R T S █ █
Phone #: + 9 3 5 2 8 0
PERSON
Name: J A N E █ █ █ █
Phone #: + 9 3 5 2 8 0
COMPANY
ID: 2 8 0 0 0 3 9 4 1
COMPANY
ID: 2 8 0 0 0 3 9 4 1
Id Name Address Reg_Num
100 Example.com 10 Garden st. 8791237
101 ZjkLPj 11 Park ave. 1233971
102 Robotrd Inc. 12 Forest st. 0382979
103 Xingzhoug 8 Mountst. 2389012
104 ABCD Ltd. 123 Tech str. 3129001
Company_Id Contact_Person Phone_Number
100 Jane +32186331
100 Colyn +23769123
102 Robert +12389679
102 Teresa +32187912
102 Laura +42198723
Separate segments by dataframes
20/40

• Filter segment #1 (companies)
val dfCompanies =
df.filter($"SEGMENT_ID"==="C")
.select($"COMPANY_ID",
$"COMPANY.NAME".as($"COMPANY_NAME"),
$"COMPANY.ADDRESS",
$"COMPANY.REG_NUM")
Company_Id Company_Name Address Reg_Num
100 ABCD Ltd. 10 Garden st. 8791237
104 Example.co 123 Tech str. 3129001
Reading root segments
21/40

• Filter segment #2 (people)
val dfContacts = df
.filter($"SEGMENT_ID"==="P")
.select($"COMPANY_ID",
$"CONTACT.CONTACT_PERSON",
$"CONTACT.PHONE_NUMBER")
100 Marry +32186331
100 Colyn +23769123
102 Robert +12389679
102 Teresa +32187912
102 Laura +42198723
Reading child segments
22/40

Company_Id Company_Name Address Reg_Num
100 ABCD Ltd. 10 Garden st. 8791237
104 Example.co 123 Tech str. 3129001
100 Marry +32186331
100 Colyn +23769123
102 Robert +12389679
102 Teresa +32187912
102 Laura +42198723
Company_Id Company_Name Address Reg_Num Contact_Person Phone_Number
100 ABCD Ltd. 10 Garden st. 8791237 Marry +32186331
100 ABCD Ltd. 10 Garden st. 8791237 Colyn +23769123
102 Robotrd Inc. 12 Forest st. 0382979 Robert +12389679
102 Robotrd Inc. 12 Forest st. 0382979 Teresa +32187912
102 Robotrd Inc. 12 Forest st. 0382979 Laura +42198723
The two segments can now be joined by Company_Id
23/40

• Joining segments 1 and 2
val dfJoined =
dfCompanies.join_outer(dfContacts, "COMPANY_ID")
Results:
Company_Id Company_Name Address Reg_Num Contact_Person Phone_Number
100 ABCD Ltd. 10 Garden st. 8791237 Marry +32186331
100 ABCD Ltd. 10 Garden st. 8791237 Colyn +23769123
102 Robotrd Inc. 12 Forest st. 0382979 Robert +12389679
102 Robotrd Inc. 12 Forest st. 0382979 Teresa +32187912
102 Robotrd Inc. 12 Forest st. 0382979 Laura +42198723
Joining in Spark is easy
24/40

• The joined table can also be denormalized for document storage
val dfCombined =
dfJoined
.groupBy($"COMPANY_ID",
$"COMPANY_NAME",
$"ADDRESS",
$"REG_NUM")
.agg(
collect_list(
struct($"CONTACT_PERSON",
$"PHONE_NUMBER"))
.as("CONTACTS"))
{
"COMPANY_ID": "8216281722",
"COMPANY_NAME": "ABCD Ltd.",
"ADDRESS": "74 Lawn ave., New York",
”REG_NUM": "33718594",
"CONTACTS": [
{
"CONTACT_PERSON": "Cassey Norgard",
"PHONE_NUMBER": "+(595) 641 62 32"
},
{
"CONTACT_PERSON": "Verdie Deveau",
"PHONE_NUMBER": "+(721) 636 72 35"
},
{
"CONTACT_PERSON": "Otelia Batman",
"PHONE_NUMBER": "+(813) 342 66 28"
}
]
}
Denormalize data
25/40

Restore parent-child relationships
• In our example we had
COMPANY_ID field that is
present in all segments
• In real copybooks this is not
the case
• What can we do?
COMPANY
ID: █ █ █ █ █ █ █ █ █
Name: █ █ █ █ █ █ █ █ █ █ █ █
Address: █ █ █ █ █ █ █ █ █
CONTACT-PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
Root segment
Child segment
CONTACT-PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
…
Child segment
26/40

01 COMPANY-DETAILS.
05 COMPANY.
10 NAME PIC X(15).
• If COMPANY_ID is not part
of all segments
Cobrix can generate it for you
val df = spark
.read
.format("cobol")
.option("copybook", "/path/to/copybook.cpy")
.option("is_record_sequence", "true")
.option("segment_field", "SEGMENT-ID")
.option("segment_id_level0", "C")
.option("segment_id_prefix", "ID")
.load("examples/multisegment_data")
No COMPANY-ID
Id Generation
27/40

01 COMPANY-DETAILS.
05 COMPANY.
10 NAME PIC X(15).
• Seg0_Id can be used to restore
parent-child relationship
between segments
No COMPANY-ID
SEGMENT_ID Seg0_Id COMPANY CONTACT
C ID_0_0 [ ABCD Ltd. ] [ invalid ]
P ID_0_0 [ invalid ] [ Cliff Wallingford ]
C ID_0_2 [ DEFG Ltd. ] [ invalid ]
P ID_0_2 [ invalid ] [ Beatrice Gagliano ]
C ID_0_4 [ Robotrd Inc. ] [ invalid ]
P ID_0_4 [ invalid ] [ Doretha Wallingford ]
P ID_0_4 [ invalid ] [ Deshawn Benally ]
Id Generation
28/40

• When transferred from a mainframe a hierarchical database becomes
• A sequence of records
• To read next record a previous record should be read first
• A sequential format by it's nature
• How to make it scalable?
A * N J O H N G A 3 2 S H K D K S I A S S A S K A S A L , S D F O O . C O M X L Q O K ( G A } S N B W E S < N J
X I C W L D H J P A S B C + 2 3 1 1 - 3 2 7 C = D 1 2 0 0 0 F H 0 D . K A I O D A P D F C J S C D C D C W E P 1
9 1 2 3 – 3 1 2 2 1 . 3 1 F A D F L 1 7
COMPANY
ID: █ █ █ █ █ █ █ █ █
Name: █ █ █ █ █ █ █ █ █ █ █ █
Address: █ █ █ █ █ █ █ █ █
COMPANY
ID: █ █ █ █ █ █ █ █ █
Name: █ █ █ █ █ █ █ █ █ █ █ █
Address: █ █ █ █ █ █ █ █ █
PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
A data file
PERSON
Name: J A N E █ █ █ █
Phone #: + 9 3 5 2 8 0
PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
COMPANY
ID: 2 8 0 0 0 3 9 4 1
Variable Length Records (VLR)
29/40

Performance challenge of VLRs
• Naturally sequential files
• To read next record the prior
record need to be read first
• Each record had a length
field
• Acts as a pointer to the next
record
• No record delimiter when
reading a file from the middle
VLR structure
30/40
0
20
40
60
80
100
120
140
160
180
0 10 20 30 40 50 60 70
MB/s Number of Spark cores
Throughput, variable record length
Sequential processing
10 MB/s

Spark on HDFS
Blocks
Partitions
Data Node
Spark Executor
Blocks
Partitions
Data Node
Spark Executor
Blocks
Partitions
Data Node
Spark Executor
. . .
HDFS Namenode
Spark Driver
31/40

Data node 1 Data node 2 Data node 3
HDFS
Cobrix
1- Read headers
List(offsets,lengths)
Spark
3 - Parse records
In parallel from
parallelized offsets
and lengths
Spark cluster
2 - Parallelize
Offsets and lengths
Parallelizing Sequential Reads
32/40

Data node 1 Data node 2 Data node 3
HDFS
Namenode Cobrix
1- Read VLR 1 header
offset = 3000
length = 1002 - where is
offset 3000
until 3100?
3 - Check nodes
3, 18 and 41.
Spark
4 - Load VLR 1
Preferred location
is node 3
5 - Launch task
On executor
hosted on node 3
Spark cluster
Enabling Data Locality for VLRs
33/40

Throughput when sparse indexes are used
• Experiments were ran
on our lab cluster
• 4 nodes
• 380 cores
• 10 Gbit network
• Scalable – for bigger
files when using more
executors the
throughput is bigger
34/40
0
20
40
60
80
100
120
140
160
180
0 10 20 30 40 50 60 70
MB/s
Number of Spark cores
10 GB file
20 GB file
40 GB file
Sequential

Comparison versus fixed length record performance
● Distribution and locality is
handled completely by Spark
● Parallelism is achieved using
sparse indexes
35/40
0
20
40
60
80
100
120
140
160
180
0 10 20 30 40 50 60 70
MB/s
Throughput, fixed length records
40 GB file 150 MB/s
0
20
40
60
80
100
120
140
160
180
0 10 20 30 40 50 60 70
MB/s
40 GB file
Sequential
145 MB/s
10 MB/s

Cobrix in ABSA Data Infrastructure - Batch
Mainframes
180+
sources
HDFS
Enceladus
Cobrix
Spark
Spline
2. Parse
3. Conform
4. Track Lineage5. Re-ingest
6. Consume
36/40

Cobrix in ABSA Data Infrastructure - Stream
Mainframes
180+
sources
Cobrix
Parser ABRiS
Enceladus
Kafka
Spline
1.Ingest 2.Parse 3.Avro
4.Conform
5. Track Lineage
6. Consume
37/40

Cobrix in ABSA Data Infrastructure
Mainframes
180+
sources
HDFS
Cobrix
Parser ABRiS
Enceladus
Kafka
Cobrix
Spark
Spline
1.Ingest
2. Parse
3. Conform
4. Track Lineage5. Re-ingest
6. Consume
2.Parse 3.Avro
4.Conform
5. Track Lineage
6. Consume
Batch
Stream
38/40

● Thanks to the following people the project was made possible and for all
the help along the way:
○ Andrew Baker, Francois Cillers, Adam Smyczek,
Jan Scherbaum, Peter Moon, Clifford Lategan,
Rekha Gorantla, Mohit Suryavanshi, Niel Steyn
• Thanks to the authors of the original COBOL parser:
○ Ian De Beer, Rikus de Milander
(https://github.com/zenaptix-lab/copybookStreams)
Acknowledgment
39/40

• Combine expertise to make access mainframe data in Hadoop seamless
• Our goal is to support the widest range of use cases possible
• Report a bug !
• Request new feature !
• Create a pull request ! " # $
Our home: https://github.com/AbsaOSS/cobrix
Your Contribution is Welcome
40/40

Cobrix – a COBOL Data Source for Spark

More Related Content

What's hot (20)

Similar to Cobrix – a COBOL Data Source for Spark (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Cobrix – a COBOL Data Source for Spark