0% found this document useful (0 votes)
26 views

SysML Based Conceptual ETL Process Modeling

SysML Based Conceptual ETL Process Modeling

Uploaded by

Bruno Oliveira
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

SysML Based Conceptual ETL Process Modeling

SysML Based Conceptual ETL Process Modeling

Uploaded by

Bruno Oliveira
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

SysML Based Conceptual ETL Process Modeling

Neepa Biswas1 , Samiran Chattopadhyay1 , Gautam Mahapatra2 ,


Santanu Chatterjee2 , and Kartick Chandra Mondal1(B)
1
Department of Information Technology, Jadavpur University, Kolkata, India
[email protected]
2
Research Centre Imarat DRDO, Ministry of Defence,
Government of India, Kurmalguda, India

Abstract. Data generated from various sources can be erroneous or


incomplete which can have direct impact over business analysis. ETL
(Extraction-Transformation-Loading) is a well-known process which
extract data from different sources, transform those data into required
format and finally load it into target data warehouse (DW). ETL per-
forms an important role in data warehouse environment. Configuring an
ETL process is one of the key factor having direct impact over cost, time
and effort for establishment of a successful data warehouse. Conceptual
modeling of ETL can give a high-level view of the system activities. It
provides the advantage of pre-identification of system error, cost min-
imization, scope, risk assessment etc. Some research development has
been done for modeling ETL process by applying UML, BPMN and
Semantic Web at conceptual level. In this paper, we propose a new app-
roach for conceptual modeling of ETL process by using a new standard
Systems Modeling Language (SysML). SysML extends UML features
with much more clear semantics from System Engineering point of view.
We have shown the usefulness of our approach by exemplifying using a
use case scenario.

Keywords: ETL · Data warehouse · Conceptual model · MBSE ·


SysML

1 Introduction

Data warehouse [13] is a repository of historical data which is consolidated in


multidimensional format. In warehouse, data is stored in a standard structure
which are obtained by integrating data of different operational sources of an
organization. Business analyst [8,30] can access that data, perform analysis,
apply business intelligence tool and make prediction as well as take strategic
decision. For maintaining a data warehouse, the main focus is to manage large
amount of data generated from different type of systems (SAP, ERP, Oracle,
Mainframe etc.) and store those data in an uniform structure. For managing the
uniformity of data, ETL has a very important role. ETL is a widely used process
in business organizations. It identify and extract data from various sources, filter
c Springer Nature Singapore Pte Ltd. 2017
J.K. Mandal et al. (Eds.): CICBA 2017, Part II, CCIS 776, pp. 242–255, 2017.
DOI: 10.1007/978-981-10-6430-2 19
SysML Based Conceptual ETL Process Modeling 243

and customize those data according to required format, at last integrate and
update it into data warehouse [33].
Data modeling [7] gives an abstract view about how the data will be arranged
in an organization and how they will be managed. By applying data modeling
techniques, the relationship between different data items can be visualized. The
modeling concept has a great benefit over organizational data to manage it
in a structural way. At starting phase, it is highly recommended to make an
efficient modeling and design of the total workflow. Due to the expensive nature
of warehouse implementation, good modeling as well as documentation should be
maintained. Based on the report [11], designing a well-established ETL workflow
consumes almost one third of cost and effort in a DW implementation. A well
designed ETL process is one of the important aspects to accomplish an effective
DW. Each vendor provided tool has their own specific methodology for designing
the ETL process [9,18]. It requires understanding about functionality, language,
standards etc. about that particular tool. Moreover the integrated design is not
suitable to execute in other platform.
During the ETL processing, conceptual modeling reflect high-level view of
entities and relationship among them. It only provides an abstract view of the
workflow instead of the implementation details. Different research work has been
done for conceptual modeling of ETL. UML, BPMN and semantic web are
commonly used so far for conceptual modeling techniques. We proposed a new
way for modeling an ETL process using a system modeling language (SysML).
Although there are many contributions towards ETL abstract modeling is done,
we think that SysML is a new direction for conceptualizing and validating of ETL
workflow. There is a lot of research scope using SysML to practically implement
ETL model, validation, simulation, executable code production in a specific way
for the sake of both technical and non-technical users.
This paper aims to propose a new technique for designing conceptual model
of ETL by using SysML standard supporting Model-based System Engineering
(MBSE) approach. SysML is a general purpose system modeling language which
facilitates the system by identification, analysis, design, test and validation [14].
It supports system modeling for broad categories of organization like aerospace,
automotive, health care etc. SysML is a new modeling language standardize by
Object Management Group (OMG) [2] and International Council on Systems
Engineering (INCOSE) [15,16]. It can be used to model high-level view of the
ETL process and justify the system validation by applying simulation process.
The rest of the paper is structured as follows: Sect. 2 briefly discusses existing
work in the area of ETL conceptual modeling. An overview of Model-Based
Systems Engineering and SysML notations for requirement and activity diagram
with its characteristics are included in Sect. 3. Section 4 explains our proposed
work for ETL modeling with a suitable example using SysML diagrams. Finally,
in Sect. 5 discuss about the conclusion with probable future direction of this
work.
244 N. Biswas et al.

2 Related Work
There are various approaches have been proposed for conceptual designing of
ETL process in last few decades. These research work can be classified using the
modeling languages they have used like UML, BPMN and Semantic Web based.
This section contains a brief discussion about these techniques.
The very first attempt for conceptual ETL modeling was established by
Simitsis et al. in [34]. A customizable generic meta-model is proposed and a
set of notations is given to represent the ETL activities. Relationship among
attributes of source and data warehouse is established through this model. Their
module supports customizable template of transformation like primary or for-
eign key checking, null value checking etc. Finally candidate relationship set is
established for updating data in warehouse from multiple source database. The
authors further enriched their work [26] by proposing a methodology showing
step by step procedure from source selection to warehouse population along with
attributes relationship mapping and runtime obstacle handling issues.
J. Trujillo et al. [32] designed the work-flow of ETL based on UML modeling
approach. This was the first approach of conceptual model design by using stan-
dard UML notations. The author uses UML class diagram to establish database
and their attributes relationship. Various transformation process like aggrega-
tion, conversion, filter, join etc. is supported by their modeling with zooming
in and zooming out facility for different level of design. Another research effort
using UML 2.0 was proposed in [19] where authors have highlighted the extrac-
tion phase only. They have identified six classes and exhibit class diagram, use
case diagram and sequence diagram for extraction phase using standard UML
notation. Transformation and loading phase are not included in their work.
L. Munoz [20] modeled a complete ETL process by using UML activity dia-
grams. The activity involved in ETL process are expressed using diagram with
control flow sequence supporting various transformation activity. Further, they
have enriched their work in [21] proposing automatic code generation from con-
ceptual models by supporting model driven architecture (MDA). A conceptual
model based on their previous work [20] is designed by using PIM (Platform Inde-
pendent Model) supporting UML features. PIM can give a system functional view
without bothering about the platform. Different PSM (Platform Specific Model)
showing logical model view can be produced from the PIM. Automatic data struc-
ture creation code is generated form individual PSM. PIM model to PSM model
transformation is done by QVT (Query View Transformation) language.
BPMN stands for Business Process Model and Notation consists of stan-
dard graphical notations which helps to understand business processes within a
organization. First attempt of using BPMN notations in ETL conceptual mod-
eling was proposed by Akkaoui et al. in [3]. Conceptual model formation process
is described and conversion from BPMN to BPEL (Business Process Execu-
tion Language) is done to execute the designed model as well as implement-
ing relations with web services. In a sequel of their work [6], a Model-Driven
Development (MDD) based vendor independent BPMN meta model is created
and automatic code generation for any vendor specific platform is proposed.
SysML Based Conceptual ETL Process Modeling 245

Further, they made advancement in their work [5] by proposing model-to-text


and model-to-model transformation for code generation. A model updating pur-
pose along with required maintenance factors and designing a BPMN meta model
for ETL conceptual view was done in [4]. Oliveira and Belo [22,23] designed a
set of generalized ETL meta model for some specific tasks by using BPMN nota-
tions. Finally, they have validated their model by case study. Further, they have
advanced their work for conceptual to physical model auto-generation process
in [10,24].
Skoutas and Simitsis modeled a High-level view of ETL process by using
ontologies in literature [27,28]. Use of ontology facilitates to identify the schema
of the data source and data warehouse. Automatic transformation and data
selection form source to warehouse population is established. Extending their
previous work in [29], a framework is proposed by using the feature of ontology
with semantic specification of source and the target. A set of graph transforma-
tion rules are formulated to guide the flow of ETL operations. In another work,
Simitsis converted the conceptual model into natural language explanation in
[25] for non technical background people. Hoang Thi and Nguyen proposed a
new semantic approach in [17,31] for ETL work-flow using common warehouse
meta model (CWM) design standard. CWM support structured, non-structured
and multidimensional meta data modeling of object in data warehouse.
The drawback of UML is of having a software centric point of views and hav-
ing shortfall of clear semantics. Moreover, the relationship within software and
hardware are not representable by UML. SysML offer more facilities over UML
by adapting some core features and extending many new directions. Whereas
BPMN notation is suitable for business users to graphically model complex
business processes of an organization. An initial model of the overall process
is created by the business users, after that technical developers implement that
model. But implementing any SysML model is much more flawless for the tech-
nical developers as it is developed from systems engineering view point. SysML
is derived from UML model but compared to UML, SysML is very much flex-
ible and expressive which is capable of better requirement analysis and define
performance and quantitative parameters of a broad range of system from the
perspective of a system engineer and not from software centric views like UML.
SysML can efficiently capture continuous nature of system with requirements
and parametric relation of a system model.

3 Model Based Systems Engineering (MBSE)

MBSE is an OMG supported new standard for system engineering domain fea-
turing requirement-driven and functional analysis, design, integration, validation
and simulation of system design throughout the life-cycle of system development
defined by INCOSE [12,15]. MBSE promotes model-based approaches instead of
prevalent document-oriented design methods. Model oriented approach helps to
capture system architectural descriptions. It promotes better understanding of
system construction and its performance. Using model complete system can be
246 N. Biswas et al.

visualized with the facilities of system validation in the earlier stage of design.
After that model can be mapped into physical implementation. UML or SysML
are standard visual modeling language that can be used to describe the system
model.
MBSE is gaining popularity in industry for creating complex systems in the
scenario of merging multi-disciplinary environment. SysML is one of the key
components of MBSE, having properties for capturing requirements, architec-
ture, constraints and hierarchical or multi layered views of system model. It
allows linking different types of models that come from different engineering
disciplines. MBSE [1] improves system modeling techniques by advanced com-
munication, better system complexity management, standard data management,
better quality product, upgraded information capture, risk minimization.

3.1 System Modeling Language (SysML)

SysML is a general purpose graphical modeling language which can be termed


as an extended version of UML. Continuous evolution of different visual model-
ing languages from SA(Structure analysis) SD(Structured design) to SysML are
shown in the Fig. 1. SysML 1.0 was standardized by OMG group at 2006.

SysML

UML <<derive>>
Language Expressiveness

<<refine>> 2.x
UML
OOSE 1.x

<<derive>> <<influence>>
Booch BPMN
<<derive>>
2.x
BPMN
1.x <<refine>>
SA/SD OMT
Method

1975 1980 1985 1990 1995 2000 2005 2010 2015

Evolution

Fig. 1. Evolution of visual modeling languages

For modeling of a system, SysML support requirements, functional and


behavioral structure of system and their inter-relationship. As it is originated
from UML, it reuses many UML notations with some additional extensions [2].
SysML support various type of diagrams to represent structural and behavioral
SysML Based Conceptual ETL Process Modeling 247

nature of a system shown in Fig. 2. Activity diagram, Block diagram and inter-
nal block diagram indicated by bubble box are modified version of basic UML
diagram. Parametric and Requirement diagram indicated by dashed box are
introduced in SysML. Other basic diagram of UML can also be drawn using
SysML.

Use Case
Diagram

Behavior State Machine


Diagram Diagram

Sequence
Diagram

Activity
Requirement Diagram
SysML Diagrams
Diagram

Block Definition
Diagram

Internal Block
Structure Diagram
Diagram

Package
Diagram

Parametric
Diagram

Fig. 2. SysML supported diagrams

3.2 SysML Notation


In this work, we are using SysML requirement diagram and activity diagram for
expressing ETL processes. Requirement diagram represents test-based require-
ment using graphical construct whereas activity diagram explores system behav-
ior by showing flow of control and data within activities. In SysML, each model-
ing elements can be characterized by their Stereotype. There are set of different
standard stereotypes available for SysML diagrams. Stereotype notation pro-
vide is a new way to define system elements according to user requirements.
Stereotypes are expressed by enclosing its type within double chevrons such as
 discrete ,  continuous ,  allocated  etc.

Requirement Diagram Notation. Requirement diagram is a completely new


concept compared to UML diagrams. It supports text-based requirements, their
248 N. Biswas et al.

relationship and test cases to verify the requirements. A basic SysML require-
ment block is displayed in Fig. 3. A SysML requirement rectangular block con-
tain its stereotype mentioned as  requirement , its unique identifier number
Id=RQ1.1 and Text=“#” for describing textual requirement details. There are
some extended requirement properties such as verification method, source pri-
ority, risk etc. can be selected by the designer. Requirements can be customized
into more additional sub-categories like business, functional, interface, usability,
performance, physical etc. Derive, Satisfy, Nesting, Trace, Verify and Refine are
different relationship types that can be used in requirement diagram for describ-
ing the relationship.

Fig. 3. SysML requirement Fig. 4. Activity diagram notations of UML


block

Activity Diagram Notation. For SysML activity diagram, some of the nota-
tions are same as UML activity diagram as shown in Fig. 4 and some new
notations incorporated. For example, activity edge can be characterize by men-
tioning its stereotypes like  discrete  or  continuous . Also, actual
rate of the object flow can also be mentioned by using constraint notation like
{rate = expression}. Assigning probability to any activity edge (mostly control
flow) is another new feature like {P robability = value%} in SysML diagram. It
expresses the probability of traversal for any particular edge.
Behavior of any object node can more precisely expressed by using stereotype
 nobuf f er  or  overwrite . For the first case, the object node will be
discarded if the next action is not prepared to receive it. For the second case,
the object node will be overwritten if the next action is not prepared to receive
it. Applying interruptible regions a group of elements in activity diagram can be
separately identified by a dashed box.

4 Conceptual Modeling of ETL Processes


The main purpose of conceptual ETL modeling is to establish a relationship
between source data schema and the target warehouse data schema. It provides
a high level view of system which does not include any logical or physical imple-
mentation details.
SysML Based Conceptual ETL Process Modeling 249

We have proposed a design of high-level model of ETL process. At first,


we have designed a SysML requirement diagram for the ETL scenario. After
that, we have modeled the conceptual ETL process by using SysML activity
diagram. Each elements of the SysML model are specified by its simulation
specific characteristics.

4.1 Example Scenario

For representing the ETL scenario, we are taking an example of an e-commerce


system where a database is maintained for daily transactions. Operational data
are stored in relational format. This data needs to be converted and deposited
according to the data warehouse format. For the e-commerce system, total sale
for each day are calculated and stored in the data warehouse. Moreover, all
information related to customer, supplier, website and products are stored in
the warehouse.
The structure of the target data warehouse schema is shown in the Fig. 5. The
fact table contains key attributes of dimension tables, basic facts and derived
facts. Here, the Fact Sales table has six dimensions of Customer, Supplier,
Product, Date Time, Website and Promotion. Dimension table can have aggrega-
tion level hierarchy. Dimension Website → Dimension Navigation is an example
of hierarchy maintenance. Dimension Address → Dimension State → Dimen-
sion City is a three level of hierarchy shared by both Dimension Customer and
Dimension Supplier.

4.2 Requirement Diagram

Before starting the conceptual modeling, we need to identify the requirement


for the ETL process. For this purpose a SysML requirement diagram will help
to visualize the requirements and their interrelations. Figure 6 represents an
example of the requirement of ETL process for the e-commerce system using
MagicDraw.
The operational databases provide data for loading to the data warehouse.
Two other data sources are shown here from where data about address and prod-
uct are derived. The warehouse data are derived from these source databases.
The restriction before loading to warehouse is described in the constraint block.

4.3 Activity Diagram

The explanation of the proposed ETL process is shown using SysML activity
diagram in Fig. 7 created using MagicDraw. Flow of data as well as control
within different activities for loading in an e-commerce sales data warehouse is
shown. From starting to ending node, each object flow and control flow stereo-
types are indicated for describing their nature of flow. Opaque action and call
behavior action are used for describing unit activity and sub-activity as per the
250 N. Biswas et al.

Dimension_State
Dimension_Customer State_key
Customer_key City_key
Cust_greetings Dimension_Supplier State_name
Cust_first_name State_capital
Supplier_key
Cust_middle_name State_Region
Supplier_name
Cust_last_name District_name
Address_key
Cust_gender
Shipping_mode Dimension_Address
Cust_marital_status
Shipper_name
Cust_date_of_birth Address_key
Contact_ph_code
Cust_education Country_name
Contact_ph_no
Address_key State_key
Contact_email
Cont_ph_code Country_capital
Contact_ph_no
Cust_email Dimension_City
Cust_website Fact_Sales City_key
Cust_payment_type Customer_key City_name
Cust_credit_status Supplier_key City_road_name
Website_key City_area
Product_key landmark
Date_time_key Pin_code
Promo_key
Dimension_Date_Time Order_id
Dimension_Website
Date_time_key Item_price
Website_key
Calerdar_date Item_quantity
Wpage_name
Day_no_week Item_discount
Wpage_URL
Day_name_week Item_vat
Wpage_type
Day_no Item_shipping_charge
Wpage_designer
Week_no Item_total_no
Wpage_metainfo
Month_name Item_total_price
Navigation_type
Quarter_No Average_item_price
Navigation_key
Year Average_discount
flag_logo
Fiscal_year Cust_total_no
place_logo
Holiday_flag
Total_item_on_page
Weekend_flag
Dimension_Product name_banner
Time_24_hr_clock
type_banner
Time_12_hr_clock Product_key
flag_image
AM_PM_info Category_key
image_place
Fnoon_flag SKU_no
Noon_flag Prod_name
Anoon_flag Prod_info
Evening_flag Unit_price Dimension_Navigation
Category_info Navigation_key
home_page_URL
1st_page_URL
Dimension_Category 2nd_page_URL
Category_key 3rd_page_URL
Dimension_promotion 4th_page_URL
Subcat
Promo_key Manufacturer_info 5th_page_URL
Promo_name Brand_name 6th_page_URL
Promo_type Color_info 7th_page_URL
Price_discount Size_info 8th_page_URL
Advertisement_type Weight_info 9th_page_URL
Adv_media parcel_type 10th_page_URL
Coupon_info parcel_size Exit_page_URL
Promo_cost retail_case_units search_page_flag
Start_date shipping_case_units help_page_flag
Close_date pallet_case_no signout_page_flag

Fig. 5. Schema of the data warehouse for an E-commerce system


SysML Based Conceptual ETL Process Modeling 251

Fig. 6. SysML requirement diagram of the E-commerce system for ETL process

requirement. Value type for each input and output pin of action node is spec-
ified. Parallel edges are joined by join node and single paths split into parallel
outgoing edges by fork node.
At first source databases are accessed. After verifying the key attributes,
list of data about the dimensions are updated by the loader into their respective
dimension tables. During dimension loading, aggregation level hierarchy is main-
tained. For example, Dimension Navigation will be loaded prior to Dimension
Website. Data about product is firstly loaded to Dimension Category and then to
Dimension Product. Address of Customer and Supplier comes from Area.XML
file and list of product catalog fetched from Product.CSV file. Sub activity for
loading the Area is given in Fig. 8. After loading the Dimension Address, it is
shared and finally loaded by Dimension Supplier and Dimension Customer.
After loading six dimensions, basic facts (price, quantity, discount), derived
facts (vat, shipping charge, total price, item total no) and non-additive facts
(average item price, average discount) are stored in fact table. The overall ETL
process is executed in every 12 h intervals as mentioned in Fig. 7. In this example,
extraction and loading processes are shown. Some other common ETL transfor-
mation task like Aggregation, Filter, Correction, Conversion, Joining, Splitting,
Merging, Log generation can also be represented in the conceptual model.
Post designing the system model, transforming of the SysML model to its
corresponding executable code is done. XMI format is the standard platform
independent code of a SysML model. This conceptual model can be transformed
to its corresponding XMI format. Part of this XMI code is given in Listing 1.1.
252 N. Biswas et al.

Fig. 7. Example of ETL conceptual model using SysML activity diagram

Fig. 8. Sub-activity for loading address using SysML


SysML Based Conceptual ETL Process Modeling 253

Listing 1.1. XMI code of SysML diagram


1 <?xml version=” 1 . 0 ” e n c o d i n g=”UTF−8” ?>
2 −<xmi:XMI x m l n s : D S L C u s t o m i z a t i o n=” h t t p : //www. magicdraw . com/ schemas
/ DSL Customization . xmi ” x m l n s : M a g i c D r a w P r o f i l e=” h t t p : //www. omg .
o r g / s p e c /UML/ 2 0 1 3 1 0 0 1 / M a g i c D r a w P r o f i l e ” x m l n s : s y s m l=” h t t p : //www.
omg . o r g / s p e c /SysML / 2 0 1 5 0 7 0 9 / SysML” x m l n s : V a l i d a t i o n P r o f i l e=”
h t t p : //www. magicdraw . com/ schemas / V a l i d a t i o n P r o f i l e . xmi ”
x m l n s : S t a n d a r d P r o f i l e=” h t t p : //www. omg . o r g / s p e c /UML/ 2 0 1 3 1 0 0 1 /
StandardProfile ” xmlns:MD Customization for
3 R e q u i r e m e n t s a d d i t i o n a l s t e r e o t y p e s=” h t t p : //www. magicdraw . com/ s p e c
/ C u s t o m i z a t i o n /180/ R e q u i r e m e n t s ”
x m l n s : M D C u s t o m i z a t i o n f o r S y s M L a d d i t i o n a l s t e r e o t y p e s=” h t t p :
//www. magicdraw . com/ s p e c / C u s t o m i z a t i o n /180/SysML” x m l n s : x m i=”
h t t p : //www. omg . o r g / s p e c /XMI/ 2 0 1 3 1 0 0 1 ” xmlns:uml=” h t t p : //www. omg .
o r g / s p e c /UML/ 2 0 1 3 1 0 0 1 ”>
4 −<xmi:Documentation>
5 <x m i : e x p o r t e r>MagicDraw UML</ x m i : e x p o r t e r>
6 <x m i : e x p o r t e r V e r s i o n>1 8 . 4 v2</ x m i : e x p o r t e r V e r s i o n>
7 </ xmi:Documentation>
8 −<x m i : E x t e n s i o n e x t e n d e r=” MagicDraw UML 1 8 . 4 ”>
9 <p l u g i n p l u g i n V e r s i o n=” 1 8 . 4 ” pluginName=”SysML” />
10 <p l u g i n p l u g i n V e r s i o n=” 1 8 . 4 ” pluginName=”Cameo R e q u i r e m e n t s
Modeler ” />
11 <r e q r e s o u r c e r e s o u r c e V a l u e N a m e=”SysML A c t i v i t y Diagram ”
resourceName=”SysML” r e s o u r c e I D=” 1440 ” />
12 <r e q r e s o u r c e r e s o u r c e V a l u e N a m e=”Cameo R e q u i r e m e n t s Modeler ”
resourceName=”Cameo R e q u i r e m e n t s Modeler ” r e s o u r c e I D=” 1480 ” />
13 <r e q r e s o u r c e r e s o u r c e V a l u e N a m e=”SysML” resourceName=”SysML”
r e s o u r c e I D=” 1440 ” />
14 </ x m i : E x t e n s i o n>
15 −<uml:Model name=” Model ” x m i : i d=” e e e 1 0 4 5 4 6 7 1 0 0 3 1 3 1 3 5 4 3 6 1 ” x m i : t y p e
=” uml:Model ”>
16 −<ownedComment x m i : i d=” 1 8 4 5 6 1 0 1 e a 1 4 7 8 6 7 9 9 8 6 8 6 9 4 0 5 1 1 1 3 7 2 3 ”
x m i : t y p e=”uml:Comment” body=” Author:Admin . C r e a t e d : 1 1 /9/16 1 : 5 6
PM. T i t l e : . Comment: . ”>
17 <a n n o t a t e d E l e m e n t x m i : i d r e f=” e e e 1 0 4 5 4 6 7 1 0 0 3 1 3 1 3 5 4 3 6 1 ” />
18 </ownedComment>
19 −<packagedElement name=” ETL Sales ” x m i : i d=”
1 8 4 5 6 1 0 1 e a 1 4 7 8 6 7 9 9 9 1 1 0 4 2 0 4 3 6 9 1 3 7 2 8 ” x m i : t y p e=” u m l : A c t i v i t y
”>
20 −<x m i : E x t e n s i o n e x t e n d e r=” MagicDraw UML 1 8 . 4 ”>

5 Conclusion

ETL process is responsible for selection and extraction of data from several
sources, then cleaning and transformation according to desired format is done
and finally updates into a DW. ETL process modeling is a way to design the ori-
entation of data and establish their relationship throughout the ETL processing
activity. In this paper, the main focus is to model an ETL process at concep-
tual level. Significant number of works has been done for ETL process modeling
by UML, BPMN or Semantic web based methods. In this paper, we proposed
a MBSE oriented system model for ETL process for data warehouse environ-
ment. To do the job, a new modeling language called SysML is used which is
gaining popularity for modeling now a days. It is derived from UML by giving
some additional facilities to the system engineers. By using SysML, the system
model can be designed in more expressive as well as flexible way. An example of
e-commerce system for ETL process modeling is discussed in this work. Partic-
ularly, propagation of data from sources to DW is explained as an use case of
254 N. Biswas et al.

the model. Our developed model is platform Independent by nature and simple
to understand by both technical and non-technical users. After designing the
ETL model using SysML language, its corresponding executable XMI code is
generated. In future, we intend to simulate the proposed model to analyze sys-
tem behavior and requirements more precisely and to extend the model view at
logical and physical level.

References
1. MBSE wiki. http://www.omgwiki.org/MBSE/doku.php
2. OMG systems modeling language. http://www.omgsysml.org/
3. Akkaoui, E.E., Zimányi, E.: Defining ETL worfklows using BPMN and BPEL. In:
Proceedings of the ACM Twelfth International Workshop on Data Warehousing
and OLAP, pp. 41–48. ACM (2009)
4. El Akkaoui, Z., Mazón, J.-N., Vaisman, A., Zimányi, E.: BPMN-based concep-
tual modeling of ETL processes. In: Cuzzocrea, A., Dayal, U. (eds.) DaWaK
2012. LNCS, vol. 7448, pp. 1–14. Springer, Heidelberg (2012). doi:10.1007/
978-3-642-32584-7 1
5. Akkaoui, Z.E., Zimányi, E., López, J.N.M., Mondéjar, J.C.T., et al.: A BPMN-
based design and maintenance framework for ETL processes (2013)
6. Akkaoui, Z.E., Zimànyi, E., Mazón, J.N., Trujillo, J.: A model-driven framework
for ETL process development. In: Proceedings of the 14th International Workshop
on Data Warehousing and OLAP, pp. 45–52. ACM (2011)
7. Çağıltay, N.E., Topallı, D., Aykaç, Y.E., Tokdemir, G.: Abstract conceptual database
model approach. In: Conference on Science and Information, pp. 275–281 (2013)
8. Ayhan, S., Pesce, J., Comitz, P., Sweet, D., Bliesner, S., Gerberick, G.: Predictive
analytics with aviation big data. In: Conference on Integrated Communications,
Navigation and Surveillance (ICNS 2013), pp. 1–13 (2013)
9. Barateiro, J., Galhardas, H.: A survey of data quality tools. Datenbank-Spektrum
14(15–21), 48 (2005)
10. Belo, O., Gomes, C., Oliveira, B., Marques, R., Santos, V.: Automatic genera-
tion of ETL physical systems from BPMN conceptual models. In: Bellatreche,
L., Manolopoulos, Y. (eds.) MEDI 2015. LNCS, vol. 9344, pp. 239–247. Springer,
Cham (2015). doi:10.1007/978-3-319-23781-7 19
11. Eckerson, W., White, C.: Evaluating ETL and data integration platforms. Report
of The Data Warehousing Institute 184 (2003)
12. Estefan, J.A.: Survey of model-based systems engineering (MBSE) methodologies.
Incose MBSE Focus Group 25(8) (2007)
13. Franconi, E., Kamblet, A.: A data warehouse conceptual data model. In: Pro-
ceedings of 16th International Conference on Scientific and Statistical Database
Management, pp. 435–436 (2004)
14. Friedenthal, S., Moore, A., Steiner, R.: A Practical Guide to SysML: The Systems
Modeling Language. Morgan Kaufmann, San Francisco (2014)
15. Hart, L.E.: Introduction to model-based system engineering (MBSE) and SysML,
30 July 2015. http://www.incose.org/docs/default-source/delaware-valley/mbse-
overview-incose-30-july-2015.pdf
16. Hause, M.: The sysml modelling language. In: 15th European Systems Engineering
Conference, vol. 9 (2006)
17. Hoang, A.D.T., Nguyen, B.T.: An integrated use of CWM and ontological mod-
eling approaches towards ETL processes. In: IEEE International Conference on
e-Business Engineering (ICEBE 2008), pp. 715–720, October 2008
SysML Based Conceptual ETL Process Modeling 255

18. Kherdekar, V.A., Metkewar, P.S.: A technical comprehensive survey of ETL tools.
Int. J. Appl. Eng. Res. 11(4), 2557–2559 (2016)
19. Mrunalini, M., Kumar, T.S., Kanth, K.R.: Simulating secure data extraction in
extraction transformation loading (ETL) processes. In: Third UKSim European
Symposium on Computer Modeling and Simulation (EMS 2009), pp. 142–147.
IEEE (2009)
20. Muñoz, L., Mazón, J.-N., Pardillo, J., Trujillo, J.: Modelling ETL processes of data
warehouses with UML activity diagrams. In: Meersman, R., Tari, Z., Herrero, P.
(eds.) OTM 2008. LNCS, vol. 5333, pp. 44–53. Springer, Heidelberg (2008). doi:10.
1007/978-3-540-88875-8 21
21. Muñoz, L., Mazón, J.N., Trujillo, J.: Automatic generation of ETL processes from
conceptual models. In: Proceedings of the ACM Twelfth International Workshop
on Data Warehousing and OLAP, pp. 33–40. ACM (2009)
22. Oliveira, B., Belo, O.: BPMN patterns for ETL conceptual modelling and valida-
tion. In: Chen, L., Felfernig, A., Liu, J., Raś, Z.W. (eds.) ISMIS 2012. LNCS, vol.
7661, pp. 445–454. Springer, Heidelberg (2012). doi:10.1007/978-3-642-34624-8 50
23. Oliveira, B., Belo, O.: ETL standard processes modelling - a novel BPMN approach.
In: Proceedings of the 15th International Conference on Enterprise Information
Systems, pp. 120–127 (2013)
24. Oliveira, B., Santos, V., Belo, O.: Pattern-based ETL conceptual modelling. In:
Cuzzocrea, A., Maabout, S. (eds.) MEDI 2013. LNCS, vol. 8216, pp. 237–248.
Springer, Heidelberg (2013). doi:10.1007/978-3-642-41366-7 20
25. Simitsis, A., Skoutas, D., Castellanos, M.: Representation of conceptual etl designs
in natural language using semantic web technology. Data Knowl. Eng. 69(1), 96–
115 (2010)
26. Simitsis, A., Vassiliadis, P.: A methodology for the conceptual modelling of ETL
processes. In: Proceedings of DSE (2003)
27. Skoutas, D., Simitsis, A.: Designing ETL processes using semantic web technolo-
gies. In: Proceedings ACM 9th International Workshop on Data Warehousing and
OLAP (DOLAP 2006), Arlington, Virginia, USA, pp. 67–74 (2006)
28. Skoutas, D., Simitsis, A.: Ontology-based conceptual design of ETL processes for
both structured and semi-structured data. Int. J. Semant. Web Inf. Syst. (IJSWIS)
3(4), 1–24 (2007)
29. Skoutas, D., Simitsis, A., Sellis, T.: Ontology-driven conceptual design of ETL
processes using graph transformations. In: Spaccapietra, S., Zimányi, E., Song,
I.-Y. (eds.) Journal on Data Semantics XIII. LNCS, vol. 5530, pp. 120–146.
Springer, Heidelberg (2009). doi:10.1007/978-3-642-03098-7 5
30. Snezana, S., Violeta, M.: Business intelligence tools for statistical data analysis.
In: Proceedings of the 32nd International Conference on Information Technology
Interfaces (ITI 2010), pp. 199–204 (2010)
31. Thi, A.D.H., Nguyen, B.T.: A semantic approach towards CWM-based ETL
processes. In: Proceedings of I-SEMANTICS 2008, pp. 58–66 (2008)
32. Trujillo, J., Luján-Mora, S.: A UML based approach for modeling ETL processes
in data warehouses. In: Song, I.-Y., Liddle, S.W., Ling, T.-W., Scheuermann, P.
(eds.) ER 2003. LNCS, vol. 2813, pp. 307–320. Springer, Heidelberg (2003). doi:10.
1007/978-3-540-39648-2 25
33. Vassiliadis, P.: A survey of extract - transform - load technology. Int. J. Data
Warehouse. Min. 5(3), 1–27 (2009)
34. Vassiliadis, P., Simitsis, A., Skiadopoulos, S.: Conceptual modeling for ETL
processes. In: Proceedings DOLAP, pp. 14–21 (2002)

You might also like