SysML Based Conceptual ETL Process Modeling
SysML Based Conceptual ETL Process Modeling
1 Introduction
and customize those data according to required format, at last integrate and
update it into data warehouse [33].
Data modeling [7] gives an abstract view about how the data will be arranged
in an organization and how they will be managed. By applying data modeling
techniques, the relationship between different data items can be visualized. The
modeling concept has a great benefit over organizational data to manage it
in a structural way. At starting phase, it is highly recommended to make an
efficient modeling and design of the total workflow. Due to the expensive nature
of warehouse implementation, good modeling as well as documentation should be
maintained. Based on the report [11], designing a well-established ETL workflow
consumes almost one third of cost and effort in a DW implementation. A well
designed ETL process is one of the important aspects to accomplish an effective
DW. Each vendor provided tool has their own specific methodology for designing
the ETL process [9,18]. It requires understanding about functionality, language,
standards etc. about that particular tool. Moreover the integrated design is not
suitable to execute in other platform.
During the ETL processing, conceptual modeling reflect high-level view of
entities and relationship among them. It only provides an abstract view of the
workflow instead of the implementation details. Different research work has been
done for conceptual modeling of ETL. UML, BPMN and semantic web are
commonly used so far for conceptual modeling techniques. We proposed a new
way for modeling an ETL process using a system modeling language (SysML).
Although there are many contributions towards ETL abstract modeling is done,
we think that SysML is a new direction for conceptualizing and validating of ETL
workflow. There is a lot of research scope using SysML to practically implement
ETL model, validation, simulation, executable code production in a specific way
for the sake of both technical and non-technical users.
This paper aims to propose a new technique for designing conceptual model
of ETL by using SysML standard supporting Model-based System Engineering
(MBSE) approach. SysML is a general purpose system modeling language which
facilitates the system by identification, analysis, design, test and validation [14].
It supports system modeling for broad categories of organization like aerospace,
automotive, health care etc. SysML is a new modeling language standardize by
Object Management Group (OMG) [2] and International Council on Systems
Engineering (INCOSE) [15,16]. It can be used to model high-level view of the
ETL process and justify the system validation by applying simulation process.
The rest of the paper is structured as follows: Sect. 2 briefly discusses existing
work in the area of ETL conceptual modeling. An overview of Model-Based
Systems Engineering and SysML notations for requirement and activity diagram
with its characteristics are included in Sect. 3. Section 4 explains our proposed
work for ETL modeling with a suitable example using SysML diagrams. Finally,
in Sect. 5 discuss about the conclusion with probable future direction of this
work.
244 N. Biswas et al.
2 Related Work
There are various approaches have been proposed for conceptual designing of
ETL process in last few decades. These research work can be classified using the
modeling languages they have used like UML, BPMN and Semantic Web based.
This section contains a brief discussion about these techniques.
The very first attempt for conceptual ETL modeling was established by
Simitsis et al. in [34]. A customizable generic meta-model is proposed and a
set of notations is given to represent the ETL activities. Relationship among
attributes of source and data warehouse is established through this model. Their
module supports customizable template of transformation like primary or for-
eign key checking, null value checking etc. Finally candidate relationship set is
established for updating data in warehouse from multiple source database. The
authors further enriched their work [26] by proposing a methodology showing
step by step procedure from source selection to warehouse population along with
attributes relationship mapping and runtime obstacle handling issues.
J. Trujillo et al. [32] designed the work-flow of ETL based on UML modeling
approach. This was the first approach of conceptual model design by using stan-
dard UML notations. The author uses UML class diagram to establish database
and their attributes relationship. Various transformation process like aggrega-
tion, conversion, filter, join etc. is supported by their modeling with zooming
in and zooming out facility for different level of design. Another research effort
using UML 2.0 was proposed in [19] where authors have highlighted the extrac-
tion phase only. They have identified six classes and exhibit class diagram, use
case diagram and sequence diagram for extraction phase using standard UML
notation. Transformation and loading phase are not included in their work.
L. Munoz [20] modeled a complete ETL process by using UML activity dia-
grams. The activity involved in ETL process are expressed using diagram with
control flow sequence supporting various transformation activity. Further, they
have enriched their work in [21] proposing automatic code generation from con-
ceptual models by supporting model driven architecture (MDA). A conceptual
model based on their previous work [20] is designed by using PIM (Platform Inde-
pendent Model) supporting UML features. PIM can give a system functional view
without bothering about the platform. Different PSM (Platform Specific Model)
showing logical model view can be produced from the PIM. Automatic data struc-
ture creation code is generated form individual PSM. PIM model to PSM model
transformation is done by QVT (Query View Transformation) language.
BPMN stands for Business Process Model and Notation consists of stan-
dard graphical notations which helps to understand business processes within a
organization. First attempt of using BPMN notations in ETL conceptual mod-
eling was proposed by Akkaoui et al. in [3]. Conceptual model formation process
is described and conversion from BPMN to BPEL (Business Process Execu-
tion Language) is done to execute the designed model as well as implement-
ing relations with web services. In a sequel of their work [6], a Model-Driven
Development (MDD) based vendor independent BPMN meta model is created
and automatic code generation for any vendor specific platform is proposed.
SysML Based Conceptual ETL Process Modeling 245
MBSE is an OMG supported new standard for system engineering domain fea-
turing requirement-driven and functional analysis, design, integration, validation
and simulation of system design throughout the life-cycle of system development
defined by INCOSE [12,15]. MBSE promotes model-based approaches instead of
prevalent document-oriented design methods. Model oriented approach helps to
capture system architectural descriptions. It promotes better understanding of
system construction and its performance. Using model complete system can be
246 N. Biswas et al.
visualized with the facilities of system validation in the earlier stage of design.
After that model can be mapped into physical implementation. UML or SysML
are standard visual modeling language that can be used to describe the system
model.
MBSE is gaining popularity in industry for creating complex systems in the
scenario of merging multi-disciplinary environment. SysML is one of the key
components of MBSE, having properties for capturing requirements, architec-
ture, constraints and hierarchical or multi layered views of system model. It
allows linking different types of models that come from different engineering
disciplines. MBSE [1] improves system modeling techniques by advanced com-
munication, better system complexity management, standard data management,
better quality product, upgraded information capture, risk minimization.
SysML
UML <<derive>>
Language Expressiveness
<<refine>> 2.x
UML
OOSE 1.x
<<derive>> <<influence>>
Booch BPMN
<<derive>>
2.x
BPMN
1.x <<refine>>
SA/SD OMT
Method
Evolution
nature of a system shown in Fig. 2. Activity diagram, Block diagram and inter-
nal block diagram indicated by bubble box are modified version of basic UML
diagram. Parametric and Requirement diagram indicated by dashed box are
introduced in SysML. Other basic diagram of UML can also be drawn using
SysML.
Use Case
Diagram
Sequence
Diagram
Activity
Requirement Diagram
SysML Diagrams
Diagram
Block Definition
Diagram
Internal Block
Structure Diagram
Diagram
Package
Diagram
Parametric
Diagram
relationship and test cases to verify the requirements. A basic SysML require-
ment block is displayed in Fig. 3. A SysML requirement rectangular block con-
tain its stereotype mentioned as requirement , its unique identifier number
Id=RQ1.1 and Text=“#” for describing textual requirement details. There are
some extended requirement properties such as verification method, source pri-
ority, risk etc. can be selected by the designer. Requirements can be customized
into more additional sub-categories like business, functional, interface, usability,
performance, physical etc. Derive, Satisfy, Nesting, Trace, Verify and Refine are
different relationship types that can be used in requirement diagram for describ-
ing the relationship.
Activity Diagram Notation. For SysML activity diagram, some of the nota-
tions are same as UML activity diagram as shown in Fig. 4 and some new
notations incorporated. For example, activity edge can be characterize by men-
tioning its stereotypes like discrete or continuous . Also, actual
rate of the object flow can also be mentioned by using constraint notation like
{rate = expression}. Assigning probability to any activity edge (mostly control
flow) is another new feature like {P robability = value%} in SysML diagram. It
expresses the probability of traversal for any particular edge.
Behavior of any object node can more precisely expressed by using stereotype
nobuf f er or overwrite . For the first case, the object node will be
discarded if the next action is not prepared to receive it. For the second case,
the object node will be overwritten if the next action is not prepared to receive
it. Applying interruptible regions a group of elements in activity diagram can be
separately identified by a dashed box.
The explanation of the proposed ETL process is shown using SysML activity
diagram in Fig. 7 created using MagicDraw. Flow of data as well as control
within different activities for loading in an e-commerce sales data warehouse is
shown. From starting to ending node, each object flow and control flow stereo-
types are indicated for describing their nature of flow. Opaque action and call
behavior action are used for describing unit activity and sub-activity as per the
250 N. Biswas et al.
Dimension_State
Dimension_Customer State_key
Customer_key City_key
Cust_greetings Dimension_Supplier State_name
Cust_first_name State_capital
Supplier_key
Cust_middle_name State_Region
Supplier_name
Cust_last_name District_name
Address_key
Cust_gender
Shipping_mode Dimension_Address
Cust_marital_status
Shipper_name
Cust_date_of_birth Address_key
Contact_ph_code
Cust_education Country_name
Contact_ph_no
Address_key State_key
Contact_email
Cont_ph_code Country_capital
Contact_ph_no
Cust_email Dimension_City
Cust_website Fact_Sales City_key
Cust_payment_type Customer_key City_name
Cust_credit_status Supplier_key City_road_name
Website_key City_area
Product_key landmark
Date_time_key Pin_code
Promo_key
Dimension_Date_Time Order_id
Dimension_Website
Date_time_key Item_price
Website_key
Calerdar_date Item_quantity
Wpage_name
Day_no_week Item_discount
Wpage_URL
Day_name_week Item_vat
Wpage_type
Day_no Item_shipping_charge
Wpage_designer
Week_no Item_total_no
Wpage_metainfo
Month_name Item_total_price
Navigation_type
Quarter_No Average_item_price
Navigation_key
Year Average_discount
flag_logo
Fiscal_year Cust_total_no
place_logo
Holiday_flag
Total_item_on_page
Weekend_flag
Dimension_Product name_banner
Time_24_hr_clock
type_banner
Time_12_hr_clock Product_key
flag_image
AM_PM_info Category_key
image_place
Fnoon_flag SKU_no
Noon_flag Prod_name
Anoon_flag Prod_info
Evening_flag Unit_price Dimension_Navigation
Category_info Navigation_key
home_page_URL
1st_page_URL
Dimension_Category 2nd_page_URL
Category_key 3rd_page_URL
Dimension_promotion 4th_page_URL
Subcat
Promo_key Manufacturer_info 5th_page_URL
Promo_name Brand_name 6th_page_URL
Promo_type Color_info 7th_page_URL
Price_discount Size_info 8th_page_URL
Advertisement_type Weight_info 9th_page_URL
Adv_media parcel_type 10th_page_URL
Coupon_info parcel_size Exit_page_URL
Promo_cost retail_case_units search_page_flag
Start_date shipping_case_units help_page_flag
Close_date pallet_case_no signout_page_flag
Fig. 6. SysML requirement diagram of the E-commerce system for ETL process
requirement. Value type for each input and output pin of action node is spec-
ified. Parallel edges are joined by join node and single paths split into parallel
outgoing edges by fork node.
At first source databases are accessed. After verifying the key attributes,
list of data about the dimensions are updated by the loader into their respective
dimension tables. During dimension loading, aggregation level hierarchy is main-
tained. For example, Dimension Navigation will be loaded prior to Dimension
Website. Data about product is firstly loaded to Dimension Category and then to
Dimension Product. Address of Customer and Supplier comes from Area.XML
file and list of product catalog fetched from Product.CSV file. Sub activity for
loading the Area is given in Fig. 8. After loading the Dimension Address, it is
shared and finally loaded by Dimension Supplier and Dimension Customer.
After loading six dimensions, basic facts (price, quantity, discount), derived
facts (vat, shipping charge, total price, item total no) and non-additive facts
(average item price, average discount) are stored in fact table. The overall ETL
process is executed in every 12 h intervals as mentioned in Fig. 7. In this example,
extraction and loading processes are shown. Some other common ETL transfor-
mation task like Aggregation, Filter, Correction, Conversion, Joining, Splitting,
Merging, Log generation can also be represented in the conceptual model.
Post designing the system model, transforming of the SysML model to its
corresponding executable code is done. XMI format is the standard platform
independent code of a SysML model. This conceptual model can be transformed
to its corresponding XMI format. Part of this XMI code is given in Listing 1.1.
252 N. Biswas et al.
5 Conclusion
ETL process is responsible for selection and extraction of data from several
sources, then cleaning and transformation according to desired format is done
and finally updates into a DW. ETL process modeling is a way to design the ori-
entation of data and establish their relationship throughout the ETL processing
activity. In this paper, the main focus is to model an ETL process at concep-
tual level. Significant number of works has been done for ETL process modeling
by UML, BPMN or Semantic web based methods. In this paper, we proposed
a MBSE oriented system model for ETL process for data warehouse environ-
ment. To do the job, a new modeling language called SysML is used which is
gaining popularity for modeling now a days. It is derived from UML by giving
some additional facilities to the system engineers. By using SysML, the system
model can be designed in more expressive as well as flexible way. An example of
e-commerce system for ETL process modeling is discussed in this work. Partic-
ularly, propagation of data from sources to DW is explained as an use case of
254 N. Biswas et al.
the model. Our developed model is platform Independent by nature and simple
to understand by both technical and non-technical users. After designing the
ETL model using SysML language, its corresponding executable XMI code is
generated. In future, we intend to simulate the proposed model to analyze sys-
tem behavior and requirements more precisely and to extend the model view at
logical and physical level.
References
1. MBSE wiki. http://www.omgwiki.org/MBSE/doku.php
2. OMG systems modeling language. http://www.omgsysml.org/
3. Akkaoui, E.E., Zimányi, E.: Defining ETL worfklows using BPMN and BPEL. In:
Proceedings of the ACM Twelfth International Workshop on Data Warehousing
and OLAP, pp. 41–48. ACM (2009)
4. El Akkaoui, Z., Mazón, J.-N., Vaisman, A., Zimányi, E.: BPMN-based concep-
tual modeling of ETL processes. In: Cuzzocrea, A., Dayal, U. (eds.) DaWaK
2012. LNCS, vol. 7448, pp. 1–14. Springer, Heidelberg (2012). doi:10.1007/
978-3-642-32584-7 1
5. Akkaoui, Z.E., Zimányi, E., López, J.N.M., Mondéjar, J.C.T., et al.: A BPMN-
based design and maintenance framework for ETL processes (2013)
6. Akkaoui, Z.E., Zimànyi, E., Mazón, J.N., Trujillo, J.: A model-driven framework
for ETL process development. In: Proceedings of the 14th International Workshop
on Data Warehousing and OLAP, pp. 45–52. ACM (2011)
7. Çağıltay, N.E., Topallı, D., Aykaç, Y.E., Tokdemir, G.: Abstract conceptual database
model approach. In: Conference on Science and Information, pp. 275–281 (2013)
8. Ayhan, S., Pesce, J., Comitz, P., Sweet, D., Bliesner, S., Gerberick, G.: Predictive
analytics with aviation big data. In: Conference on Integrated Communications,
Navigation and Surveillance (ICNS 2013), pp. 1–13 (2013)
9. Barateiro, J., Galhardas, H.: A survey of data quality tools. Datenbank-Spektrum
14(15–21), 48 (2005)
10. Belo, O., Gomes, C., Oliveira, B., Marques, R., Santos, V.: Automatic genera-
tion of ETL physical systems from BPMN conceptual models. In: Bellatreche,
L., Manolopoulos, Y. (eds.) MEDI 2015. LNCS, vol. 9344, pp. 239–247. Springer,
Cham (2015). doi:10.1007/978-3-319-23781-7 19
11. Eckerson, W., White, C.: Evaluating ETL and data integration platforms. Report
of The Data Warehousing Institute 184 (2003)
12. Estefan, J.A.: Survey of model-based systems engineering (MBSE) methodologies.
Incose MBSE Focus Group 25(8) (2007)
13. Franconi, E., Kamblet, A.: A data warehouse conceptual data model. In: Pro-
ceedings of 16th International Conference on Scientific and Statistical Database
Management, pp. 435–436 (2004)
14. Friedenthal, S., Moore, A., Steiner, R.: A Practical Guide to SysML: The Systems
Modeling Language. Morgan Kaufmann, San Francisco (2014)
15. Hart, L.E.: Introduction to model-based system engineering (MBSE) and SysML,
30 July 2015. http://www.incose.org/docs/default-source/delaware-valley/mbse-
overview-incose-30-july-2015.pdf
16. Hause, M.: The sysml modelling language. In: 15th European Systems Engineering
Conference, vol. 9 (2006)
17. Hoang, A.D.T., Nguyen, B.T.: An integrated use of CWM and ontological mod-
eling approaches towards ETL processes. In: IEEE International Conference on
e-Business Engineering (ICEBE 2008), pp. 715–720, October 2008
SysML Based Conceptual ETL Process Modeling 255
18. Kherdekar, V.A., Metkewar, P.S.: A technical comprehensive survey of ETL tools.
Int. J. Appl. Eng. Res. 11(4), 2557–2559 (2016)
19. Mrunalini, M., Kumar, T.S., Kanth, K.R.: Simulating secure data extraction in
extraction transformation loading (ETL) processes. In: Third UKSim European
Symposium on Computer Modeling and Simulation (EMS 2009), pp. 142–147.
IEEE (2009)
20. Muñoz, L., Mazón, J.-N., Pardillo, J., Trujillo, J.: Modelling ETL processes of data
warehouses with UML activity diagrams. In: Meersman, R., Tari, Z., Herrero, P.
(eds.) OTM 2008. LNCS, vol. 5333, pp. 44–53. Springer, Heidelberg (2008). doi:10.
1007/978-3-540-88875-8 21
21. Muñoz, L., Mazón, J.N., Trujillo, J.: Automatic generation of ETL processes from
conceptual models. In: Proceedings of the ACM Twelfth International Workshop
on Data Warehousing and OLAP, pp. 33–40. ACM (2009)
22. Oliveira, B., Belo, O.: BPMN patterns for ETL conceptual modelling and valida-
tion. In: Chen, L., Felfernig, A., Liu, J., Raś, Z.W. (eds.) ISMIS 2012. LNCS, vol.
7661, pp. 445–454. Springer, Heidelberg (2012). doi:10.1007/978-3-642-34624-8 50
23. Oliveira, B., Belo, O.: ETL standard processes modelling - a novel BPMN approach.
In: Proceedings of the 15th International Conference on Enterprise Information
Systems, pp. 120–127 (2013)
24. Oliveira, B., Santos, V., Belo, O.: Pattern-based ETL conceptual modelling. In:
Cuzzocrea, A., Maabout, S. (eds.) MEDI 2013. LNCS, vol. 8216, pp. 237–248.
Springer, Heidelberg (2013). doi:10.1007/978-3-642-41366-7 20
25. Simitsis, A., Skoutas, D., Castellanos, M.: Representation of conceptual etl designs
in natural language using semantic web technology. Data Knowl. Eng. 69(1), 96–
115 (2010)
26. Simitsis, A., Vassiliadis, P.: A methodology for the conceptual modelling of ETL
processes. In: Proceedings of DSE (2003)
27. Skoutas, D., Simitsis, A.: Designing ETL processes using semantic web technolo-
gies. In: Proceedings ACM 9th International Workshop on Data Warehousing and
OLAP (DOLAP 2006), Arlington, Virginia, USA, pp. 67–74 (2006)
28. Skoutas, D., Simitsis, A.: Ontology-based conceptual design of ETL processes for
both structured and semi-structured data. Int. J. Semant. Web Inf. Syst. (IJSWIS)
3(4), 1–24 (2007)
29. Skoutas, D., Simitsis, A., Sellis, T.: Ontology-driven conceptual design of ETL
processes using graph transformations. In: Spaccapietra, S., Zimányi, E., Song,
I.-Y. (eds.) Journal on Data Semantics XIII. LNCS, vol. 5530, pp. 120–146.
Springer, Heidelberg (2009). doi:10.1007/978-3-642-03098-7 5
30. Snezana, S., Violeta, M.: Business intelligence tools for statistical data analysis.
In: Proceedings of the 32nd International Conference on Information Technology
Interfaces (ITI 2010), pp. 199–204 (2010)
31. Thi, A.D.H., Nguyen, B.T.: A semantic approach towards CWM-based ETL
processes. In: Proceedings of I-SEMANTICS 2008, pp. 58–66 (2008)
32. Trujillo, J., Luján-Mora, S.: A UML based approach for modeling ETL processes
in data warehouses. In: Song, I.-Y., Liddle, S.W., Ling, T.-W., Scheuermann, P.
(eds.) ER 2003. LNCS, vol. 2813, pp. 307–320. Springer, Heidelberg (2003). doi:10.
1007/978-3-540-39648-2 25
33. Vassiliadis, P.: A survey of extract - transform - load technology. Int. J. Data
Warehouse. Min. 5(3), 1–27 (2009)
34. Vassiliadis, P., Simitsis, A., Skiadopoulos, S.: Conceptual modeling for ETL
processes. In: Proceedings DOLAP, pp. 14–21 (2002)