1 Das SDIL Smart Data Testbed
Autor: Prof. Dr. Michael Beigl, Prof. Dr. Bernhardt Neumair, Till Riedel, Nico
Schlitter, KIT/Smart Data Innovation Lab
The Smart Data Innovation Lab (SDIL) offers big data researchers a unique access
to a large variety of Big Data and In-Memory technologies. Industry and science
collaborate closely in order to find hidden value in big data and generate smart data.
Projects are focused on the strategic research areas of Industrie 4.0, Energy, Smart
Cities and Medicine.
SDIL bridges the gap between cutting-edge research and industrial big data
[Link] main goal of the SDIL is to accelerate innovation cycles using
smart data [Link] order to close today’s gap between academic research
and industry problems through a data driven innovation cycle the SDIL provides
extensive support to all collaborative research projects free of charge.
Figure1: The SDIL Innovation Cycle
1.1 Platform
The hardware and software provided by the SDIL platform enables
researchers to perform their analytics on unique state of the art hardware and
software without acquiring e.g. separate licensing or dealing with complicated
cost structures. To industrial data providers it gives a chance to analyze their
data together with an academic partner in fully secured on-premise
environment.
Seite 1
Figure 2: The SDIL Plattform
1.1.1 SAP HANA
SAP HANA is a revolutionary platform that allows customers to explore and
analyze large volumes of data in real-time, create flexible analytic models,
develop and deploy real-time applications. The SAP HANA in-memory
appliance is available on the SDIL Platform.
In addition, we installed the Application Function Library (AFL) on the HANA
instances. The AFL is a collection of pre-delivered commonly utilized
business, predictive and other types of algorithms for use in projects or
solutions that run on SAP HANA. These algorithms can be leveraged directly
in development projects, speeding up projects by avoiding writing custom
complex algorithms. AFL operations also offer very fast performance, as AFL
functions run in the core of SAP HANA in-memory DB. The AFL
packageincludes:
The Predictive Analysis Library (PAL) is a set of functions in the AFL.
It contains pre-built, parameter-driven, commonly used algorithms
primarily related to predictive analysis and data mining. Support of
multiple algorithms, e.g., K-Means, Association Analysis, C4.5
Decision Tree, Multiple Linear Regression, or Exponential Smoothing.
Seite 2
Please refer to the official SAP HANA PAL user guide for further
information (SAP HANA PAL Library Documentation).
The Business Function Library (BFL) is a set of functions in the AFL. It
contains pre-built, parameter-driven, commonly used algorithms and is
primarily related to the analysis of financial data. Please refer to the
official SAP HANA BFL user guide for further information (SAP HANA
BFL Library Documentation).
System: SAP HANA
Cores: 320 (4 servers with
80 cores each)
RAM: 4TB (each server
hosts 1TB of RAM)
Disk Space: 80TB (each
server hosts 20TB of disk
space)
Network: 10Gbit/s Ethernet
Software:
SAP HANA Database
System
Predictive Analysis Library
Business Function Library
Figure 3: Hardware and software configuration for the SAP HANA System
1.1.2 TerracottaBigMemory Max
Terracotta BigMemory Max is the in-memory data management platform for
real-time big data applications and developed by Software AG. It supports a
distributed in-memory data-storage topology, which enables the sharing of
data among multiple caches and in-memory data stores in multiple JVM.
BigMemory Max uses a Terracotta Server Array to manage data that is
shared by multiple application nodes in a cluster. Furthermore the use of off-
heap memory enables Java applications to leverage virtually all the RAM for
in-memory data storage without causing garbage collection pauses.
The BigMemory Max kit is installed and available on the SDIL Platform. A
single and active Terracotta Server is configured and running on this
machine. The server manages Terracotta clients, coordinates shared objects
Seite 3
and persists data. Terracotta clients run on application server along with the
applications being clustered by Terracotta. The data is held in the remote
server with a subset of recently used data held in each application node.
1.1.3 IBM Open Platform with Hadoop and Spark
The SDIL Platform is running a Hadoop cluster with Spark that can be used
to perform analytics following the map-reduce paradigm.
IBM SPSS Modeler
In addition, we provide specialized tools that build upon hadoop for further
analytics. The IBM SPSS Modeler built by IBM is a data mining and text
analytics software application. It provides a range of advanced algorithms
and techniques that include text and entity analytics, decision making
management, and optimization in order to build the predictive models and
conduct various range of data analysis tasks.
([Link]
0/en/modelerusersguide_book.pdf)
IBM SPSS Analytic Server
In order to start the analysis stream on the IBM SPSS Modeler Server, one
need to import the data. The IBM SPSS Modeler Server provides a number
of ways to transfer the data to the analytic streams: via files (csv, json, xml
and other known formats), using a DB2 database server, or using the SPSS
Analytic Server.
Seite 4
System: IBM Watson Foundation
Power 8
Cores: 140 (7 servers with
20 cores each)
RAM: 4TB
Disk Space: 300TB
Network: 40Gbit/s Ethernet
Software
IBM Open Platform with
Hadoop/Spark
SPSS Modeler
SPSS Analytic Server
DB2 with BLU Acceleration
Figure 4: Hardware and software configuration for the IBM Watson System
1.1.4 VirtualizationandResourceAllocation
HTCondor
In order to use the SDIL resources efficiently and to avoid interference
between users, we make use of the HTCondor batch system. This system
takes care of resource management and guarantees that users get exclusive
access to the requested resources. A program will run and returns after it’s
finished. While it is running it consumes memory (RAM) and CPU. If many
users run many programs the total available memory might not be sufficient
and the program or even the whole compute server might crash. When using
a batch system the system takes care of resource management and avoids
system overload and crashes. To do this, users need to specify which
computing task they would like to perform and what resources will be
required for this task. This so called job is submitted to the batch system and
the system takes care of executing of it as soon as the requested resources
will become available. The users can get an overview about their submitted
and running jobs via an API. Additionally, users can be informed via email
when their job is finished.
1.2 Communities
SDIL provides access to experts and domain-specific skills within Data
Innovation Communities fostering the exchange of project results. They
Seite 5
further provide the possibility for open-innovation and bilateral matchmaking
between industrial partners and academic institutions.
1.2.1 Data Innovation Community “Industrie 4.0”
Industrie 4.0 is a powerful driver of large data growth, and directly connected
with the “Internet of Things”. Through the Web, real and virtual worlds grow
together to form the Internet of Things. In production, machines as well as
production lines and warehousing systems are increasingly capable of
exchanging information on their own, triggering actions and controlling each
other. The aim is to significantly improve processes in the areas of
development and construction, manufacturing and service. This fourth
industrial revolution represents the linking of industrial manufacturing and
information technology – creating a new level of efficiency and effectiveness.
Industrie 4.0 creates new information spaces linking ERP systems,
databases, the Internet and real-time information from production facilities,
supply chains and products.
The Data Innovation Community “Industrie 4.0” wants to explore important
data-driven aspects of the fourth industrial revolution, such as proactive
service and maintenance of production resources or finding anomalies in
production processes.
The Data Innovation Community “Industrie 4.0” addresses all companies and
research institutions interested in conducting joint research with regard to
these aspects. This includes user companies as well as companies from the
automation and IT industries.
1.2.2 Data Innovation Community “Energy”
The energy industry is facing fundamental changes. The move towards
renewable energies; the EU stipulation to install smart meters; the
development of new, customer-centred business models: all these changes
combine to form entirely new challenges for IT infrastructure of the energy
industry. By analysing comprehensive data, both structured and
unstructured, e.g. data generated by mobile device apps, web portals or
social media, utility companies will be able to optimise their business
processes and develop new business models. A case in point: Big Data
analyses enable better consumption forecasts so that energy providers will
be able to better manage and control their energy purchases on the energy
markets. Thanks to Big Data, consumption rate models can be better tailored
towards specific user groups, and unhappy customers can be identified more
quickly – allowing for measures aimed at ensuring higher customer retention.
Seite 6
The Data Innovation Community “Energy” wants to explore important data-
driven aspects in the area of energy, such as the demand-driven fine-tuning
of consumption rate models based on smart meter generated data.
The Data Innovation Community “Energy” addresses all companies and
research institutions interested in conducting joint research with regard to
these aspects. This includes energy industry user companies as well as
companies from the automation and IT industries.
1.2.3 Data Innovation Community “Smart Cities”
Urban development and traffic management are also areas where Big Data
analyses open up entirely new possibilities. By means of integrated transport
communication solutions and intelligent traffic management systems, traffic in
fast-growing, densely populated urban areas can be managed better. In
cities, immense masses of data are generated by subway trains, busses,
taxis and traffic cameras, just to name a few. The existing IT environment
hardly allows for making forecasts or even extended data analyses in order
to play through different traffic and transport scenarios. But that’s the only
way to improve the respective services and further urban planning. Once
information can be analysed in real-time, correctly interpreted and put into
context with historical data, then traffic jams and dangerous situations can be
identified at an early stage, leading to a significant decrease in traffic volume,
emissions and driving time.
The Data Innovation Community “Smart Cities” wants to explore important
data-driven aspects of urban life, such as traffic control, but also waste
disposal or disaster control.
The Data Innovation Community “Smart Cities” addresses all companies and
research institutions interested in conducting joint research with regard to
these aspects, but also public bodies. This includes user companies as well
as companies from the automation and IT industries.
1.3 Data Innovation Community “Personalized Medicine”
Modern medicine as well generates increasingly larger data quantities.
Reasons for this are higher resolution data from state-of-the-art diagnostic
methods like magnetic resonance imaging (MRI), IT controlled medical
technology, comprehensive medical documentation and the ever more
detailed knowledge about the human genome. A case in point: personalised
cancer therapy. There, increasing use of software aims at taking terabytes of
data from clinical, molecular and medication data in diverse formats and
Seite 7
distilling from them effective treatment options for each individual patient in
real-time, in order to significantly improve treatment results.
Within the Data Innovation Community “Personalised Medicine”, important
data-driven aspects of personalised medicine are to be explored, such as the
need-driven care of patients, IT controlled medical technology or even web-
based patient care.
The Data Innovation Community “Personalised Medicine” addresses all
companies and research institutions interested in conducting joint research
with regard to these aspects. This includes industry user companies and
clinics but also companies from the automation and IT industries.
1.4 Legal, security and curation as cross-cutting activities
Template agreements and processes ensure fast project initiation at
maximum legal security fit to the common technological platform. A
standardized process allows anyone to set up a new collaborative project at
SDIL within 2 weeks.
Once you have successfully registered for the SDIL service, partners are
allowed to upload and work with your data on the SDIL Platform. The data
providers can upload their data using the SFTP or SCP protocols. All users
get a dedicated private home directory for their files. For projects involving
multiple users a project directory is available which is only accessible by the
project members.
The SDIL platform is protected by several layers of firewalls. Access to the
platform is only possible via dedicated login machines and only to users
which were approved beforehand in our identity management system. The
hardware itself is operated in a segregated server room with a dedicated
access control [Link] data processing takes place in compliance with
German data protection rules and regulations. Data sources are only
accessible if such access was expressively granted by the data provider in
[Link] against data loss we do frequent encrypted backups to
our tape library. All data is deleted from the platform after the project finished.
The SDIL guarantees a sustainable invest to all partners by curating
industrial data sources, best practices and code artifacts, that are contributed
on a fair share basis. Furthermore, it actively includes open data and open
source developments to augment the unique industrial grade solutions
provided within the platform.
Seite 8