Demchenko - Addressing Big Data Issues in Scientific Data Infrastructure
Demchenko - Addressing Big Data Issues in Scientific Data Infrastructure
net/publication/256082290
CITATIONS READS
637 5,557
4 authors, including:
All content following this page was uploaded by Yuri Demchenko on 08 July 2018.
Abstract—Big Data are becoming a new technology focus both in an infrastructure of their own and need to be supported by
science and in industry. This paper discusses the challenges that corresponding physical or logical infrastructures to store,
are imposed by Big Data on the modern and future Scientific access and manage these data.
Data Infrastructure (SDI). The paper discusses a nature and The emerging SDI should allow different groups of
definition of Big Data that include such features as Volume, researchers to work on the same data sets, build their own
Velocity, Variety, Value and Veracity. The paper refers to
(virtual) research and collaborative environments, safely store
different scientific communities to define requirements on data
intermediate results, and later share the discovered results. New
management, access control and security. The paper introduces
the Scientific Data Lifecycle Management (SDLM) model that data provenance, security and access control mechanisms and
includes all the major stages and reflects specifics in data tools will allow researchers to link their scientific results with
management in modern e-Science. The paper proposes the SDI the initial data (sets) and intermediate data to allow future re-
generic architecture model that provides a basis for building use/re-purpose of data, e.g. with the improved research
interoperable data or project centric SDI using modern technique and tools.
technologies and best practices. The paper explains how the This paper analyses new challenges imposed to modern e-
proposed models SDLM and SDI can be naturally implemented Science infrastructures by the emerging Big Data technologies;
using modern cloud based infrastructure services provisioning it proposes a general approach and architecture solutions that
model and suggests the major infrastructure components for Big constitute a new Scientific Data Lifecycle Management
Data. (SDLM) model and the generic SDI architecture model that
provides a basis for heterogeneous SDI components
Keywords- Big Data Science, Scientific Data Infrastructure interoperability and integration, in particular based on cloud
(SDI), Scientific Data Lifecycle Management (SDLM), Cloud infrastructure technologies.
Infrastructure Service, Big Data Infrastructure. This paper is primarily focused on SDI, however provides
I. INTRODUCTION analysis of the big data nature in both e-Science and industry,
analyses their commonalities and difference, discussing also
Big Data technologies are becoming a current focus and a possible cross-fertilisation between two domains.
new “buzz-word” both in science and in industry. Emergence This paper continues the authors’ work on defining the Big
of Big Data or data centric technologies indicates the beginning Data infrastructure for e-Science initially presented in the paper
of a new form of the continuous technology advancement that [3] and significantly extends it with new results and wider
is characterized by overlapping technology waves related to scope to investigate relations between Big Data technologies in
different aspects of the human activity from production and e-Science and industry. With long tradition of working with
consumption to collaboration and general social activity. In this constantly increasing volume of data, modern science can offer
context data intensive science plays key role. industry the scientific analysis methods, while industry can
Big Data are becoming related to almost all aspects of bring Big Data technologies and tools to wider public.
human activity from just recording events to research, design, The paper is organised as follows. Section II looks into Big
production and digital services or products delivery, to the final Data definition and Big Data nature in industry and science
consumer. Current technologies such as Cloud Computing and analysing also the main drivers for the Big Data technology
ubiquitous network connectivity provide a platform for development. Section II gives an overview of the main research
automation of all processes in data collection, storing, communities and summarizes requirement to future SDI.
processing and visualization. Section IV discusses challenges to data management in Big
Modern e-Science infrastructures allow targeting new large Data Science, including SDLM discussion. Section V
scale problems whose solution was not possible before, e.g. introduces the proposed e-SDI architecture model that is
genome, climate, global warming. e-Science typically produces intended to answer the future big data challenges and
a huge amount of data that need to be supported by a new type requirements. Section VI discusses SDI implementation using
of e-Infrastructure capable to store, distribute, process, cloud technologies. Section VII discusses security and trust
preserve, and curate these data [1, 2]: we refer to this new related issues in handling data and summarises specific
infrastructures as Scientific Data e-Infrastructure (SDI). requirements to access control infrastructure for modern and
In e-Science, the scientific data are complex multifaceted future SDI.
objects with the complex internal relations, they are becoming
II. BIG DATA DEFINITION AND ANALYSIS As a starting point, we can refer to the simple definition
given in [10]: “Big Data: a massive volume of both structured
A. Big Data Nature in e-Science and Industry
and unstructured data that is so large that it's difficult to
Science has been traditionally dealing with challenges to process using traditional database and software techniques.”
handle large volume of data in complex scientific research Related definition of the data-intensive science is given in
experiments. Scientific research typically includes collection the book “The Fourth Paradigm: Data-Intensive Scientific
of data in passive observation or active experiments which aim Discovery” by the computer scientist Jim Gray [11]: “The
to verify one or another scientific hypothesis. Scientific techniques and technologies for such data-intensive science
research and discovery methods typically are based on the are so different that it is worth distinguishing data-intensive
initial hypothesis and a model which can be refined based on science from computational science as a new, fourth paradigm
the collected data. The refined model may lead to a new more for scientific exploration.”
advanced and precise experiment and/or the previous data re- In a number of discussion blogposts and articles Big Data
evaluation. Another distinctive feature of the modern scientific are attributed to have such characteristics as Volume,
research is that it suggests wide cooperation between Velocity, and Variety called “3 Vs of Big Data”. Based on our
researchers to challenge complex problems and run complex analysis and concurring with some other articles [5, 6, 12] we
scientific instruments. In industry, private companies will not intend to propose wider definition of Big Data as 5 Vs:
share data or expertise. When dealing with data, companies Volume, Velocity, Variety and additionally Value and
will intend always keep control over their information assets. Veracity.
They may use shared third party facilities, like clouds, but Figure 1 below illustrates the features related to 5 Vs
special measures need to be taken to ensure data protection, which we analyse below.
including data sanitization. It might be also a case that
companies can use shared facilities only for proof of concept
and do production data processing at private facilities. In this
respect, we need to accept that science and industry can't be
done in the same way, and consequently this will be reflected
in a way how they can interact and how the Big Data
infrastructure and tools can be built.
With the digital technologies proliferation into all aspects
of business activities and emerging Big Data technologies the
industry is entering a new playground when it needs to use
scientific methods to benefit from the possibility to collect and
mine data for desirable information, such as market prediction,
customer behavior predictions, social groups activity
predictions, etc.
A number of discussions and blog articles [4, 5, 6] suggest
that the Big Data technologies need to adopt scientific
discovery methods that include iterative model improvement
and collection of improved data, re-use of collected data with
improved model. Figure 1. 5 Vs of Big Data
We can quote here a blog article by Mike Gualtieri from
1) Volume
Forrester [7, 8, 9]:“Firms increasingly realize that [big data]
Volume is the most important and distinctive feature of
must use predictive and descriptive analytics to find
Big Data which impose additional and specific requirements to
nonobvious information to discover value in the data.
all traditional technologies and tools currently used.
Advanced analytics uses advanced statistical, data mining and
In e-Science, growth of data amount is caused by
machine learning algorithms to dig deeper to find patterns that
advancements in both scientific instruments and SDI. In many
you can’t see using traditional BI (Business Intelligence) tools,
areas the trend is actually to include data collections from all
simple queries, or rules.”
observed events, activities and sensors what became possible
B. 5 Vs of Big Data and is important for social activities and social sciences.
Despite the “Big Data” became a new buzz-word, there is Big Data volume includes such features as size, scale,
no consistent definition for Big Data, nor detailed analysis of amount, dimension for tera- and exascale data recording either
this new emerging technology. Most discussions are going data rich processes, or collected from many transactions and
now in blogosphere in which however the most significant stored in individual files or databases – all needs to be
features and incentives of the Big Data are identified and accessible, searchable, processed and manageable.
became commonly accepted. In this section we will attempt to Two examples from e-Science give also different
summarise available definitions and propose a consolidated characters of data and also different processing requirements,
view on the generic Big Data features that would help us to such as:
define requirements to supporting Big Data infrastructure and Large Hadron Collider (LHC) produces in average 5 PB
in particular Scientific Data Infrastructure. data a month that are generated in a number of short collisions
that make them unique events, The collected data are filtered, by a number of factors including data origin, collection and
stored and extensively searched for single events that may processing methods, including trusted infrastructure and
confirm a scientific hypothesis. facility.
LOFAR (Low Frequency Array) is a radio telescope that Big Data veracity ensures that the data used are trusted,
collects about 5 PB every hour, however the data are authentic and protected from unauthorised access and
processed by correlator and only correlated data are stored. modification. The data must be secured during the whole their
In industry, global services providers such as Google, lifecycle from collection from trusted sources to processing on
Facebook, Twitter are producing, analyzing and storing data in trusted compute facilities and storage on protected and trusted
huge amount as their regular activity/production services. storage facilities.
Although some of their tools and processes are proprietary, The following aspects define and need to be addressed to
they actually prove the feasibility of solving Big Data ensure data veracity:
problems at the global scale and significantly push the Integrity of data and linked data (e.g., for complex
development of the Open Source Big Data tools. hierarchical data, distributed data)
Data authenticity and (trusted) origin
2) Velocity
Big Data are often generated at high speed, including also Identification of both data and source
data generated by arrays of sensors or multiple events, and Computer and storage platform trustworthiness
need to be processed in real-time, near real-time or in batch, or Availability and timeliness
as streams (like in case of visualisation). Accountability and Reputation
As an example, LHC ATLAS detector [http://atlas.ch/] Data veracity relies entirely on the security infrastructure
uses about 80 readout channels and collects up to 1PB of deployed and available from the Big Data infrastructure.
unfiltered data in second which are reduced to approx. 100MB
III. GENERAL REQUIREMENTS TO BIG DATA E-SCIENCE
per second. This should record up to 40 million collision
INFRASTRUCTURE
events per second.
Industry can also provide numerous examples when data A. Paradigm change in Big Data e-Science
registration, processing or visualization impose similar Big Data Science is becoming a new technology driver and
challenges. requires re-thinking a number of infrastructure components,
3) Variety solutions and processes to address the following general
challenges [2, 3]:
Variety deals with the complexity of big data and
Exponential growth of data volume produced by different
information and semantic models behind these data. This is
research instruments and/or collected from sensors
resulted in data collected as structured, unstructured, semi-
Need to consolidate e-Infrastructures as persistent research
structured, and a mixed data. Data variety imposes new
platforms to ensure research continuity and cross-
requirements to data storage and database design which should disciplinary collaboration, deliver/offer persistent services,
dynamic adaptation to the data format, in particular scaling up with adequate governance model.
and down. The recent advancements in the general ICT and big data
Data variety will in particular increase when biological, technologies facilitate the paradigm change in modern e-
human and societal systems will become a subject of closer Science that is characterized by the following features:
research and monitoring. An example of the latter is urban Automation of all e-Science processes including data
environment that requires operating, monitoring and evolving collection, storing, classification, indexing and other
of numerous processes, individuals and associations. components of the general data curation and provenance.
Adopting data technologies in traditionally non-computer Transformation of all processes, events and products into
oriented areas such as psychology and behavior research, digital form by means of multi-dimensional multi-faceted
history, archeology will generate especially rich data sets. measurements, monitoring and control; digitising existing
artifacts and other content.
4) Value Possibility to re-use the initial and published research data
Value is an important feature of the data which is defined with possible data re-purposing for secondary research
by the added-value that the collected data can bring to the Global data availability and access over the network for
intended process, activity or predictive analysis/hypothesis. cooperative group of researchers, including wide public
Data value will depend on the events or processes they access to scientific data.
represent such as stochastic, probabilistic, regular or random. Existence of necessary infrastructure components and
Depending on this the requirements may be imposed to collect management tools that allow fast infrastructures and
all data, store for longer period (for some possible event of services composition, adaptation and provisioning on
interest), etc. In this respect data value is closely related to the demand for specific research projects and tasks.
data volume and variety. Advanced security and access control technologies that
ensure secure operation of the complex research
5) Veracity
infrastructures and scientific instruments and allow
Veracity dimension of Big Data includes two aspects: data
creating trusted secure environment for cooperating groups
consistency (or certainty) what can be defined by their and individual researchers
statistical reliability; and data trustworthiness that is defined
The future SDI should support the whole data lifecycle and Biomedical data (healthcare, clinical case data) are privacy
explore the benefit of the data storage/preservation, aggregation sensitive data and must be handled according to the European
and provenance in a large scale and during long/unlimited policy on Personal Data processing [19].
period of time. Important is that this infrastructure must ensure Social Science and Humanities communities and projects
data security (integrity, confidentiality, availability, and are characterized by multi-lateral and often global
accountability), and data ownership protection. With current collaborations between researchers from all over the world that
needs to process big data that require powerful computation, need to be engaged into collaborative groups/communities and
there should be a possibility to enforce data/dataset policy that supported by collaborative infrastructure to share data,
they can be processed on trusted systems and/or complying discovery/research results and cooperatively evaluate results.
other requirements. Researchers must trust the SDI to process The current trend to digitize all currently collected physical
their data on SDI facilities and be ensured that their stored artifacts will create in the near future a huge amount of data
research data are protected from non-authorised access. Privacy that must be widely and openly accessible.
issues are also arising from distributed remote character of SDI
that can span multiple countries with different local policies. C. General SDI Requirements
This should be provided by the Access Control and Accounting From the overview we just gave we can extract the
Infrastructure (ACAI) which is an important component of SDI following general infrastructure requirements to SDI for
[13, 14]. emerging Big Data Science:
Support long running experiments and large data volumes
B. Research communities and specific SDI requirements generated at high speed
A short overview of some research infrastructures and Data integrity, confidentiality, accountability
communities, in particular the ones defined for the Europe Support for long running experiments and large data
Research Area (ERA) [3] allows us to analyse specific volumes generated at high speed
requirement for future SDIs to address Big Data challenges. Multi-tier inter-linked data distribution and replication
Existing studies of European e-Infrastructures analyze the On-demand infrastructure provisioning to support data sets
scientific communities practices and requirements; examples and scientific workflows, mobility of data-centric
are those undertaken by the SIENA Project [15], EIROforum scientific applications
Federated Identity Management Workshop [14], European Grid Support of virtual scientists communities, addressing
Infrastructure (EGI) Strategy Report [16], UK Future Internet dynamic user groups creation and management, federated
Strategy Group Report [17]. identity management
The High Energy Physics community represents a large Trusted environment for data storage and processing
number or researchers, unique expensive instruments, huge Support for data integrity, confidentiality, accountability
amount of data that are generated and need to be processed Policy binding to data to protect privacy, confidentiality
continuously. This community has already the operational and IPR
Worldwide Large Hadron Collider Grid (WLCG) [18]
infrastructure to manage and access data, protect their integrity IV. DATA MANAGEMENT IN BIG DATA SCIENCE
and support the whole scientific data lifecycle. WLCG
Emergence of computer aided research methods is
development was an important step in the evolution of
transforming the way research is done and scientific data are
European e-Infrastructures that currently serves multiple
used. The following types of scientific data are defined [13]:
scientific communities in Europe and internationally. The EGI
cooperation [16] manages European and worldwide Raw data collected from observation and from experiment
infrastructure for HEP and other communities. (according to an initial research model)
Material science, analytical and low energy physics (proton, Structured data and datasets that went through data
neutron, laser facilities) is characterized by short projects, filtering and processing (supporting some particular formal
experiments and consequently highly dynamic user model)
community. It requires highly dynamic supporting Published data that supports one or another scientific
infrastructure and advanced data management infrastructure to hypothesis, research result or statement
allow wide data access and distributed processing. Data linked to publications to support the wide research
Environmental and Earth science community and projects consolidation, integration, and openness.
target regional/national and global problems. They collect huge Once the data is published, it is essential to allow other
amount of data from land, sea, air and space and require ever scientists to be able to validate and reproduce the data that they
increasing amount of storage and computing power. This SDI are interested in, and possibly contribute with new results.
requires reliable fine-grained access control to huge data sets, Capturing information about the processes involved in
enforcement of regional issues, policy based data filtering (data transformation from raw data up until the generation of
may contain national security related information), while published data becomes an important aspect of scientific data
tracking data use and keeping data integrity. management. Scientific data provenance becomes an issue that
Biological and Medical Sciences (also defined as Life also needs to be taken into consideration by SDI providers [20].
sciences) have a general focus on health, drug development, Another aspect to take into consideration is to guarantee
new species identification, new instruments development. They reusability of published data within the scientific community.
generates massive amount of data and new demand for Understanding semantic of the published data becomes an
computing power, storage capacity, and network performance important issue to allow for reusability, and this had been
for distributed processes, data sharing and collaboration. traditionally been done manually. However, as we anticipate
unprecedented scale of published data that will be generated in
Big Data Science, attaching clear data semantic becomes a Layer D1: Network infrastructure layer represented by the
necessary condition for efficient reuse of published data. general purpose Internet infrastructure and dedicated network
Learning from best practices in semantic web community on infrastructure
how to provide a reusable published data, will be one of Layer D2: Datacenters and computing resources/facilities
consideration that will be addressed by SDI. Layer D3: Infrastructure virtualisation layer that is
Big data are typically distributed both on the collection side represented by the Cloud/Grid infrastructure services and
and on the processing/access side: data need to be collected middleware supporting specialised scientific platforms
(sometimes in a time sensitive way or with other environmental deployment and operation
attributes), distributed and/or replicated. Linking distributed Layer D4: (Shared) Scientific platforms and instruments
data is one of the problems to be addressed by SDI. specific for different research areas
The European Commission’s initiative to support Open Layer D5: Federation and Policy layer that includes
Access to scientific data from publicly funded projects suggests federation infrastructure components, including policy and
introduction of the following mechanisms to allow linking collaborative user groups support functionality.
publications and data [21, 22]: Layer D6: Scientific applications and user portals/clients
PID - persistent data ID Note: “D” prefix denotes relation to data infrastructure.
ORCID – Open Researcher and Contributor Identifier
[23].
The required new approach to data management and
handling in e-Science is reflected in the Scientific Data
Lifecycle Management (SDLM) model (see Figure 2) we as a
result of analysis of the existing practices in different scientific
communities. Our proposed model is compliant with the data
lifecycle study results presented in [24].
The generic scientific data lifecycle includes a number of
consequent stages: research project or experiment planning;
data collection; data processing; publishing research results;
discussion, feedback; archiving (or discarding).