0% found this document useful (0 votes)
29 views9 pages

Demchenko - Addressing Big Data Issues in Scientific Data Infrastructure

The paper discusses the challenges posed by Big Data on Scientific Data Infrastructure (SDI), emphasizing the need for robust data management, access control, and security mechanisms. It introduces the Scientific Data Lifecycle Management (SDLM) model and a generic SDI architecture to support collaborative research and data sharing among scientific communities. The authors highlight the importance of addressing the 5 Vs of Big Data—Volume, Velocity, Variety, Value, and Veracity—in developing effective e-Science infrastructures that can handle large-scale data processing and storage.

Uploaded by

hiras23684
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views9 pages

Demchenko - Addressing Big Data Issues in Scientific Data Infrastructure

The paper discusses the challenges posed by Big Data on Scientific Data Infrastructure (SDI), emphasizing the need for robust data management, access control, and security mechanisms. It introduces the Scientific Data Lifecycle Management (SDLM) model and a generic SDI architecture to support collaborative research and data sharing among scientific communities. The authors highlight the importance of addressing the 5 Vs of Big Data—Volume, Velocity, Variety, Value, and Veracity—in developing effective e-Science infrastructures that can handle large-scale data processing and storage.

Uploaded by

hiras23684
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/256082290

Addressing Big Data Issues in Scientific Data Infrastructure

Conference Paper · May 2013


DOI: 10.1109/CTS.2013.6567203

CITATIONS READS
637 5,557

4 authors, including:

Yuri Demchenko Peter Membrey


University of Amsterdam The Hong Kong Polytechnic University
187 PUBLICATIONS 3,302 CITATIONS 163 PUBLICATIONS 1,538 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Yuri Demchenko on 08 July 2018.

The user has requested enhancement of the downloaded file.


Addressing Big Data Issues in Scientific Data
Infrastructure
Yuri Demchenko, Paola Grosso, Cees de Laat Peter Membrey
System and Network Engineering Group Hong Kong Polytechnic University
University of Amsterdam Hong Kong SAR, China
Amsterdam, The Netherlands e-mail: [email protected]
e-mail: {y.demchenko, p.grosso, C.T.A.M.deLaat}@uva.nl

Abstract—Big Data are becoming a new technology focus both in an infrastructure of their own and need to be supported by
science and in industry. This paper discusses the challenges that corresponding physical or logical infrastructures to store,
are imposed by Big Data on the modern and future Scientific access and manage these data.
Data Infrastructure (SDI). The paper discusses a nature and The emerging SDI should allow different groups of
definition of Big Data that include such features as Volume, researchers to work on the same data sets, build their own
Velocity, Variety, Value and Veracity. The paper refers to
(virtual) research and collaborative environments, safely store
different scientific communities to define requirements on data
intermediate results, and later share the discovered results. New
management, access control and security. The paper introduces
the Scientific Data Lifecycle Management (SDLM) model that data provenance, security and access control mechanisms and
includes all the major stages and reflects specifics in data tools will allow researchers to link their scientific results with
management in modern e-Science. The paper proposes the SDI the initial data (sets) and intermediate data to allow future re-
generic architecture model that provides a basis for building use/re-purpose of data, e.g. with the improved research
interoperable data or project centric SDI using modern technique and tools.
technologies and best practices. The paper explains how the This paper analyses new challenges imposed to modern e-
proposed models SDLM and SDI can be naturally implemented Science infrastructures by the emerging Big Data technologies;
using modern cloud based infrastructure services provisioning it proposes a general approach and architecture solutions that
model and suggests the major infrastructure components for Big constitute a new Scientific Data Lifecycle Management
Data. (SDLM) model and the generic SDI architecture model that
provides a basis for heterogeneous SDI components
Keywords- Big Data Science, Scientific Data Infrastructure interoperability and integration, in particular based on cloud
(SDI), Scientific Data Lifecycle Management (SDLM), Cloud infrastructure technologies.
Infrastructure Service, Big Data Infrastructure. This paper is primarily focused on SDI, however provides
I. INTRODUCTION analysis of the big data nature in both e-Science and industry,
analyses their commonalities and difference, discussing also
Big Data technologies are becoming a current focus and a possible cross-fertilisation between two domains.
new “buzz-word” both in science and in industry. Emergence This paper continues the authors’ work on defining the Big
of Big Data or data centric technologies indicates the beginning Data infrastructure for e-Science initially presented in the paper
of a new form of the continuous technology advancement that [3] and significantly extends it with new results and wider
is characterized by overlapping technology waves related to scope to investigate relations between Big Data technologies in
different aspects of the human activity from production and e-Science and industry. With long tradition of working with
consumption to collaboration and general social activity. In this constantly increasing volume of data, modern science can offer
context data intensive science plays key role. industry the scientific analysis methods, while industry can
Big Data are becoming related to almost all aspects of bring Big Data technologies and tools to wider public.
human activity from just recording events to research, design, The paper is organised as follows. Section II looks into Big
production and digital services or products delivery, to the final Data definition and Big Data nature in industry and science
consumer. Current technologies such as Cloud Computing and analysing also the main drivers for the Big Data technology
ubiquitous network connectivity provide a platform for development. Section II gives an overview of the main research
automation of all processes in data collection, storing, communities and summarizes requirement to future SDI.
processing and visualization. Section IV discusses challenges to data management in Big
Modern e-Science infrastructures allow targeting new large Data Science, including SDLM discussion. Section V
scale problems whose solution was not possible before, e.g. introduces the proposed e-SDI architecture model that is
genome, climate, global warming. e-Science typically produces intended to answer the future big data challenges and
a huge amount of data that need to be supported by a new type requirements. Section VI discusses SDI implementation using
of e-Infrastructure capable to store, distribute, process, cloud technologies. Section VII discusses security and trust
preserve, and curate these data [1, 2]: we refer to this new related issues in handling data and summarises specific
infrastructures as Scientific Data e-Infrastructure (SDI). requirements to access control infrastructure for modern and
In e-Science, the scientific data are complex multifaceted future SDI.
objects with the complex internal relations, they are becoming
II. BIG DATA DEFINITION AND ANALYSIS As a starting point, we can refer to the simple definition
given in [10]: “Big Data: a massive volume of both structured
A. Big Data Nature in e-Science and Industry
and unstructured data that is so large that it's difficult to
Science has been traditionally dealing with challenges to process using traditional database and software techniques.”
handle large volume of data in complex scientific research Related definition of the data-intensive science is given in
experiments. Scientific research typically includes collection the book “The Fourth Paradigm: Data-Intensive Scientific
of data in passive observation or active experiments which aim Discovery” by the computer scientist Jim Gray [11]: “The
to verify one or another scientific hypothesis. Scientific techniques and technologies for such data-intensive science
research and discovery methods typically are based on the are so different that it is worth distinguishing data-intensive
initial hypothesis and a model which can be refined based on science from computational science as a new, fourth paradigm
the collected data. The refined model may lead to a new more for scientific exploration.”
advanced and precise experiment and/or the previous data re- In a number of discussion blogposts and articles Big Data
evaluation. Another distinctive feature of the modern scientific are attributed to have such characteristics as Volume,
research is that it suggests wide cooperation between Velocity, and Variety called “3 Vs of Big Data”. Based on our
researchers to challenge complex problems and run complex analysis and concurring with some other articles [5, 6, 12] we
scientific instruments. In industry, private companies will not intend to propose wider definition of Big Data as 5 Vs:
share data or expertise. When dealing with data, companies Volume, Velocity, Variety and additionally Value and
will intend always keep control over their information assets. Veracity.
They may use shared third party facilities, like clouds, but Figure 1 below illustrates the features related to 5 Vs
special measures need to be taken to ensure data protection, which we analyse below.
including data sanitization. It might be also a case that
companies can use shared facilities only for proof of concept
and do production data processing at private facilities. In this
respect, we need to accept that science and industry can't be
done in the same way, and consequently this will be reflected
in a way how they can interact and how the Big Data
infrastructure and tools can be built.
With the digital technologies proliferation into all aspects
of business activities and emerging Big Data technologies the
industry is entering a new playground when it needs to use
scientific methods to benefit from the possibility to collect and
mine data for desirable information, such as market prediction,
customer behavior predictions, social groups activity
predictions, etc.
A number of discussions and blog articles [4, 5, 6] suggest
that the Big Data technologies need to adopt scientific
discovery methods that include iterative model improvement
and collection of improved data, re-use of collected data with
improved model. Figure 1. 5 Vs of Big Data
We can quote here a blog article by Mike Gualtieri from
1) Volume
Forrester [7, 8, 9]:“Firms increasingly realize that [big data]
Volume is the most important and distinctive feature of
must use predictive and descriptive analytics to find
Big Data which impose additional and specific requirements to
nonobvious information to discover value in the data.
all traditional technologies and tools currently used.
Advanced analytics uses advanced statistical, data mining and
In e-Science, growth of data amount is caused by
machine learning algorithms to dig deeper to find patterns that
advancements in both scientific instruments and SDI. In many
you can’t see using traditional BI (Business Intelligence) tools,
areas the trend is actually to include data collections from all
simple queries, or rules.”
observed events, activities and sensors what became possible
B. 5 Vs of Big Data and is important for social activities and social sciences.
Despite the “Big Data” became a new buzz-word, there is Big Data volume includes such features as size, scale,
no consistent definition for Big Data, nor detailed analysis of amount, dimension for tera- and exascale data recording either
this new emerging technology. Most discussions are going data rich processes, or collected from many transactions and
now in blogosphere in which however the most significant stored in individual files or databases – all needs to be
features and incentives of the Big Data are identified and accessible, searchable, processed and manageable.
became commonly accepted. In this section we will attempt to Two examples from e-Science give also different
summarise available definitions and propose a consolidated characters of data and also different processing requirements,
view on the generic Big Data features that would help us to such as:
define requirements to supporting Big Data infrastructure and Large Hadron Collider (LHC) produces in average 5 PB
in particular Scientific Data Infrastructure. data a month that are generated in a number of short collisions
that make them unique events, The collected data are filtered, by a number of factors including data origin, collection and
stored and extensively searched for single events that may processing methods, including trusted infrastructure and
confirm a scientific hypothesis. facility.
LOFAR (Low Frequency Array) is a radio telescope that Big Data veracity ensures that the data used are trusted,
collects about 5 PB every hour, however the data are authentic and protected from unauthorised access and
processed by correlator and only correlated data are stored. modification. The data must be secured during the whole their
In industry, global services providers such as Google, lifecycle from collection from trusted sources to processing on
Facebook, Twitter are producing, analyzing and storing data in trusted compute facilities and storage on protected and trusted
huge amount as their regular activity/production services. storage facilities.
Although some of their tools and processes are proprietary, The following aspects define and need to be addressed to
they actually prove the feasibility of solving Big Data ensure data veracity:
problems at the global scale and significantly push the  Integrity of data and linked data (e.g., for complex
development of the Open Source Big Data tools. hierarchical data, distributed data)
 Data authenticity and (trusted) origin
2) Velocity
Big Data are often generated at high speed, including also  Identification of both data and source
data generated by arrays of sensors or multiple events, and  Computer and storage platform trustworthiness
need to be processed in real-time, near real-time or in batch, or  Availability and timeliness
as streams (like in case of visualisation).  Accountability and Reputation
As an example, LHC ATLAS detector [http://atlas.ch/] Data veracity relies entirely on the security infrastructure
uses about 80 readout channels and collects up to 1PB of deployed and available from the Big Data infrastructure.
unfiltered data in second which are reduced to approx. 100MB
III. GENERAL REQUIREMENTS TO BIG DATA E-SCIENCE
per second. This should record up to 40 million collision
INFRASTRUCTURE
events per second.
Industry can also provide numerous examples when data A. Paradigm change in Big Data e-Science
registration, processing or visualization impose similar Big Data Science is becoming a new technology driver and
challenges. requires re-thinking a number of infrastructure components,
3) Variety solutions and processes to address the following general
challenges [2, 3]:
Variety deals with the complexity of big data and
 Exponential growth of data volume produced by different
information and semantic models behind these data. This is
research instruments and/or collected from sensors
resulted in data collected as structured, unstructured, semi-
 Need to consolidate e-Infrastructures as persistent research
structured, and a mixed data. Data variety imposes new
platforms to ensure research continuity and cross-
requirements to data storage and database design which should disciplinary collaboration, deliver/offer persistent services,
dynamic adaptation to the data format, in particular scaling up with adequate governance model.
and down. The recent advancements in the general ICT and big data
Data variety will in particular increase when biological, technologies facilitate the paradigm change in modern e-
human and societal systems will become a subject of closer Science that is characterized by the following features:
research and monitoring. An example of the latter is urban  Automation of all e-Science processes including data
environment that requires operating, monitoring and evolving collection, storing, classification, indexing and other
of numerous processes, individuals and associations. components of the general data curation and provenance.
Adopting data technologies in traditionally non-computer  Transformation of all processes, events and products into
oriented areas such as psychology and behavior research, digital form by means of multi-dimensional multi-faceted
history, archeology will generate especially rich data sets. measurements, monitoring and control; digitising existing
artifacts and other content.
4) Value  Possibility to re-use the initial and published research data
Value is an important feature of the data which is defined with possible data re-purposing for secondary research
by the added-value that the collected data can bring to the  Global data availability and access over the network for
intended process, activity or predictive analysis/hypothesis. cooperative group of researchers, including wide public
Data value will depend on the events or processes they access to scientific data.
represent such as stochastic, probabilistic, regular or random.  Existence of necessary infrastructure components and
Depending on this the requirements may be imposed to collect management tools that allow fast infrastructures and
all data, store for longer period (for some possible event of services composition, adaptation and provisioning on
interest), etc. In this respect data value is closely related to the demand for specific research projects and tasks.
data volume and variety.  Advanced security and access control technologies that
ensure secure operation of the complex research
5) Veracity
infrastructures and scientific instruments and allow
Veracity dimension of Big Data includes two aspects: data
creating trusted secure environment for cooperating groups
consistency (or certainty) what can be defined by their and individual researchers
statistical reliability; and data trustworthiness that is defined
The future SDI should support the whole data lifecycle and Biomedical data (healthcare, clinical case data) are privacy
explore the benefit of the data storage/preservation, aggregation sensitive data and must be handled according to the European
and provenance in a large scale and during long/unlimited policy on Personal Data processing [19].
period of time. Important is that this infrastructure must ensure Social Science and Humanities communities and projects
data security (integrity, confidentiality, availability, and are characterized by multi-lateral and often global
accountability), and data ownership protection. With current collaborations between researchers from all over the world that
needs to process big data that require powerful computation, need to be engaged into collaborative groups/communities and
there should be a possibility to enforce data/dataset policy that supported by collaborative infrastructure to share data,
they can be processed on trusted systems and/or complying discovery/research results and cooperatively evaluate results.
other requirements. Researchers must trust the SDI to process The current trend to digitize all currently collected physical
their data on SDI facilities and be ensured that their stored artifacts will create in the near future a huge amount of data
research data are protected from non-authorised access. Privacy that must be widely and openly accessible.
issues are also arising from distributed remote character of SDI
that can span multiple countries with different local policies. C. General SDI Requirements
This should be provided by the Access Control and Accounting From the overview we just gave we can extract the
Infrastructure (ACAI) which is an important component of SDI following general infrastructure requirements to SDI for
[13, 14]. emerging Big Data Science:
 Support long running experiments and large data volumes
B. Research communities and specific SDI requirements generated at high speed
A short overview of some research infrastructures and  Data integrity, confidentiality, accountability
communities, in particular the ones defined for the Europe  Support for long running experiments and large data
Research Area (ERA) [3] allows us to analyse specific volumes generated at high speed
requirement for future SDIs to address Big Data challenges.  Multi-tier inter-linked data distribution and replication
Existing studies of European e-Infrastructures analyze the  On-demand infrastructure provisioning to support data sets
scientific communities practices and requirements; examples and scientific workflows, mobility of data-centric
are those undertaken by the SIENA Project [15], EIROforum scientific applications
Federated Identity Management Workshop [14], European Grid  Support of virtual scientists communities, addressing
Infrastructure (EGI) Strategy Report [16], UK Future Internet dynamic user groups creation and management, federated
Strategy Group Report [17]. identity management
The High Energy Physics community represents a large  Trusted environment for data storage and processing
number or researchers, unique expensive instruments, huge  Support for data integrity, confidentiality, accountability
amount of data that are generated and need to be processed  Policy binding to data to protect privacy, confidentiality
continuously. This community has already the operational and IPR
Worldwide Large Hadron Collider Grid (WLCG) [18]
infrastructure to manage and access data, protect their integrity IV. DATA MANAGEMENT IN BIG DATA SCIENCE
and support the whole scientific data lifecycle. WLCG
Emergence of computer aided research methods is
development was an important step in the evolution of
transforming the way research is done and scientific data are
European e-Infrastructures that currently serves multiple
used. The following types of scientific data are defined [13]:
scientific communities in Europe and internationally. The EGI
cooperation [16] manages European and worldwide  Raw data collected from observation and from experiment
infrastructure for HEP and other communities. (according to an initial research model)
Material science, analytical and low energy physics (proton,  Structured data and datasets that went through data
neutron, laser facilities) is characterized by short projects, filtering and processing (supporting some particular formal
experiments and consequently highly dynamic user model)
community. It requires highly dynamic supporting  Published data that supports one or another scientific
infrastructure and advanced data management infrastructure to hypothesis, research result or statement
allow wide data access and distributed processing.  Data linked to publications to support the wide research
Environmental and Earth science community and projects consolidation, integration, and openness.
target regional/national and global problems. They collect huge Once the data is published, it is essential to allow other
amount of data from land, sea, air and space and require ever scientists to be able to validate and reproduce the data that they
increasing amount of storage and computing power. This SDI are interested in, and possibly contribute with new results.
requires reliable fine-grained access control to huge data sets, Capturing information about the processes involved in
enforcement of regional issues, policy based data filtering (data transformation from raw data up until the generation of
may contain national security related information), while published data becomes an important aspect of scientific data
tracking data use and keeping data integrity. management. Scientific data provenance becomes an issue that
Biological and Medical Sciences (also defined as Life also needs to be taken into consideration by SDI providers [20].
sciences) have a general focus on health, drug development, Another aspect to take into consideration is to guarantee
new species identification, new instruments development. They reusability of published data within the scientific community.
generates massive amount of data and new demand for Understanding semantic of the published data becomes an
computing power, storage capacity, and network performance important issue to allow for reusability, and this had been
for distributed processes, data sharing and collaboration. traditionally been done manually. However, as we anticipate
unprecedented scale of published data that will be generated in
Big Data Science, attaching clear data semantic becomes a Layer D1: Network infrastructure layer represented by the
necessary condition for efficient reuse of published data. general purpose Internet infrastructure and dedicated network
Learning from best practices in semantic web community on infrastructure
how to provide a reusable published data, will be one of Layer D2: Datacenters and computing resources/facilities
consideration that will be addressed by SDI. Layer D3: Infrastructure virtualisation layer that is
Big data are typically distributed both on the collection side represented by the Cloud/Grid infrastructure services and
and on the processing/access side: data need to be collected middleware supporting specialised scientific platforms
(sometimes in a time sensitive way or with other environmental deployment and operation
attributes), distributed and/or replicated. Linking distributed Layer D4: (Shared) Scientific platforms and instruments
data is one of the problems to be addressed by SDI. specific for different research areas
The European Commission’s initiative to support Open Layer D5: Federation and Policy layer that includes
Access to scientific data from publicly funded projects suggests federation infrastructure components, including policy and
introduction of the following mechanisms to allow linking collaborative user groups support functionality.
publications and data [21, 22]: Layer D6: Scientific applications and user portals/clients
 PID - persistent data ID Note: “D” prefix denotes relation to data infrastructure.
 ORCID – Open Researcher and Contributor Identifier
[23].
The required new approach to data management and
handling in e-Science is reflected in the Scientific Data
Lifecycle Management (SDLM) model (see Figure 2) we as a
result of analysis of the existing practices in different scientific
communities. Our proposed model is compliant with the data
lifecycle study results presented in [24].
The generic scientific data lifecycle includes a number of
consequent stages: research project or experiment planning;
data collection; data processing; publishing research results;
discussion, feedback; archiving (or discarding).

Figure 3. The proposed SDI architecture model

We define also the three cross-layer planes: Operational


Support and Management System; Security plane; and
Metadata and Lifecycle Management. •
The dynamic character of SDI and its support of distributed
multi-faceted communities are guaranteed by the dedicated
Figure 2. Scientific Data Lifecycle Management in e-Science layers: D3 – Infrastructure Virtualisation layer that typically
uses modern cloud technologies; and D5 – Federation and
New SDLM requires data storage and preservation at all policy layer that incorporates related federated infrastructure
stages what should allow data re-use/re-purposing and management and access technologies [13, 25, 26]. Introducing
secondary research on the processed data and published results. the Federation and Policy layer reflects current practice in
However, this is possible only if the full data identification, building and managing complex SDIs (and also enterprise
cross-reference and linkage are implemented in SDI. Data infrastructures) and allows independently managed
integrity, access control and accountability must be supported infrastructures to share resources and support the inter-
during the whole data during lifecycle. Data curation is an organisational cooperation.
important component of the discussed SDLM and must also be Network infrastructure is presented as a separate lower
done in a secure and trustworthy way. layer in e-SDI. Network aspects in Big Data are becoming even
Support data security and access control to scientific data more important than it was e.g. with Computer Grids and
during their lifecycle: data acquisition (experimental data), clouds. Although the dilemma of moving data to computing
initial data filtering, specialist processing; research data storage facilities or vice versa moving computing to data location can
and secondary data mining, data and research information be solved in some particular cases, processing highly
archiving. distributed data on MPP (Massively Parallel Processing)
infrastructures will require a special design of the internal MPP
V. PROPOSED SDI ARCHITECTURE MODEL network infrastructure. The authors refer to their long time
We also propose the SDI Architecture for e-Science (e- research on high speed optical networking and experience of
SDI) as illustrated in Figure 3. This model contains the building optical network infrastructure for e-Science [27, 28].
following layers:
VI. CLOUD BASED INFRASTRUCTURE SERVICES FOR SDI  Registries, indexing/search, semantics, namespaces
Figure 4 illustrates the typical e-Science or enterprise  Security infrastructure (access control, policy enforcement,
collaborative infrastructure that is created on demand and confidentiality, trust, availability, privacy)
includes enterprise proprietary and cloud based computing and  Collaborative environment (groups management)
storage resources, instruments, control and monitoring system, Big Data analytics tools are currently offered by the major
visualization system, and users represented by user clients and cloud services providers such as: Amazon Elastic MapReduce
typically residing in real or virtual campuses. and Dynamo [32], Microsoft Azure HDInsight [33], IBM Big
The main goal of the enterprise or scientific infrastructure is Data Analytics [34]. Scalable Hadoop and data analytics tools
to support the enterprise or scientific workflow and operational services are offered by few companies that position themselves
procedures related to processes monitoring and data processing. as Big Data companies such as Cloudera, [35] and few others
Cloud technologies simplify the building of such infrastructure [36].
and provision it on-demand. Figure 3 illustrates how an
VII. SECURITY INFRASTRUCTURE FOR BIG DATA
example enterprise or scientific workflow can be mapped to
cloud based services and later on deployed and operated as an A. Security and Trust in Cloud based Infrastructure
instant inter-cloud infrastructure. It contains cloud Ensuring data veracity in Big Data infrastructure and
infrastructure segments IaaS (VR3-VR5) and PaaS (VR6, applications requires deeper analysis of all factors affecting
VR7), separate virtualised resources or services (VR1, VR2), data security and trustworthiness during their whole lifecycle.
two interacting campuses A and B, and interconnecting them Figure 5 illustrates the main actors and their relations when
network infrastructure that in many cases may need to use processing data on remote system. User/customer and service
dedicated network links for guaranteed performance. provider are the two actors concerned with their own
data/content security and each other system/platform
trustworthiness: user wants to be sure that their data are secure
when processed or stored on the remote system.

Figure 5. Security and Trust in Data Services and Infrastructure.

Figure 5 illustrates the complexity of trust and security


relations even in a simple usecase of the direct user/provider
Figure 4. From scientific workflow to cloud based infrastructure. interaction. In clouds data security and trust model needs to be
extended to distributed, multi-domain and multi-provider
Efficient operation of such infrastructure will require both environment.
overall infrastructure management and individual services and In the general case of multi-provider and multi-tenant e-
infrastructure segments to interact between themselves. This Science cooperative environment, the e-SDI security
task is typically out of scope of the existing cloud service infrastructure should support on-demand created and
provider models but will be required to support perceived dynamically configured user groups and associations,
benefits of the future e-SDI. These topics are a subject of potentially re-using existing experience in managing Virtual
another research we did on the InterCloud Architecture Organisations (VO) and VO-based access control in Computer
Framework [29, 30]. Grids [37, 38].
Besides the general cloud base infrastructure services Data centric security models when used in generically
(storage, compute, infrastructure/VM management) the distributed and also multi-provider e-SDI environment will
following specific applications and services will be required to require policy binding to data and fine grained data access
support Big Data and other data centric applications [31]: policy that should allow flexible policy definition based on the
 Cluster services semantic data model. Based on the authors’ experience, the
 Hadoop related services and tools XACML (eXtensible Access Control Mark-up Language)
 Specialist data analytics tools (logs, events, data mining, policy language can provide a good basis for such functionality
etc.) [39, 40]. However support of the data lifecycle and related
 Databases/Servers SQL, NoSQL provenance information will require additional research in
 MPP (Massively Parallel Processing) databases policy definition and underlying trust management models.
 Big Data Management tools
B. General Requirements to Access Control Infrastructure framework to support the Big Data e-Science processes and
To support secure data processing, the future SDI should infrastructure operation.
be supported by a corresponding Access Control and ACKNOWLEDGMENT
Accounting Infrastructure (ACAI) that would ensure normal
infrastructure operation, assets and information protection, and This work was motivated and partly supported by the
allow user identification/authentication and policy European Commission “Study on Authentication,
Authorization and Accounting (AAA) Platforms for Scientific
enforcement in distributed multi-organisations environment.
data/information Resources in Europe” resulted in the report
Moving to Open Access [21] may require partial change of
currently published as [13]. The authors value wide discussions
business practices of currently existing scientific information between the consortium members on the different aspects of
repositories and libraries, and consequently the future ACAI the existing research infrastructures and AAA technologies
should allow such transition and fine grained access control which findings found further development in this paper. The
and flexible policy definition and control. proposed cloud based architecture for SDI is the outcome of the
Taking into account that future SDI should support the EU funded FP7 projects The Generalized Architecture for
whole data lifecycle and explore the benefit of the data Dynamic Infrastructure Services (GEYSERS, FP7-ICT-
storage/preservation, aggregation and provenance in a large 248657) and GEANT (Grant Agreement No. 238875)
scale and during long/unlimited period of time, the future
ACAI should also support all stages of the data lifecycle, REFERENCES
including policy attachment to data to ensure persistency of [1] Global Research Data Infrastructures: Towards a 10-year vision for
the data policy enforcement during continuous online and global research data infrastructures. Final Roadmap, March 2012.
[online] http://www.grdi2020.eu/Repository/FileScaricati/6bdc07fb-
offline processes. b21d-4b90-81d4-d909fdb96b87.pdf
The required ACAI should support the following features [2] Riding the wave: How Europe can gain from the rising tide of scientific
of the future SDI: data. Final report of the High Level Expert Group on Scientific Data.
 Empower researchers (and make them trust) to do their October 2010. [online] Available at http://cordis.europa.eu/fp7/ict/e-
infrastructure/docs/hlg-sdi-report.pdf
data processing on shared facilities of large datacentres
[3] Y.Demchenko, Z.Zhao, P.Grosso, A.Wibisono, C. de Laat, Addressing
with guaranteed data and information security Big Data Challenges for Scientific Data Infrastructure. The 4th IEEE
 Motivate/ensure researchers to share/open their research Conf. on Cloud Computing Technologies and Science (CloudCom2012),
environment to other researchers by providing tools for 3 - 6 December 2012, Taipei, Taiwan. ISBN: 978-1-4673-4509-5
instantiation of customised pre-configured infrastructures [4] Reflections on Big Data, Data Science and Related Subjects. Blog by
to allow other researchers to work with existing or own Irving Wladawsky-Berger. [online]
http://blog.irvingwb.com/blog/2013/01/reflections-on-big-data-data-
data sets. science-and-related-subjects.html
 Protect data policy, ownership, linkage (with other data [5] E.Dumbill, What is big data? An introduction to the big data landscape.
sets and newly produced scientific/research data), when [online] http://strata.oreilly.com/2012/01/what-is-big-data.html
providing (long term) data archiving. (Data preservation [6] What is big data? IBM. [online] http://www-
technologies should themselves ensure data readability 01.ibm.com/software/data/bigdata/
and accessibility with the changing technologies). [7] Roundup of Big Data Pundits' Predictions for 2013. Blog post by David
Pittman. January 18, 2013. [online]
http://www.ibmbigdatahub.com/blog/roundup-big-data-pundits-
VIII. FUTURE RESEARCH AND DEVELOPMENT predictions-2013
The future research and development will include further [8] Big Data prediction for 2013. Blog by Mike Gualtieri. [online]
Big Data definition initially presented in this paper. At this http://blogs.forrester.com/mike_gualtieri
stage we tried to summarise and re-think some widely used [9] The Forrester Wave: Big Data Predictive Analytics Solutions, Q1 2013.
definitions related to Big Data, further research will require Mike Gualtieri, January 13, 2013. [online]
http://www.forrester.com/pimages/rws/reprints/document/85601/oid/1-
more formal approach and taxonomy of the general Big Data LTEQDI
use cases both in science and industry.
[10] The Big Data Long Tail. Blog post by Jason Bloomberg on Jan 17, 2013.
Although currently proposed SDLM definition have been [online] http://www.devx.com/blog/the-big-data-long-tail.html
accepted as the European Commission Study recommendation [11] The Fourth Paradigm: Data-Intensive Scientific Discovery. Edited by
[13], we plan to move further definition of the related metadata, Tony Hey, Stewart Tansley, and Kristin Tolle. Microsoft Corporation,
procedures and protocols to the Research Data Alliance (RDA) October 2009. ISBN 978-0-9825442-0-4 [online]
[41] community recently established to coordinate http://research.microsoft.com/en-us/collaboration/fourthparadigm/
standardisation in the area of research data. [12] The 3Vs that define Big Data. Posted by Diya Soubra on July 5, 2012
As a part of the general infrastructure research we will [online] http://www.datasciencecentral.com/forum/topics/the-3vs-that-
define-big-data
continue research on the infrastructure issues in Big Data
[13] European Union. A Study on Authentication and Authorisation
targeting more detailed and technology oriented definition of Platforms For Scientific Resources in Europe. Brussels : European
SDI and related security infrastructure definition. Special Commission, 2012. Final Report. Contributing author. Internal
attention will be given to defining the whole cycle of the identification SMART-Nr 2011/0056. [online] Available at
provisioning SDI services on-demand, specifically tailored to http://cordis.europa.eu/fp7/ict/e-infrastructure/docs/aaa-study-final-
support instant scientific workflows using cloud IaaS and PaaS report.pdf
platforms. This research will be also supported by development [14] Federated Identity Management for Research Collaborations. Final
version. Reference CERN-OPEN-2012-006. [online]
of the corresponding Cloud and InterCloud architecture https://cdsweb.cern.ch/record/1442597
[15] SIENA European Roadmap on Grid and Cloud Standards for e-Science Computer Systems, Elsevier Press, Volume 21, Number 4, 2005, pages
and Beyond. SIENA Project report. [online] 501-513.
http://www.sienainitiative.eu/Repository/Filescaricati/ 8ee3587a-f255- [29] Y.Demchenko, C.Ngo, M.Makkes, R.Strijkers, C. de Laat, Defining
4e5c-aed4-9c2dc7b626f6.pdf Inter-Cloud Architecture for Interoperability and Integration. The 3rd
[16] Seeking new horizons: EGI’s role for 2020. [online] Int’l Conf. on Cloud Computing, GRIDs, and Virtualization CLOUD
http://www.egi.eu/blog/2012/03/09/seeking_new_horizons_egis_role_fo COMPUTING 2012, July 22-27, 2012, Nice, France
r_2020.html [30] Cloud Reference Framework. Internet-Draft, version 0.4, December 27,
[17] Future Internet Report. UK Future Internet Strategy Group. May 2011. 2012. [online] http://www.ietf.org/id/draft-khasnabish-cloud-reference-
[online] framework-04.txt
https://connect.innovateuk.org/c/document_library/get_file?folderId=86 [31] A chart of the big data ecosystem, take 2. By Matt Turk [online]
1750&name=DLFE-33761.pdf http://mattturck.com/2012/10/15/a-chart-of-the-big-data-ecosystem-take-
[18] Worldwide Large Hadron Collider Grid (WLCG) [online] 2/
http://wlcg.web.cern.ch/ [32] Amazon Big Data. [online] http://aws.amazon.com/big-data/
[19] European Data Protection Directive. [online] [33] Microsoft Azure Big Data. [online] http://www.windowsazure.com/en-
http://ec.europa.eu/justice/data-protection/index_en.htm us/home/scenarios/big-data/
[20] D.Koopa, et al, A Provenance-Based Infrastructure to Support the Life [34] IBM Big Data Analytics. [online] http://www-
Cycle of Executable Papers, International Conference on Computational 01.ibm.com/software/data/infosphere/bigdata-analytics.html
Science, ICCS 2011. [online] http://vgc.poly.edu/~juliana/pub/vistrails-
executable-paper.pdf [35] Cloudera Impala Big Data Platform
http://www.cloudera.com/content/cloudera/en/home.html
[21] Open Access: Opportunities and Challenges. European Commission for
UNESCO. [online] http://ec.europa.eu/research/science- [36] 10 hot big data startups to watch in 2013, 10 January 2013 [online]
society/document_library/pdf_06/open-access-handbook_en.pdf http://beautifuldata.net/2013/01/10-hot-big-data-startups-to-watch-in-
2013/
[22] OpenAIR – Open Access Infrastructure for Research in Europe. [online]
http://www.openaire.eu/ [37] Y.Demchenko, C. de Laat, V. Ciaschini, VO-based dynamic security
associations in collaborative grid environment Collaborative
[23] Open Researcher and Contributor ID. [online] http://about.orcid.org/ Technologies and Systems, 2006. CTS 2006. International ...
[24] Data Lifecycle Models and Concepts. [online] [38] Y.Demchenko, A.Wan, M. Cristea, C. De Laat, Authorisation
http://wgiss.ceos.org/dsig/whitepapers/Data%20Lifecycle%20Models%2 infrastructure for on-demand network resource provisioning, Grid
0and%20Concepts%20v8.docx Computing, 2008 9th IEEE/ACM International Conference on, 95-103
[25] EGI federated cloud task force. [online] [39] Y.Demchenko, C. de Laat, L. Gommans, B. Oudenaarde, A. Tokmakoff,
http://www.egi.eu/infrastructure/cloud/cloudtaskforce.html M. Snijders, Job-centric security model for open collaborative
[26] eduGAIN - Federated access to network services and applications. environment
[online] http://www.edugain.org [40] Y.Demchenko, M. Cristea, C. de Laat, Collaborative Technologies and
[27] Editorial: Special section: Optiplanet-the optiputer global collaboratory Systems, 2005. Proceedings of the 2005 XACML policy profile for
L Smarr, M Brown, C de Laat Future Generation Computer multidomain network resource provisioning and supporting authorisation
SysEditorial: Special section: Optiplanet-the optiputer global infrastructure. Policies for Distributed Systems and Networks, 2009.
collaboratory L Smarr, M Brown, C de Laat Future Generation POLICY 2009. IEEE ...
Computer Systems 25 (2), 109-113 [41] Research Data Alliance (RDA). [online] http://rd-alliance.org/
[28] R.Grossman, Y.Gu, X.Hong, A.Antony, J.Blom, F.Dijkstra, and C. de
Laat, Teraflows over Gigabit WANs with UDT, Journal of Future

View publication stats

You might also like