0% found this document useful (0 votes)
9 views

Big Data The Driver For Innovation in Databases

Big data has emerged as a new frontier that challenges traditional database and data warehousing technologies. As companies collect vast amounts of user data, ensuring security and privacy is important. Cloud computing is being used by companies to manage large data centers for applications like social networks and web search, but it is not mature enough for enterprise processes. The massive amounts of diverse data, while posing challenges, also provide opportunities to make data-driven decisions if scalable data management and analytics can be developed.

Uploaded by

Carol
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Big Data The Driver For Innovation in Databases

Big data has emerged as a new frontier that challenges traditional database and data warehousing technologies. As companies collect vast amounts of user data, ensuring security and privacy is important. Cloud computing is being used by companies to manage large data centers for applications like social networks and web search, but it is not mature enough for enterprise processes. The massive amounts of diverse data, while posing challenges, also provide opportunities to make data-driven decisions if scalable data management and analytics can be developed.

Uploaded by

Carol
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

National Science Review

PERSPECTIVE 1: 27–30, 2014


doi: 10.1093/nsr/nwt020
Advance access publication 6 January 2014

COMPUTER SCIENCE in the context of big data has emerged


as a new research frontier in the foresee-
able future. As an orthogonal challenge
Big data: the driver for innovation in the context of ‘big data’, since enter-
prises maintain vast amounts of sensitive
in databases user interaction data for its clients, it is
imperative that adequate mechanisms are
Bin Cui1,∗ , Hong Mei1 and Beng Chin Ooi2 provided to ensure security and privacy
of user data. Recently, an emerging tech-
nology referred to as cloud computing is
INTRODUCTION emerged was to deal with potentially being increasingly used in proprietary en-
Advances in the technology frontier unlimited number of users accessing vironments to manage large data centers.
have resulted in major disruptions and the service over the Internet. It is now In particular, companies such as Ama-
zon, Google, Yahoo and Microsoft have

Downloaded from http://nsr.oxfordjournals.org/ by guest on July 9, 2015


transformations in the enterprise-wide widely acknowledged that although the
information technology infrastructures. Web and Application server tiers in the each developed their respective propri-
For the past three decades, classical three-tiered Web-based architecture can etary versions of cloud computing tech-
database management systems have be scaled easily to handle a large number nology and have enjoyed unprecedented
maintained a feverish pace in realizing of users, the database tier becomes a scal- success. However, although cloud com-
significant efficiencies in dealing with ability bottleneck since it cannot be easily puting has demonstrated its superiority
the vast amount of information that scaled by deploying additional hardware in the context of Web-based applications
needs to be maintained to model the or machines. Companies such as Google (e.g., Social Networking, Web Search, E-
operational characteristics of large-scale and Amazon have in fact abandoned the commerce), it is not mature enough to fa-
enterprises. Database research and traditional DBMS technology in favor cilitate the enterprise processes that are
development advances have primarily of proprietary data stores referred to as based on DBMSs.
been focused in the areas of advanced key-value stores [1,2].
data models, declarative query lan- While enterprises are struggling with
guages, high throughput transaction the problem of poor database scalabil- APPLICATIONS AS DEMAND
processing and database reliability, etc. ity, a new challenge has emerged that has DRIVERS
In the intervening years especially in further crippled the capability of mod-
Due to the wide adoption of technolo-
the 1990s, data warehousing and data ern IT infrastructures. This challenge has
gies, data from different sources and in
analysis emerged as a major research and been labeled as the ‘big data’ problem. In
different format are being collected at
technology frontier. In particular, it was principle, while earlier DBMSs focused
unprecedented scale. This gives rise to
realized that transactional information at on modeling operational characteristics
the so-called 3V characteristics of the big
the enterprise level can be collated and of enterprises, big data systems are now
data: volume, velocity and variety. Al-
analyzed to enable data-centric decision expected to model vast amounts of het-
though the massive data pose many chal-
making. erogeneous and complex data. Classi-
lenges and invalidate earlier designs, they
Database management systems cal approaches of data warehousing and
provide many great opportunities, and
gained significant prominence in the era data analysis are no longer viable to deal
most of all, instead of making decisions
of Web-based services and E-commerce with both the scale of data and the so-
based on small sets of data or calibra-
deployments. What is now considered phisticated analysis that need to be con-
tion, decisions can now be made based on
classical, a typical web-service architec- ducted often in real time (e.g., online
the data itself. Below, we briefly examine
ture encompasses database management fraud detection). None of the commer-
some of the big data applications.
systems (DBMSs) as the core tier to cial DBMS and Data Warehousing tech-
provision services and applications when nologies provide an adequate solution
coupled with two critical IT components in this regard which is evident from the
referred to as the Web servers and Ap- efforts led by companies such as Face- Social networking
plication servers. In the context of Web- book, Google and Baidu to build pro- Online social networks such as Face-
and Internet-enabled database services, prietary solutions. Clearly, scalable data book, LinkedIn and Twitter provide new
one of the major research challenges that management and complex data analytics platforms for social interactions at a


C The Author(s) 2014. Published by Oxford University Press on behalf of China Science Publishing & Media Ltd. All rights reserved. For Permissions, please email: journals.

[email protected]
28 National Science Review, 2014, Vol. 1, No. 1 PERSPECTIVE

worldwide scale. These new sources of cessing techniques that rely on indexing systems, they are not sufficient for han-
data also allow for new kinds of data- may not be feasible simply because there dling the volumes of data today. Once
analysis applications, e.g. understanding is insufficient time to build indexes on the a database’s upper capacity is exceeded,
social behaviors at an aggregate scale. The data. Clearly, controls over the execution database engineers must redistribute the
enormous size of social networks allows time are needed but there has been lit- data across multiple databases to break
unprecedented and new forms of analy- tle work in this direction. The result of a them up. Third, it is difficult for relational
sis and brings new processing challenges. query could range from the empty set to a database systems to effectively deal with
First, in many online social networks, significant fraction of the database. How- complicated queries. The increasing de-
data analysis makes use of the social ever, although a very large result is a com- mand for data exploration and knowledge
graph. This requires new data-analysis al- plete and correct answer, it might not be discovery needs more complicated ad hoc
gorithms that must be able to cope with a very useful answer and may also take queries in data analysis, which makes it
massive graphs with O(109 ) nodes. Ex- too long to compute. This suggests that difficult to build indexes and views in the
isting graph algorithms are not designed rather than large results, a summary of the databases based on requirements and as-
to deal with graphs at such scale and data and applying sampling may be more sumptions, even for those most agile sys-
are mainly tailored for graphs residing in appropriate. tems. Below, we provide two immediate
main memory. Second, the scale of inter- challenges and opportunities.
actions in the social network and the dif-
Mobile computing

Downloaded from http://nsr.oxfordjournals.org/ by guest on July 9, 2015


ficulty of clustering the data make it es-
pecially challenging to be able to answer Smart phones and other connected mo-
Scalable and elastic data
queries about the interactions in a timely bile and embedded devices are becoming management
manner. increasingly prevalent. This begets new Scalable and elastic data management
conveniences for individuals to manage has been a great challenge to the
Enterprise data management data, perform complex computations, as database research community for more
well as obtain real-time information that than 20 years, and different distributed
Currently, in many enterprises, users
will aid their activities, thereby improv- database systems were proposed to
independently source, model, manage
ing personal productivity. The key chal- deal with large datasets. However,
and store data to support their own
lenge will be how the mobile and cloud the scalability of these systems is still
area of responsibilities and functionali-
platform can be integrated holistically limited due to some common problems
ties. This is mainly due to two reasons.
as a single computing experience. An- in distributed environment, such as
First, users have better domain knowl-
other new dimension is that a new chal- synchronization costs and node failures.
edge and are thus able to customize the
lenge arises which is the use of crowd- Therefore, most of the existing Cloud
database to suit their own needs. Sec-
sourcing and real-time data mining to systems, such as BigTable and Cassandra,
ond, such a decentralized approach al-
further enhance the quality of the real- exploit different solutions to improve
lows users to break the data requirement
time, location-sensitive information that the system scalability. The techniques
of the enterprise into smaller subsets
is available to the authorities, providers adopted by such systems include: (a)
and address these requirements by using
and end-users. simple data model, (b) separated meta-
smaller, independent databases. How-
The sheer size of the data and rate data and application data and (c) relaxed
ever, such an uncoordinated data man-
that it is created provides many chal- consistency requirements. However,
agement approach can result in data con-
lenges in data management and process- they introduce different sets of problems,
flicts and quality inconsistencies within
ing. As has been mentioned, the exist- such as lack of support for real-time
the enterprise, making it difficult for users
ing database technologies are not able update, consistency and high-latency
to trust the data when it is used for op-
to handle the challenges presented by of selective retrieval. It is therefore im-
erations and reporting at a higher level
the big data. First, relational databases portant to design a system architecture
within the enterprise.
use well-defined schemas and require ap- that can dynamically support the elastic
plication data to fit into the relational requirement of the applications for the
Scientific applications paradigm of rows and columns. Unfor- storage and processing of multi-tenancy
Many parts of science now are about ex- tunately, a lot of data are unstructured. data over a distributed cluster of com-
periments which create a large amount of Data may be collected from various data modity compute nodes. Both horizontal
data and the subsequent analysis of the sources, such as search logs, click streams and vertical partitioning strategies could
data. This is the challenge posed by Gray and crawled pages. They may have vari- be employed to partition and distribute
[3], how to support the data intensive sci- ous formats, and programmers must pre- the data that serve to achieve a high
entific discovery paradigm. The scale of process them and load into the database performance of query processing and
the data is significantly larger than that before performing any analysis. Second, updates. To facilitate an efficient location
in most enterprise applications, e.g. sci- it is hard for a relational database sys- of selective data without having to scan
ence experiments can generate terabytes tem to scale. Though expensive high-end the whole database over all the compute
of data daily. This means that query pro- servers are used to run parallel database nodes, efficient and light-weight indexing
PERSPECTIVE Cui, Mei and Ooi 29

should be designed. Based on the access of business data. DBMSs have evolved better scalability and availability than
methods and data distribution, efficient over the last four decades and are now relational databases. MapReduce is being
query processing strategies could then be functionally rich. However, as the arrival used increasingly in applications such as
designed. of the big data era, these database systems data mining, data analytics and scientific
showed up the deficiencies in handling computation. Its wide adoption and
big data. success lie in its distinguishing features,
Scalable data analytics Recently, a new distributed data- including flexibility, scalability, efficiency
The widespread adoption of comput- processing framework called MapReduce and fault tolerance.
ing technologies has resulted in a large was proposed [5], whose fundamental Along with the praises brought by its
amount of data in a wide variety of forms. idea is to simplify the parallel processing simplicity and flexibility, MapReduce is
Besides the traditional business data that using a distributed computing platform also criticized for its reduced functional-
are structured, social media applications that offers only two interfaces: map and ity. Already much effort has been devoted
generate graph data; location-based ser- reduce. Programmers implement their to address these problems; yet an active
vices and urban sensing applications pro- own map and reduce functions, while area of research remains to be explored,
duce spatial–temporal data; multimedia the system is responsible for scheduling such as SQL-like declarative language en-
applications generate images and videos and synchronizing the map and reduce hancement for the ease of use, DBMS
that are typically represented as high- tasks. By defining the ‘Map and Reduce’ operator implementation for functional-

Downloaded from http://nsr.oxfordjournals.org/ by guest on July 9, 2015


dimensional features. There is therefore a functions, MapReduce applications are ity enhancement, performance improve-
need to study how best to represent and able to deal with much more complicated ment for iterative computation (for de-
process these complex data structures ef- tasks than SQL queries. Different from tailed survey, please refer to [6]).
ficiently, and to propose new paradigms relational databases, MapReduce uses Generally speaking, MapReduce
of data analytics over different data types. a much simpler data model and views systems are good at complex analytics
Furthermore, user interaction and visual- data as key-value pairs. As a result, the and extract-transform-load tasks at large
ization are important for exploring large programmers are free to structure their scale, while parallel databases perform
quantity of data of complex forms. De- data and data parsing and loading before better in efficient querying of large
veloping methods for interactive visual analysis can be conducted. MapReduce data sets. Some researchers attempt to
analytics is critical for successful scalable applications can handle data stored incorporate the best characteristics of
data analysis. To improve system usabil- either in an unstructured file system as both parallel databases and MapReduce
ity, there is a need for a declarative pro- well as in a structured database. Mean- systems. For example, by using the
gramming model that is communication- while, the key-value data model used MapReduce framework as its middle
centric (optimize inter-process commu- in MapReduce leads to the emergence layer and using distributed PostgreSQL
nication) and data-centric (optimize data and population of many key-value stores as its bottom layer, HadoopDB [7]
processing and support for large-scale such as Bigtable [1]. As the key-value benefits from both the scalability of
concurrency). abstraction naturally allows horizontal MapReduce and the efficiency of parallel
Furthermore, a white paper [4] was partitioning, key-value stores can provide databases. There also exist many other
recently created through a distributed
conversation among more than 20
prominent researchers, which illustrated
the challenges and opportunities with big
data, and discussed the research agenda
in this field as well. The analysis of big
data involves multiple distinct phases as
shown in Fig. 1. Major steps in big data
analysis are shown in the flow on the
top of Fig. 1, which include acquisition,
extraction, integration, analysis and
interpretation. Below it, the challenges
introduced by big data are shown. The
authors also discuss both what has
already been done and what challenges
remain to exploit big data.

EMERGING TECHNOLOGIES
DBMSs have become a ubiquitous opera-
tional platform in managing huge amount Figure 1. The big data analysis pipeline [4].
30 National Science Review, 2014, Vol. 1, No. 1 PERSPECTIVE

distributed data processing systems that were proposed to handle real-time data software and software engineering, which
go beyond the MapReduce framework. processing. S4 is a distributed stream are beyond the scope of the paper. Over-
These systems have been designed processing engine that allows pro- all, data are indeed the root of the prob-
to address various problems not well grammers to develop applications lems, and they drive the development
handled by MapReduce, which are listed for continuous stream processing. It of many new technologies and decision
as follows. combines the Actors model and the making.
MapReduce model, and hence applica-
tions can be massively concurrent with a Bin Cui, Hong Mei and Beng Chin Ooi
Interactive analysis simple programming interface. 1. School of EECS and Key Laboratory of High
MapReduce is optimized for batch pro- Confidence Software Technologies, Peking
cessing but not fast interactive analy- University, China
sis. Google tried to overcome the prob- Generic data processing 2. School of Computing, National University of
lem by building a completely different Efforts have been made in developing Singapore, Singapore
system named Dremel to support fast ∗ Corresponding author.
alternative parallel processing platforms
interactive analysis. Dremel [8] splits that have MapReduce flavor, but are E-mail: [email protected]
data into different fields and stores these more general. One example of this line
fields in different files before execution, of work is epiC [11]. EpiC was designed REFERENCES

Downloaded from http://nsr.oxfordjournals.org/ by guest on July 9, 2015


so that the time for data parsing and to handle variety of data (e.g., struc-
loading at runtime is reduced. Further, 1. Chang, F, Dean, J and Ghemawat, S et al. Pro-
tured and unstructured), variety of stor-
Dremel uses multi-level serving trees to ceedings of the Seventh USENIX Symposium on
age (e.g., database and file systems) and
execute queries so that intermediate ag- Operating Systems Design and Implementation.
variety of processing (e.g., SQL and pro-
gregation can reduce the amount of data 2006; pp. 205–18.
prietary APIs). The important character-
that needs to flow in the system. Popu- 2. DeCandia, G, Hastorun, D and Jampani, M et al.
istic of epiC, from a MapReduce or data
lar Dremel-like systems include Apache’s Proceedings of Twenty-first ACM SIGOPS sym-
management perspective, is that it simul-
Drill, Cloudera’s Impala and Metamar- posium on Operating systems principles. 2007;
taneously supports both data intensive
kets’ Druid. pp. 205–20.
OLAP and OLTP.
3. Hey, T, Tansley, S and Tolle, K, eds. The Fourth
Paradigm: Data-Intensive Scientific Discovery,
Graph analysis Microsoft Research, 2009.
Directly running graph analysis task in
CONCLUSIONS 4. Jagadish, HV. Challenges and opportunities
MapReduce will lead to massive data With the advancement and wide adop- with big data, 2012. http://www.cra.org/ccc/
movement since it does not exploit the tion of technologies, data have been cre- files/docs/init/bigdatawhitepaper.pdf.
underlying graph structure. Recently, ated at an unprecedented rate. Coupled 5. Dean, J and Ghemawat, S. Proceedings of the
some graph computation frameworks with the problems of size and heterogene- Fifth USENIX Symposium on Operating Systems
have emerged to solve this challenge. ity, we have the 3V problems to han- Design and Implementation. 2004; pp. 137–50.
Google introduced a vertex-centric graph dle and value to create out of the data. 6. Li, F, Ooi, BC and Ozsu, T et al. ACM Comput Surv
computation system called Pregel [9], The value of data is unleashed when it 2013; 46.
which stores the underlying graph in can be integrated and made sense with 7. Abouzeid, A, Bajda-Pawlikowski, K and Abadi, D
memory to speed up random access, other data. The big data presents us the et al. Proceedings of the VLDB Endowment. 2009;
and executes graph computation using a challenges and opportunities in design- 2(1): 922–33.
bulk synchronous parallel (BSP) model. ing new processing platforms for integrat- 8. Hall, A, Bachmann, O and Büssow, R et al. Pro-
There are several other implementations ing, managing and processing the massive ceedings of the VLDB Endowment. 2012; 5(11):
for graph analysis, including Apache data, and for providing contextual analy- 1436–46.
Hama, GoldenOrb, Giraph, Phoebus, sis by working with the domain and sub- 9. Malewicz, G, Austern, M and Bik, A et al. Pro-
GPS and GraphLab. ject experts, and the visualization of the ceedings of the 2010 ACM SIGMOD Interna-
massive data. The potential research top- tional Conference on Management of data. 2010;
ics in this field lie in all phases of data pp. 135–46.
Real-time analysis or stream management pipeline that includes data 10. Neumeyer, L, Robbins, B and Nair, A et al. Pro-
processing acquisition, data integration, data model- ceedings of the 2010 IEEE International Confer-
The MapReduce framework works well ing, query processing, data analysis, etc. ence on Data Mining Workshops. 2010; pp. 170–
for offline-batched analytics, but it was Besides, the big data also brings great 77.
not designed for real-time decision challenges and opportunities to other 11. Chen, C, Chen, G and Jiang, D et al. Proceedings
making. Recently, systems such as S4 computer science disciplines such as sys- of the 11th International Conference on Web In-
(simple scalable streaming system) [10] tem architecture, storage system, system formation Systems Engineering. 2010; pp. 1–19.

You might also like