Big Data The Driver For Innovation in Databases
Big Data The Driver For Innovation in Databases
C The Author(s) 2014. Published by Oxford University Press on behalf of China Science Publishing & Media Ltd. All rights reserved. For Permissions, please email: journals.
[email protected]
28 National Science Review, 2014, Vol. 1, No. 1 PERSPECTIVE
worldwide scale. These new sources of cessing techniques that rely on indexing systems, they are not sufficient for han-
data also allow for new kinds of data- may not be feasible simply because there dling the volumes of data today. Once
analysis applications, e.g. understanding is insufficient time to build indexes on the a database’s upper capacity is exceeded,
social behaviors at an aggregate scale. The data. Clearly, controls over the execution database engineers must redistribute the
enormous size of social networks allows time are needed but there has been lit- data across multiple databases to break
unprecedented and new forms of analy- tle work in this direction. The result of a them up. Third, it is difficult for relational
sis and brings new processing challenges. query could range from the empty set to a database systems to effectively deal with
First, in many online social networks, significant fraction of the database. How- complicated queries. The increasing de-
data analysis makes use of the social ever, although a very large result is a com- mand for data exploration and knowledge
graph. This requires new data-analysis al- plete and correct answer, it might not be discovery needs more complicated ad hoc
gorithms that must be able to cope with a very useful answer and may also take queries in data analysis, which makes it
massive graphs with O(109 ) nodes. Ex- too long to compute. This suggests that difficult to build indexes and views in the
isting graph algorithms are not designed rather than large results, a summary of the databases based on requirements and as-
to deal with graphs at such scale and data and applying sampling may be more sumptions, even for those most agile sys-
are mainly tailored for graphs residing in appropriate. tems. Below, we provide two immediate
main memory. Second, the scale of inter- challenges and opportunities.
actions in the social network and the dif-
Mobile computing
should be designed. Based on the access of business data. DBMSs have evolved better scalability and availability than
methods and data distribution, efficient over the last four decades and are now relational databases. MapReduce is being
query processing strategies could then be functionally rich. However, as the arrival used increasingly in applications such as
designed. of the big data era, these database systems data mining, data analytics and scientific
showed up the deficiencies in handling computation. Its wide adoption and
big data. success lie in its distinguishing features,
Scalable data analytics Recently, a new distributed data- including flexibility, scalability, efficiency
The widespread adoption of comput- processing framework called MapReduce and fault tolerance.
ing technologies has resulted in a large was proposed [5], whose fundamental Along with the praises brought by its
amount of data in a wide variety of forms. idea is to simplify the parallel processing simplicity and flexibility, MapReduce is
Besides the traditional business data that using a distributed computing platform also criticized for its reduced functional-
are structured, social media applications that offers only two interfaces: map and ity. Already much effort has been devoted
generate graph data; location-based ser- reduce. Programmers implement their to address these problems; yet an active
vices and urban sensing applications pro- own map and reduce functions, while area of research remains to be explored,
duce spatial–temporal data; multimedia the system is responsible for scheduling such as SQL-like declarative language en-
applications generate images and videos and synchronizing the map and reduce hancement for the ease of use, DBMS
that are typically represented as high- tasks. By defining the ‘Map and Reduce’ operator implementation for functional-
EMERGING TECHNOLOGIES
DBMSs have become a ubiquitous opera-
tional platform in managing huge amount Figure 1. The big data analysis pipeline [4].
30 National Science Review, 2014, Vol. 1, No. 1 PERSPECTIVE
distributed data processing systems that were proposed to handle real-time data software and software engineering, which
go beyond the MapReduce framework. processing. S4 is a distributed stream are beyond the scope of the paper. Over-
These systems have been designed processing engine that allows pro- all, data are indeed the root of the prob-
to address various problems not well grammers to develop applications lems, and they drive the development
handled by MapReduce, which are listed for continuous stream processing. It of many new technologies and decision
as follows. combines the Actors model and the making.
MapReduce model, and hence applica-
tions can be massively concurrent with a Bin Cui, Hong Mei and Beng Chin Ooi
Interactive analysis simple programming interface. 1. School of EECS and Key Laboratory of High
MapReduce is optimized for batch pro- Confidence Software Technologies, Peking
cessing but not fast interactive analy- University, China
sis. Google tried to overcome the prob- Generic data processing 2. School of Computing, National University of
lem by building a completely different Efforts have been made in developing Singapore, Singapore
system named Dremel to support fast ∗ Corresponding author.
alternative parallel processing platforms
interactive analysis. Dremel [8] splits that have MapReduce flavor, but are E-mail: [email protected]
data into different fields and stores these more general. One example of this line
fields in different files before execution, of work is epiC [11]. EpiC was designed REFERENCES