Implementing A Smart Data Platform (2017) PDF
Implementing A Smart Data Platform (2017) PDF
a Smart Data
Platform
How Enterprises Survive in the
Era of Smart Data
While the publisher and the authors have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher and
the authors disclaim all responsibility for errors or omissions, including without
limitation responsibility for damages resulting from the use of or reliance on this
work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is sub‐
ject to open source licenses or the intellectual property rights of others, it is your
responsibility to ensure that your use thereof complies with such licenses and/or
rights.
978-1-491-98348-5
[LSI]
Table of Contents
5. SmartDP Solutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Data Market 31
Platform Products 32
Data Applications 34
Consulting and Services 34
iii
Data Application Layer 44
Operation Management Layer 45
7. Case Studies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
SmartDP Drives Growth in Banks 47
Real Estate Development Groups Integrate Online and
Offline Marketing with SmartDP 54
Common Market Practices and Disadvantages 54
Methodology 55
Description of the Overall Plan 56
Conclusion 63
iv | Table of Contents
CHAPTER 1
The Advent of the Smart Data Era
1
Figure 1-1. Artificial intelligence global yearly financing history, 2010–
2015, in millions of dollars (source: CB Insights)
Three Elements of the Smart Data Era: Data, AI, and Human Wisdom | 3
SmartDP along with the three basic capabilities that SmartDP
should possess: data management, data science, and data engineer‐
ing. Meanwhile, we also introduce the SmartDP referential frame‐
work, and detail the functions of each layer. Finally, we will take a
look at how SmartDP is adopted in real scenarios to enhance our
understanding of smart data.
5
tion perception, acceleration, terrestrial magnetism, gyroscope, dis‐
tance, pressure, RGB light, temperature, humidity, Hall coefficient,
heartbeat and fingerprint, and more. If all sensors are activated, each
mobile phone could acquire up to 1GB of data per day. Although
this data can truly present the contexts of mobile users, most is
abandoned.
With both the scale and dimensions of data rapidly increasing,
enterprises are unable to effectively prepare and gain insight from
data, making it hard for them to support business policymaking.
According to a report of BCG (Boston Consulting Group) in 2015,
only 34% of the data generated by financial institutions (with a rela‐
tively higher degree of IT support) was actually used. And according
to a survey report of Experian Data Quality, in 2016 nearly 60% of
American enterprises could not actively sense or deal with the issue
of data quality and did not have fixed departments or roles responsi‐
ble for managing data quality. There is clearly still a long way to go
in terms of managing complicated data. If not effectively utilized, a
large amount of data would not be asset-oriented and thus would
not produce any value, which means huge costs for enterprises in
turn.
Enterprises struggle with these challenges for a variety of reasons:
some have no advanced technical platform, some are deficient in
data management, some have not built standard data engineering
systems, and some others simply lag behind in terms of their under‐
standing of the value of data science. All these have hampered the
transformation of traditional enterprises toward intelligent, data-
driven ones. Let’s look at each of these challenges more closely.
11
• Their unified data management strategy can be used to manage
data views that are consistent across enterprises, efficiently
gather data (including self-owned and third-party data), and
also efficiently output data and data services.
• Their end-to-end data engineering capacity can support data
management for the business and help form a closed loop that
continuously optimizes business operations.
Smart enterprises are the companies that are armed with these three
capabilities.
In order to become data-driven, smart enterprises need a new plat‐
form to support them, a platform that promotes an environment
that is focused on data. This platform is called SmartDP (smart data
platform). SmartDP refers to a platform that explores the commer‐
cial value of data based on smart data applications, and enables
proper data management, data engineering, and data science.
Comprised of a set of modern data solutions, SmartDP helps enter‐
prises build an end-to-end closed data loop, from data acquisition to
decision to action, in order to provide the capacity for flexible data
insight and data value mining as well as flexible and scalable support
for contextual data applications. As we’ll see later in this report,
adopting SmartDP can improve enterprises’ data management, data
engineering, and data science capabilities. We’ll now review each of
these aspects in general terms.
Data Management
Data management refers to the process by which data is effectively
acquired, stored, processed, and applied, aiming to bring the role of
data into full play. In terms of business, data management includes
metadata management, data quality management, and data security
management.
Metadata Management
Metadata can help us to find and use data, and it constitutes the
basis of data management.
Normally, metadata is divided into the following three types:
13
• Business metadata refers to a description of a dataset from the
business point of view, mainly the significance of a dataset for
business users, including business names, business descriptions,
business labels, data-masking strategies.
Data Management | 15
access data in a convenient and efficient manner while ensuring data
security.
Data Engineering
Most traditional enterprises are challenged by poor implementation
of data acquisition, organization, analytics, and action procedures
when they transform themselves for the smart era. Thus, it is urgent
that enterprises build end-to-end data engineering capacity
throughout their data acquisition, organization, analytics, and
action procedures, so as to ensure a data- and procedure-driven
business structure, rational data, and a closed-loop approach, and
realize the transformation from further insight into commercial
value of data. The search engine is the simplest example. After a
search engine makes a user’s interactive behavior data-driven, it can
optimize the presentation of the search result so as to improve the
user’s searching experience and attract more users to it. This optimi‐
zation is done according to duration of the user’s stay, number of
clicks, and other conditions. Additionally, it can generate more data
for optimization. This is a closed loop of data, which can bring
about continuous business optimization.
In the smart data era, due to the complexity of data and data appli‐
cation contexts, data engineering needs to integrate both AI and
human wisdom to maximize its effectiveness. For example, a search
engine aims to solve the issue of information ingestion after the
surge in the volume of information on the internet. As tens of mil‐
lions of web pages cannot be dealt with using manual URL classified
navigation, algorithms must be used to index information and sort
search results according to users’ characteristics. In order to adapt to
the increasingly complex web environment, Google has been gradu‐
ally improving its search ranking intelligence, from the earliest Pag‐
eRank algorithm, to Hummingbird in 2013 and the addition of the
machine learning algorithm RankBrain as the third-most important
sorting signal in 2015. There are over 200 sorting signals for the
Google search engine; and variant signals or subsignals may be in
the tens of thousands and are continuously changing. Normally, new
sorting signals need to be discovered, analyzed, and evaluated by
humans in order to determine their effects on the sorting results.
Thus, even if there are powerful algorithms and massive data,
human wisdom is absolutely necessary and undertakes a key role in
efficient data engineering.
Data Acquisition
Data acquisition focuses on generated data and captures data into
the system for processing. It is divided into two stages—data harvest
and data ingestion.
Different data application contexts have different demands for the
latency of the data acquisition process. There are three main modes:
Real time
Data should be processed in a real-time manner without any
time delay. Normally, there would be a demand for real-time
processing in trading-related contexts. For example:
Data Engineering | 17
• For online trade fraud prevention, the data of trading par‐
ties should be dealt with by an anti-fraud model at the fast‐
est possible speed, so as to judge if there is any fraud, and
promptly report any deviant behavior to the authorities.
• The commodities of an ecommerce website should be rec‐
ommended in a real-time manner according to the histori‐
cal data of clients and the current web page browsing
behavior.
• Computer manufacturers should, according to their sales
conditions, make a real-time adjustment of inventories,
production plans, and parts supply orders.
• The manufacturing industry should, based on sensor data,
make a real-time judgment of production line risks,
promptly conduct troubleshooting, and guarantee the pro‐
duction.
Micro batch
Data should be processed by the minute in a periodic manner. It
is not necessary that data is processed in a real-time manner.
Some delay is allowed. For example, the effect of an advertise‐
ment should be monitored every five minutes so as to deter‐
mine a future release strategy. It is thus required that data
should be processed in a centralized manner every five minutes
in aggregate.
Mega batch
Data should be processed periodically with a time span of sev‐
eral hours, without a high volume of data ingested in real time
and a long delay in processing. For example, some web pages
are not frequently updated and web page content may be
crawled and updated once every day.
Streaming data is not necessarily acquired in a real-time manner. It
may also be acquired in batches, depending on application context.
For example, the click event stream of a mobile app is uploaded in a
continuous way. However, if we only wish to count the added or
retained stream in the current day, we only need to incorporate all
click-stream blogs in that day in a document and upload them to the
system by means of a mega batch for analytics.
Data ingestion
Data ingestion refers to a process by which the data acquired from
data sources is brought into your system, so the system can start act‐
ing upon it. It concerns how to acquire data.
Data ingestion typically involves three operations, namely discover,
connect, and sync. Generally, no revision of any form is made to
numeric values to avoid information loss.
Discover refers to a process by which accessible data sources are
searched in the corporate environment. Active scanning, connec‐
tion, and metadata ingestion help to develop the automation of the
process and reduce the workload of data ingestion.
Connect refers to a process by which the data sources that are con‐
firmed to exist are connected. Once connected, the system may
directly access data from a data source. For example, building a con‐
nection to a MySQL database actually involves configuring the con‐
necting strings of the data source, including IP address, username
and password, database name, and so on.
Sync refers to a process by which data is copied to a controllable sys‐
tem. Sync is not always necessary upon the completion of connec‐
tion. For example, in an environment which requires highly
sensitive data security, only connection is allowed for certain data
sources. Copying is not allowed for that data.
Data Engineering | 19
Data Organization
Data organization refers to a process to make data more available
through various operations. It is divided into two stages, namely
data preparation and data enrichment.
Data preparation
Data preparation refers to a process by which data quality is
improved using tools. In general cases, data integrity, timeliness,
accuracy, and consistency are regarded as indicators for improve‐
ment so as to make preparations for further analytics.
Common data preparation operations include:
Data enrichment
In contrast to data preparation, data enrichment shows more prefer‐
ence to contexts. It can be understood as a data preparation process
at a higher level based on context.
Common data enrichment operations include:
Data labels
Labels are highly contextual. They may have different meanings
in different contexts, so they should be discussed in a specific
context. For example, gender labels have different meanings in
contexts such as ecommerce, fundamental demography, and
social networking.
Data modeling
This targets the algorithm models of a business—for example, a
graph model built in order to screen the age group of econnois‐
seurs in the internet finance field.
Data Engineering | 21
human intervention. And the cost of such solutions is far lower than
that of traditional banks.
Data analytics is divided into two stages, namely data insight and
data decisions.
Data insight
Data insight refers to a process by which data is understood through
data analytics. Data insights are usually presented in the form of
documents, figures, charts, or other visualizations.
Data insight can be divided into the following types depending on
the time delay from data ingestion to data insight:
Real time
Applicable to the contexts where data insight needs to be
obtained in a real-time manner. Server system monitoring is
one example of simple contexts. An alarm and response plan
should be immediately triggered when key indicators (including
magnetic disk and network) exceed the designated threshold. In
complicated contexts such as P2P fraud prevention, a judgment
should be made if there is any possibility of fraud according to
contextual data (the borrower’s data and characteristics) and
third-party data (the borrower’s credit data). Also, an alarm
should be trigged based on such judgment.
Interactive
Applicable to the context where the insight needs to be obtained
in an interactive manner. For example, a business expert cannot
get an answer in one query when studying the reason for the
recent fall in the sales volume for a particular product. A clue
needs to be obtained through continuous query, thus determin‐
ing the target for the next query. The response speed of the
query should be in an almost real-time manner, as required by
interactive insight.
Batch
Applicable to the context where the insight should be completed
once every time interval. For example, there are no real-time
requirements for behavior statistics of mobile app users (includ‐
ing add, daily active, retain) in general cases.
The depth and completeness of data insight results greatly affects the
quality of decisions.
Action
An action is a process by which the decision generated in the analyt‐
ics stage is put into use and the effect is assessed. It includes two
stages, namely deployment and assessment.
Deployment
Deployment is a process by which action strategies are imple‐
mented. Simple deployment includes presenting the visualized result
or reaching users during the marketing process. However, the com‐
mon deployment is more complicated. Usually, it relates to shifting
the data strategy from natural accumulation to active acquisition.
The data acquisition stage involves deployment through construc‐
tion, including offline construction of IoT devices (especially beacon
devices, including iBeacon and Eddystone) and WiFi probe devices
as well as improvement of business operation flow so as to obtain
specific data points (such as capture of Shake, QR code scanning,
and WiFi connection events).
Assessment
Assessment is a process by which the action result is measured; it
aims to provide a basis for optimization of all data engineering.
In practice, although the problems of the action result appear to be
derived from the decision, they are more a reflection on data quality.
Data quality may relate to all the stages of data engineering, includ‐
ing acquisition, harvest, preparation, enrichment, insight, decision,
and action. Thus, it is necessary to track the processing procedures
of each link, which can help to locate the root causes for problems.
Sometimes, for the purpose of justice and objectivity, enterprises
may employ third-party service providers to make an assessment; in
this situation, all participants in the action should reach a consensus
on the assessment criteria. For example, an app advertiser finds that
users of a particular region have a large potential value through ana‐
lytics, and thus hope to advertise in a targeted way in this region. In
the marketing campaign, the app advertiser employs the third-party
Data Engineering | 23
monitoring service to follow up on the marketing effect. The results
indicate that quite a lot of activated users are not within the region.
Does this mean an erroneous action was taken in the release chan‐
nel? It is discovered through further analysis that the app advertiser,
channel, and the third-party monitoring service provider are not
consistent in the standards for judging the position of the audience.
In the mobile field, due to the complex network environment and
mobile phone structure (applications and sensors may be affected),
enterprises should pay particular attention to the adjustment of
positions, especially when looking at deviations in assessments.
Figure 4-3. Roles of the data engineering team (figure courtesy of Wen‐
feng Xiao)
Data steward
A data steward is responsible for planning and managing the data
assets of an enterprise, including data purchases, utilization, and
Data Engineering | 25
maintenance, so as to provide stable, easily accessible, and high-
quality data.
A data steward should have the following capabilities:
Data engineer
A data engineer is responsible for the architecture and the technical
platform and tools needed in data engineering, including data con‐
nectors, the data storage and computing engine, data visualization,
the workflow engine, and so on. A data engineer should ensure data
is processed in a stable and reliable way and provide support for the
smooth operation of the work of the data steward, data scientist, and
data analyst.
The capabilities of a data engineer should include, but are not limi‐
ted to the following aspects:
Data scientist
Some enterprises would classify data scientists as data analysts, as
they undertake similar tasks (i.e., acquiring insight from data to
guide decisions).
In fact, the roles do not require completely identical skills. Data sci‐
entists should cross even higher thresholds and should be able to
deal with more complex data contexts. A data scientist should have a
deep background in computer science, statistics, mathematics, and
software engineering as well as industry knowledge and should have
the capacity to undertake algorithm research (such as algorithm
optimization or new algorithm modeling). Thus, they are able to
solve some more complex data issues, such as how to optimize web‐
sites to increase user retention rate or how to promote game apps so
as to better realize users’ life cycle value.
If we say data analysts show a preference for summaries and analyt‐
ics (descriptive and diagnostic analytics), data scientists highlight
future strategic analytics (predictive analytics and independent deci‐
sion analytics). In order to continuously create profits for enterpri‐
ses, data scientists should have a deep understanding of business.
The capabilities owned by a data scientist should include, but are
not limited by, the following aspects:
Data Engineering | 27
• Traditional data science tools, including SPSS, MATLAB, and
SAS
Data analyst
Data analysts are responsible for exploring data based on data plat‐
forms, tools, and algorithm models and gaining business insight so
as to satisfy the demands of business users. They should not only
have an understanding of data but also master the specialized
knowledge regarding business such as accounting, financial risk
control, weather, and game app operation.
To some extent, data analysts may be regarded as data scientists at
the primary level but do not need to have a solid mathematical
foundation and algorithm research skills. Nevertheless, it is neces‐
sary for them to master Excel, SQL, basic statistics, and statistical
tools, as well as data visualization.
A data analyst should have the following core capabilities:
Programming
A general understanding of a programming language, such as
Python and Java, or a database language can effectively help
data analysts to process data in a scaled and characteristic man‐
ner and improve the efficiency of analytics.
Data Engineering | 29
security. Data product managers should also consider business
demand more completely and develop data products that can be
recognized in abnormal environments.
Data Science
As required by the smart data era, data science spans across com‐
puter science, statistics, mathematics, software engineering, industry
knowledge, and other fields. It studies how to analyze data and to
gain insight.
With the emergence of big data, smart enterprises must deal with a
greater data scale and more complex data types on smart data plat‐
forms through data science. Some traditional fields and data science
share similar concepts, including advanced analytics, data mining,
and predictive analytics.
Data science continues some ideas of statistics, for example, statisti‐
cal search, comparison, clustering, classification, and other analytics
and summarization of a lot of data. Its conclusions are correlation
rather than a necessary cause–effect relationship. Although data sci‐
ence relies heavily on computation, it is not based on a known
mathematical model, which is different from computer simulation.
Instead, it replaces cause–effect relationship and rigorous theories
and models with a lot of data correlation and acquires new “knowl‐
edge” based on such correlation.
Data Market
Data requires flow, interaction, and integration to bring its largest
value into play. The data exchange and trade market enables data
suppliers to upload, introduce, publicize, and transfer the transmis‐
sion of data and let purchasers try out, inspect, and acquire data in
scale, representing the middleman and guarantor in the trade.
In addition to necessary measuring and billing, more crucial charac‐
teristics of the data market include solving the problems of data
conversion, conforming to laws and regulations, fraud prevention,
standard unification, quality verification, data convergence, and so
on.
The key functional points of the data market include:
Conforming to laws and regulations
• Incorporating a checking mechanism to avoid the personal
identifiable information–type (PII) data and easing privacy
issues through asymmetric cryptography
• Providing a guarantee and a conversion platform to assist both
buyer and seller in obtaining data results under the status of
“available but invisible”
31
• Validly processing data through ID conversion
Fraud prevention
Increasing anti-fraud rules and avoiding the data fraud of some
suppliers
Standard unification
Providing data input and standards for input interfaces as well
as industry-based and type-based standards for business data
Quality verification
Verifying data quality through cross verification, business feed‐
back, sampling, spot inspections, etc.
Data convergence
Converging and converting scattered small data sources and
increasing the availability of data.
Platform Products
SmartDP products should be able to support data management, data
engineering, and data science. The platform should not only satisfy
the requirements for data management, but also complete data
acquisition, organization, analytics, and actions and support the
building of data science algorithms and models. In terms of ecology,
the platform is capable of generating smart data applications and
providing support for the data market through data processing and
production capabilities.
Data Management
In terms of data management, SmartDP products should include the
following three elements:
Multisource gathering
Self-owned data, self-owned data on third-party platforms,
third-party data
Quality enrichment
Mainly monitoring and capacity enhancement of data quality
Strategy control
Including management and control of data assets (such as safety
and validity, and access security and authority)
Data Science
In terms of data science, SmartDP products should include the fol‐
lowing elements:
Training mode
As the initial link of data science, the platform should be able to
construct, verify, and adjust models based on the analysis results
of business demand.
Precision of verification
As the verification link of data science, the platform should be
able to confirm the precision of the model based on prior data
and human experience.
Application of execution
As the application link of data science, the platform should be
able to realize the execution, operation, and maintenance of
algorithm models.
Platform Products | 33
Data Applications
Data is applied in the smart data age are utilized to satisfy the logical
packaging of business needs with SmartDP’s capabilities (data man‐
agement, data science, and data engineering) in order to solve spe‐
cific business issues and realize business values.
Data applications are usually undeveloped by data product manag‐
ers who provide services to business users. Therefore, they need to
consider user experience, including the simplicity of operation pro‐
cess and the clarity of visual delivery.
In practice, data applications may be derived either from enterprises
or from third parties. Third-party data application providers may
have proven capacity in some vertical fields, such as financial risk
control, customer value prediction algorithms, and context aware‐
ness, which could supplement the experience of enterprises in these
fields and avoid reinvention. However, when insufficient third-party
data applications exist in the market to solve practical business con‐
text problems, enterprises must realize customized data applications
in a targeted way through independent development or subcontract‐
ing.
Data applications can support each other to help realize the
reusability of data and functions.
37
Figure 6-1. SmartDP reference architecture (figure courtesy of Wenfeng
Xiao)
Self-Owned Data
Self-owned data refers to data that is owned by an enterprise and
that can be completely controlled and managed by the enterprise,
which is stored in the enterprise or any external platform of a third
party.
The self-owned data stored in an enterprise generally includes:
Data Layer | 39
methods in mainstream environments and platforms. The techni‐
ques include but are not limited to:
Third-Party Data
Third-party data is the data not owned by an enterprise and pro‐
vided by a third-party data provider.
Normally, the third-party data includes:
Infrastructure Layer
The infrastructure layer provides the basic technical capacity for
data applications at upper layers, including data storage, data com‐
puting, data science, workflow management, the message bus, and
the service bus.
Data Storage
Data storage capacity supports long-term data storage.
Data storage can be divided into the following types according to
the type of data stored:
Structured
It is required that a predefined strict data model (schema) or a
predefined organization mode be established for storage. Nor‐
mally, the structured storage includes row-oriented storage
(including traditional RDBMSs, and MySQL), column-oriented
storage (including column-oriented databases, such as HBase,
and Vertica) and graph storage (including graph databases, such
as GraphSQL).
Semi-structured
This has no strict definition of a data model but a certain for‐
mat, which is freely scalable, such as JSON and XML.
Unstructured
There is no strict definition of a data model. And there is no
way to pre-organize the storage type. Unstructured storage
Infrastructure Layer | 41
includes scalable or freely organized files, including server
blogs, emails, compressed files and videos.
Data storage can be divided into the following types in terms of the
form of stored content:
Data Computing
Data computing capacity supports all operations of data organiza‐
tion and analytics.
Data computing can be divided into the following types based on
computing type:
Workflow Management
Workflow management is used to manage data-related work and
flow processes. It arranges various automatic or manual tasks neces‐
sary for data engineering and ensures their completion and imple‐
mentation. Workflow management relates to task security, error
handling, dependency arbitration, and other coordination tasks.
The common tools for workflow management include Azkaban and
Oozie.
Infrastructure Layer | 43
The common message buses includes Kafka, and the common ser‐
vice buses include such microservice governance frameworks as
Eureka and Dubbo.
Data Catalog
Equivalent to a search engine for data, the data catalog can facilitate
users to search, understand, and use data by organizing the relevant
information of data.
Generally, the data catalog is maintained by the data steward and
may be used by all other roles of the data engineering team.
Normally, the data catalog undertakes the tasks as data management
and includes the following core functions:
Metadata management
Management of data lineage, data summary, and data format
Data quality management
Management of record rearrangement, unit error correction,
and data correlation
Data security management
Management of audit, hiding, and tokenization and access con‐
trol
Data Factory
A data factory is used to build production flow lines for data engi‐
neering and ensure these lines operate normally.
Generally, a data factory is maintained by data engineers and may be
used by all other roles of the data engineering team.
Normally, a data factory includes the following core functions:
Business Applications
Business applications provide services to business departments.
In general, business applications are used by business users and may
be developed either by the enterprise or by a third party.
Common business applications include marketing release, mobile
analysis, advertisement monitoring, offline visitor flow analysis,
competition analysis, financial risk control, and identity verification.
Business Operations
Business operations can help operators manage the SmartDP
account as well as the measurement and billing of services provided
externally.
Data Operations
Data operations can enable operators to manage compliance, ano‐
nymization, and access control rules as well as the audit strategies of
the data services they provide, and improve the efficiency of data
service releases.
47
independently developed by TalkingData, a query and computation
of millions of pieces of data takes only minutes to complete.
Application Contexts
High-value client mining and marketing
Financial enterprises represent a typical Pareto effect. That is, 20%
of their clients contribute 80% of the operating revenues. Talking‐
Data discovered through data analytics that 8% of financing clients
for the mobile end of a bank own about 75% of the total assets of the
bank. The bank hoped to find more high-value clients for marketing
and improvement of financing products sales performance.
With 30,000 high-value clients as seeds and the variables related to
high-value clients as input, among millions of mobile devices, Talk‐
ingData calculated the devices that are similar to those high-value
clients based on the lookalike algorithm of the Atom engine. It gets
engaged in marketing by using the Push and SMS functions of the
digital marketing tools. In the SmartDP model used to mine high-
net-worth clients, TalkingData used data of several dimensions as
input variables, including device concentration point, application
name, device model, transaction information, and customer infor‐
mation, and searched for high-value potential clients in data with 50
million dimensions.
By this method, the bank sold millions of dollars of financial prod‐
ucts within two months. Compared with traditional marketing
means, costs were reduced 95% and the bank saw an increase of 15%
in high-value clients.
As shown in Figure 7-2, the TPU (Traffic, Product, and User) meth‐
odology highlights the relationship among Channel/Traffic, Prod‐
uct, and User. TalkingData will establish a label system for target
groups (for instance, real estate buyers) to profile dimensions such
as demographics, wealth, hobbies and interests, brand preference,
and real-life locations: establish a network of connections and rela‐
tions between devices, scenarios, and audiences working with the
developer’s first-party data dimensions, such as unit models and vol‐
ume of transactions; and filter for the target groups for future mar‐
keting base on these profiles. This label system can be deployed and
established in SmartDP, realizing a 360-degree panorama on the tar‐
get group and driving follow-up marketing based on the aggregation
and interconnection of client data and external data.
We can learn about the general situation of a city, understand popu‐
lation traffic, and identify our promotion channels according to
urban development and the general trend of population migration.
Through analysis and orientation of competitive products, we see a
Methodology | 55
more informed view of clients (real-estate developers and agents)
and more targeted marketing strategies may be formulated.
Conclusion
The era of smart data has arrived, whether you have realized it or
not. SmartDP’s advanced technical platform can help enterprises
respond to the challenges smart data presents in terms of data man‐
agement, data engineering, and data science, while building an end-
to-end closed loop of data. As we’ve discussed, SmartDP can provide
flexible and scalable support for contextual data applications with
agile data insight and data mining capability. TalkingData users have
demonstrated that SmartDP can also greatly reduce the obstacles
they encountered when transforming to a data-driven model, obsta‐
cles related to personnel, workflows, and tools for data acquisition,
organization, analytics, and action. SmartDP ultimately improved
their ability to drive contextual applications using data and explore
commercial value, thus making them smart enterprises.
Conclusion | 63
About the Authors
Yifei Lin is the cofounder and executive vice president of Talking‐
Data, in charge of Big Data Collaboration with Industrial Custom‐
ers. In this role, he focuses on Big Data Collaboration with
enterprises from the finance, securities, insurance, telecom, retail,
aviation, and automobile industries, helping traditional enterprises
discover business value in mobile big data.
He has over 15 years of development, counseling, and sales experi‐
ence, as well as 12 years of team management experience. He served
as the General Manager of Enterprise Structure Counseling and
General Manager of Middleware Technical Counseling for the
Greater China Region for Oracle, the Senior Manager of the com‐
munications industry technical division for BEA, and the Senior
Structure Consultant for Asia Info. He has also worked with several
major Chinese banks (CCB, CUP, ICBC, SPDB, etc.), the three
major telecom operators, large-scale diversified enterprises (China
Resource, Haier), major automobile companies (FAW, SAIC-GM),
and major high-tech enterprises (Huawei, ZTE).
Xiao Wenfeng is the CTO of TalkingData. He acquired a master’s
degree from Tsinghua University, and has worked in software devel‐
opment and development management for major companies such as
Lucent, BEA/Oracle, and Microsoft. He joined BEA’s telecom tech‐
nical division in 2006, worked on the development of WLSS 4.0 (SIP
signalization container based on WebLogic) as an architect and a
core developer, and also led the development of BEA’s first ISM full-
service client project.
In 2008, he joined Microsoft to lead quality assurance for BizTalk
middleware servers. In 2013, he joined Qihoo 360 as the lead of the
PC Cleaner/Accelerator product division, and has managed the pro‐
duction, technique, and operations for multiple product families
including the 360 Cleaner, and has applied for over 11 technical pat‐
ents. In 2014, he joined TalkingData as the CTO and leads the devel‐
opment of all production lines.