We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 6
1.3 FACETS OF DATA
Data science involves in processing of complex datasets and in building predictive
models from those data. It includes a wide range of processes, from upstream processes
of acquiring, cleaning and integrating data to downstream processes of analysis, modeling
and prediction. ~
There are many facets of data science, which includes:
Identifying the structure of data a
Cleaning, filtering, reorganizing, augmenting and aggregating data
Visualizing the data
Data analysis, statistics and modeling
Machine Learning
Assembling data processing pipelines to link these steps
Leveraging high-end computational resources for large-scale problems
Different tools address different parts of this process. Therefore, interoperability
among tools, based on common data structures and interfaces, is an important element in
enabling the construction of complex, multifaceted data analysis pipelines.
The main categories of data are:
a)
b)
°)
d)
e)
)
8)
Structured Data
Unstructured Data
Natural language Data
Machine-generated Data
Graph-based Data
Audio, video and images Data
Streaming Data16 Fundamentals of Data Science and Analytics
1.3.1 Structured Data
Structured data is data that depends on a data model and resides in a fixed field
within a record. As such, itis often easy to store structured data in tables within databases
or Excel files. Structured Query Language (SQL), is the preferred way to manage and
query data that resides in databases, Some structured data may be difficult to store in a
traditional relational database, Hierarchical data such as a family tree is one such example.
| Resuts gl Messages
emp id lastname —first_neme hwe_date ob_tile
naa ‘Seth, James 2016-03-01 00.00.00 000 Staff Accountant
22] Wittoms Roberta 2004-02-07 00.0000 000. Sr Softwore Engineer
a 9 Wenberg Jett 2007-01-02 00.00.00 000 Human Resource Manger
44 Frankin Victoria ---2010-07-0200.00 00.000 Operations Monager
8 5 Armstrong Wilks 2012-11-14 00:00:00.000 Database Administrator
6s 6 Cromey Ene. 2009-09-09 00:00.00.000 Recruting Manager
7 Richardson John 2007-02-11 00.0000.000 Safety Clerk
a8 Hoston
Accounting Manager
FigA.3. Excel File
1.3.2. Unstructured Data
In most of the cases, data is unstructured. U i
1 7 |. Unstructured data is data that is not easy
to fit into a data model because the content is content-specific ot varying. One ean aene
of unstructured data is the regular email, iS
Although email contains steuctur
such as the sender, title, and body text, itis @ challenge to find the number ae
0Introduction to Data Science 17
have written an email complaint about a specific employee because there are many ways
to refer to a person. The thousands of different languages and dialects out there further
complicate this. A human-written email, is also a perfect example of natural language
data. Some of the other common sources of unstructured data are surveys, web forms,
o 7A
Agent Notes
= =) aan
sp NIE] 62
2) aE
a. EI 8
Fig.1.4. Common Sources of Unstructured Data
1.3.3. Natural language Data
Natural language is a special type of unstructured data. It is challenging to process
because it requires linguistic knowledge and specific data science techniques. The natural
language processing involves entity recognition, topic recognition, summarization, text
completion, and sentiment analysis. But the models trained in one domain do not perform
well in the other domains. Even state-of-the-art techniques are not able to decipher the
meaning of all the text input. This is because the meaning of the natural language varies
according to context.
Fig... Natural Languago Processing18 Fundamentals of Data Science and Analytics
Example: Two people listening to the same conversation may understand different
meanings. The meaning of the same words can vary when coming from someone upset oF
joyous.
1.3.4 Machine-generated Data
Machine-generated data is information that is automatically created by a computer,
process, application, or other machine without human intervention. Machine-generated
data is becoming a major data resource. The analysis of machine data relies on highly
sealable tools, due to its high volume and speed. Examples of machine data are web
server logs, call detail records, network event logs, and telemetry.
200.90.177.197 - E - T /novededes/ess/ MTTP/3.2° 200 3284
* 200 7952
0300] “GET /ing_index/nopot. jpg HTTP/1.1" 200 8835
$0300) “CET /ing_(ndex/Todot. jog MTTP/.1" 200 12964
380) “GET /ing_indes/Lodo2. jog MTTP/3.1° 200 8349
380] “GET /ceg_andes/escuelel. jog HTT#/1.3" 200 7858
0328) “GET /ieg_index/servictosi. jog MTTP/1.1" 280 6235
0300) “GET /ng.index/éepartenent
EERRERERREREREREE
9} GET /teogenes/fnel Lodo. jog MITP/1.1" 200 2675
300) “CET /inginden/cobierte.prg HTTP/1.1” 200 S464
0320] “GET /ing_index/nep_ing.uchilecl. jog MITP/1.3" 200 S419
-€300) “CET /inogenes/loge_ucurses. jog MITP/1.1" 200 36799
9300) “CCT ing. index/novededes2. jog HTTP/1.1° 200 5989
‘€300) “GET /novedades..nea MITP/3.4" 200 655
8300) “CET ferret reucerda?.nte HIT#/1.2" 200 3288
0300) “GET /eoin.novedodes, htm HTTP/1.17 200 514
0300) “GET /inp_indes/Iupa.grf HTTP/1.2" 200 1566
8300] “GET Mead. orincipal nes NTTP/2.17 200 3361
0320] “GT /ing_indes/orgomzeciones}. jog HTTP/1.1" 200 7543
0300) “GET /imp.inden/rovedodes2. jog NTTP/1.2° 208 8269
22 8380) “GET /ieg_snden/escrovil2.jop HTTP/1.1" 200 8877
Fig.1.6. Web server log
380] “GET /ieg_inden/orgamizecionest. 30g MITP/1.1" 208 7543 (
BERPRARPE ETE AVER ITE EE
HERE
Soessssssssscessssssssese
1.3.5 Graph-based Data
In graph theory, a graph is a mathematical structure that is used to model pair-wise
relationships between objects. Graph or network data focuses on the relationship or
adjacency of objects. The graph structures use nodes, edges, and properties to represent
and store graphical data. Graph-based data is a natural way to represent social networks,
and its structure helps in calculating specific metrics such as the influence of a person
and the shortest path between two people. Follower list on Twitter is an example of graph.
based data.Introduction to Data Science 19
roto Oy
Fig..7. Graph-based data
Graph databases are used to store graph-based data and are queried with specialized
query languages such as SPARQL. Graph data poses its challenges, and for a computer
interpreting additive and image data, it can be even more difficult.
1.3.6 Audio, video, and images Data
Audio, image, and video are the data types that pose specific challenges to a data
scientist. Tasks such as recognizing objects in pictures are trivial for human beings. but.
they are chajlenging for computers.
a so
Fig. 1.8. Object Detection
In 2014, Major League Baseball Advanced Media (MLBAM) announced that the
Video capture will be increased to approximately 7 TB per game for the purpose of live,
in-game analytics. High-speed cameras placed at stadiums will capture ball and athlete
movements to calculate the path taken by a defender relative to two baselines in rea!
time. Recently a company called DeepMind succeeded in creating an algorithm which is
capable of learning how to play video games. This algorithm uses deep learning to interpret
the video. The learning algorithm takes in data as it is produced by the computer game
(This data is called as streaming data),1.10 Fundamentals of Data Science and Analytics
1.3.7 Streaming Data
The streaming data can take any of the previous forms of data such as audi
and images. The most important property of streaming data is that the data flows into the
system whenever an event occurs instead of being loaded into a data store ina batch. This
is not a different type of data, but it is treated as such because it should adapt the process
to deal with this type of information.
io, video
sas= Cy ap
sss=
sse=|°- op
Input Data Stream Processing
Engine
Fig.1.9. Processing of Streaming Data
Trending in USA (
#MyFavoriteRadioMemories
Trending in USA
#TheProudFamily
22.5K Tweets
Fig.1.40. Trending topics on Twitter
Examples: Trending topics on Twitter, Live sporting/music events and the stock
market.