0% found this document useful (0 votes)
174 views

Facets of Data

Foundation of data science
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
174 views

Facets of Data

Foundation of data science
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 6
1.3 FACETS OF DATA Data science involves in processing of complex datasets and in building predictive models from those data. It includes a wide range of processes, from upstream processes of acquiring, cleaning and integrating data to downstream processes of analysis, modeling and prediction. ~ There are many facets of data science, which includes: Identifying the structure of data a Cleaning, filtering, reorganizing, augmenting and aggregating data Visualizing the data Data analysis, statistics and modeling Machine Learning Assembling data processing pipelines to link these steps Leveraging high-end computational resources for large-scale problems Different tools address different parts of this process. Therefore, interoperability among tools, based on common data structures and interfaces, is an important element in enabling the construction of complex, multifaceted data analysis pipelines. The main categories of data are: a) b) °) d) e) ) 8) Structured Data Unstructured Data Natural language Data Machine-generated Data Graph-based Data Audio, video and images Data Streaming Data 16 Fundamentals of Data Science and Analytics 1.3.1 Structured Data Structured data is data that depends on a data model and resides in a fixed field within a record. As such, itis often easy to store structured data in tables within databases or Excel files. Structured Query Language (SQL), is the preferred way to manage and query data that resides in databases, Some structured data may be difficult to store in a traditional relational database, Hierarchical data such as a family tree is one such example. | Resuts gl Messages emp id lastname —first_neme hwe_date ob_tile naa ‘Seth, James 2016-03-01 00.00.00 000 Staff Accountant 22] Wittoms Roberta 2004-02-07 00.0000 000. Sr Softwore Engineer a 9 Wenberg Jett 2007-01-02 00.00.00 000 Human Resource Manger 44 Frankin Victoria ---2010-07-0200.00 00.000 Operations Monager 8 5 Armstrong Wilks 2012-11-14 00:00:00.000 Database Administrator 6s 6 Cromey Ene. 2009-09-09 00:00.00.000 Recruting Manager 7 Richardson John 2007-02-11 00.0000.000 Safety Clerk a8 Hoston Accounting Manager FigA.3. Excel File 1.3.2. Unstructured Data In most of the cases, data is unstructured. U i 1 7 |. Unstructured data is data that is not easy to fit into a data model because the content is content-specific ot varying. One ean aene of unstructured data is the regular email, iS Although email contains steuctur such as the sender, title, and body text, itis @ challenge to find the number ae 0 Introduction to Data Science 17 have written an email complaint about a specific employee because there are many ways to refer to a person. The thousands of different languages and dialects out there further complicate this. A human-written email, is also a perfect example of natural language data. Some of the other common sources of unstructured data are surveys, web forms, o 7A Agent Notes = =) aan sp NIE] 62 2) aE a. EI 8 Fig.1.4. Common Sources of Unstructured Data 1.3.3. Natural language Data Natural language is a special type of unstructured data. It is challenging to process because it requires linguistic knowledge and specific data science techniques. The natural language processing involves entity recognition, topic recognition, summarization, text completion, and sentiment analysis. But the models trained in one domain do not perform well in the other domains. Even state-of-the-art techniques are not able to decipher the meaning of all the text input. This is because the meaning of the natural language varies according to context. Fig... Natural Languago Processing 18 Fundamentals of Data Science and Analytics Example: Two people listening to the same conversation may understand different meanings. The meaning of the same words can vary when coming from someone upset oF joyous. 1.3.4 Machine-generated Data Machine-generated data is information that is automatically created by a computer, process, application, or other machine without human intervention. Machine-generated data is becoming a major data resource. The analysis of machine data relies on highly sealable tools, due to its high volume and speed. Examples of machine data are web server logs, call detail records, network event logs, and telemetry. 200.90.177.197 - E - T /novededes/ess/ MTTP/3.2° 200 3284 * 200 7952 0300] “GET /ing_index/nopot. jpg HTTP/1.1" 200 8835 $0300) “CET /ing_(ndex/Todot. jog MTTP/.1" 200 12964 380) “GET /ing_indes/Lodo2. jog MTTP/3.1° 200 8349 380] “GET /ceg_andes/escuelel. jog HTT#/1.3" 200 7858 0328) “GET /ieg_index/servictosi. jog MTTP/1.1" 280 6235 0300) “GET /ng.index/éepartenent EERRERERREREREREE 9} GET /teogenes/fnel Lodo. jog MITP/1.1" 200 2675 300) “CET /inginden/cobierte.prg HTTP/1.1” 200 S464 0320] “GET /ing_index/nep_ing.uchilecl. jog MITP/1.3" 200 S419 -€300) “CET /inogenes/loge_ucurses. jog MITP/1.1" 200 36799 9300) “CCT ing. index/novededes2. jog HTTP/1.1° 200 5989 ‘€300) “GET /novedades..nea MITP/3.4" 200 655 8300) “CET ferret reucerda?.nte HIT#/1.2" 200 3288 0300) “GET /eoin.novedodes, htm HTTP/1.17 200 514 0300) “GET /inp_indes/Iupa.grf HTTP/1.2" 200 1566 8300] “GET Mead. orincipal nes NTTP/2.17 200 3361 0320] “GT /ing_indes/orgomzeciones}. jog HTTP/1.1" 200 7543 0300) “GET /imp.inden/rovedodes2. jog NTTP/1.2° 208 8269 22 8380) “GET /ieg_snden/escrovil2.jop HTTP/1.1" 200 8877 Fig.1.6. Web server log 380] “GET /ieg_inden/orgamizecionest. 30g MITP/1.1" 208 7543 ( BERPRARPE ETE AVER ITE EE HERE Soessssssssscessssssssese 1.3.5 Graph-based Data In graph theory, a graph is a mathematical structure that is used to model pair-wise relationships between objects. Graph or network data focuses on the relationship or adjacency of objects. The graph structures use nodes, edges, and properties to represent and store graphical data. Graph-based data is a natural way to represent social networks, and its structure helps in calculating specific metrics such as the influence of a person and the shortest path between two people. Follower list on Twitter is an example of graph. based data. Introduction to Data Science 19 roto Oy Fig..7. Graph-based data Graph databases are used to store graph-based data and are queried with specialized query languages such as SPARQL. Graph data poses its challenges, and for a computer interpreting additive and image data, it can be even more difficult. 1.3.6 Audio, video, and images Data Audio, image, and video are the data types that pose specific challenges to a data scientist. Tasks such as recognizing objects in pictures are trivial for human beings. but. they are chajlenging for computers. a so Fig. 1.8. Object Detection In 2014, Major League Baseball Advanced Media (MLBAM) announced that the Video capture will be increased to approximately 7 TB per game for the purpose of live, in-game analytics. High-speed cameras placed at stadiums will capture ball and athlete movements to calculate the path taken by a defender relative to two baselines in rea! time. Recently a company called DeepMind succeeded in creating an algorithm which is capable of learning how to play video games. This algorithm uses deep learning to interpret the video. The learning algorithm takes in data as it is produced by the computer game (This data is called as streaming data), 1.10 Fundamentals of Data Science and Analytics 1.3.7 Streaming Data The streaming data can take any of the previous forms of data such as audi and images. The most important property of streaming data is that the data flows into the system whenever an event occurs instead of being loaded into a data store ina batch. This is not a different type of data, but it is treated as such because it should adapt the process to deal with this type of information. io, video sas= Cy ap sss= sse=|°- op Input Data Stream Processing Engine Fig.1.9. Processing of Streaming Data Trending in USA ( #MyFavoriteRadioMemories Trending in USA #TheProudFamily 22.5K Tweets Fig.1.40. Trending topics on Twitter Examples: Trending topics on Twitter, Live sporting/music events and the stock market.

You might also like