0% found this document useful (0 votes)
36 views3 pages

Lec 2 Types of Data

The document outlines various types of data encountered in data science, including structured, unstructured, natural language, machine-generated, graph-based, audio/video/images, and streaming data. Each type has unique characteristics and challenges, such as structured data being easily stored in databases while unstructured data is context-specific and difficult to categorize. The document emphasizes the importance of using appropriate tools and techniques for analyzing these diverse data types.

Uploaded by

Ahmed kaleem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views3 pages

Lec 2 Types of Data

The document outlines various types of data encountered in data science, including structured, unstructured, natural language, machine-generated, graph-based, audio/video/images, and streaming data. Each type has unique characteristics and challenges, such as structured data being easily stored in databases while unstructured data is context-specific and difficult to categorize. The document emphasizes the importance of using appropriate tools and techniques for analyzing these diverse data types.

Uploaded by

Ahmed kaleem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Types of data

In data science and big data you’ll come across many different types of data, and each
of them tends to require different tools and techniques. The main categories of data are
these:

 Structured
 Unstructured
 Natural language
 Machine-generated
 Graph-based
 Audio, video, and images
 Streaming

Let’s explore all these interesting data types.

1.2.1. Structured data

Structured data is data that depends on a data model and resides in a fixed field within
a record. As such, it’s often easy to store structured data in tables within databases or
Excel files. SQL, or Structured Query Language, is the preferred way to manage and
query data that resides in databases. You may also come across structured data that
might give you a hard time storing it in a traditional relational database. Hierarchical
data such as a family tree is one such example.

2. Unstructured data

Unstructured data is data that isn’t easy to fit into a data model because the content is
context-specific or varying. One example of unstructured data is your regular email.
Although email contains structured elements such as the sender, title, and body text, it’s
a challenge to find the number of people who have written an email complaint about a
specific employee because so many ways exist to refer to a person, for example. The
thousands of different languages and dialects out there further complicate this.
3. Natural language

Natural language is a special type of unstructured data; it’s challenging to process


because it requires knowledge of specific data science techniques and linguistics.

The natural language processing community has had success in entity recognition, topic
recognition, summarization, text completion, and sentiment analysis, but models trained
in one domain don’t generalize well to other domains. Even state-of-the-art techniques
aren’t able to decipher the meaning of every piece of text. This shouldn’t be a surprise
though: humans struggle with natural language as well. It’s ambiguous by nature. The
concept of meaning itself is questionable here. Have two people listen to the same
conversation. Will they get the same meaning? The meaning of the same words can
vary when coming from someone upset or joyous.

4. Machine-generated data

Machine-generated data is information that’s automatically created by a computer,


process, application, or other machine without human intervention. Machine-generated
data is becoming a major data resource and will continue to do so.

The analysis of machine data relies on highly scalable tools, due to its high volume and
speed. Examples of machine data are web server logs, call detail records, network
event logs, and telemetry.

5. Graph-based or network data

“Graph data” can be a confusing term because any data can be shown in a graph.
“Graph” in this case points to mathematical graph theory. In graph theory, a graph is a
mathematical structure to model pair-wise relationships between objects. Graph or
network data is, in short, data that focuses on the relationship or adjacency of objects.
The graph structures use nodes, edges, and properties to represent and store graphical
data. Graph-based data is a natural way to represent social networks, and its structure
allows you to calculate specific metrics such as the influence of a person and the
shortest path between two people.

Examples of graph-based data can be found on many social media websites. For
instance, on LinkedIn you can see who you know at which company. Your follower list
on Twitter is another example of graph-based data.

6. Audio, image, and video

Audio, image, and video are data types that pose specific challenges to a data scientist.
Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to be
challenging for computers.

Recently a company called DeepMind succeeded at creating an algorithm that’s


capable of learning how to play video games. This algorithm takes the video screen as
input and learns to interpret everything via a complex process of deep learning. It’s a
remarkable feat that prompted Google to buy the company for their own Artificial
Intelligence (AI) development plans. The learning algorithm takes in data as it’s
produced by the computer game; it’s streaming data.

7. Streaming data

While streaming data can take almost any of the previous forms, it has an extra
property. The data flows into the system when an event happens instead of being
loaded into a data store in a batch. Although this isn’t really a different type of data, we
treat it here as such because you need to adapt your process to deal with this type of
information.

Examples are the “What’s trending” on Twitter, live sporting or music events, and the
stock market.

You might also like