Types of data
In data science and big data you’ll come across many different types of data, and each
of them tends to require different tools and techniques. The main categories of data are
these:
Structured
Unstructured
Natural language
Machine-generated
Graph-based
Audio, video, and images
Streaming
Let’s explore all these interesting data types.
1.2.1. Structured data
Structured data is data that depends on a data model and resides in a fixed field within
a record. As such, it’s often easy to store structured data in tables within databases or
Excel files. SQL, or Structured Query Language, is the preferred way to manage and
query data that resides in databases. You may also come across structured data that
might give you a hard time storing it in a traditional relational database. Hierarchical
data such as a family tree is one such example.
2. Unstructured data
Unstructured data is data that isn’t easy to fit into a data model because the content is
context-specific or varying. One example of unstructured data is your regular email.
Although email contains structured elements such as the sender, title, and body text, it’s
a challenge to find the number of people who have written an email complaint about a
specific employee because so many ways exist to refer to a person, for example. The
thousands of different languages and dialects out there further complicate this.
3. Natural language
Natural language is a special type of unstructured data; it’s challenging to process
because it requires knowledge of specific data science techniques and linguistics.
The natural language processing community has had success in entity recognition, topic
recognition, summarization, text completion, and sentiment analysis, but models trained
in one domain don’t generalize well to other domains. Even state-of-the-art techniques
aren’t able to decipher the meaning of every piece of text. This shouldn’t be a surprise
though: humans struggle with natural language as well. It’s ambiguous by nature. The
concept of meaning itself is questionable here. Have two people listen to the same
conversation. Will they get the same meaning? The meaning of the same words can
vary when coming from someone upset or joyous.
4. Machine-generated data
Machine-generated data is information that’s automatically created by a computer,
process, application, or other machine without human intervention. Machine-generated
data is becoming a major data resource and will continue to do so.
The analysis of machine data relies on highly scalable tools, due to its high volume and
speed. Examples of machine data are web server logs, call detail records, network
event logs, and telemetry.
5. Graph-based or network data
“Graph data” can be a confusing term because any data can be shown in a graph.
“Graph” in this case points to mathematical graph theory. In graph theory, a graph is a
mathematical structure to model pair-wise relationships between objects. Graph or
network data is, in short, data that focuses on the relationship or adjacency of objects.
The graph structures use nodes, edges, and properties to represent and store graphical
data. Graph-based data is a natural way to represent social networks, and its structure
allows you to calculate specific metrics such as the influence of a person and the
shortest path between two people.
Examples of graph-based data can be found on many social media websites. For
instance, on LinkedIn you can see who you know at which company. Your follower list
on Twitter is another example of graph-based data.
6. Audio, image, and video
Audio, image, and video are data types that pose specific challenges to a data scientist.
Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to be
challenging for computers.
Recently a company called DeepMind succeeded at creating an algorithm that’s
capable of learning how to play video games. This algorithm takes the video screen as
input and learns to interpret everything via a complex process of deep learning. It’s a
remarkable feat that prompted Google to buy the company for their own Artificial
Intelligence (AI) development plans. The learning algorithm takes in data as it’s
produced by the computer game; it’s streaming data.
7. Streaming data
While streaming data can take almost any of the previous forms, it has an extra
property. The data flows into the system when an event happens instead of being
loaded into a data store in a batch. Although this isn’t really a different type of data, we
treat it here as such because you need to adapt your process to deal with this type of
information.
Examples are the “What’s trending” on Twitter, live sporting or music events, and the
stock market.