0% found this document useful (0 votes)
21 views3 pages

It-222 Reviewer

The document discusses big data characteristics including volume, velocity, variety, variability, veracity and value. It also covers Hadoop, data analytics, predictive analytics, NoSQL and key components of big data including HDFS, MapReduce, Hive and Pig.

Uploaded by

nimfadelgado11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views3 pages

It-222 Reviewer

The document discusses big data characteristics including volume, velocity, variety, variability, veracity and value. It also covers Hadoop, data analytics, predictive analytics, NoSQL and key components of big data including HDFS, MapReduce, Hive and Pig.

Uploaded by

nimfadelgado11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

IT 222 POINTERS TO REVIEW

BIG DATA CHARACTERISTICS


(1) VOLUME: Quantity of data to be stored
DATA INFORMATION DECISION MAKING CYCLE
• As the quantity of data needing to be stored
increases, the need for larger storage
devices increases as well.
Scaling up is keeping the same
number of systems but migrating
each one to a larger system •
Scaling out means the workload
exceeds server capacity, it is
spread out across several servers
(2) VELOCITY: Speed at which data is
entered into the system and must be
processed.
• Can be broken down into two categories:
CASE TOOLS Stream processing focuses on
Computer-aided Systems Engineering input processing, and requires
• Automated framework for SDLC analysis of the data stream as it
• Structured methodologies and powerful enters the system.
graphical interfaces Feedback loop processing refers
Front-end CASE tools to the analysis of the data to
provide support for planning, analysis, and produce actionable results.
design phases (3) VARIETY: Variations in the structure of
Back-end CASE tools provide support for data to be stored
coding and implementation phases Structured data fits in a predefined
Typical CASE tool has five components data model.
CASE TOOLS COMPONENT Unstructured data is not organized
Graphic design to fit into a predefined data model
• Produce structured diagrams (DFDs, Semi-structured data combines
ERDs, class diagrams, object diagrams) elements of both – some parts of
Screen painters and report generators the data fit a predefined model
• Produce the information system’s input while other parts do not.
and output formats (enduser interface) VARIABILITY: changing in the meaning of
Integrated repository data based on context
• Stores and cross-references the system • Sentimental analysis attempts to
design data; includes a comprehensive data determine the attitude of a
dictionary statement (positive, negative,
Analysis segment neutral)
• Provide a fully automated check on system VERACITY: trustworthiness of data
consistency, syntax, and completeness VALUE: the degree to which the data can
Program documentation generator be analyzed to provide meaningful insights •
MANAGING USERS AND ESTABLISHING VISUALIZATION: the ability to graphically
SECURITY present the data in such a way as to make it
User: uniquely identifiable object understandable to users
• Allows a given person to log on to the HADOOP
database • De facto standard for most Big Data
Role: a named collection of database storage and processing
access privileges • Java-based framework for distributing and
• Authorizes a user to connect to the processing very large data sets across
database and use system resources clusters of computers
Profile: name collection of settings Most important components:
• Controls how much of a resource a given • Hadoop Distributed File System
user can use (HDFS): low-level distributed file
processing system that can be used DATA ANALYTICS
directly for data storage • Subset of business intelligence (BI)
• MapReduce: programming model functionality that encompasses
that supports processing large data mathematical, statistical, and modeling
sets techniques used to extract knowledge from
HADOOP DISTRIBUTED FILE SYSTEM data
Approach based on several key * Continuous spectrum of
assumptions: knowledge acquisition that goes
• High volume: default block size is 64MB from discovery to explanation to
and can be configured to even larger values prediction
• Write-once, read-many: model simplifies • Explanatory analytics focuses on
concurrent issues and improves data discovering and explaining data
throughput characteristics based on existing data
• Streaming access: Hadoop is optimized for • Predictive analytics focuses on predicting
batch processing of entire files as a future data outcomes with a high degree of
continuous stream of data accuracy
• Fault tolerance: HDFS is designed to PREDICTIVE ANALYTICS
replicate data across many different devices • Refers to the use of advanced
so that when one fails, data is still available mathematical, statistical, and modeling tools
from another device to predict future business outcomes with a
• Data node communicates with name node high degree of accuracy
by regularly sending block reports and • Focuses on creating actionable
heartbeats models to predict future behaviors
HADOOP ECOSYSTEM and events
• Most BI vendors are dropping the
term data mining and replacing it
with predictive analytics
• Models used in customer service, fraud
detection, targeted marketing, and optimized
pricing
• Can add in many different ways
but needs to be monitored and
• Map Reduce Simplification Applications: evaluated to determine return on
Hive is a data warehousing system investment
that sits on top of HDFS and NoSQL
supports its SQL-like language • Name given to non-relational database
Pig compiles a high-level scripting technologies developed to address Big Data
language (Pig Latin) into challenges
MapReduce jobs for executing in • Key-value (KV) databases store data as a
Hadoop collection of key-value pairs organized as
• Data Ingestion Applications: buckets which are the equivalent of tables
Flume is a component for ingesting • Document databases store data in key-
data in Hadoop value pairs in which the value components
Sqoop is a tool for converting data are tag-encoded documents grouped into
back and forth between a relational logical groups called collections
database and the HDFS
• Direct Query Applications:
Hbase is a column-oriented NoSQL
database designed to sit on top of
the HDFS that quickly processes
sparse datasets
Impala was the first SQL-on-
Hadoop application
NoSQL
• Column-oriented databases refer to two
technologies:
• Column-centric storage: data
stored in blocks that hold data from
a single column across many rows
• Row-centric storage: data stored
in a block that holds data from all
columns of a given set of rows

• Graph databases store data on


relationship-rich data as a collection of
nodes and edges DBMS FACILITIES xxx
• Properties are the attributes of a ADVANTAGES OF SQL xxx
node or edge of interest to a user
• Traversal is a query in a graph DBMS facilitates:
database • Interpretation and presentation of
data
• Distribution of data and
information
• Preservation and monitoring of
data • Control over data duplication
and use

You might also like