The document discusses big data characteristics including volume, velocity, variety, variability, veracity and value. It also covers Hadoop, data analytics, predictive analytics, NoSQL and key components of big data including HDFS, MapReduce, Hive and Pig.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
21 views3 pages
It-222 Reviewer
The document discusses big data characteristics including volume, velocity, variety, variability, veracity and value. It also covers Hadoop, data analytics, predictive analytics, NoSQL and key components of big data including HDFS, MapReduce, Hive and Pig.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3
IT 222 POINTERS TO REVIEW
BIG DATA CHARACTERISTICS
(1) VOLUME: Quantity of data to be stored DATA INFORMATION DECISION MAKING CYCLE • As the quantity of data needing to be stored increases, the need for larger storage devices increases as well. Scaling up is keeping the same number of systems but migrating each one to a larger system • Scaling out means the workload exceeds server capacity, it is spread out across several servers (2) VELOCITY: Speed at which data is entered into the system and must be processed. • Can be broken down into two categories: CASE TOOLS Stream processing focuses on Computer-aided Systems Engineering input processing, and requires • Automated framework for SDLC analysis of the data stream as it • Structured methodologies and powerful enters the system. graphical interfaces Feedback loop processing refers Front-end CASE tools to the analysis of the data to provide support for planning, analysis, and produce actionable results. design phases (3) VARIETY: Variations in the structure of Back-end CASE tools provide support for data to be stored coding and implementation phases Structured data fits in a predefined Typical CASE tool has five components data model. CASE TOOLS COMPONENT Unstructured data is not organized Graphic design to fit into a predefined data model • Produce structured diagrams (DFDs, Semi-structured data combines ERDs, class diagrams, object diagrams) elements of both – some parts of Screen painters and report generators the data fit a predefined model • Produce the information system’s input while other parts do not. and output formats (enduser interface) VARIABILITY: changing in the meaning of Integrated repository data based on context • Stores and cross-references the system • Sentimental analysis attempts to design data; includes a comprehensive data determine the attitude of a dictionary statement (positive, negative, Analysis segment neutral) • Provide a fully automated check on system VERACITY: trustworthiness of data consistency, syntax, and completeness VALUE: the degree to which the data can Program documentation generator be analyzed to provide meaningful insights • MANAGING USERS AND ESTABLISHING VISUALIZATION: the ability to graphically SECURITY present the data in such a way as to make it User: uniquely identifiable object understandable to users • Allows a given person to log on to the HADOOP database • De facto standard for most Big Data Role: a named collection of database storage and processing access privileges • Java-based framework for distributing and • Authorizes a user to connect to the processing very large data sets across database and use system resources clusters of computers Profile: name collection of settings Most important components: • Controls how much of a resource a given • Hadoop Distributed File System user can use (HDFS): low-level distributed file processing system that can be used DATA ANALYTICS directly for data storage • Subset of business intelligence (BI) • MapReduce: programming model functionality that encompasses that supports processing large data mathematical, statistical, and modeling sets techniques used to extract knowledge from HADOOP DISTRIBUTED FILE SYSTEM data Approach based on several key * Continuous spectrum of assumptions: knowledge acquisition that goes • High volume: default block size is 64MB from discovery to explanation to and can be configured to even larger values prediction • Write-once, read-many: model simplifies • Explanatory analytics focuses on concurrent issues and improves data discovering and explaining data throughput characteristics based on existing data • Streaming access: Hadoop is optimized for • Predictive analytics focuses on predicting batch processing of entire files as a future data outcomes with a high degree of continuous stream of data accuracy • Fault tolerance: HDFS is designed to PREDICTIVE ANALYTICS replicate data across many different devices • Refers to the use of advanced so that when one fails, data is still available mathematical, statistical, and modeling tools from another device to predict future business outcomes with a • Data node communicates with name node high degree of accuracy by regularly sending block reports and • Focuses on creating actionable heartbeats models to predict future behaviors HADOOP ECOSYSTEM and events • Most BI vendors are dropping the term data mining and replacing it with predictive analytics • Models used in customer service, fraud detection, targeted marketing, and optimized pricing • Can add in many different ways but needs to be monitored and • Map Reduce Simplification Applications: evaluated to determine return on Hive is a data warehousing system investment that sits on top of HDFS and NoSQL supports its SQL-like language • Name given to non-relational database Pig compiles a high-level scripting technologies developed to address Big Data language (Pig Latin) into challenges MapReduce jobs for executing in • Key-value (KV) databases store data as a Hadoop collection of key-value pairs organized as • Data Ingestion Applications: buckets which are the equivalent of tables Flume is a component for ingesting • Document databases store data in key- data in Hadoop value pairs in which the value components Sqoop is a tool for converting data are tag-encoded documents grouped into back and forth between a relational logical groups called collections database and the HDFS • Direct Query Applications: Hbase is a column-oriented NoSQL database designed to sit on top of the HDFS that quickly processes sparse datasets Impala was the first SQL-on- Hadoop application NoSQL • Column-oriented databases refer to two technologies: • Column-centric storage: data stored in blocks that hold data from a single column across many rows • Row-centric storage: data stored in a block that holds data from all columns of a given set of rows
• Graph databases store data on
relationship-rich data as a collection of nodes and edges DBMS FACILITIES xxx • Properties are the attributes of a ADVANTAGES OF SQL xxx node or edge of interest to a user • Traversal is a query in a graph DBMS facilitates: database • Interpretation and presentation of data • Distribution of data and information • Preservation and monitoring of data • Control over data duplication and use
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions