0% found this document useful (0 votes)
18 views24 pages

UNIT-5

Unit 5 covers the applications of Big Data technologies including Pig, Hive, and HBase. Pig is a high-level platform for processing large datasets with a scripting language called Pig Latin, while Hive serves as a data warehouse infrastructure for structured data in Hadoop, facilitating easy querying. HBase is a non-relational database management system that supports sparse data storage and is utilized across various industries such as medical, sports, and e-commerce.

Uploaded by

AYUSH GUPTA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views24 pages

UNIT-5

Unit 5 covers the applications of Big Data technologies including Pig, Hive, and HBase. Pig is a high-level platform for processing large datasets with a scripting language called Pig Latin, while Hive serves as a data warehouse infrastructure for structured data in Hadoop, facilitating easy querying. HBase is a non-relational database management system that supports sparse data storage and is utilized across various industries such as medical, sports, and e-commerce.

Uploaded by

AYUSH GUPTA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Notes UNIT 5 bigdata

UNIT 5
Application of Big Data using :
1. Pig :
Pig is a high-level platform or tool which is used to process large datasets.
It provides a high level of abstraction for processing over MapReduce.
It provides a high-level scripting language, known as Pig Latin which is used to
develop the data analysis codes.
Applications :

1. For exploring large datasets Pig Scripting is used.


2. Provides supports across large data sets for Ad-hoc queries.
3. In the prototyping of large data-sets processing algorithms.
4. Required to process the time-sensitive data loads.
5. For collecting large amounts of datasets in form of search logs and web crawls.
6. Used where the analytical insights are needed using the sampling.

2. Hive :
Hive is a data warehouse infrastructure tool to process structured data in Hadoop.
It resides on top of Hadoop to summarize Big Data and makes querying and analyzing
easy.
It is used by different companies. For example, Amazon uses it in Amazon Elastic
MapReduce.
Benefits :

1. Ease of use
2. Accelerated initial insertion of data
3. Superior scalability, flexibility, and cost-efficiency
4. Streamlined security
5. Low overhead
6. Exceptional working capacity

3. HBase :
HBase is a column-oriented non-relational database management system that runs on
top of the Hadoop Distributed File System (HDFS).
HBase provides a fault-tolerant way of storing sparse data sets, which are common in
many big data use cases

1|Page
HBase does support writing applications in Apache Avro, REST and Thrift.
Application :

1. Medical
2. Sports
3. Web
4. Oil and petroleum
5. e-commerce
PIG
Introduction to PIG :
Apache Pig's name is derived from its ability to handle various types of
data, much like a pig can eat almost anything. The name also reflects the
project's goal to simplify the process of analyzing large datasets by
providing a high-level programming language and infrastructure on top of
Hadoop's MapReduce framework.

Pig is a high-level platform or tool which is used to process large datasets.


It provides a high level of abstraction for processing over MapReduce.
It provides a high-level scripting language, known as Pig Latin which is used
to develop the data analysis codes.
Pig Latin and Pig Engine are the two main components of the Apache Pig tool.
The result of Pig is always stored in the HDFS.
One limitation of MapReduce is that the development cycle is very long.
Writing the reducer and mapper, compiling packaging the code, submitting
the job and retrieving the output is a time-consuming task.
Apache Pig reduces the time of development using the multi-query approach.
Pig is beneficial for programmers who are not from Java backgrounds.
2|Page
200 lines of Java code can be written in only 10 lines using the Pig
Latin language.
Programmers who have SQL knowledge needed less effort to learn Pig
Latin. 18.

Difference between Pig and MapReduce


Apache Pig MapReduce

It is a scripting language. It is a compiled programming language.

Abstraction is at higher level. Abstraction is at lower level.

It have less line of code as compared to MapReduce. Lines of code is more.

More development efforts are required for


Less effort is needed for Apache Pig.
MapReduce.

Code efficiency is less as compared to MapReduce. As compared to Pig efficiency of code is higher.

Pig provides built in functions for ordering, sorting and


Hard to perform data operations.
union.

It allows nested data types like map, tuple and bag It does not allow nested data types

Execution Modes of Pig :


Apache Pig scripts can be executed in three ways :
Interactive Mode (Grunt shell) :

3|Page
You can run Apache Pig in interactive mode using the Grunt shell.
In this shell, you can enter the Pig Latin statements and get the output (using the Dump
operator).
Batch Mode (Script) :
You can run Apache Pig in Batch mode by writing the Pig Latin script in a single file
with the .pig extension.
Embedded Mode (UDF) :
Apache Pig provides the provision of defining our own functions (User
Defined Functions) in programming languages such as Java and using them in our
script.
Comparison of Pig with Databases :

PIG SQL

SQL is a
Pig Latin is a procedural language
declarative language

In Apache Pig, the schema is optional. We


Schema is mandatory in SQL.
can store data without designing a schema
(values are stored as $01, $02 etc.)

The data model in Apache Pig is The data model used in SQL is
nested relational. flat
relational.
There is more opportunity for query
Apache Pig provides limited opportunity
optimization in SQL.
for Query optimization.

Grunt :
After invoking the Grunt shell, you can run your Pig scripts in the shell. In addition to
that, there are certain useful shell and utility commands provided by the Grunt shell.
This chapter explains the shell and utility commands provided by the Grunt shell.
Grunt shell is a shell command.
The Grunt shell of the Apace pig is mainly used to write pig Latin scripts.
Pig script can be executed with grunt shell which is a native shell provided by
Apache pig to execute pig queries.
We can invoke shell commands using sh and fs.
Syntax of sh command :
grunt> sh ls

20. Syntax of fs command :


4|Page
grunt>fs -ls

Pig Latin :
The Pig Latin is a data flow language used by Apache Pig to analyze the data in
Hadoop.
It is a textual language that abstracts the programming from the Java MapReduce
idiom into a notation.
The Pig Latin statements are used to process the data.
It is an operator that accepts a relation as an input and generates another relation as an
output.
· It can span multiple lines.
· Each statement must end with a semi-colon.
· It may include expression and schemas.
· By default, these statements are processed using multi-query execution

User-Defined Functions :
Apache Pig provides extensive support for User Defined Functions(UDF’s).
Using these UDF’s, we can define our own functions and use them.
5|Page
The UDF support is provided in six programming languages:
· Java
· Jython
· Python
· JavaScript
· Ruby
· Groovy
For writing UDF’s, complete support is provided in Java and limited support is
provided in all the remaining languages.
Using Java, you can write UDF’s involving all parts of the processing like data
load/store, column transformation, and aggregation.
Since Apache Pig has been written in Java, the UDF’s written using Java language
work efficiently compared to other languages.
Types of UDF’s in Java :
Filter Functions :

The filter functions are used as conditions in filter statements.


These functions accept a Pig value as input and return a Boolean value. Eval
Functions :

• The Eval functions are used in FOREACH-GENERATE statements.

• These functions accept a Pig value as input and return a Pig result.
Algebraic Functions :

• The Algebraic functions act on inner bags in a FOREACHGENERATE


statement.
• These functions are used to perform full MapReduce operations on an inner
bag.

Data Processing Operators :


The Apache Pig Operators is a high-level procedural language for querying large data
sets using Hadoop and the Map-Reduce Platform.
A Pig Latin statement is an operator that takes a relation as input and produces another
relation as output.
These operators are the main tools for Pig Latin provides to operate on the data. They
allow you to transform it by sorting, grouping, joining, projecting, and filtering. The
Apache Pig operators can be classified as :

6|Page
Relational Operators :
Relational operators are the main tools Pig Latin provides to operate on the data.
Some of the Relational Operators are :
LOAD: The LOAD operator is used to loading data from the file system or HDFS
storage into a Pig relation.
FOREACH: This operator generates data transformations based on columns of data.
It is used to add or remove fields from a relation.
FILTER: This operator selects tuples from a relation based on a condition. JOIN:
JOIN operator is used to performing an inner, equijoin join of two or more relations
based on common field values
ORDER BY: Order By is used to sort a relation based on one or more fields in either
ascending or descending order using ASC and DESC keywords.
GROUP: The GROUP operator groups together the tuples with the same group key
(key field).
COGROUP: COGROUP is the same as the GROUP operator. For readability,
programmers usually use GROUP when only one relation is involved and COGROUP
when multiple relations are reinvolved.
Diagnostic Operator :
The load statement will simply load the data into the specified relation in Apache Pig.
To verify the execution of the Load statement, you have to use the Diagnostic
Operators.
Some Diagnostic Operators are :
DUMP: The DUMP operator is used to run Pig Latin statements and display the
results on the screen.
DESCRIBE: Use the DESCRIBE operator to review the schema of a particular
relation. The DESCRIBE operator is best used for debugging a script.
ILLUSTRATE: ILLUSTRATE operator is used to review how data is transformed
through a sequence of Pig Latin statements. ILLUSTRATE command is your best
friend when it comes to debugging a script.
EXPLAIN: The EXPLAIN operator is used to display the logical, physical, and
MapReduce execution plans of a relation.

Hive
Apache Hive Architecture :

7|Page
The above figure shows the architecture of Apache Hive and its major components.
The major components of Apache Hive are :
1. Hive Client
2. Hive Services
3. Processing and Resource Management
4. Distributed Storage
HIVE CLIENT :
Hive supports applications written in any language like Python, Java, C++, Ruby, etc
using JDBC, ODBC, and Thrift drivers, for performing queries on the Hive.
Hence, one can easily write a hive client application in any language of its own
choice.
Hive clients are categorized into three types :
1. Thrift Clients : The Hive server is based on Apache Thrift so that it can serve
the request from a thrift client.
2. JDBC client : Hive allows for the Java applications to connect to it using the
JDBC driver. JDBC driver uses Thrift to communicate with the Hive Server.
3. ODBC client : Hive ODBC driver allows applications based on the ODBC
protocol to connect to Hive. Similar to the JDBC driver, the ODBC driver uses Thrift
to communicate with the Hive Server.
8|Page
HIVE SERVICE :
To perform all queries, Hive provides various services like the Hive server2, Beeline,
etc.
The various services offered by Hive are :
1. Beeline
2. Hive Server 2
3. Hive Driver
4. Hive Compiler
5. Optimizer
6. Execution Engine
7. Metastore
8. HCatalog
9. WebHCat
PROCESSING AND RESOURCE MANAGEMENT :
Hive internally uses a MapReduce framework as a defacto engine for executing the
queries.
MapReduce is a software framework for writing those applications that process a
massive amount of data in parallel on the large clusters of commodity hardware.
MapReduce job works by splitting data into chunks, which are processed by map-
reduce tasks.

DISTRIBUTED STORAGE :
Hive is built on top of Hadoop, so it uses the underlying Hadoop Distributed File
System for the distributed storage.

Hive Shell :
Hive shell is a primary way to interact with hive.
9|Page
It is a default service in the hive.
It is also called CLI (command line interference).
Hive shell is similar to MySQL Shell.
Hive users can run HQL queries in the hive shell.
In hive shell up and down arrow keys are used to scroll previous commands.
HiveQL is case-insensitive (except for string comparisons).
The tab key will autocomplete (provides suggestions while you type into the field)
Hive keywords and functions.
Hive Shell can run in two modes :
Non-Interactive mode :
Non-interactive mode means run shell scripts in administer zone.
Hive Shell can run in the non-interactive mode, with the -f option.
Example:
$hive -f script.q, Where script. q is a file.
Interactive mode :
The hive can work in interactive mode by directly typing the command “hive” in the
terminal.
Example:
$hive
Hive> show databases;
What is Partition in Hive?
Apache Hive renders tables structured into partitions. Partitioning is a way to divide a
table into different sections based on the values of common columns such as date, town
and section. Every table in the hive may mark a specific partition with one or more
partition keys. It’s quick to do queries on slices of the data in Hadoop using partition.

Hive Services :
The following are the services provided by Hive :
• Hive CLI: The Hive CLI (Command Line Interface) is a shell where we can
execute Hive queries and commands.
• Hive Web User Interface: The Hive Web UI is just an alternative of Hive CLI.
It provides a web-based GUI for executing Hive queries and commands.
• Hive metastore: It is a central repository that stores all the structure information
of various tables and partitions in the warehouse. It also includes metadata of
column and its type information, the serializers and deserializers which is used to
read and write data and the corresponding HDFS files where the data is stored.
10 | P a g e
• Hive Server: It is referred to as Apache Thrift Server. It accepts the request from
different clients and provides it to Hive Driver.
• Hive Driver: It receives queries from different sources like web UI, CLI, Thrift,
and JDBC/ODBC driver. It transfers the queries to the compiler.
• Hive Compiler: The purpose of the compiler is to parse the query and perform
semantic analysis on the different query blocks and expressions. It converts
HiveQL statements into MapReduce jobs.
• Hive Execution Engine: Optimizer generates the logical plan in the form of DAG
of map-reduce tasks and HDFS tasks. In the end, the execution engine executes
the incoming tasks in the order of their dependencies.

MetaStore :

Hive metastore (HMS) is a service that stores Apache Hive and other metadata in a
backend RDBMS, such as MySQL or PostgreSQL.
Impala, Spark, Hive, and other services share the metastore.
The connections to and from HMS include HiveServer, Ranger, and the NameNode,
which represents HDFS.
Beeline, Hue, JDBC, and Impala shell clients make requests through thrift or JDBC to
HiveServer.
The HiveServer instance reads/writes data to HMS.
By default, redundant HMS operate in active/active mode. The
physical data resides in a backend RDBMS, one for HMS.
All connections are routed to a single RDBMS service at any given time.
HMS talks to the NameNode over thrift and functions as a client to HDFS.

11 | P a g e
HMS connects directly to Ranger and the NameNode (HDFS), and so does
HiveServer.
One or more HMS instances on the backend can talk to other services, such as Ranger.
Comparison with Traditional Database :

RDBMS HIVE

It is used to maintain the It is used to maintain a data


database. warehouse.

It uses SQL (Structured


It uses HQL (Hive Query Language).
Query Language).

Schema is fixed in RDBMS Schema varies in it.

Normalized and de-normalized both


Normalized data is stored.
type of data is stored.

Tables in rdms are sparse. The table in hive is dense.

It doesn’t support
It supports automation partition.
partitioning.

The sharding method is used for


No partition method is used
partition

HiveQL :
Even though based on SQL, HiveQL does not strictly follow the full SQL-92
standard.
HiveQL offers extensions not in SQL, including multitable inserts and create table as
select.
HiveQL lacked support for transactions and materialized views and only limited
subquery support.
Support for insert, update, and delete with full ACID functionality was made available
with release 0.14.
Internally, a compiler translates HiveQL statements into a directed acyclic graph of
MapReduce Tez, or Spark jobs, which are submitted to Hadoop for execution.
Example :
DROP TABLE IF EXISTS docs;
CREATE TABLE docs
(line STRING);

12 | P a g e
Checks if table docs exist and drop it if it does. Creates a new table called docs with a
single column of type STRING called line.
LOAD DATA INPATH 'input_file' OVERWRITE INTO TABLE docs;

Loads the specified file or directory (In this case “input_file”) into the table.
OVERWRITE specifies that the target table to which the data is being loaded is to be
re-written; Otherwise, the data would be appended.
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\s')) AS word FROM docs) temp
GROUP BY word
ORDER BY word;

The query CREATE TABLE word_counts AS SELECT word, count(1) AS


count creates a table called word_counts with two columns: word and count.
This query draws its input from the inner
query (SELECT explode(split(line, '\s')) AS word FROM docs) temp".
This query serves to split the input words into different rows of a temporary table
aliased as temp.
The GROUP BY WORD groups the results based on their keys.
This results in the count column holding the number of occurrences for each word of
the word column.
The ORDER BY WORDS sorts the words alphabetically.
Tables :
Here are the types of tables in Apache Hive:
Managed Tables :

In a managed table, both the table data and the table schema are managed by Hive.
The data will be located in a folder named after the table within the Hive data
warehouse, which is essentially just a file location in HDFS.
By managed or controlled we mean that if you drop (delete) a managed table, then
Hive will delete both the Schema (the description of the table) and the data files
associated with the table.
Default location is /user/hive/warehouse.
The syntax for Managed Tables :
CREATE TABLE IF NOT EXISTS stocks (exchange STRING,
symbol STRING,
price_open FLOAT,

13 | P a g e
price_high FLOAT,
price_low FLOAT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;

External Tables :
An external table is one where only the table schema is controlled by Hive.
In most cases, the user will set up the folder location within HDFS and copy the data
file(s) there.
This location is included as part of the table definition statement.
When an external table is deleted, Hive will only delete the schema associated with
the table.
The data files are not affected.
Syntax for External Tables :
CREATE EXTERNAL TABLE IF NOT EXISTS stocks
(exchange STRING,
symbol STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks';

Querying Data :
A query is a request for data or information from a database table or a combination of
tables.
This data may be generated as results returned by Structured Query Language (SQL)
or as pictorials, graphs or complex results, e.g., trend analyses from data-mining tools.
One of several different query languages may be used to perform a range of simple to
complex database queries.
SQL, the most well-known and widely-used query language, is familiar to most
database administrators (DBAs)

User-Defined Functions :
In Hive, the users can define their own functions to meet certain client requirements.

14 | P a g e
These are known as UDFs in Hive.
User-Defined Functions written in Java for specific modules.
Some of UDFs are specifically designed for the reusability of code in application
frameworks.
The developer will develop these functions in Java and integrate those UDFs with the
Hive.
During the Query execution, the developer can directly use the code, and UDFs will
return outputs according to the user-defined tasks.
It will provide high performance in terms of coding and execution.
The general type of UDF will accept a single input value and produce a single output
value.
We can use two different interfaces for writing Apache Hive User-Defined Functions :
1. Simple API
2. Complex API
Sorting And Aggregating :

Sorting data in Hive can be achieved by use of a standard ORDER BY clause, but
there is a catch.
ORDER BY produces a result that is totally sorted, as expected, but to do so it sets the
number of reducers to one, making it very inefficient for large datasets.
When a globally sorted result is not required and in many cases it isn’t, then you can
use Hive’s nonstandard extension, SORT BY instead.
SORT BY produces a sorted file per reducer.
If you want to control which reducer a particular row goes to, typically so you can
perform some subsequent aggregation.
This is what Hive’s DISTRIBUTE BY clause does.
Example :
To sort the weather dataset by year and temperature, in s

Hive> FROM records2


> SELECT year, temperature
> DISTRIBUTE BY year
·> SO RT B Y y ear ASC, temperature DESC;

15 | P a g e
1949 78
1950 22
1950 0
1950 -11

MapReduce Scripts in Hive / Hive Scripts :

Similar to any other scripting language, Hive scripts are used to execute a set of Hive
commands collectively.
Hive scripting helps us to reduce the time and effort invested in writing and executing
the individual commands manually.
Hive scripting is supported in Hive 0.10.0 or higher versions of Hive.
Joins and SubQueries :
JOINS :
Join queries can perform on two tables present in Hive.
Joins are of 4 types, these are :
· Inner join: The Records common to both tables will be retrieved by
this Inner Join.
· Left outer Join: Returns all the rows from the left table even though there
are no matches in the right table.
· Right Outer Join: Returns all the rows from the Right table even though
there are no matches in the left table.
· Full Outer Join: It combines records of both the tables based on the
JOIN Condition given in the query. It returns all the records from both
tables and fills in NULL Values for the columns missing values matched
on either side.

SUBQUERIES :
In SQL, a subquery can be defined as a query embedded within another query. It is
often used in the WHERE, HAVING, or FROM clauses of a statement.
Subqueries are commonly used with SELECT, UPDATE, INSERT,
and DELETE statements to achieve complex filtering and data manipulation.
They are an essential tool when we need to perform operations like:

16 | P a g e
• Filtering: Getting specific records based on conditions derived from another
query.
• Aggregating: Performing aggregate functions like SUM, COUNT, or AVG based
on subquery results.
• Updating: Dynamically updating records based on values from other tables.
• Deleting: Deleting records from one table using criteria based on another.
While there is no universal syntax for subqueries, they are commonly used
in SELECT statements as follows. This general syntax allows the outer query to
use the results of the inner subquery for filtering or other operations.
· To get a particular value combined from two column values from
different tables.
· Dependency of one table values on other tables.
· Comparative checking of one column values from other
tables. Syntax :

Subquery in FROM clause


SELECT <column names 1, 2...n>From (SubQuery) <TableName_Main >
Subquery in WHERE clause
SELECT <column names 1, 2...n> From<TableName_Main>WHERE col1 IN (Su
bQuery);

SELECT column_name
FROM table_name
WHERE column_name expression operator
(SELECT column_name FROM table_name WHERE …);
Key Characteristics of Subqueries
1. Nested Structure: A subquery is executed within the context of an outer query.
2. Parentheses: Subqueries must always be enclosed in parentheses ().
3. Comparison Operators: Subqueries can be used with operators
like =, >, <, IN, NOT IN, LIKE, etc.
4. Single-Row vs. Multi-Row Subqueries: Subqueries may return a single value
(e.g., a single row) or multiple values. Depending on the result, different SQL
constructs may be required.
Common SQL Clauses for Subqueries
Subqueries are frequently used in specific SQL clauses to achieve more
complex results. Here are the common clauses where subqueries are used:

17 | P a g e
1. WHERE Clause: Subqueries in the WHERE clause help filter data based
on the results of another query. For example, you can filter records based on
values returned by a subquery.
2. FROM Clause: Subqueries can be used in the FROM clause to treat the
result of the subquery as a derived table or temporary table that can be
joined with other tables.
3. HAVING Clause: Subqueries in the HAVING clause allow you to filter
aggregated data after performing group operations.

Types of Subqueries
1. Single-Row Subquery: Returns a single value (row). Useful with comparison
operators like =, >, <.
2. Multi-Row Subquery: Returns multiple values (rows). Useful with operators
like IN, ANY, ALL.
3. Correlated Subquery: Refers to columns from the outer query in the subquery.
Unlike regular subqueries, the subquery depends on the outer query for its values.
4. Non-Correlated Subquery: Does not refer to the outer query and can be executed
independently.
Examples of Using SQL Subqueries
These examples showcase how subqueries can be used for various
operations like selecting, updating, deleting, or inserting data, providing
insights into their syntax and functionality. Through these examples, we
will understand the flexibility and importance of subqueries in
simplifying complex database tasks. Consider the following two tables:
1. DATABASE TABLE

Database Table

18 | P a g e
2. STUDENT TABLE

Student Table

Example 1: Fetching Data Using Subquery in WHERE Clause


This example demonstrates how to use a subquery to retrieves roll numbers
of students in section ‘A’, and the outer query uses those roll numbers to
fetch corresponding details (name, location, and phone number) from
the DATABASE table. This enables filtering based on results from another
table.
Query:
SELECT NAME, LOCATION, PHONE_NUMBER
FROM DATABASE
WHERE ROLL_NO IN (
SELECT ROLL_NO FROM STUDENT WHERE SECTION='A'
);
Output
NAME LOCATION PHONE_NUMBER

Ravi Salem 8989898989

Raj Coimbatore 8877665544

HBASE
HBase Concepts :
Apache HBase is a column-oriented, non-relational, open-source, and distributed
database management system that is developed based on the concept of Google's
Bigtable. Apache HBase is designed in the Java language and used for real-time
processing, random read and write operations on a huge dataset. It can easily be
deployed and run on Hadoop.
Apache HBase is not a relational database management system and hence it does not
support the structured query language(SQL). We can write an HBase application in Java

19 | P a g e
language similar to Apache MapReduce also HBase support creating the application in
other frameworks such as Apache Avro, REST, and so on.
Apache HBase stores data in the table in the form of rows and columns just like a
traditional database management system. Those tables have a primary key that is used
to access that table. Apache HBase uses the Zoopkeepr to manage performance. For the
production environment, it is suggested to use a dedicated Zookeeper cluster that is
integrated with the Apache HBase cluster.

HBase Vs RDBMS :
RDBMS HBase

It requires SQL (structured query


NO SQL
language)

It has a fixed schema No fixed schema

It is row-oriented It is column-oriented

It is not scalable It is scalable


It is static in nature Dynamic in nature

Slower retrieval of data Faster retrieval of data

It follows the ACID (Atomicity, It follows CAP (Consistency, Availability,


Consistency, Isolation and Partition-tolerance) theorem.
Durability) property.

20 | P a g e
It can handle structured, unstructured as well
It can handle structured data as semi- structured data

It cannot handle sparse data It can handle sparse data

Schema Design :
HBase table can scale to billions of rows and any number of columns based on your
requirements.
This table allows you to store terabytes of data in it.
The HBase table supports the high read and writes throughput at low latency.
A single value in each row is indexed; this value is known as the row key.
The HBase schema design is very different compared to the relational database
schema design.
Some of the general concepts that should be followed while designing schema in
Hbase:
· Row key: Each table in the HBase table is indexed on the row key. There
are no secondary indices available on the HBase table.
· Automaticity: Avoid designing a table that requires atomicity across
all rows. All operations on HBase rows are atomic at row level.
· Even distribution: Read and write should be uniformly distributed
across all nodes available in the cluster. Design row key in such a way that,
related entities should be stored in adjacent rows to increase read efficacy.

Zookeeper :
ZooKeeper is a distributed coordination service that also helps to manage a large set
of hosts.
Managing and coordinating a service especially in a distributed environment is a
complicated process, so ZooKeeper solves this problem due to its simple architecture
as well as API.
ZooKeeper allows developers to focus on core application logic.
For instance, to track the status of distributed data, Apache HBase uses ZooKeeper.

21 | P a g e
They can also support a large Hadoop cluster easily.
To retrieve information, each client machine communicates with one of the servers.
It keeps an eye on the synchronization as well as coordination across the cluster
There is some best Apache ZooKeeper feature :
· Simplicity: With the help of a shared hierarchical namespace, it coordinates.
· Reliability: The system keeps performing, even if more than one node fails.
· Speed: In the cases where ‘Reads’ are more common, it runs with the ratio of
10:1.
· Scalability: By deploying more machines, the performance can be enhanced.

IBM Big Data Strategy :


IBM, a US-based computer hardware and software manufacturer, had implemented a
Big Data strategy.
Where the company offered solutions to store, manage, and analyze the huge amounts
of data generated daily and equipped large and small companies to make informed
business decisions.
The company believed that its Big Data and analytics products and services would
help its clients become more competitive and drive growth.
Issues :
· Understand the concept of Big Data and its importance to large, medium, and
small companies in the current industry scenario.
· Understand the need for implementing a Big Data strategy and the various issues
and challenges associated with this.
· Analyze the Big Data strategy of IBM.
· Explore ways in which IBM’s Big Data strategy could be improved further.

Introduction to InfoSphere :
InfoSphere Information Server provides a single platform for data integration and
governance.
The components in the suite combine to create a unified foundation for enterprise
information architectures, capable of scaling to meet any information volume
requirements.
You can use the suite to deliver business results faster while maintaining data quality
and integrity throughout your information landscape.

17 | P a g e
InfoSphere Information Server helps your business and IT personnel collaborate to
understand the meaning, structure, and content of information across a wide variety
of sources.
By using InfoSphere Information Server, your business can access and use
information in new ways to drive innovation, increase operational efficiency,
and lower risk.
BigInsights :

BigInsights is a software platform for discovering, analyzing, and visualizing


data from disparate sources.
The flexible platform is built on an Apache Hadoop open-source framework that
runs in parallel on commonly available, low-cost hardware.

Big Sheets :

BigSheets is a browser-based analytic tool included in the InfoSphere BigInsights


Console that you use to break large amounts of unstructured data into
consumable, situation-specific business contexts.
These deep insights help you to filter and manipulate data from sheets even further.

Intro to Big SQL :

IBM Big SQL is a high performance massively parallel processing (MPP) SQL
engine for Hadoop that makes querying enterprise data from across the organization
an easy and secure experience.
A Big SQL query can quickly access a variety of data sources including HDFS,
RDBMS, NoSQL databases, object stores, and WebHDFS by using a single
database connection or single query for best-in-class analytic capabilities.
Big SQL provides tools to help you manage your system and your databases, and
you can use popular analytic tools to visualize your data.
Big SQL's robust engine executes complex queries for relational data and
Hadoop data.

18 | P a g e

You might also like