Rec It-It17701 Data Analytics Unit 1 Part - II
Rec It-It17701 Data Analytics Unit 1 Part - II
• PIG has two parts: Pig Latin, the language and the pig runtime, for
the execution environment.Can better understand it as Java and JVM.
• It supports pig latin language, which has SQL like command structure.
• As everyone does not belong from a programming background. So,
Apache PIG relieves them. You might be curious to know how?
• Well, there is an interesting fact:
• 10 line of pig latin = approx. 200 lines of Map-Reduce Java code
• Facebook created HIVE for people who are fluent with SQL.
• Basically, HIVE is a data warehousing component which performs
reading, writing and managing large data sets in a distributed
environment using SQL-like interface.
• HIVE + SQL = HQL
• There is a Flume agent which ingests the streaming data from various
data sources to HDFS. From the diagram, the web server indicates the
data source. Twitter is among one of the famous sources for
streaming data.
• When we submit Sqoop command, our main task gets divided into
sub tasks which is handled by individual Map Task internally. Map Task
is the sub task, which imports part of data to the Hadoop
Ecosystem. Collectively, all Map tasks imports the whole data.
10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 51
APACHE SQOOP
Export also works in a similar
manner.
When we submit our Job, it is
mapped into Map Tasks which
brings the chunk of data from
HDFS. These chunks are
exported to a structured data
destination. Combining all these
exported chunks of data, we
receive the whole data at the
destination,
10/11/2024 which in REC\IT-IT17701_Data
most ofAnalytics_UNIT
the 1_Part -II 52
APACHE SOLR &
LUCENE
• Apache Solr and Apache Lucene are the two services which are used
for searching and indexing in Hadoop Ecosystem.
• Apache Lucene is based on Java, which also helps in spell checking.
• If Apache Lucene is the engine, Apache Solr is the car built around it.
Solr is a complete application built around Lucene.
• It uses the Lucene Java search library as a core for search and full
indexing.