Showing posts with label data lake hadoop. Show all posts
Showing posts with label data lake hadoop. Show all posts

Saturday, September 19, 2020

Introduction - PySpark for Big Data

Spark is written in Scala and runs on the JVM and we can use all the features of it in python through PySpark. Programs written in PySpark can be submitted to a Spark cluster and run in a distributed manner.

PySpark is a Python API for Spark to support the collaboration of Apache Spark and Python.

Actually Apache Spark is made up of several components and at its core; Spark is a generic engine for processing large amounts of data.

A PySpark program isn’t that much different from a regular Python program, but the execution model can be very different from a regular Python program, especially if we’re running on a cluster.

Advantages of using PySpark:

  1. Python is almost 29 years old language in the programming era which is easy to learn and implement
  2. Python has a very strong community support to deal with most of the problems
  3. Py4J is a popular library which is integrated within PySpark and allows python to dynamically interface with JVM objects.
  4. It provides simple and comprehensive API
  5. With Python, the readability of code, maintenance, and familiarity is far better
  6. It features various options for data visualization, which is difficult using Scala or Java

How to setup PySpark on your machine?

Version — spark-3.0.0-bin-hadoop3.2

Notes — create spark directory on your desktop and put the above spark version there and then create the three system variables –

SPARK_HOME: this variable must be mapped with your spark directory,

HADOOP_HOME: this variable should be mapped with your Hadoop directory inside the spark directory such as %SPARK_HOME%\hadoop

PYTHONPATH: this variable should be mapped with your python directory inside the spark directory such as %SPARK_HOME%\python;%SPARK_HOME%\python\lib\py4j-0.10.9-src.zip:%PYTHONPATH%

Components of PySpark

  1. Cluster — Cluster is nothing more than a platform to install Spark; Apache Spark is a Big Data processing engine. Spark can be run in distributed mode on the cluster, with a least one driver and a master, and others as Spark workers. The Spark driver interacts with the master to find out where the workers are, and then the driver federates tasks to the workers for computation.
  2. SparkContext is the entry gate of Apache Spark functionality. The most important step of any Spark driver application is to generate SparkContext. It acts as the master of the Spark application
  3. SQLContext is the main entry point for Spark SQL functionality. A SQLContext can be used create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files.
  4. Native Spark: If we use Spark data frames and libraries, then Spark will natively parallelize and distribute our task. First, we’ll need to convert the Pandas data frame to a Spark data frame and then do the needful business operations.
  5. Thread Pools: The multiprocessing library can be used to run concurrent Python threads, and even perform operations with Spark data frames.
  6. Pandas UDFs — With this feature, we can partition a Spark data frame into smaller data sets that are distributed and converted to Pandas objects, where our function is applied, and then the results are combined back into one large Spark data frame.

How does PySpark work?

By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.

  1. MultiThreading — The threading module uses threads where threads run in the same memory space. Since threads use the same memory, precautions have to be taken or two threads will write to the same memory at the same time. It is a good option for I/O-bound applications. Benefits –

a. Multithreading is concurrency

b. Multithreading is for hiding latency

c. Multithreading is best for IO

2. MultiProcessing — The multiprocessing module uses processes where processes have separate memory. Multiprocessing gets around the Global Interpreter Lock and it takes advantage of multiple CPUs & cores. Benefits –

a. Multiprocessing is parallelism

b. Multiprocessing is for increasing speed

c. Multiprocessing is best for computations

3. Map() function

map() applies a function to each item in an iterable, but it always produces a 1-to-1 mapping of the original items.

Key features of PySpark — PySpark comes with various features as given below:

  1. Real-time Computation — PySpark provides real-time computation on a large amount of data because it focuses on in-memory processing. It shows the low latency
  2. Support Multiple Language — PySpark framework is suited with various programming languages like Scala, Java, Python, SQL, and R. Its compatibility makes it the preferable frameworks for processing huge datasets
  3. Caching and disk constancy — PySpark framework provides powerful caching and good disk constancy
  4. Swift Processing — It allows us to achieve a high data processing speed, which is about 100 times faster in memory and 10 times faster on the disk as stated by their development team
  5. Works well with RDD — Python programming language is dynamically typed, which helps when working with RDD
To learn more, please follow us -
To Learn more, please visit our YouTube channel at - 
To Learn more, please visit our Instagram account at -
To Learn more, please visit our twitter account at -
To Learn more, please visit our Medium account at -

Tuesday, July 17, 2018

Data Lake Vs Data Warehouse


We know that data is the business asset for any organisation which always keeps secure and accessible to business users whenever it required. 
In current era, two techniques are very popular to store the data for the business insights. Hence, we are going to differentiate them based on some technical terms.

One is Data Warehouse which is highly structured store of the data that is requiring a significant amount of discovery, planning, data modeling, and development work before the data becomes available for analysis by the business users.

Second one is a Data Lake which is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed. We can say that Data Lake is a more organic store of data without regard for the perceived value or structure of the data.

Data lakes are a big opportunity to store large amounts of data in an affordable way without having to decide upfront how it must be structured and used. They are typically used to complement traditional data warehouses, which are still better adapted for highly-trusted, tightly-governed data such as your financial figures, but there are some overlaps between the two compositories.

Data Warehouses compared to Data Lakes - Depending on the business requirements, a typical organization will require both a data warehouse and a data lake as they serve different needs, and use cases.
Characteristics
Data Warehouse
Data Lake
Type of data stored
Structured data (most often in columns & rows in a relational database) from transactional systems, operational databases, and line of business applications
Any type of data structure,
any format, including structured, semi-structured, and unstructured data from IoT devices, web sites, mobile apps, social media, and corporate applications
Best way to ingest data
Batch processes
Streaming, micro-batch, or
batch processes
Schema
Designed prior to the DW implementation (schema-on-write)
define the structure of the data at the time of analysis , referred to as schema on reading (schema-on-read)
Typical load pattern
ETL - (Extract, Transform, then Load)
ELT - (Extract, Load, and Transform at the time the data is loaded)
Price/Performance
Fastest query results using higher cost storage
Query results getting faster using low-cost storage
Data Quality
Highly curated data that serves as the central version of the truth
Any data that may or may not be curated (ie. raw data)
Users
Business analysts
Data scientists, Data developers, and Business analysts (using curated data)
Analytics pattern
Determine structure, acquire data, then analyze it; iterate back to change structure as needed.
Batch reporting, BI and visualizations
Acquire data, analyze it, then iterate to determine its final structured form.
Machine Learning, Predictive analytics, data discovery and profiling
During the development of a traditional data warehouse, we should decide a considerable amount of time which is going to spend analyzing data sources, understanding business processes, profiling data, and modeling data.
In contrast, the default expectation for a data lake is to acquire all of the data and retain all of the data.
Please visit us to learn more on -
  1. Collaboration of OLTP and OLAP systems
  2. Major differences between OLTP and OLAP
  3. Data Warehouse - Introduction
  4. Data Warehouse - Multidimensional Cube
  5. Data Warehouse - Multidimensional Cube Types
  6. Data Warehouse - Architecture and Multidimensional Model
  7. Data Warehouse - Dimension tables.
  8. Data Warehouse - Fact tables.
  9. Data Warehouse - Conceptual Modeling.
  10. Data Warehouse - Star schema.
  11. Data Warehouse - Snowflake schema.
  12. Data Warehouse - Fact constellations
  13. Data Warehouse - OLAP Servers.
  14. Preparation for a successful Data Lake in the cloud
  15. Why does cloud make Data Lakes Better?