Showing posts with label Big Data. Show all posts
Showing posts with label Big Data. Show all posts

Wednesday, November 8, 2023

Data Engineering — Azure Databricks or Azure Synapse Analytics

 The cloud is the fuel that powers today’s digital companies, with businesses paying solely for the specific services or resources that they consume over time.

Azure Synapse Analytics bridges the gap between these two worlds by providing a uniform experience for ingesting, preparing, managing, and serving data for instant BI and machine learning needs.

 

Databricks is ideal for the “processing” layer, whereas Azure Synapse Analytics is ideal for the serving layer due to access control, active directory integration, and interaction with other Microsoft products.

 

Azure Databricks and Azure Synapse Analytics are mostly used for machine learning, and Synapse is also a Data Warehouse, therefore it is optimised for OLAP.

Azure Synapse Analytics vs Azure Databricks

Apache Spark powers both Databricks and Synapse Analytics. With optimized Apache Spark support, Databricks allows users to select GPU-enabled clusters that do faster data processing and have higher data concurrency.


Azure Synapse Analytics
 is an umbrella term for a variety of analytics solutions. It is a combination of Azure Data Factory, Azure Synapse SQL Pools (essentially what was formerly known as Azure SQL Data Warehouse), and some added capabilities such as serverless Spark clusters and Jupyter notebooks, all within a browser IDE interface.

Azure Synapse architecture comprises the Storage, Processing, and Visualization layers. The Storage layer uses Azure Data Lake Storage, while the Visualization layer uses Power BI.

 

Azure Synapse Pipelines is a lightweight orchestrator and are ideal for basic extract-load procedures that require highly parameterized copy actions with ADLS2 or specialised SQL pool integration.

Some ideal features of Azure Synapse Analytics

  1. Azure Synapse offers cloud data warehousing, dashboarding, and machine learning analytics in a single workspace.
  2. It ingests all types of data, including relational and non-relational data, and it lets you explore this data with SQL.
  3. Azure Synapse uses massive parallel processing or MPP database technology, which allows it to manage analytical workloads and also aggregate and process large volumes of data in an efficient manner.
  4. It is compatible with a wide range of scripting languages like Scala, Python, .Net, Java, R, SQL, T-SQL, and Spark SQL.
  5. It facilitates easy integration with Microsoft and Azure solutions like Azure Data Lake, Azure Blob Storage, and more.
  6. It includes the latest security and privacy technologies such as real-time data masking, dynamic data masking, always-on encryption, Azure Active Directory authentication, and more.

Azure Synapse is an unrestricted analytics solution that combines business data warehousing and Big Data analytics. It allows you to query data on your own terms, leveraging serverless on-demand or provided resources — at scale.

Azure Databricks uses mostly open-source software and utilizes cloud companies’ compute and storage costs. Databricks would be that it integrates more easily into the Azure ecosystem, and it is substantially more streamlined and works right away. It is now fully knob-free, yet given what it can accomplish, it takes very little configuration.

 

Databricks on the other hand is a complete ecosystem build cloud native. It supports that you write SQL, Python, R and Scala. It was built by the founders of spark and comes with tools that improve sparks capability e.g., in query performance and speed, Delta Lake for a lake format with version control and possibility to clone data between environments.

 

Databricks architecture is not entirely a Data Warehouse. It accompanies a LakeHouse architecture that combines the best elements of Data Lakes and Data Warehouses for metadata management and data governance.

 

Azure Databricks offers streamlined workflows and an interactive workspace for collaboration between data scientists, data engineers, and business analysts.

 

Some ideal features of Azure Databricks — Databricks has a lot more customizability and they have some internal libraries that are useful for data engineers. For instance, Mosaic is a useful geospatial library that was created inside Databricks. There are some ideal features inside Databricks as given below-

  1. Databricks are more open and the types of features they are releasing cover most of the things such as data governance, security, change data capture.
  2. Databricks uses Python, Spark, R, Java, or SQL for performing Data Engineering and Data Science activities using notebooks.
  3. Databricks has AutoML, and instead of a black box at the end for inference, you receive a notebook with the code that built the model you want.
  4.  AutoML is a wonderful starting point for most ML use cases that can subsequently be customised; it’s also ideal for learning and transforming “citizen” Data Scientists into coding practitioners.
  5. Databricks compute will not auto-start, which means you have to leave the clusters running to be able to allow users to query DB data.
  6. Databricks has a CDC option on their tables where it allows you to track changes. You can use this feature to get a list of rows that changed on the Trillion row Delta table (~billion a day).
  7. Databricks is generally cheaper (cost for X performance), so it’s easier to keep a shared auto scaling cluster running in Databricks.
  8. Databricks provides a platform for integrated data science and advanced analysis, as well as secure connectivity for these domains.

 

Databricks also have Delta sharing which is an interesting idea around making it easier to integrate with your lake house. The biggest selling point for us is that Databricks has understood that data platforms today is about machine learning and advanced analytics.

 

In an ideal business scenario, you (data engineers) can use Databricks to build data pipelines to process data and save everything as delta tables in ADLSG2, then use Synapse Analytics serverless pool to consume those delta tables for further data analysis and reports.

 

Azure Synapse and Databricks support Notebooks that help developers to perform quick experiments. Synapse provides co-authoring of a notebook with a condition where one person must save the notebook before the other person observes the changes.

 

However, Databricks Notebooks support real-time co-authoring along with automated version control.

 

Azure Synapse Analytics over Azure Databricks

  1. Azure Synapse Analytics is less costly to process than Azure Databricks when Spark Pools are used appropriately. Databricks are expensive, yet they give benefits that most people will never have.
  2. You pay per query/GB with Azure Synapse serverless. Because of this, serverless computing is charged per query, Synapse appears to be a suitable fit.
  3. If you are planning to do machine learning and data science later than go with synapse. Synapse is like Databricks, data factory, and SQL Server in one place.

 Azure Databricks over Azure Synapse Analytics

  1. Databricks comes with what can be seen as Spark improved with multiple optimizations which can perform x 50 times better. They have built Phonton which will always outperform Spark.
  2. Azure Synapse has no versions control in notebooks as where Databricks have this.

 

Microsoft Azure Synapse Analytics is a scalable, cloud-based data warehousing solution and includes business intelligence, data analytics, and machine learning solutions for both relational and non-relational data.

 

Azure Databricks is a large data analytics service built on Apache Spark that is quick, simple, and collaborative. It is intended for data science and data engineering. It is designed to store all of your data on a single, open LakeHouse while also unifying all of your analytics and AI workloads.

 

We propose that you assess your requirements and select:

  1. If you want a lot of product knobs at the sacrifice of productivity, use Synapse. To be clear, Azure Synapse Analytics is a collection of goods under the same umbrella. It’s like IBM Watson in terms of data processing.
  2. If you want a more refined experience at the sacrifice of certain capabilities, choose Azure Databricks. Databricks, for example, does not provide a no-code ML, although AzureML does.
  3. If you want to construct pipelines without writing code, Azure Synapse Analytics is a no-brainer.
  4. Use Azure Databricks for sophisticated analytics, large amounts of data processing, machine learning, and notebooks.

To learn more, please follow us -
http://www.sql-datatools.com

To Learn more, please visit our YouTube channel at —
http://www.youtube.com/c/Sql-datatools

To Learn more, please visit our Instagram account at -
https://www.instagram.com/asp.mukesh/

To Learn more, please visit our twitter account at -
https://twitter.com/macxima

 

Saturday, September 19, 2020

Introduction - PySpark for Big Data

Spark is written in Scala and runs on the JVM and we can use all the features of it in python through PySpark. Programs written in PySpark can be submitted to a Spark cluster and run in a distributed manner.

PySpark is a Python API for Spark to support the collaboration of Apache Spark and Python.

Actually Apache Spark is made up of several components and at its core; Spark is a generic engine for processing large amounts of data.

A PySpark program isn’t that much different from a regular Python program, but the execution model can be very different from a regular Python program, especially if we’re running on a cluster.

Advantages of using PySpark:

  1. Python is almost 29 years old language in the programming era which is easy to learn and implement
  2. Python has a very strong community support to deal with most of the problems
  3. Py4J is a popular library which is integrated within PySpark and allows python to dynamically interface with JVM objects.
  4. It provides simple and comprehensive API
  5. With Python, the readability of code, maintenance, and familiarity is far better
  6. It features various options for data visualization, which is difficult using Scala or Java

How to setup PySpark on your machine?

Version — spark-3.0.0-bin-hadoop3.2

Notes — create spark directory on your desktop and put the above spark version there and then create the three system variables –

SPARK_HOME: this variable must be mapped with your spark directory,

HADOOP_HOME: this variable should be mapped with your Hadoop directory inside the spark directory such as %SPARK_HOME%\hadoop

PYTHONPATH: this variable should be mapped with your python directory inside the spark directory such as %SPARK_HOME%\python;%SPARK_HOME%\python\lib\py4j-0.10.9-src.zip:%PYTHONPATH%

Components of PySpark

  1. Cluster — Cluster is nothing more than a platform to install Spark; Apache Spark is a Big Data processing engine. Spark can be run in distributed mode on the cluster, with a least one driver and a master, and others as Spark workers. The Spark driver interacts with the master to find out where the workers are, and then the driver federates tasks to the workers for computation.
  2. SparkContext is the entry gate of Apache Spark functionality. The most important step of any Spark driver application is to generate SparkContext. It acts as the master of the Spark application
  3. SQLContext is the main entry point for Spark SQL functionality. A SQLContext can be used create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files.
  4. Native Spark: If we use Spark data frames and libraries, then Spark will natively parallelize and distribute our task. First, we’ll need to convert the Pandas data frame to a Spark data frame and then do the needful business operations.
  5. Thread Pools: The multiprocessing library can be used to run concurrent Python threads, and even perform operations with Spark data frames.
  6. Pandas UDFs — With this feature, we can partition a Spark data frame into smaller data sets that are distributed and converted to Pandas objects, where our function is applied, and then the results are combined back into one large Spark data frame.

How does PySpark work?

By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.

  1. MultiThreading — The threading module uses threads where threads run in the same memory space. Since threads use the same memory, precautions have to be taken or two threads will write to the same memory at the same time. It is a good option for I/O-bound applications. Benefits –

a. Multithreading is concurrency

b. Multithreading is for hiding latency

c. Multithreading is best for IO

2. MultiProcessing — The multiprocessing module uses processes where processes have separate memory. Multiprocessing gets around the Global Interpreter Lock and it takes advantage of multiple CPUs & cores. Benefits –

a. Multiprocessing is parallelism

b. Multiprocessing is for increasing speed

c. Multiprocessing is best for computations

3. Map() function

map() applies a function to each item in an iterable, but it always produces a 1-to-1 mapping of the original items.

Key features of PySpark — PySpark comes with various features as given below:

  1. Real-time Computation — PySpark provides real-time computation on a large amount of data because it focuses on in-memory processing. It shows the low latency
  2. Support Multiple Language — PySpark framework is suited with various programming languages like Scala, Java, Python, SQL, and R. Its compatibility makes it the preferable frameworks for processing huge datasets
  3. Caching and disk constancy — PySpark framework provides powerful caching and good disk constancy
  4. Swift Processing — It allows us to achieve a high data processing speed, which is about 100 times faster in memory and 10 times faster on the disk as stated by their development team
  5. Works well with RDD — Python programming language is dynamically typed, which helps when working with RDD
To learn more, please follow us -
To Learn more, please visit our YouTube channel at - 
To Learn more, please visit our Instagram account at -
To Learn more, please visit our twitter account at -
To Learn more, please visit our Medium account at -

Monday, August 24, 2020

Python — Filtering data with Pandas Dataframe

If you are working as Python developer where you have to accomplished a lot of data cleansing stuffs. One of the data cleansing stuff is to remove unwanted data from your dataframe. Pandas is one of the most important packages that makes importing and analyzing data much easier with the help of its strong library.

For analyzing data, a programmer requires a lot of filtering operations. Pandas provide many methods to filter a Data frame and Dataframe.query() is one of them.

To understand filtering feature of Pandas, we are creating some sample data by using list feature of Python.

In this example, dataframe has been filtered on multiple conditions.

# Import pandas library

import pandas as pd

 

# intialise data of lists.

data = {'Name':['Ryan Arjun', 'Kimmy Wang', 'Rose Gray', 'Will Smith'],

        'Age':[20, 21, 19, 18],

        'Country':['India','Taiwan','Canada','Greenland'],

        'Sex':['Male','Female','Female','Male']}

 

# Create DataFrame

df = pd.DataFrame(data)

 

#show data in the dataframe

df

=======================================

   Age |   Country |       Name   |  Sex

--------------------------------------- 

0   20 |     India | Ryan Arjun   |  Male

1   21 |    Taiwan | Kimmy Wang   |Female

2   19 |    Canada |  Rose Gray   |Female

3   18 | Greenland | Will Smith   | Male

=======================================

 

# filtering with query method

# Where sex must be male

# and Country must be India

# and age must be greater than 15

df.query('Sex =="Male" and Country =="India" and Age>15', inplace = True)

 

#show data in the dataframe

df

 

===================================

Age | Country   |     Name  | Sex

-----------------------------------

20  |India      |Ryan Arjun | Male

===================================

By using query feature of pandas in Python can save a lot of data processing time because we can use multiple filters conditions in a single go.

To learn more, please follow us -

http://www.sql-datatools.com

To Learn more, please visit our YouTube channel at - 

http://www.youtube.com/c/Sql-datatools

To Learn more, please visit our Instagram account at -

https://www.instagram.com/asp.mukesh/

To Learn more, please visit our twitter account at -

https://twitter.com/macxima

To Learn more, please visit our Medium account at -

https://medium.com/@macxima

Monday, June 18, 2018

What is Data Engineering

D
ata engineering ensuring all right data (internal/external, structured/unstructured) are identified, sourced, cleaned, analyzed, modelled, and decisions implemented — without losing on granularity and value as the data travels this path.
Data Engineering has to help businesses by building robust capabilities to deal with the volume, velocity, reliability, and variety of data and makes this data available for business users to consume — both as traditional marts and warehouses, and new-age big data ecosystems.
Data engineering is dealing with data — data lakes, clouds, pipelines, and platforms. Data Warehouse is the base of BI (Business Intelligence) project, and ETL (Extract, Transform and Load) is the base of Data Warehouse.

Data Approaches: There are many data engineering approaches which are very helpful to understand different techniques as given below-
1. Implement Data Lakes/ Data Warehouses/ Data Marts: Help lay or enlarge the enterprise data foundation so a range of analytics solutions can be built on top
2. Develop Data Pipelines: Facilitate production grade end-to-end pipeline of data-to-value that takes data solutions from sandbox environments, and rolls them out to end users