What is a Cluster:
================
---> A Cluster is a group of Virtual Machines(Nodes) where workloads or tasks
distributed across multiple machines to process
data massively and parallely
An Azure Databricks cluster is a group of compute resources and configurations on
which you run data engineering, data science, and data analytics workloads, such as
production ETL pipelines, streaming analytics, ad-hoc analytics, and machine
learning models.
You run these workloads as a set of commands in a notebook or as an automated job.
Azure Databricks makes a distinction between all-purpose clusters and job clusters.
---> A Cluster is a Compute Engine
---> Compute defines processing ability
Types of Azure Databricks Cluster:
----------------------------------
1) All-purpose Cluster ( Interactive Cluster | Standard Cluster)
2) Job Cluster ( Automated Cluster | On-Demand Cluster)
All-purpose Cluster ( Interactive Cluster):
-------------------------------------------
---> You can create an all-purpose cluster using the UI, CLI, or REST API.
---> You can manually terminate and restart an all-purpose cluster.
---> Multiple users can share such clusters to do collaborative interactive
analysis.
---> All Purpose Cluster is also known as " Interactive Cluster" or "Standard
Cluster"
---> Interactive Cluster : A Data Engineer can interactively test every small
piece code before proceeding to next Cell
What is Job:
============
---> A Job is a automated Scheduled Task such as "Running Notebook","Running Python
File","JAR File( Set of Java files and Scala files)"
Job Cluster:
-------------
---> The Azure Databricks job scheduler creates a job cluster when you run a job
begins and terminates the cluster when the job is complete.
---> A job cluster in Azure Databricks is a temporary cluster that's created to run
a specific job or task.
---> The cluster is created when the job begins and is terminated when the job
finishes.
---> Job clusters are designed to improve the performance and reliability of data
pipelines
---> Azure Databricks job clusters are clusters that are created on-demand to run a
specific job or notebook.
---> They are automatically terminated when the job or notebook execution is
completed
---> You cannot restart a job cluster.
---> A Job is a Scheduled Task that runs on "Job Cluster"
---> A Task can be "Scheduled Notebook" or "Scheduled Jar File" or " Scheduled
Python File"