Cloud Computing Unit 2 Notes
Cloud Computing Unit 2 Notes
• It is the most basic and traditional form of storage, typically used for individual systems.
• Storage devices like Hard Disk Drives (HDD), Solid State Drives (SSD), or Optical Drives
(CD/DVD) are physically connected to the system via:
• The host computer fully controls the DAS and manages the data.
• Internal Storage:
• External Storage:
Advantages of DAS
6. Disadvantages of DAS
• SAN connects storage devices (like disk arrays, tape libraries) to servers, allowing them to
access storage as if it were locally attached.
• It is mainly used in data centers and large enterprises to handle huge volumes of data
efficiently.
Components of SAN:
• SAN uses switches to route the requests to the appropriate storage device.
• Storage devices respond and deliver data over the same dedicated network.
• SAN uses protocols like Fibre Channel (FC), iSCSI (Internet SCSI), or FCoE (Fibre Channel over
Ethernet).
Advantages of SAN
1. High Performance — Fast data access suitable for databases and enterprise apps.
Components of NAS:
1. Easy File Sharing — Centralized storage accessible from any networked device.
Disadvantages of NAS
4. Limited by LAN Bandwidth — Heavy usage may affect overall network performance.
GFS Architecture:
GFS Clients:
They can be computer programs or applications which may be used to request files.
Requests may be made to access and modify already-existing files or add new files to
the system.
GFS Master Server:
• Stores metadata about file system:
• File and directory names.
• Mapping from files to chunks.
• Chunk locations (which chunkserver has which chunk).
• Manages chunk leases for write operations.
• Coordinates system operations like chunk creation, deletion, replication.GFS
Chunk Servers:
• Store file data in chunks (default 64 MB per chunk).
• Each chunk is replicated (typically 3 copies) across different chunkservers for
fault tolerance.
• Handle read/write requests from clients.each chunk and stores them on
various chunk servers in order to assure stability; the default is three copies.
Every replica is referred to as one.
How GFS Works? (Workflow)
File Write Operation:
1. Client requests master for chunk locations.
2. Master returns primary and secondary chunkservers for replication.
3. Client sends data to all chunkservers simultaneously (pipelined).
4. Primary chunkserver coordinates write among replicas.
5. Once all replicas acknowledge, client gets confirmation of successful write.
5. Supports Large Files — Optimized for big files used in data analysis.
Disadvantages of GFS
1. Not the best fit for small files.
2. Master may act as a bottleneck.
3. unable to type at random.
Hadoop File System
• Hadoop Distributed File System (HDFS) is an open-source distributed file
system designed to store and process large datasets across multiple machines.
• It is a core component of Apache Hadoop, built to handle big data storage
with fault tolerance and high throughput.
Architecture:
10. Supports Large Files — Optimized for big files used in data analysis.
Disadvantages of HDFS
4. Not the best fit for small files.
5. Master may act as a bottleneck.
6. unable to type at random.
Dynamo: Distributed Data Storage System
Dynamo is a highly available and scalable distributed key-value storage system developed
by Amazon to handle large-scale, highly available e-commerce applications like Amazon's
shopping cart service.
Working of Dynamo
• Write Process:
1. Client sends a PUT (write) request to any node.
2. Node stores data and replicates to other nodes based on replication factor.
3. Acknowledgment is sent back to the client.
• Read Process:
1. Client sends a GET (read) request to any node.
2. Node fetches data from multiple replicas.
3. If versions differ, conflicts are resolved using vector clocks or client-side logic.
Advantages of HDFS
15. Supports Large Files — Optimized for big files used in data analysis.
Disadvantages of HDFS
7. Not the best fit for small files.
8. Master may act as a bottleneck.
9. unable to type at random.
Write a short note on MapReduce
MapReduce is a programming model and processing technique developed by Google for
handling and processing large datasets in a distributed computing environment.
Key Concepts:
1. Map Phase:
o Takes input data and converts it into a set of key-value pairs.
o Each piece of data is processed independently in parallel.
2. Shuffle & Sort Phase:
o Intermediate key-value pairs are grouped and sorted based on keys.
o Prepares data for reduction.
3. Reduce Phase:
o Aggregates or summarizes the data with the same key.
o Produces the final output.
Features:
• Scalable and fault-tolerant.
• Parallel processing over distributed systems (like Hadoop).
• Suitable for big data analytics and batch processing.
Example Use Cases:
• Word count in large documents.
• Log analysis.
• Data aggregation.
Explain how the Cloud Data Management Works
Cloud Data Management is the process of storing, organizing, and managing data on cloud
platforms instead of local servers or personal computers. It allows secure, scalable, and
flexible access to data from anywhere using the internet.
How Does It Work?
1. Data Collection & Ingestion
• Data is collected from various sources like apps, devices, databases, sensors, etc.
• The data is uploaded to cloud storage (e.g., AWS S3, Google Cloud Storage, Azure
Blob Storage).
2. Storage Management
• Data is stored in different formats: files, databases, data lakes, data warehouses.
• Storage systems provide automatic scaling, backup, and replication to protect
against data loss.
3. Data Organization & Classification
• Data is organized into folders, buckets, or tables.
• Metadata is used to describe data (e.g., type, owner, date).
• Classification helps in quick searching and access control.
4. Data Security & Access Control
• Encryption protects data during transfer and at rest.
• Authentication and Authorization ensure that only authorized users can access data.
• Role-based access (RBAC) and policies manage who can read, write, or delete data.
5. Backup, Replication & Recovery
• Data is backed up automatically and replicated across multiple regions for disaster
recovery.
• Enables quick restoration in case of failure or loss.
6. Data Sharing & Collaboration
• Data can be shared securely with other users or organizations.
• Enables real-time collaboration and data integration.
Cloud Data Management enables organizations to store, manage, secure, and analyze their
data efficiently using cloud platforms, offering scalability, security, and flexibility without
investing in physical infrastructure.
Data-Intensive Computing deals with large-scale data processing, storage, and management. In
cloud computing, data-intensive technologies handle big data efficiently, ensuring high
performance, scalability, and fault tolerance.
• Examples:
• A framework for processing large data sets in parallel across distributed clusters.
• Examples:
o Google BigQuery
o Amazon Redshift
o Snowflake
• Examples:
o TensorFlow, PyTorch
Cloud data storage offers scalability and flexibility, but it comes with several challenges that need to
be addressed to ensure security, reliability, and cost-efficiency.
• Challenge: Storing sensitive data on third-party cloud platforms raises concerns about
unauthorized access, data breaches, and privacy violations.
• Key Issues:
• Challenge: Ensuring that data is always accessible even during system failures or outages.
• Key Issues:
• Challenge: Maintaining the accuracy and consistency of stored data over time.
• Key Issues:
4. Vendor Lock-in
• Challenge: Difficulty in migrating data between different cloud providers due to proprietary
formats and APIs.
• Key Issues:
5. Cost Management
• Key Issues:
• Challenge: Ensuring proper backup strategies and quick data recovery in case of data loss or
attack.
• Key Issues: