0% found this document useful (0 votes)
5 views

CCD chapter 3 notes

Uploaded by

Ishwari khebade
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

CCD chapter 3 notes

Uploaded by

Ishwari khebade
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

CCD chapter 3

• cloud data governance


− Cloud data governance refers to the set of policies, procedures, and
technologies used to manage, protect, and ensure the quality, security, and
compliance of data stored in cloud environments.
− Data governance is a principled approach to managing data during its life
cycle
− Data governance is everything you do to ensure data is secure, private,
accurate, available, and usable.
− It helps organizations maintain control over their data
− It Identifies and classifies data based on its sensitivity, importance, and
requirements. Examples: Public, confidential
− It defines access control that is who can access the data
− It includes Applying encryption, firewalls, and security measures to protect
data from breaches or unauthorized access.
− It implements procedure to maintain data accuracy , consistency and
monitors for duplicate and in complete data

advantages :
1) Better Data Security: Ensures sensitive data is protected from breaches
and unauthorized access.
2) Improved Data Quality: Keeps data accurate and reliable for better
decision-making.
3) Access Control: Ensures only authorized users can access specific data.
4) Disaster Recovery: Provides backups and recovery plans to ensure data is
safe during failures.
5) Cost Management: Optimizes storage and processing costs by removing
redundant data.
6) continuous monitoring : Tracks who accessed or modified data, making it
easy to audit.
disadvantages :
1) Hard to Set Up: It takes time and effort to create and manage rules for
data in the cloud.
2) dependency on cloud provider : Organizations must trust cloud providers
to maintain security
3) Training Needs: Employees must learn how to handle data properly,
which takes time and resources.
4) Need Constant Monitoring: You have to keep an eye on the data all the
time, which can be challenging.

• key value databases


− A key-value database is a type of NoSQL database that stores data as a
collection of key-value pairs.
− Each key is a unique identifier, and the associated value can be any type of
data, such as a string, JSON object, or binary file.
− Key: A unique identifier for the data (e.g., userID123).
Value: The actual data associated with the key (e.g., { "name": "Ishwari",
"age": 20 }).
− In this database the values or data is retrieved or manipulated using key

advantages.
1) scalability : Easily expands to handle increasing amounts of data.
2) easy to use : as there is a key value simple structure it is easy to use
3) speed : It is extremely faster for simple queries
4) flexible : can store different types of data
disadvantages
1) limited searching : You can only search for data using keys, not the
values.
2) No Data Relationships: Cannot link data like relational databases
3) not ideal for large values : Performance can slow down if the stored
values are too large.
4) harder to manage complex data : Managing and retrieving data becomes
challenging for complex data
use cases :
1) E-commerce applications : Manages shopping carts and user
preferences.
2) Real-Time Analytics: Tracks events and metrics for monitoring
dashboards.
Examples of Key-Value Databases in Cloud Computing
1) Amazon DynamoDB
2) Azure Table Storage
3) Google Cloud Datastore
• batch and streaming data in machine learning
Batch data :
− Batch data refers to a collection of data that is processed in groups or
batches at specific intervals.
− It involves analyzing large datasets that have already been collected and
stored.
− Typically has higher latency since data is not processed immediately.
− This approach is widely used in machine learning (ML) for tasks that do not
require real-time processing and focus on analyzing historical data to train
predictive models.
workflow :
1) Data Collection: Collect data from various sources over a fixed period.
2) Storage: Store the collected data in a database, data warehouse, or file
system (e.g., Amazon S3)
3) Preprocessing: Clean, filter, and prepare the data for analysis. Common
preprocessing tasks include handling missing values, removing duplicates,
and normalizing data.
4) Batch Processing: Analyze or process the data in bulk using tools like
Hadoop or Python libraries.
5) Model Training: Use the processed data to train machine learning models
6) Evaluation and Deployment: Evaluate the trained model's performance
using test datasets. Deploy the model to make predictions on new data.
Advantages :
1) Efficient for Large Data Volumes: Handles large datasets in a single run,
making it suitable for big data applications.
2) Cost-Effective: it optimizes resource usage and costs.
3) Supports Complex Workflows: Handles complex transformations over large
datasets.
4) Data Consistency: Processes entire datasets at once, ensuring uniformity
and reducing inconsistencies.
5) Suitable for Model Training: Allows machine learning models to be trained
on large, historical datasets.
Disadvantages :
1) have to wait until all the data is collected and processed, so it’s not good for
instant results.
2) Processing large amounts of data at once requires a lot of computer
resources.
3) need sufficient space to store the data before it’s processed.
4) If there’s an error, you might need to start the process all over again, which
wastes time.
5) It is not suitable for real time data
Examples :
1) Weather Data
2) Customer Feedback Analysis
3) Image or Video Processing
4) generating Employee monthly salary
5) generating academic records at the end of semester

Streaming data :
− Streaming data refers to the continuous flow of data generated and
processed in real time.
− it involve analysing data immediately as it arrives
− Offers lower latency because data is processed immediately as it flows
into the system
− It is used in scenarios where data arrives in small increments over time
Workflow
1) data generation : Data is generated from sources like sensors, user
clicks
2) Ingestion: Data is ingested using tools like Google Pub/Sub.
3) preprocessing : Data is cleaned and transformed in real time to prepare
it for analysis.
4) Model Application: pretrained models are used to make predictions
5) Output: Results are used for real-time decision-making
advantages :
1) Real-Time Insights: can get results and take action immediately as data
arrives.
2) Always Updated: Works with the latest data, ensuring decisions are
based on current information.
3) Ideal for Quick Decisions: it is used when the decision to be make
immediately on the spot
disadvantages :
1) Needs Complex Systems: it needs complex tools and setup and expertise
for real time processing
2) Expensive: Requires more computing power, which can increase costs.
3) Hard to Ensure Data Quality: It’s challenging to clean or verify the data
quickly because processing happens immediately.
4) Limited for Historical Analysis: As it focuses on real time data it is
difficult to process historical data
examples
1) Social Media Monitoring
2) Fraud Detection
3) Stock Market Analysis
4) Traffic Updates

• cloud data warehouse – AWS redshift


− A cloud data warehouse is a database that stores, processes, and integrates
data in a public cloud environment.
− A Cloud Data Warehouse is a database stored in the cloud that is designed
for analyzing and querying large datasets to support business intelligence
and decision-making.
− It provides a scalable, cost-effective, and flexible alternative to traditional
on-premises data warehouses.
− a cloud data warehouse is hosted on the cloud and managed by a service
provider, eliminating the need for physical hardware.
− The system can handle increasing data volumes and workloads by scaling
resources up or down based on demand.
− Cloud data warehouses easily integrate with various data sources, business
intelligence tools, and third-party applications.
− Providers implement robust security measures such as encryption, access
control, and compliance certifications to safeguard data.
cloud data warehouse function :
− Data Ingestion: Data from multiple sources (e.g., databases, applications,
IoT devices) is collected and ingested into the cloud data warehouse using
pipelines or ETL (Extract, Transform, Load) processes.
− Data Storage: The data is stored in an optimized format for analytical
queries, often leveraging columnar storage for faster read operations.
− Data Processing: Data is processed and transformed to make it suitable for
analysis. Cloud warehouses use distributed processing to handle large
datasets efficiently.
− Query Execution: Users can run SQL or similar queries on the stored data.
The system leverages parallel processing to provide quick results.
− Analytics and Reporting: The queried data is visualized using business
intelligence tools or dashboards to derive actionable insights.
advantages :
− Reduced Setup Time: Organizations can quickly deploy a cloud data
warehouse without the need for extensive hardware or software
installations.
− Performance Optimization: Cloud data warehouses use advanced
technologies like in-memory computing and massively parallel processing
(MPP) for high-speed analytics.
− Global Accessibility: Data can be accessed from anywhere with an
internet connection.
− Support for Big Data: Cloud data warehouses are built to handle
massive datasets and complex analytics workloads.
challenges:
− Moving data from on-premises systems to the cloud can be time-
consuming.
− Mismanagement of resources can lead to higher costs.
− Ensuring compliance with data regulations can be complex.
− it needs high bandwidth network
− Although providers implement robust security measures, storing
sensitive data in the cloud can expose it to potential breaches.
− vendor lock in : Switching providers or transitioning back to an on-
premises system can be difficult due to proprietary technologies and data
formats.

AWS Redshift:
− Amazon Redshift is a fully managed, cloud-based data warehousing
service provided by AWS.
− It is designed for storing and analyzing large volumes of structured and
semi-structured data using SQL-based tools and business
intelligence applications.
− Redshift is optimized for high-performance analytics, supporting
complex queries on large datasets with scalability, speed, and cost-
effectiveness.
workflow :

1. Data Loading:
Data is uploaded to Redshift from sources like files, databases, or other
cloud services (e.g., Amazon S3).
2. Data Storage:
Data is organized in a column format, compressed, and stored in the
cloud for easy access.
3. Querying Data:
You can run SQL queries to analyze data, generate reports, or find
insights.
4. Processing Queries:
Redshift splits tasks across multiple servers to process large queries
faster.

Architecture :
1. Leader Node: The leader node acts like a manager or controller. It:
• Receives queries from users or applications.
• Creates a plan to execute those queries.
• Coordinates the work with compute nodes.
• Returns the final results to the user.
2. Compute Nodes : Compute nodes are like the workers. They:
• Store the actual data in the form of slices (chunks of data).
• Perform the heavy lifting by processing queries sent by the leader
node.
• Work in parallel to speed up tasks.
3. Storage Layer: Redshift stores data in a columnar format
(organized by columns, not rows).
This format is faster for analytical queries because only the relevant
columns are read.
4. Network and Integration: Redshift is connected to AWS services
and external tools to ingest and export data easily. Data can be loaded
from Amazon S3, databases, or streaming services.
features :

1. Fast Data Processing:


Redshift divides work among multiple servers (computers) to
process data quickly.
2. Column-Based Storage:
Instead of storing data row by row, it stores data in columns. This
makes it faster to find and analyze data.
3. Scalable:
You can start small and add more storage or processing power as
your data grows.
4. SQL Support:
You can use SQL (a standard language for databases) to query and
analyze your data.
5. Integration with AWS Tools:
Works well with other AWS services like S3 (cloud storage) and
Glue (data preparation).
6. Cost-Effective:
You only pay for what you use, and it offers ways to save money by
optimizing storage and computing.
7. Data Security:
Your data is encrypted (protected) when it's being stored and
transferred.

Advantages :

− Easy to Use: You don’t need to manage servers; AWS handles that
for you.
− Faster Analytics: Optimized for analyzing big datasets quickly.
− Flexible Scaling: Adjust resources (storage and compute) as needed.
− Affordable: Pay only for what you use, and store large amounts of
data cost-effectively.

Disadvantage :

− Complex Queries Can Slow Down: Very complex queries or too


many users may affect performance.
− Costs Can Add Up: Without proper monitoring, costs can increase
for large-scale usage.
− No Real-Time Processing: It’s designed for analyzing data in bulk,
not for real-time applications.
• various cloud based tools used for data science in ML – GCP BigQuery.
various cloud based tools for data science are :
1) Amazon sagemaker : A fully managed service to build, train, and deploy
machine learning models.
2) AWS Lambda: Can run ML inference tasks triggered by events without
managing servers. For serverless computing.
3) Amazon Redshift: A fast data storage and analytics tool for large datasets.
4) Azure Machine Learning: A platform for training and deploying ML
models with simple tools.
5) BigQuery: A tool for fast data analytics, with built-in ML capabilities.

GCP BigQuery :
− BigQuery is a fully-managed, serverless, and highly scalable data
warehouse built on Google Cloud Platform (GCP) for performing fast
and SQL-based queries on massive datasets.
− It's widely used in data science and machine learning (ML) for storing,
analyzing, and managing large datasets quickly and efficiently.

key features :

1. Serverless and Scalable:


o Serverless means you don't need to manage infrastructure (no
servers to configure or scale).
o BigQuery automatically scales resources based on the size of your
data and the complexity of your queries.
2. Fast Querying:
o BigQuery is optimized for fast querying over large datasets, thanks to
its columnar storage format and distributed architecture.
o It supports SQL queries, which makes it accessible for data scientists
familiar with SQL.
3. Fully Managed:
o No need to worry about data backups, updates, or hardware failures.
Google handles everything, making it easy to focus on analytics.
4. Integrated with Google Cloud Tools:
o BigQuery seamlessly integrates with other Google Cloud services
like Google Cloud AI, Google Cloud Storage, and Google Cloud
Dataproc for advanced analytics and ML.

workflow :
1. Data Storage
o Data is stored in datasets, which are collections of tables.
o Tables contain data organized in columns and rows (like a typical
relational database).
o BigQuery stores data in the columnar format, making it fast for
analytical queries.
2. Loading Data
o You can load data into BigQuery from multiple sources like
Google Cloud Storage (GCS), Google Sheets, or external
databases.
o It supports both batch loading (uploading large amounts of data
at once) and streaming (real-time data insertion).
3. Running Queries
o Use SQL-like queries to interact with data stored in BigQuery.
o It uses distributed computing to run queries in parallel,
speeding up the processing time for large datasets.
4. Query Optimization
o BigQuery automatically optimizes queries by determining the
most efficient way to execute them, which minimizes the need
for manual tuning.
5. Storage and Pricing
o BigQuery has two main pricing components:
1. Storage: Charges based on the amount of data stored.
2. Queries: Charges based on the amount of data processed
when executing queries.
o The cost of querying is calculated by the amount of data read
during query execution, not the number of queries.

BigQuery for Data Science and ML

1. Data Storage:
BigQuery stores huge amounts of data (like millions of customer
records) in an organized way (tables and columns).
2. Data Querying:
You can query that data using SQL, so if you want to see which
customers are most likely to leave, you can write a simple SQL query
to filter and analyze the data.
3. Machine Learning Models:
Instead of exporting the data to a separate tool, you can build
machine learning models directly in BigQuery using BigQuery ML.
For example, you can predict customer behavior based on past data
using a simple SQL query.
4. Integration with Other Tools:
BigQuery can send and receive data from other Google Cloud tools
like Google Cloud Storage (for storing data), TensorFlow (for deep
learning), and Google AI (for additional machine learning models),
making it a central hub for your data science and ML needs.

Advantages :

1. Speed:
BigQuery’s architecture is designed to quickly analyze and process
large datasets, saving time during data exploration, model training,
and evaluation.
2. Cost-Effective:
BigQuery uses a pay-as-you-go pricing model, meaning you only
pay for the data you query. This is useful for data scientists as they
can run analysis and ML tasks without paying upfront for storage or
processing.
3. Seamless Integration:
BigQuery works well with other cloud-based tools and machine
learning frameworks, providing a seamless workflow for data science
projects.
4. Ease of Use:
BigQuery uses standard SQL, which is a familiar language for data
analysts and data scientists. The built-in BigQuery ML feature
makes machine learning accessible without requiring extensive
coding.
5. Scalable:
As your data grows, BigQuery automatically scales to handle larger
datasets, making it ideal for enterprises or applications with massive
data storage and processing needs.

Use cases : customer segmentation , real time analytics , large scale


data processing , predictive analytics

You might also like