CCD chapter 3 notes
CCD chapter 3 notes
advantages :
1) Better Data Security: Ensures sensitive data is protected from breaches
and unauthorized access.
2) Improved Data Quality: Keeps data accurate and reliable for better
decision-making.
3) Access Control: Ensures only authorized users can access specific data.
4) Disaster Recovery: Provides backups and recovery plans to ensure data is
safe during failures.
5) Cost Management: Optimizes storage and processing costs by removing
redundant data.
6) continuous monitoring : Tracks who accessed or modified data, making it
easy to audit.
disadvantages :
1) Hard to Set Up: It takes time and effort to create and manage rules for
data in the cloud.
2) dependency on cloud provider : Organizations must trust cloud providers
to maintain security
3) Training Needs: Employees must learn how to handle data properly,
which takes time and resources.
4) Need Constant Monitoring: You have to keep an eye on the data all the
time, which can be challenging.
advantages.
1) scalability : Easily expands to handle increasing amounts of data.
2) easy to use : as there is a key value simple structure it is easy to use
3) speed : It is extremely faster for simple queries
4) flexible : can store different types of data
disadvantages
1) limited searching : You can only search for data using keys, not the
values.
2) No Data Relationships: Cannot link data like relational databases
3) not ideal for large values : Performance can slow down if the stored
values are too large.
4) harder to manage complex data : Managing and retrieving data becomes
challenging for complex data
use cases :
1) E-commerce applications : Manages shopping carts and user
preferences.
2) Real-Time Analytics: Tracks events and metrics for monitoring
dashboards.
Examples of Key-Value Databases in Cloud Computing
1) Amazon DynamoDB
2) Azure Table Storage
3) Google Cloud Datastore
• batch and streaming data in machine learning
Batch data :
− Batch data refers to a collection of data that is processed in groups or
batches at specific intervals.
− It involves analyzing large datasets that have already been collected and
stored.
− Typically has higher latency since data is not processed immediately.
− This approach is widely used in machine learning (ML) for tasks that do not
require real-time processing and focus on analyzing historical data to train
predictive models.
workflow :
1) Data Collection: Collect data from various sources over a fixed period.
2) Storage: Store the collected data in a database, data warehouse, or file
system (e.g., Amazon S3)
3) Preprocessing: Clean, filter, and prepare the data for analysis. Common
preprocessing tasks include handling missing values, removing duplicates,
and normalizing data.
4) Batch Processing: Analyze or process the data in bulk using tools like
Hadoop or Python libraries.
5) Model Training: Use the processed data to train machine learning models
6) Evaluation and Deployment: Evaluate the trained model's performance
using test datasets. Deploy the model to make predictions on new data.
Advantages :
1) Efficient for Large Data Volumes: Handles large datasets in a single run,
making it suitable for big data applications.
2) Cost-Effective: it optimizes resource usage and costs.
3) Supports Complex Workflows: Handles complex transformations over large
datasets.
4) Data Consistency: Processes entire datasets at once, ensuring uniformity
and reducing inconsistencies.
5) Suitable for Model Training: Allows machine learning models to be trained
on large, historical datasets.
Disadvantages :
1) have to wait until all the data is collected and processed, so it’s not good for
instant results.
2) Processing large amounts of data at once requires a lot of computer
resources.
3) need sufficient space to store the data before it’s processed.
4) If there’s an error, you might need to start the process all over again, which
wastes time.
5) It is not suitable for real time data
Examples :
1) Weather Data
2) Customer Feedback Analysis
3) Image or Video Processing
4) generating Employee monthly salary
5) generating academic records at the end of semester
Streaming data :
− Streaming data refers to the continuous flow of data generated and
processed in real time.
− it involve analysing data immediately as it arrives
− Offers lower latency because data is processed immediately as it flows
into the system
− It is used in scenarios where data arrives in small increments over time
Workflow
1) data generation : Data is generated from sources like sensors, user
clicks
2) Ingestion: Data is ingested using tools like Google Pub/Sub.
3) preprocessing : Data is cleaned and transformed in real time to prepare
it for analysis.
4) Model Application: pretrained models are used to make predictions
5) Output: Results are used for real-time decision-making
advantages :
1) Real-Time Insights: can get results and take action immediately as data
arrives.
2) Always Updated: Works with the latest data, ensuring decisions are
based on current information.
3) Ideal for Quick Decisions: it is used when the decision to be make
immediately on the spot
disadvantages :
1) Needs Complex Systems: it needs complex tools and setup and expertise
for real time processing
2) Expensive: Requires more computing power, which can increase costs.
3) Hard to Ensure Data Quality: It’s challenging to clean or verify the data
quickly because processing happens immediately.
4) Limited for Historical Analysis: As it focuses on real time data it is
difficult to process historical data
examples
1) Social Media Monitoring
2) Fraud Detection
3) Stock Market Analysis
4) Traffic Updates
AWS Redshift:
− Amazon Redshift is a fully managed, cloud-based data warehousing
service provided by AWS.
− It is designed for storing and analyzing large volumes of structured and
semi-structured data using SQL-based tools and business
intelligence applications.
− Redshift is optimized for high-performance analytics, supporting
complex queries on large datasets with scalability, speed, and cost-
effectiveness.
workflow :
1. Data Loading:
Data is uploaded to Redshift from sources like files, databases, or other
cloud services (e.g., Amazon S3).
2. Data Storage:
Data is organized in a column format, compressed, and stored in the
cloud for easy access.
3. Querying Data:
You can run SQL queries to analyze data, generate reports, or find
insights.
4. Processing Queries:
Redshift splits tasks across multiple servers to process large queries
faster.
Architecture :
1. Leader Node: The leader node acts like a manager or controller. It:
• Receives queries from users or applications.
• Creates a plan to execute those queries.
• Coordinates the work with compute nodes.
• Returns the final results to the user.
2. Compute Nodes : Compute nodes are like the workers. They:
• Store the actual data in the form of slices (chunks of data).
• Perform the heavy lifting by processing queries sent by the leader
node.
• Work in parallel to speed up tasks.
3. Storage Layer: Redshift stores data in a columnar format
(organized by columns, not rows).
This format is faster for analytical queries because only the relevant
columns are read.
4. Network and Integration: Redshift is connected to AWS services
and external tools to ingest and export data easily. Data can be loaded
from Amazon S3, databases, or streaming services.
features :
Advantages :
− Easy to Use: You don’t need to manage servers; AWS handles that
for you.
− Faster Analytics: Optimized for analyzing big datasets quickly.
− Flexible Scaling: Adjust resources (storage and compute) as needed.
− Affordable: Pay only for what you use, and store large amounts of
data cost-effectively.
Disadvantage :
GCP BigQuery :
− BigQuery is a fully-managed, serverless, and highly scalable data
warehouse built on Google Cloud Platform (GCP) for performing fast
and SQL-based queries on massive datasets.
− It's widely used in data science and machine learning (ML) for storing,
analyzing, and managing large datasets quickly and efficiently.
key features :
workflow :
1. Data Storage
o Data is stored in datasets, which are collections of tables.
o Tables contain data organized in columns and rows (like a typical
relational database).
o BigQuery stores data in the columnar format, making it fast for
analytical queries.
2. Loading Data
o You can load data into BigQuery from multiple sources like
Google Cloud Storage (GCS), Google Sheets, or external
databases.
o It supports both batch loading (uploading large amounts of data
at once) and streaming (real-time data insertion).
3. Running Queries
o Use SQL-like queries to interact with data stored in BigQuery.
o It uses distributed computing to run queries in parallel,
speeding up the processing time for large datasets.
4. Query Optimization
o BigQuery automatically optimizes queries by determining the
most efficient way to execute them, which minimizes the need
for manual tuning.
5. Storage and Pricing
o BigQuery has two main pricing components:
1. Storage: Charges based on the amount of data stored.
2. Queries: Charges based on the amount of data processed
when executing queries.
o The cost of querying is calculated by the amount of data read
during query execution, not the number of queries.
1. Data Storage:
BigQuery stores huge amounts of data (like millions of customer
records) in an organized way (tables and columns).
2. Data Querying:
You can query that data using SQL, so if you want to see which
customers are most likely to leave, you can write a simple SQL query
to filter and analyze the data.
3. Machine Learning Models:
Instead of exporting the data to a separate tool, you can build
machine learning models directly in BigQuery using BigQuery ML.
For example, you can predict customer behavior based on past data
using a simple SQL query.
4. Integration with Other Tools:
BigQuery can send and receive data from other Google Cloud tools
like Google Cloud Storage (for storing data), TensorFlow (for deep
learning), and Google AI (for additional machine learning models),
making it a central hub for your data science and ML needs.
Advantages :
1. Speed:
BigQuery’s architecture is designed to quickly analyze and process
large datasets, saving time during data exploration, model training,
and evaluation.
2. Cost-Effective:
BigQuery uses a pay-as-you-go pricing model, meaning you only
pay for the data you query. This is useful for data scientists as they
can run analysis and ML tasks without paying upfront for storage or
processing.
3. Seamless Integration:
BigQuery works well with other cloud-based tools and machine
learning frameworks, providing a seamless workflow for data science
projects.
4. Ease of Use:
BigQuery uses standard SQL, which is a familiar language for data
analysts and data scientists. The built-in BigQuery ML feature
makes machine learning accessible without requiring extensive
coding.
5. Scalable:
As your data grows, BigQuery automatically scales to handle larger
datasets, making it ideal for enterprises or applications with massive
data storage and processing needs.