0% found this document useful (0 votes)
46 views64 pages

Exam Topics - PDE - Questions-7w1dhd9jefy8p8w9ucpjurqidy

The document outlines various Google Cloud services and their corresponding use cases based on specific keywords. It includes services like Cloud Bigtable for IoT data, BigQuery for analytics, and Cloud Spanner for relational databases with ACID guarantees. Additionally, it discusses data processing tools like Dataproc and Dataflow, as well as storage options such as multi-regional and nearline storage.

Uploaded by

Kasani Ravikumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views64 pages

Exam Topics - PDE - Questions-7w1dhd9jefy8p8w9ucpjurqidy

The document outlines various Google Cloud services and their corresponding use cases based on specific keywords. It includes services like Cloud Bigtable for IoT data, BigQuery for analytics, and Cloud Spanner for relational databases with ACID guarantees. Additionally, it discusses data processing tools like Dataproc and Dataflow, as well as storage options such as multi-regional and nearline storage.

Uploaded by

Kasani Ravikumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Important Keywords and Answers:

IF any question has this keywords , then we can conclude this as answer

keywords --- answer

Google Stackdriver Monitoring → Ans : performance NOT missing data

IOT, Sensor, clicks, low latency,hbase api,ingest large volumes -- Ans : Cloud Bigtable

warehouse,analytics,sql ---- Ans : Bigquery

Sql, relational database, multiple regions, transactions, ACID(atomicity, consistency, isolation, and
durability ) guarantee, global ( all regional names listed) ,txn tables scale horizontally -- Cloud
spanner

sql,relational database, transaction,postgresql single or particular region specified ,upto 30TB


stroage– Cloud SQL

spark,hadoop -- cloud data proc

messaging service,queuing service,subscription --- cloud pub/sub

images and container -- cloud build

data transformation pipeline -- cloud data fusion

quick checks,within a minute , one second -- cloud functions

ruby, minimal code, serverless -- App engine

backend, app mobile -- firestore

analytics and ML - cloud dataprep

sdk,batch & stream processing - cloud dataflow

data access high availability in multi regional /global - multi regional storage

data access in single region - regional storage

data access once per month / once per 30 days - nearline line storage

data access once per quarter/ 90 days - coldline

data should be stored permanently in storage but accessing is rare - Archive storage
AutoML -> custom specific models trained for specific use case.

INTERNAL USE
Cloud Vision API -> pre-trained models to detect labels, faces, words.

INTERNAL USE
INTERNAL USE
INTERNAL USE
INTERNAL USE
Stackdriver is used to track access logs for Bigquery

Two key points: Persist data beyond the life of the cluster → (GC) storage
Managed hadoop cluster - dataproc
Persistent storage: GCS (dataproc uses gcs connector to connect to gcs)

Description: Dataproc is used to migrate Hadoop and Spark jobs on GCP. Dataproc with GCS connected through
Google Cloud Storage connector helps store data after the life of the cluster. When the job is high I/O intensive, then
we need to create a small persistent disk.

INTERNAL USE
Ans: B,C,D

B - Not labelled as Fraud or not. So Unsupervised.


C - Clustering can be done based on location, amount etc.
D - Location is already given. So labelled. Hence supervised.

First rule of dataproc is to keep data in GCS.


dataproc - storage - cost effective is cloud storage

The custom endpoint is not acknowledging the message, that is the reason for Pub/Sub to send the message again
and again. When you do not acknowledge a message before its acknowledgement deadline has expired, Pub/Sub
resends the message. As a result, Pub/Sub can send duplicate messages.

INTERNAL USE
Datalab before it get deprecated now Vertex AI

INTERNAL USE
Keywords : API , project sink

INTERNAL USE
reading only relevant cols

perform analytics → BigQuery

INTERNAL USE
Stack driver could tell us about performance but not logging of missing data.

Apache Spark is faster than Hadoop/Pig/MapReduce


SPARK > hadoop, pig, hive

INTERNAL USE
INTERNAL USE
ML →BQ

Keywords → gsutil , TAR

INTERNAL USE
INTERNAL USE
INTERNAL USE
Hadoop/Spark jobs are run on Dataproc, and the pre-emptible machines cost 80% less

INTERNAL USE
INTERNAL USE
A & B - Need to build your own model, so discarded as options.
C or D can do the job here using Cloud Video Intelligence API. BigTable is better option. So C is correct.
IoT – Bigtable

Answer: C - best suitable for the purpose with autoscaling and google recommended transform engine between
pubsub & bq

INTERNAL USE
Entity analysis -> Identify entities within documents receipts, invoices, and contracts and label them by types such
as date, person, contact information, organization, location, events, products, and media.

Sentiment analysis -> Understand the overall opinion, feeling, or attitude sentiment expressed in a block of text. --
Avoid Custom models

INTERNAL USE
Spanner allows transaction tables to scale horizontally and secondary indexes for range queries.

BigTable can take in data from dataproc, spark and Hadoop.


hbase api,ingest large volumes -- Ans : Cloud Bigtable Apache/Hadoop → BigTable

The link on authorized views (https://cloud.google.com/bigquery/docs/share-access-views) explicitly states


"Authorized views should be created in a different dataset from the source data. That way, data owners can give
users access to the authorized view without simultaneously granting access to the underlying data." therefore B is
the correct answer because we are to create a new dataset and view within that dataset.

INTERNAL USE
By SubSampling the training data, you will reduce the training time. In case of D, if you increase the
number of layers, then the model's accuracy will be increased. But it will not reduce the time required to
train the model.

Speed of data transfer depends on Bandwidth

INTERNAL USE
INTERNAL USE
INTERNAL USE
Cloud SQL cheap and relational DB.

INTERNAL USE
Cloud Dataprep by Trifacta is an intelligent data service for visually exploring, cleaning, and preparing structured and
unstructured data for analysis, reporting, and machine learning.

INTERNAL USE
? A- 56% and C – 44%

INTERNAL USE
highly available = multi-regional
recovery strategy of this data that minimizes cost = point-in-time snapshot

Since we do not know when the load job will finish, we cannot use a fixed scheduler or cron job. With composer
we can define logic and dependencies to first check if the load job has finished and then run the dataprep job.
Dataprep can be run on Dataflow using template and cloud composer will create dependency on previous job. For
dependency creation the only valid option from below is Cloud Composer (Apache Airflow). The Cloud Dataprep
job when it executes creates a dataflow template which is stored in GCS. The same can be exported from there
and used in creating the workflow in Cloud Composer.

INTERNAL USE
Managed Service - Cloud Composer

INTERNAL USE
Cloud Composer is a fully managed workflow orchestration service, enabling you to create, schedule, monitor, and
manage workflows that span across clouds and on-premises data centers.

Add a ParDo transform in Cloud Dataflow to discard corrupt elements

INTERNAL USE
INTERNAL USE
Cloud Composer is a fully managed workflow orchestration service, enabling you to create, schedule, monitor, and
manage workflows that span across clouds and on-premises data centers.

By creating an authorized view one assures that the data is current and avoids taking more storage space (and
cost) in order to share a dataset. B and D are not cost optimal and C does not guarantee that the data is kept
updated.

The table is already partitioned with ingestion date.So package-tracking ID

INTERNAL USE
Dataflow → streaming and batch . Dataproc →Hadoop

.A Good ROW KEY has to be an ID followed by timestamp. Stock symbol in this case works as an ID

INTERNAL USE
subscription, increase – decrease

Alterative to Kafka in google cloud native service is Pub/Sub and Dataflow punched with Pub/Sub is the google
recommended option

INTERNAL USE
Denormalization will help in performance by reducing query time.Append has better performance than update.

Multi-region increases high availability and pdf can be stored in gcs

INTERNAL USE
This is a case of underfitting - not overfitting (for over fitting the model will have extremely low training error but a
high testing error) - so we need to make the model more complex

INTERNAL USE
INTERNAL USE
Bigtable provides lowest latency. requirement to serve predictions within 100 ms.

INTERNAL USE
instance n1-standard-1 is low configuration and hence need to be larger configuration, definitely B should be one of the option.
Increase max workers will increase parallelism and hence will be able to process faster given larger CPU size and multi core
processor instance type is chosen. Option A can be a better step.

INTERNAL USE
"The maximum number of Compute Engine instances to be made available to your pipeline during execution. Note that this can
be higher than the initial number of workers (specified by num_workers to allow your job to scale up, automatically or
otherwise." "Adding nodes to the original cluster: You can add 3 nodes to the cluster, for a total of 6 nodes. The write
throughput for the instance doubles, but the instance's data is available in only one zone:"

INTERNAL USE
Aggregated log sink will create a single sink for all projects, the destination can be a google cloud storage, pub/sub
topic, bigquery table or a cloud logging bucket. without aggregated sink this will be required to be done for each
project individually which will be cumbersome.

INTERNAL USE
Transfer Appliance for moving offline data, large data sets, or data from a source with limited bandwidth. Transfer
Appliance is a high-capacity storage device that enables you to transfer and securely ship your data to a Google
upload facility, where we upload your data to Cloud Storage.

Vote for 'A', because of requirement - Enabling non-developer analysts to modify transformations.
Dataprep by Trifacta is an intelligent data service for visually exploring, cleaning, and preparing structured and
unstructured data for analysis, reporting, and machine learning. Because Dataprep is serverless and works at any
scale, there is no infrastructure to deploy or manage. Your next ideal data transformation is suggested and
predicted with each UI input, so you don’t have to write code.

INTERNAL USE
AutoML -> custom specific models trained for specific use case.
Cloud Vision API -> pre-trained models to detect labels, faces, words.

INTERNAL USE
It is now feasible to provide table level access to user by allowing user to query single table and no other table will
be visible to user in same dataset.

For I/O intensive jobs, increasing the disk size resolves the issue.

INTERNAL USE
Geospatial and ML functionality is with bigquery.

Reasons:-
a) Kafka IO and Dataflow is a valid option for interconnect (needless where Kafka is located - On Prem/Google
Cloud/Other cloud)
b) Sliding Window will help to calculate average.

INTERNAL USE
These are the functionalities which are currently lagging/not-available with Pub/Sub. Pub sub can retain message
only for 31 days max.

Ask for cost effective so persistent disk are HDD which are cheaper in comparison to SSD.

INTERNAL USE
If you create a Dataproc cluster with internal IP addresses only, attempts to access the Internet in an initialization
action will fail unless you have configured routes to direct the traffic through a NAT or a VPN gateway. Without
access to the Internet, you can enable Private Google Access, and place job dependencies in Cloud Storage; cluster
nodes can download the dependencies from Cloud Storage from internal IPs.

It specifically asks for scaling up which can be done in Cloud SQL and can be queried using SQL.
Cloud SQL continues to add storage until it reaches the maximum of 30 TB.

Cloud SQL (30TB)

INTERNAL USE
A tall and narrow table has a small number of events per row, which could be just one event, whereas a short and
wide table has a large number of events per row. As explained in a moment, tall and narrow tables are best suited
for time-series data. For time series, you should generally use tall and narrow tables. This is for two reasons:
Storing one event per row makes it easier to run queries against your data. Storing many events per row makes it
more likely that the total row size will exceed the recommended maximum (see Rows can be big but are not
infinite).

AAD is used to decrypt the data so better to keep it outside GCP for safety

INTERNAL USE
Monitoring does not only provide you with access to Dataflow-related metrics, but also lets you to create alerting
policies and dashboards so you can chart time series of metrics and choose to be notified when these metrics
reach specified values.

ACID compliance for Spanner.


Spanner supports read-write transactions for use cases, as handling bank transactions.

INTERNAL USE
Clustered tables in BigQuery are tables that have a user-defined column sort order using clustered columns.
Clustered tables can improve query performance and reduce query costs. table has already created with ingest-
date partitioning.

When you want to move your Apache Spark workloads from an on-premises environment to Google Cloud, we
recommend using Dataproc to run Apache Spark/Apache Hadoop clusters. Dataproc is a fully managed, fully
supported service offered by Google Cloud. It allows you to separate storage and compute, which helps you to
manage your costs and be more flexible in scaling your workloads.
Migrating Hive data from your on-premises or other cloud-based source cluster to BigQuery has two steps: 1.
Copying data from a source cluster to Cloud Storage 2. Loading data from Cloud Storage into BigQuery.

INTERNAL USE
The Seek feature extends subscriber functionality by allowing you to alter the acknowledgement state of
messages in bulk. For example, you can replay previously acknowledged messages or purge messages in
bulk. In addition, you can copy the state of one subscription to another by using seek in combination
with a Snapshot.

INTERNAL USE
Using the TRANSFORM clause, you can specify all preprocessing during model creation. The preprocessing is
automatically applied during the prediction and evaluation phases of machine learning.

moving average → sliding window

INTERNAL USE
TUMBLE=> fixed windows. A tumbling window represents a consistent, disjoint time interval in the data stream.
HOP=> sliding windows.
SESSION=> session windows.

A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage
and query your data. By dividing a large table into smaller partitions, you can improve query performance, and you
can control costs by reducing the number of bytes read by a query.

INTERNAL USE
INTERNAL USE
Keywords: You want to ensure that the sensitive data is masked but still maintains referential integrity.
Part1- data is masked-Create a pseudonym by replacing PII data with a cryptographic token.
Part 2-still maintains referential integrity- with a cryptographic format-preserving token.

Denormalization is a common strategy for increasing read performance for relational datasets that were previously
normalized. The recommended way to denormalize data in BigQuery is to use nested and repeated fields. It's best
to use this strategy when the relationships are hierarchical and frequently queried together, such as in parent-child
relationships.

if the call takes on average 1 sec, that would cause massive backpressure on the pipeline. In these circumstances
you should consider batching these requests, instead.

INTERNAL USE
The gsutil tool is the standard tool for small- to medium-sized transfers (less than 1 TB) over a typical enterprise-
scale network, from a private data center to Google Cloud.

INTERNAL USE
While your job is running, you might encounter errors or exceptions in your worker code. These errors generally
mean that the DoFns in your pipeline code have generated unhandled exceptions, which result in failed tasks in
your Dataflow job. Exceptions in user code (for example, your DoFn instances) are reported in the Dataflow
monitoring interface.

INTERNAL USE
Cloud Storage as restricted API

INTERNAL USE
When you create a Cloud Spanner instance, you must configure it as either regional (that is, all the resources are
contained within a single Google Cloud region) or multi-region (that is, the resources span more than one region).
You can change the instance configuration to multi-regional (or global) at anytime.

Like gsutil, Storage Transfer Service for on-premises data enables transfers from network file system (NFS) storage
to Cloud Storage. Although gsutil can support small transfer sizes (up to 1 TB), Storage Transfer Service for on-
premises data is designed for large-scale transfers (up to petabytes of data, billions of files).

INTERNAL USE
There 2 parts and they are relevant to each other 1. Overfit is fixed by decreasing the number of input features
(select only essential features) 2. Accuracy is improved by increasing the amount of training data examples.

Dialogflow is a natural language understanding platform that makes it easy to design and integrate a
conversational user interface into your mobile app, web application, device, bot, interactive voice response
system, and so on. Using Dialogflow, you can provide new and engaging ways for users to interact with your
product. Dialogflow can analyze multiple types of input from your customers, including text or audio inputs (like
from a phone or voice recording). It can also respond to your customers in a couple of ways, either through text or
with synthetic speech.

INTERNAL USE
The source is a proprietary format. Dataflow wouldn't have a built-in template to read the file. You will have to
create something custom.

INTERNAL USE
We exclude [C] as non ACID and [D] for being invalid (location is configured on Dataset level, not Table). Then, let's
focus on "minimal human intervention in case of a failure" requirement in order to eliminate one answer among
[A] and [B]. Basically, we have to compare point-in-time recovery with high availability. It doesn't matter whether
it's about MySQL or PostgreSQL since both databases support those features. - Point-in-time recovery logs are
created automatically, but restoring an instance in case of failure requires manual steps - High availability, in case
of failure requires no human intervention: "If an HA-configured instance becomes unresponsive, Cloud SQL
automatically switches to serving data from the standby instance.

Shared VPC enables organizations to establish budgeting and access control boundaries at the project level while
allowing for secure and efficient communication using private IPs across those boundaries. In the Shared VPC
configuration, Cloud Composer can invoke services hosted in other Google Cloud projects in the same organization
without exposing services to the public internet. Shared VPC requires that you designate a host project to which
networks and subnetworks belong and a service project, which is attached to the host project.

INTERNAL USE
In BigQuery, materialized views are precomputed views that periodically cache the results of a query for increased
performance and efficiency. BigQuery leverages precomputed results from materialized views and whenever
possible reads only delta changes from the base tables to compute up-to-date results. Materialized views can be
queried directly or can be used by the BigQuery optimizer to process queries to the base tables. Queries that use
materialized views are generally faster and consume fewer resources than queries that retrieve the same data only
from the base tables. Materialized views can significantly improve the performance of workloads that have the
characteristic of common and repeated queries.

INTERNAL USE
INTERNAL USE
INTERNAL USE

You might also like