Important Keywords and Answers:
IF any question has this keywords , then we can conclude this as answer
keywords --- answer
Google Stackdriver Monitoring → Ans : performance NOT missing data
IOT, Sensor, clicks, low latency,hbase api,ingest large volumes -- Ans : Cloud Bigtable
warehouse,analytics,sql ---- Ans : Bigquery
Sql, relational database, multiple regions, transactions, ACID(atomicity, consistency, isolation, and
durability ) guarantee, global ( all regional names listed) ,txn tables scale horizontally -- Cloud
spanner
sql,relational database, transaction,postgresql single or particular region specified ,upto 30TB
stroage– Cloud SQL
spark,hadoop -- cloud data proc
messaging service,queuing service,subscription --- cloud pub/sub
images and container -- cloud build
data transformation pipeline -- cloud data fusion
quick checks,within a minute , one second -- cloud functions
ruby, minimal code, serverless -- App engine
backend, app mobile -- firestore
analytics and ML - cloud dataprep
sdk,batch & stream processing - cloud dataflow
data access high availability in multi regional /global - multi regional storage
data access in single region - regional storage
data access once per month / once per 30 days - nearline line storage
data access once per quarter/ 90 days - coldline
data should be stored permanently in storage but accessing is rare - Archive storage
AutoML -> custom specific models trained for specific use case.
INTERNAL USE
Cloud Vision API -> pre-trained models to detect labels, faces, words.
INTERNAL USE
INTERNAL USE
INTERNAL USE
INTERNAL USE
Stackdriver is used to track access logs for Bigquery
Two key points: Persist data beyond the life of the cluster → (GC) storage
Managed hadoop cluster - dataproc
Persistent storage: GCS (dataproc uses gcs connector to connect to gcs)
Description: Dataproc is used to migrate Hadoop and Spark jobs on GCP. Dataproc with GCS connected through
Google Cloud Storage connector helps store data after the life of the cluster. When the job is high I/O intensive, then
we need to create a small persistent disk.
INTERNAL USE
Ans: B,C,D
B - Not labelled as Fraud or not. So Unsupervised.
C - Clustering can be done based on location, amount etc.
D - Location is already given. So labelled. Hence supervised.
First rule of dataproc is to keep data in GCS.
dataproc - storage - cost effective is cloud storage
The custom endpoint is not acknowledging the message, that is the reason for Pub/Sub to send the message again
and again. When you do not acknowledge a message before its acknowledgement deadline has expired, Pub/Sub
resends the message. As a result, Pub/Sub can send duplicate messages.
INTERNAL USE
Datalab before it get deprecated now Vertex AI
INTERNAL USE
Keywords : API , project sink
INTERNAL USE
reading only relevant cols
perform analytics → BigQuery
INTERNAL USE
Stack driver could tell us about performance but not logging of missing data.
Apache Spark is faster than Hadoop/Pig/MapReduce
SPARK > hadoop, pig, hive
INTERNAL USE
INTERNAL USE
ML →BQ
Keywords → gsutil , TAR
INTERNAL USE
INTERNAL USE
INTERNAL USE
Hadoop/Spark jobs are run on Dataproc, and the pre-emptible machines cost 80% less
INTERNAL USE
INTERNAL USE
A & B - Need to build your own model, so discarded as options.
C or D can do the job here using Cloud Video Intelligence API. BigTable is better option. So C is correct.
IoT – Bigtable
Answer: C - best suitable for the purpose with autoscaling and google recommended transform engine between
pubsub & bq
INTERNAL USE
Entity analysis -> Identify entities within documents receipts, invoices, and contracts and label them by types such
as date, person, contact information, organization, location, events, products, and media.
Sentiment analysis -> Understand the overall opinion, feeling, or attitude sentiment expressed in a block of text. --
Avoid Custom models
INTERNAL USE
Spanner allows transaction tables to scale horizontally and secondary indexes for range queries.
BigTable can take in data from dataproc, spark and Hadoop.
hbase api,ingest large volumes -- Ans : Cloud Bigtable Apache/Hadoop → BigTable
The link on authorized views (https://cloud.google.com/bigquery/docs/share-access-views) explicitly states
"Authorized views should be created in a different dataset from the source data. That way, data owners can give
users access to the authorized view without simultaneously granting access to the underlying data." therefore B is
the correct answer because we are to create a new dataset and view within that dataset.
INTERNAL USE
By SubSampling the training data, you will reduce the training time. In case of D, if you increase the
number of layers, then the model's accuracy will be increased. But it will not reduce the time required to
train the model.
Speed of data transfer depends on Bandwidth
INTERNAL USE
INTERNAL USE
INTERNAL USE
Cloud SQL cheap and relational DB.
INTERNAL USE
Cloud Dataprep by Trifacta is an intelligent data service for visually exploring, cleaning, and preparing structured and
unstructured data for analysis, reporting, and machine learning.
INTERNAL USE
? A- 56% and C – 44%
INTERNAL USE
highly available = multi-regional
recovery strategy of this data that minimizes cost = point-in-time snapshot
Since we do not know when the load job will finish, we cannot use a fixed scheduler or cron job. With composer
we can define logic and dependencies to first check if the load job has finished and then run the dataprep job.
Dataprep can be run on Dataflow using template and cloud composer will create dependency on previous job. For
dependency creation the only valid option from below is Cloud Composer (Apache Airflow). The Cloud Dataprep
job when it executes creates a dataflow template which is stored in GCS. The same can be exported from there
and used in creating the workflow in Cloud Composer.
INTERNAL USE
Managed Service - Cloud Composer
INTERNAL USE
Cloud Composer is a fully managed workflow orchestration service, enabling you to create, schedule, monitor, and
manage workflows that span across clouds and on-premises data centers.
Add a ParDo transform in Cloud Dataflow to discard corrupt elements
INTERNAL USE
INTERNAL USE
Cloud Composer is a fully managed workflow orchestration service, enabling you to create, schedule, monitor, and
manage workflows that span across clouds and on-premises data centers.
By creating an authorized view one assures that the data is current and avoids taking more storage space (and
cost) in order to share a dataset. B and D are not cost optimal and C does not guarantee that the data is kept
updated.
The table is already partitioned with ingestion date.So package-tracking ID
INTERNAL USE
Dataflow → streaming and batch . Dataproc →Hadoop
.A Good ROW KEY has to be an ID followed by timestamp. Stock symbol in this case works as an ID
INTERNAL USE
subscription, increase – decrease
Alterative to Kafka in google cloud native service is Pub/Sub and Dataflow punched with Pub/Sub is the google
recommended option
INTERNAL USE
Denormalization will help in performance by reducing query time.Append has better performance than update.
Multi-region increases high availability and pdf can be stored in gcs
INTERNAL USE
This is a case of underfitting - not overfitting (for over fitting the model will have extremely low training error but a
high testing error) - so we need to make the model more complex
INTERNAL USE
INTERNAL USE
Bigtable provides lowest latency. requirement to serve predictions within 100 ms.
INTERNAL USE
instance n1-standard-1 is low configuration and hence need to be larger configuration, definitely B should be one of the option.
Increase max workers will increase parallelism and hence will be able to process faster given larger CPU size and multi core
processor instance type is chosen. Option A can be a better step.
INTERNAL USE
"The maximum number of Compute Engine instances to be made available to your pipeline during execution. Note that this can
be higher than the initial number of workers (specified by num_workers to allow your job to scale up, automatically or
otherwise." "Adding nodes to the original cluster: You can add 3 nodes to the cluster, for a total of 6 nodes. The write
throughput for the instance doubles, but the instance's data is available in only one zone:"
INTERNAL USE
Aggregated log sink will create a single sink for all projects, the destination can be a google cloud storage, pub/sub
topic, bigquery table or a cloud logging bucket. without aggregated sink this will be required to be done for each
project individually which will be cumbersome.
INTERNAL USE
Transfer Appliance for moving offline data, large data sets, or data from a source with limited bandwidth. Transfer
Appliance is a high-capacity storage device that enables you to transfer and securely ship your data to a Google
upload facility, where we upload your data to Cloud Storage.
Vote for 'A', because of requirement - Enabling non-developer analysts to modify transformations.
Dataprep by Trifacta is an intelligent data service for visually exploring, cleaning, and preparing structured and
unstructured data for analysis, reporting, and machine learning. Because Dataprep is serverless and works at any
scale, there is no infrastructure to deploy or manage. Your next ideal data transformation is suggested and
predicted with each UI input, so you don’t have to write code.
INTERNAL USE
AutoML -> custom specific models trained for specific use case.
Cloud Vision API -> pre-trained models to detect labels, faces, words.
INTERNAL USE
It is now feasible to provide table level access to user by allowing user to query single table and no other table will
be visible to user in same dataset.
For I/O intensive jobs, increasing the disk size resolves the issue.
INTERNAL USE
Geospatial and ML functionality is with bigquery.
Reasons:-
a) Kafka IO and Dataflow is a valid option for interconnect (needless where Kafka is located - On Prem/Google
Cloud/Other cloud)
b) Sliding Window will help to calculate average.
INTERNAL USE
These are the functionalities which are currently lagging/not-available with Pub/Sub. Pub sub can retain message
only for 31 days max.
Ask for cost effective so persistent disk are HDD which are cheaper in comparison to SSD.
INTERNAL USE
If you create a Dataproc cluster with internal IP addresses only, attempts to access the Internet in an initialization
action will fail unless you have configured routes to direct the traffic through a NAT or a VPN gateway. Without
access to the Internet, you can enable Private Google Access, and place job dependencies in Cloud Storage; cluster
nodes can download the dependencies from Cloud Storage from internal IPs.
It specifically asks for scaling up which can be done in Cloud SQL and can be queried using SQL.
Cloud SQL continues to add storage until it reaches the maximum of 30 TB.
Cloud SQL (30TB)
INTERNAL USE
A tall and narrow table has a small number of events per row, which could be just one event, whereas a short and
wide table has a large number of events per row. As explained in a moment, tall and narrow tables are best suited
for time-series data. For time series, you should generally use tall and narrow tables. This is for two reasons:
Storing one event per row makes it easier to run queries against your data. Storing many events per row makes it
more likely that the total row size will exceed the recommended maximum (see Rows can be big but are not
infinite).
AAD is used to decrypt the data so better to keep it outside GCP for safety
INTERNAL USE
Monitoring does not only provide you with access to Dataflow-related metrics, but also lets you to create alerting
policies and dashboards so you can chart time series of metrics and choose to be notified when these metrics
reach specified values.
ACID compliance for Spanner.
Spanner supports read-write transactions for use cases, as handling bank transactions.
INTERNAL USE
Clustered tables in BigQuery are tables that have a user-defined column sort order using clustered columns.
Clustered tables can improve query performance and reduce query costs. table has already created with ingest-
date partitioning.
When you want to move your Apache Spark workloads from an on-premises environment to Google Cloud, we
recommend using Dataproc to run Apache Spark/Apache Hadoop clusters. Dataproc is a fully managed, fully
supported service offered by Google Cloud. It allows you to separate storage and compute, which helps you to
manage your costs and be more flexible in scaling your workloads.
Migrating Hive data from your on-premises or other cloud-based source cluster to BigQuery has two steps: 1.
Copying data from a source cluster to Cloud Storage 2. Loading data from Cloud Storage into BigQuery.
INTERNAL USE
The Seek feature extends subscriber functionality by allowing you to alter the acknowledgement state of
messages in bulk. For example, you can replay previously acknowledged messages or purge messages in
bulk. In addition, you can copy the state of one subscription to another by using seek in combination
with a Snapshot.
INTERNAL USE
Using the TRANSFORM clause, you can specify all preprocessing during model creation. The preprocessing is
automatically applied during the prediction and evaluation phases of machine learning.
moving average → sliding window
INTERNAL USE
TUMBLE=> fixed windows. A tumbling window represents a consistent, disjoint time interval in the data stream.
HOP=> sliding windows.
SESSION=> session windows.
A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage
and query your data. By dividing a large table into smaller partitions, you can improve query performance, and you
can control costs by reducing the number of bytes read by a query.
INTERNAL USE
INTERNAL USE
Keywords: You want to ensure that the sensitive data is masked but still maintains referential integrity.
Part1- data is masked-Create a pseudonym by replacing PII data with a cryptographic token.
Part 2-still maintains referential integrity- with a cryptographic format-preserving token.
Denormalization is a common strategy for increasing read performance for relational datasets that were previously
normalized. The recommended way to denormalize data in BigQuery is to use nested and repeated fields. It's best
to use this strategy when the relationships are hierarchical and frequently queried together, such as in parent-child
relationships.
if the call takes on average 1 sec, that would cause massive backpressure on the pipeline. In these circumstances
you should consider batching these requests, instead.
INTERNAL USE
The gsutil tool is the standard tool for small- to medium-sized transfers (less than 1 TB) over a typical enterprise-
scale network, from a private data center to Google Cloud.
INTERNAL USE
While your job is running, you might encounter errors or exceptions in your worker code. These errors generally
mean that the DoFns in your pipeline code have generated unhandled exceptions, which result in failed tasks in
your Dataflow job. Exceptions in user code (for example, your DoFn instances) are reported in the Dataflow
monitoring interface.
INTERNAL USE
Cloud Storage as restricted API
INTERNAL USE
When you create a Cloud Spanner instance, you must configure it as either regional (that is, all the resources are
contained within a single Google Cloud region) or multi-region (that is, the resources span more than one region).
You can change the instance configuration to multi-regional (or global) at anytime.
Like gsutil, Storage Transfer Service for on-premises data enables transfers from network file system (NFS) storage
to Cloud Storage. Although gsutil can support small transfer sizes (up to 1 TB), Storage Transfer Service for on-
premises data is designed for large-scale transfers (up to petabytes of data, billions of files).
INTERNAL USE
There 2 parts and they are relevant to each other 1. Overfit is fixed by decreasing the number of input features
(select only essential features) 2. Accuracy is improved by increasing the amount of training data examples.
Dialogflow is a natural language understanding platform that makes it easy to design and integrate a
conversational user interface into your mobile app, web application, device, bot, interactive voice response
system, and so on. Using Dialogflow, you can provide new and engaging ways for users to interact with your
product. Dialogflow can analyze multiple types of input from your customers, including text or audio inputs (like
from a phone or voice recording). It can also respond to your customers in a couple of ways, either through text or
with synthetic speech.
INTERNAL USE
The source is a proprietary format. Dataflow wouldn't have a built-in template to read the file. You will have to
create something custom.
INTERNAL USE
We exclude [C] as non ACID and [D] for being invalid (location is configured on Dataset level, not Table). Then, let's
focus on "minimal human intervention in case of a failure" requirement in order to eliminate one answer among
[A] and [B]. Basically, we have to compare point-in-time recovery with high availability. It doesn't matter whether
it's about MySQL or PostgreSQL since both databases support those features. - Point-in-time recovery logs are
created automatically, but restoring an instance in case of failure requires manual steps - High availability, in case
of failure requires no human intervention: "If an HA-configured instance becomes unresponsive, Cloud SQL
automatically switches to serving data from the standby instance.
Shared VPC enables organizations to establish budgeting and access control boundaries at the project level while
allowing for secure and efficient communication using private IPs across those boundaries. In the Shared VPC
configuration, Cloud Composer can invoke services hosted in other Google Cloud projects in the same organization
without exposing services to the public internet. Shared VPC requires that you designate a host project to which
networks and subnetworks belong and a service project, which is attached to the host project.
INTERNAL USE
In BigQuery, materialized views are precomputed views that periodically cache the results of a query for increased
performance and efficiency. BigQuery leverages precomputed results from materialized views and whenever
possible reads only delta changes from the base tables to compute up-to-date results. Materialized views can be
queried directly or can be used by the BigQuery optimizer to process queries to the base tables. Queries that use
materialized views are generally faster and consume fewer resources than queries that retrieve the same data only
from the base tables. Materialized views can significantly improve the performance of workloads that have the
characteristic of common and repeated queries.
INTERNAL USE
INTERNAL USE
INTERNAL USE