0% found this document useful (0 votes)
110 views60 pages

ExamTopics PDE Questions Continued-Xq9fxx8o7pb7teqebey9suitnw

Flowlogistic is a logistics provider facing challenges with outdated infrastructure and the need for real-time shipment tracking and analytics. They aim to migrate to the cloud to enhance their operations and utilize proprietary technology for better resource deployment and predictive analytics. MJTelco, a startup in telecom, seeks to build a scalable data infrastructure for real-time analysis and machine learning, emphasizing security and efficient data transport as they expand their network capabilities.

Uploaded by

Kasani Ravikumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
110 views60 pages

ExamTopics PDE Questions Continued-Xq9fxx8o7pb7teqebey9suitnw

Flowlogistic is a logistics provider facing challenges with outdated infrastructure and the need for real-time shipment tracking and analytics. They aim to migrate to the cloud to enhance their operations and utilize proprietary technology for better resource deployment and predictive analytics. MJTelco, a startup in telecom, seeks to build a scalable data infrastructure for real-time analysis and machine learning, emphasizing security and efficient data transport as they expand their network capabilities.

Uploaded by

Kasani Ravikumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

#34.

Flowlogistic Case Study -

Company Overview -
Flowlogistic is a leading logistics and supply chain provider. They help businesses throughout the world
manage their resources and transport them to their final destination. The company has grown rapidly,
expanding their offerings to include rail, truck, aircraft, and oceanic shipping.

Company Background -
The company started as a regional trucking company, and then expanded into other logistics market. Because
they have not updated their infrastructure, managing and tracking orders and shipments has become a
bottleneck. To improve operations, Flowlogistic developed proprietary technology for tracking shipments in real
time at the parcel level. However, they are unable to deploy it because their technology stack, based on Apache
Kafka, cannot support the processing volume. In addition, Flowlogistic wants to further analyze their orders and
shipments to determine how best to deploy their resources.

Solution Concept -
Flowlogistic wants to implement two concepts using the cloud:
✑ Use their proprietary technology in a real-time inventory-tracking system that indicates the location of their
loads
✑ Perform analytics on all their orders and shipment logs, which contain both structured and unstructured
data, to determine how best to deploy resources, which markets to expand info. They also want to use
predictive analytics to learn earlier when a shipment will be delayed.

Existing Technical Environment -


Flowlogistic architecture resides in a single data center:
✑ Databases
8 physical servers in 2 clusters
- SQL Server `" user data, inventory, static data
3 physical servers
- Cassandra `" metadata, tracking messages
10 Kafka servers `" tracking message aggregation and batch insert
✑ Application servers `" customer front end, middleware for order/customs
60 virtual machines across 20 physical servers
- Tomcat `" Java services
- Nginx `" static content
- Batch servers
✑ Storage appliances
- iSCSI for virtual machine (VM) hosts
- Fibre Channel storage area network (FC SAN) `" SQL server storage
- Network-attached storage (NAS) image storage, logs, backups
✑ 10 Apache Hadoop /Spark servers
- Core Data Lake
- Data analysis workloads
✑ 20 miscellaneous servers
- Jenkins, monitoring, bastion hosts,

Business Requirements -
Build a reliable and reproducible environment with scaled panty of production.

✑ Aggregate data in a centralized Data Lake for analysis

For Internal Use Only


✑ Use historical data to perform predictive analytics on future shipments
✑ Accurately track every shipment worldwide using proprietary technology
✑ Improve business agility and speed of innovation through rapid provisioning of new resources
✑ Analyze and optimize architecture for performance in the cloud
✑ Migrate fully to the cloud if all other requirements are met

Technical Requirements -
✑ Handle both streaming and batch data
✑ Migrate existing Hadoop workloads
✑ Ensure architecture is scalable and elastic to meet the changing demands of the company.
✑ Use managed services whenever possible
✑ Encrypt data flight and at rest
✑ Connect a VPN between the production data center and cloud environment

SEO Statement -
We have grown so quickly that our inability to upgrade our infrastructure is really hampering further growth and
efficiency. We are efficient at moving shipments around the world, but we are inefficient at moving data around.
We need to organize our information so we can more easily understand where our customers are and what they
are shipping.

CTO Statement -
IT has never been a priority for us, so as our data has grown, we have not invested enough in our technology. I
have a good staff to manage IT, but they are so busy managing our infrastructure that I cannot get them to do
the things that really matter, such as organizing our data, building the analytics, and figuring out how to
implement the CFO' s tracking technology.

CFO Statement -
Part of our competitive advantage is that we penalize ourselves for late shipments and deliveries. Knowing
where out shipments are at all times has a direct correlation to our bottom line and profitability. Additionally, I
don't want to commit capital to building out a server environment.
Flowlogistic wants to use Google BigQuery as their primary analysis system, but they still have Apache Hadoop
and Spark workloads that they cannot move to
BigQuery. Flowlogistic does not know how to store the data that is common to both workloads. What should
they do?

• A. Store the common data in BigQuery as partitioned tables.

• B. Store the common data in BigQuery and expose authorized views.

• C. Store the common data encoded as Avro in Google Cloud Storage.

• D. Store he common data in the HDFS storage for a Google Cloud Dataproc cluster.

For Internal Use Only


#35.
Continuation of question 34.
You need to build a new system on Google Cloud Platform (GCP) that will feed the proprietary tracking
software. The system must be able to ingest data from a variety of global sources, process and query in real-
time, and store the data reliably. Which combination of GCP products should you choose?

• A. Cloud Pub/Sub, Cloud Dataflow, and Cloud Storage

• B. Cloud Pub/Sub, Cloud Dataflow, and Local SSD

• C. Cloud Pub/Sub, Cloud SQL, and Cloud Storage

• D. Cloud Load Balancing, Cloud Dataflow, and Cloud Storage

For Internal Use Only


#36.
Flowlogistic's CEO wants to gain rapid insight into their customer base so his sales team can be better
informed in the field. This team is not very technical, so they've purchased a visualization tool to simplify the
creation of BigQuery reports. However, they've been overwhelmed by all the data in the table, and are spending
a lot of money on queries trying to find the data they need. You want to solve their problem in the most cost-
effective way. What should you do?

• A. Export the data into a Google Sheet for virtualization.

• B. Create an additional table with only the necessary columns.

• C. Create a view on the table to present to the virtualization tool.

• D. Create identity and access management (IAM) roles on the appropriate columns, so only they appear
in a query.

For Internal Use Only


#37.
Flowlogistic is rolling out their real-time inventory tracking system. The tracking devices will all send package-
tracking messages, which will now go to a single
Google Cloud Pub/Sub topic instead of the Apache Kafka cluster. A subscriber application will then process the
messages for real-time reporting and store them in
Google BigQuery for historical analysis. You want to ensure the package data can be analyzed over time.
Which approach should you take?

• A. Attach the timestamp on each message in the Cloud Pub/Sub subscriber application as they are
received.

• B. Attach the timestamp and Package ID on the outbound message from each publisher device as they
are sent to Clod Pub/Sub.

• C. Use the NOW () function in BigQuery to record the event's time.

• D. Use the automatically generated timestamp from Cloud Pub/Sub to order the data.

For Internal Use Only


#38.
MJTelco Case Study -

Company Overview -
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world. The
company has patents for innovative optical communications hardware. Based on these patents, they can
create many reliable, high-speed backbone links with inexpensive hardware.

Company Background -
Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome
communications challenges in space. Fundamental to their operation, they need to create a distributed data
infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their
topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to
account for the impact of dynamic regional politics on location availability and cost.
Their management and operations teams are situated all around the globe creating many-to-many relationship
between data consumers and provides in their system. After careful consideration, they decided public cloud is
the perfect environment to support their needs.

Solution Concept -
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
✑ Scale and harden their PoC to support significantly more data flows generated when they ramp to more than
50,000 installations.
✑ Refine their machine-learning cycles to verify and improve the dynamic models they use to control topology
definition.
MJTelco will also use three separate operating environments `" development/test, staging, and production `" to
meet the needs of running experiments, deploying new features, and serving production customers.

Business Requirements -
✑ Scale up their production environment with minimal cost, instantiating resources when and where needed

For Internal Use Only


in an unpredictable, distributed telecom user community.
✑ Ensure security of their proprietary data to protect their leading-edge machine learning and analysis.
✑ Provide reliable and timely access to data for analysis from distributed research workers
✑ Maintain isolated environments that support rapid iteration of their machine-learning models without
affecting their customers.

Technical Requirements -
✑ Ensure secure and efficient transport and storage of telemetry data
✑ Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows each.
✑ Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately
100m records/day
✑ Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems both in
telemetry flows and in production learning cycles.

CEO Statement -
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware
is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large
distributed data pipelines to meet our reliability and capacity commitments.

CTO Statement -
Our public cloud services must operate as advertised. We need resources that scale and keep our data secure.
We also need environments in which our data scientists can carefully study and quickly adapt our models.
Because we rely on automation to process our data, we also need our development and test environments to
work as we iterate.

CFO Statement -
The project is too large for us to maintain the hardware and software required for the data and analysis. Also,
we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and
infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high-
value problems instead of problems with our data pipelines.
MJTelco's Google Cloud Dataflow pipeline is now ready to start receiving data from the 50,000 installations. You
want to allow Cloud Dataflow to scale its compute power up as required. Which Cloud Dataflow pipeline
configuration setting should you update?

• A. The zone

• B. The number of workers

• C. The disk size per worker

• D. The maximum number of workers

For Internal Use Only


#39.
You need to compose visualizations for operations teams with the following requirements:
✑ The report must include telemetry data from all 50,000 installations for the most resent 6 weeks (sampling
once every minute).
✑ The report must not be more than 3 hours delayed from live data.
✑ The actionable report should only show suboptimal links.
✑ Most suboptimal links should be sorted to the top.
✑ Suboptimal links can be grouped and filtered by regional geography.
✑ User response time to load the report must be <5 seconds.
Which approach meets the requirements?

• A. Load the data into Google Sheets, use formulas to calculate a metric, and use filters/sorting to show
only suboptimal links in a table.

• B. Load the data into Google BigQuery tables, write Google Apps Script that queries the data, calculates
the metric, and shows only suboptimal rows in a table in Google Sheets.

• C. Load the data into Google Cloud Datastore tables, write a Google App Engine Application that
queries all rows, applies a function to derive the metric, and then renders results in a table using the
Google charts and visualization API.

For Internal Use Only


• D. Load the data into Google BigQuery tables, write a Google Data Studio 360 report that connects to
your data, calculates a metric, and then uses a filter expression to show only suboptimal rows in a
table.

#40.
You create a new report for your large team in Google Data Studio 360. The report uses Google BigQuery as its
data source. It is company policy to ensure employees can view only the data associated with their region, so
you create and populate a table for each region. You need to enforce the regional access policy to the data.
Which two actions should you take? (Choose two.)

• A. Ensure all the tables are included in global dataset.

• B. Ensure each table is included in a dataset for a region.

• C. Adjust the settings for each table to allow a related region-based security group view access.

• D. Adjust the settings for each view to allow a related region-based security group view access.

• E. Adjust the settings for each dataset to allow a related region-based security group view access.

For Internal Use Only


#41.
MJTelco needs you to create a schema in Google Bigtable that will allow for the historical analysis of the last 2
years of records. Each record that comes in is sent every 15 minutes, and contains a unique identifier of the
device and a data record. The most common query is for all the data for a given device for a given day.
Which schema should you use?

• A. Rowkey: date#device_id Column data: data_point

• B. Rowkey: date Column data: device_id, data_point

• C. Rowkey: device_id Column data: date, data_point

• D. Rowkey: data_point Column data: device_id, date

• E. Rowkey: date#data_point Column data: device_id

Ans: None, it needs to be device_id#date with column data being: data_point

#80.
Actual exam question from Google's Professional Data Engineer

Question #: 80
Topic #: 1

For Internal Use Only


[All Professional Data Engineer Questions]

MJTelco Case Study -

Company Overview -
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world. The
company has patents for innovative optical communications hardware. Based on these patents, they can
create many reliable, high-speed backbone links with inexpensive hardware.

Company Background -
Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome
communications challenges in space. Fundamental to their operation, they need to create a distributed data
infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their
topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to
account for the impact of dynamic regional politics on location availability and cost.
Their management and operations teams are situated all around the globe creating many-to-many relationship
between data consumers and provides in their system. After careful consideration, they decided public cloud is
the perfect environment to support their needs.

Solution Concept -
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
✑ Scale and harden their PoC to support significantly more data flows generated when they ramp to more than
50,000 installations.
✑ Refine their machine-learning cycles to verify and improve the dynamic models they use to control topology
definition.
MJTelco will also use three separate operating environments `" development/test, staging, and production `" to
meet the needs of running experiments, deploying new features, and serving production customers.

Business Requirements -
✑ Scale up their production environment with minimal cost, instantiating resources when and where needed
in an unpredictable, distributed telecom user community.
✑ Ensure security of their proprietary data to protect their leading-edge machine learning and analysis.
✑ Provide reliable and timely access to data for analysis from distributed research workers
✑ Maintain isolated environments that support rapid iteration of their machine-learning models without
affecting their customers.

Technical Requirements -
Ensure secure and efficient transport and storage of telemetry data
Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows each.
Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately 100m
records/day
Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems both in
telemetry flows and in production learning cycles.

CEO Statement -
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware
is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large
distributed data pipelines to meet our reliability and capacity commitments.

CTO Statement -
Our public cloud services must operate as advertised. We need resources that scale and keep our data secure.
We also need environments in which our data scientists can carefully study and quickly adapt our models.

For Internal Use Only


Because we rely on automation to process our data, we also need our development and test environments to
work as we iterate.

CFO Statement -
The project is too large for us to maintain the hardware and software required for the data and analysis. Also,
we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and
infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high-
value problems instead of problems with our data pipelines.
MJTelco is building a custom interface to share data. They have these requirements:
1. They need to do aggregations over their petabyte-scale datasets.
2. They need to scan specific time range rows with a very fast response time (milliseconds).
Which combination of Google Cloud Platform products should you recommend?

• A. Cloud Datastore and Cloud Bigtable

• B. Cloud Bigtable and Cloud SQL

• C. BigQuery and Cloud Bigtable

• D. BigQuery and Cloud Storage

#81.
The project is too large for us to maintain the hardware and software required for the data and analysis. Also,
we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and
infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high-
value problems instead of problems with our data pipelines.
You need to compose visualization for operations teams with the following requirements:
✑ Telemetry must include data from all 50,000 installations for the most recent 6 weeks (sampling once every
minute)
✑ The report must not be more than 3 hours delayed from live data.
✑ The actionable report should only show suboptimal links.
✑ Most suboptimal links should be sorted to the top.
Suboptimal links can be grouped and filtered by regional geography.

✑ User response time to load the report must be <5 seconds.


You create a data source to store the last 6 weeks of data, and create visualizations that allow viewers to see
multiple date ranges, distinct geographic regions, and unique installation types. You always show the latest
data without any changes to your visualizations. You want to avoid creating and updating new visualizations
each month. What should you do?

A.

Look through the current data and compose a series of charts and tables, one for each possible combination of
criteria.

B.

Look through the current data and compose a small set of generalized charts and tables bound to criteria filters
that allow value selection.

C.

Export the data to a spreadsheet, compose a series of charts and tables, one for each possible combination of
criteria, and spread them across multiple tabs.

D.

For Internal Use Only


Load the data into relational database tables, write a Google App Engine application that queries all rows,
summarizes the data across each criteria, and then renders results using the Google Charts and visualization
API.

#82.
Given the record streams MJTelco is interested in ingesting per day, they are concerned about the cost of
Google BigQuery increasing. MJTelco asks you to provide a design solution. They require a single large data
table called tracking_table. Additionally, they want to minimize the cost of daily queries while performing fine-
grained analysis of each day's events. They also want to use streaming ingestion. What should you do?

A.

Create a table called tracking_table and include a DATE column.

B.

Create a partitioned table called tracking_table and include a TIMESTAMP column.

C.

Create sharded tables for each day following the pattern tracking_table_YYYYMMDD.

D.

Create a table called tracking_table with a TIMESTAMP column to represent the day.

#83.
Flowlogistic's management has determined that the current Apache Kafka servers cannot handle the data
volume for their real-time inventory tracking system.
You need to build a new system on Google Cloud Platform (GCP) that will feed the proprietary tracking
software. The system must be able to ingest data from a variety of global sources, process and query in real-
time, and store the data reliably. Which combination of GCP products should you choose?

A.

Cloud Pub/Sub, Cloud Dataflow, and Cloud Storage

B.

Cloud Pub/Sub, Cloud Dataflow, and Local SSD

C.

Cloud Pub/Sub, Cloud SQL, and Cloud Storage

D.

Cloud Load Balancing, Cloud Dataflow, and Cloud Storage

E.

Cloud Dataflow, Cloud SQL, and Cloud Storage

For Internal Use Only


Stackdriver Monitoring = Cloud Monitoring

For Internal Use Only


For Internal Use Only
For Internal Use Only
For Internal Use Only
For Internal Use Only
#214 not sure if B is correct?

For Internal Use Only


For Internal Use Only
#220. NAT gateway is not secure.

#221. BigQuery Omni allows you to query data in Azure and AWS object stores directly without physically
moving it to BigQuery, reducing data transfer costs and delays. BigLake Tables: Provide a unified view of both
BigQuery tables and external object storage files, enabling seamless querying across multi-cloud data.

#222.
- Dataprep is a serverless, no-code data preparation tool that allows users to visually explore, cleanse, and
prepare data for analysis.
- It's designed for business analysts, data scientists, and others who want to work with data without writing
code.
- Dataprep can directly access and transform data in Cloud Storage, making it a suitable choice for a team that
prefers a low-code, user-friendly solution.

For Internal Use Only


#223.
- Dataform provides a feature called "assertions," which are essentially SQL-based tests that you can define to
verify the quality of your data.
- Assertions in Dataform are a built-in way to perform data quality checks, including checking for uniqueness
and null values in your tables.

#225. Answer C is correct:

#226.
Setting up a perimeter around project A is future proof, the question asks to "ensure that project B and any
future project cannot access data in the project A topic", IAM is not future proof. Reference:
https://cloud.google.com/vpc-service-controls/docs/overview#isolate
p.s: VPC Service Controls is not the same thing as VPC, instead its a security layer on top of a VPC and it should
be used together with IAM, not one or the other (https://cloud.google.com/vpc-service-

For Internal Use Only


controls/docs/overview#how-vpc-service-controls-works)

#228.
DE Another way to replay messages that have been acknowledged is to seek to a timestamp. To seek to a
timestamp, you must first configure the subscription to retain acknowledged messages using retain-acked-
messages. If retain-acked-messages is set, Pub/Sub retains acknowledged messages for 7 days. You only need
to do this step if you intend to seek to a timestamp, not to a snapshot.
https://cloud.google.com/pubsub/docs/replay-message

For Internal Use Only


#229.

For Internal Use Only


#234.
D.
1. Store the historical data in BigQuery for analytics.
2. In a Cloud SQL table, store the last state of the product after every product change.
3. Serve the last state data directly from Cloud SQL to the AP
This approach leverages BigQuery's scalability and efficiency for handling large datasets for analytics. BigQuery
is well-suited for managing the 10 PB of historical product data. Meanwhile, Cloud SQL provides the necessary
performance to handle the API queries with the required low latency. By storing the latest state of each product
in Cloud SQL, you can efficiently handle the high QPS with sub-second latency, which is crucial for the API's
performance. This combination of BigQuery and Cloud SQL offers a balanced solution for both the large-scale
analytics and the high-performance API needs.

For Internal Use Only


#235.
Option C is correct
Because,
This explains why it's not D: maintainable workflow to process hundreds of tables and provide the freshest data
to your end users How is creating a DAG for each of the hundreds of tables maintainable?

For Internal Use Only


- AEAD cryptographic functions in BigQuery allow for encryption and decryption of data at the column level.
- You can encrypt specific data fields using a unique key per user and manage these keys outside of BigQuery
(for example, in your application or using a key management system).
- By "deleting" or revoking access to the key for a specific user, you effectively make their data unreadable,
achieving crypto-deletion.
- This method provides fine-grained encryption control but requires careful key management and integration
with your applications.
https://cloud.google.com/bigquery/docs/aead-encryption-concepts
https://cloud.google.com/bigquery/docs/reference/standard-sql/aead_encryption_functions

For Internal Use Only


For Internal Use Only
1. Use a dual-region Cloud Storage bucket with turbo replication enabled.
2. Monitor Dataflow metrics with Cloud Monitoring to determine when an outage occurs.
3. Seek the subscription back in time by 60 minutes to recover the acknowledged messages.
4. Start the Dataflow job in a secondary region. RPO of 15 minutes is guaranteed when turbo replication is used
https://cloud.google.com/storage/docs/availability-durability

Gemini told me C Here's why it's the best of the limited choices:
Calculates price_per_sqft: It includes the calculation for the target variable your model needs.
Handles Nulls: It uses IFNULL(feature1, 0) to replace nulls in feature1 with 0, similar to COALESCE.
Most Comprehensive: While it excludes the original price, square_feet, and feature1 columns, it still retains any
other columns that might be present in the training_data table.

For Internal Use Only


Option C:
You've just concluded processing data, ending up with clean and prepared data for the model. Now you need to
decide how to split the data for testing and for training. Only afterwards, you can train the model, evaluate it,
fine tune it and, eventually, predict with it

Cloud Data Loss Prevention (DLP) is the best tool for identifying sensitive information such as street
addresses in BigQuery.

A deep inspection job scans all tables in the dataset to detect occurrences of street addresses, using
predefined infoTypes like STREET_ADDRESS.

This method is scalable and works even if the address format varies, unlike simple regex-based queries.

For Internal Use Only


A data mesh is designed to decentralize data ownership, allowing domain teams to manage their own data,
rather than relying on a central data platform team (which was a bottleneck).
✔ Creating one lake per domain (airlines, hotels, ride-hailing) allows domain-specific governance and
autonomy.
✔ Creating separate zones for analytics and data science teams ensures logical data separation while
keeping collaboration within the same domain.
✔ Delegating management of data assets to each domain eliminates the bottleneck, ensuring faster
insights and pipeline updates.

- The table structure shows that the vCPU data is stored in a nested field within the component’s column.
- Using the UNNEST operator to flatten the nested field and apply the filter.

For Internal Use Only


- Autoclass automatically moves objects between storage classes without impacting performance or
availability, nor incurring retrieval costs.
- It continuously optimizes storage costs based on access patterns without the need to set specific lifecycle
management policies.
https://cloud.google.com/storage/docs/autoclass

Format-preserving encryption (FPE) with FFX in Cloud DLP is a strong choice for de-identifying PII like email
addresses. FPE maintains the format of the data and ensures that the same input results in the same encrypted
output consistently. This means the email fields in both datasets can be encrypted to the same value, allowing
for accurate joins in BigQuery while keeping the actual email addresses hidden.

For Internal Use Only


- Setting a retention policy on a Cloud Storage bucket prevents objects from being deleted for the duration of
the retention period.
- Locking the policy makes it immutable, meaning that the retention period cannot be reduced or removed,
thus ensuring that the documents cannot be deleted or overwritten until the retention period expires.

- Private Google Access for services allows VM instances with only internal IP addresses in a VPC network or
on-premises networks (via Cloud VPN or Cloud Interconnect) to reach Google APIs and services.
- When you launch a Dataflow job, you can specify that it should use worker instances without external IP
addresses if Private Google Access is enabled on the subnetwork where these instances are launched.
- This way, your Dataflow workers will be able to access Cloud Storage and BigQuery without violating the
organizational constraint of no external IPs.
ref - https://cloud.google.com/dataflow/docs/guides/routes-firewall

For Internal Use Only


- Datastream is a serverless and easy-to-use change data capture (CDC) and replication service.
- You would create a Datastream service that sources from your Oracle database and targets BigQuery, with
private connectivity configuration to the same VPC.
- This option is designed to minimize the need to manage infrastructure and is a fully managed service.

----------------------------------
#256.
Correct option is B
✔ Option B is the best choice because it:

• Uses Cloud Storage notifications to reactively trigger the DAG.

• Uses a Cloud Function to invoke the Cloud Composer API securely.

• Uses VPC Serverless Access to reach Cloud Composer in a private network.

For Internal Use Only


- Cloud Data Fusion is a fully managed, code-free, GUI-based data integration service that allows you to
visually connect, transform, and move data between various sources and sinks.
- It supports various file formats and can write to Cloud Storage. - You can configure it to use Customer-
Managed Encryption Keys (CMEK) for the buckets where it writes data.

For Internal Use Only


#261.

✔ TPT (Teradata Parallel Transporter) is built for high-performance data transfers.


✔ DTS handles the migration efficiently without requiring extra storage.
✔ Google Cloud specifically recommends using TPT with DTS for migrating Teradata to BigQuery.
✔ It requires the least programming and minimizes infrastructure management.
Even though Option A (JDBC with FastExport) may have received more votes online, it is not the best
choice for this specific scenario.
✔ Option C (BigQuery DTS with TPT) is the best choice for large-scale, efficient Teradata migrations.

- Cloud EKM allows you to use encryption keys managed in external key management systems, including on-
premises HSMs, while using Google Cloud services.
- This means that the key material remains in your control and environment, and Google Cloud services use it
via the Cloud EKM integration.
- This approach aligns with the need to generate and store encryption material only on your on-premises HSM
and is the correct way to integrate such keys with BigQuery.
======
Why not Option C - Cloud HSM is a fully managed service by Google Cloud that provides HSMs for your
cryptographic needs. However, it's a cloud-based solution, and the keys generated or managed in Cloud HSM
are not stored on-premises. This option doesn't align with the requirement to use only on-premises HSM for key
storage.

For Internal Use Only


✔ Option A is correct because using Reshuffle forces Dataflow to separate execution steps, helping
isolate bottlenecks and improve performance.

For Internal Use Only


- Lowest RPO: Time travel offers point-in-time recovery for the past seven days by default, providing the
shortest possible recovery point objective (RPO) among the given options. You can recover data to any state
within that window.
- No Additional Costs: Time travel is a built-in feature of BigQuery, incurring no extra storage or operational
costs.
- Managed Service: BigQuery handles time travel automatically, eliminating manual backup and restore
processes.

For Internal Use Only


- It utilizes Data Catalog's native support for both BigQuery datasets and Pub/Sub topics.
- For PostgreSQL tables running on a Compute Engine instance, you'd use Data Catalog APIs to create custom
entries, as Data Catalog does not automatically discover external databases like PostgreSQL.

For Internal Use Only


- Detailed Investigation of Logs and Jobs Checking for duplicate rows targets the potential immediate cause of
the issue.
- Checking the BigQuery Audit logs helps identify which jobs might be contributing to the increased data
volume.
- Using Cloud Monitoring to correlate job starts with pipeline versions helps identify if a specific version of the
pipeline is responsible.
- Managing multiple versions of pipelines ensures that only the intended version is active, addressing any

For Internal Use Only


versioning errors that might have occurred during deployment.

#272.
Tug of war between C & D,

For Internal Use Only


For Internal Use Only
For Internal Use Only
For Internal Use Only
For Internal Use Only
https://cloud.google.com/dataflow/docs/concepts/streaming-pipelines#watermarks
A watermark is a threshold that indicates when Dataflow expects all of the data in a window to have arrived. If
the watermark has progressed past the end of the window and new data arrives with a timestamp within the
window, the data is considered late data. For more information, see Watermarks and late data in the Apache
Beam documentation.
Dataflow tracks watermarks because of the following reasons:
Data is not guaranteed to arrive in time order or at predictable intervals. Data events are not guaranteed to
appear in pipelines in the same order that they were generated.

For Internal Use Only


For Internal Use Only
For Internal Use Only
For Internal Use Only
For Internal Use Only
For Internal Use Only
For Internal Use Only
Minimize cost. https://cloud.google.com/alloydb?hl=en
AlloyDB offers superior performance, 4x faster than standard PostgreSQL for transactional workloads. That
does not come without cost.

For Internal Use Only


For Internal Use Only
For Internal Use Only
For Internal Use Only
For Internal Use Only
For Internal Use Only
For Internal Use Only
For Internal Use Only

You might also like