ExamTopics PDE Questions Continued-Xq9fxx8o7pb7teqebey9suitnw
ExamTopics PDE Questions Continued-Xq9fxx8o7pb7teqebey9suitnw
Company Overview -
Flowlogistic is a leading logistics and supply chain provider. They help businesses throughout the world
manage their resources and transport them to their final destination. The company has grown rapidly,
expanding their offerings to include rail, truck, aircraft, and oceanic shipping.
Company Background -
The company started as a regional trucking company, and then expanded into other logistics market. Because
they have not updated their infrastructure, managing and tracking orders and shipments has become a
bottleneck. To improve operations, Flowlogistic developed proprietary technology for tracking shipments in real
time at the parcel level. However, they are unable to deploy it because their technology stack, based on Apache
Kafka, cannot support the processing volume. In addition, Flowlogistic wants to further analyze their orders and
shipments to determine how best to deploy their resources.
Solution Concept -
Flowlogistic wants to implement two concepts using the cloud:
✑ Use their proprietary technology in a real-time inventory-tracking system that indicates the location of their
loads
✑ Perform analytics on all their orders and shipment logs, which contain both structured and unstructured
data, to determine how best to deploy resources, which markets to expand info. They also want to use
predictive analytics to learn earlier when a shipment will be delayed.
Business Requirements -
Build a reliable and reproducible environment with scaled panty of production.
Technical Requirements -
✑ Handle both streaming and batch data
✑ Migrate existing Hadoop workloads
✑ Ensure architecture is scalable and elastic to meet the changing demands of the company.
✑ Use managed services whenever possible
✑ Encrypt data flight and at rest
✑ Connect a VPN between the production data center and cloud environment
SEO Statement -
We have grown so quickly that our inability to upgrade our infrastructure is really hampering further growth and
efficiency. We are efficient at moving shipments around the world, but we are inefficient at moving data around.
We need to organize our information so we can more easily understand where our customers are and what they
are shipping.
CTO Statement -
IT has never been a priority for us, so as our data has grown, we have not invested enough in our technology. I
have a good staff to manage IT, but they are so busy managing our infrastructure that I cannot get them to do
the things that really matter, such as organizing our data, building the analytics, and figuring out how to
implement the CFO' s tracking technology.
CFO Statement -
Part of our competitive advantage is that we penalize ourselves for late shipments and deliveries. Knowing
where out shipments are at all times has a direct correlation to our bottom line and profitability. Additionally, I
don't want to commit capital to building out a server environment.
Flowlogistic wants to use Google BigQuery as their primary analysis system, but they still have Apache Hadoop
and Spark workloads that they cannot move to
BigQuery. Flowlogistic does not know how to store the data that is common to both workloads. What should
they do?
• D. Store he common data in the HDFS storage for a Google Cloud Dataproc cluster.
• D. Create identity and access management (IAM) roles on the appropriate columns, so only they appear
in a query.
• A. Attach the timestamp on each message in the Cloud Pub/Sub subscriber application as they are
received.
• B. Attach the timestamp and Package ID on the outbound message from each publisher device as they
are sent to Clod Pub/Sub.
• D. Use the automatically generated timestamp from Cloud Pub/Sub to order the data.
Company Overview -
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world. The
company has patents for innovative optical communications hardware. Based on these patents, they can
create many reliable, high-speed backbone links with inexpensive hardware.
Company Background -
Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome
communications challenges in space. Fundamental to their operation, they need to create a distributed data
infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their
topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to
account for the impact of dynamic regional politics on location availability and cost.
Their management and operations teams are situated all around the globe creating many-to-many relationship
between data consumers and provides in their system. After careful consideration, they decided public cloud is
the perfect environment to support their needs.
Solution Concept -
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
✑ Scale and harden their PoC to support significantly more data flows generated when they ramp to more than
50,000 installations.
✑ Refine their machine-learning cycles to verify and improve the dynamic models they use to control topology
definition.
MJTelco will also use three separate operating environments `" development/test, staging, and production `" to
meet the needs of running experiments, deploying new features, and serving production customers.
Business Requirements -
✑ Scale up their production environment with minimal cost, instantiating resources when and where needed
Technical Requirements -
✑ Ensure secure and efficient transport and storage of telemetry data
✑ Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows each.
✑ Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately
100m records/day
✑ Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems both in
telemetry flows and in production learning cycles.
CEO Statement -
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware
is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large
distributed data pipelines to meet our reliability and capacity commitments.
CTO Statement -
Our public cloud services must operate as advertised. We need resources that scale and keep our data secure.
We also need environments in which our data scientists can carefully study and quickly adapt our models.
Because we rely on automation to process our data, we also need our development and test environments to
work as we iterate.
CFO Statement -
The project is too large for us to maintain the hardware and software required for the data and analysis. Also,
we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and
infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high-
value problems instead of problems with our data pipelines.
MJTelco's Google Cloud Dataflow pipeline is now ready to start receiving data from the 50,000 installations. You
want to allow Cloud Dataflow to scale its compute power up as required. Which Cloud Dataflow pipeline
configuration setting should you update?
• A. The zone
• A. Load the data into Google Sheets, use formulas to calculate a metric, and use filters/sorting to show
only suboptimal links in a table.
• B. Load the data into Google BigQuery tables, write Google Apps Script that queries the data, calculates
the metric, and shows only suboptimal rows in a table in Google Sheets.
• C. Load the data into Google Cloud Datastore tables, write a Google App Engine Application that
queries all rows, applies a function to derive the metric, and then renders results in a table using the
Google charts and visualization API.
#40.
You create a new report for your large team in Google Data Studio 360. The report uses Google BigQuery as its
data source. It is company policy to ensure employees can view only the data associated with their region, so
you create and populate a table for each region. You need to enforce the regional access policy to the data.
Which two actions should you take? (Choose two.)
• C. Adjust the settings for each table to allow a related region-based security group view access.
• D. Adjust the settings for each view to allow a related region-based security group view access.
• E. Adjust the settings for each dataset to allow a related region-based security group view access.
#80.
Actual exam question from Google's Professional Data Engineer
Question #: 80
Topic #: 1
Company Overview -
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world. The
company has patents for innovative optical communications hardware. Based on these patents, they can
create many reliable, high-speed backbone links with inexpensive hardware.
Company Background -
Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome
communications challenges in space. Fundamental to their operation, they need to create a distributed data
infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their
topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to
account for the impact of dynamic regional politics on location availability and cost.
Their management and operations teams are situated all around the globe creating many-to-many relationship
between data consumers and provides in their system. After careful consideration, they decided public cloud is
the perfect environment to support their needs.
Solution Concept -
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
✑ Scale and harden their PoC to support significantly more data flows generated when they ramp to more than
50,000 installations.
✑ Refine their machine-learning cycles to verify and improve the dynamic models they use to control topology
definition.
MJTelco will also use three separate operating environments `" development/test, staging, and production `" to
meet the needs of running experiments, deploying new features, and serving production customers.
Business Requirements -
✑ Scale up their production environment with minimal cost, instantiating resources when and where needed
in an unpredictable, distributed telecom user community.
✑ Ensure security of their proprietary data to protect their leading-edge machine learning and analysis.
✑ Provide reliable and timely access to data for analysis from distributed research workers
✑ Maintain isolated environments that support rapid iteration of their machine-learning models without
affecting their customers.
Technical Requirements -
Ensure secure and efficient transport and storage of telemetry data
Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows each.
Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately 100m
records/day
Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems both in
telemetry flows and in production learning cycles.
CEO Statement -
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware
is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large
distributed data pipelines to meet our reliability and capacity commitments.
CTO Statement -
Our public cloud services must operate as advertised. We need resources that scale and keep our data secure.
We also need environments in which our data scientists can carefully study and quickly adapt our models.
CFO Statement -
The project is too large for us to maintain the hardware and software required for the data and analysis. Also,
we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and
infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high-
value problems instead of problems with our data pipelines.
MJTelco is building a custom interface to share data. They have these requirements:
1. They need to do aggregations over their petabyte-scale datasets.
2. They need to scan specific time range rows with a very fast response time (milliseconds).
Which combination of Google Cloud Platform products should you recommend?
#81.
The project is too large for us to maintain the hardware and software required for the data and analysis. Also,
we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and
infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high-
value problems instead of problems with our data pipelines.
You need to compose visualization for operations teams with the following requirements:
✑ Telemetry must include data from all 50,000 installations for the most recent 6 weeks (sampling once every
minute)
✑ The report must not be more than 3 hours delayed from live data.
✑ The actionable report should only show suboptimal links.
✑ Most suboptimal links should be sorted to the top.
Suboptimal links can be grouped and filtered by regional geography.
A.
Look through the current data and compose a series of charts and tables, one for each possible combination of
criteria.
B.
Look through the current data and compose a small set of generalized charts and tables bound to criteria filters
that allow value selection.
C.
Export the data to a spreadsheet, compose a series of charts and tables, one for each possible combination of
criteria, and spread them across multiple tabs.
D.
#82.
Given the record streams MJTelco is interested in ingesting per day, they are concerned about the cost of
Google BigQuery increasing. MJTelco asks you to provide a design solution. They require a single large data
table called tracking_table. Additionally, they want to minimize the cost of daily queries while performing fine-
grained analysis of each day's events. They also want to use streaming ingestion. What should you do?
A.
B.
C.
Create sharded tables for each day following the pattern tracking_table_YYYYMMDD.
D.
Create a table called tracking_table with a TIMESTAMP column to represent the day.
#83.
Flowlogistic's management has determined that the current Apache Kafka servers cannot handle the data
volume for their real-time inventory tracking system.
You need to build a new system on Google Cloud Platform (GCP) that will feed the proprietary tracking
software. The system must be able to ingest data from a variety of global sources, process and query in real-
time, and store the data reliably. Which combination of GCP products should you choose?
A.
B.
C.
D.
E.
#221. BigQuery Omni allows you to query data in Azure and AWS object stores directly without physically
moving it to BigQuery, reducing data transfer costs and delays. BigLake Tables: Provide a unified view of both
BigQuery tables and external object storage files, enabling seamless querying across multi-cloud data.
#222.
- Dataprep is a serverless, no-code data preparation tool that allows users to visually explore, cleanse, and
prepare data for analysis.
- It's designed for business analysts, data scientists, and others who want to work with data without writing
code.
- Dataprep can directly access and transform data in Cloud Storage, making it a suitable choice for a team that
prefers a low-code, user-friendly solution.
#226.
Setting up a perimeter around project A is future proof, the question asks to "ensure that project B and any
future project cannot access data in the project A topic", IAM is not future proof. Reference:
https://cloud.google.com/vpc-service-controls/docs/overview#isolate
p.s: VPC Service Controls is not the same thing as VPC, instead its a security layer on top of a VPC and it should
be used together with IAM, not one or the other (https://cloud.google.com/vpc-service-
#228.
DE Another way to replay messages that have been acknowledged is to seek to a timestamp. To seek to a
timestamp, you must first configure the subscription to retain acknowledged messages using retain-acked-
messages. If retain-acked-messages is set, Pub/Sub retains acknowledged messages for 7 days. You only need
to do this step if you intend to seek to a timestamp, not to a snapshot.
https://cloud.google.com/pubsub/docs/replay-message
Gemini told me C Here's why it's the best of the limited choices:
Calculates price_per_sqft: It includes the calculation for the target variable your model needs.
Handles Nulls: It uses IFNULL(feature1, 0) to replace nulls in feature1 with 0, similar to COALESCE.
Most Comprehensive: While it excludes the original price, square_feet, and feature1 columns, it still retains any
other columns that might be present in the training_data table.
Cloud Data Loss Prevention (DLP) is the best tool for identifying sensitive information such as street
addresses in BigQuery.
A deep inspection job scans all tables in the dataset to detect occurrences of street addresses, using
predefined infoTypes like STREET_ADDRESS.
This method is scalable and works even if the address format varies, unlike simple regex-based queries.
- The table structure shows that the vCPU data is stored in a nested field within the component’s column.
- Using the UNNEST operator to flatten the nested field and apply the filter.
Format-preserving encryption (FPE) with FFX in Cloud DLP is a strong choice for de-identifying PII like email
addresses. FPE maintains the format of the data and ensures that the same input results in the same encrypted
output consistently. This means the email fields in both datasets can be encrypted to the same value, allowing
for accurate joins in BigQuery while keeping the actual email addresses hidden.
- Private Google Access for services allows VM instances with only internal IP addresses in a VPC network or
on-premises networks (via Cloud VPN or Cloud Interconnect) to reach Google APIs and services.
- When you launch a Dataflow job, you can specify that it should use worker instances without external IP
addresses if Private Google Access is enabled on the subnetwork where these instances are launched.
- This way, your Dataflow workers will be able to access Cloud Storage and BigQuery without violating the
organizational constraint of no external IPs.
ref - https://cloud.google.com/dataflow/docs/guides/routes-firewall
----------------------------------
#256.
Correct option is B
✔ Option B is the best choice because it:
- Cloud EKM allows you to use encryption keys managed in external key management systems, including on-
premises HSMs, while using Google Cloud services.
- This means that the key material remains in your control and environment, and Google Cloud services use it
via the Cloud EKM integration.
- This approach aligns with the need to generate and store encryption material only on your on-premises HSM
and is the correct way to integrate such keys with BigQuery.
======
Why not Option C - Cloud HSM is a fully managed service by Google Cloud that provides HSMs for your
cryptographic needs. However, it's a cloud-based solution, and the keys generated or managed in Cloud HSM
are not stored on-premises. This option doesn't align with the requirement to use only on-premises HSM for key
storage.
#272.
Tug of war between C & D,