BigQuery Data Preparation with AI Agents: Smarter,...

issaghaba

Ready to unlock the secrets hidden in your product data? This tutorial demonstrates how BigQuery Data Preparation can clean and transform your raw data into actionable business intelligence, using a realistic example from the Fashion and Beauty industry.

Ready to get started? First, ensure you have the necessary roles and permissions (listed below).

Roles and permissions

To use BigQuery Data Preparation, you will need the following roles and permissions.

BigQuery Studio User (roles/bigquery.studioUser) on the project
Gemini for Google Cloud User (roles/cloudaicompanion.user) on the project

Tutorial

In this tutorial, you will open datasets from the Analytics Hub in data preparation, learn how to prepare the data using Gemini, join the data, and load the results into a new BigQuery table.

Let's jump right in! Our first step is connecting to BigQuery and subscribing to the dataset we’ll use for this tutorial. Here's how.

Finding and subscribing to the "bq data preparation demo" Listing in BigQuery

We'll be using the "bq data preparation demo" dataset for this tutorial. Follow these steps to find and subscribe to it:

1. Open BigQuery Studio: In the Google Cloud Console, navigate to BigQuery. In BigQuery Studio, Click the Add button.

2. Select Analytics Hub: In the 'Additional sources' section of the new window, choose 'Analytics Hub'.

Select “Analytics Hub” in the bottom left corner.

3. Search for the Listing: Within the Analytics Hub interface, use the search bar to find our dataset. Type 'bq data preparation demo' and press Enter."

4. Subscribe to the Listing: Select the "bq data preparation demo" listing from the search results.

5. On the listing details page, click the "Subscribe" button.

6. Review any confirmation dialogs and update the project/dataset if needed.

7. Access the Dataset in BigQuery: Once you've successfully subscribed, the datasets within the listing will be linked to your BigQuery project.

8. Return to the main BigQuery console

Explore the data in the Data Preparation explorer window

Locate the Dataset and Table: In the Explorer panel, select your project and then find the dataset that was included in the "bq data preparation demo" listing. Select the ‘stg_product’ table.
Open in Data Preparation: Click the three vertical dots next to the table name and select "Open in Data Preparation." In this tutorial, we’ll work with the stg_product table.
This will open the table in the Data Preparation interface, ready for you to begin transforming your data.

As you can see in the data preview below, we have some data challenges we'll tackle: The price column contains both the amount and currency, making analysis difficult. Similarly, the product column mixes the product name and category (separated by a pipe symbol |).

Right away, Gemini analyzes your data and suggests several transformations. In this example, we see seven recommendations. You can review these suggestions and apply the ones you need.* We will utilize the recommendations provided by Gemini. It's important to understand that these recommendations are non-deterministic. Consequently, the suggestions you receive while following this tutorial may not be identical to those illustrated here

Handling the Price column

Let's tackle the Price column. As we've seen, it contains both the currency and the amount. Our goal is to separate these into two distinct columns: Currency and Amount.

Gemini will identify recommendations for the Price column, including one to remove the currency. While tempting, applying this directly would mean losing the currency information. Instead, let's click the "Edit" button to examine the proposed transformation.

The edit window shows the regular expression Gemini uses to remove the currency. Notice that the target column is Price, and Gemini warns us that "This action will overwrite the target column values."

That's not what we want! To keep our original Price data intact, we'll create a new column to store just the numerical amount.

From the dropdown menu, select "CREATE NEW COLUMN." A prompt will appear asking for the new column name. Enter "Amount" and click "Save." Click the "Preview" button before applying the transformation to confirm that the new Amount column has been created and that it contains only the numerical value, without the currency symbol. With the Amount column created, click "Apply" to save the changes.

Next, we'll create the Currency column. As Gemini hasn't offered a specific recommendation for this step, we'll leverage the natural language code generation feature. Click "ADD STEP" at the bottom right of the interface, and then choose "Transformation." Now, let's tell Gemini what we want to do. In the input field, enter the instruction "Extract the currency from the Price column" and then click the "Play" button. Gemini will process your request and generate the necessary script.Following the same process, create a new target column for the currency and name it "Currency." Remember to preview the transformation before applying it to verify that the Currency column is populated correctly. Using Gemini's recommendations and the natural language scripting feature, we've now separated the Price column into two distinct columns: Currency and AmountWith the Price information now split into Amount and Currency columns, the original Price column is no longer needed. To delete it, navigate to the "Schema" tab. Select the Price column, click the three vertical dots (or "kebab menu") in the "Actions" column, and then select "Delete."

Remember to save your work periodically. Click the "Save" button next to the pipeline name. You can also rename your pipeline by clicking on the name and editing it.

Handling the Product Column

The product column contains both the product name and category, separated by a pipe (|). While we could use natural language again, let's explore another of Gemini's powerful features. Select the category portion of a product entry and delete it. Gemini will intelligently recognize this pattern and suggest a transformation to apply to the entire column.

Gemini's recommendation is spot on: it removes everything after the '|' character, effectively isolating the product name. But, just like before, we don't want to overwrite our original data. We'll use the target column dropdown to create a new column named ProductName

Click "Edit" to open the transformation editor. Create a new column to store the product name. Before applying, preview the changes to make sure everything looks good. Then, apply the transformation.

Now, to extract the product category, let’s add a new transformation step.Using natural language, we'll instruct Gemini to "extract the word after the pipe (|) in the Product column." This extracted value will be stored in a new column called Category

Applying the transformation will give you the following results. With the ProductName and Category columns created, the original Product column is no longer needed. Delete it from the "Schema" tab. Our product data is now clean and ready to go. Often, you'll want to enrich your data with information from other sources. In our example, we'll join our product data with extended product attributes,stg_extended_product, from a third-party table. This table includes details like brand and launch date.

To join the data, click "ADD STEP," select "Join," browse to the stg_extended_product table, select the join keys and join type, and then click "Create.

While the join was successful, the extended attributes data requires some cleaning. The LaunchDate column has inconsistent date formats, and the Brand column contains some missing values.

We'll begin by addressing the LaunchDate column. Before creating any transformations, I started by looking at Gemini's recommendations. It provided an excellent suggestion: not only harmonizing the LaunchDate format but also converting it to a date data type. We'll click the "Edit" button to examine and then apply Gemini's suggested transformation.

The LaunchDate column now has a consistent date format. To be transparent, while Gemini usually handles all date formats, I had to add the %Y/%m/%d format to the expression: COALESCE(SAFE.PARSE_DATE('%Y-%m-%d', LaunchDate),SAFE.PARSE_DATE('%Y/%m/%d', LaunchDate)).

Handling the Brand column

Next, we'll address the empty values in the Brand column. We'll replace these with the default value "Unknown." As demonstrated earlier, entering "Unknown" into an empty Brand cell will trigger Gemini to suggest applying this change to all empty cells in the column.

Gemini's recommendation is shown below. After reviewing the generated transformation, we realized it only addressed NULL values, not empty strings ('').

We refined the natural language description as shown in the screenshot below, regenerated the transformation, and then obtained the desired result.

Handling Manufacturer column and creating Error Table

We've also identified empty values in the Manufacturer column. For these records, we want to implement a data quality check and move them to an error table for further review

To create the error table, click the "MORE" button, navigate to the "Settings" section, and select "Error Table." Check the "Enable error table" box, configure the table settings (including the retention period for error data), and then click "Save. Select the Manufacturer column. Gemini has already identified a relevant transformation: keeping only rows where the Manufacturer field is not empty. Click the "Edit" button on this recommendation to review it. Ensure the "Failed validation rows go to error table" option is checked, and then click the "Apply" button.

At any point, you can review, modify, or delete the transformations you've applied by clicking on the "Applied steps" button. The ProductID_1 column, which duplicates the ProductID from our joined table, can now be deleted. Use the same process as before: go to the "Schema" tab and delete the column.

Our dataset is now clean and ready for loading into a dimension table in our data warehouse.

To load the data, click "ADD STEP," select "Destination," fill in the required parameters, and click "Save. We've now worked with the "Data" and "Schema" tabs. In addition to these, BigQuery Data Preparation provides a "Graph" view that visually displays the sequence of transformation steps in your pipeline. We're now ready to run the data preparation job and validate our entire pipeline. Once we're satisfied with the results, we can schedule the job to run automatically.

Run and schedule the pipeline

To run the job in interactive mode, click the "Run" button and select the "Configure" option. A prompt will appear, asking you to select a service account or create a new one. Choose a service account that has been granted the required permissions (listed above) and then save your selection. With the service account configured, you're ready to run the pipeline.

BigQuery Data Preparation will automatically create the target table if it doesn't already exist.

By default, the table is fully refreshed using a truncate and load operation. To modify this behavior, click the "MORE" button and change the "Write mode" setting.

You'll be presented with the following three options.

We're ready to run the pipeline in interactive mode. Click the "Run" button to start the process. You can track the pipeline's progress in real-time using the execution window.

To get more information about the pipeline execution, such as duration, target table details, the service account used, and the job ID, select "View details."

The DimProduct table should now be created and populated with your transformed data.

Review the error table to verify that records with missing Manufacturer values were correctly rejected. With the pipeline validated, you can now schedule it to run on a recurring basis. Click the "Schedule" button, provide a name for the schedule, select the service account to use, and configure the details of the schedule frequency (e.g., daily, weekly, monthly).

View scheduled job execution details in the "Scheduling" tab of BigQuery, under "Pipelines & Integration." BigQuery Data Preparation makes data transformation accessible to everyone. With its intuitive interface, AI-powered recommendations, and natural language processing capabilities, you can quickly clean, transform, and enrich your data for analysis. Start exploring BigQuery Data Preparation today and unlock the full potential of your data!

To get support or provide feedback for BigQuery data preparation with Gemini, contact [email protected].