Ready to unlock the secrets hidden in your product data? This tutorial demonstrates how BigQuery Data Preparation can clean and transform your raw data into actionable business intelligence, using a realistic example from the Fashion and Beauty industry.
Ready to get started? First, ensure you have the necessary roles and permissions (listed below).
To use BigQuery Data Preparation, you will need the following roles and permissions.
In this tutorial, you will open datasets from the Analytics Hub in data preparation, learn how to prepare the data using Gemini, join the data, and load the results into a new BigQuery table.
Let's jump right in! Our first step is connecting to BigQuery and subscribing to the dataset we’ll use for this tutorial. Here's how.
We'll be using the "bq data preparation demo" dataset for this tutorial. Follow these steps to find and subscribe to it:
1. Open BigQuery Studio: In the Google Cloud Console, navigate to BigQuery. In BigQuery Studio, Click the Add button.
2. Select Analytics Hub: In the 'Additional sources' section of the new window, choose 'Analytics Hub'.
Select “Analytics Hub” in the bottom left corner.
3. Search for the Listing: Within the Analytics Hub interface, use the search bar to find our dataset. Type 'bq data preparation demo' and press Enter."
4. Subscribe to the Listing: Select the "bq data preparation demo" listing from the search results.
5. On the listing details page, click the "Subscribe" button.
6. Review any confirmation dialogs and update the project/dataset if needed.
7. Access the Dataset in BigQuery: Once you've successfully subscribed, the datasets within the listing will be linked to your BigQuery project.
8. Return to the main BigQuery console
This will open the table in the Data Preparation interface, ready for you to begin transforming your data.
As you can see in the data preview below, we have some data challenges we'll tackle: The
price
column contains both the amount and currency, making analysis difficult. Similarly, the product
column mixes the product name and category (separated by a pipe symbol |
).
Let's tackle the Price
column. As we've seen, it contains both the currency and the amount. Our goal is to separate these into two distinct columns: Currency
and Amount
.
Gemini will identify recommendations for the Price
column, including one to remove the currency. While tempting, applying this directly would mean losing the currency information. Instead, let's click the "Edit" button to examine the proposed transformation.
The edit window shows the regular expression Gemini uses to remove the currency. Notice that the target column is Price
, and Gemini warns us that "This action will overwrite the target column values."
That's not what we want! To keep our original
Price
data intact, we'll create a new column to store just the numerical amount.
From the dropdown menu, select "CREATE NEW COLUMN." A prompt will appear asking for the new column name. Enter "Amount" and click "Save."
Click the "Preview" button before applying the transformation to confirm that the new
Amount
column has been created and that it contains only the numerical value, without the currency symbol. With the
Amount
column created, click "Apply" to save the changes.
Next, we'll create the Currency
column. As Gemini hasn't offered a specific recommendation for this step, we'll leverage the natural language code generation feature. Click "ADD STEP" at the bottom right of the interface, and then choose "Transformation." Now, let's tell Gemini what we want to do. In the input field, enter the instruction "Extract the currency from the Price column" and then click the "Play" button. Gemini will process your request and generate the necessary script.
Following the same process, create a new target column for the currency and name it "Currency." Remember to preview the transformation before applying it to verify that the
Currency
column is populated correctly. Using Gemini's recommendations and the natural language scripting feature, we've now separated the
Price
column into two distinct columns: Currency
and Amount
With the
Price
information now split into Amount
and Currency
columns, the original Price
column is no longer needed. To delete it, navigate to the "Schema" tab. Select the Price
column, click the three vertical dots (or "kebab menu") in the "Actions" column, and then select "Delete."
Remember to save your work periodically. Click the "Save" button next to the pipeline name. You can also rename your pipeline by clicking on the name and editing it.
The product
column contains both the product name and category, separated by a pipe (|
). While we could use natural language again, let's explore another of Gemini's powerful features. Select the category portion of a product entry and delete it. Gemini will intelligently recognize this pattern and suggest a transformation to apply to the entire column.
Gemini's recommendation is spot on: it removes everything after the '|' character, effectively isolating the product name. But, just like before, we don't want to overwrite our original data. We'll use the target column dropdown to create a new column named
ProductName
Click "Edit" to open the transformation editor. Create a new column to store the product name. Before applying, preview the changes to make sure everything looks good. Then, apply the transformation.
Now, to extract the product category, let’s add a new transformation step.Using natural language, we'll instruct Gemini to "extract the word after the pipe (|) in the Product column." This extracted value will be stored in a new column called
Category
Applying the transformation will give you the following results. With the
ProductName
and Category
columns created, the original Product
column is no longer needed. Delete it from the "Schema" tab. Our product data is now clean and ready to go. Often, you'll want to enrich your data with information from other sources. In our example, we'll join our product data with extended product attributes,stg_extended_product, from a third-party table. This table includes details like brand and launch date.
To join the data, click "ADD STEP," select "Join," browse to the stg_extended_product table, select the join keys and join type, and then click "Create.
While the join was successful, the extended attributes data requires some cleaning. The
LaunchDate
column has inconsistent date formats, and the Brand
column contains some missing values.
We'll begin by addressing the LaunchDate
column. Before creating any transformations, I started by looking at Gemini's recommendations. It provided an excellent suggestion: not only harmonizing the
LaunchDate
format but also converting it to a date data type. We'll click the "Edit" button to examine and then apply Gemini's suggested transformation.
The LaunchDate
column now has a consistent date format. To be transparent, while Gemini usually handles all date formats, I had to add the %Y/%m/%d format to the expression: COALESCE(SAFE.PARSE_DATE('%Y-%m-%d', LaunchDate),SAFE.PARSE_DATE('%Y/%m/%d', LaunchDate)).
Next, we'll address the empty values in the Brand
column. We'll replace these with the default value "Unknown." As demonstrated earlier, entering "Unknown" into an empty Brand
cell will trigger Gemini to suggest applying this change to all empty cells in the column.
Gemini's recommendation is shown below. After reviewing the generated transformation, we realized it only addressed
NULL
values, not empty strings (''
).
We refined the natural language description as shown in the screenshot below, regenerated the transformation, and then obtained the desired result.
We've also identified empty values in the Manufacturer
column. For these records, we want to implement a data quality check and move them to an error table for further review
To create the error table, click the "MORE" button, navigate to the "Settings" section, and select "Error Table." Check the "Enable error table" box, configure the table settings (including the retention period for error data), and then click "Save. Select the
Manufacturer
column. Gemini has already identified a relevant transformation: keeping only rows where the Manufacturer
field is not empty. Click the "Edit" button on this recommendation to review it. Ensure the "Failed validation rows go to error table" option is checked, and then click the "Apply" button.
At any point, you can review, modify, or delete the transformations you've applied by clicking on the "Applied steps" button. The
ProductID_1
column, which duplicates the ProductID
from our joined table, can now be deleted. Use the same process as before: go to the "Schema" tab and delete the column.
Our dataset is now clean and ready for loading into a dimension table in our data warehouse.
To load the data, click "ADD STEP," select "Destination," fill in the required parameters, and click "Save. We've now worked with the "Data" and "Schema" tabs. In addition to these, BigQuery Data Preparation provides a "Graph" view that visually displays the sequence of transformation steps in your pipeline.
We're now ready to run the data preparation job and validate our entire pipeline. Once we're satisfied with the results, we can schedule the job to run automatically.
To run the job in interactive mode, click the "Run" button and select the "Configure" option. A prompt will appear, asking you to select a service account or create a new one. Choose a service account that has been granted the required permissions (listed above) and then save your selection.
With the service account configured, you're ready to run the pipeline.
BigQuery Data Preparation will automatically create the target table if it doesn't already exist.
By default, the table is fully refreshed using a truncate and load operation. To modify this behavior, click the "MORE" button and change the "Write mode" setting.
You'll be presented with the following three options.
We're ready to run the pipeline in interactive mode. Click the "Run" button to start the process. You can track the pipeline's progress in real-time using the execution window.
To get more information about the pipeline execution, such as duration, target table details, the service account used, and the job ID, select "View details."
The DimProduct
table should now be created and populated with your transformed data.
Review the error table to verify that records with missing Manufacturer
values were correctly rejected. With the pipeline validated, you can now schedule it to run on a recurring basis. Click the "Schedule" button, provide a name for the schedule, select the service account to use, and configure the details of the schedule frequency (e.g., daily, weekly, monthly).
View scheduled job execution details in the "Scheduling" tab of BigQuery, under "Pipelines & Integration." BigQuery Data Preparation makes data transformation accessible to everyone. With its intuitive interface, AI-powered recommendations, and natural language processing capabilities, you can quickly clean, transform, and enrich your data for analysis. Start exploring BigQuery Data Preparation today and unlock the full potential of your data!
To get support or provide feedback for BigQuery data preparation with Gemini, contact [email protected].