0% found this document useful (0 votes)
159 views25 pages

BigQuery For Data Warehouse Practitioners - Solutions - Google Cloud

BigQuery can serve as a data warehouse, with datasets analogous to data marts and tables/views functioning similarly to a traditional data warehouse. BigQuery organizes data into projects, datasets and tables. It provides dynamic resource allocation without requiring provisioning, and handles backup/recovery through table snapshots. Permissions are managed through IAM roles and views can enable row-level security.

Uploaded by

Siddharth Phalke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
159 views25 pages

BigQuery For Data Warehouse Practitioners - Solutions - Google Cloud

BigQuery can serve as a data warehouse, with datasets analogous to data marts and tables/views functioning similarly to a traditional data warehouse. BigQuery organizes data into projects, datasets and tables. It provides dynamic resource allocation without requiring provisioning, and handles backup/recovery through table snapshots. Permissions are managed through IAM roles and views can enable row-level security.

Uploaded by

Siddharth Phalke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

BigQuery for data warehouse practitioners

Updated September 2017

This article explains how to use BigQuery (/bigquery/what-is-bigquery) as a data warehouse, >rst
mapping common data warehouse concepts to those in BigQuery, and then describing how to
perform standard data-warehousing tasks in BigQuery.

Service model comparison

The following table maps standard data-warehouse concepts to those in BigQuery:

Data warehouse BigQuery

Data warehouse The BigQuery service replaces the typical hardware setup for a traditional data warehouse.
That is, it serves as a collective home for all analytical data in an organization.

Data mart Datasets are collections of tables that can be divided along business lines or a given
analytical domain. Each dataset is tied to a Google Cloud project.

Data lake Your data lake might contain >les in Cloud Storage (/storage) or Google Drive
(https://www.google.com/drive/) or transactional data in Bigtable (/bigtable). BigQuery can
de>ne a schema and issue queries directly on external data as federated data sources.
(/bigquery/federated-data-sources)

Tables and views Tables and views function the same way in BigQuery as they do in a traditional data
warehouse.

Grants Identity and Access Management (IAM) is used to grant permission to perform speci>c
actions in BigQuery.

Datasets

BigQuery organizes data tables into units called datasets. These datasets are scoped to your Google
Cloud project. When you reference a table from the command line, in SQL queries, or in code, you
refer to it by using the following construct:

oject.dataset.table
These multiple scopes—project, dataset, and table—can help you structure your information logically.
You can use multiple datasets to separate tables pertaining to different analytical domains, and you
can use project-level scoping to isolate datasets from each other according to your business needs.

Here is a structural overview of BigQuery:

Provisioning and system sizing

You don't need to provision resources before using BigQuery, unlike many RDBMS systems. BigQuery
allocates storage and query resources dynamically based on your usage patterns.

Storage resources are allocated as you consume them and deallocated as you remove data or
drop tables.

Query resources are allocated according to query type and complexity. Each query uses some
number of slots, which are units of computation that comprise a certain amount of CPU and
RAM.

You don't have to make a minimum usage commitment to use BigQuery. The service allocates and
charges for resources based on your actual usage. By default, all BigQuery customers have access to
2,000 slots for query operations. You can also reserve a >xed number of slots for your project. For
details about which approach to use, see the Costs (#costs) section.
To start using BigQuery, you create a project to host your data, and then you enable billing. For instructions, see the
gQuery Quickstart (/bigquery/quickstart-web-ui#before-you-begin).

Storage management

Internally, BigQuery stores data in a proprietary columnar format called Capacitor


(/blog/big-data/2016/04/inside-capacitor-bigquerys-next-generation-columnar-storage-format), which has a
number of bene>ts for data warehouse workloads. BigQuery uses a proprietary format because it can
evolve in tandem with the query engine, which takes advantage of deep knowledge of the data layout
to optimize query execution. BigQuery uses query access patterns to determine the optimal number
of physical shards and how they are encoded.

The data is physically stored on Google's distributed >le system, called Colossus
(https://cloud.google.com/>les/storage_architecture_and_challenges.pdf), which ensures durability by using
erasure encoding (https://wikipedia.org/wiki/Erasure_code) to store redundant chunks of the data on
multiple physical disks. Moreover, the data is replicated to multiple data centers.

You can also run BigQuery queries on data outside of BigQuery storage, such as data stored in Cloud
Storage, Google Drive, or Bigtable, by using federated data sources (/bigquery/federated-data-sources).
However, these sources are not optimized for BigQuery operations, so they might not perform as well
as data stored in BigQuery storage.

Maintenance

BigQuery is a fully-managed service, which means that the BigQuery engineering team takes care of
updates and maintenance for you. Upgrades shouldn't require downtime or hinder system
performance.

Many traditional systems require resource-intensive vacuum processes to run at various intervals to
reshu`e and sort data blocks and recover space. BigQuery has no equivalent of the vacuum process,
because the storage engine continuously manages and optimizes how data is stored and replicated.
Also, because BigQuery doesn't use indexes on tables, you don't need to rebuild indexes.

Backup and recovery

BigQuery addresses backup and disaster recovery at the service level. Also, by maintaining a
complete 7-day history of changes against your tables, BigQuery allows you to query a point-in-time
snapshot of your data by using either table decorators (/bigquery/table-decorators#snapshot_decorators)
or SYSTEM_TIME AS OF in the FROM clause (/bigquery/docs/reference/standard-sql/query-syntax#from-clause)
. You can easily revert changes without having to request a recovery from backups. (When a table is
explicitly deleted, its history is bushed after 7 days.)

Managing work!ows

This section discusses administrative tasks, such as organizing datasets, granting permissions, and
onboarding work in BigQuery. The section also discusses how to manage concurrent workloads,
monitor the health of your data warehouse, and audit user access.

Organizing datasets

You can segment datasets into separate projects based on class of data or business unit, or
consolidate them into common projects for simplicity.

You can invite a data analyst to collaborate on an existing dataset in any limited role that you de>ne.
When a data analysts logs into the BigQuery web UI (https://bigquery.cloud.google.com), they see only the
datasets that have been shared with them across projects. The activities that they can perform
against datasets varies, based on their role against each dataset.

Granting permissions

In a traditional RDBMS system, you grant permissions to view or modify tables by creating SQL grants
and applying them to a given user within the database system. In addition, some RDBMS systems
allow you to grant permissions to users in an external directory, such as LDAP. The BigQuery model
for managing users and permissions resembles the latter model.

BigQuery provides prede>ned roles (/bigquery/docs/access-control#bigquery) for controlling access to


resources. You can also create custom IAM roles (/bigquery/docs/access-control) consisting of your
de>ned set of permissions, and then assign those roles to users or groups. You can assign a role to a
Google email address or to a G Suite Group (https://support.google.com/a/answer/33329?hl=en).

An important aspect of operating a data warehouse is allowing shared but controlled access against
the same data to different groups of users. For example, >nance, HR, and marketing departments all
access the same tables, but their levels of access differ. Traditional data warehousing tools make this
possible by enforcing row-level security. You can achieve the same results in BigQuery by de>ning
authorized views (/bigquery/docs/authorized-views) and row-level permissions
(/bigquery/docs/authorized-views#row-level-permissions).

Onboarding

Traditionally, onboarding new data analysts involved signi>cant lead time. To enable analysts to run
simple queries, you had to show them where data sources resided and set up ODBC connections and
tools and access rights. Using Google Cloud, you can greatly accelerate an analyst's time to
productivity.

To onboard an analyst on Google Cloud, you grant access to relevant project(s)


(https://support.google.com/cloud/answer/6158846#add-members), introduce them to the Google Cloud
Console and BigQuery web UI, and share some queries to help them get acquainted with the data:

The Cloud Console (https://console.cloud.google.com/) provides a centralized view of all assets in


your Google Cloud environment. The most relevant asset to data analysts might be Cloud
Storage buckets (/storage/docs/creating-buckets), where they can collaborate on >les.

The BigQuery web UI presents the list of datasets that the analyst has access to. Analysts can
perform tasks in the Cloud Console according to the role you grant them, such as viewing
metadata, previewing data, executing, and saving and sharing queries.

Managing workloads and concurrency

BigQuery limits the maximum rate of incoming requests and enforces appropriate quotas on a per-
project basis. Speci>c policies vary depending on resource availability, user pro>le, service usage
history, and other factors. For details, see the BigQuery quota policy (/bigquery/quota-policy).

BigQuery offers two types of query priorities (/bigquery/querying-data#interactive-batch): interactive and


batch. By default, BigQuery runs interactive queries, which means that the query is executed as soon
as possible. Interactive queries count towards query quotas (/bigquery/quota-policy#queries). Batch
queries are queued and executed as soon as idle resources are available, usually within a few
minutes.

BigQuery doesn't support >ne-grained prioritization of interactive or batch queries. Given the speed
and scale at which BigQuery operates, many traditional workload issues aren't applicable. If you need
explicit query prioritization, you can separate your sensitive workloads into a project with an explicit
number of reserved slots. Contact your Google representative to assist in becoming a bat-rate
customer.
Monitoring and auditing

You can monitor BigQuery using Monitoring (/bigquery/docs/monitoring), where various charts and
alerts are de>ned based on BigQuery metrics (/bigquery/docs/monitoring#metrics). For example, you can
monitor system throughput using the Query Time metric or visualize query demand trends based on
the Slots Allocated metric. When you need to plan ahead for a demanding query, you can use the
Slots Available metric. To stay proactive about system health, you can create alerts
(/bigquery/docs/monitoring#create-alert) based on thresholds that you de>ne. Monitoring provides a self-
service web-based portal. You can control access to the portal with a Monitoring Workspace
(/monitoring/accounts/guide).

BigQuery automatically creates audit logs of user actions. You can export audit logs to another
BigQuery dataset in a batch or as a data stream and use your preferred analysis tool to visualize the
logs. For details, see Analyzing audit logs using BigQuery (/bigquery/audit-logs).

Managing data

This section discusses schema design considerations, denormalization


(https://wikipedia.org/wiki/Denormalization), how partitioning works, and methods for loading data into
BigQuery. The section concludes with a look at handling change in the warehouse while maintaining
zero analysis downtime.

Designing schema

Follow these general guidelines to design the optimal schema for BigQuery:

Denormalize a dimension table that is larger than 10 gigabytes, unless you see strong evidence
that data manipulation, UPDATE and DELETE operation, costs outweigh the bene>ts of optimal
queries.

Keep a dimension table that is smaller than 10 gigabytes normalized, unless the table rarely
goes through UPDATE and DELETE operations.

Take full advantage of nested and repeated >elds in denormalized tables.

Denormalization

The conventional method of denormalizing data involves writing a fact, along with all its dimensions,
into a bat table structure. For example, for sales transactions, you would write each fact to a record,
along with the accompanying dimensions, such as order and customer information.

In contrast, the preferred method for denormalizing data takes advantage of BigQuery's native
support for nested and repeated structures in JSON or Avro input data. Expressing records using
nested and repeated structures can provide a more natural representation of the underlying data. In
the case of the sales order, the outer part of a JSON structure contains the order and customer
information, and the inner part of the structure contains the individual line items of the order, which
are represented as nested, repeated elements.

"orderID": "ORDER",
"custID": "EMAIL",
"custName": "NAME",
"timestamp": "TIME",
"location": "LOCATION",
"purchasedItems": [
{
"sku": "SKU",
"description": "DESCRIPTION",
"quantity": "QTY",
"price": "PRICE"
},
{
"sku": "SKU",
"description": "DESCRIPTION",
"quantity": "QTY",
"price": "PRICE"
}

Expressing records by using nested and repeated >elds simpli>es data load using JSON or Avro >les.
After you've created such a schema, you can perform SELECT, INSERT, UPDATE, and DELETE operations
on any individual >elds using a dot notation, for example, Order.Item.SKU. For examples, see the
BigQuery documentation (/bigquery/docs/data).

Advantages of denormalization

BigQuery is essentially an analytical engine. It supports DML


(https://wikipedia.org/wiki/Data_manipulation_language) actions, but it isn't meant to be used as an online
transaction processing (OLTP) store. The discussion about Changing data (#handling_change) provides
guidelines for dealing with changes while maintaining zero analysis downtime and delivering optimal
online analytical processing (OLAP) performance. While normalized or partially normalized data
structures, such as star schema or snowbake, are suitable for update/delete operations, they aren't
optimal for OLAP workloads. When performing OLAP operations on normalized tables, multiple tables
have to be JOINed to perform the required aggregations. JOINs (/bigquery/query-reference#joins) are
possible with BigQuery and sometimes recommended on small tables. However, they are typically not
as performant as denormalized structures.

The following graph compares query performance using JOINs to simple >lters in relation to table
size. Query performance shows a much steeper decay in presence of JOINs.

Disadvantages of denormalization

Denormalized schemas aren't storage-optimal, but BigQuery's low cost of storage addresses
concerns about storage inemciency. You can contrast costs against gains in query speed to see why
storage isn't a signi>cant factor.
One challenge when you work with denormalized schema is maintaining data integrity. Depending on
the frequency of change and how widespread it is, maintaining data integrity can require increased
machine time and sometimes human time for testing and veri>cation.

Pa"itioning tables

BigQuery supports partitioning tables by date (/bigquery/docs/creating-partitioned-tables). You enable


partitioning during the table-creation process. BigQuery creates new date-based partitions
automatically, with no need for additional maintenance. In addition, you can specify an expiration time
for data in the partitions.

New data that is inserted into a partitioned table is written to the raw partition at the time of insert. To
explicitly control which partition the data is loaded to, your load job can specify a particular date
partition.

Loading data

Before data can be loaded into BigQuery for analytical workloads, it is typically stored in a Cloud
Storage product (/products/storage) and in a format that is native to its origin. During early stages of
migration to Google Cloud, the common pattern is to use existing extract, transform, and load (ETL)
tools to transform data into the ideal schema for BigQuery. After data is transformed, it is transferred
to Cloud Storage as CSV, JSON, or Avro >les, and from there loaded into BigQuery by using load jobs
(/bigquery/loading-data) or streaming (/bigquery/streaming-data-into-bigquery). Alternatively, you can
transfer >les to Cloud Storage in the schema that is native to the existing on-premises data storage,
loaded into a set of staging tables in BigQuery and then transformed into the ideal schema for
BigQuery by using BigQuery SQL commands. These two approaches are visualized here:
As you expand your footprint in Google Cloud, you will probably capture your source data directly in
Bigtable (/bigtable), Datastore (/datastore), or Cloud Spanner (/spanner) and use Databow (/databow) to
ETL data into BigQuery in batch or streams.

Using load jobs

This section assumes that your data is in Cloud Storage as a collection of >les in a supported >le
format. For more information about each data format, as well as speci>c requirements and features
to consider when choosing a format, see BigQuery data formats (/bigquery/data-formats).

In addition to CSV, you can also use data >les with delimiters other than commas by using the --
field_delimiter bag. For details, see bq load bags (/bigquery/bq-command-line-tool#bq-load-bags).

BigQuery supports loading gzip compressed


(/bigquery/preparing-data-for-loading#loading_compressed_and_uncompressed_data) >les. However, loading
compressed >les isn't as fast as loading uncompressed >les. For time-sensitive scenarios or
scenarios in which transferring uncompressed >les to Cloud Storage is bandwidth- or time-
constrained, conduct a quick loading test to see which alternative works best.

Because load jobs are asynchronous, you don't need to maintain a client connection while the job is
being executed. More importantly, load jobs don't affect your other BigQuery resources.
A load job creates a destination table if one doesn't already exist.

BigQuery determines the data schema as follows:

If your data is in Avro format, which is self-describing, BigQuery can determine the schema
directly.

If the data is in JSON or CSV format, BigQuery can auto-detect the schema
(/bigquery/bq-command-line-tool#autodetect), but manual veri>cation
(/bigquery/bq-command-line-tool#autodetect) is recommended.

You can specify a schema explicitly by passing the schema as an argument to the load job
(/bigquery/docs/loading-data-cloud-storage). Ongoing load jobs can append to the same table using the
same procedure as the initial load, but do not require the schema to be passed with each job.

If your CSV >les always contain a header row that needs to be ignored after the initial load and table
creation, you can use the --skip_leading_rows bag to ignore the row. For details, see bq load bags
(/bigquery/bq-command-line-tool#bq-load-bags).

BigQuery sets daily limits on the number and size of load jobs that you can perform per project and
per table. In addition, BigQuery sets limits on the sizes of individual load >les and records. For details,
see Quota policy (/bigquery/quota-policy#import).

Because tables are set to a hard limit of 1,000 load jobs per day, micro-batching isn't advised. To achieve emcient high-
lume or real-time loading of data, use streaming inserts in place of micro-batching.

You can launch load jobs through the BigQuery web UI. To automate the process, you can set up a
Cloud Functions (/functions) to listen to a Cloud Storage event (/functions/docs/calling/storage) that is
associated with arriving new >les in a given bucket and launch the BigQuery load job.

Using streaming inserts

For an alternate and complementary approach, you can also stream data directly into BigQuery.
Streamed data is made available immediately and can be queried alongside existing table data in
real-time.

For situations that can bene>t from real-time information, such as fraud detection or monitoring
system metrics, streaming can be a signi>cant differentiator. However, unlike load jobs, which are free
in BigQuery, there is a charge for streaming data. Therefore, it's important to use streaming in
situations where the bene>ts outweigh the costs.
When you stream data to the BigQuery tables, you send your records directly to BigQuery
(/bigquery/streaming-data-into-bigquery) by using the BigQuery API. If you use Cloud Logging, you can
also stream your Google Cloud project's logs directly into BigQuery
(/logging/docs/export/con>gure_export), including request logs from App Engine and custom log
information sent to Cloud Logging.

Handling change

Many data warehouses operate under strict Service Level Agreements (SLAs), demanding little to no
downtime. While Google handles BigQuery's uptime, you control the availability and responsiveness
of your datasets with your approach to rebecting change in the data.

All table modi>cations in BigQuery are ACID (https://wikipedia.org/wiki/ACID) compliant. This applies to
DML operations, queries with destination tables, and load jobs. A table that goes through inserts,
updates, and deletes while serving user queries handles the concurrency gracefully and transitions
from one state to the next in an atomic fashion. Therefore, modifying a table doesn't require
downtime. However, your internal process might require a testing and validation phase before making
newly refreshed data available for analysis. Also, because DML operations compete against analytical
workload over slots, you might prefer to isolate them. For these reasons, you might introduce
downtime. This article uses the term "analysis downtime" to avoid confusion with BigQuery service
downtime.

You can apply most of the old and proven techniques for handling analysis downtime. This section
expands on some of the known challenges and remedies.

Using BigQuery as an OLTP store is considered an anti-pattern. Because OLTP stores have a high volume of updates and
letes, they are a mismatch for the data warehouse use case. To decide which storage option best >ts your use case, review th
oud storage products (/products/storage) table.

Sliding time window

A traditional data warehouse, unlike a data lake, retains data only for a >xed amount of time, for
example, the last 5 years. On each update cycle, new data is added to the warehouse and the oldest
data rolls off, keeping the duration >xed. For the most part, this concept was employed to work
around the limitations of older technologies.

BigQuery is built for scale and can scale out as the size of the warehouse grows, so there is no need
to delete older data. By keeping the entire history, you can deliver more insight on your business. If the
storage cost is a concern, you can take advantage of BigQuery's long term storage pricing
(/bigquery/pricing#long-term-storage) by archiving older data and using it for special analysis when the
need arises. If you still have good reasons for dropping older data, you can use BigQuery's native
support for date-partitioned tables (/bigquery/docs/creating-partitioned-tables) and partition expiration
(/bigquery/docs/managing-partitioned-tables#partition-expiration). In other words, BigQuery can
automatically delete older data.

Changing schemas

While a data warehouse is designed and developed, it is typical to tweak table schemas by adding,
updating, or dropping columns or even adding or dropping whole tables. Unless the change is in the
form of an added column or table, it could break saved queries and reports that reference a deleted
table, a renamed column, and so on.

After the data warehouse is in production, such changes go through strict change control. You might
decide to handle minor schema changes during an analysis downtime, but for the most part rebecting
schema changes are scheduled as version upgrades. You design, develop, and test the upgrade in
parallel while the previous version of the data warehouse is serving the analysis workloads. You
follow the same approach in applying schema changes to a BigQuery data warehouse.

Slowly changing dimensions

A normalized data schema minimizes the impact of Slowly Changing Dimensions (SCD)
(https://wikipedia.org/wiki/Slowly_changing_dimension) by isolating the change in the dimension tables. It
is generally favorable over a denormalized schema, where SCD can cause widespread updates to the
bat fact table. However, as discussed in the schema design section, use normalization carefully for
BigQuery.

When it comes to SCD, there is no one-size->ts-all solution. It is important to understand the nature of
the change and apply the most relevant solution or combinations of solutions to your problem. The
remainder of this section outlines a few solutions and how to apply them to SCD types.

It is important to address slowly changing dimensions in the context of ideal schema for BigQuery. Often you must
cri>ce emcient SCD handling in exchange for optimized query performance or the opposite.

Technique 1: view switching


This technique is based on two views of the data: "main" vs. "shadow". The trick is to hide the actual
table and expose the "main" view to the users. On update cycles, the "shadow" view is
created/updated and goes through data correctness tests while the users work against the "main"
view. At switchover time, the "main" view is swapped with "shadow." The old "main" and now "shadow"
could be torn down until the next update cycle or kept around for some workbows depending on the
rules and processes de>ned by the organization.

The two views could be based on a common table and differentiated by a column, for example,
"view_type," or based on distinct tables. The former method is not recommended, because DML
operations against the "shadow" view of the table could slow down user queries against the "main"
view without offering any real bene>ts.

While view switching offers zero analysis downtime, it has a higher cost because during the update
cycle, two copies of the data exist. More importantly, if update cycles happen at a higher rate than 90
days, this approach could prevent your organization from taking advantage of long-term storage
pricing (/bigquery/pricing#long-term-storage). Ninety days is based on the pricing policy at the time of this
writing. Be sure to check the latest policy.

Sometimes different segments of data change at their own pace. For instance, sales data in North
America is updated on a daily basis, while data for Asia Paci>c is updated on a biweekly basis. In
such situations, it is best to partition the table based on the driving factor for the change, Country in
this example. View switching is then applied to the impacted partitions and not the entire data
warehouse. At the time of this writing, you can only partition based on a custom data attribute, such
as Country, by explicitly splitting the data into multiple tables.

Technique 2: in-place partition loading

When the change in data can be isolated by a partition and brief analysis downtime is tolerated, view
switching might be overkill. Instead, data for the affected partitions can be staged in other BigQuery
tables or exported to >les in Cloud Storage, where they can be replaced during analysis downtime.

To replace data in a target partition with data from a query of another table:

query --use_legacy_sql=false --replace \


--destination_table 'flight_data.fact_flights_part$20140910' \
'select * from `ods.load_flights_20140910`

To replace data in a target partition by loading from Cloud Storage:


load --replace \
--source_format=NEWLINE_DELIMITED_JSON
'flight_data.fact_flights_part$20140910' \
gs://{bucket}/load_flights_20140910.json

Technique 3: update data masking

A small and frequently changing dimension is a prime candidate for normalization. In this technique,
updates to such a dimension are staged in an isolated table or view that is conditionally joined with
the rest of the data:

LECT f.order_id as order_id, f.customer_id as customer_id,


IFNULL(u.customer_first_name, f.customer_first_name) as customer_first_name,
IFNULL(u.customer_last_name, f.customer_last_name) as customer_last_name
OM fact_table f
FT OUTER JOIN pending_customer_updates u
f.customer_id = u.customer_id

SCD Type 1: overwrite attribute value

Type 1 SCD overwrites the value of an attribute with new data without maintaining the history. For
example, if the product "awesome moisturizer cream" was part of the "health and beauty" category
and is now categorized as "cosmetics", the change looks like this:

Before:

PRD_SK PRD_ID PRD_DESC PRD_CATEGORY

123 ABC awesome moisturizer cream - 100 oz health and beauty

After:

PRD_SK PRD_ID PRD_DESC PRD_CATEGORY

123 ABC awesome moisturizer cream - 100 oz health and beauty


cosmetics
If the attribute is in a normalized dimension table, the change is very isolated. You simply update the
impacted row in the dimension table. For smaller dimension tables with frequent type 1 updates, use
Technique 3: update data masking. (#technique_3)

If the attribute is embedded in the fact table in a denormalized fashion, the change is rather
widespread. You will have to update all fact rows where the attribute is repeated. In this case, use
either Technique 2: in-place partition loading (#technique_2), or Technique 1: view switching.
(#technique_1)

SCD Type 2: change attribute value and maintain history

This method tracks unlimited historical data by creating multiple records for a given natural key
(https://wikipedia.org/wiki/Natural_key) with separate surrogate keys
(https://wikipedia.org/wiki/Surrogate_key). For example, the same change that is illustrated in SCD type 1
would be handled as below:

Before:

PRD_SK PRD_ID PRD_DESC PRD_CATEGORY START_DATE END_DATE

123 ABC awesome moisturizer cream - 100 oz health and beauty 31-Jan-2009 NULL

After:

PRD_SK PRD_ID PRD_DESC PRD_CATEGORY START_DATE END_DATE

123 ABC awesome moisturizer cream - 100 oz health and beauty 31-Jan-2009 18-JUL-2017

124 ABC awesome moisturizer cream - 100 oz cosmetics 19-JUL-2017 NULL

If the attribute is in a normalized dimension table, the change is isolated. You simply update the
previous row and add a new one in the dimension table. For smaller dimension tables with frequent
type 1 updates, use Technique 3: update data masking (#technique_3).

If the attribute is embedded in the fact table in a denormalized fashion, the situation can be more
favorable, as long as you don't maintain explicit start and end dates for the value and instead rely on
the transaction dates. Because the previous value remains true for the date and time the previous
transactions occurred, you don't need to change previous fact table rows. The fact table would look
like this:
TRANSACTION_DATE PRD_SK PRD_ID PRD_DESC PRD_CATEGORY UNITS AMOUNT

18-JUL-2017 123 ABC awesome moisturizer cream - 100 oz health and beauty 2 25.16

19-JUL-2017 124 ABC awesome moisturizer cream - 100 oz cosmetics 1 13.50

Querying data

BigQuery supports standard SQL queries and is compatible with ANSI SQL 2011. BigQuery's SQL
reference (/bigquery/sql-reference) provides a comprehensive description of all functions, operators, and
regex capabilities that are supported.

Prior to supporting standard SQL, BigQuery supported an alternate SQL version that is now referred to as Legacy SQL. We
commend using the updated SQL standard in your queries. For more information, see Enabling standard SQL
bigquery/sql-reference/enabling-standard-sql).

Because BigQuery supports nested and repeated >elds as part of the data model, its SQL support has
been extended to speci>cally support these >eld types. For example, using the GitHub public dataset
(/bigquery/public-data/github), you could issue the UNNEST (/bigquery/sql-reference/query-syntax#unnest)
command, which lets you iterate over a repeated >eld:

LECT
name, count(1) as num_repos

`bigquery-public-data.github_repos.languages`, UNNEST(language)
OUP BY name
DER BY num_repos
SC limit 10

Interactive queries

The BigQuery web UI allows interactive querying of datasets and provides a consolidated view of
datasets across projects that you have access to. The console also provides several useful features
such as saving and sharing ad-hoc queries, tuning and editing historical queries, exploring tables and
schemas, and gathering table metadata. Refer to the BigQuery web UI (/bigquery/bigquery-web-ui) for
more details.

Automated queries

It is a common practice to automate execution of queries based on a schedule/event and cache the
results for later consumption.

If you are using Airbow to orchestrate other automated activities and already familiar with the tool,
use Apache Airbow API for BigQuery
(http://airbow.incubator.apache.org/integration.html?highlight=bigquery#gcp) for this purpose. This blog post
(/blog/big-data/2017/07/how-to-aggregate-data-for-bigquery-using-apache-airbow) walks you through the
process of installing Airbow and creating a workbow against BigQuery.

For simpler orchestrations, you can rely on cron jobs. This blog post
(/blog/big-data/2017/04/how-to-build-a-bi-dashboard-using-google-data-studio-and-bigquery) shows you how to
encapsulate a query as an App Engine app and run it as a scheduled cron job.

Query optimization

Each time BigQuery executes a query, it executes a full-column scan. BigQuery doesn't use or support
indexes. Because BigQuery performance and query costs are based on the amount of data scanned
during a query, design your queries so that they reference only the columns that are relevant to the
query. When using date-partitioned tables, ensure only the relevant partitions are scanned. You can
achieve this by using partition >lters based on PARTITIONTIME or PARTITIONDATE
(/bigquery/docs/querying-partitioned-tables).

To understand the performance characteristics after a query executes, take a look at the detailed
query plan explanation (/bigquery/query-plan-explanation). The explanation breaks down the stages that
the query went through, the number of input/output rows handled at each stage, and the timing pro>le
within each stage. Using the results from the explanation can help you understand and optimize your
queries.

External sources

You can run queries on data that exists outside of BigQuery by using federated data sources
(/bigquery/federated-data-sources), but this approach has performance implications. Use federated data
sources only if the data must be maintained externally. You can also use query federation to perform
ETL from an external source to BigQuery. This approach allows you to de>ne ETL using familiar SQL
syntax.

User-de#ned functions

BigQuery also supports user-de>ned functions (/bigquery/user-de>ned-functions) (UDFs) for queries that
exceed the complexity of SQL. UDFs allow you to extend the built-in SQL functions; they take a list of
values, which can be arrays or structs, and return a single value, which can also be an array or struct.
UDFs are written in JavaScript and can include external resources, such as encryption or other
libraries.

Query sharing

BigQuery allows collaborators to save and share queries between team members. This feature can be
especially useful in data exploration exercises or as a means of coming up to speed on a new dataset
or query pattern. For more information, see Saving and sharing queries
(/bigquery/docs/saving-sharing-queries).

Analyzing data

This section presents various ways that you can connect to BigQuery and analyze the data. To take
full advantage of BigQuery as an analytical engine, you should store the data in BigQuery storage.
However, your speci>c use case might bene>t from analyzing external sources either by themselves
or JOINed with data in BigQuery storage.

O$-the-shelf tools

Google Data Studio (/data-studio), available in beta at the time of this writing, as well as many partner
tools (/bigquery/partners) that are already integrated with BigQuery, can be used to draw analytics from
BigQuery and build sophisticated interactive data visualizations.

If you >nd yourself in a situation where you have to choose a tool, you can >nd comprehensive vendor
comparison in Gartner's magic quadrant report
(https://www.gartner.com/doc/3611117/magic-quadrant-business-intelligence-analytics) and G2 score report
(https://www.g2crowd.com/grid_report/documents/grid-for-business-intelligence-platforms-winter-2017) by G2
Crowd. Gartner's report can be obtained from many of our partner sites, such as Tableau
(https://www.tableau.com/asset/2017-gartner-magic-quadrant).
Custom development

To build custom applications and platforms on top of BigQuery, you can use client libraries
(/bigquery/client-libraries), which are available for most common programming languages, or you can
use BigQuery's REST API (/bigquery/docs/reference/rest/v2) directly.

For a concrete example, refer to this tutorial (/solutions/bokeh-and-bigquery-dashboards), which uses


Python libraries to connect to BigQuery and generate custom interactive dashboards.

All the methods for connecting to BigQuery essentially provide a wrapper around BigQuery's REST API. All connections to
e BigQuery API are encrypted by using HTTPS, and enforce permissions by using IAM policies (/bigquery/docs/access-contro

Third-pa"y connectors

To connect to BigQuery from an application that isn't natively integrated with BigQuery at the API
level, you can use the BigQuery JDBC and ODBC drivers (/bigquery/partners/simba-beta-drivers). The
drivers provide a bridge to interact with BigQuery for legacy applications or applications that cannot
be easily modi>ed, such as Microsoft Excel
(/blog/big-data/2016/11/how-to-connect-bigquery-to-microsoft-excel-and-other-apps-with-our-new-odbc-driver).
Although ODBC and JDBC support interacting with BigQuery using SQL, the drivers aren't as
expressive as dealing with the API directly.
Costs

Most data warehouses serve multiple business entities within the organization. A common challenge
is to analyze cost of operation per business entity. For guidance on slicing your bill and attributing
cost to consumption, see Visualize Google Cloud billing using BigQuery and Data Studio
(https://medium.com/google-cloud/visualize-gcp-billing-using-bigquery-and-data-studio-d3e695f90c08).

There are three primary cost dimensions for BigQuery: loading, storage, and query costs. This section
discusses each dimension in detail.

Storing data

Storage pricing is prorated per MB/s.

If a table hasn't been edited for 90 consecutive days, it is categorized as long-term storage and the
price of storage for that table automatically drops by 50 percent to $0.01 per GB per month. There is
no degradation of performance, durability, availability, or any other functionality when a table is
considered long-term storage. When the data in a table is modi>ed, BigQuery resets the timer on the
table, and any data in the table returns to the normal storage price. Actions that don't directly
manipulate the data, such as querying and creating views, don't reset the timer.

For more details, see BigQuery storage pricing (/bigquery/pricing#storage).

Loading data

You can load data into BigQuery by using a conventional load job, at no charge. After data is loaded,
you pay for the storage as discussed above.

Streaming inserts are charged based on the amount of data that is being streamed. For details, see
costs of streaming inserts listed under BigQuery storage pricing (/bigquery/pricing#storage).

Querying data

For queries, BigQuery offers two pricing models: on-demand and bat-rate.

In a multi-project situation where data is hosted in one project and made available for queries to users of other projects, t
st of storage and streaming is incurred in the hosting project, but the cost of queries is incurred in the project where the quer
ssued from.
On-demand pricing

In the on-demand model, BigQuery charges for the amount of data accessed during query execution.
Because BigQuery uses a columnar storage format, only the columns relevant to your query are
accessed. If you only run reports on a weekly or monthly basis, and you've performed queries on less
than 1 TB of your data, you might >nd the cost of queries on your bill is very low. For more details on
how queries are charged, see BigQuery query pricing (/bigquery/pricing#queries).

To help determine how much data any given query is going to scan beforehand, you can use the query
validator in the web UI. In the case of custom development, you can set the dryRun bag in the API
request and have BigQuery not run the job. Instead, return with statistics about the job, such as how
many bytes would be processed. Refer to the query API (/bigquery/docs/reference/rest/v2/jobs/query) for
more details.

Flat-rate pricing

Customers who prefer more consistency of monthly expenses can choose to enable bat-rate pricing.
To learn more, see BigQuery bat-rate pricing (/bigquery/pricing#bat_rate_pricing).

What's next?

BigQuery how-tos (/bigquery/docs/how-to)

BigQuery public datasets (/bigquery/public-data)

BigQuery on the Google Cloud podcast (https://www.gcppodcast.com/categories/bigquery)

BigQuery on Stack Overbow (http://stackoverbow.com/questions/tagged/google-bigquery)

r/bigquery on Reddit (https://www.reddit.com/r/bigquery)

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License
(https://creativecommons.org/licenses/by/4.0/), and code samples are licensed under the Apache 2.0 License
(https://www.apache.org/licenses/LICENSE-2.0). For details, see the Google Developers Site Policies
(https://developers.google.com/site-policies). Java is a registered trademark of Oracle and/or its amliates.

You might also like