Google BigQuery is a fully managed data warehouse tool. It allows scalable analysis over a petabyte of data, querying using ANSI SQL, integration with various applications, etc. To access all these features conveniently, you need to understand BigQuery architecture, maintenance, pricing, and security.
This guide decodes the most important components of Google BigQuery: BigQuery Architecture, Maintenance, Performance, Pricing, and Security.
Table of Contents
What Is Google BigQuery?
Google BigQuery is a Cloud Datawarehouse run by Google. It is capable of analysing terabytes of data in seconds. If you know how to write SQL Queries, you already know how to query it. In fact, there are plenty of interesting public data sets shared in BigQuery, ready to be queried by you.
You can access BigQuery by using the GCP console or the classic web UI, by using a command-line tool, or by making calls to BigQuery Rest API using a variety of Client Libraries such as Java, .Net, or Python. There are also a variety of third-party tools that you can use to interact with BigQuery, such as visualising the data or loading the data.
Key Features of Google BigQuery
Why did Google release BigQuery and why would you use it instead of a more established data warehouse solution?
- Ease of Implementation: Building your own is expensive, time-consuming, and difficult to scale. With BigQuery, you need to load data first and pay only for what you use.
- Speed: Process billions of rows in seconds and handles the real-time analysis of Streaming data.
Google BigQuery Architecture
BigQuery Architecture is based on Dremel Technology. Dremel is a tool used in Google for about 10 years.
- Dremel: BigQuery Architecture dynamically apportions slots to queries on an as-needed basis, maintaining fairness amongst multiple users who are all querying at once. A single user can get thousands of slots to run their queries. It takes more than just a lot of hardware to make your queries run fast. BigQuery requests are powered by the Dremel query engine.
- Colossus: BigQuery Architecture relies on Colossus, Google’s latest generation distributed file system. Each Google data centre has its own Colossus cluster, and each Colossus cluster has enough disks to give every BigQuery user thousands of dedicated disks at a time. Colossus also handles replication, recovery (when disks crash) and distributed management.
- Jupiter Network: It is the internal data centre network that allows BigQuery to separate storage and compute.
- Columnar storage.
- Nested/Repeated fields.
- No Index: Single full table scan.
- The query is implemented in Tree Architecture.
- The query is executed using tens of thousands of machines over a fast Google Network.
BigQuery’s Columnar Database
Google BigQuery Architecture uses column-based storage or columnar storage structure that helps it achieve faster query processing with fewer resources. It is the ma reason how Google BigQuery handles large datasets quantities and delivers excellent speed.
Row-based storage structure is used in Relational Databases where data is stored in rows because it is an efficient way for storing data for transactional Databases. Storing data in columns is efficient for analytical purposes because it needs a faster data reading speed.
Suppose a Database has 1000 records or 1000 columns of data. If we store data in a row-based structure, then querying only 10 rows out of 1000 will take more time as it will read all the 1000 rows to get 10 rows in the query output. But this is not the case in Google BigQuery’s Columnar Database, where all the data is stored in columns instead of rows.
The columnar database will process only 100 columns in the interest of the query, which in turn makes the overall query processing faster.
The Google Ecosystem
Google BigQuery is a Cloud Data Warehouse that is a part of Google Cloud Platform (GCP) which means it can easily integrate with other Google products and services. Google Cloud Platforms is a package of many Google services used to store data such as Google Cloud Storage, Google Bigtable, Google Drive, Databases, and other Data processing tools.
Google BigQuery can process all the data stored in these other Google products. Google BigQuery uses standard SQL queries to create and execute Machine Learning models and integrate with other Business Intelligence tools like Looker and Tableau.
Google BigQuery Comparison with Other Database and Data Warehouses
Here, you will be looking at how Google BigQuery is different from other Databases and Data Warehouses:
1) Comparison with MapReduce and NoSQL
MapReduce vs. Google BigQuery
NoSQL Datastore vs. Google BigQuery
2) Comparison with Redshift and Snowflake
|Description||Large scale data warehouse service for use with business intelligence tools||Large scale data warehouse service with append-only tables||Cloud-based data warehousing service for structured and semi-structured data|
|Primary database model||Relational DBMS||Relational DBMS||Relational DBMS|
|APIs and other access methods||JDBC
|RESTfull HTTP/JSON API||CLI Client
|User concepts||Fine-grained access rights according to SQL-standard||Access privileges (owner, writer, reader) for whole datasets, not for individual tables||Users with fine-grained authorization concepts, user roles and pluggable authentication|
Some Important Considerations about these Comparisons:
- If you have a reasonable volume of data, say, dozens of terabytes that you rarely use to perform queries and it’s acceptable for you to have query response times of up to a few minutes when you use, then Google BigQuery is an excellent candidate for your scenario.
- If you need to analyze a big amount of data (e.g.: up to a few terabytes) by running many queries which should be answered each very quickly — and you don’t need to keep the data available once the analysis is done, then an on-demand cloud solution like Amazon Redshift is a great fit.
But keep in mind that differently from Google BigQuery, Redshift does need to be configured and tuned in order to perform well.
- BigQuery Architecture is good enough if not to take into account the speed of data updating. Compared to Redshift, Google BigQuery only supports hourly syncs as its fastest frequency update. This made us choose Redshift, as we needed the solution with the support of close to real-time data integration.
Key Concepts of Google BigQuery
Now, you will get to know about the key concepts associated with Google BigQuery:
1) Google BigQuery Working
BigQuery is a data warehouse, implying a degree of centralization. The query we demonstrated in the previous section was applied to a single dataset.
However, the benefits of BigQuery become even more apparent when we do joins of datasets from completely different sources or when we query against data that is stored outside BigQuery.
If you’re a power user of Sheets, you’ll probably appreciate the ability to do more fine-grained research with data in your spreadsheets. It’s a sensible enhancement for Google to make, as it unites BigQuery with more of Google’s own existing services. Previously, Google made it possible to analyse Google Analytics data in BigQuery.
These sorts of integrations could make BigQuery Architecture a better choice in the market for cloud-based data warehouses, which is increasingly how Google has positioned BigQuery. Public cloud market leader Amazon Web Services (AWS) has Redshift, but no widely used tool for spreadsheets.
Microsoft Azure’s SQL Data Warehouse, which has been in preview for several months, does not currently have an official integration with Microsoft Excel, surprising though it may be.
2) Google BigQuery Querying
Google BigQuery Architecture supports SQL queries and supports compatibility with ANSI SQL 2011. BigQuery SQL support has been extended to support nested and repeated field types as part of the data model.
For example, you can use GitHub public dataset and issue the UNNEST command. It lets you iterate over a repeated field.
SELECT name, count(1) as num_repos FROM `bigquery-public-data.github_repos.languages`, UNNEST(language) GROUP BY name ORDER BY num_repos DESC limit 10
A) Interactive Queries
Google BigQuery Architecture supports interactive querying of datasets and provides you with a consolidated view of these datasets across projects that you can access. Features like saving as and shared ad-hoc, exploring tables and schemas, etc. are provided by the console.
B) Automated Queries
You can automate the execution of your queries based on an event and cache the result for later use. You can use Airflow API to orchestrate automated activities. For simple orchestrations, you can use corn jobs. To encapsulate a query as an App Engine App and run it as a scheduled cron job you can refer to this blog.
C) Query Optimization
Each time when a Google BigQuery executes a query, it executes a full-column scan. It doesn’t support indexes. As you know, the performance and query cost of Google BigQuery Architecture is dependent on the amount of data scanned during a query, you need to design your queries to reference the column that is strictly relevant to your query.
When you are using data partitioned tables, make sure that only the relevant partitions are scanned.
You can also refer to the detailed blog here that can help you to understand the performance characteristics after a query executes.
D) External sources
With federated data sources, you can run queries on the data that exists outside of your Google BigQuery. But this method has performance implications. You can also use query federation to perform the ETL process from an external source to Google BigQuery.
E) User-defined functions
F) Query sharing
Collaborators can save and share the queries between the team members. Data exploration exercise, getting desired speed on a new dataset or query pattern becomes a cakewalk with it.
3) Google BigQuery ETL/Data Load
There are various approaches to load data to BigQuery. In case you are moving data from Google Applications – like Google Analytics, Google Adwords, etc. google provides a robust BigQuery Data Transfer Service. This is Google’s own intra-product data migration tool.
Data load from other data sources – databases, cloud applications and more can be accomplished by deploying engineering resources to write custom scripts.
The broad steps would be to extract data from the data source, transform it into a format that BigQuery accepts, upload this data to Google Cloud Storage (GCS) and finally load this to Google BigQuery from GCS.
A word of caution though – custom coding scripts to move data to Google BigQuery is both a complex and cumbersome process. A third party data pipeline platform such as Hevo can make this a hassle-free process for you.
Hevo Data helps you directly transfer data from 100+ other data sources (including 30+ free sources) to Business Intelligence tools, Data Warehouses, or a destination of your choice in a completely hassle-free & automated manner. Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.
Hevo takes care of all your data preprocessing needs required to set up the integration and lets you focus on key business activities and draw a much powerful insight on how to generate more leads, retain customers, and take your business to new heights of profitability. It provides a consistent & reliable solution to manage data in real-time and always have analysis-ready data in your desired destination.
Check out what makes Hevo amazing:
- Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
- Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
- Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
- Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
- Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
- Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, E-Mail, and support calls.
- Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
4) Google BigQuery Pricing Model
|Storage Cost||$0.020 per GB per month|
|Query Cost||$5 per TB||1st TB per month is free|
A) Google BigQuery Storage Cost
- Active – Monthly charge for stored data modified within 90 days.
- Long-term – Monthly charge for stored data that have not been modified within 90 days. This is usually lower than the earlier one.
B) Google BigQuery Query Cost
- On-demand – Based on data usage.
- Flat rate – Fixed monthly cost, ideal for enterprise users.
Free usage is available for the below operations:
- Loading data (network pricing policy applicable in case of inter-region).
- Copying data.
- Exporting data.
- Deleting datasets.
- Metadata operations.
- Deleting tables, views, and partitions.
5) Google BigQuery Maintenance
Google has managed to solve a lot of common data warehouse concerns by throwing order of magnitude of hardware at the existing problems and thus eliminating them altogether. Unlike Amazon Redshift, running VACUUM in Google BigQuery is not an option.
Google BigQuery is specifically architected without the need for the resource-intensive VACUUM operation that is recommended for Redshift. BigQuery Pricing is way different compared to the redshift.
Keep in mind that by design, Google BigQuery is append-only. Meaning, when planning to update or delete data, you’ll need to truncate the entire table and recreate the table with new data.
However, Google has implemented ways in which users can reduce the amount of data processed.
Partition their tables by specifying the partition date in their queries. Use wildcard tables to share their data by an attribute.
6) Google BigQuery Security
The fastest hardware and most advanced software are of little use if you can’t trust them with your data. BigQuery’s security model is tightly integrated with the rest of Google’s Cloud Platform, so it is possible to take a holistic view of your data security.
BigQuery uses Google’s Identity and Access Management (IAM) access control system to assign specific permissions to individual users or groups of users. BigQuery also ties in tightly with Google’s Virtual Private Cloud (VPC) policy controls, which can protect against users who try to access data from outside your organization, or who try to export it to third parties.
Both IAM and VPC controls are designed to work across Google cloud products, so you don’t have to worry that certain products create a security hole.
BigQuery is available in every region where Google Cloud has a presence, enabling you to process the data in the location of your choosing. At the time of writing, Google Cloud has more than two dozen data centres around the world, and new ones are being opened at a fast rate.
If you have business reasons for keeping data in the US, it is possible to do so. Just create your dataset with the US region code, and all of your queries against the data will be done within that region.
Know more about Google BigQuery security from here.
7) Google BigQuery Features
Some features of Google BigQuery Data Warehouse are listed below:
- Just upload your data and run SQL.
- No cluster deployment, no virtual machines, no setting keys or indexes, and no software.
- Separate storage and computing.
- No need to deploy multiple clusters and duplicate data into each one. Manage permissions on projects and datasets with access control lists. Seamlessly scales with usage.
- Compute scales with usage, without cluster resizing.
- Thousands of cores are used per query.
- Deployed across multiple data centres by default, with multiple factors of replication to optimize maximum data durability and service uptime.
- Stream millions of rows per second for real-time analysis.
- Analyse terabytes of data in seconds.
- Storage scales to Petabytes.
Know more about Google BigQuery features from here.
8) Google BigQuery Interaction
A) Web User Interface
- Run queries and examine results.
- Manage databases and tables.
- Save queries and share them across the organization for re-use.
- Detailed Query history.
B) Visualize Data Studio
- View BigQuery results with charts, pivots and dashboards.
- A programmatic way to access Google BigQuery.
D) Service Limits for Google BigQuery
- The concurrent rate limit for on-demand, interactive queries: 50.
- Daily query size limit: Unlimited by default.
- Daily destination table update limit: 1,000 updates per table per day.
- Query execution time limit: 6 hours.
- A maximum number of tables referenced per query: 1,000.
- Maximum unresolved query length: 256 KB.
- Maximum resolved query length: 12 MB.
- The concurrent rate limit for on-demand, interactive queries against Cloud Big table external data sources: 4.
E) Integrating with Tensorflow
BigQuery has a new feature BigQuery ML that let you create and use a simple Machine Learning (ML) model as well as deep learning prediction with the TensorFlow model. This is the key technology to integrate the scalable data warehouse with the power of ML.
The solution enables a variety of smart data analytics, such as logistic regression on a large dataset, similarity search and recommendation on images, documents, products or users, by processing feature vectors of the contents. Or you can even run TensorFlow model prediction inside BigQuery.
Now, imagine what would happen if you could use BigQuery for deep learning as well. After having data scientists training the cutting edge intelligent neural network model with TensorFlow or Google Cloud Machine Learning, you can move the model to BigQuery and execute predictions with the model inside BigQuery.
This means you can let any employee in your company use the power of BigQuery for their daily data analytics tasks, including image analytics and business data analytics on terabytes of data, processed in tens of seconds, solely on BigQuery without any engineering knowledge.
Google BigQuery Performance
Google BigQuery rose from Dremel, Google’s distributed query engine. Dremel held the capability to handle terabytes of data in seconds flat by leveraging distributed computing within a serverless BigQuery Architecture.
This BigQuery architecture allows it to process complex queries with the help of multiple servers in parallel to significantly improve processing speed. In the following sections, you will take a look at the 4 critical components of Google BigQuery performance:
BigQuery Architecture and Dremel can scale to thousands of machines by structuring computations as an execution tree. A root server receives an incoming query and relays it to branches, also known as mixers, which modify incoming queries and deliver them to leaf nodes, also known as slots.
Working in parallel, the leaf nodes handle the nitty-gritty of filtering and reading the data. The results are then moved back down the tree where the mixers accumulate the results and send them to the root as the answer to the query.
In most Data Warehouse environments, organizations have to specify and commit to the server hardware on which computations are run. Administrators have to provision for performance, elasticity, security, and reliability.
A serverless model can come in handy in solving this constraint. In a serverless model, processing can automatically be distributed over a large number of machines working simultaneously.
By leveraging Google BigQuery’s serverless model, database administrators and data engineers can focus less on infrastructure and more on provisioning servers and extracting actionable insights from data.
SQL and Programming Language Support
Users can avail BigQuery Architecture through standard-SQL, which many users are quite familiar with. Google BigQuery also has client libraries for writing applications that can access data in Python, Java, Go, C#, PHP, Ruby, and Node.js.
Google BigQuery can also run and process reports on real-time data by using other GCP resources and services. Data Warehouses can provide support for analytics after data from multiple sources is accumulated and stored- which can often happen in batches throughout the day.
Apart from Batch Processing, Google BigQuery Architecture also supports streaming at a rate of millions of rows of data every second.
9) Use Cases of Google BigQuery
You can use Google BigQuery Data Warehouse in the following cases:
- Use it when you have queries that run more than five seconds in a relational database. The idea of BigQuery is running complex analytical queries, which means there is no point in running queries that are doing simple aggregation or filtering. BigQuery is suitable for “heavy” queries, those that operate using a big set of data.
The bigger the dataset, the more you’re likely to gain performance by using BigQuery. The dataset that I used was only 330 MB (megabytes, not even gigabytes).
- BigQuery is good for scenarios where data does not change often and you want to use the cache, as it has a built-in cache. What does this mean? If you run the same query and the data in tables are not changed (updated), BigQuery will just use cached results and will not try to execute the query again. Also, BigQuery is not charging money for cached queries.
- You can also use BigQuery when you want to reduce the load on your relational database. Analytical queries are “heavy” and overusing them under a relational database can lead to performance issues. So, you could eventually be forced to think about scaling your server.
However, with BigQuery you can move these running queries to a third-party service, so they would not affect your main relational database.
BigQuery is a sophisticated mature service that has been around for many years. It is feature-rich, economical and fast. BigQuery integration with Google Drive and free Data Studio visualization toolset is very useful for comprehension and analysis of Big Data and can process several terabytes of data within a few seconds. This service needs to deploy across existing and future Google Cloud Platform (GCP) regions. Serverless is certainly the next best option to obtain maximized query performance with minimal infrastructure cost.
If you want to integrate your data from various sources and load it in Google BigQuery, then try Hevo.
Businesses can use automated platforms like Hevo Data to set the integration and handle the ETL process. It helps you directly transfer data from various Data Sources to a Data Warehouse, Business Intelligence tools, or any other desired destination in a fully automated and secure manner without having to write any code and will provide you with a hassle-free experience.
Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.
So, what are your thoughts on Google BigQuery? Let us know in the comments