Serverless computing is one of the fastest growing cloud services (Lambda, Cloud functions, Azure functions) as businesses continue to push provisioning and infrastructure management upstream. Serverless data warehousing has been around longer than serverless compute, but with far less attention. This article is about BigQuery, one of the most established serverless data warehouses.
BigQuery is an SQL-based Data Warehouse as a Service (DWaaS) that requires no upfront hardware provisioning or management. Single queries can cover petabyte-sized datasets and utilise thousands of CPUs distributed across 100s of machines. While you aren’t running queries, you only pay a tiny amount for data stored ($0.02 per GB per month, or less).
Comparing BigQuery to standard compute infrastructure, FaaS (functions-as-a-service) has a lot of parallels, e.g. execution of functions on-demand without upfront provisioning of resources (virtual machines, RAM, CPU, etc.) Incidentally AWS explicitly calls their BigQuery competitor Athena “serverless”. Much like BigQuery FaaS hides the complexity of provisioning and management, allowing engineers to focus fully on their application (or queries in the case of BigQuery).
“Serverless continues to be attractive to organizations since it doesn’t require management of the infrastructure,” the report stated. “As companies migrate increasingly to the cloud and continue to build cloud-native architectures, we think the pace of serverless adoption will also continue to grow.”
Abstraction of hardware provisioning allows engineers to focus on the main game.
Both these techniques have the advantage of high scalability with greatly reduced engineering costs and much greater hardware utilisation. While FaaS is pretty commonplace across all cloud providers, the BigQuery data warehouse model has been under appreciated for a long time, although AWS and others are now coming to the party.
Often data engineers complain about bottlenecks, under/over provisioning, outages, index configuration, backups and other infrastructure issues. The BigQuery model makes all of these issues redundant.
BigQuery Pros and Cons
The pros: You can have a full data warehouse up and running in minutes with virtually zero ongoing operational overhead. Into this you can bulk load or stream in data programmatically (>100k rows per second per project edit: previously we said per table which was incorrect. Thanks Graham Polley), which you can then query including multiple tables, views with crazy joins and anything else you can think of. All of this without any thought for indexes (you don’t need them), storage, hardware or any of the typical data warehouse management issues. And if your query can be cached, you don’t pay anything to run it again.
Example query uses over 3000 CPU cores, hundreds of disks and a 330Gb network to run a regex over 4TB of data in under 30 secs.
Above shows a regex query being run over a 7TB table with ~100 billion rows (BigQuery is a column storage engine, so you only pay for the columns processed). It’s hard to imagine thinking about provisioning and managing the resources to brute force this type of query in less than 30 seconds (not to mention the wages of the people you would need to pay to do this). Most importantly those resources were only active during the query. You can read more about BigQuery under the hood here.
Infinite storage, thousands of CPU cores, massive network bandwidth and virtually zero operational overhead
The cons: it’s a black box, resource controls are out of reach and it’s possible (usually you’re doing something stupid) to hit quota issues and get stuck. The other main con is geo-locality of data, if you have data sovereignty issues you don’t have a lot of options as BigQuery is only currently available in several regions. Apart from that the predefined schema constraints can be a bit annoying, but once you understand why things are done that way you tend to architect your tables accordingly, so this is less of a big deal.
Reducing the data warehouse overhead
Traditional data warehouses fit into an on-premise or more recently a cloud based IaaS model (see below). The shift to IaaS has taken time, but the cost savings have been dramatic despite cloud providers growing fast and making tidy profits: AWS > 20% margin on ~$20B annual revenue. Google Cloud > $4B annual revenue.
PaaS and Serverless are now taking this even further, reducing infrastructure costs and offloading more $ to cloud providers. In 2015 Urs Hölzle predicted Google's cloud revenue would eclipse advertising revenue by 2020. This isn’t surprising given Google has spent two decades perfecting infrastructure services internally and had the jump on container-based services by over 10 years. Google have come to market with more advanced infrastructure components: Spanner, Kubernetes, gRPC, Jupiter network, Software-defined networking and more. Each of these solves scalability problems as well as simplifying application design. BigQuery is no different.
BigQuery fits somewhere in-between PaaS and Serverless.
The above image shows the continual reduction of responsibility as infrastructure is offloaded to third parties. At each step to the right the technology cost increases, but is typically offset by much greater infrastructure management cost savings. The exception is the SaaS column where the cost savings are more evident upfront but can be less so (or negative) over time.
We started out using BigQuery several years ago because we had minimal engineering resources but produced a lot of data (every keypress in a search query to the Search.io API adds at least one row to BigQuery). We needed a data warehouse but we didn’t want to spend a lot of time on it or pay a lot of money, especially as we scaled to 10x 100x and 1000x throughput.
Fast forward and we’re still very happy with BigQuery. We haven’t any scaling issues (API quotas cough cough) and we record everything.
From an architecture perspective we use BigQuery as a data catch-all. Engineers can create events for recording and effortlessly push these into BigQuery and analyze as needed.
We’ve now stored billions of rows and stream in additional rows at rates of ~1000 / second. We query this data constantly both ad hoc and programmatically (machine learning processes, customer dashboards, monitoring, via Data Studio and more). Admittedly we’ve optimised our queries and we architect for maximum cacheability, but it’s still mind blowing that BigQuery is one of our smallest infrastructure costs at a only several hundred dollars per month.
While FaaS has become trendy for compute loads, it’s time for people to start looking at their Data Warehouse through the same lens.
If you want a data warehouse with thousands of CPU cores and infinite storage, no provisioning, maintenance or manual upgrading, then take a look at BigQuery, you might be surprised just how cost effective it is.
And we’re hiring!