Many of our customers ask us which data warehousing option is cheaper: BigQuery or Redshift? TL;DR: Redshift is cheaper as soon as you don’t leverage multi-stage querying and don’t care a lot about SQL query structure. BigQuery can be much more cost effective if you structure your data warehouse querying very well and split it into stages.
In this blog post, we’re going to break down BigQuery vs Redshift pricing structures and see how they work in detail.
Redshift pricingRedshift pricing is pretty simple to understand. They charge as usual for AWS: per machine, per hour. The cheapest Redshift cluster you can spin up will cost you $0.25 per hour, or about $180 per month. It’ll contain one dc2.large node in the US region. AWS constantly updates prices so please check their site for up-to-date information.
Storage is bound to computing power for Redshift, unlike EC2 deployments. This means Redshift pricing will depend on your data size. Per TB pricing is $0.425 / TB / hour for HDD storage and $1.5625 / TB / hour for SSD storage.
So the cheapest pricing is $306 / TB / mo. This is on-demand pricing and as usual, AWS provides significant discounts if you pay upfront. Also, there’s separate Redshift Spectrum pricing, which is out of scope for this post.
BigQuery pricing is much more complicated compared to Redshift. BigQuery charges separately for storage at $20 / TB / month and $5 / TB processed in query. Ingestion into a BigQuery warehouse is usually free of charge, but this is not the case for data streaming. You can check updated BigQuery pricing.
Bytes processed in query, also known as bytes billed, are pretty tricky to understand and this usually makes up about 95% of your BigQuery costs. Let’s review some simple SQL queries to see how BigQuery calculates it.
Bytes billed explained
BigQuery uses columnar storage, and bills are based on scanned data within columns and not within rows. We’ll review a simple count query on the sample Hacker News BigQuery data set, which is publicly available.
SELECT count(*) FROM `bigquery-public-data`.hacker_news.full
It will cost you $0 because no columns are processed in this query.
Let’s introduce a column here:
SELECT max(timestamp) FROM `bigquery-public-data`.hacker_news.full
This query will cost you 130MB. Here and below we’re using bytes billed cost for the sake of simplicity. This price is calculated as total
timestamp column size. A data type for the
timestamp column is
TIMESTAMP, which is 8 bytes. So there’re about 16M timestamps stored in this column, which correlates with 17M of rows in this table.
Note that the
timestamp column is not stored for every row due to the nature of BigQuery’s Capacitor storage format.
A much more interesting use case is filtering. Look at the billing for this query:
SELECT max(timestamp) FROM `bigquery-public-data`.hacker_news.full WHERE timestamp BETWEEN '2015-01-01' AND '2015-02-01'
You may expect that such a query allows you to reduce the costs, but actually it doesn’t. It’ll be the same 130MB. When a column is involved in a query, BigQuery calculates data as a whole column scan. This happens even if BigQuery doesn’t scan this data due to internal storage optimizations. In this case, querying performance and pricing aren’t aligned.
So how can query costs be reduced for such a filtering query? This is where BigQuery sharding and partitioning come into play. BigQuery sharding is implemented as wildcard table querying. It works as a UNION ALL operation on a scale. You can query at once up to 1000 tables with a specified suffix. This approach is fast as well as cost effective.
Here is an example query on a NOAA Global Surface Summary of the Day Weather Data dataset:
SELECT max(temp) FROM `bigquery-public-data`.noaa_gsod.`gsod*` WHERE _TABLE_SUFFIX BETWEEN '2010' AND '2018'
Unlike in the previous example, costs will be reduced to the size of columns of tables starting with
gsod2010 up to
gsod2018. It’ll be 262MB instead of 1.11GB in the case where all tables are scanned. BigQuery partitioning works in a pretty same way but it’s not separated unlike sharding tables.
Leveraging multi-stage querying
Partitioning and sharding are not the only way to reduce your BigQuery costs. From our perspective, they are not the main one as well. Let’s consider an upgraded version of the previous example.
1 2 3
SELECT CONCAT(year, '-', mo, '-', da), max(temp) FROM `bigquery-public-data`.noaa_gsod.`gsod*` WHERE CONCAT(year, '-', mo, '-', da) BETWEEN '2010-01-01' AND '2018-01-01' GROUP BY 1
It’ll cost you 3.04GB. If you use a table suffix filter instead of a date filter, you can reduce costs to 720MB per query. In such types of queries, the date filter is usually variable, which leads to a lot of queries. Although the table suffix filter allows you to reduce costs dramatically, they will still be high if you query too often. The typical solution here is to introduce a roll up table first and then query it.
At the moment, BigQuery supports the CREATE TABLE AS statement in beta mode, as well as it supports load jobs, which you can use to create a table from an SQL query.
So, for our query we can introduce the following roll up table:
1 2 3
SELECT CONCAT(year, '-', mo, '-', da) as date, max(temp) as max_temp FROM `bigquery-public-data`.noaa_gsod.`gsod*` GROUP BY 1
And to query it we can run something like:
1 2 3
SELECT date, max(max_temp) FROM rollups.max_temp_rollup WHERE date BETWEEN '2010-01-01' AND '2018-01-01' GROUP BY 1
The first query costs us 3.04GB, which is obviously more than in the case of table suffix filtering. On the other hand, the second query is 634KB, which is at least 1000 times less than suffix filtering. So even if you use this query often, it won’t significantly affect your bill.
As you see, there is no absolute winner in the Bigquery vs Redshift comparison. On-demand Redshift may cost you $10 / TB / day and up no matter how intensive the workload is. Meanwhile, BigQuery will allow you to query only about 10-50 queries per 1TB of data stored for that price per day. At the same time, we saw a lot of BigQuery deployments, which are at least 4 times cheaper than Redshift due to multi-stage querying.
This means that if you are not cautious about queries issued to BigQuery, it’ll become much more expensive than the Redshift option pretty fast.