Optimizers is a great candidate for a love-hate relationship for data analysts – that is, if they know optimizers exist. An optimizer can improve query performance by 100x, but their con is that sometimes they are too smart for their own good and can ruin query performance. For most optimizers, the former occurs much more often than the latterm, but, for anyone who hath experienced the fury of bad query performance due to the optimizer, let me attempt to persuade you to fall back in love with the optimizer.
Why is it important to have an optimizer for data analytics? In my opinion it comes down to language and communication, specifically how we communicate with our database systems between what we want and how we want it retrieved. Let’s take a very simple SQL query on a database managing soccer players
In this scenario, I want to know what players who have a combined age of 50 share similar prefix for their last name. In the Netherlands “De” is a very common prefix in last names, so we will use that.
There are two classes of languages that can express this. SQL, or declarativie languages, and dataframe libraries, which you can also call imperative languages.
With SQL, we could have a query like the following
select *
from
players p1,
players p2
where
p1.age + p2.age = 50 and
regexp_matches(p1.last_name, '^De') and
regexp_matches(p2.last_name, '^De')
We join the table together to get all combinations, then only select those where the combined age is 50.
In a dataframe library like datafusion or dplyr the query looks like the following
# filtered_players_1 <- players %>%
# filter(grepl("^De", last_name)) %>%
# select(first_name, last_name, age)
players <- filtered_players %>% join(filtered_players, 'age + age = 50') %>% filter(t1.age + t2.age = 50)
Here it is a bit more tricky. We can’t really self-join in Dplyr without first declaring the dataframe to start.
# TODO: datafusion
This is a little complex, but still readable. We can see that we need to query players table, then get all combinations of two players and check the sum of their ages. The difference between the two is understanding that in SQL you can describe what you want, but in dplyr you need to describe what you want, and how you want it. This is the difference between Declarative Languages (SQL) and Imperative Languages (dataframe languages).
An attentive reader may see that the Imperative Languages (i.e DataFrame libarry) requires a bit more verbeage. This is natural though, since first you need to tell dplyr how to get some of the data, then you can tell it how to combine the data. Sometimes this is nice, but depending on how the query is written and if you have eager execution or not, it’s actually not that nice. It’s also more code, which can lead to more errors. It’s also not portable. SQL is always SQL. Dply is not pandas which is not datafusion. So why is SQL better, and what do optimizers have to do with this?
No more need to describe “how” you want the data.
If you are using a system that supports SQL, you get to benefit from the fact that it is a declarative language (usually, sometimes you can tell the database exactly how you want data stored or how you want something else done). You can just say what you want and be done with it. This is where the optimizer comes in. The optimizer will take your statement, and figure out the how. It will translate what you want into language the db-engine understand and can execute. Not only does it do that, it is so tight with the engine that it can make optimizations a human cannot make by hand. (Yes, I skipped the step that the parser will to translating from SQL to a query tree, but to avoid explaining how a database system works in its entirity I’ve skipped that step.)
It is true that some imperative and dataframe languages have optimizers as well, but by removing the requirement to describe “how” you want the data, you have to write less code, which usually means less bugs. It also makes your code more portable, since SQL in DuckDB is SQL in BigQuery.
OK, but what is the optimizer doing that I can’t do?
In my opinion, there are three categories of optimizers. Some optimizations are easy to communicate to the query engine and always improve query performance, optimizations that can be written by hand, but need to be re-written when the data changes, and optimizations that are impossible to express without language-specific functionality to the express the optimization.
Hand optimizations
Here is a list of optimiziations that you can do by hand on your queries. Keep in mind that by doing theses optimizations by hand, you run the risk of introducing bugs into your code.
Filter pushdown.
Expression rewriting
This is a simple one. Suppose you have a join condition like select * from t1, t2 where a + b = 50
. This could be written as select * from t1, t2 where a = 50-b
. Rewriting the expression in this way allow the planner to plan a hash join instead of a nested loop join. To the human eye, however, a+b=50
might be nicer to write because it can communicate the combined age of two players.
Regex->Like Optimizations
Suppose your queries involve regex matching in order to match the prefix or suffix of a string. I.e
select * from t1 where regexp_matches(a, '^prefix.*)
. This is a simple prefix lookup, and without an optimizer, a regex library will be used to run the prefix check. Certain db-engines have optimized prefix lookups and can perform them 100x faster than a regex library. Again, this is possible by hand, but it’s nice when this happens automatically.
filter pullup and pushdown
Suppose you have a query like select * from t1, t2 where a + b = 50 and regexp_matches(c, '^prefix')
. Filter pull up and pushdown will automatically convert the a=b
filter into a join filter, and will push the filter on column c
into the scan of t2 instead of performing it at the top.
Lets take a look at the two query plans with one taking advantage of each optimization, and the other query that is not optimzed.
Optimizations that specific to data state
The optimizations below are possible by hand. If the state of the data changes, however, it’s possible that the query performance becomes worse by a factor of 100x.
Join Order Optimization
Let me try to summarize my MSc thesis in one paragraph. When executing any query, it’s important to avoid keeping large amounts of data in memory. For join heavy queries, predicting the size of the join output can be difficult, and if we predict a small join output when in fact it is very large, query performance can suffer since handling large intermediate data is slow. Join order optimization attempts prevents this by looking at the stats of a table and attempting to predict the cardinality of joins to prevent an exploding join producing an unmanageable amount of data. Join Order Optimization is possible by hand of course, but when the underlying data changes, it’s possible the intermediate join outputs increase, which can lead to poor query performance.
Statistics Propagation
Build-Side Probe Side Optimization
Certain
An optimizer can perform these optimizations with up-to-date knowledge of the data every single time, so the optimization is valid for almost every state your data is in. If an analyst were to hand optimize a query, and later dump a huge amount of data into certain tables, query performance would significantly decrease, which could cause a number of problems for downstream processes.
Requires specific language
These optimizers below are, in my opinion, impossible to describe without an optimizer.
Top N optimizations
This is a relativly simple optimization. If you have some sql like select a, sum(b) group by a order by a asc, limit 5
there is room to optimize this to only keep track of 5 aggregate values. Which means you won’t need to store the result of sum(b)
for the rest of the values of a
.
Join Filter Pushdown
This is an optimizer in DuckDB that can dramatically improve the performance of hash joins and related table scans. During the hash join process, the keys of one table are stored in a hash table. Join Filter Pushdown will store the min&max values that are stored in the hash table. Then, when the probe side is executed to start checking matches, the table scan at the bottom of the probe first performs a zone map filter on the data it is scanninig to see if each tuple is within the min and max values in the build table. Confused? Imagine explaining in SQL.
Table filter scans
What’s nice is that SQL is declarative, and I can just write in SQL, exactly what I want. If I used pandas, I not only have to write what I want, but also how I want it retrieved.
# TODO
With this kind of a script, you can also run into issues around eager and lazy evaluation, but that’s a separate blog post. When I look at these two examples, I see the following reasons as to why an optimizer is so important
- There is more human error when a human must explain what and how
- Optimizers can tell the db-engine how to get the data much better than any human
- Optimizers can optimize in ways a humans cannot explain easily
- An optimizer can change how the data is retrieved based on changes in the data
Why I think an optimizer is so important, is because it can tell the db engine how to get the data better than any human can, and in ways that we as humans can not express ourselves without using
Well, there are a few reasons. Sometimes the general physical makeup of the data can influence the performance of a non-optimized query. Other times, changes in the make up of the data can render a previous implementation of handwritten ETL code innefficient. And finally, some logical optimizations are impossible to write by hand.
You can’t express physical execution in SQL
Some optimizations are purely optimizations based on how the data physically moves through the CPU&memory to execute the logical task. Some of these optimizations are possible by hand, but usually they are not. Below is a query that takes advantage of 3 optimizations that occur because the optimize how data physically moves from memory to the CPU.
select c_custkey, count(*) count
from customer, orders
where
c_custkey = o_custkey and
c_nationkey = 20
group by c_custkey
order by count desc
limit 5;
This will show us the top 5 customers with the highest number of orders from nation with nation key 20.
If we run this query with and without optimization, these are the results
With Optimization | 0.016 |
---|---|
Without Optimization | >1min |
The problem with no optimization is that the filter is applied after a cross product. An optimization that most systems would have is to make the join an inner join and have the filter on just the table. We can hand-rewrite the query as follows.
select c_custkey, count(*) count
from
(select * from customer where c_nationkey = 20) customer
JOIN orders ON
c_custkey = o_custkey
group by c_custkey
order by count desc
limit 5;
Now the execution time looks like the following | With Optimization | 0.016 | |———————-|——-| | Hand-Optimized | 0.060 | |———————-|——-| | Without Optimization | >1min |
But this is a hand optimized query. Look at how much uglier it is! Imagine doing this for every query you would need to write. And even then, the machine optimized query is 6x faster!
Let us now investigate where the performance is going.
Suppose we have a simple join query in our ETL pipeline. We want to join orders and customers so we can have the customer information for every order and v.v. If we are working with a system that does not have a join order optimizer this can have detrimental effects.
From a pure brain to SQL perspective, the query could look like this
select c_custkey, count(*) from customer, orders where c_custkey = o_custkey;
But without any optimizations, this gives you a cross product. So let’s hand optimize the query a little bit to give ourselves a fair fight with the optimizer.
select * from customer join orders on c_custkey = o_custkey;
the above query is a bit better. On my machine it completes in 2.694
seconds on scale factor 10.
Now, let’s run the same query with the optimizer on
pragma enable_optimizer;
select * from customer join orders on c_custkey = o_custkey;
Wow! That took 1.132
seconds, so it more than ~2x faster. Why?
Well, Join Order Optimization came into play. This optimizer knows what physical operators will planned, and can then optimize for those specific operators. In this case, a physical hash join will be planned, without boring you with details, it is better to have the smaller table on the right hand side of the join (although this varies between systems). The join order optimizer will inspect the table cardinalities and perform this swap.
You may now think, oh, but I can just do that myself when I write the query. You may be right, but that leads me to my next reason why optimizers are extremely important
Data can change, which means the query should change (physically)
Take join order optimization, what happens when your data changes? What happens
Ok, I need optimization, what systems have it?
Most databases have optimization. Systems that don’t have optimizers R (data.table, dplyr, …). Python libraries like pandas and datafusion??
Well, in many cases an ETL pipeline, or a SQL query, or some data transfer is usually a set it and forget it type of thing. This means that once the code to perform some logic is written, it is no longer actively maintained. The problem, however, is that while the code is no longer maintained, the physical make up of the data it is accessing does change.
What can a good optimizer do that a normal human being might miss?
- As data changes in an ETL pipeline, the most efficient way to retreive the data can also change. The make-up of the data can start changing to the point where cheap data retreivals become much more expensive because more records were added. So now, instead of this retreival being one of your first operations it should be your last.
- Think about Regex-range optimizations (I still don’t know what these are)
- But you do have regex to like optimizations.
- Expression re-writing. So many here, Constant folding, distributivity rule, aritmetic simplification rule, case simplification rule, conjunction simplification etc.
- Unused columns & column lifetime analyzer. This is a big one if you write your queries incorrectly. You end up holding a lot of data in memory that goes completely unused.
- Common aggregates. Less important, but if you want to check percentages of count(i)/count(*), you could end up double counting certain aggregate expressions.
- TopN optimizer, if you do an order by and a limit, then you don’t need to order all values, just the top N values that you’ve seen. This agains helps with memory costs.
- Reorder Filter. What filters to apply first based on cost, but really this should be based on statistics.
- Filter pull up and filter pushdown
-
In clause rewriter.
- Think about filter pushdown in the taxi case
https://github.com/Tmonster/duckplyr_demo/blob/main/duckplyr/q03_popular_manhattan_cab_rides.R ? The last filter is on pickup dropoff = manhattan, but then the boroughs can also be filtered to only project manhattan stuff.
select pickup.zone pickup_zone, dropoff.zone dropoff_zone, count(*) as num_trips from zone_lookups pickup, zone_lookups dropoff, taxi_data_2019 data where pickup.LocationID = data.pickup_location_id and dropoff.LocationID = data.dropoff_location_id and pickup.Borough = 'Manhattan' and dropoff.Borough = 'Manhattan' group by pickup_zone, dropoff_zone order by num_trips desc;
Naively, someone might not filter the zone_lookups to only include only the Manhattan borough.
- And potentially modifying certain ETL parths so that a query selects extra data (suppose a window expression). How can a naive fix in a optimizerless system compare to a naive fix in a regular system.
I want to know, Why is it good to have an optimizer? What systems have optimizer? What systems allow you to turn on/off certain parts of the optimizer? Can you see how your optimizer is affecting the plan?
- Polars
- You can see the query plan by calling show_graph https://docs.pola.rs/user-guide/lazy/query-plan/
- Datafusion
- Databend
- ???
- Spark catalyst optimizer.
- apparently you can see the optimizer effects by looking at the string() method. https://stackoverflow.com/questions/55614122/how-do-i-get-a-spark-dataframe-to-print-its-explain-plan-to-a-string
- not much else about what optimizers are available on the system.
- Pandas
- I don’t think there really is any optimization workflow happening here.
WHy is maintaining an optimizer difficult? So many downstream effects.
Why is working on an optimizer so difficult? I guess normally the workflow for performance programming is “Here is a problem with some variables, it needs to be faster”, and then you can work on it until it gets faster. This is the case with optimization, except once the thing is faster, you have to make sure nothing else is slower, and this is the part that takes forever. I implemented the Join Order Optimizer for DuckDB, then refactored it, then refactored it again. Most of the time spent on these PRs was not spent coding, but testing the rest of our functionality to make sure it wasn’t slower.
I work on the Optimizer at DuckDB and have been wanting to write a blog post about how amazing and powerful they are for a while. Before I can really emphasize the importance of optimizers though, I need to talk about different types of query languges.
There are two main query language types, imperative and declarative. They let how express granular details of what should be done when. Dataframe library languages are usually imperative languages. Declarative languages let you express what you want, and the execution engine figures out itself how to get it.
The main difference between imperative and declarative languages is how programmers approach writing them. With an imperative language, you have to express two things usually; what data you want, and how you want to data retrieved/processed. With query languages, you only have to express one thing, namely what data you want.
Let’s look at a quick example, starting with an imperative language (the dplyr library). Let’s retrieve some information from the taxi dataset, specifically what five neighborhood trips in Manhattan are the most popular/have the most rides .
popular_manhattan_cab_rides <- taxi_data_2019 |>
inner_join(zone_map, by=join_by(pickup_location_id == LocationID)) |>
inner_join(zone_map, by=join_by(dropoff_location_id == LocationID)) |>
filter(Borough.x == "Manhattan", Borough.y=="Manhattan") |>
select(start_neighborhood = Zone.x, end_neighborhood = Zone.y) |>
summarise(
num_trips = n(),
.by = c(start_neighborhood, end_neighborhood),
) |>
arrange(desc(num_trips))
Let’s compare that to a SQL query
-- q3 in sql
select pickup.zone pickup_zone, dropoff.zone dropoff_zone, count(*) as num_trips
from
zone_lookups pickup,
zone_lookups dropoff,
taxi_data_2019 data
where
pickup.LocationID = data.pickup_location_id and
dropoff.LocationID = data.dropoff_location_id
and
pickup.Borough = 'Manhattan' and
dropoff.Borough = 'Manhattan'
group by pickup_zone, dropoff_zone
order by num_trips desc;
At first glance the two seem similar, after all they produce the same results. They are different, however, when you read it as a list of instructions or a description of data. In dplyr, we can read the query as a list of instructions like so
# Read the taxi_data_2019 data set
taxi_data_2019 |>
# inner join it with zone map on condition X
inner_join(zone_map, ...) |>
# inner join it again with zone map on condition Y
inner_join(zone_map, ...) |>
# now filter the result so pickup and dropoff are in Manhattan
filter(Borough.x == "Manhattan", Borough.y=="Manhattan") |>
# Now select just two columns
select(start_neighborhood = Zone.x, end_neighborhood = Zone.y) |>
# and summarise and group them
summarise(
num_trips = n(),
.by = c(start_neighborhood, end_neighborhood),
) |>
# and order them
arrange(desc(num_trips))
In SQL, the query logically summarizes into the following
select
-- Produce these columns
pickup.zone pickup_zone,
dropoff.zone dropoff_zone,
count(*) as num_trips
from ...
-- from these tables
where ...
-- with these filters.
group by pickup_zone, dropoff_zone
-- group the results by pickup_zone, dropoff_zone,
-- counting the number of entries in each group
order by num_trips desc;
-- order them in decreasing order.
Hopefully now the difference is more clear. With Dplyr, the verbs usually occur in an order that the user wants them to occur when their query is executed. With SQL, the verbs are to express what the state of the data should be when it is returned.