SQL to compare rows within two tables
Oracle Corporation’s developer Vadim Tropashko has some interesting notes on tuning Oracle SQL queries that compare the contents of two tables, showing several SQL solutions and their performance within the Oracle cost-based optimizer.
Vadim Tropashko’s is the author of “SQL Design Patterns: The Expert Guide to SQL Programming“, the definitive reference for SQL design patterns, a critical knowledge area for any Oracle developer.
Vadim also shows a great example of using the hidden parameter _convert_set_to_join to improve SQL execution speed for queries that find the semantic difference between two tables, and he shows how to compare the rows and columns of two tables with Oracle SQL syntax.
(Note: Hidden parameters are extremely powerful and you should always thoroughly test all use of hidden SQL tuning parameters and only use hidden parms like _coinvert_set_to_join at the direction of Oracle technical support).
Vadim Tropashko’s new book “SQL Design Patterns: The Expert Guide to SQL Programming“, is the very first book to apply the mathematical foundation of SQL in-terms of design patterns, a must-own book for any professional SQL developer.
Computing table content differences with Oracle SQL
The question “what is the difference between two tables? is simple to ask, but difficult to code. It’s like an “anti-union” where we seek only the unique rows within each table.
Tropashko has a great analysis of this common SQL problem from both a mathematical perspective and a practical Oracle perspective. It also shows an undocumented technique for improving execution speed for querying the non-intersection of two tables.
Vadim notes that the _convert_set_to_join parameter can be used when comparing the contents of two tables. This type of SQL query is called a “semantic difference”, or an “anti-union” operation. Tropashko depicts the table comparison problem below:
In the above figure we read the “/” as “not in” and the “∪” as “union”. In sum, we want the table comparison SQL to find:
Rows in table A that are not in table B plus Rows in table B that are not in table A
Vadim notes that the obvious solution van have poor performance:
The figure 1 with the expressions (A \ B) ∪ (B \ A) and (A ∪ B) \ (A ∩ B) pretty much exhaust all the theory involved. It is straightforward to translate them into SQL queries. Here is the first one:(select * from A minus select * from B) — Rows in A that are not in B union all ( select * from B minus select * from A ) — rows in B that are not in A
In practice, however, this query is a sluggish performer. With a naïve evaluation strategy, the execution flow and the operators are derived verbatim from the SQL which we have written. First, each table has to be scanned twice. Then, four sort operators are applied in order to exclude duplicates. Next, the two set differences are computed, and, finally, the two results are combined together with the union operator. . .
After executing the symmetric difference query, and capturing the row-source execution statistics (from the v$sql_plan_statistics view) we get the following result:
Execution plan for comparing the contents of two tables with MINUS operator
Rewriting Oracle SQL MINUS operator with a NOT IN subquery
Vadim notes that the Oracle query rewrite might come into play where a “set operation” (the minus operator) might be converted into a join by the CBO:
It confirms our intuition on the number of table scans and sorts. It also reveals that the time scanning a single table (about 100 msec) is small compared to the time spent when sorting each table (about 450 msec). Therefore we could try to aim lower than 2.4 sec of total execution time.
This is not the only possible execution plan for the symmetric difference query. Oracle has quite sophisticated query rewrite capabilities, and transforming set operation into join is one of them.
In our example, both minus operators could be rewritten as anti-joins. As of version 10.2, however, this transformation is not enabled by default, and used as part of view merging/ query unnesting framework only. It can be enabled with
alter session set “_convert_set_to_join”= true;
The other alternative is to rewrite the SQL query manually [replacing the minus operator with a NOT IN subquery] evidences about 30% improvement in execution time . . .
where (col1,col2,…) not in
(select col1,col2,… from B)
select * from B
where (col1,col2,…) not in
(select col1,col2,… from A);
The new execution statistics are very different and faster, using hash joins:
Execution plan for comparing the contents of two tables with NOT IN subquery
Disabling sort merge and hash joins
Tropashko notes that the hash joins may not be the fastest table join method and he removes them by unsetting thehash_join_enabled parameter and reviews the resulting nested loops table join method:
alter session set “_hash_join_enabled” = false;
alter session set “_optimizer_sortmerge_join_enabled” = false;
The particular query that we are studying in this section has 5 query blocks: 2 pairs of selects from the tables A and B, and one set operation. After this query is transformed, it would consist of only 3 query blocks which we might witness from the QBLOCK_NAME column of the PLAN_TABLE.
Execution plan for comparing the contents of two tables with sort merge disabled
Forcing nested loop joins
Vadim next tries forcing nested loop joins with the use_nl hint and discovers that the execution speed gets worse, not faster:
Those query block names allow directing the hints to the query blocks where they are supposed to be applied. In our case, the joint method can be hinted as
/*+ use_nl(@”SEL$74086987″ A)use_nl(@”SET$D8486D66″ B)*/
which produces the following rowsource level execution statistics:
Execution plan for comparing the contents of two tables with nested loop joins
From the statistics output we see significant increase in buffer gets which is not offset by any noticeable improvement in the execution time. It is fair to conclude that indexed nested loops didn’t meet our performance expectations in this case.
Using SQL aggregation to compute the difference between two tables
Tropashko notes that this anti-union semantic difference problem can also be solved using the SQL CASE operator with in-line views (select statements in the FROM clause). He notes a decrease in buffer gets, but no change in overall execution time:
select * from (
select id, name,
sum(case when src=1 then 1 else 0 end) cnt1,
sum(case when src=2 then 1 else 0 end) cnt2
select id, name, 1 src from A
select id, name, 2 src from B
group by id, name
where cnt1 <> cnt2;
This is appeared as rather elegant solution where each table has to be scanned once only, until we capture execution statistics. Although we succeeded decreasing buffer gets count in half, the execution time still remained around 2 sec.
Execution plan for comparing the contents of two tables with SQL aggregation
Hash Value based Table Comparison SQL Syntax
Tropashko notes yet another technique for using SQL to compare the contents of two tables using a hash-based method:
When comparing data in two tables with identical signatures there actually are two questions that one might want to ask:
- Is there any difference? The expected answer is Boolean.
- What are the rows that one table contain, and the other doesn’t.
Answering the second question implies that we can answer the first, yet, it is possible that the first query might have better performance. We’ll approach it with the hash-based technique in mind.
The standard disclaimer of any hash-based method is that it is theoretically possible to get a wrong result. The rhetorical question, however, is how often did the reader experience a problem due to hash value collision? I never did. Would a user accept a tradeoff: significant performance boost of the table difference query at the expense of remote possibility of getting an incorrect result?
The idea of hash based method is associating a single hash value with a table. This hash value behaves like aggregate, and therefore, it can be calculated incrementally: if a row is added into a table, then a new hash value is a function of the old hash value and the added row. Incremental evaluation means good performance, and the two hash values comparison is done momentarily.
Comparing the contents of two tables with SQL hash weighting
Figure 2 is a method to “weight” a table into a single value involves concatenating all the fields per each row first. Then a concatenated string is mapped into a number — hash value. Finally all hash values are aggregated into a single one. Had a single field in a single row have change, the table weight would change as well.
A naive implementation is as simple as this SQL:
We concatenate all the row fields (with a distinct separator, in order to guarantee uniqueness), then translate this string into a hash value.
Using get_hash_values to computer table row differences
Tropashko notes that you can use the dbms_utility packages get_hash_value procedure to rewrite this query with the POWER clause:
Alternatively, we could have calculated hash values for each field, and then apply some asymmetric function in order for the resulting hash value to be sensitive to column permutations:
Row hash values are added together with ordinary sum aggregate, but we could have written a modulo216-1 user-defined aggregate hash_sum in the spirit of CRC (Cyclic Redundancy Check) technique.
For more on advanced Oracle SQL tuning, see Vadim Tropashko’s book “SQL Design Patterns: The Expert Guide to SQL Programming“