onsdag 30 juni 2010

Problems with subquery transformations

MySQL transforms IN predicates into nearly-equivalent EXISTS predicates, and uses a set of mechanisms to catch all special cases. This transformation, outlined below, also applies to NOT IN predicates since a NOT IN b parses to NOT a IN b. Expressions such as

SELECT * FROM table1 WHERE column1 IN (SELECT column2 FROM table2)

get transformed into

SELECT * FROM table1 WHERE EXISTS
( SELECT column2 FROM table2
WHERE column1 = column2 OR column2 IS NULL )

since the latter can be executed more efficiently. For instance an index can be used for accessing only the rows of table2 that will actually participate in the final result. The added WHERE condition is called a guarded condition, or trigger condition, both terms are used. In order to get correct result, the condition column1 = column2 must be deactivated if we are comparing against a NULL value for column1. This guarantees correct results for most cases but there are some additional cases to consider.

If the IN predicate is used directly in the WHERE clause (e.g. not as a sub-expression of IS UNKNOWN,) a NULL value is treated as false. There is a special property top_level_item of an expression node that tells the execution to filter out NULL values from the subquery. But there are problems with it.
  1. It does not propagate through NOT nodes (or any other nodes for that matter.)
  2. It is only applied to the WHERE and HAVING clauses. Since MySQL allows quantified predicates in the SELECT list - this is not in the SQL standard - an extra hack is needed to make it work. The solution is to add - when top_level_item is false - another guarded condition as HAVING clause. Remember that MySQL allows HAVING without GROUP BY - yet again extending the standard. Hence we end up with
SELECT * FROM table1 WHERE EXISTS
( SELECT column2
FROM table2
WHERE column1 = column2 OR column2 IS NULL
HAVING NOT column2 IS NULL )

This will filter out NULL values causing the result of the subquery to be empty.

1 and 2 together caused Bug#51070. For row valued predicands, e.g.

SELECT * FROM t WHERE (a, b) NOT IN (SELECT c, d FROM u)

there may be partial matches for any two rows. A partial match occurs when the corresponding positions in two rows are either equal, or at least one is NULL. The design above handles these cases just fine as long as NULL values come from the outer table. When there is a NULL value in the row from the inner table, however, it gets filtered out by the HAVING clause. Hence the subquery, now rewritten into

SELECT c, d FROM u WHERE c = a OR b = d
HAVING NOT c IS NULL AND NOT d IS NULL

yields an empty result and the value of IN is false, causing NOT IN to yield true.