2008年12月16日星期二

in vs exists

in和exists的问题我们都谈论了好多年了,今天又有课长提到这个问题,并且宣称in没有exists高效。一直以来大家都这么想,也就是说in比较慢。因此大家都用exists来替代in,其实这是一种不一定正确的做法。

在很多情况下,exists是没有办法替代in的,比如select * from emp where deptno in(10,20);我就看不出来这种条件下exists怎么能替代in。再者,如果exists真的能完全替代in的话,那么in为什么不从最新的sql 2003标准里除名呢。要知道制定这个标准的是数据库领域内的专家。

当然了,在大部分情况下,是可以使用exists来替代in的,同理也可以使用not exists来替代not in。但是即使在能替代的情况下,也不能保证二者究竟谁更高效一些。他们的效率问题,tom老人家早有论述,原话如下:

Well, the two are processed very very differently.

Select * from T1 where x in ( select y from T2 )

is typically processed as:

select *
from t1, ( select distinct y from t2 ) t2
where t1.x = t2.y;

The subquery is evaluated, distinct'ed, indexed (or hashed or sorted) and then joined to
the original table -- typically.


As opposed to

select * from t1 where exists ( select null from t2 where y = x )

That is processed more like:


for x in ( select * from t1 )
loop
if ( exists ( select null from t2 where y = x.x )
then
OUTPUT THE RECORD
end if
end loop

It always results in a full scan of T1 whereas the first query can make use of an index
on T1(x).


So, when is where exists appropriate and in appropriate?

Lets say the result of the subquery
( select y from T2 )

is "huge" and takes a long time. But the table T1 is relatively small and executing (
select null from t2 where y = x.x ) is very very fast (nice index on t2(y)). Then the
exists will be faster as the time to full scan T1 and do the index probe into T2 could be
less then the time to simply full scan T2 to build the subquery we need to distinct on.


Lets say the result of the subquery is small -- then IN is typicaly more appropriate.


If both the subquery and the outer table are huge -- either might work as well as the
other -- depends on the indexes and other factors.

英语不错的同学可以看看,翻译过来的意思是:in实际上是拿到内层查询的所有值,然后在去外层查找对应的值,得到查询结果;而exists实际上是拿到内层的每一个值都去外层查找一次。因此in更适合处理小结果集合的内层查询,而exists更适合处理大结果的内层查询。

以下是他给的例子:
rem create table big as select * from all_objects;
rem insert /*+ append */ into big select * from big;
rem commit;
rem insert /*+ append */ into big select * from big;
rem commit;
rem insert /*+ append */ into big select * from big;
rem create index big_idx on big(object_id);
rem
rem
rem create table small as select * from all_objects where rownum < 100;
rem create index small_idx on small(object_id);
rem
rem analyze table big compute statistics
rem for table
rem for all indexes
rem for all indexed columns
rem /
rem analyze table small compute statistics
rem for table
rem for all indexes
rem for all indexed columns
rem /

so, small has 99 rows, big has 133,000+

select count(subobject_name)
from big
where object_id in ( select object_id from small )

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.01 0.01 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 2 0.02 0.02 0 993 0 1
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4 0.03 0.03 0 993 0 1

Rows Execution Plan
------- ---------------------------------------------------
0 SELECT STATEMENT GOAL: CHOOSE
1 SORT (AGGREGATE)
792 MERGE JOIN
100 SORT (JOIN)
100 VIEW OF 'VW_NSO_1'
99 SORT (UNIQUE)
792 INDEX GOAL: ANALYZED (FULL SCAN) OF 'SMALL_IDX'
(NON-UNIQUE)
891 SORT (JOIN)
0 TABLE ACCESS GOAL: ANALYZED (FULL) OF 'BIG'


versus:

select count(subobject_name)
from big
where exists ( select null from small where small.object_id = big.object_id )

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 2 4.12 4.12 0 135356 15 1
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4 4.12 4.12 0 135356 15 1

Rows Execution Plan
------- ---------------------------------------------------
0 SELECT STATEMENT GOAL: CHOOSE
1 SORT (AGGREGATE)
792 FILTER
135297 TABLE ACCESS GOAL: ANALYZED (FULL) OF 'BIG'
133504 INDEX GOAL: ANALYZED (RANGE SCAN) OF 'SMALL_IDX'
(NON-UNIQUE)

That shows if the outer query is "big" and the inner query is "small", in is generally more
efficient then NOT EXISTS.

Now:

select count(subobject_name)
from small
where object_id in ( select object_id from big )

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.01 0.01 0 0 0 0
Execute 2 0.00 0.00 0 0 0 0
Fetch 2 0.51 0.82 50 298 22 1
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 5 0.52 0.83 50 298 22 1



Rows Execution Plan
------- ---------------------------------------------------
0 SELECT STATEMENT GOAL: CHOOSE
1 SORT (AGGREGATE)
99 MERGE JOIN
16913 SORT (JOIN)
16912 VIEW OF 'VW_NSO_1'
16912 SORT (UNIQUE)
135296 INDEX GOAL: ANALYZED (FAST FULL SCAN) OF 'BIG_IDX'
(NON-UNIQUE)
99 SORT (JOIN)
99 TABLE ACCESS GOAL: ANALYZED (FULL) OF 'SMALL'


versus:
select count(subobject_name)
from small
where exists ( select null from big where small.object_id = big.object_id )

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 2 0.01 0.01 0 204 12 1
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4 0.01 0.01 0 204 12 1

EGATE)
99 FILTER
100 TABLE ACCESS GOAL: ANALYZED (FULL) OF 'SMALL'
99 INDEX GOAL: ANALYZED (RANGE SCAN) OF 'BIG_IDX' (NON-UNIQUE)

shows that is the outer query is "small" and the inner query is "big" -- a WHERE EXISTS can be
quite efficient.

不过这个例子我在ORACLE 11g上运行的时候并没有得到相同的结果,主要是因为我开启了查询重写,使得ORACLE可以自己重写查询,根据实际情况将exists和in进行互换(你不能指望每次数据库的查询重写都正确,特别是在数据库设计的不怎么好的情况下,查询重写总会出错)。禁用查询重写后得到了类似的结论,在9i和10g上我没有试过,有兴趣的话,可以测试一下。

欢迎有问题一起讨论。

没有评论: