September 2010
M T W T F S S
« Apr    
 12345
6789101112
13141516171819
20212223242526
27282930  

Selectivity and Cardinality and their Formulas

Two important and related concepts inside CBO is selectivity and cardinality. The cardinality means that how many rows should be returned by CBO after evaluating the predicates. And this values is always equal to (number of input rows)* selectivity. So we can guess that selectivity means that how many percents of the input rows will be returned. The selectivity plan an crucial part in choosing the join order and optimal index.

Recall the index access path cost calculation, selectivity also play as a factor to the number of IO blocks. Below we will focus on selectivity calculation.

For the simplest predicate column = constant, such as month_no  = 12, the selectivity is simple:

selectivity = 1/num_distinct; (when histogram non-available)
or
selectivity = density; (when histogram available)

both of num_distinct and density comes from user_tab_columns/user_tab_col_statistics view. And we can get the “number of input rows” from the num_rows of user_tables view, so we can get the candinality as num_rows * selectivity.

When we have null values in the column, then we must multiply this selectivity with (num_rows - num_nulls)/num_rows. As below

selectivity = (num_rows - num_nulls)/(num_distinct*num_rows); (when histogram non-available)
or
selectivity = density * (num_rows - num_nulls)/num_rows; (when histogram available)

When we use the in-list predicate, such as month_no in (2,3,4). The correct behavior should be to sum up all the possibilities of each member.

What about the values outside the valid scope? such as month_on in (13, 14)?  What Oracle does is to assume that the further away you get from the known low/high value, the less chance you will get the data back. Oracle uses a straight-line decay to predict the variation, decaying to zero when you exceed the range by the difference between low and high.

What about not in predicates? Maybe you could guess that it should be (1 - selectivity(in predicate)), actually, the answer is NO. As Oracle transform the (C not in(A,B)) into (C <>A and C<>B), whose selectivity should be (1 - selectivity(A))*(1- selectivity(B)). There are a minor difference between (1 - selectivity(in predicate)) and (1 - selectivity(A))*(1- selectivity(B)). But Oracle seems to remain this in-consistence in all versions, including 11g. Let’s consider the in predicate again, actually Oracle transform this predicate (C in (A,B)) into (C = A or C = B), whose selectivity should be (selectivity(A) + selectivity(B) - selectivity(A) * selectivity(B). Although we know this value is wrong, as A and B in independent variable. But this wrong value is just the Oracle 8i counts, Oracle 9i and forwards corrects this mistake.

Let’s come to the range predicate, such as month_no between 2 and 5.

selectivity = "required range" divided by "total available range" + n/num_distinct;

(n is the number of closed ends)

And then, we come to the multiple predicates, such as A = B or A = C, the following formulas applying.

selectivity(predicate1 AND predicate2) = selectivity(predicate1) * selectivity(predicate2).
selectivity(predicate1 OR predicate2) = selectivity(predicate1) + selectivity(predicate2) - selectivity(predicate1 AND predicate2)
selectivity(NOT predicate1) = 1 – selectivity(predicate1)       //except for bind variable problems

Of course, these formula comes from Probability theory, as these formula indicates, the predicate1 and predicate2 should be  independent, otherwise, the result will be erroneous. Below are some erroneous examples:

month_no > = 8 or month_no < 8 will not cover all the month_no.
month_no in (8,9) will not cover 2/12 in Oracle 8i.
month_no not in (8,9) will not cover 10/12 in Oracle 8i onwards.

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>