Professional Documents
Culture Documents
Chinar Aliyev
chinaraliyev@gmail.com
As it is known the Query Optimizer tries to select best plan for the query and it does that based on
generating all possible plans, estimates cost each of them, and selects cheapest costed plan as optimal
one. Estimating cost of the plan is a complex process. But cost is directly proportionate to the
number of I/O s. Here is functional dependence between number of the rows retrieved from
database and number of I/O s. So the cost of a plan depends on estimated number of the rows
retrieved in each step of the plan – cardinality of the operation. Therefore optimizer should accurately
estimate cardinality of each step in the execution plan. In this paper we going to analyze how oracle
optimizer calculates join selectivity and cardinality in different situations, like how does CBO
calculate join selectivity when histograms are available (including new types of histograms, in 12c)?,
what factors does error (estimation) depend on? And etc. In general two main join cardinality
estimation methods exists: Histogram Based and Sampling Based.
Thanks to Jonathan Lewis for writing “Cost Based Oracle Fundamentals” book. This book actually
helped me to understand optimizer`s internals and to open the “Black Box”. In 2007 Alberto
Dell`Era did an excellent work, he investigated join size estimation with histograms. However there
are some questions like introduction of a “special cardinality” concept. In this paper we are going to
review this matter also.
For simplicity we are going to use single column join and columns containing no null values. Assume
we have two tables t1, t2 corresponding join columns j1, j2 and the rest of columns are filter1 and
filter2. Our queries are
(Q0)
SELECT COUNT (*)
FROM t1, t2
WHERE t1.j1 = t2.j2
AND t1.filter1 ='value1'
AND t2.filter2 ='value2'
(Q1)
SELECT COUNT (*)
FROM t1, t2
WHERE t1.j1 = t2.j2;
(Q2)
SELECT COUNT (*)
FROM t1, t2;
As you know the query Q2 is a Cartesian product. It means we will get Join Cardinality -
𝐶𝑎𝑟𝑑𝑐𝑎𝑟𝑡𝑒𝑠𝑖𝑎𝑛 for the join product as:
𝐶𝑎𝑟𝑑𝑐𝑎𝑟𝑡𝑒𝑠𝑖𝑎𝑛 =num_rows(𝑡1 )*num_rows(𝑡2 )
Here num_rows(𝑡𝑖 ) is number of rows of corresponding tables. When we add join condition into the
query (so Q1) then it means we actually get some fraction of Cartesian product. To identify this
fraction here Join Selectivity has been introduced.
Therefore we can write this as follows
𝐶𝑎𝑟𝑑𝑄1 ≤ 𝐶𝑎𝑟𝑑𝑐𝑎𝑟𝑡𝑒𝑠𝑖𝑎𝑛
𝐶𝑎𝑟𝑑𝑄1 = 𝐽𝑠𝑒𝑙 *𝐶𝑎𝑟𝑑𝑐𝑎𝑟𝑡𝑒𝑠𝑖𝑎𝑛 = 𝐽𝑠𝑒𝑙 ∗num_rows(𝑡1 )*num_rows(𝑡2 ) (1)
Definition: Join selectivity is the ratio of the “pure”-natural cardinality over the Cartesian product.
I called 𝐶𝑎𝑟𝑑𝑄1 as “pure” cardinality because it does not contain any filter conditions.
Here 𝐽𝑠𝑒𝑙 is Join Selectivity. This is our main formula. You should know that when optimizer tries to
estimate JC- Join Cardinality it first calculates 𝐽𝑠𝑒𝑙 . Therefore we can use same 𝐽𝑠𝑒𝑙 and can write
appropriate formula for query Q0 as
𝐶𝑎𝑟𝑑𝑄0 = 𝐽𝑠𝑒𝑙 ∗Card(𝑡1 )*Card(𝑡2 ) (2)
Here Card (𝑡𝑖 ) is final cardinality after applying filter predicate to the corresponding table. In other
words 𝐽𝑠𝑒𝑙 for both formulas (1) and (2) is same. Because 𝐽𝑠𝑒𝑙 does not depend on filter columns,
unless filter conditions include join columns. According to formula (1)
𝐽𝑠𝑒𝑙 = 𝐶𝑎𝑟𝑑𝑄1 /(num_rows(𝑡1 ) ∗ num_rows(𝑡2 )) (3)
𝐶𝑎𝑟𝑑𝑄1 ∗Card(𝑡1 )∗Card(𝑡2 )
or 𝐶𝑎𝑟𝑑𝑄0 = (4)
num_rows(𝑡1 )∗num_rows(𝑡2 )
Based on this we have to find out estimation mechanism of expected cardinality - 𝑪𝒂𝒓𝒅𝑸𝟏 . Now
consider that for 𝑗𝑖 join columns of 𝑡𝑖 tables here is not any type of histogram. So it means in this
As it can be seen we have got formula (5). Without histogram optimizer is not aware of the data
distribution, so in dictionary of the database here are not “(distinct value, frequency)” – this pairs
indicate column distribution. Because of this, in case of uniform distribution, optimizer actually
thinks and calculates “average frequency” as 𝑛𝑢𝑚_𝑟𝑜𝑤𝑠(𝑡1 )/𝑛𝑢𝑚_𝑑𝑖𝑠𝑡(𝑗1 ). Based on “average
frequency” optimizer calculates “pure” expected cardinality and then join selectivity. So if a table
column has histogram (depending type of this) optimizer will calculates join selectivity based on
histogram. In this case “(distinct value, frequency)” pairs are not formed based on “average
frequency”, but are formed based on information which are given by the histogram.
2 - access("T1"."J1"="T2"."J2")
3 - filter("T1"."F1"=13)
Estimation is good enough for this situation, but it has not been exactly estimated. And why? How
did optimizer calculate cardinality of the join as 2272?
If we enable SQL trace for the query then we will see oracle queries only histgrm$ dictionary table.
Therefore information about columns and tables is as follows.
Select table_name,num_rows from user_tables where table_name in (‘T1’,’T2’);
tab_name num_rows
T1 1000
T2 1000
(Freq_values1)
Frequency histograms exactly express column distribution. So “(column value, frequency)” pair
gives us all opportunity to estimate cardinality of any kind of operations. Now we have to try to
estimate pure cardinality 𝐶𝑎𝑟𝑑𝑄1 then we can find out 𝐽𝑠𝑒𝑙 according to formula (3). Firstly we
have to find common data for the join columns. These data is spread between
max(min_value(j1),min_value(j2)) and min(max_value(j1),max_value(j2)). It means we are
not interested in the data which column value greater than 10 for j2 column. Also we have to take
equval values, so we get following table
tab t1, col tab t2, col
j1 j2
value frequency value frequency
0 40 0 100
2 80 2 40
3 100 3 120
4 160 4 20
5 60 5 40
6 260 6 100
8 120 8 40
9 60 9 20
As we see same number as in above execution plan. Another question was why we did not get exact
cardinality – 2260? Although join selectivity by definition does not depend on filter columns and
conditions, but filtering actually influences this process. Optimizer does not consider join column
value range, max/min value, spreads, distinct values after applying filter – in line 3 of execution plan.
It is not easy to resolve. At least it will require additional estimation algorithms, then efficiency of
whole estimation process could be harder. So if we remove filter condition from above query we will
get exact estimation.
---------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
---------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
|* 2 | HASH JOIN | | 1 | 56800 | 56800 |
| 3 | TABLE ACCESS FULL| T1 | 1 | 1000 | 1000 |
| 4 | TABLE ACCESS FULL| T2 | 1 | 1000 | 1000 |
---------------------------------------------------------------
2 - access("T1"."J1"="T2"."J2")
It means optimizer calculates “average” join selectivity. I think it is not an issue in general. As result
we got the following formula for join selectivity.
∑min _𝑚𝑎𝑥
𝑖=max _𝑚𝑖𝑛 𝑓𝑟𝑒𝑞(𝑡1.𝑗1)∗𝑓𝑟𝑒𝑞(𝑡2.𝑗2))
𝐽𝑠𝑒𝑙 = (7)
𝑛𝑢𝑚_𝑟𝑜𝑤𝑠(𝑡1)∗𝑛𝑢𝑚_𝑟𝑜𝑤𝑠(𝑡2)
For the column J1 here is Height balanced histogram - HB and for the column j2 here is frequency
- FQ histogram avilable. The appropriate information from user_tab_histogrgrams dictionary view
shown in Table 3.
tatb t1, col tab t2 , col
j1 j2
column frequency ep column frequency ep
value value
1 0 0 1 2 2
9 1 1 7 2 4
16 1 2 48 3 7
24 1 3 64 4 11
32 1 4
40 1 5
48 2 7
56 1 8
64 2 10
72 2 12
80 3 15
Ferquency column for t1.j1 of Table 3 does not express real frequency for the column. It is actually
“frequency of the bucket”. First we have to identify common values. So we have to ignore HB
histogram buckets with endpoint number greater than 10. We have exact “value, frequency” pairs of
the t2.j2 column therefore our base source must be values of the t2.j2 column. But for the t1.j1 we
do not have exact frequencies. HB histogram cointains buckets which hold approximately same
number of rows. Also we can find number of the distinct values per bucket. Then for every value of
the frequency histogram we can identify appropriate bucket of the HB histogram. Within HB bucket
we aslo can assume uniform distrbution then we can estimate size of this disjoint subset – {value of
FQ and Bucket of HB} .
Although this approach gave me some approximation and estimation of the join cardinality but it did
not give me exact number(s) which oracle optimizer calculates and reports in 10053 trace file. We
have to find what information we need to improve this approach? ,
Firstly Alberto Dell'Era investigated joins based on the histograms in 2007- (Join Over histograms).
His approach was based on grouping values into three major categories:
- “populars matching populars”
- “populars not matching populars”
T1 11 T1.J1 30
T2 130 T2.J2 4
We have got all “(value, frequency)” pairs so according formula (7) we can calculate Join Selectivity.
tab t1 , col tab t2 , col
j1 j2
column frequency column frequency freq*freq
value value
1 2.00005 1 2 4.0001
7 2.00005 7 2 4.0001
48 17.33333333 48 3 52
64 17.33333333 64 4 69.33333
Sum 129.3335
129.3335 129.3335
And finally 𝐽𝑠𝑒𝑙 = = =0.090443
numrows(t1)∗numrows(t2) 11∗130
So our “pure” cardinality is 𝐶𝑎𝑟𝑑𝑄1 = 129. Execution plan of the query is as follows
---------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
---------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
|* 2 | HASH JOIN | | 1 | 129 | 104 |
| 3 | TABLE ACCESS FULL| T2 | 1 | 11 | 11 |
| 4 | TABLE ACCESS FULL| T1 | 1 | 130 | 130 |
---------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("T1"."J1"="T2"."J2")
It means we were able to figure out exact estimation mechanism in this case. Execution plan of the
second query (Case2 q2) as follows
It actually confirms our approach. However execution plan shows cardinality of the single table t1 as
5, it is correct because it must be rounded up but during join estimation process optimizer consider
original values rather than rounding.
Reviewing Alberto Dell'Era`s – complete formula (join_histogram_complete.sql)
We can list column information from dictionary as below:
tatb t1, col value tatb t2, col value
column value frequency column value frequency
20 1 10 1
40 1 30 2
50 1 50 1
60 1 60 4
70 2 70 2
80 2
90 1
99 1
So we have to find common values, as you see min(t1.value)=20 due to we must ignore t2.value=10
also max(t1.val)=70 it means we have to ignore column values t2.value>70. In addition we do not
have the value 40 in t2.value therefore we have to delete it also. Because of this we are getting
following table
Num_rows(t1)=12;num_buckets(t1.value)=6;num_distinct(t1.value)=8,=>
𝑛𝑢𝑚_𝑢𝑛𝑝𝑜𝑝_𝑏𝑢𝑐𝑘𝑒𝑡𝑠 6−2
newdensity= = (8−1)∗6 =0.095238095, so appropriate column values
𝑢𝑛𝑝𝑜𝑝_𝑛𝑑𝑣∗𝑛𝑢𝑚_𝑏𝑢𝑐𝑘𝑒𝑡𝑠
frequency based on HB histogram will be :
t1.value freq calculated
30 1.142857143 num_rows*newdensity
50 1.142857143 num_rows*newdensity
60 1.142857143 num_rows*newdensity
70 4 num_rows*freq/num_buckets
2 - access("T1"."VALUE"="T2"."VALUE")
And num_rows(t1)=20,num_rows(t2)=11,num_dist(t1.value)=11,num_dist(t2.val)=5,
Density (t1.value)=(10-6)/((11-3)*10)= 0.05. Above mechanism does not give us exact number as
expected as optimizer estimation. Because in this case to estimate frequency for un-popular values
oracle does not use density it uses number of distinct values per bucket and number of rows per
distinct values instead of the density. To prove this one we can use join_histogram_essentials1.sql.
In this case t1 table is same as in join_histogram_essentials.sql . The column T2.value has only one
value 20 with frequency one.
t1.value freq EP t2.value freq EP
10 2 2 20 1 1
20 1 3
30 2 5
40 1 6
50 1 7
60 1 8
70 2 10
In this case oracle computes join cardinality 2 as rounded up from 1.818182. We can it from trace
file
Join Card: 1.818182 = outer (20.000000) * inner (1.000000) * sel (0.090909)
Tests show that in such cases cardinality of the join computed as frequency of the t2.value. So it
means frequency of the popular value will be:
𝑛𝑢𝑚_𝑟𝑜𝑤𝑠_𝑏𝑢𝑐𝑘𝑒𝑡
𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡2. 𝑣𝑎𝑙𝑢𝑒 = 1
Frequency (non-popular t1.value) = { 𝑛𝑢𝑚_𝑑𝑖𝑠𝑡_𝑏𝑢𝑐𝑘𝑒𝑡𝑠
1 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡2. 𝑣𝑎𝑙𝑢𝑒 > 1
or
Cardinality = max (frequency of t2.val, number of rows per distinct value within bucket)
Question is why? In such cases I think optimizer tries to minimize estimation errors. So
Therefore
tab t1,col tab t2 , col
value value
column value frequency column value frequency freq*freq
10 4 10 2 8
20 1.818181818 20 1 1.818181818
50 1 50 3 3
60 1.818181818 60 1 1.818181818
70 4 70 4 16
sum 30.63636364
We get 30.64≈31 as expected cardinality. Let`s see trace file and execution plan
Join Card: 31.000000 = outer (11.000000) * inner (20.000000) * sel (0.140909)
Join Card - Rounded: 31 Computed: 31.000000
---------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
---------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
|* 2 | HASH JOIN | | 1 | 31 | 29 |
| 3 | TABLE ACCESS FULL| T2 | 1 | 11 | 11 |
| 4 | TABLE ACCESS FULL| T1 | 1 | 20 | 20 |
---------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("T1"."VALUE"="T2"."VALUE")
As it can be seen common column values are between 0 and 9. So we are not interested in buckets
which contain column values greater than or equival 10. Hybrid histogram gives us more information
to estimate single table and also join selectivity than height balanced histogram. Specially endpoint
repeat count column are used by optimizer to exactly estimate endpoint values. But how does
optimizer use this information to estimate join? Principle of the estimation “(value,frequency)” pairs
based on hybrid histogram are same as height based histogram. So it depends on popularity of the
value, if value is popular then frequency will be equval to the corresponding endpoint repeat count,
Average bucket size is 7. Oracle considers value as popular when correspoindg endpoint repeat count
is greater than or equval average bucket size. Also in our case density is (crdn- popfreq)/((NDV-
popCnt)*crdn)=(100-28)/((20-4)*100)= 0.045. If we enable 10053 trace event you can clearly see
columns and tables statistics. Therefore “(value,frequency)” will be as
t1.j1 popular frequency calculated
0 N 4.5 density*num_rows
1 N 4.5 density*num_rows
2 Y 7 endpoint_repeat_count
3 N 4.5 density*num_rows
4 N 4.5 density*num_rows
5 N 4.5 density*num_rows
6 N 4.5 density*num_rows
7 Y 7 endpoint_repeat_count
8 N 4.5 density*num_rows
9 N 4.5 density*num_rows
t1.j1 t2.j2
value frequency value frequency freq*freq
0 4.5 0 3 13.5
1 4.5 1 6 27
2 7 2 6 42
3 4.5 3 8 36
4 4.5 4 11 49.5
5 4.5 5 3 13.5
6 4.5 6 3 13.5
7 7 7 9 63
8 4.5 8 6 27
9 4.5 9 5 22.5
sum 307.5
Join sel 0.05125
By definition of the Top-Frequency histogram, we can say that here are two types of buckets.
Oracle placed high frequency values into some buckets (appropriate) and rest of the values of the
table oracle actually “placed” into another “bucket”. So we actually have “high frequency” and
“low frequency” values. Therefore for “high frequency” values we also have exact frequencies, but
for “low frequency” values we can approach by using “Uniform distribution”. Firstly we have to
build high frequency pairs based on common values. The max(min(t1.j1),min(t2.j2))=4 and also
max(max(t1.j1),max(t2.j2))=100. In principle we have to see and gather common values which are
between 4 and 100. So after identifying common values, for popular values we are going to use
exact frequency and for non-popular values new density. Therefore we could create following table:
In this case there is a frequency histogram for the column t2.j2 and we have exact common {1, 2, 4}
values. But test cases show that optimizer also considers all the values from top frequency histogram
which are between max(min(t1.j1),min(t2.j2)) and min(max(t1.j1),max(t2.j2)). It is quite interesting
case. Because of this we have frequency histogram and it should be our main source and this case
should have been similar to the case 3.
Table Stats::
Table: T2 Alias: T2
#Rows: 16 SSZ: 0 LGR: 0 #Blks: 1 AvgRowLen: 3.00 NEB: 0 ChainCnt: 0.00
SPC: 0 RFL: 0 RNF: 0 CBK: 0 CHR: 0 KQDFLG: 1
#IMCUs: 0 IMCRowCnt: 0 IMCJournalRowCnt: 0 #IMCBlocks: 0 IMCQuotient:
0.000000
Column (#1): J2(NUMBER)
AvgLen: 3 NDV: 4 Nulls: 0 Density: 0.062500 Min: 0.000000 Max: 4.000000
Histogram: Freq #Bkts: 4 UncompBkts: 16 EndPtVals: 4 ActualVal: yes
***********************
Table Stats::
Table: T1 Alias: T1
#Rows: 42 SSZ: 0 LGR: 0 #Blks: 1 AvgRowLen: 3.00 NEB: 0 ChainCnt: 0.00
SPC: 0 RFL: 0 RNF: 0 CBK: 0 CHR: 0 KQDFLG: 1
#IMCUs: 0 IMCRowCnt: 0 IMCJournalRowCnt: 0 #IMCBlocks: 0 IMCQuotient:
0.000000
Column (#1): J1(NUMBER)
AvgLen: 3 NDV: 11 Nulls: 0 Density: 0.023810 Min: 1.000000 Max: 25.000000
Histogram: Top-Freq #Bkts: 41 UncompBkts: 41 EndPtVals: 10 ActualVal: yes
Here for the value 3 j2.freq calculated as num_rows(t2)*density=16*0.0625=1. And in 10053 file
Join Card: 47.000000 = outer (16.000000) * inner (42.000000) * sel (0.069940)
Join Card - Rounded: 47 Computed: 47.000000
But if we compare estimated cardinality with actual values then we will see:
---------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
---------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
|* 2 | HASH JOIN | | 1 | 73 | 42 |
| 3 | TABLE ACCESS FULL| T3 | 1 | 18 | 18 |
| 4 | TABLE ACCESS FULL| T1 | 1 | 42 | 42 |
---------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("T1"."J1"="T3"."J3")
As we see here is significant difference. 73 vs 42, error estimation is enough big. That is why we said
before its quite interesting case, so optimizer should consider only values from frequency histogram,
these values should be main source of the estimation process – as similar to the case3. So if consider
and walk on the values of the frequency histogram as common values then we will get the following
table:
common val j1.freq calculated j3.freq calculated freq*freq
1 3 freq 7 freq 21
2 3 freq 2 freq 6
You can clearly see that, such estimation is very close to the actual rows.
It is quite hard to interpret when one of join column has top frequency histogram
(Hybrid_topfreq.sql). For example here is hybrid histogram for t1.j1 and top frequency histogram
for t2.j2. Column information from dictionary
High frequency common values are located between 1 and 7. Also we have two popular values for
t1.j1 column :{3,4}.
Table Stats::
Table: T2 Alias: T2
#Rows: 30 SSZ: 0 LGR: 0 #Blks: 5 AvgRowLen: 3.00 NEB: 0 ChainCnt: 0.00 SPC
0 RFL: 0 RNF: 0 CBK: 0 CHR: 0 KQDFLG: 1
#IMCUs: 0 IMCRowCnt: 0 IMCJournalRowCnt: 0 #IMCBlocks: 0
IMCQuotient: 0.000000
Column (#1): J2(NUMBER)
AvgLen: 3 NDV: 12 Nulls: 0 Density: 0.033333 Min:
1.000000 Max: 30.000000
Histogram: Top-Freq #Bkts: 27 UncompBkts: 27
EndPtVals: 9 ActualVal: yes
***********************
Table Stats::
Table: T1 Alias: T1
#Rows: 40 SSZ: 0 LGR: 0 #Blks: 5 AvgRowLen: 3.00 NEB: 0 ChainCnt: 0.00 SPC
0 RFL: 0 RNF: 0 CBK: 0 CHR: 0 KQDFLG: 1
#IMCUs: 0 IMCRowCnt: 0 IMCJournalRowCnt: 0 #IMCBlocks: 0
IMCQuotient: 0.000000
Column (#1): J1(NUMBER)
AvgLen: 3 NDV: 13 Nulls: 0 Density: 0.063636 Min:
1.000000 Max: 13.000000
Histogram: Hybrid #Bkts: 8 UncompBkts: 40
EndPtVals: 8 ActualVal: yes
The above test case was a quite simple because popular values of the hybrid histogram also are located
within range of high frequency values of the top frequency histogram. I mean popular values {1, 5,
6} of the hybrid histogram actually located 1-6 range of top frequency histogram.
Let see another example
CREATE TABLE t1(j1 NUMBER);
INSERT INTO t1 VALUES(6);
INSERT INTO t1 VALUES(2);
INSERT INTO t1 VALUES(7);
EXPLAIN PLAN
FOR
SELECT COUNT ( * )
FROM t1, t2
SELECT *
FROM table (DBMS_XPLAN.display);
So we our average bucket size is 3 and we have 2 popular values {6, 7}. These values are not a part
of high frequency values in top frequency histogram. Table and column statistics from optimizer
trace file:
Table Stats::
Table: T2 Alias: T2
#Rows: 20 SSZ: 0 LGR: 0 #Blks: 1 AvgRowLen: 3.00 NEB: 0 ChainCnt: 0.00
SPC: 0 RFL: 0 RNF: 0 CBK: 0 CHR: 0 KQDFLG: 1
#IMCUs: 0 IMCRowCnt: 0 IMCJournalRowCnt: 0 #IMCBlocks: 0 IMCQuotient:
0.000000
Column (#1): J2(NUMBER)
AvgLen: 3 NDV: 8 Nulls: 0 Density: 0.062500 Min: 1.000000 Max: 20.000000
Histogram: Top-Freq #Bkts: 15 UncompBkts: 15 EndPtVals: 4 ActualVal: yes
***********************
Table Stats::
Table: T1 Alias: T1
#Rows: 20 SSZ: 0 LGR: 0 #Blks: 1 AvgRowLen: 3.00 NEB: 0 ChainCnt: 0.00
SPC: 0 RFL: 0 RNF: 0 CBK: 0 CHR: 0 KQDFLG: 1
#IMCUs: 0 IMCRowCnt: 0 IMCJournalRowCnt: 0 #IMCBlocks: 0 IMCQuotient:
0.000000
Column (#1): J1(NUMBER)
So our cardinality for high frequency values is 17.7273. And we also have num_rows(t1)-
popular_rows(t1)=20-15=5 unpopular rows. But as you see oracle computed final cardinality as
31. In my opinion popular rows of the hybrid histogram here play role. Test cases show that
optimizer in such situations also tries to take advantage of the popular values. In our case the value
6 and 7 are popular values and popular frequency is 7 (sum of popular frequency). If we try find
out frequencies of these values based on the top frequency histogram then we have to use density.
So cardinality for popular values will be:
Popular frequency*num_rows(t1)*density(j2)=7*20*0.0625=8.75. Moreover for every “low
frequency” values we have 1.18182≈1 frequency and we have 5 “low frequency” values (or
unpopular rows of the j2 column) therefore cardinality for “low frequency” could be consider as 5.
Eventually we can figure out final cardinality.
CARD = CARD (High frequency values) + CARD (Low frequency values) + CARD (Unpopular
rows) = 17.7273+8.75+5=31.4773.
And execution plan
---------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
---------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
|* 2 | HASH JOIN | | 1 | 31 | 26 |
| 3 | TABLE ACCESS FULL| T1 | 1 | 20 | 20 |
| 4 | TABLE ACCESS FULL| T2 | 1 | 20 | 20 |
---------------------------------------------------------------
So it is an expected cardinality.
But in general here could be estimation or approximation errors which are related with rounding.
CREATE TABLE t2
AS SELECT * FROM dba_objects;
Note
-----
- dynamic statistics used: dynamic sampling (level=AUTO)
As we see without histogram there is significant difference between actual and estimated rows but in
case when automatic (adaptive) sampling is enabled estimation is good enough. The Question is how
did optimizer actually get cardinality as 58728? How did optimizer calculate it? To give the
explanation we could use 10046 and 10053 trace events. So in SQL trace file we could see following
lines.
SQL ID: 1bgh7fk6kqxg7
Plan Hash: 3696410285
SELECT /* DS_SVC */ /*+ dynamic_sampling(0) no_sql_tune no_monitoring
optimizer_features_enable(default) no_parallel result_cache(snapshot=3600)
*/ SUM(C1)
FROM
(SELECT /*+ qb_name("innerQuery") NO_INDEX_FFS( "T2#0") */ 1 AS C1 FROM
"T2" SAMPLE BLOCK(51.8135, 8) SEED(1) "T2#0", "T1" "T1#1" WHERE
("T1#1"."USERNAME"="T2#0"."OWNER")) innerQuery
During parsing oracle has executed this SQL statement and result has been used to estimate size of
the join. The SQL statement used sampling (undocumented format) actually read 50 percent of the
T2 table blocks. Sampling was not applied to the T1 table because its size is quite small when
compared to the second table and 100% sampling of the T1 table does not consume “lot of” time
during parsing. It means oracle first identifies appropriate sampling size based on the table size and
Let see what will happen if we increase sizes of both tables – using multiple
insert into t select * from t
table name blocks row nums size mb
T1 3186 172032 25
T2 6158 368076 49
In this case oracle completely ignores adaptive sampling and uses uniform distribution to estimate
join size.
Table Stats::
Table: T2 Alias: T2
#Rows: 368076 SSZ: 0 LGR: 0 #Blks: 6158 AvgRowLen: 115.00 NEB: 0 ChainCnt:
0.00 SPC: 0 RFL: 0 RNF: 0 CBK: 0 CHR: 0 KQDFLG: 1
#IMCUs: 0 IMCRowCnt: 0 IMCJournalRowCnt: 0 #IMCBlocks: 0 IMCQuotient: 0.000000
Column (#1): OWNER(VARCHAR2)
AvgLen: 6 NDV: 31 Nulls: 0 Density: 0.032258
***********************
Table Stats::
Table: T1 Alias: T1
#Rows: 172032 SSZ: 0 LGR: 0 #Blks: 3186 AvgRowLen: 127.00 NEB: 0 ChainCnt:
0.00 SPC: 0 RFL: 0 RNF: 0 CBK: 0 CHR: 0 KQDFLG: 1
#IMCUs: 0 IMCRowCnt: 0 IMCJournalRowCnt: 0 #IMCBlocks: 0 IMCQuotient: 0.000000
Column (#1): USERNAME(VARCHAR2)
AvgLen: 9 NDV: 42 Nulls: 0 Density: 0.023810
It is obvious that oracle stopped execution of this SQL during parsing, we can see it from rows
column of the execution statistics and also from row source statistics. Oracle did not complete HASH
JOIN operation in this SQL, we can be confirm that with result of the above SQL and row source
statistics. Sizes of the tables are not big actually but why did optimizer ignore and decided to continue
with previous approach? In my opinion here could be two factors, although sample size is not small
but in our case sample SQL actually took quite long time during parsing (1.8 sec elapsed time)
therefore oracle stopped it. I have added one filter predicate to the query:
SELECT COUNT (*)
FROM t1, t2
WHERE t1.username = t2.owner AND t2.object_type = 'TABLE';
********************************************************************************
It means oracle firstly tried to estimate size of T2 table, because it has filter predicate and optimizer
thinks using ADS could be very efficient. If we should have added predicate like t2.owner=’HR’ then
optimizer would tried to estimate also T1 table cardinality. But the mechanism of estimating subset
of the join and then estimate whole join principle in this case actually ignored. However in this case
only T2 table has been estimated. We can easily see this fact from the trace file:
BASE STATISTICAL INFORMATION
***********************
Table Stats::
In this case oracle completely ignored ADS and used statistics from dictionary to estimate size of
tables and join cardinality.
Summary
In this paper has explained the mechanism of the oracle optimizer to calculate join selectivity and
cardinality. We learned that firstly optimizer calculates join selectivity based on “pure” cardinality. To
estimate the “pure” cardinality optimizer identifies “distinct value, frequency” pairs for each column,
based on the column distribution. The column distribution information is identified by the histogram.
And as we know that, frequency histogram gives us completely whole data distribution of the column.
Also top frequency histogram gives us enough information for high frequency values. However for
less significantly values we can approach “uniform distribution”. Moreover if here are hybrid
histograms for the join columns in the dictionary then optimizer can use endpoint repeat count to
formulate frequency. In addition optimizer has chance to estimate join cardinality via sampling.
Although this process influenced by time restriction and size of the tables. As a result optimizer can
completely ignore adaptive dynamic sampling.
References
• Lewis, Jonathan. Cost-Based Oracle: Fundamentals Based Oracle: Fundamentals. Apress.
2006
• Alberto Dell'Era. Join Over Histograms. 2007
• http://www.adellera.it/investigations/join_over_histograms/JoinOverHistograms.pdf
• Chinar Aliyev. Automatic Sampling in Oracle 12c. 2014
• https://www.toadworld.com/platforms/oracle/w/wiki/11036.automaticadaptive-dynamic-
sampling-in-oracle-12c-part-2
• https://www.toadworld.com/platforms/oracle/w/wiki/11052.automaticadaptive-dynamic-
sampling-in-oracle-12c-part-3