You are on page 1of 6

Okay, so we talked about algebraic

optimization and then we talked about


decorative languages on top of the
algebra.
in order to simply expression and in
order to avoid specifying to the computer
exactly how to do that.
Right, we want to leave that open and let
the database figure that out.
But we stopped at what I call logical
optimization.
And what I want to talk a little bit
about the physical level optimization.
And what I mean by this is that even
after you specify we hinted at this last
time.
That even after you specify the order
operations, we haven't yet specified
every detail needed, in order to actually
evaluate the query.
Okay, and let me give an example of that.
So here is a simplified version of a
query we looked at last time.
Where we say for every order, we want to
find all the corresponding items that
were part of that order.
And that's it.
Last time we had an extra condition.
Oops, I'm actually pointing with the
mouse but you can't see that 'cuz I'm on
the wrong screen so.
[SOUND] So for every order find the
corresponding items that match.
And in last time we had another predicate
down here, and this time I've taken that
out.
And so the algebraic plan that this
translates into is very simple.
Its just a joint of the two tables, and
that's it.
So you think we're done right, we're
going to join order an item in or
finished.
Or we gotta specify how we're going to do
that joint.
And so let me tell you about a couple of
options here.
So one, in sort of very high level
pseudo-code, looks like this.
We could say for each record I in item,
and for each record o in or-, in order.
Check to see if those two records agree
on the order field, on the order
attributes, and if so, return it.
And that's, that's a 6 a that's a join
result.
Okay.
You know they match.
So, fine.
Another option is for each record I and

item, insert that record into some sort


of data structuring.
Here I'm going to call it a hash table.
I'm not to concerned about what exactly
that, that is.
And then second, for each record o and
order go look up the corresponding
records in that data structure that we
found or that we built and return all the
matching pairs.
Okay.
And if it is actually a hashtable that
we're talking about.
Then this look up could be pretty
efficient.
Right?
It could be constant time, amortized
constant time.
Right?
And so now this one says well, for every
record and item, go scan every single
record in order.
And so we have kind of a in squared
complexity going on here.
And here we say, well, for every record
in item, put into a data structure.
And then after that, for every record in
order, go look up those records in a hash
table.
And indeed if this is constant time,
amortized, then this is sort of a linear
time algorithm.
So there's two different ways.
So I argue that there's two different
ways to implement this joined, and both
of these are valid.
Okay, so which one is faster?
Well, I've sort of hinted that perhaps
option 2 is faster, but in practice it
may or may not be.
And so, you know, I would, I would pause
here and ask the class to answer the
question but since it's over video I
can't do that.
I'll give you a moment to think about
that.
But I want you to think about why one,
why this one in particular, might be
faster in some cases than this one, even
though it seems like it should never be.
Okay, let's see an example maybe in a
second.
So leaving, leaving that question hanging
open.
I want to make the point that you have
access to this underlying algebra.
This isn't something that's, all that's,
purely sort of theoretical, alright.
This is, this is something that you can
use, tomorrow if you work with databases

at your job.
For example in this particular product
Microsoft SQL Server, and in fact all the
products you're going to use the same
sort of mechanism.
But you can explain a query and that will
give you access to some form of this
algebra that I've been talking about.
Okay, so if you take a query and here,
I've changed the scheme yet again.
This table, this Reuters is one you'll be
working with in the homework.
I ran a query here and I've explained it
in what shows what the sequel management
studio gives back to me is a little
algebraic tree, kind of like the ones
I've been drawing here, just, you know,
in PowerPoint.
Okay, and so this one says, a hash mash
is going to be used to implement this
joining conditions.
This one's kind of a complicated joining
condition, for a reason I'm not going to
explain right now.
But it has two leaves, and they did join
with this thing called a hash mash inter
join.
Okay, so this very much like the hash
table example I gave on the previous
slide.
But, I want you to take a look at
something.
So here I've taken the exact same query,
but I've added an extra condition where
I'm only looking for words equal to
parliament.
I probably should explain this scheme a
little bit.
So the Reuters set gives you term
frequencies.
You have three columns, doc ID, or let's
just say doc, term, and frequency.
How, and the frequency is how often that
term appears in that document.
Okay?
And so this is the, the table you'll be
looking at.
Right.
And so here what I've said is, I'm
looking for pairs of terms that that that
co occur in a single document, is the, is
the previous query I was looking at.
And now I've said, well look, I don't
want all pairs of document, or all pairs
of terms, I only want terms that co-occur
with the term parliament.
Right so perhaps a lawyer, a co-occurs
with parliament frequently.
So I'm looking for all the, all the terms
co-occur in some document with parliament

is what this period is expressing.


Okay?
So now, what I want you to notice though
when I explain this query I get a
different, physical plan.
The logical plan looks the same.
It's still got scan, scan an a join, but
the algorithm to compute the join has
changed an now it's this nested, this
thing called nested loops.
And then nested loops corresponds,
exactly to, this pseudo code here.
That's why they call it nested loops, the
outer loop, an the inner loop.
So it's exactly the same thing.
And so we chose to do this nested loops
plan even though we argued that it was an
in-squared algorithm, and it probably
wouldn't be chosen very often.
So why was it in this case?
So, if you think about it.
The, one of the sides of this drawing is
only dealing with those terms.
Or with the, with the occurrences of the
term parliament in a document, which is a
very small relation.
And so it's a very small relation, and
this nested loops algorithm could be
very, very efficient and faster than
dealing with the overhead of actually
constructing this hash this hash table,
or constructing some data structure.
Okay.
So the main take away here as oppose to
the details a, are, is that different
physical algorithms are appropriated at a
different times.
And that this decorative language and
thanks to the declarative languages and
thanks to algebra optimization, the
programmer doesn't have to worry about
any of that.
They don't have to make that choice.
Okay, so this is a very, very powerful
idea.
You just expressed the query and the
database does the rest.
Alright.
So, fine.
And just to point out, this is not just
something you need to SQL server you can
generate these kinds of algebraic plans
in Postgre by using explain.
And in fact they look kind of nicer, and
here's these hash joints again.
This actually shows you, whoops excuse
me, this shows you where it's building a
hash table, as in step one, and then
probing it would be step two.
And same thing here.

And this is another operator that we


didn't talk about where you are, say
you're going to, count all the records
that match them the for [LAUGH], count
all the members of some group, I'll put
it that way.
And so the hash here is on group ID.
And you can apply aggregate functions to
the rest of it.
But I shouldn't give such a high-level
view of that without talking about it
more, so let me skip that altogether.
Okay.
So fine.
So the algebra really does exist, you can
look at it directly just by using the
keyword explain, and I advise you to do
so.
If you work with databases, you should be
using explain all the time to try to
understand what's going on.
Alright.
another point I'll make is just that,
this matters, this is not from, directly
from SQL, and in fact is not from a
commercial database.
It's from some research that we do in my
group, but the point is the same here.
These are actually different physical
plans for the exact same query.
And in fact, here I'm doing something in
parallel, so there's actually no number
of processors being applied.
And so, as you go from four to 16
processors, things go down a little bit,
not as much as we'd like, actually, they
should be going down quite a bit.
But the point is that each one of these
plans is doing a very different amount of
time.
Well, these two are kind of the same, but
the difference is pretty important.
And so ignoring these opportunities and
sticking with only the plan that the
programmer specifies would be a big
mistake.
Okay.
And then another illustration of this
that's a little bit hard to stare at but
let me, let me give it a whirl to try to
explain what's going on here.
This is by, some very nice work by
Haritsa et al, and VLDB 2010, but there's
a whole series of papers on this work.
But they tried to visualize the space of
possible query plans, and so what the two
axes are here, this is all for a single
query.
But the parameters to that query are
changing.

And so this in fact says something about


the supplier account balance.
And this is a parameter on sort of the
extended price, and they change the value
of these parameters in the query.
So imagine the same syntax, the same
select star from something, something,
where some condition equals extended
price and some other condition equals
account balance.
And just by varying those 2 knobs, you
get this really rich tapestry of
different plans being selected by the
optimizer.
So each color in this space represents a
different query plan, a different
algebraic query plan, selected by the
optimizer.
Okay and so I think the take away here is
that there is a very complex decision
being made by the data bases and
necessarily so.
They actually, you know, these, these
different plans actually matter.
They don't show that here.
But you can actually show that this
choice of plan tends to, the databases
tend to do a pretty good job of finding
the right plan.
And that, you know, I argued in the last
slide that this can actually matter, that
the difference in the, in the time, can
be pretty significant.
Okay, so leaving this kind of complexity
up to the programmer, is, can be a big
source of loss.
Right, hiding this complexity is a, is a
huge huge huge win.

You might also like