2 - 7 - SQL For Data Science - Interpreting Complicated SQL (12-12)

[MUSIC].
Okay.
So now I want to talk about, how to
interpret, or give some examples of how
to interpret, SQL statements, sort of, in
terms of relational algebra.
But, we're not going to actually write
out the plans.
But, I want to give you some experience
staring at what may seem sort of
complicated and kind of teasing out
what's actually going on here.
And so for people that have spent a lot
of time around databases and SQL, these
may or may not seem particularly
complicated, but if you're just starting
out they, they probably do.
So in this first example [SOUND] what do
we see here?
Well, what you want to look for when
you're sort of staring at something that
may seem sort of hairy is.
You know, look for the from clause here.
And so in this case, it's a little funny,
right, because we see, oops.
We see that the from clause does not have
a table name mentioned, it has a nested
query within it.
And we remember that, that's perfectly
fine because.
Of, this closure property of the, of the
underlying algebra.
We know that any relational algebra
expression, and therefore any SQL
statement, is going to return a table.
And, it operates on tables.
So, if you can operate on tables, if
you're operating on tables, and you know
you return a table, then you can sort of
chain these operations together, and you
have this nice closure property.
So, we know that we're allowed to query
derived results just like we're allowed
to query base tables.
Okay, and so in this case we're doing
derived result.
Now, well also, let's go down to the
other from clause.
Well, here we see another nested query,
another layer of nesting, again,
perfectly fine.
And one more layer, now we see this table
here.
And so where this data came from was a
sensor that was mounted underneath a
oceanographic research vessel.
That was collecting measurements of a, of
a variety of different variables.
Here a few of them are mentioned,
fluorescence, oxygen, nitrates, and
there's, there's several more.

And then this is latitude and longitude
where the, where the ship actually was
located at that point in time.
And they're also tagged with this time
stamp.
So they're tagged with
latitude/longitude, time stamp, and a
bunch of measured variables.
And so what this operation is actually
doing, is, aggregating, binning, these
measurements onto 5 minute windows.
Okay.
And so we can see how that's, done.
So in this inner query, there's some work
to, or the, we call this function to cast
timestamp to a float.
That's not particularly important.
And here we see this trick where we just
use a constant value right there in the
select clause, and so what that does is,
sort of a pinned a new column with a,
with a value 5 in in, for every record.
Okay.
And I'll say maybe why this was done in
this particular case, ah,in a moment.
And then in the next layer up, we see
this kind of hairy expression involving
binsize twice.
And what it's doing is rounding down the
timestamp to the nearest 5 minute window.
Okay, so, you know, 6 minutes and 32
seconds becomes 5 minutes and 11 minutes
and 29 seconds becomes 10 minutes, and so
on.
Okay.
and in both cases, by the way, we see
this star here meaning that all other
columns are going to be passed through.
And finally in the outer query, we have
bin id which we computed here.
Notice that we've got this renaming
operator.
We can have this complicated expession
and just give it a nice, convenient name.
So you got that passed through, and then
we compute the average of these other
values.
The average latitude and the average
longitude within that 5-minute window as
well.
Okay.
So, given that we're doing an average, we
should expect to see a group by, and in
fact, we do.
We're grouping on the binid, which would
make sense.
And then we happen to be sorting by binid
just to make sure that the records come
out in time stamp order.
Because, perhaps, some application

requires it that way.
Okay, so why is this?
Why is this sort of overly complicated?
Why not just collapse this all into one
expression?
Well you could, but it's for the same
reason that you might sort of abstract
things or re-factor things into
imperative Language.
There's a little bit of software
engineering being applied here, so that
this complicated expression can be reused
in multiple places.
In this case, it's only being used once,
so you could perhaps move it there.
But it sort of separates 2 different
blocks of logic.
Okay.
So, in this slide, I've, I've color coded
the.
you know, 3 different blocks of logic.
Red, blue, and green.
So you see the layers of nesting.
But the main thing I want you, I want you
to take away is that one is that nesting
is perfectly fine.
You may see it and not to, you know,
worry when you see it being used in
practice.
And second just to kind of convey that as
you start to do more and more analysis in
SQL.
More complicated analysis, you, there are
ways to kind of re-factor the, the
complicated queries so they don't
necessarily look so complicated.
Another thing that you can do here that
we will talk about in perhaps in the next
segment is save this result as a view,
right?
Give it a name, and then you can refer to
it in the outer query just as a table.
Okay, I'll talk more about that next
time.
But these are some of the tricks you can
play when you're sort of working with in,
in SQL.
And, in fact, you know, the things you
may see people do, even if you're not
planning on doing as much SQL authoring
yourself.
Okay.
Fine.
So here's another example.
Same thing.
The first step is to look at the from
clause and see what you see.
And here we see two tables, and there's
this keyword inner join.
Now the join is explicit, and the join

condition here is.
This, where we have some sort of ID and
some sort of other ID.
By the way, one of the other things that
I wanted to kind of do here is to show
that you can kind of analyze the
structure of a sequel statement to
understand what is going on.
Even if you really have no idea what the
data is all about.
And it's just kind of helpful to do so.
This is something you'll be presented
with in a data science context.
Someone will say, look, you know, we need
to know...
We need to predict what the average sales
for next month is going to be and you say
okay, great, give me the data.
And they'll say, well I don't know.
It's in some database over there.
Right.
And you'll go over there and talk to the
DBA, or maybe there won't even be a DBA
and you'll just sort of analyze what's
going on in that database on your own.
So, that means staring at the schema,
which we haven't done.
But it also means staring at certian
people.
At queries.
which we, you may not agree, okay.
So having a little bit of a scale of
analyzing these complicated queries can
be important.
Okay so far, so this looks assistance we
like to join condition but if you look at
the WHERE clause, I want to make sort of
point here.
That the table referred to as x, this hot
spot deserts, and the table referred to
it as w, are involved in additional
conditions down here.
Okay.
And so, these are actually join
conditions as well, right.
And even though this one is explicitly
listed as all, and, you know, or the
inner join all on this particular
condition Anything that involves,
anything applies a condition to
attributes that are in both tables,
alright any sort of attributes from both
tables is a join condition.
So that this, this whole thing is all
sort of complicated join condition.
Alright.
so I went a little bit out of order into
the
Little bit different than the order I
wanted to go in.
But let me, let me backtrack and come
back to that.
So hold that thought.
Popping back to the top.
This other piece of complicated logic
here.
Well, this is a particular syntax that's
available in SQL called, you know, a case
statement.
And it acts about the same way as the
case [INAUDIBLE] in other languages.
So that's not too bad, but in particular
you need to just ignore all this logic,
you can just collapse all this down and
say, well look, there is some function
that's computing leen overlap where I get
that name, well that's what I mean the.
Result of this complicated expression.
The length of the overlap.
And in fact, you know, I happen to know a
little bit about where this query came
from.
What they're working on is genetic
sequences.
And you may be able to deduce that if you
stare at this.
There's snip region which is, stands for
single nucleotide polymorphism.
And there's strain and they give you a
hint.
And bp is base pair.
And noncoding regions.
Noncoding positions gives you a bit of a
hint.
If you, if you've done any work in
biochromatics.
But if you haven't, that's okay.
The point is, len_overlap appears to be
the name of this thing.
So it's almost like there exists a
function len_overlap that involves these
attributes.
And we don't even care what's inside it.
It's just a function.
So that helps us sort of see the
underlying simplicity of this query in
this case.
Okay.
Then, back to this other complicated
expression.
Let me show you a little about what's
going on here, just 'cause I think it's
kind of a fun example.
If you break these out into these 3
conditions, and you happen to know
something about this data is coming from
you can see that what this is saying is
well look we want the start based pair
from the x table to be greater than the w
the start based pair of with the w table.

And we want the end base pair from the x
table to be less than the end base pair
from the w.
So, it's this picture, right?
We want the blue x interval, right?
It, it's a sequence.
The x table is filled with ranges, all
right?
Start and end ranges, intervals.
And, the blue interval needs to be
completely contained within the red
interval.
Interval.
You know?
Or, the red interval needs to be
completely contained within the blue
interval.
Or, the red interval needs to straddle
the x,start base pair.
Which you can do if you stare at this,
this condition.
Okay, so they're doing kind of interval
logic.
Right there in SQL.
And so the point of, maybe looking at
this in, in enough detail that I'm
going to understand what's going on is a
couple thing.
One is, this point about the joint
condition is that it's not, you know,
even if you don't understand what's going
on, you can sort of see the structural.
details to understand that it's just a
join condition but also that you can
actually do certain kinds of analytics
directly in SQL.
This is a fairly nontrivial operation to
do that many people, especially among
people who have either not had good
experience with databases or have heard
from their friends that have not had good
experience with databases, that this
would be something that's considered sort
of impossible.
And it's not impossible, and nor is it
even a bad idea.
It's actually kind of a natural thing to
do.
So analytics in the database is, should
be a part of your bag of tricks.
You know, the first step shouldn't be
pull everything out of the database and
using imperative code.
Okay.
This example alone, I would hope doesn't
convince you of that claim I just made
of, of, you know getting things out of
the database is a good idea.
But there, there's going to be a sequence
of arguments that I make, probably

throughout the quarter here, here and
there.
Okay.
Meanwhile, I should put the caveat that
this is not, you know I'm not going to be
pushing databases as the ultimate
solution to data science by any stretch
of the imagination.
But I, there is a role for it to play.
Okay.
Alright and so.
Now with these, you know we collapsed
this into one function.
We can also collapse the join condition
into these 2 things.
It must match on, on this, CHR field.
And then it must have this kind of over
laps condition be true that we saw in the
last slide.
And so this is just an example of a theta
join where there's some non-trivial
function being applied on each pair of
tuples.
Let me stop there and pick up on user
defined functions next time.

2 - 7 - SQL For Data Science - Interpreting Complicated SQL (12-12)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2 - 7 - SQL For Data Science - Interpreting Complicated SQL (12-12)

Uploaded by

Copyright:

Available Formats

[MUSIC].

there's, there's several more.

Because, perhaps, some application

Now the join is explicit, and the join

the start based pair of with the w table.

of arguments that I make, probably

You might also like