5 - 6 - Recap and Big Data (11-39)

[MUSIC].
Okay, so where are we now?

So we motivated the discussion of
statistical inference and estimation by
bringing up this so called decline
effect, where the, effect size of
scientific results seems to be going down
over time and reproducibility is
suffering.
And so we gave some reasons for this;
publication bias, you know mistakes and
fraud and this multiple hypothesis
problem.
And so we used these, motivations, these
scenario, to bring up various topics and
techniques.
So we talked a little bit about basic
statistical inference Where I just give
you an overview and, and that's it.
we talked about effect size.
We brought up the specific term,
heteroskedasticity.
For fraud detection, we brought up
Benford's Law.
And then we multiple hypothesis testing
which is.
Perhaps the most important part of the
discussion we talked about the familywise
error rate and the false discovery rate
and gave correction procedures for both
of these.
Okay, and so this hopefully was a tour of
not just some basic concepts but also
some, if not advanced at least things
that don't necessarily come up in a a you
know, Stats 101 course.
But I think it's pretty important for us
data scientists to understand.
In fact, as a data scientist, there's a
view amongst statisticians that these
topics are not very well understood.
And in fact, they'll point to typical
machine learning classes where understand
the population, understanding the various
biases, understanding how to correct for,
for the problems that can arise is not
taught at all and it's more of a.
Of a, you know, blind application of
algorithms.
So I think it's pretty important to go
over this choice of topics now.
So, What about big data?
What changes?
Well, so, Brad Efron.
Who's a world renowned statistician you
know, describes it this way.
Says classical statistics was fashioned
for small problems, a few hundred data
points at most, in just a few parameters.
And the bottom line is that we've entered
in an era of massive scientific data

collection, with a demand for answers to
large scale inference problems that lie
beyond the scope of classical statistics.
And so, Suggest that something is
changing in the area of big data.
Now, what can go wrong here?
Well, as we've talked about, you can find
spurious relationships in big data and so
this is a picture that I got from a
colleague recently that was emailed to
him.
Which is a plot that someone took the
time to make, may or may not have been as
a joke, but as you can see here, it says
"Internet Explorer versus the murder
rate." Rate, OK.
And so this is the murders in the US in
blue, along with the market share of
internet explorer in the green.
And the, you know, corresponding
discussion that went along with this
plot, you know, was, was somewhat
amusing.
Talking about various theories for why
The murder rate might be going up as in
the next four market share.
Our murder rate goes down as, as in the
next four market share also goes down.
But, the point here is that without some
common sense or without the [UNKNOWN] the
application of understanding the scenario
of the problem you can make, you know
discoveries Of, of this form.
Okay.
Alright, and so other examples that have
been talked about in the literature,
again brought up, as as, you know, bad
examples, the number of police officers
and the number of crimes.
So, why might these 2 things be
correlated?
You know, maybe police officers cause
crimes.
Well, no, probably because there's in
pop, in densely populated areas, there
are both more police officers and there
are more, and there are more crimes.
By the way, just to point out again, you
know, these, these authors here are not
authors that claim, made these claims.
These are authors that brought up the
mistake.
Okay, amount of ice cream sold and deaths
by drowning.
Why would these things be correlated?
Well, there's a seasonality.
Right?
In the summertime you sell more ice cream
and more people go swimming.
And then one is you know Stork sightings

and population increase used as you know,
evidence that storks do indeed bring
newborns to families.
Well again, in more densely populated
areas there's more people actually
actually see the Storks and so you get an
increase in sightings.
So, these kind of procedures to remove
bias and these procedures to understand
the population you are sampling from and
understand the possibilities as far as
correlations.
These things are taught in statistics
programs, but are not typically taught in
machine learning classes.
Okay.
So what does that have to do with big
data?
Well.
Might be a view that there's, you know,
the curse of big data, as Vincent
Granville put it, is the fact that when
you search for patterns in a very, very
large data sets with billions or
trillions of data points and thousands of
metrics, you are bound to identify
coincidences that have Predictive power
and so the example he gives is to
consider stock prices for some large
number of companies over a one month
period.
And then you check for correlations
between all pairs.
And actually doesn't stop there, because
that would be over the same exact one
month period.
But you might want to account for lags.
Maybe the stock price of Google.
a few days later effects the stock price
of smaller companies that depend on you.
So now you're not just comparing every
500 squared checking the paralyzed
correlation of these time series but you
are also checking the paralyzed slightly
offset one okay and so these are the
cross correlation procedures.
So very basic time series analysis this
is just to measure the correlation and I
just wanted to throw the formulas up here
where the covariants of two data sets is
measured this way.
Alright so you take the data point xi and
subtract the mean of x.
And multiply that by y i minus the mean
of y.
And all that up and that's the
covariance.
And then you divide the covariance by the
standard deviation of each data set
multiplied by each other.

And so this gives you the correlation.
Okay.
So, what does this experiment look like?
Well, I generated this plot by running
random walks for stock prices that start
at $10.
They all start at the same, the same
Point, and at each step, which is an hour
of simulated time.
A, draw a sample for a normal
distribution where the mean is the
current stock price.
And the, a standard deviation is one
percent of that current stock price.
Okay.
And this is, not especially defensible,
but you can see just sort of visually
that it does generate stock price looking
things.
And you do get some variance here.
Alright.
So, clearly this is, this is random.
This plot shows the number of corelations
at a level of 0.9.
All right, that's a pretty strong
correlation as a function of the number
of stock prices tracked.
So as I went up from 10 to 100, I didn't
go all the way up to 500 which is what
Vincent Granville described in the
thought experiment.
This is the number of spurious
correlations I- You, you find, okay and
this is also not doing the lagged cross
correlation, alright this is just
directly [INAUDIBLE] the correlation of
these two [INAUDIBLE] of time series
across this month.
And that's a pretty long period to,
across a month.
So what's the point?
Well [INAUDIBLE] gives more opportunities
For spurious findings.
Okay.
Now, it's not all bad news.
So, how is big data different?
Well, there's a notion of big p versus
big n.
Where big p is sort of the number of
columns.
And big n is the number of rows.
And in this experiment we just did with
the time series.
This was sort of a big piece in here.
We looked at more you know an increasing
number of companies and then we looked at
all possible correlations between them so
this was growing sort of quadratically.
Okay.
So the thing about big data though is

marginal cost of increasing the number of
records is essentially zero.
It's gotten cheaper and cheaper and
cheaper to collect data.
Okay.
Great.
Now that's very very powerful, right.
We want to, the increase in the number of
records, adds statistical power and helps
us sort of, you know, get lower and lower
p values but it also amplifies bias.
If you 're collecting the wrong data, if
you're looking at the wrong population.
you're going to make, you know, so-called
discoveries that are simply false.
And so, for example, log all the clicks
to your website, you have a very, very
large data set and you can very precisely
model user behavior.
But that would only model your current
users.
When your hope, you know, perhaps the
whole point of modeling.
user behaviors to try to attract new
customers.
Well, for example, if you have early
adopters, and your current user base is
early adopters, you're only going to be
modeling their behavior.
You haven't actually sampled the
population at large.
You know, another example is mobile data.
And this comes up in polling, for say,
the presidential election.
you know, you, you're only sampling
people that have cell phones.
And this may or may not be the same
population, you want, you want to be
sampling.
Okay, this may ignore lower income groups
or different age groups, okay.
You need to be careful on multiple
hypothesis tests as well, as we pointed
out.
So there's a fantastic comment from XKCD
that makes this point very, very clear.
where they sort of demonstrate that green
jelly beans cause acne.
Right, and the story here is that there's
20 different [SOUND] colours of jelly
beans, and for a P value of 0.05 [SOUND]
we do 20 experiments.
And sure enough we find one of the colors
indeed causes acne.
But that would be expected purely by
chance.
And so I encourage you to look up that
comment.
And the other comment I'll make that we
will probably come back to is Nassim

Taleb's Black Swan events.
So this is- Things that are sort of
inherently unpredictable or the
distribution of them does not follow a
normal distribution, sort of a bell curve
distribution, where the tails of the bell
curve mean that extreme values become
exponentially more rare.
That's the sort of definition of the
normal distribution.
But in some cases, extreme values are not
exponentially less common.
They, they, they happen, okay?
And so the example that he uses in this
case is that, you know, that if the, if a
turkey was to model your behavior, it
would get increasingly more confident
that that you mean it, it no hard.
And you mean it, you know, good will.
Every day you come and feed the turkey,
and everyday you take care of it and you
look out for its well being.
But then on the, you know day before
Thanksgiving it gets slaughtered.
Perhaps and so that was Taleb's argument
for a Black Swam event.
A black swan itself refers to the fact
that people didn't believe black swans
existed and then.
Finds out that they did, so it was an
unexpected event, okay.
All right we'll talk more about that in
some detail.
[SOUND]

5 - 6 - Recap and Big Data (11-39)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

5 - 6 - Recap and Big Data (11-39)

Uploaded by

Copyright:

Available Formats

[MUSIC].

Okay, so where are we now?

in an era of massive scientific data

And then one is you know Stork sightings

multiplied by each other.

So the thing about big data though is

will probably come back to is Nassim

You might also like