You are on page 1of 226

• Goal of Statistical Analysis

• Univariate Statistics
measure of location
measures of spread
measure of shape
tendency toward normal or lognormal behavior
data Transforms
• Bivariate Statistics
Correlation Coefficient
 Linear Regression
Q-Q / P-P Plots

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 1


Outline of Statistical Analysis

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 2


Descriptive versus Inferential Process

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 3


Sampling
•A subset of the population
•Used to develop an understanding of the populations
behaviour
•Most commonly used to predict future behaviour
•An area of specialization within statistics

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 4


Two Separate Approaches
•Sampling with replacement
–The number of members of the population does not change with the
sampling process
–Probabilities of occurrence of any event does not change when replacement
occurs
–Applies to most production, plant and economic decisions
•Sampling without replacement
–The number of members of the population changes with the sampling
process (ie some are removed)
–Probabilities of of occurrence any event changes when no replacement
occurs
–Most exploration programs and reservoir evaluations – though the impact
on the population of the latter is minor

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 5


Two Separate Approaches
•Sampling with replacement
–The number of members of the population does not change with the
sampling process
–Probabilities of occurrence of any event does not change when replacement
occurs
–Applies to most production, plant and economic decisions
•Sampling without replacement
–The number of members of the population changes with the sampling
process (ie some are removed)
–Probabilities of of occurrence any event changes when no replacement
occurs
–Most exploration programs and reservoir evaluations – though the impact
on the population of the latter is minor

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 6


Typical Data for Exploratory Data Analysis

Goals of statistical analysis:


organize, explore and describe data
condense and consolidate information
share the learned knowledge

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 7


After-Class Exercise: Data-plotting

•Follow instructions in handout 
EX‐ Log Stats‐PROB.doc 
with data in 
EX‐Log Stats‐DATA.xls 

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 8


• Not interested in sample statistics  need underlying population parameters
• A model is required to go beyond the known data
• Because earth science phenomena involve complex processes they appear as random. Important to keep in mind that
actual data are not the result of a random process.

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 9


•Population (Sample Space): the range of all possible outcomes or
events that can possibly take place, expressed as a whole finite number
or assumed to be infinite in size.
•Sample: a single value or group of values of the elements within the
total population obtained by tests, interpretation of test results, or from
direct counting.
•Raw Data: data collected which have not been organized in some
numerical fashion.
•Statistics is the methodology of ordering raw data to answer specific
questions. It provides a process for analyzing raw data to draw
inferences about the population.
•A Statistic is number computed from a sample by applying an
algorithm (or function). It quantifies a characteristic of the sample (eg
its mean) and is often used to
9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 10
• Geological sample - specimen
• Statistical sample
set of observations,
represents a small fraction of the total number of potential observations.
• Population
total set of potential observations
• Sampled population is one from which the sample is actually
drawn
• Target population is the population of interest, about which we
wish to make inferences and draw conclusions.
• By theory we make inferences with known chances of error, from
sample to sampled population
9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 11
• Empirical and theoretical frequency distributions
 14 wells cored, 1335 core samples
 sampled population Changnan Block, Changqin Gas Field Ordovician formation.
 1 statistical sample
 5-20 intervals
• Empirical frequency distribution is a table to summarize numerical
information.
• Theoretical frequency distribution (tfd) is an abstraction that represents all
potential observations.
• Value of tfd - we can manipulate with it using mathematics.
• Observation - numerical value (obtained by measuring, counting, ranking)
porosity of a core plug
number of fractures in the wellbore image
commercial HC reserves

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 12


9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 13
• We would like to know the following about our data:
measure of location
measures of spread
measure of shape
tendency toward normal or lognormal behavior

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 14


Frequency Plots and Histograms
Given a set of data:
1. Look for the min & max values
2. Divide the range of values into a number of “sensible” class intervals (bins)
• 10 – 30 is good if there is enough data
3. Count the number of data fall within each interval (bin)
• Option: If there are extreme values – all data below the upper bound of the first
class interval go into it and all data above the lower bound of the last class interval
go into it. State it is so!
4. Make a frequency table by dividing the number in each bin by the total number
5. Make a histogram by plotting a bar chart of the data in the frequency table
(strictly, histogram refers to number)

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 15


9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 16
9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 17
9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 18
9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 19
9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 20
Note re Excel
•When plotting a bar chart, Excel does not know that the
categories on the X-axis are numbers
–So it just treats the class intervals used in the Frequency
functions as labels and puts them in under the middle of
the bar (default centred)
–So its not always clear what interval the bar represents
•Best to create a separate label for each bin that
describes its upper & lower limits. EG
= A10&”– “$B10 gives a label with the value of A10 as the
lower limit and B10 as the upper limit
9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 21
9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 22
Histograms – class widths
•Class widths (bin sizes) are usually CONSTANT
–so the HEIGHT of each bar is proportional to the number
of values in it
•If class widths are VARIABLE
–the AREA of each bar is proportional to the number of
values in it
•For small samples, the shape of the histogram (and
therefore our decisions about population distribution) can
be very sensitive to the number & definition of the class
intervals

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 23


Sensitivity to Number / Size of Class Intervals (Bins) (for a large sample)

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 24


9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 25
Cumulative Frequency Table
Define a series of critical values (or cut-offs) and count the number of values
less than that cut-off
– If you use cut-offs that are the same as the upper class limits for the histogram you can just take the
cumulative frequency

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 26


Cumulative Histogram

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 27


Cumulative Histogram: Show as a bar chart, not
a line chart (it’s the number in an interval)

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 28


Cumulative Histogram using individual data points
• Unlike the histogram, the CUMULATIVE
frequency (or cumulative histogram) can
be plotted at the level of the data – i.e. use
the data points themselves as the cut-offs

– Just order the N data points from


minimum to maximum,
– give each a frequency of 1/N
– then plot each point against its
cumulative frequency

• This makes it particularly useful for


reading quantiles (see later)

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 29


*Which is most correct – imagine the CDF of the data points

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 30


9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 31
• Mean m is the arithmetic average of the data values
• Median M is the midpoint of the observed values if they are arranged in
increasing order. Half the values are below the median and half are above
the median.
• Mode is the value that occurs most frequently
• Quantile, in the same fashion that the median splits the data into halves the
quantile splits the data into quarters

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 32


9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 33
• Discrete random variable
• The arithmetic average of statistical sample
• In fluid flow studies the effective permeability of a stratified sequence is the
arithmetic mean whenever the flow is parallel to the strata.

x
i 1
i

x
n

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 34


• fi - frequency of the group
• xi - midpoint of the group
n

fx
i 1
i i

x
n

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 35


Bin Frequency Cumulative Relative Cumulative %
frequency

1 354 354 26.52% 26.52%


2 246 600 18.43% 44.94%
3 167 767 12.51% 57.45%
4 156 923 11.69% 69.14%
5 99 1022 7.42% 76.55%
6 76 1098 5.69% 82.25%
7 73 1171 5.47% 87.72%
8 62 1233 4.64% 92.36%
9 33 1266 2.47% 94.83%
10 29 1295 2.17% 97.00%
11 16 1311 1.20% 98.20%
12 12 1323 0.90% 99.10%
13 8 1331 0.60% 99.70%
14 1 1332 0.07% 99.78%
15 0 1332 0.00% 99.78%
16 2 1334 0.15% 99.93%
17 1 1335 0.07% 100.00%

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 36


• Calculate sample mean using grouped data
n

fx
i 1
i i

x
n
x 0.5*.2652+1.5*.1843+2.5*.1251+3.5*.1169+4.5*.0742 +5.5*.0569+ 6.5*.0547+
7.5*0.0464+8.5*.0247+ 9.5*0.217 +10.5*0.0120 +11.5*0.0090+ 12.5*0.0060+
13.5*0.0007+ 15.5*0.0015+16.5*0.0007=3.25%

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 37


• Continuous random variable
• Integral
  E ( x)   x * f ( x) * dx

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 38


  E ( x)   x * f ( x) * dx
30 100.00%

Frequency 90.00%
25 Cumulative % 80.00%
70.00%
20
Frequency

60.00%
15 50.00%
40.00%
10
30.00%
20.00%
5
10.00%
0 .00%
0.13

0.15

0.16

0.17

0.19

0.20

0.21

0.23

0.24

0.25

0.27
Porosity Range

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 39


E (c )  c
E ( x  c)  E ( x)  c
 
E   ai xi   b j y j    ai E ( xi )   b j E ( y j )
 i j  i j

E ( xy )  E ( x) * E ( y ) x and y - independent random variables

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 40


• For random variable x with p.d.f.

f ( x)  2 exp 2 x  x>0
f ( x)  0 x0

  E ( x)   x * f ( x) * dx 

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 41


• The arithmetic average of a statistical sample without outliers
• 10% trimmed mean drops the lowest 10% and the highest 10%
of the data

x
ia
i


(b  a)

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 43


Normal Distribution Lognormal Distribution

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 44


• The nth root of the product of n numbers
• In fluid flow studies the effective permeability of a stratified sequence is the
geometric mean whenever the flow is neither parallel or perpendicular to
the strata.

 g  x1  x2  x3  ...xn  1/ n

1/ n
 n 
 g   xi 
 L 1 

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 45


• In fluid flow studies the effective permeability of the stratified sequence is the harmonic mean
whenever the flow is perpendicular to the strata.

hi1
i

k harm  n

h / k
i 1
i i

• In general:
• Arithmetic > Geometric > Harmonic
9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 46
• The midpoint of observed values arranged in increasing
order

• 50th percentile

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 47


250 100.00%
90.00%
200 Frequency 80.00%
Cumulative % 70.00%
150 60.00%
Frequency

50.00% Median
100 40.00%
30.00%
50 20.00%
10.00%
0 .00%

Permeability Range, md

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 48


400 100.00%
350 Frequency 90.00%
Cumulative % 80.00%
300
70.00%
Frequency

250 60.00%
200 50.00%
150 40.00%
30.00%
100
20.00%
50 10.00%
0 .00%
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Porosity Range, %
• The cumulative frequency is the total or the cumulative fraction of samples less than a given threshold

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 49


Median ~ 2.4
Bin Frequency Cumulative Relative Cumulative %
frequency

1 354 354 26.52% 26.52%


2 246 600 18.43% 44.94%
3 167 767 12.51% 57.45%
4 156 923 11.69% 69.14%
5 99 1022 7.42% 76.55%
6 76 1098 5.69% 82.25%
7 73 1171 5.47% 87.72%
8 62 1233 4.64% 92.36%
9 33 1266 2.47% 94.83%
10 29 1295 2.17% 97.00%
11 16 1311 1.20% 98.20%
12 12 1323 0.90% 99.10%
13 8 1331 0.60% 99.70%
14 1 1332 0.07% 99.78%
15 0 1332 0.00% 99.78%
16 2 1334 0.15% 99.93%
17 1 1335 0.07% 100.00%

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 50


• The value that occurs most frequently

11, 13, 15, 15, 16, 17, 17, 17, 19, 21, 25, 26

mode

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 51


m , TM, Md, Mo

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 52


Distribution skewed to the right

Mo, Md, TM, m

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 53


Distribution skewed to the left

m, TM, Md, Mo

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 54


Mo, Md, TM, m

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 55


0.12

freq. Theoretical
0.1 freq. Observed
freq. mode 1
freq. mode 2
0.08
Frequency

0.06

0.04

0.02

Ln B6

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 56


100%
90% Observed cumulative ln(B6)

80% cumulative mode 1


70% cumulative mode 2
60%
theoretical cumulative
50%
40%
30%
20%
10%
0%
1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06
B6

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 57


9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 58
Measures of Location: Quantiles
What is another name for the 2nd quartile?
•Quartiles
–in the same way that the median splits data into two halves, the quartiles split the data into quarters. If
data values are arranged in increasing order, then a quarter of data fall below the first quartile, and a
quarter of data falls above the third quartile.
•Deciles
–splits the data into tenths; one tenth of the data falls below the first or lowest decile, two tenths fall
below second decile. Fifth decile corresponds to the median.
•Percentiles
–splits the data into hundredths; 25th percentile is the same as the first quartile, and 50th percentile is
the same as the median, and 75th quantile is the same as 3rd quartile. Often referred to as P25, P50,
etc. What we typically use.
•Quantiles
–are a generalization of splitting data into any fraction.

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 59


Use of Cumulative Frequency to calculate Quantiles

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 60


Use of Cumulative Frequency to calculate Quantiles

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 61


• A quantile is the variable-value that corresponds to a fixed cumulative frequency
• first quartile = 0.25 quantile
• second quartile = median = 0.5 quantile
• third quartile = 0.75 quantile
• can read any quantile from the cumulative frequency plot

IQR

25% 25% 25% 25%

Md
9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 62
250 100.00%
90.00%
200 80.00%
70.00%
Frequency
Frequency

150 60.00%
Cumulative %
50.00%
100 40.00%
30.00%
50 20.00%
10.00%
0 .00%

Permeability Range, md

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 63


9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 64
250 100.00%
Frequency
90.00%
Cumulative %
200 80.00%
70.00%
Frequency

150 60.00%
50.00%
100 40.00%
30.00%
50 20.00%
10.00%
0 .00%

Permeability Range, md

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 65


V < 15 V < 30 V < 45

V < 60 V < 75 V < 90

V < 105 V < 120 V < 135

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 66


Visual Summary of Sample Statistics: Box Plots: Simple Version

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 67


Visual Summary of Sample Statistics: Box Plots: Full Version
(developed by Tukey)

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 68


Example: Box Plots

Data for Box Plots: 4, 17, 7, 14, 18, 12, 3, 16, 10, 4, 4, 11

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 69


9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 71
• The variance is the averaged squared difference of the
observed values from their mean
• The standard deviation is an unbiased estimator of population
standard deviation

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 72


9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 73
Sample Measures of Dispersion (Spread)

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 74


• The variance is the averaged squared difference of the observed values from their mean.

 n 2
   xi  x  
 i 1 
s2   
 n 1 
 
• Use of n-1 is not arbitrary. It is needed to make sample variance an unbiased estimator of
population variance
• If we take a large number of samples of size n, from a population with the variance s2, the average
of s2 = s2

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 75


• The standard deviation is an unbiased estimator of population standard
deviation

 n 2
   xi  x  
 i 1 
s s  
2

 n  1 
 

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 76


Measures of Dispersion: Standard Deviation

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 77


  n  
2

   xi  
1    
 
n

 i
2 i 1
s 
2

n  1  i  1 
x
n 
 
 

s  s2

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 78


Measures of Dispersion: Variance (Var) and Standard Deviation, s.d.

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 79


Practical formula for calculating Sample
Variance (and Standard Deviation)

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 80


9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 81
 f x  x
2
i i
s2  i
n 1
fi - frequency of the i-th class
xi- midpoint of the i-th class

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 82


Midpoint Frequency f  x  x  2
i i
xi fi

 f x  x
2
0.5 354 2,672.39
i i 1.5 246 751.28

s2  i 2.5 167 93.33


3.5 156 9.94

n 1 4.5
5.5
99
76
155.29
385.58
6.5 73 772.22
7.5 62 1,121.16

x  3.25 8.5
9.5
33
29
910.41
1,133.70
10.5 16 841.56

n  1335 11.5
12.5
12
8
817.23
684.86
13.5 1 105.11
14.5 0 -

s  8.18
2 15.5 2 300.24
16.5 1 175.63

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 83


   
2

   f i xi  
1    

2
s 
2
f i xi  i

n 1  i n 
 
 
fi - frequency of the i-th class
xi- midpoint of the i-th class

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 84


Porosity
Permeability

Well Depth X Y Poro Perm Layer Seq LogK


10100 7158.9 672.7 2886.0 27.00 554.000 1 51 2.744
10100 7160.4 672.8 2886.0 28.60 1560.000 1 51 3.193
10100 7160.7 672.8 2886.0 30.70 991.000 1 51 2.996
10100 7161.0 672.8 2886.0 28.40 1560.000 1 51 3.193
10100 7161.9 672.9 2886.0 31.00 3900.000 1 51 3.591
10100 7162.3 672.9 2886.0 32.40 4100.000 1 51 3.613

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 85


  Var ( x)   x  E ( x) f ( x)dx
2 2

2
 2

  E x  E ( x)  E ( x )  E ( x) 2 2

f (x) - probability distribution function


E(x) - expectation

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 86


• For random variable x with p.d.f.
f ( x)  x 0  x 1
f ( x)  0 x  0; x  1

2
 2

  E x  E ( x)  E ( x )  E ( x) 2 2

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 87


• For random variable x with p.d.f.
f ( x)  2 exp 2 x  x>0
f ( x)  0 x0

2
 2

  E x  E ( x)  E ( x )  E ( x) 2 2

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 89


Var (c)  0
Var ( x  c)  Var ( x)


Var (cx)  E cx   E cx  
2
 2

 c E  x   c E  x   c 2Var ( x)
2 2 2 2

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 91


Var ( x  y )  Var ( x)  Var ( y )

x and y are independent random variables

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 92


9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 93
• Skewness measures the deviation of the distribution for
symmetry. If the skewness is clearly different from 0, then that
distribution is asymmetrical, while normal distributions are
perfectly symmetrical.
• Coefficient of Variation cv is used as an alternative to the
coefficient of skewness for defining a distribution

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 94


• Many variables in earth science data have distributions that are not close to gaussian
• It is very common to have many small values and a few very large samples (permeability)

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 95


• A gaussian distribution is one of many distributions for which a
concise mathematical description exists
• The gaussian distribution has properties that favor its use in
theoretical approaches to estimation
• Gaussian simulation techniques are extremely fast and easy to
implement
• It is important to know how close the distribution of your data
come to being gaussian
• The probability distribution in a gaussian model that is fully
specified by the mean and standard deviation

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 96


1  1 w  
2

f ( w)  exp   *   
 2  2    
1
Amplitude 
 2
Area  2 *  * Amplitude  1
Center  

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 97


1
Cum. Normal 0.9
pdf Normal 0.8
0.7
0.6
0.5 0.6827
0.4 0.9545
0.3
0.2 0.9973
0.1
0
-3 -2 -1 0 1 2 3

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 98


1 1
0.9 0.9
m=0.2056
0.8 s=0.02951 0.8
0.7 r2=99.09% 0.7
FitStdErr=0.0274804
0.6 Fstat=9406.59 0.6
CPF

CPF
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0.125 0.175 0.225 0.275
Porosity, fraction

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 99


3
1  xi  x  n
Skew( x)   
n 1  s 
Distribution skewed to the right Distribution skewed to the left

positive negative

Mo, Md,   , Md, Mo

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 100


Measure of symmetry in the distribution of the data values
3
1  xi  x  n
Skew( x)   
n 1  s 
Distribution skewed to the right Distribution skewed to the left

positive negative

Mo, Md,   , Md, Mo

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 101


Measures the “flatness” or “peakedness” of the distribution

 1 n  xi  x  4 
Kurt ( x)      3
 n 1  s  

positive
negative

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 102


s
CV 
x

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 103


• If CV>50% population is skewed
• If population is unimodal and symmetric

xs contains 68% of the data

x  2s contains 95% of the data

s ~ Range/4

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 104


45 100%

40 90%

Cumulative Frequency
35 80%

70%
30
Frequency

60%
25
50%
20
40%
15
30%
10 20%
5 10%

0 0%

10.000
0.001

0.002

0.005

0.010

0.022

0.046

0.100

0.215

0.464

1.000

2.154

4.642
Calculated Permeability Data Set

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 105


PROBABILITY PLOT
Estimated Permeability Data Set 109 Data Points, Mean = 0.36  Median = 0.10
Calculated Permeability Data Set

10

0.1

0.01

0.001
1 2 5 10 15 20 30 40 50 60 70 80 85 90 95 98 99
Probability , % Less Than

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 106


4000 100%

90%
3500

Mean = 1.3467 80%


3000
St. Dev. = 1.201 70%

2500
60%
Frequency

2000 Frequency 50%

Cumulative % 40%
1500

30%
1000
20%

500
10%

0 0%
0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3 3.25 3.5 3.75 4
Bin

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 107


Histogram of Sample mean

25 100%
Frequency 90%
20 Cumulative % 80%
70%
Frequency

15 60%
Mean = 1.3467 50%
10 St. Dev = 0.1042 40%
30%
5 20%
10%
0 0%
1.07 1.12 1.17 1.22 1.27 1.33 1.38 1.43 1.48 1.53 1.59 1.64
Bin

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 108


• Sample size n, n large
• Population with finite m and s
• Distribution of sample means is approximately normal with mean
m and s/n1/2
• Mean of Sample means = Population mean = 1.3467
• St. Dev of Sample means = 0.1042 ~ 1.201/11

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 109


9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 110
9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 111
9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 112
• Attributes, such as permeability, with highly skewed data distributions present problems in
variogram calculation; the extreme values have a significant impact on the variogram.
• Many geostatistical techniques require the data to be transformed to a Gaussian or normal
distribution.
• Data Transformations fall into two categories
Transformations used to correct for spatial trends in the data
Transformations used to correct for statistical deviations from the gaussian (normal)
distribution

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 113


• Truncate Observation Transformation
• Corrects for data outliers
• Box-Cox / Log Transformation
• Corrects for skewness in the data
• Linear Trend Transformation
• Corrects for Spatial linear trends
• Normal Score Transformation
• Perfectly matches a normal distribution

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 114


Bin Frequency Cumulative Relative Cumulative %
frequency

1 354 354 26.52% 26.52%


2 246 600 18.43% 44.94%
3 167 767 12.51% 57.45%
4 156 923 11.69% 69.14%
5 99 1022 7.42% 76.55%
6 76 1098 5.69% 82.25%
7 73 1171 5.47% 87.72%
8 62 1233 4.64% 92.36%
9 33 1266 2.47% 94.83%
10 29 1295 2.17% 97.00%
11 16 1311 1.20% 98.20%
12 12 1323 0.90% 99.10%
13 8 1331 0.60% 99.70%
14 1 1332 0.07% 99.78%
15 0 1332 0.00% 99.78%
16 2 1334 0.15% 99.93%
17 1 1335 0.07% 100.00%

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 115


Porosity From Core
Mean 3.253 400 100%
Median 2.38
350 90%
Mode 0.74
Standard Deviation 2.850 80%
300
Sample Variance 8.124 70%

Frequency
Coefficient of Variation 0.876 250 Frequency 60%
Kurtosis 1.215 200 Cumulative % 50%
Skewness 1.234
150 40%
Range 16.56
Minimum 0.04 30%
100
Maximum 16.6 20%
Sum 4342 50 10%
Count 1335 0 .00%
Q3 Largest(334) 4.7 0 1 2 3 4 5 6 7 8 9 10 12 14 16
Q1 Smallest(334) 0.95
IQR 3.75 Porosity Range, %

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 116


400 100.00%
350 90.00%
80.00%
300
70.00%
Frequency

250 Frequency 60.00%


200 Cumulative % 50.00%
150 40.00%
30.00%
100
20.00%
50 10.00%
0 .00%
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Porosity Range, %

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 117


1400 100.00%
90.00%
1200
80.00%
1000 Frequency
70.00%
Frequency

Cumulative %
800 60.00%
50.00%
600 40.00%
400 30.00%
20.00%
200
10.00%
0 .00%

K Range, md

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 118


250 100.00%
90.00%
200 80.00%
70.00%
Frequency

150 60.00%
Frequency 50.00%
100 Cumulative % 40.00%
30.00%
50 20.00%
10.00%
0 .00%

Permeability Range, md

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 119


9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 120
Comparing Distributions: Visual
–Box-Plots
–Normal Probability Plots
–Q-Q and P-P Plots

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 121


Comparing Distributions

•Why?
•Identifying key spatial controls on reservoir properties
–Which facies or lithotypes or layers are really different
and need to be modelled separately
•Quality control - comparing predicted (estimated)
distributions
•Establishing statistical homogeneity

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 122


Fundamental Principle: Classify the data into as
homogeneous groups as possible (Step 2)
- minimize variance within categories and
- maximize difference between categories

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 123


Example: Permeability distributions within each of 3
facies

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 124


Porosity Distributions in 4 Wells
Burgan Cretaceous Sand

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 125


Box plots make it easier to compare
Burgan Cretaceous Sand

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 126


Box Plots of Resistivity by Facies

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 127


Univariate Statistics …
Keep the geology in mind!

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 128


Univariate Statistics …
Keep the geology in mind!

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 129


Comparing to a Normal (Gaussian) Distribution:
Normal Probability Plots
•Y-axis scaled such that the cumulative frequency of a normal distribution will plot as a straight line

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 130


Comparing to a Normal (Gaussian) Distribution:
Normal Probability Plots

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 131


Comparing to a Log-Normal Distribution: Log-Normal
Probability Plot
•As before but with X-Axis a log scale – a log-normal distribution will plot as a straight line

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 132


Comparing Any Two Distributions: Q-Q Plot (Quantile-
Quantile)
•A Q-Q plot: is a plot on which the quantiles from two distributions
are plotted against each other.
–It has the units of the data (so the min & max of each axis is
determined by the data)
–A q-q plot of two identical distributions will plot as a straight line y
= x.
–Departures of the q-q plot from the line y = x reveal where the
difference is.
–If the q-q plot of two distributions is some straight line other than
y = x, the two distributions have the same shape but their mean
and variance are different.
9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 133
• Compares two univariate distributions
• Q-Q plot is a plot of matching quartiles
• a straight line implies that the two distributions have the same shape.
• P-P plot is a plot of matching cumulative probabilities
• a straight line implies that the two distributions have the same shape.
• Q-Q plot has units of the data,
• P-P plots are always scaled between 0 and 1

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 134


Making a Q-Q Plot:

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 135


Making a Q-Q Plot:

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 136


Making a Q-Q Plot:

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 137


Interpreting Q-Q plots

• If points line on a straight line then the two distributions are the
same
• A systematic departure above or below the 45o line (but parallel
to it) indicates a difference in the MEANS
– Above the 45o line: mean of Y > mean of x
– Below the 45o line: mean of Y < mean of x
• A slope different from 45o indicates a difference in spread
– Slope greater than 1 (>45o): Variance of Y > variance of X
– Slope less than 1 (<45o): Variance of Y < variance of X

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 138


Comparing Distributions: Q-Q Plot Example

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 139


1.E+02

1.E+01
Permeability

1.E+00

1.E-01

1.E-02

1.E-03
0.001 0.010 0.100 1.000
Porosity

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 140


Comparing Distributions: P-P Plot (Percent-Percent)
•A P-P plot: is a plot on which the cumulative probabilities
(Percents) of the same data value in two distributions are
plotted against each other.
–It is always scaled between 0 and 1.
–Samples with distribution of the same shape will fall on a
straight line y = x.
•Also called a Probability-Probability plot

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 141


Making a P-P Plot:

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 142


Making a P-P Plot:

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 143


Comparing Distributions: P-P Plot Example

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 144


Exercise: Q-Q and P-P plots
•Do the questions in the handout
EX-QQ & PP-PROB.doc
with data in
EX-QQ & PP-DATA.xls

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 145


Comparing Distributions:
Quantitative Statistical Test
–Kolmogorov-Smirnov Test

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 146


Comparing Distributions: Kolmogorov-Smirnov Test
•K-S is used to compare CDFs. Two cases:
–Compare sample against a theoretical distribution (commonly
the Gaussian) – use the K-S one-sample test
–Compare two samples against each other – use a K-S two
sample test

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 147


Hypothesis Testing: general 6 step procedure
1.Specify a null hypothesis, H0, (e.g. two CDFs are the same for K-S test;
other tests for differences in means or s.d.) and an alternative hypothesis,
H1, (e.g. CDFs are different)
–H0 is deliberately set up for the purpose of being rejected in favour of H1
2.Choose a statistical test (here the K-S test)
3.Choose a small probability, a, (typically 1% or 5%) that corresponds to a
probability that we are willing to live with of wrongly rejecting H0
–‘a” is often called the “significance level”
4.Compute the “test statistic” (max diff between CDFs, for the K-S test) and
determine its frequency distribution from a table -or calculate it
5.Define the critical region: the part of the test-statistic frequency distribution
where there is probability of a or less of observing the test statistic given that
that H0 is true
6.Reject H0 (and accept H1) if the test statistic is in the critical region
9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 148
Critical Value & Critical Region: One-tailed test

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 149


Critical Value & Critical Region: Two-tailed test

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 150


Comparing Distributions: Hypothesis Testing

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 151


Example

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 152


Comparing Distributions: Hypothesis Testing

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 153


Comparing Distributions: Kolmogorov-Smirnov One-Sample Test

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 154


Comparing Distributions: Kolmogorov-Smirnov One-Sample Test
Define the null hypothesis that the sample of 25 porosity measurements is
drawn from a normal distribution

N = 25 and a = 0.05, so critical value is 0.27


Since D=0.30 lies above the critical value (falls in the critical region) we REJECT the null
hypothesis that the sample is drawn from a normal distribution.
There is a 5% chance that we are wrong in doing so

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 155


Comparing Distributions: Kolmogorov-Smirnov Two-Sample Test

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 156


Handout

•Hypothesis Testing (K-S example)


•Hypothesis Testing using p-value
–the norm now

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 157


Data Transformations

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 158


Data Transformations
•Transforming the data so that their distribution matches a
prescribed (target) distribution
•Sometimes we must transform the data
–So it behaves better
–So they are more representative, e.g. honour a different
histogram
–So it is computationally easier to use
•Can you name one variable we almost always transform?
•Many geostatistical modeling methods require that the random
variable has a Normal (Gaussian) distribution or Standard-Normal
distribution

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 159


Transforming a Normal to a Standard Normal, N[0,1]
•Standard normal has a mean of 0 and variance of 1
•Define a transformation from x to z

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 160


* Transforming any distribution to a normal: Normal Score
Transform

•Match Quantiles
Can you think of any problems that might arise? 
1)Tie-breaking spikes (vertical sections of CDF)
2)Reproducing tails on back transformation

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 161


De-spiking

•Need to break ties (equal value data points)


–by order of data?
–randomly (add small random component and then sort)?
–local averages, then sort?

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 162


De-spiking example

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 163


Data Transformations: Honouring a different histogram
•Remember the Core and Log porosity histograms?

•Why different? Preferential sampling.


•Core values can be transformed (calibrated) to match log
histogram by matching quantiles

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 164


9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 165
9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 166
Dependency

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 167


Partial Dependency

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 168


*Common Variables that are Related
•Porosity and permeability
•Porosity and water saturation
•Density and molecular weight
•Oil density and viscosity
•Formation thickness and productivity
•Sand-body width and thickness

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 169


Example: Is there a different relationship between
porosity & permeability for each facies

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 170


Example: Is there a different relationship between
porosity & permeability for each facies

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 171


9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 172
Scatterplots
•Bivariate display, typically
–two covariates (e.g. porosity and permeability) at same location
–the same variable at different locations - separated by some distance vector
–estimated value v. true value
•Good for spotting aberrant data

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 173


9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 174
Scatterplots

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 175


Another Example: Identifying facies controls on
poro-perm

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 176


Scatter Plot of Log10(Permeability) and Porosity
measured on 500 samples

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 177


1.0E+03

1.0E+02
Core Plug Permeability, md

1.0E+01

1.0E+00

1.0E-01
Core plug k
Conditional
1.0E-02 median k

1.0E-03
0 5 10 15 20
Porosity, %

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 178


• What we want to learn about our data:
Are variables correlated or not
Regression and correlation coefficients between variables
Detect and handle errors in the data

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 179


9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 180
Types of correlation

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 181


9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 182
• The correlation coefficient measures the strength of the linear association between two variable.
Correlation coefficients can range from -1.00 to +1.00. The value of -1.00 represents a perfect
negative correlation while a value of +1.00 represents a perfect positive correlation. A value of
0.00 represents a lack of correlation.

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 183


• Least squares regression lines

Y  a 0  a1 * X
 Y  X    X  YX 
2

a0 
N  X   X  2 2

N  XY   X  Y 
a1 
N  X   X  2 2

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 184


• Least squares regression lines

X  b0  b1 * Y
 X  Y    Y  YX 
2

b0 
N  Y   Y  2 2

N  XY   X  Y 
b1 
N  Y   Y  2 2

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 185


• Regression equations Y=a0+ a1*X and X=b0+ b1*Y are identical
if and only if there is a perfect correlation between X and Y

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 186


• Total variation of Y = Explained variation Y + Unexplained
variation Y

 Y  Y    Y    Y  Y 
2 2 2
est Y est

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 187


Explained variation Y
Coefficient of determination r2 =
Unexplained variation Y

 Y  Y 
2

r est

 Y  Y 
2

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 188


9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 189
r
 xy
 x  y 
2 2

where
x X X
y  Y Y
This formula automatically gives the proper sign of r

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 190


Quantifying Linear dependency between two samples:
(Pearson) Correlation Coefficient

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 191


Example – Correlation Coefficient

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 192


Example – Correlation Coefficient

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 193


4.00

3.00
Log k = 27.376* - 3.7015
2.00 R2= 7.42910E-01
log k

1.00

0.00 logk
Linear (logk)
-1.00

-2.00
0.00 0.05 0.10 0.15 0.20 0.25 0.30
Porosity, Fraction

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 194


Correlation Coefficient - Interpretation
• What do I do with this number?
• 0.5683 indicates that porosity and permeability are partially
correlated
• A correlation coefficient of 0.5683 means that 32.3% (= 0.56832)
of the variation in permeability is explained by the variability in
porosity.
• r is a measure of how close the points come to falling on a
straight line
Correlation Negative Positive
Small −0.29 to −0.10 0.10 to 0.29
Medium −0.49 to −0.30 0.30 to 0.49
Large −1.00 to −0.50 0.50 to 1.00

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 195


Correlation Coefficient - Interpretation
•Since r is a measure of how close the points come to
falling on a straight line it is an indicator of how successful
we might be in predicting one variable from another (see
later – regression)
–If r is high then for a given value of one variable, then we
know that the other variable is restricted to only a small
range of values
–If r is low then knowing the value of one variable does
not give us much information on the other

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 196


Interpreting Correlation Coefficients

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 197


Correlation: Problem areas – always look at the scatter plot

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 198


Correlation: Always look at the scatter plot: Anscombe’s quartet

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 199


• When Linear correlation is a poor measure, we can correlate
the Ranks of the values instead (using the Pearson r)
– Rank is the position of a data value when sorted in ascending order.
– Smallest data value has Rank=1 and largest has Rank =N
– Sort in ascending order of first (independent) variable

6 d 2
rrank  1 

n n2 1 
Where
d = difference between ranks of corresponding x and y
n = number of pairs of values (x,y) in the data

Comparison of r and r rank Spearman’s Rank correlation coefficient is a


gives an idea about the technique which can be used to summarise the
significance of the strength and direction (negative or positive) of a
correlation relationship between two variables.

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 200


• Create a table from your data.

• Rank the two data sets. Ranking is achieved by giving the ranking '1' to the
biggest number in a column, '2' to the second biggest value and so on. The
smallest value in the column will get the lowest ranking. This should be done for
both sets of measurements.

• Tied scores are given the mean (average) rank.

• Find the difference in the ranks (d): This is the difference between the ranks of
the two values on each row of the table. The rank of the second value (price) is
subtracted from the rank of the first (distance from the museum).

• Square the differences (d²) To remove negative values and then sum them
(d²).

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 201


Difference 
Distance from  Rank  Price of 50cl 
Convenience Store Rank price between ranks  d²
CAM (m) distance bottle (€)
(d)

1 50 10 1.80 2 8 64
2 175 9 1.20 3.5 5.5 30.25
3 270 8 2.00 1 7 49
4 375 7 1.00 6 1 1
5 425 6 1.00 6 0 0
6 580 5 1.20 3.5 1.5 2.25
7 710 4 0.80 9 ‐5 25
8 790 3 0.60 10 ‐7 49
9 890 2 1.00 6 ‐4 16
10 980 1 0.85 8 ‐7 49
d² = 285.5

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 202


9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 203
4
3.5
3 log k = 23.067*  - 2.534
2.5 r = 0.9232
2 r rank = 0.8850
Log k

1.5
1
0.5
0
-0.5
-1
0 0.05 0.1 0.15 0.2 0.25 0.3
Porosity, Fraction

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 204


450
400 y = 1.0274x + 0.0394
350 r = 0.9996
300 r rank= 0.2664
250
y

200
150
100
50
0
0 50 100 150 200 250 300 350 400
x

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 205


1000
y = 1.0274x + 0.0394
r = 0.9996 100
r rank= 0.2664
10

1
y

0.001 0.01 0.1 1 10 100 1000


0.1

0.01

0.001
x

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 206


Example: Sample of 100 Porosities

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 207


Example: 100 Permeabilities measured on same samples as the 100 porosities

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 208


Correlation coefficients between porosity and permeability measurements

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 209


Exercise

• Calc r and rrank for y=x2 for x=1 to 20

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 210


Interpreting Rank Correlation
•Learning from the difference between rrank and r
•If rrank > r
–then a few outliers are spoiling an otherwise good correlation
•If rrank < r
–then a few outliers are enhancing an otherwise poor correlation
•If rrank = 1
–then the relationship is monotonic, but not necessarily linear
–in these cases a non-linear transform of one covariate can make
r=1

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 211


Exercise: Bi-variate Distributions & Correlation

•Complete the questions in the handout


EX-Scatter & Correl-PROB.doc
with data in
EX-Scatter & Correl-DATA.xls

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 212


9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 213
• A strong relationship between two variables can
help us predict one variable if the other is know.
Linear regression assumes the dependence of one
variable on the other can be described by the
equation of a straight line

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 214


Linear Regression: Estimating one variable from an other
(data integration)

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 215


Regression: Derivation of m and b

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 216


Regression

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 217


Regression – Example

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 218


Regression: What does it mean

• Tells how much of the variance in the dependent variable (y) is


due to variance in the independent, or explanatory, variable (x)
• The independent variable is assumed to be error-free, but the
dependent variable is assumed to have random error associated
with it
– If there is perfect correlation then r = +1 or –1 (R2=1) and all of
the variation in y is explained by variation in x
– If there is zero correlation, then none of the variation in y is
explained by variation in x
– If 0< R2<1 then some of the variation in y is explained by
variation in x

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 219


Linear Regression: Example & Residual Plot

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 220


Regression
•Application to what generic types of variables? And some specific
examples?
Spatial (x is location) y is variable
– eg depth, or distance from a well
Hard (x) - Soft (y)
- eg core permeability – log porosity
Time (x) – Seismic amplitude (y)
•How will the Histogram of the predicted points compare to that of
the data points from which the predictor was derived?

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 221


• Linear regression does not guarantee positive elements
• Linear regression does not guarantee the data will fall within a reasonable tolerance

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 222


Linear Regression Models

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 223


Comparison of Regression Models

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 224


Regression: Generalization to higher order equations

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 225


Regression: Generalization to multiple variables

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 226


Correlation vs Regression

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 227


Exercise: Regression

•Follow directions in the handout


EX-Regression-PROB.doc
with data in
EX-Regression-DATA.xls

9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 228


9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 229
9/9/2017 Thai Ba Ngoc – Faculty of Geology & Petroleum Engineering ‐ HCMUT 230

You might also like