You are on page 1of 12

Basic course 1 in

biostatistics
BY
DALIA AHMED
ASSISTANT PROFESSOR OF PUBLIC HEALTH AND
COMMUNITY MEDCINE

dr dalia ahmed

dr dalia ahmed

dr dalia ahmed

Day 1: (theoretical)
Introduction
Variable types and data presentation
Statistical parameters
Day 2: (practical)
Rules for data entry
Data coding and entry
Data cleaning
Day 3 : (practical)
Frequency and crosstabulation
Statistical parameters calculations
Graphs
dr dalia ahmed

Objectives
Acquiring knowledge about variable
types,data presentation and statistical
parameters.
Acquiring skills about data coding ,entry
and cleaning.
Acquiring skills about data presentation and
parameters calculations.
3

Clinical epidemiology

dr dalia ahmed

Definition of Research

5I

am not an expert but I will do my


best!
5 If you know and I dont - speak up!
5 This is the basics - there is a major
new literature to become familiar
with and new skills to learn.

Research is the systematic plans for


discovering facts or principles in
any field of knowledge.

www.bradfordvts.co.uk
dr dalia ahmed

dr dalia ahmed

Steps of research
1.

Problem definition and objective setting.

2.

Study type (design).

3.

Sample type and size.

4.

Data collection tools and procedure.

5.

Ethical considerations

6.

Proper data analysis and presentation

7.

Findings interpretation and report writing

MEDICAL STATISTICS

dr dalia ahmed

dr dalia ahmed

Vary+measurable

Definitions:

Variables

Statistics a scientific field that deals with


the collection, classification, description,
analysis, interpretation and presentation
of data. .
Medical statistics is a branch of statistics
concerned with medical topics. .
dr dalia ahmed

Quantitative variables

is a type of information/characteristic
that varies among the studied cases
(study units) & can be measured .

dr dalia ahmed

Quantitative variables: the measure is expressed in


numbers.
Quantitative variables could be:
Discrete variables: the measurements of the different
values of the variable are integers (i.e. whole numbers)separated (no decimals)
Examples: number of live births per mother, number of
cigarettes per day, pulse rate.
Continuous variables: the measurements of the different
values take on any numerical value within the variables
range. (with decimals)
Examples: age, cholesterol
level, blood sugar.
dr dalia ahmed
11

10

Qualitative variables
Qualitative variables: the measure is expressed as
description.
Qualitative variables could be:
a- Nominal variables: the measurements have no
specific order.
Example: sex (males and females).
b- Ordinal variables: the measurements could be
arranged in certain order.
Examples: for disease conditions (mild, moderate,
severe).
dr dalia ahmed

12

Data and Information :

Research variables are:

Data is measurement with precise definition


e.g. , Systolic pressure of "120 mm/Hg" of
an adult male is data .
information is translation of the
measurement to a meaningful knowledge
e.g. the adult male in previous e.g. is
"normotensive
Other
examples??????????
dr dalia ahmed

A.Background variables
B.The outcome variables
C.The confounders

13

dr dalia ahmed

What is a confounding factor?

Confounding

A "confounding" factor is defined as one which


is associated both with exposure and disease
and distributed unequally in the study and
control groups.

How can we conclude the effect of caffeine intake

Observed association, presumed causation

Gambling

Cancer

on increasing the possibility of coronary heart


Unobserved association

disease if we know that:

Heavy caffeine drinkers are also heavy smokers


Smoking is a risk factor
for CHD
dr dalia ahmed

15

14

Smoking,
Alcohol,
other
Factors

True association

dr dalia ahmed

16

Variables:
Solution

Match

groups

according

to

possible

confounders

What is the correct statement:


The presence of diabetes is affected by the gender
OR
The gender is affected by the presence of diabetes

Consider confounders during analysis


AND WHY????????????????
dr dalia ahmed

17

dr dalia ahmed

18

Variables:

Proper data presentation

What is the correct statement:


The presence of diabetes is affected by the
gender true
The gender is affected by the presence of
diabetes false
AS
GENDRE IS
THE INDEPENDENT V.

Tables:

When details of data are needed.

Graphs:

When only impressions are needed.

Parameters: Precise mathematical summary,


useful for comparison.

DM IS
THE DEPENDENT V.
dr dalia ahmed

19

dr dalia ahmed

20

1-Tables: a spreadsheet of 18 of medical statistics course students


(Kasr El-Aini 1998).
No

dr dalia ahmed

21

Characteristics of the tables prepared for


research reports:
No. of the table
Title of the table describing its contents
Suitable number of rows (4-12), avoid rows with zero
frequencies
4- Title for each column and each row
5- Totals
6- Meaningful percent from row or column
7- Make sure that the tables and text refer to each other (through
the table number)
8- Not everything displayed in the table needs to be mentioned in
the text
9- Try to set standard for comparisons across the tables. Values
compared down columns are more easily than across rows
10- It is useful to round numbers in the table to one or two decimal
23
places. Numbers are easy drtodalia
beahmed
understood when the number
of
digits is reduced
123-

Sex

Age

Result

Degree

1
2

Male
Female

35
30

Fail
Pass

50
70

Male

40

Pass

90

4
5

Male
Female

45
33

Pass
Fail

80
40

Female

34

Fail

55

Female

32

Pass

60

female

28

Fail

50

male

25

Fail

45

10

male

30

Pass

70

11
12

male
male

40
42

Pass
Pass

72
75

13

male

45

Pass

80

14
15

male
female

41
55
dr dalia ahmed

Fail
Pass

16

female

44

Pass

67

17

male

36

Pass

90

22

55
66

Types of tables
1)
Simple tables showing single variable
a.
Tables with data on qualitative variables
(nominal) (e.g. percent distribution of the
studied sample by sex) .
b.
Table with data on quantitative variable
(continuous) (e.g. percent distribution of the
studied sample by age) .
2)
Contingency tables or cross tabulation of
two variables.
In such tables two variables are presented:
obesity and diabetes .
dr dalia ahmed

24

Simple tables for single variable


Qualitative data

Quantitative data

Sex distribution of medical statistics


course students (Kasr El-Aini 1998)

Number Percent
11

Females

Total

7
18

Age distribution of medical statistics


course students (Kasr El-Eini 1998).

Age group

Frequency

61.1

25-

38.9

30-

Sex

Male

11

40 -

Female

45 -

11

18

50-

Exam
result

55-

Total

18

Title

Sex
Males

Contingency tables (or cross tabulation) of two variables

Interval
35 (open end)

100.0

class

dr dalia ahmed

Frequency distribution of exam results according to students' sex

Result

25

Fail Pass Total


SEX

Sex

Results

dr dalia ahmed

26

Graphs to present data derived from


qualitative variables
To present data derived from qualitative
variables (nominal and ordinal variables)
there are two types of charts: Pie chart and
bar chart

Graphs

dr dalia ahmed

27

dr dalia ahmed

28

Pie Chart characteristics

The graphs for the qualitative data

Y axis
12

Number

10
8

6
4
2

0
Male

Male

Female

Sex

X axis

Female

Pie chart

Bar chart

dr dalia ahmed

29

It is a circle that has been portioned into percentage


distribution of qualitative variables.
To construct the pie chart, the total area is100%,
with 1% equivalent to 3.6 of the circle.
We should use the percentages corresponding to
each category, rather than the absolute frequency
of each category.
Read the pie chart by beginning at the 12 oclock
position and proceeding clockwise.
Use no more than six sectors in a given pie chart;
clarity is lost with more than six sectors.
Make sure that the sum of the pie chart sectors
dr dalia ahmed
30
equal 100%.

Bar Chart characteristics

dr dalia ahmed

percent

Disease severity among males


and females
100
90
80
70
60
50
40
30
20
10
0

severe
moderate
mild

males

31

females
dr dalia ahmed

32

Fig. : Percent of patients according to their view to the tested drug

Maps
45%

to present information

45%
40%
35%

If the study explored the epidemiology of


cholera, a map could be produced showing
the geographical distribution of cholera
cases, together with the distribution of
protected water sources, thus illustrating
that there is an association.

30%

30%
20%

25%
20%

15%

15%
10%
5%
0%

e.g.

No felt side
effects

One tablet per


day

No interfere
with daily work

In constructing bar charts, the category labels are usually listed


horizontally in systematic order.
Vertical bars are drawn to represent the frequency or percentage in
each category.
A space separates each bar to emphasize the nominal or ordinal nature
of the variable.
Bar charts could make it easier to compare univariate distributions
for two or more groups. Example: we can present a bar chart that
presents a comparison in the percent distribution of males versus
females according to the level of the disease (mild, moderate, severe).
Bar charts are suitable to present data when the total is more than 100%
as in case of having more than one response from the interviewed
respondents e.g. the percent distribution of respondents according to
their views.

Good method
of
administration

dr dalia ahmed

33

dr dalia ahmed

34

Histogram
Graphs for quantitative data
Histogram

Histogram
Frequency polygon
Frequency curve
Box plot
Error bar
Scatter plot
dr dalia ahmed

It is appropriate for continuous variables (interval).


It is similar to bar chart, but in histogram the bars are placed side by side.
The bar length represents the percent (frequency) falling within each interval.
Each histogram has a total area of 100%

35

For discrete variables (e.g. the number of children per family), the number
representing the values should be centered below each bar to emphasize the discrete
dr dalia ahmed
36
nature of the variable .

Quantitative data
A

Box plot

Frequency

5
4

HISTOGRAM

3
2
1
0
22.5

27.5

32.5

37.5

42.5

47.5

52.5

Age (years)

Fig. 10: Age distribution of medical statistics course students (Kasr El-Eini 1998).

Frequency POLYGON

Frequency CURVE
C

6
5
4

Frequency

3
2
1
0
22.5

27.5

32.5

37.5 42.5
Age (years)

47.5

52.5

dr dalia ahmed

6
5
4
3
2
1
0
22.5

27.5

32.5

37.5

42.5

47.5 3752.5

dr dalia ahmed

38

Age (years)

Scatter diagrams

Error bar

Scatter diagrams are useful for showing information


on two variables which are possibly related.

Figure 22.5: Weight of five-year-olds according to annual


family income
dr dalia ahmed

39

Scatter plot

dr dalia ahmed

dr dalia ahmed

Y Dependent variable

Frequency

41

40

+1

-1

dr dalia ahmed
X Independent variable

X Increases
Y Increases

X Change
Y Not Follow

X Increases
Y Decreases
42

Line graphs
A line graph is particularly useful for numerical data
if you wish to show a trend over time.

Parameters
good summary of data.
used for statistical comparison and
testing.

Figure (): Daily number of malaria patients at the health


centers in District X
dr dalia ahmed
easy to show two or more distributions
in one graph (M&F)43

dr dalia ahmed

44

For qualitative variables


Parameter
Qualitative
Proportion

Ratio

Quantitative
Central Tendency

Dispersion

dr dalia ahmed

45

For quantitative variables:

Part
100
Total

Ratio

Part
Part

dr dalia ahmed

a
b

46

Measures of Central Tendency

1. Measures of Central Tendency


(Location).
2. Measures of Scatter, Variations,
Dispersion:

dr dalia ahmed

Proportion =

Central tendency measures are parameters


used to summarize data and include:
arithmetic mean, mode, median and
midrange.

47

dr dalia ahmed

48

a. The midrange:
(the smallest observation + the largest observation)
2
Midrange = (1 + 6) /2 = 3.5 children

DATA
House

10

b. The mode:
Children

(The most frequent value).


(The mode is 3 children
as there are three families have three children each).

c. The median:
(The value in the middle of an arranged group of values)
the value that has 50% of the observations equal to or more than it
and the other 50% of observations equal to or less than it.
dr dalia ahmed

c.

dr dalia ahmed
50
After arranging houses, the middle
house is that one with 5 child

49

The median:

d. Arithmetic mean (average)

House

10

Children

Sum of all measurements divided by the number of


measurements.

Mean = sum x/n= 1+1+2+2+3+3+3+4+5+6/10 = 30/10


= 3 children/family

General rule for median value calculation:


1-Arrange values from lowest to highest.
2-Depending on number of observations proceed :
*observations (n) is odd) then median rank = (n+1)/2
** observations are even then there will be two values in the middle, its
ranks are n/2 and (n/2) +1.Then their sum
2

dr dalia ahmed

Parameter

Advantages

51

dr dalia ahmed

52

Disadvantages

Mid range

Does not include all values


Easy to calculate
Used for rapid evaluation Of no value in comparing
groups.

Mode

Used for rapid evaluation May not show the center


or middle of the group
There may be no value that
occur more than any other

Median

Takes the rank of all


values in consideration
Its value Is not affected
by the extreme values
(very high or very low)

Of limited value for


comparison

Arithmetic mean

Take all values into consideration


The best mathematical parameter
-Used for statistical analysis
-Used for comparing between
dr dalia ahmed
groups

Affected by the extreme


values
53

2. Measures of Scatter, Variations,


Dispersion
The measures of central tendency alone
give the common finding and help to know
where to locate the group
We can measure the extent of variation by
using measure of dispersion

dr dalia ahmed

54

b. Mean deviation around the mean

a. Range:

Mean
=3

It is the average of the differences between the mean


and each of the values of
the individual observation

(It is the difference between the smallest


and the largest observation.)
Range = 6 1 = 5 children
It is easy to calculate
but does not consider each observation
Is affected easily by the extreme outliers.

dr dalia ahmed

Family

10

Childr
en

Differe
nce

-2

-2

-1

-1

Instead of taking the differences as calculated as it will = zero


{- 2 - 2 - 1 - 1- 0 0 0 + 1 + 2 + 3 = 0 (zero)}
we add it together irrespective of the sign and take the absolute differences
(deviations) around the mean :

Sum of the deviations =

x X

= 12

The mean deviations around


dr daliathe
ahmedmean = 12 10 = 1.256child

55

c-Variance
In the previous calculations, the mathematicians do not accept
ignoring the negative sign, instead they suggested
squaring the difference to remove the negative sign
Then divide the results by observation number minus one

(x

d-The standard deviation

= 4 + 4 + 1 + 1 + 0+ 0 + 0 + 1 + 4 + 9 = 24
=

Average squared deviation around the mean = 24

9 = 2.7 (square child!!)

This value is called the variance


(average square differences around the mean)
The variance is to be used to calculate the standard deviation

dr dalia ahmed

57

Example of Variance & SD


Mean=4
Measurements Deviations
x
3
5
5
1
7
2
6
7
0
4
40

x - mean
-1
1
1
-3
3
-2
2
3
-4
0
0

Square of
deviations
1
1
1
9
9
4
4
9
16
0
54

dr dalia ahmed

58

Percentile and quartiles

It is a measure of
spread.
Notice that the larger
the deviations (positive
or negative) the larger
the variance
SD=

In the previous example Variance = 6


Standard deviation = Variance = Square
root of 6 = +2.45

n=10

Variance = 54/9 = 6

dr dalia ahmed

We want this value to express the number of children


that vary around the mean, then we take the square root of the
variance and call it the standard deviations

= + 2.5
59

The percentiles are points that divide all the


measurements (observations) into 100 equal parts, or
4 equal parts (quartiles).
To determine percentiles, the observations should be
first arranged from the lowest to the highest value.
Percentile is a score value above which and below
which a certain percentage of values in the
distribution fall.
The 50th percentile value corresponds to the median
(half of the observation are higher and half the
observation are lower than the median value).
dr dalia ahmed

60

10

Measures of central tendency and their corresponding measures of


dispersion

Percentile

Central Tendency
(Location)

Example: a sample of 100 newborn children was


weighted, and the data were arranged from the lowest
to the highest.
The value (score) of the 95th percentile was found to be
3.8 Kg. Such findings indicates that 95% of the
children had body weight less than 3.8 Kg and 5% have
body weight more than 3.8 kg.
The value (score) of the 5th percentile was found to be
2.5 Kg. Such findings indicates that 5% of children had
body weight less than 2.5 kg and 95% had body weight
more than 2.5 kg. Children are considered within
normal weight if they are between 2.5-3.8Kg or the 5th
and 95th percentiles. dr dalia ahmed
61

Mid range

Range

Mode

Minimum
Maximum

Median

percentile
quartiles

Arithmetic Mean

ADAM
Variance
Standard deviation
dr dalia ahmed

.population mean of 120 mm/Hg


SAMPLE

Coefficient of variation
Standard error of the mean
95% confidence interval of the mean

No.

63

MEAN

SD

SE

95 % Confidence
Interval for Mean

115.8

10.87

4.86

102.3 TO

129.3

112.0

7.58

3.39

102.6 TO

121.4

121.0

15.97

7.14

101.2 TO

140.8

118.0

17.89

8.00

95.8 TO

140.2

129.0

7.42

3.32

119.8 TO

138.2

25

119.2

12.98

2.59

113.8 TO

124.5

Total

Variance
Standard Error of first sample mean=
dr dalia ahmed
n

118.2
5

=4.86

64

Normal distribution curve:

Standard Error of sample mean

show how the data are generally distributed.

80
70

Frequency

60
50
-S D

40

+S D

When the frequency curve


is done for a large group
of values it tends to take
the shape called a
normal distribution
curve
(mean=120&SD=+ 5)

68 %

30
20
10

Co nfid e nc e inte rv a l 95 %

135

133

131

129

127

dr dalia ahmed
mm/Hg

125

123

121

119

117

115

113

111

109

107

65

Arithmetic Mean = 120


90= 5
SD

105

The standard error of the sample mean helps in


estimation by calculating a range of values that
contain the population mean in between.
The range of values =sample mean + 2SE
We can say with 95% (not 100% sure) confidence
that the population mean of the first sample is a
value between 102.3 and 129.3 mm/Hg.
We may be mistaken in our prediction of the
population mean in 5% of samples.
all the samples' 95% confidence interval contain the
population mean of 120 mm/Hg
dr dalia ahmed

62

Estimating the population mean


based on sample mean:

What is meant by????????

dr dalia ahmed

Dispersion

66

Normal distribution curve of 1000 SBP readings of healthy adult males.

11

Characteristics of the normal distribution curve:


1- A bell shaped curve with most of the values clustered near the mean
and a few values out near the tails.
2- The values are distributed symmetrically around the mean.
3- The mean, the median and the mode of a normal distribution have
the same value.
4- The percent (number) of individuals with certain range of values is
known:
*Mean +1 SD = 68 % (68% of the observations have values
within a range of the mean 1 SD) .
In the example 68% (n=680)of normal males has blood pressure
values between 115 and 125 mm/Hg mm/Hg.
*Mean 2 SD = 95% . or 95% of the observations have values
within a range of the mean 2 SD .
In the example 95% (n=950)of normal males has blood pressure
values between 110 and 130 mm/Hg

The normal distribution curve does


not mean it is the distribution of
normal healthy people;
the word normal refers to the
distribution of values

dr dalia ahmed

67

5- The variations of the closest


95%
be
dr dalia
ahmedof the data to the mean could
68
considered normal variations

dr dalia ahmed

69

dr dalia ahmed

70

12

You might also like