You are on page 1of 16

ANALYSIS AND DESIGN OF RESEARCH STUDIES

LECTURE 1: VARIABLES AND DISTRIBUTIONS


SUMMARISING DATA

Objectives

After completing this session students should be able to:

Identify the variables in a study, their types and whether they are explanatory or response
variables.
Construct and interpret frequency tables for qualitative and quantitative variables.
Display and interpret qualitative and quantitative variables graphically.
Construct and interpret summary statistics for quantitative and qualitative variables.

Subjects (units of analysis) and variables

The approach in most investigations is to study and compare subjects in groups of reasonable
size, rather than to study individual subjects. In research, subjects (or units of analysis) are
not always people. They may be mosquitoes, mice, hospitals, villages etc. and will depend on
the study being carried out.

To analyse data from a study we form groups and compare them with respect to different
characteristics of the subjects. For example, we may compare count of CD4 cells per micro-
litre between HIV-1 and HIV-2 infected hospital patients at their first visit to hospital. In
statistical terminology, we call the characteristics variables, because the value it takes (e.g.
CD4 cell count per micro-litre and HIV-type) varies from subject to subject.

Explanatory and response variables

A variable in a study may be defined as:

1. An outcome (or response or dependent) variable. This is the variable that is the main
interest in our study. In the example above, CD4 count is considered the outcome of
interest.
2. An explanatory (or exposure or independent) variable. This is a variable that changes
the value of our outcome. In the example above, HIV type is considered the explanatory
variable.

The distinction between explanatory and outcome variables is dependent on the context and
the objectives of the question being answered. In lecture 9 we will see the importance of
being able to identify the outcome of interest in a study, and how to design the study to be
able to answer the question about that outcome.

In randomised controlled trials, it is often a drug or treatment that is the main explanatory
variable. In most studies there is more than one explanatory variable that can influence the

1
response or outcome of interest. When we have more than one explanatory variable we call
the other variables covariates.

Types of variables

The main types of variables are:

Categorical or binary variables

Categorical variables. These variables can take one of several possible values related to
the different categories (or groups or levels) that are distinct from each other. For
example, ethnic group, blood group, marital status or HIV-subtype.

Binary variables. The simplest type of categorical variable which has only two categories.
For example, sex, whether or not child has parasitaemia, whether anaemic or not, IgE
seropositive or not.

Ordered categorical variables. These have different categories which are naturally ordered
on some scale. For example, severity of disease, performance scores for HIV-patients,
age-groups.

Quantitative variables

These variables describe a quantity which can be counted or measured numerically. Hence
they are sometimes called numerical variables. There are two different types:

Continuous variables. These can be quantified on a well-defined scale with units and do
not have any restriction on the set of values that they can take. For example, birthweight
can not only take the values 10kg, 11kg, 12kg but also 10.1kg, 10.2kg etc..

Discrete variables. Again these can be quantified on a well-defined scale with units but
the values that they can take are from a finite set. They are usually the result of a count.
For example, the number of live-born children can only take the values 0, 1, 2, 3 etc...

Describing your data - displaying categorical variables

Before we begin to answer any question of interest from our study we need to summarise and
display our data to get some idea of what it is telling us. For example, we can look at how
the values of variables change from subject to subject i.e. what is the distribution of values
taken by a single variable or the association between two variables. We can summarise data
through tables or graphs. These presentations are purely descriptive with each having
advantages and disadvantages. Careful consideration must be given to what you would like
to show.

Presentation of categorical of data using tables

Tables (and diagrams) should be well labelled and self-explanatory; you should be able to
obtain all the information you require from the table without any text to describe it. To
ensure this the title should be informative; the outcome and explanatory variables should be

2
clear; the percentages should be clearly derived; and provide footnotes for missing values and
abbreviations. However, they must not be cluttered with too much information.

Summarising categorical variables is straightforward. For each category of a variable the


number of subjects is counted. These counts are known as frequencies. One-way tables
show the frequency of categories (or values) of each variable.

Example 1: A study was carried out among 400 children aged 3-14 years in Guinea-Bissau
to investigate whether the prevalence of atopy is lower in children who have been vaccinated
with BCG in infancy than in children who have not been vaccinated1.

In Table 1 we have summarised the BCG status of 400 children from Guinea-Bissau, a
categorical with 3 categories: not vaccinated (i.e. no documents/no scar), vaccinated - scar
only, and vaccinated according to documents.

Table 1: BCG status of 400 children from Guinea-Bissau


BCG status No. of children Percentage
Not vaccinated 53 13.3
Vaccinated - scar only 76 19.0
Vaccinated - documented 271 67.7
Total 400 100

As well as examining the distribution of a single variable it is often of interest to display the
way in which the distribution of one variable relates to the distribution of another. When
both variables are categorical two-way tables (sometimes called contingency tables) can be
used (see Table 2). Often these represent the association between the outcome variable and
the categories of the explanatory variable.

Table 2: Comparison of atopy by BCG status among children in Guinea-Bissau


Atopic status
BCG status Atopic Not Atopic Total
Not documented/no scar 21 (40%) 32 (60%) 53
Scar only 27 (36%) 49 (64%) 76
Documented 57 (21%) 214 (79%) 271
Atopic is defined as reaction (2mm) to any of the allergens Dermatophagoides pteronyssinus,
Dermatophagoides farinae and cockroach mix

Since the question of interest is whether the prevalence of atopy is lower in children who have
been vaccinated with BCG than in children who have not been vaccinated, the outcome
variable is atopy, which is binary (yes or no), and the explanatory variable is BCG status.
Percentages show the distribution more clearly; a useful guide is that the percentages should
correspond to the exposure variable since we are interested in whether the distribution of the

1 Aaby P et al. Early BCG vaccination and reduction in atopy in Guineau-Bissau. Clinical and Experimental
Allergy 2000; 30: 644-50

3
outcome varies across the different categories of the exposure. In this example, the exposure is
BCG status and is presented as the row variable and therefore row percentages are appropriate.

In this example the proportion of non-vaccinated children who were atopic (40%) was higher
than the proportion of children vaccinated according to documentation who were atopic (21%).
Therefore there is some suggestion that the prevalence of atopy is lower in children who have
been vaccinated with BCG.

Presentation of categorical data using graphs

Graphical representation can often be used to show the same information as a table but in a
more vivid manner. Graphs are particularly useful for presentations and talks. Frequencies
are often illustrated in two forms:

Pie charts - In a pie chart the frequencies or the percentages are represented by the angles
in different sectors (slices) of a circle; the total (360 degrees) is equal to 100%, as shown
in Figure 1

Bar charts - In a bar chart the numbers or percentages are represented by the lengths of
the bars, as shown in Figure 2.

Figure 1: Causes of death among Gambian children aged 1-4 years between 1989 and
1993

Septicaemia; 1; 1.00% Unknown; 7; 7.00% Acute Gastro; 10; 10.00%


Pneumonia; 14; 14.00%

No agreement; 10; 10.00% Malaria; 39; 39.00%


Miscellaneous; 3; 3.00%
Meningitis; 3; 3.00%
Measles; 1; 1.00%
Malnutrition; 12; 12.00%

Malaria is the most common cause, accounting for 39% of the deaths in this age group.

Pie charts are useful for presenting the distribution of a single categorical variable, in this
case the categorical variable (with 10 groups) cause of death. When presenting distributions
for two groups, it is best to use a bar chart. Each bar represents a level of each categorical
variable and the lengths of the bars are drawn proportional to the frequencies or percentages.
Unlike a histogram (see later in this lecture) the bars are equally spaced and have the same
width.

4
Figure 2: Proportion of children dying from pneumonia and malaria among Gambian
children aged 1-4 years

Pneumonia
Malaria

Figure 2 shows the percentage of children who died each year from pneumonia or malaria.
The horizontal axis shows the year, while the vertical axis shows the percentage. Each year,
many more children died from malaria than from pneumonia in this age group. A bar chart is
often useful for showing how the y-axis (outcome) variable changes in relation to the x-axis
(explanatory) variable, for example, does mortality from malaria increase/decrease/remain
stable over time?

Describing your data - displaying quantitative variables

Presentation of quantitative data using tables

The frequencies with which different possible values of a quantitative variable occur may be
summarised as a frequency distribution. The frequency distribution of individual values is
seldom helpful, unless the overall number of observations is quite small. It is more useful to
group the values taken by the variable and to report the numbers and the frequencies (or
percentage frequencies) of subjects in each group.

The first step when forming a frequency distribution is to identify the lowest and highest
values. Then the number and size of the groups is determined. The number of groups will
depend on the observations; if the number of groups is too few (width of the groups is large)
too much information will be lost, while too many groups (width of the groups is small) may
be impractical. Where possible each group should be the same width and the starting points
should be whole numbers with no gaps.

Example 2: A study on HIV infected patients presenting at a hospital in the Gambia. A total
of 1084 patients have been classified by CD4 cell count (cells/l) and HIV-type, at the first
presentation to hospital.

5
The raw data would look like this:

Obs Hosp No. HIV type CD4 count


1. 960123 1 150
2. 960246 2 398
3. 960369 2 756
4. 970001 2 443
.
.
1081 010006 1 349
1082 0101182 606
1083 010101 1 630
1084 010121 1 210

The lowest CD4 count is 132 cells/l and the highest count 802 cells/l (minimum and
maximum values not shown in raw data example). Three categories were chosen to reflect
the distribution of the CD4 counts, the number of patients, and literature on the relationship
between CD4 count and the initiation of treatment.

Table 3 shows the frequency distribution of the variable CD4 cell count according to the
distinct categories <200, 200-499 and >499 separately for patients infected with HIV-1 and
those infected with HIV-2. This is an example of how a frequency distribution can be used to
examine the association between a categorical variable (HIV type) and a quantitative variable
(CD4 count). In a similar way the distribution of the CD4 counts could have been examined
ignoring HIV type by producing a single frequency distribution.

Table 3: Frequency distribution of CD4 counts (cells/l) for 1084 HIV-1 and HIV-2
patients
CD4 counts HIV-1 HIV-2
(cells/l) No. of No. of patients
patients (%) (%)

<200 255 (43) 160 (33)

200-499 224 (38) 154 (31)

>500 116 (19) 175 (36)

Total 595 (100) 489 (100)

Note the use of percentages. It is important to include these, because the results can then be
compared more easily between groups of different size.

Question: Is there a difference in CD4 cell count between HIV-1 and HIV-2 patients? If so,
how can we tell whether this difference is meaningful?

The answer to this question will be given during the course of lectures in this study module.
However, this table allows us to make an initial comparison of the distribution of CD4 cell

6
count between the two HIV groups. HIV-1 patients appear to have a lower number of CD4
cells than HIV-2.

Presentation of quantitative data using graphs

Graphical presentation of quantitative variables can take three forms:

Histograms - A histogram is similar to a bar chart, with (usually) the values of the
variable grouped into several categories and the bars can then represent the frequencies.
The bars touch one another to indicate the continuous nature of the variable. If the widths
of the groups are different this should be reflected in the histogram through the thickness
of the bars; the area of the bar should be proportional to the frequency. A histogram can
be used to illustrate the distribution of a single quantitative variable or the distribution of
a quantitative variable across the levels of a categorical variable.
Cumulative frequency curves - The cumulative frequency is the number of data less than
(or equal to) a particular value.
Scatter plots A simple graph used to examine the relationship between two quantitative
variables. Each pair of values is represented by a symbol where the horizontal position is
determined by the value of the first variable (exposure) and the vertical position by the
value of the second variable (outcome).

These initial displays of the data are particularly useful for identifying outliers or unusual
values, and revealing possible errors.

Example 3: Haemoglobin levels (g/dL) for 70 women were collected. Some of the raw values
were as follows. The highest and the lowest values are highlighted in bold.

10.2 10.5 .
13.3 12.9 .
10.6 13.5 .
12.1 11.4 .
9.3 15.1 .
12.0 11.1 11.8
13.4 . 11.2
11.9 . 8.8
11.2 . 13.0
14.6 . 13.2
13.7 . 10.2
12.9 . 9.7

7
A histogram of the haemoglobin (hb) values in the 70 women is given in Figure 3. In a
histogram, it is the area of the rectangle which represents the frequency (or percentage) - the
vertical scale is measured in frequency per unit of value and the horizontal scale is measured
in unit values. Note that the rectangles are drawn from 8 up to 9, 9 up to 10 etc, not from 8 up
to 8.9, 9 up to 9.9 etc which would correspond to the actual range of recorded values.

Figure 3: Distribution of haemoglobin (g/dL) among 70 women attending the clinic

Chart Title
20
18
No. of women

16
14
12
10
8
6
4
2
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Hb (g/dL)

If we had combined the last two groups (14.0 15.9 g/dL) this interval would be twice the
width of the other intervals and therefore the height of the bar would be half the frequency for
this group.

An alternative to the histogram is to display the cumulative frequency distribution. The


cumulative frequencies start from the lowest value and show how the number of individuals
increases as the values of the variable increase. These are calculated below for the
haemoglobin data:

Table 4: Cumulative frequency and percentage for different ranges of Hb levels of 70


women attending a clinic
Haemoglobin (g/dL) Frequency Cumulative Cumulative
Frequency Percentage
8 8.9 1 1 1.4
9 9.9 3 4 5.7
10 10.9 14 18 25.7
11 11.9 19 37 52.9
12 12.9 14 51 72.9
13 13.9 13 64 91.4
14 14.9 5 69 98.6
15 15.9 1 70 100

The table shows that the cumulative percentage of women whose haemoglobin is below 9 is
1.4%, the cumulative percentage of women whose haemoglobin is below 10 is 5.7% and so
on. In the cumulative percentage frequency curve, shown in Figure 4, the cumulative
percentage frequencies are plotted against the right hand ends of their intervals.
Figure 4: Cumulative frequency distribution of Hb levels of 70 women attending a clinic

It is not as easy to see the shape of a distribution from a cumulative curve as from a
histogram, but the curve is steep where the majority of Hb values are concentrated and if the
Hb levels were evenly distributed then the curve would increase at a constant rate.
Cumulative curves are easier to use when there are different grouping intervals (they avoid
the problems of calculating the heights of the bars to equalise the areas). They also have other
more specialised applications. For example, they are more useful for calculating medians (see
below; briefly, the median haemoglobin level is obtained by drawing a horizontal line
through the 50% point on the vertical axis and noting the haemoglobin level at the point
where it cuts the curve).

Suppose it is of interest to compare two quantitative variables then the relationship between
these variables can be examined through a scatter plot. Figure 5 shows how height (outcome)
varies with age (exposure). The plot clearly suggests that height increases linearly as age
increases, but that perhaps from the age of 15 years the increase in height slows down.
Figure 5: Scatter plot showing the relationship between height and age among 1384
children aged 3-17 years

180
160 140
Height (cm)
120 100
80

0 5 10 15 20
Age (years)

Summary statistics

Frequency distributions give a general picture of the distribution of a variable but it is often
useful to summarise the variable further through the summary statistics.

Distributions of categorical variables (including binary and ordered categorical variables) can
be expressed as percentages, which can be contrasted between groups, as shown in Example
1. The distribution of a binary variable is expressible with only one percentage (with its
accompanying total for example). There is little more that can be done to summarise the
categorical variables.

For a quantitative variable however other measures summarise the information in the values
taken by the variable. The two parameters needed to summarise quantitative data are

1. a measure of the middle of the data (also known as the location)


2. a measure of the spread of the data above and below the middle.

Symmetrically distributed data

Provided the distribution of the data is roughly bell-shaped or symmetric, as in Example 3,


the mean is a good measure of the central value and the standard deviation a good measure
of the spread.

Quite often about 70% of the observations lie within one standard deviation either side of the
mean and about 95% of observations lie within two standard deviations of the mean. This is
based on a theoretical frequency distribution called the normal distribution (see Table A1) and
will be discussed further in lecture 2.

Below, we use the letter n to denote the number of observations and the letter x to denote the
values themselves.

The mean

The most commonly used measure of the central value of a distribution is the (arithmetic)
mean, or the average. It is the sum of the observations divided by the number of
observations. A formula that corresponds to this calculation is
n
xi
i=1
Mean=x=
n

The symbol x (x-bar) is commonly used to represent a sample mean, and the symbol
n

xi
indicates summation, to add up over the different values which follow i.e. i=1 represents
x
the sum of all the i values.

We use i in mathematical notation to denote each observation. If we have sample of size 10,
n
xi
then i can take the values 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10. So the notation i=1 means that we
take the sum of all values x1, x2, ... up to x10, that is, sum all the values from the 1st one to the
10th one where x1 is the first one and x10 is the 10th one.

The standard deviation

This is a measure of spread. It is based on the deviations of the observations from the mean,
that is, the difference between each observation and the mean.

1. These deviations are squared and added.


2. The result is divided by (n-1). This is called the variance.
3. The standard deviation is the square root of the variance.


n
( x ix )2
i=1
sd =
n1

The abbreviation sd is often used to denote standard deviation in a sample (s is also


commonly used).
Example 3 continued: The mean haemoglobin level and corresponding standard deviation
of the 70 values can be calculated as follows:

The arithmetic mean haemoglobin level is:

10 .2+13 . 3+10 .6+12. 1+.. .. . .+13. 0+13 . 2+10. 2+9 . 7


x= =11. 98 g/dL
70

The variance is:

2 ( 10 . 212 .0 )2 + ( 13 .312 .0 )2 +.. . .. .+ ( 10 .212. 0 )2 + ( 9 .712. 0 )2


s= =2. 00
701

Hence the standard deviation is:

s= 2.00=1.42 g/dL

An additional measure of variation is the range, which is the difference between the largest
and smallest values. The disadvantage with the range is that because it only uses these two
values it provides no information on the distribution of the values that lie in between. Hence
the standard deviation is the preferred measure of variation since it uses all the observations.

Non-symmetric distributions

The mean and standard deviation may be satisfactory for reasonably symmetric distributions,
but are less so for distributions that are clearly not symmetric.

Example 4: The number of days spent in hospital by 17 subjects after an operation, arranged
in increasing length of stay, was as follows:

3 4 4 6 8 8 8 10 10 12 14 14 17 25 27 27 37 42

The distribution is not symmetric because low values are closer together and often repeated
compared with the string of a few high values. In this example the data are right skewed or
positively skewed (there is a long right-hand tail). The mean is 14.6 days. This is clearly not
in the centre of the distribution as seen in Figure 6.
Figure 6: Frequency distribution for the number of days spent in hospital

Frequency

No. of days spent in hospital

The median

The median (50th percentile) is an alternative measure of central value that works better for
such a skewed distribution. It is the value that halves the distribution, with 50% of the
observations below it and 50% above. If the observations are arranged in increasing order the
median is:

( n+1 )
median= th value
2

Quartiles

A useful measure of the spread of the data that can be presented alongside the median is the
inter-quartile range. This is defined as the distance between the first quartile (25th
percentile) and the third quartile (75th percentile) of the distribution. The median is the
middle (50th percentile) quartile.

For a distribution with a large number of observations the quartiles are most easily found
from a cumulative relative frequency distribution (such as is Figure 4), by reading off the
values that correspond to 25% (first quartile), 50% (median) and 75% (third quartile)

For a smaller number of observations, the median can be found directly by arranging the
observations in order from the lowest to the highest value and striking off values at both ends
until either one or two values remain. If one value remains, then this is the median; if two
values remain, then the median is half way between them (i.e. it is the average of the two
values).

The procedure for finding the Pth percentile can be more systematically generalised to:
1. Sort the data values in ascending order, so that x1 is the smallest value and xn is the largest
value, where n is the sample size.
2. Calculate nP/100 (i.e., n25/100 for the first quartile, n50/100 for the median, etc.)
3. Find the ith data value, xi, where i is an integer just larger than nP/100.
4. If (i 1) = nP/100, the Pth percentile is (xi-1+xi)/2; otherwise the Pth percentile is xi.

Example 4 continued: To calculate the number of hospital stay days corresponding to the
first quartile (25th percentile):

1. Sort the data in ascending order, i.e. x1 =3, , xn = 42


2. There are 17 observations so n=17. Calculate 1725/100 = 4.25.
3. The integer i just larger than 4.25 is 5. The 5th data value, x5, is 8.
4. Since (i 1) = 5 1 = 4, which is not equal to 4.25. The data value x 5 = 8 is the first
quartile.

Note there are other ways to define a percentile, but they dont concern us in this study
module.

The geometric mean

When the distribution is skewed it may be more appropriate to transform the data to make the
distribution more normal and take the (arithmetic) mean of this transformed distribution. A
common form of transformation is to take the logarithm of the data. The mean of log-
transformed data is known as the geometric mean (see Lecture 6 for further details).

The mode

The mode is another measure of the average. It is the value which occurs most often. It is
seldom used since it may not be possible to estimate due to all values of the variable being
different or if it is possible may be misleading.

Overall summary of a distribution

The following five numbers give a general purpose summary of a distribution both for non-
symmetric and symmetric distributions:-

minimum
first quartile
median
third quartile
maximum

These numbers are often shown in a figure called a box-whisker plot, or simply called box-
plot, where the box includes 50% of the distribution (from the first quartile to the third
quartile), and the whiskers can either mark the full range of the data i.e. from the minimum
value to the maximum value, or the full range of the data excluding outliers.

For the days in hospital, the five number summary is 3, 8, 10, 17 and 42 days. The figure
below shows these data in the form of a box-plot.
ls t a y
44

40 Maximum=42 Extreme values


L e n g th o f s ta y ( i n d a y s )

36

32

28 Upper quartile = 17
24

20
Median = 10
16

12

4 Lower quartile = 8

Minimum =3

Box-plots are useful for presenting several distributions in one figure, enabling them to be
compared easily.

Figure 7: Box-plot showing the distribution of IFN-g after PPD stimulation at 0, 2 and 4
months of age
IFN-g after PPD stimulation
800

700

600

500

400

300

200

100

0
0 2 4
Age at vaccination (months)
Outliers or extreme values

There is no absolute definition of what constitutes an outlier, and although it is usually best
practice to show outliers there is nothing that states that they must be excluded from the
whiskers. Each software package, and each individual, will use their own definition about
what constitutes an outlier. One commonly used definition is:

- Take the distance between the median and the upper (or lower) quartile
- Double the distance
- Any value above the upper quartile plus the distance, or any value below the lower
quartile minus the distance can be considered an outlier

Some other commonly used definitions include:

- Any data observation outside the distance corresponding to the IQR above the upper
quartile and below the lower quartile
- Any data observation which lies more than 1.5*IQR lower than the first quartile or
1.5*IQR higher than the third quartile. (Stata statistical software uses this definition)
- Any data observation lower than the 2.5th percentile or greater than the 97.5th percentile

The important message is that you should always check for outliers and if you use a box plot,
just show them clearly.

Where outliers or unusual data points are observed, it is important to go back and check these
values for measurement and recording errors. If this does not highlight an alternative values
for these points then they must be assumed to be real values. Only if the data values are
impossible can they be treated as missing values in an analysis, for example, a value that
could not possibly have come from the test kit, or an age of 150 or a value of 7 for a
categorical variable with only 4 categories.

You might also like