Professional Documents
Culture Documents
Instructor: A. Tetteh
Lecture 1
September 10, 2018
Recommended reading:
Hartl and Clark, Chapter 2
Assigned reading:
Fan, J.-B., Chee, M. S. and Gunderson, K. L. 2007. Highly parallel genomic assays. Nat. Rev.
Genet. 7: 632-644.
Sharp, A. J., Cheng, Z. and Eichler, E. E. 2006. Structural variation in the human genome. Annu.
Rev. Genomics Hum. Genet. 7: 407-442.
Goal
The goal of studying population genetics is to understand the factors that give rise to variation in
populations that is, the distribution of allele frequencies and how they change in time and space
Population genetics is the study of the distribution of allele frequencies and interaction of alleles
and genes in populations and how the frequencies change under the influence of the four main
evolutionary forces leading to adaptation and speciation. The four main evolutionary processes
are:
natural selection
genetic drift
mutation
gene flow
The theory of population genetics encompasses other factors that alter the distribution of allele
frequencies, such as recombination, effect of mating systems, population size, patterns of
migration, segregation, and transposition of mobile elements, population subdivision and
population structure on the patterning of variation within and between populations and species.
The discipline of population genetics was founded by Sewall G. Wright, J. B. S. Haldane and R.
A. Fisher, who also laid the foundations for the related discipline of quantitative genetics.
1
Sewall Wright J.B.S. Haldane R.A. Fisher
Sewall Wright: December 21, 1889 – March 3, 1988, an American geneticist, known for his
influential work on evolutionary theory and on path analysis.
J.B.S. Haldane: November 5, 1892 - December 1, 1964, a British geneticist and evolutionary
biologist credited with a central role in the development of neo-Darwinian
thinking.
R.A. Fisher: February 17, 1890 - July 29, 1962 an English statistician who made major
contributions to Statistics, Evolutionary Biology and Genetics.
What is a population?
The genetic architecture of a population of a species is one that is divided into subpopulations, or
local populations, or demes such that a collection of the subpopulations to form a single large
population is described as a metapopulation. In this metapopulation, there can be various kinds
of distributions of individuals separated by time, space, or some social structure so that the
subpopulations are not continuous and panmixis may not exist, leading to different areas of the
metapopulation having distinct gene frequencies.
Population genetics generally deals with genetic variants that segregate as Mendelian factors,
such as variation in DNA sequences or mutations with major phenotypic effects. Population
geneticists seek to describe the patterning of variation within and between populations,
increasingly on a whole genome scale, and to infer from these observations what demographic
and evolutionary forces have acted historically to generate the observed patterns, by comparing
the observations with predictions of theoretical models.
2
Quantitative genetics is concerned with subtle differences in phenotypes between individuals.
Examples are differences in aspects of morphology (such as height and weight), physiology
(such as blood pressure), behavior, and susceptibility to common diseases. Quantitative genetics
considers the variation between individuals that can be readily observed as phenotypes that are
typically continuously distributed in populations. This continuous phenotypic variation is caused
by the joint segregation of multiple genes affecting the trait, as well as variation caused by the
effects of the environment on expression of the alleles. Quantitative geneticists are concerned
with the inheritance of phenotypic measurements, which integrates population genetic
principles with the rules of Mendelian inheritance applied to multiple loci.
Alleles are alternative forms of genes. A gene is a DNA sequence that codes for a protein or
ribosomal RNA. The particulate matter that is transmitted to and inherited by an offspring from
parents during mating is a gene. DNA is a polynucleotide.
3
Bases
Sugars of nucleic acids
Nomenclature
Bases: no sugar or phosphate
Nucleosides = sugar + base
Nucleotides = sugar + phosphate group
+ base
Nucleoside
Nucleotides
4
A trinucleotide
Alleles differ from one another in the sequence of DNA at the chromosomal locus.
Variation in DNA sequences may arise due to mutations.
Mendelian genetics
In 1865 Gregor Mendel, the Augustinian (now Czech Republic) monk published his findings on
the inheritance of seven different traits in the garden pea. Mendel was the first to describe how
chromosomes are transmitted between generations.
5
Gregor Mendel 1822-1884
Different versions of each gene are called alleles. Alleles differ from one another in the sequence
of DNA at the chromosomal locus. One allele can be dominant over the other, the recessive
allele. Mendel performed a monohybrid cross (single traits) of green and yellow seeded garden
pea. All progeny in the first filial (F1) generation were yellow seeded demonstrating that yellow
color was dominant, green color was recessive. Selfing the F1 progeny to get F2, some green-
seeded pea reappeared in the F2. Mendel concluded that allele from green seed must have been
preserved in the F1 generation even though it did not affect the seed color. Each parent carried
two copies of the gene, i.e. parents were diploid for that trait. Homozygotes had two copies of
the same allele. Heterozygotes had one copy of each allele. Gametes carried only one copy of the
gene, i.e. they are haploid.
In Mendelian genetics, the progeny could be assigned to discrete classes of either green or
yellow, but not mixed phenotype. Ratio of yellow to green-seeded in the F2 was 3:1 – single and
major gene inheritance, a qualitative trait.
Law of segregation
Homologous chromosomes separate during the production of gametes so that half of the cells
will be produced with one allele and half with the other allele in the heterozygote plant
A dihybrid cross, that is, crossing plants differing at two traits produced progeny whose traits
segregated independently from one another to produce an F2 progeny in the ratio of 9:3:3:1.
6
Traits of interest here are plant height (T (tall) and t (short) alleles) and seed color (Y (yellow) or
y (green) alleles)
Chromosomes from different homologous chromosome pairs separate independently from one
another during the production of gametes.
Population genetics is the application of Mendel‟s laws and other genetic principles to the study
of variation within and between populations of species.
Mendelian traits are characterized by single and major gene inheritance, phenotype can be
classified into discrete categories such as tall and short. Mendelian traits are not influenced by
the environment. Inheritance conforms to segregation ratios in the F2 and backcross generations.
7
However, most traits are more complex than Mendelian traits. They are controlled by the action
and interaction of many minor genes (polygenic traits), exhibit continuous variation to show a
wide range of phenotypes, and are strongly influenced by environmental factors. These are
quantitative traits.
Francis Galton (1822-1911) was the pioneer of quantitative genetics. Quantitative genetics is the
study of continuous variation where phenotypes of organisms are measured on a quantitative
scale rather than discrete classes. Examples of quantitative traits are:
Height
Weight
Skin color, etc.
Continuous variation is caused by the joint segregation of multiple and minor genes, each having
a small effect on the phenotype. The segregation of one gene is obscured by the segregation of
the other gene affecting the trait. Individual genes cannot be identified by their segregation ratios
in the F2 and BC1 generations. Genetic segregation is obscured by environmental effects. The
distribution of the phenotypes conforms closely to a normal distribution.
8
Quantitative geneticist study the inheritance of the individual differences in phenotypic
measurements.Though Mendelian ratios cannot be applied, the inheritance of quantitative traits
depends on genes subject to the same laws of transmission displayed by qualitative differences.
Quantitative genetics is therefore an extension of Mendelian genetics.
Because ratios cannot be observed, single progenies do not provide enough information.
The unit of study is made up of larger groups of individuals: populations of many progenies. The
trait to be measured is given a score, not classified into discrete groups.
Most traits of importance to agriculture are quantitative.
Although many quantitative genetic predictions can be made without understanding the
underlying genetic details, advances in molecular genetics have facilitated identifying gene
regions harboring variants affecting quantitative traits, and in some cases cloning the relevant
loci. Emerging technologies for assessing genome-wide population genetic variation for
thousands of individuals (by microarray analysis) bring the promise of using population genetic
principles to rapidly map genes and variants affecting quantitative traits, including susceptibility
to common diseases, in many organisms.
Sources of variation
Mutation
Recombination
Migration
Transposable elements
Mutation
Mutation is the process by which the nucleotide sequence of a single gene changes, as a result of
point mutations and chromosome rearrangement, such as duplication, inversion or translocation.
In general, mutational changes are deleterious and lead to cell death, especially if it occurs in
somatic cells (source of cancers). Mutation in the germline which is passed on to the next
generation and enhances survival rates is the ultimate source and drive for evolution. These are
the mutations relevant to population and quantitative genetics. Mutations that segregate in
populations are called polymorphisms.
9
Mutations occur at random and can vary in their effect. They may be neutral with no phenotypic
expression, or cause variations to an individual‟s phenotype, which may range from small-scale
to large-scale. It is important that not too many mutations occur in a single DNA molecule at a
time. For a cell or an organism to be able to evolve through time, the base sequence of its DNA
must be capable of change.
Sources of mutation
DNA replication error under the following circumstances:
- Substitution of an incorrect base
- Accidental insertion or deletion of an extra base in the daughter cell
- Inefficient DNA repair mechanism that does not faithfully correct
- DNA damaged by a mutagenic agent
There are two principal mechanisms of mutation: (1) there is a chemical alteration of the base
(by a chemical or radiation) that gives it new hydrogen-bonding properties and thus cause a
different base to be incorporated upon replication. The new sequence must persist so that
progeny cells will have the new sequence (i.e. the change must be heritable).
(2) In order to ensure cell survival, mutation rate must be kept low. The following mechanisms
are employed for keeping mutation rates low:
- The hydrophobic water core of the DNA bases in the interior of the double helix reduces
its accessibility to attacking molecules
- The cell has evolved several repair mechanisms for correcting alterations or replication
errors
- The repair systems are not completely efficient so mutations occur at a rate that is very
low, but useful in an evolutionary sense
Kinds of Mutation
1. POINT MUTATION
This is a change in only a single base pair from the wild type
Point mutation may be:
10
Base substitution
Base insertion
Base deletion
11
Polymorphisms
DNA polymorphisms are DNA sequences that vary between two related genomes. They are
usually not present in a gene. It is the ultimate in molecular markers. Polymorphisms don‟t have
to be associated with a restriction site or specific PCR primer. Techniques to follow SNPs are
still evolving. In certain applications, one can simultaneously follow thousands of
polymorphisms in a single experiment
12
The reading frame following the nonsense mutation is still the same, unlike with a frameshift
mutation
13
Other types of mutations lead to sequences that are repeated in the genome. The repeated
sequences can be arranged tandemly in one location, or dispersed throughout the genome.
Examples of tandemly repeated sequences are microsatellites (also called simple sequence
repeats, or SSRs) and minisatellites. Microsatellites are simple sequences of two, three or four
nucleotides that are repeated 10-100 times. The number of repeats can vary substantially among
individuals, leading to polymorphism in repeat copy number. Minisatellites, also VNTR
(variable number of tandem repeats) are similar to microsatellites in that a core sequence is
tandemly repeated many times at a single location, but the repeated sequence is more
complicated, containing 10-100 base pairs. The high amount of polymorphism in numbers of
microsatellite and minisatellite repeats makes them ideal for mapping genes in pedigrees and for
individual identification.
Microsatellite markers
• Microsatellite markers were developed in 1989
• Dinucleotide repeat: CA CA CA CA
• Trinucleotide repeat: CAG CAG CAG CAG
• Tetranucleotide repeat: TAGC TAGC TAGC TAGC
• The number of repeats can vary substantially among individuals, leading to
polymorphism in repeat copy number
3. TRANSPOSABLE ELEMENTS
Dispersed repetitive sequences are typically transposable elements (TEs), or selfish genetic
elements, which can replicate by jumping to different genomic locations. Barbara McClintock
was the first scientist to predict that transposable elements, mobile pieces of the genetic material
(DNA), were present in eukaryotic genomes.
• She performed her work on corn and specifically followed seed color phenotypes.
• Later, other TEs were found in Drosophila, yeast, and bacteria.
14
Barbara McClintock, 1902-
1992. 1983 Nobel Laureate in
Physiology or Medicine
There are many different families of TE or simply, transposons, based on their size, structure and
mechanism of transposition. TEs are broadly classified as being retrotransposons (transposable
element I) or DNA transposon (transposable element II)
Retrotransposons
These transposable elements replicate themselves in a genome via transposition through an RNA
intermediate. Retrotransposons are abundant in plants, where they are often a principal
component of nuclear DNA.
Maize = 49-78% of the genome. Humans = 42%.
The replication and transposition processes occur rapidly to increase the copy numbers of
elements and thereby can increase genome size. The following steps occur:
1. Retrotransposons copy themselves into RNA
2. The RNA is converted back to DNA by a reverse transcriptase, which the retrotransposon
encodes
3. The DNA is integrated back to the genome.
Types of retrotransposons
15
There are two sub-types, those having
(A) Long terminal repeat at each end (LTR-retrotransposons) and the DNA at the
insertion site is duplicated. They range from ~100 bp to over 5 kb in size. These encode
reverse transcriptase and are similar to retroviruses.
LTR Retrotransposons are in turn divided into two broad categories which differ in the size of
the duplication
1. copia elements which generate 5-bp direct duplications
2. gypsy elements in Drosophila which create 4-bp direct duplications
• molecular analysis has determined that many of the classical Drosophila mutations
are the result of transposon insertion.
• Insertion does not always eliminate the function of the gene product, but rather
changes its function enough to result in another phenotype.
16
filling in by DNA polymerase) followed by a series of inverted repeats important for the TE
excision by transposase. Cut-and-paste TEs may be duplicated if their transposition takes place
during S phase of the cell cycle. Such duplications at the target site can result in gene
duplication, which plays an important role in evolution.
Examples of DNA transposons include the Ac/Ds elements of maize, the first transposable
element system to be described; and the P element of Drosophila, which has been harnessed as
an important transformation vector. The genome copy number of TEs can vary from less than 10
to over one million, such as the Alu TE (a SINE) in humans. There can be polymorphism for
both the total number of transposons as well as individual insertion sites. Segmental duplications
(also called low copy repeats) are also interspersed sequences, consisting of 1-400 kb pairs of
highly homologous genomic regions located in multiple genomic regions.
In addition to changes in state and copy number, mutations can also result in a change in gene
order. Inversions occur when a segment of the genome within a chromosome is reversed, and can
be polymorphic in natural populations. Translocations occur when a segment of the genome is
deleted from one region of the genome and re-inserted in another – either on the same or a
different chromosome. Translocations may involve a single DNA segment, or be reciprocal
translocations of two different segments.
Effects of Mutation
The consequences of mutations depend on where they occur. Eukaryotic protein coding genes
are typically split into several exons separated by non-coding introns, with non-coding 3‟ and 5‟
regulatory sequences. Adjacent genes may be separated by stretches of sequence with no known
function, or genes may be nested in the same or opposite orientation within other genome.
Because of the degeneracy of the genetic code, point mutations in coding regions can be silent
(or synonymous) if the mutation does not change the amino acid encoded by the codon. Non-
synonymous mutations either change the code of the amino acid to another one (missense
mutations) or to a stop codon, leading to a truncated protein (nonsense mutations). Insertions and
deletions in coding regions alter the reading frame, which can lead to a non-functional protein
17
(frameshift mutation), or a potentially altered protein if the insertion/deletion is an in-frame
multiple of three bases. Similarly, TE insertions in coding regions can result in non-functional
proteins or proteins with novel function. Mutations in regulatory regions could alter the timing,
efficiency or tissue-specific patterns of gene expression, or alter the number and size of
transcripts by changing splice sites.
In population genetic models, the only mutational effect that matters is the effect on reproductive
fitness. Mutations may have no effect on fitness and be selectively neutral; have slight to strong
deleterious effects and cause reduced fitness or even lethality; or be advantageous and favored
by natural selection. In quantitative genetic models, the effect of the mutation on the measured
value of the trait, as well as the effect on fitness, matters. The trait can be an external measure of
the organism‟s phenotype (counts, linear dimensions, weights, performance indices) or an
intermediate endophenotype (transcript, metabolite or protein abundance).
Recombination
Recombination is another source of variation in a population. It occurs when two homologous
chromosomes exchange some of their genetic material producing two chromosomes that are
genetically unique from the parental chromosomes. Recombination enlarges the amount of
genetic diversity in the population by increasing the number of alleles at any given genetic locus
to generate diversity.
18
The further away two points are on the chromosome, the more recombination there is between
them. Because recombination varies aloong the chromosome, we can obtain relative positions for
loci on a genetic map.
Recombination, r, is quantified by the ratio of the number of recombinant gametes to the total
number of gametes produced by one generation of meiosis. If r = 0.5, one-half of the gametes
produced in each meiosis are recombinant, and one half are non-recombinant, as expected by
Mendelian segregation of chromosomes. For example, in a cross of individuals containing two
unlinked mutations at two different loci, with free recombination (r = 0.5) all four gamete types
will be produced by the F1 progeny. With complete linkage (for example, two different mutations
of the same nucleotide), recombination cannot separate the mutations and only two gamete types
will be produced by the F1 progeny. Thus, one can estimate r by counting the number of
recombinant and non-recombinant gametes in a dihybrid cross.
There are two metrics of the distance between genetic variants. One is the physical distance,
measured in base pairs; the other is the recombination distance, measured in centiMorgans (cM).
For recombination fractions < 0.1 r, the relationship between recombination fraction and
recombination distance is linear; i.e. 0.1 r = 10 cM. For recombination fractions > 0.1 r,
19
unobserved double cross-overs mean the relationship between the real recombination distance
and the observed fraction of recombinants is not linear. Several mapping functions have been
derived to account for this nonlinearity, the most well-known of which is Haldane‟s mapping
function: m = ln(12r), where m is the true map distance in Morgans.
Recombination is not constant across the genome of any species, or between species. For
example, in Drosophila, recombination only occurs in females, and is lower at the ends than the
middle of chromosomes. In humans, recombination is highly heterogeneous across the genome,
with „hot‟ and „cold‟ spots. In other words, the relationship between r and physical distance is a
constant. Thus, we need to consider the recombination landscape as well as mutation in models
predicting the fate of new mutations in populations.
Recombination is common between homologous chromosomes. However, when there are repeat
sequences in the genome, recombination can occur between non-homologous repeats, such as
between tandemly repeated sequences or genes, or between interspersed repeats such as TEs and
segmental duplications. Non-homologous recombination between tandem repeats leads to
increases and decreases in copy number of the repeats, accounting for the high variance in copy
number of microsatellites and minisatellites. Non-homologous recombination between segmental
duplications or between TEs of the same family can lead to duplications, deletions and
inversions.
20
October 2006, intending to award $10 million to "the first Team that can build a device and use it
to sequence 100 human genomes within 10 days or less, with an accuracy of no more than one
error in every 100,000 bases sequenced, with sequences accurately covering at least 98% of the
genome, and at a recurring cost of no more than $10,000 (US) per genome."
For many applications we do not need to know complete genome sequences, but do require large
numbers of SNP genotypes for large numbers of individuals. Again, there are several
competing methods for high-throughput SNP genotyping, including Illumina‟s Golden Gate
assay, Affymetrix Oligonucleotide Microarray Based SNP Chips, and Pyrosequencing (a re-
sequencing by synthesis method). Lower throughput methods include identifying restriction
fragment length polymorphisms (RFLPs). With this simple method one amplifies a genomic
region of interest using PCR, digests the sample with a restriction endonuclease, and performs a
gel assay to deduce the size of the sample. Two fragments will be visualized if the restriction site
is present in the sample, but there will be only one larger fragment if there is a SNP that alters
the restriction site. Much of the population genetics literature from the 1960s-1980s was based
on differences in mobility of fragments digested with enzymes when subjected to gel
electrophoresis. This method only detects SNPs in coding regions that alter the charge of the
protein.
The importance of copy number variation is becoming increasingly apparent. The most common
population method for detecting CNV is by hybridization of genomic DNA to cDNA or
oligonucleotide microarrays. Tiling arrays, which represent the entire genome, are particularly
informative.
21
Detecting Phenotypic Variation
The first step to assess variation in phenotypes is to define the trait or traits of interest, and the
second is to devise an assay to obtain a measure of the trait. For example, human height can
be measured in centimeters, mouse weight in grams, and Drosophila bristle number by a hair
count. Aspects of cognitive performance can be assessed by the amount of time it takes to learn
or to complete a task. Many traits can be simply scored as present or absent; or, in the case of
human diseases, affected or not affected. It is very important to define the trait precisely. “Body
size” could refer to a linear measure of stature, weight, or a combined function of the two, and
will change as the individual ages or as food intake or another aspect of the environment is
altered. Precise definitions are particularly important in studies of human disease and psychiatric
disorders, where it is possible that not all patients are diagnosed using the same criteria.
One can also measure variation in molecular phenotypes, such as transcript, protein and
metabolite abundance. There are several commercially available platforms for quantifying
transcript abundance on a genome-wide scale for organisms with complete genome sequences.
More generally, quantitative PCR techniques can be used to assess the relative abundance of
any transcript of interest. However, the abundance of transcripts, proteins and metabolites is
highly dynamic and changes with stage of development/age, tissue, and the environment to
which an individual is exposed, so precise definition of the conditions under which the
measurements are taken is critical.
22