You are on page 1of 47

i

METAGENOMIC ANALYSIS OF
MARIANA TRENCH SEDIMENT
SAMPLES

Vera Maria Leal Carvalho






MSc IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

ii













So long, and thanks for all the fish!




iii

Abstract
The emergence of Metagenomics allowed the study of the microbial community in the
deepest point on Earth: the Challenger Deep on the Mariana Trench. Its extreme
conditions, a water depth of almost 11km, a temperature of 2.5 degrees Celsius and a
pressure around 112 MPa, made it very difficult to perform a comprehensive study of
its microecology, given the previous dependency on culturing methods. This
metagenomic analysis included taxonomic identification and exploration of some
functional potential of the genomic sequences of the community, generated by Illumina
Next-Generation Sequencing technique, therefore bypassing the need for cloning. Here
we show that Proteobacteria clearly dominate this environment but that there is no
obvious correlation between the sediment depth and the community composition.
Moreover, the abundance of enzymes involved in oxidative phosphorylation in all
samples, suggests aerobic activity within the sediment. This supports the finding that
there is oxygen consumption along the depth of the sediment. An extensive description
of all the data generated was prohibitive; however as soon as the data becomes
available, it will be accessible to the public to search for their features of interest.

Keywords: metagenomics, Mariana Trench, Challenger Deep, extreme environments,
Illumina, community structure, energy metabolism


O aparecimento da Metagenmica permitiu o estudo da comunidade microbiana no
ponto mais profundo na Terra: o Challenger Deep na Fossa das Marianas. As
condies extremas a presentes - a coluna de gua de quase 11km, 2.5C de
temperatura e a presso volta de 112MPa - tornaram um estudo aprofundado da sua
microecologia muito difcil de executar, dada a prvia dependncia em mtodos que
envolviam culturas em laboratrio. Esta anlise metagenmica incluiu identificao
taxonmica e a pesquisa do potencial funcional das sequncias genmicas da
comunidade, geradas utilizando a tecnologia de nova gerao de sequenciao da
Illumina, ultrapassando assim a necessidade de clonagem. Neste trabalho demonstra-
se que Proteobacteria domina claramente este habitat, mas que no h uma
correlao inequvoca entre a profundidade do sedimento e a composio da
comunidade. Alm disso, a abundncia de enzimas envolvidas na oxidao

iv

fosforilativa em todas as amostras, sugere actividade aerbia no sedimento. Isto
sustenta a descoberta de que h consumo de oxignio ao longo da profundidade do
sedimento. Uma descrio extensa de todos os dados que foram gerados era
proibitivo, no entanto, assim que os dados se tornarem pblicos, sero acessveis a
todos os que os queiram investigar consoante os seus interesses.

Palavras-chave: metagenmica, Fossa das Marianas, Challenger Deep, ambientes
extremos, Illumina, estrutura da comunidade, metabolismo energtico
















Introduction

1

1. Introduction
1.1. Background
The Challenger Deep on the Mariana Trench is one of the most extreme environments
on Earth, with a depth of almost 11km, a temperature of 2.5 degrees Celsius[1] and a
pressure around 111.79 MPa calculated assuming the mean density of sea water
1036 kg/m
3
[2] and the gravity to be 9,81 m/s
2
[3]. It is located roughly at 11N 22.1N
142 25.8 E [1](Figure 1).


Figure 1 - Challenger Deep location (11 22.1'N 142 25.8' E)

It has been subject to human curiosity for many years[4], however so far, there wasn't a
detailed study of its microecology. With the emergence of Metagenomics, it was now
finally possible to unravel which organisms live in the deepest point on Earth, and what
are they doing.
It was in 1998 that the term "Metagenomics" was first used by Jo Handeslman [5] in an
effort to study the microflora as a unit, the metagenome, instead of addressing each
type of organism individually.
Previously, it was thought that it was necessary to study the morphology, physiology
and pathogenic characters in order to classify a microorganism[6], but since Woese in
Introduction

2

1977 pioneered the use of 16S sequences for classification[7], sequence comparison
has been widely used and accepted as valid to do so.
With the development of the sequencing technology, one can now take a sample
directly from the environment, extract its DNA, sequence it, and infer the microbial
composition of the sample, therefore overcoming the bottleneck of growing pure
cultures in the laboratory. This method enables the discovery of new forms of life that
are not cultivable, and to assess the genetic richness and diversity, as well as the
metabolic potential, of a community of organisms as a whole[8].
Metagenomic analysis can accordingly be defined as the identification, and functional
and evolutionary analysis of the genomic sequences of a community of organisms.[9]
Moreover, the paradigm that most of the microbial world was known changed, to the
acknowledgement that there is still a lot to know and to explore[10]. Discovering new
forms of life in extreme environments can provide insights into a variety of topics, like
the biogeochemical activities that occur in the ocean[11], and the impact that human
activity may have on them[12] .
1.2. Metagenomic Analysis
To analyse a metagenome, several steps are typically involved, from the experimental
design to sharing the data[13]. Firstly, one has to obtain the samples. Ideally, true
replicates should be taken as well. Afterwards, one may filter the samples, to target a
(more-or-less) specific group of organisms [14].
The following step is sequencing. There are several technologies to sequence DNA,
each with its own advantages and weaknesses. The Mariana Trench sediment
samples were sequenced using Illuminas paired-end assay. Its advantage is that it is
cheap and generates a large number of reads per run, however they are very short (50
250 bp), which can pose a problem for assembly and comparison since it becomes
more difficult to assign a read unequivocally to a template[15].
Illuminas technology consists in attaching random DNA fragments to a surface, amplify
them to form clusters of the same sequence, and then use them as templates for
repeated cycles of polymerase-directed single base extension. This is guaranteed by
using 3-modified nucleotides, labeled with a removable fluorophore. After determining
the identity of the nucleotide incorporated by laser-induced excitation of the
fluorophores, these as well as the side arm (that prevents the incorporation of more
than one nucleotide per cycle) are removed. The images of the fluorescent signal are
Introduction

3

used to determine the sequence (each nucleotide is attached to a fluorophore of a
different colour), and its quality, defined as the likelihood of each call being correct[16].
The paired-end option means that a fragment is sequenced in both directions (5 3,
and 3 5), therefore being helpful for the assembly[17].
Assembly is the next step in the Metagenomic analysis pipeline, although it is
sometimes skipped. Its usefulness is debatable[18], given that the accuracy of the
assemblers is difficult to assess, since there is currently no microbial community with
known reference sequences to compare to[13].
The main problem with assembly is that it distorts abundance information, since
abundant fragments will be considered as belonging to the most abundant species,
when in reality they may be present in rare species[18]. Moreover, some fragments
may be incorrectly discarded as mistakes or repeats, or joined up in the wrong places
or orientations[19]. Nonetheless, if these setbacks are taken into account when doing
the analysis, then assembly can be advantageous as it produces longer sequences
that are easily unambiguously annotated.
Gene prediction and annotation usually follow. The first classifies the sequences as
coding or non-coding, and the second tries to find homology between the coding
sequences and known sequences stored in databases. Once again, these methods
have their own flaws, mainly because they are based on models, hence failing to
predict exceptions that can occur in the biological world.
Typically, the final step is to share the sequence data on public databases together
with the metadata. Contextual data is necessary to compare with other datasets,
essentially making the sequences useful for the database and the scientific community.
By complying with standard languages for metadata, such as MIMS, the data becomes
more accessible, as complex searches will retrieve more information[20].
The whole set of drawbacks that are surrounding metagenomic analysis, are not at all
surprising, if one considers that it is still a very young field. A quick search on Web of
Knowledge[21] for the total number of articles featuring the term metagenome or
metagenomics, gives a very clear perception on how novel this field is, and how much
data has been produced (Figure 2).
With the popularity of the field expanding, a multitude of tools were developed making
the choice of which one to use, a not so trivial one. There is still no evident consensus
on which is the best tool for each step (not even for sequencing), so the errors in the
Introduction

4

data are most likely directly related to the flaws in each method, which means that a
different set of methods will yield a different set of errors.


Figure 2 - Total number of metagenomics articles published since 1998

Given this explosion of data, an obvious question is on its applicability. One example
would be bioremediation[22]. The process of biodegradation encompasses several
metabolic pathways that being considered in a community-basis, instead of an
individual-basis, lead to a global understanding of what is essential and what is
superfluous, easing the design of such a system. Moreover, the industry sector is
always in search of novel enzymes and processes[23].
Even so, metagenomics tends to be regarded as exploratory research, raising more
questions instead of addressing them. Accordingly, the aim of this project was not only
to answer some simple questions, but also to raise some more, and hopefully to
encourage further studies in this environment.
1.3. Objective
This project dealt solely with the analysis of the raw data output by sequencing. The
goal was to assess the taxonomical distribution of the community along the depth of
the sediment and to explore its metabolic potential, using the most adequate tools.
However, with the publication of the article [1], which included these sediment samples,
the focus turned to assess if the data generated by this analysis would corroborate the
published data, namely to confirm the O
2
consumption throughout the sediment depth.
1 3 4 7 19 52 110
225
383
637
1,046
1,689
6,538
9,381
13,106
13,853
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
T
o
t
a
l

n
u
m
b
e
r

o
f

a
r
t
i
c
l
e
s

Year
Number of articles with the keyword
metagenome* or metagenomic*
Introduction

5

1.4. Structure of the thesis
This report is organized in five chapters. Starting with the Introduction, some
background information is presented regarding both the site of the samples, as well as
the technology and pipeline typically employed in this kind of studies. The second
chapter has the methodology, with an explanation of each method and its output.
Chapter three includes the selected results, and in chapter four a critical discussion of
the previous is given. The last chapter has the conclusion with the final remarks and
some future directions to similar studies.



















Methods

6

2. Methods
Seven out of eight sequenced samples from different depths were analysed. Each
sample corresponded to a gradient of 5cm, starting from 0-5cm (sample 1) to 35-40cm
(sample 8). The data was cleaned using the collection of tools Biopieces[24], the reads
assembled using IDBA-UD[25] [26], and both the generated contigs and the clean
reads were submitted to MG-RAST[27].
2.1. Sample collection, preparation and sequencing
The upstream methodology was done by collaborators and consisted on the following:
the DNA was extracted from 5g of sediment collected at different depths from the
Challenger Deep-Mariana Trench at 10,900m, using PowerMax soil DNA isolation kit
(MoBio Laboratories, CA USA). Eight DNA samples, each corresponding to a different
depth, were sent to BGI-Shenzhen (China), for library preparation and sequencing.
Since one of the samples (sample 4: 15-20 cm) did not contain enough DNA for library
preparation (as reported from the Sample Test Report of BGI), 14 fastq files were
received back, 2 for each of the seven samples one with the forward and another
with the reverse reads.
2.2. Preliminary Analysis
The initial number of reads on each sample ranged from around 41 million to almost 84
million (Table 1).
Table 1 - Percentage of reads removed with the cleaning
Sample
Number of raw reads
(forward + reverse)
Number of clean
reads
Percentage of
Cleaning
1 58,814,066 43,569,096 25.921%
2 45,533,260 18,717,708 58.892%
3 47,163,190 34,419,612 27.020%
5 83,968,942 36,784,382 56.193%
6 61,751,498 43,891,904 28.922%
7 46,894,236 33,786,396 27.952%
8 41,030,848 28,508,242 30.520%


Methods

7

2.3. Biopieces
Sub-quality residues from the ends of the reads were removed, as well as the adaptors
used in the sequencing. The reads with a length inferior to 30 bp were also excluded, in
addition to reads with a local mean score under 15, to overcome errors propagated
from cycle to cycle[28]. The cleaning removed from 27% to almost 59% of the reads in
the samples (Table 1). The Biopieces script used is shown in Figure 3.
The tool trim_seq removes residues from the ends of sequences whose quality, in the
scores of the FASTQ file, does not match the minimum quality specified (in this case
25). The flag -l makes sure that residues are removed until a stretch of at least 3
residues with good quality is found, to avoid a premature termination due to a good
quality residue at the end. This step is necessary to overcome the effect of phasing and
pre-phasing. These are caused by incomplete removal of the 3' terminators and
fluorophores, sequences missing an incorporation cycle, or by the incorporation of
nucleotides without effective 3' terminators[28]. This means that each cycles signal is
affected by the signal of the previous and subsequent cycles, hindering the detection of
the right base.
read_fastq i - |
trim_seq m 25 l 3 |
find_adaptor l 6 L 6 f ACACGACGCTCTTCCGATCT r AGATCGGAAGAGCACACGTC |
clip_adaptor |
merge_pair_seq |
grab e SEQ_LEN_LEFT >= 30 |
grab e SEQ_LEN_RIGHT >= 30 |
mean_scores l |
grab e SCORES_MEAN_LOCAL >= 15 |
split_pair_seq |
write_fastq x
Figure 3 Cleaning script

Find_adaptor searches the reads for the given adaptors (forward:
ACACGACGCTCTTCCGATCT and reverse: AGATCGGAAGAGCACACGTC), or
partial adaptors with at least 6 residues of length flags -l for the forward and -L for
the reverse adaptor. By default, a percentage of the adaptor length is allowed for
mismatches, insertions, and deletions (10%, 5% and 5%, respectively).
Once the adaptors are found, clip_adaptor removes them, based on the keys output by
find_adaptor: ADAPTOR_POS_RIGHT, ADAPTOR_POS_LEFT, and ADAPTOR_LEN_LEFT.
Methods

8

The merge_pair_seq merges paired sequences, as long as they are interleaved.
Sequence names must be in either Illumina1.3/1.5 format trailing a /1 or /2 or
Illumina1.8 containing 1: or 2:. The sequence names should also match.
Grab is an improved version of Unixs grep. It selects records that match a pattern, a
regular expression, or a numerical evaluation. In this case, we selected for reads with a
length superior to 30bp, by examining the keys SEQ_LEN_LEFT and
SEQ_LEN_RIGHT, output by merge_pair_seq.
Afterwards, mean_scores l was used to calculate the local mean scores, which means
that instead of calculating the mean as the sum of all the scores over the length of the
string, it uses means from a sliding window, and returns the smallest value.
Finally, split_pair_seq was used to split the sequences merged with merge_pair_seq.
To speed up the process, this script was ran with GNU parallel[29] with the L 8 option,
which takes two records at a time (each record has 4 lines), to circumvent breaking the
pairs. GNU Parallel allows Biopieces to be executed in parallel using multiple CPUs on
multiple cores and servers[24].
The merge_pair_seq and split_pair_seq tools were created within this project, to
overcome speed and memory problems originated by the use of order_pairs. The latter
interleaves the sequences, as long as their names are in Illumina 1.5 or 1.8 scheme,
and ads a key stating if the read is paired or orphan. This should be used after the
trimming and grabbing steps, and subsequently, only the paired reads should be
grabbed.
Example of a script using order_pairs (Figure 4):

read_fastq i - |
trim_seq m 25 l 3 |
find_adaptor l 6 L 6 f ACACGACGCTCTTCCGATCT r AGATCGGAAGAGCACACGTC |
clip_adaptor |
grab e SEQ_LEN >= 30 |
mean_scores l |
grab e SCORES_MEAN_LOCAL >= 15 |
order_pairs |
grab p pair k ORDER |
write_fastq x
Figure 4 order_pairs script


Methods

9

2.4. Assembly
The decision to assemble smaller reads into larger contigs was made based on the
postulation that The longer the sequence information, the better is the ability to obtain
accurate information. The annotation procedure becomes easier since longer
sequences yield more information to compare with the databases, but it also applies for
classification of DNA fragments, as well as to rise the confidence in accuracy due to
the lower quality of single reads, by having multiple reads covering the same segment
of information, provided that the coverage is high enough[13]. The IDBA-UD algorithm
is based on de Bruijn graphs adapted for metagenomic sequencing technologies with
uneven sequencing depths[26].
De Bruijn graphs have every possible (k-1)-mer assigned to a node and it has a direct
edge to another one if there is some k-mer whose prefix is the former and whose suffix
is the latter. This means that all the edges in the graph represent all possible k-mers.
The idea is to find an Eulerian cycle[30] with the shortest superstring that contains each
k-mer exactly once (Figure 5).
By visiting each edge only once, the time to run the algorithm is roughly proportional to
the number of edges[31], unlike in a Hamiltonian cycle[32], where each node is visited
only once, making it an NP-complete problem[33] (meaning the time to solve it
increases quickly with the size of the input).
Applied to genome assembly, all the k-mers are the ones present in the reads
generated by sequencing[31], so ideally, the Eulerian cycle would generate the
genome. In practice this method cannot be applied directly, since there are some
assumptions that do not hold. Firstly, we cannot be sure that all the k-mers present in
the genome were generated; secondly, k-mers are not error-free; thirdly, each k-mer is
very likely to appear more than once in the genome; and lastly, we should not assume
that the genome is a single circular chromosome.
To deal with the first problem, instead of trying to assemble the reads, the algorithm
breaks them into smaller k-mers which are more likely to be representative of the whole
genome. To handle errors, the assembler chooses the path which is supported by
higher coverage. Regarding repeats, if a k-mer appears more than once in the
genome, it shall be represented by several edges connecting the same two nodes.
Finally, rather than searching for an Eulerian cycle, if the algorithm is modified to
search for an Eulerian path[34], then it is not required to end in the same node where it
began[31].
Methods

10


Figure 5 Genome assembly strategies: Hamiltonian and Eulerian cycles[31].

The main problem with metagenomic data is that species with different abundances will
be represented by reads with uneven depth, and this cannot be disregarded as, e.g.,
an amplification bias. IDBA_UD solves this problem by adopting variable thresholds on
the multiplicity of the k-mers, making them dependent on the sequencing depth of the
neighboring contigs. The idea is that contigs with much lower sequencing depths that
their neighbors are more likely erroneous[26]. Moreover, IDBA_UD uses paired-end
information, namely the distance between the pairs, to solve issues such as missing k-
mers and repeats.
The assembler IDBA_UD was firstly used with the default minimum contig size setting
(200 bp), which yielded a N50 from 3545 to 9240. N50 is the length of the smallest
contig that contains the fewest largest contigs whose combined length represents no
less than 50% of the assembly. It is one of the common assembly statistics[35].
Therefore, then a higher minimum contig size of 500 bp was chosen, which improved
the N50 values, so these contigs were uploaded to the server MG-RAST[27]. The
complete analysis of both assemblies (using the Biopiece analyze_assembly) is shown
on Table 2 and Table 3, including N50, contig length (maximum, minimum, mean and
total) and the number of contigs.
Methods

11

Table 2 Analysis of the assembly with minimum contig size 200 bp.
200 bp
Sample 1 2 3 5 6 7 8
N50 3545 4705 4397 6430 8726 7136 9240
L
e
n
g
t
h

Max 614,662 215,848 466,951 305,081 551,041 305,025 452,197
Min 200 200 200 200 200 200 200
Mean 1439 1956 1651 2206 2068 1822 2234
Total 106,683,337 42,414,452 79,943,730 68,959,229 61,784,508 70,790,963 64,146,341
Number contigs 74,124 21,681 48,418 31,250 29,868 38,839 28,704


Table 3 Analysis of the assembly with minimum contig size 500 bp.
500 bp
Sample 1 2 4 5 6 7 8
N50 14,106 6,199 14,122 8,662 17,856 16,340 16,698
L
e
n
g
t
h

Max 614,662 215,848 548,284 305,081 551,037 337,423 551,034
Min 504 502 503 501 505 518 503
Mean 3,261 3,180 3,384 3,883 4,277 4,235 4,492
Total 76,906,252 36,624,750 60,016,272 59,604,646 51,333,136 55,864,260 54,104,132
Number contigs 23,581 11,514 17,732 15,349 12,000 13,190 12,044

2.5. MG-RAST
MG-RAST[27] uses several bioinformatics tools in its pipeline. Firstly, it filters
sequences based on length, number of ambiguous bases and quality values. All the
contigs from all the 7 samples uploaded, passed this preprocessing stage.
Then, technical replicates, identified as sequences with identical first 50 base-pairs,
are removed in a step called Dereplication. Between 0,7% (surface sample) and 2,3%
(sample 7) of the contigs were removed in this step, but no reads were removed. This
can be explained by the use of the same reads for different contigs.
After that, FragGeneScan[36] is used to predict coding regions. This tool is an ab-initio
gene calling algorithm that uses hidden Markov Model for coding and non-coding
regions, and that was developed specially for metagenomes. It includes codon usage
bias, sequencing error models and start/stop codon patterns. A gene is reported if its
longer than 60 bp, and begins either with a start or an internal codon of a gene and
ends with a stop or an internal codon. This way, both complete and partial genes are
predicted. From 29,239 (sample 2) to 63,877 (sample 1) coding sequences were
Methods

12

predicted within the contigs, and from 16,387,405 (sample 2) to 40,199,546 (sample 6)
within the reads.
The sequences output from FragGeneScan are then clustered at 90% identity with
qiime-uclust. QIIME[37] is a software package developed specially for high throughput
amplicon sequencing data, although it also supports metagenomic data. It incorporates
many third party tools, such as UCLUST[38]. This algorithm clusters sequences based
on their similarity, according to a threshold set by the user (or in this case by MG-
RAST). Each cluster is therefore represented by a sequence, and all the sequences in
it should have a similarity higher than the threshold to the sequence representing the
cluster (centroid), and centroids should have similarity below the threshold to the other
centroids. The algorithm starts with no centroids, and each sequence is compared to
the list of centroids and it is either assigned to a cluster or selected as a new centroid.
The centroids and the singletons (unclustered sequences) are then searched using
BLAT[39] against the M5NR protein database. M5NR is a non-redundant protein
database which incorporates data from GO[40], KEGG[36][37], NCBI[38][39],
SEED[40][41], UniProt[47], VBI[48] and eggNOG[49], and has almost 16,000,000
sequences. BLAT builds an index of the database and then scans linearly through the
query sequence, unlike BLAST which builds an index of the query sequence and then
scans linearly through the database, making it faster since it does not have to scan
through a database of gigabases of sequence but only through a relatively short query
sequence. BLAT, however, looses to BLAST in terms of sensitivity, since it needs an
exact or nearly-exact match to find a hit, making it suitable mostly for closely related
species. The alignment identified between 25,261 (sample 2) and 50,816 (sample 1)
protein features in the contigs, and from 4,859,593 (sample 2) to 10,890,942 (sample
6) in the reads, which proved to be correlated at 98% with the number of dereplicated
reads, using Pearsons coefficient:





Where and are the average of the number of dereplicated reads and the number of
protein features, respectively.
The results of the search against the M5NR database were retrieved for each of the
samples, at 90% identity, to map against the metabolic pathways maps based on
KEGG data, using KEGG Mapper[41] [42] and iPath[50] [51].
Methods

13

Besides from being the input for the Dereplication step, the filtered sequences are pre-
screened to identify ribosomal sequences at 70% identity, and then they are clustered
using UCLUST at 97% identity. The clusters are then searched for similarity against the
M5RNA database (Greengenes[52], SILVA[53] and RDP[54]), using BLAT[39]. This
alignment identified between 36 rRNA features (sample 2) to 72 (sample 1) in the
contigs, whilst in the reads the number ranged from 19,014 (sample 2) to 38,639
(sample 1).
MG-RAST also calculated automatically the alpha diversity of each sample, to
summarize the distribution of species-level annotations in that sample, using the
following equation:


Where p is a ratio of the number of annotations for each species to the total number of
annotations and m is the total number of different species annotations, using all the
annotation source databases incorporated by MG-RAST[27].
Based on the abundances of each species in each sample (using the reads), the R
package vegan[55] was used to calculate the beta diversity, as suggested in the
manual[56]. Therefore it was calculated pair wise between samples, using the
Srensen index of dissimilarity:




Where a is the number of species shared by the two samples, and b and c are the
number of unique species to each sample; as well as the widely known Whittaker's
species turnover:

Where is the total number of species in the collection of samples (gamma diversity),
and is the average richness per sample. Subtraction of one guarantees that =0
means that there are no excess species or no heterogeneity between samples.
Rarefaction curves were also automatically generated. The theory behind it, is to
repeatedly re-sample the pool of reads, at random, plotting the average number of
species represented by 1, 2,N reads[57].
Methods

14

Krona[58] was used to view the percentage of reads with predicted proteins and
ribosomal RNA genes annotated based on all the databases.






















Results

15

3. Results
The reads and contigs submitted to MG-RAST were automatically attributed with
unique IDs, as indicated on Table 4.

Table 4 - ID's of the contigs submitted to MG-RAST
Sample Reads Contigs
1 4525786.3 4518922.3
2 4525785.3 4518923.3
3 4525784.3 4518924.3
5 4525781.3 4518925.3
6 4525782.3 4518926.3
7 4525783.3 4518927.3
8 4525787.3 4518928.3

To compare the abundances among the samples, the results were extracted from the
reads, whereas to assess presence or absence of a defined feature, the contigs
results were retrieved.

3.1. Taxonomic Hits Distribution
Extracting the best hit classification from the reads compared to M5NR using a
maximum e-value of 1e-5, a minimum identity of 90%, and a minimum alignment length
of 15 aa, it is clear that Bacteria, and more specifically Proteobacteria, largely dominate
in all the 7 samples (Figure 6 and Figure 20).
In terms of class, Betaproteobacteria seems to comprise 78% of Proteobacteria in
Sample 1, unlike the other samples, where Gammaproteobacteria seems to be the
dominant class (Figure 7). Sample 3 shows a larger representation of
Alphaproteobacteria compared to the other samples.
Most of Gammaproteobacteria in sample 1 is Pseudoalteromonas, in sample 2 is
Pseudomonas, whereas from sample 3 to sample 8 other genera, namely
Marinobacter, become just as dominant (See Figure 21 to Figure 27).
Results

16


Figure 6 Taxonomic distribution of the reads at the domain level



Figure 7 - Taxonomic distribution of the reads at the class level (Proteobacteria)

In terms of -diversity, calculated using the reads against all the annotation databases
used by MG-RAST, sample 1 shows the highest: 430.83 species. The other samples
have diversities between 184.10 species (sample 6) and 252.14 species (sample 7).
The values of -diversity for all the samples are shown on Table 5.

Results

17

Table 5 - -diversity

-diversity
Sample 1 430.83
Sample 2 213.47
Sample 3 232.97
Sample 5 210.42
Sample 6 184.10
Sample 7 252.14
Sample 8 240.39

The -diversity value, using the Whittaker's species turnover was 1.181461, and the
pairwise comparisons are shown on Table 6 and Figure 8.
Table 6 Pairwise -diversity

Sample 1 Sample 2 Sample 3 Sample 5 Sample 6 Sample 7
2 0.422489

3 0.353043 0.319049

5 0.382264 0.283298 0.292187

6 0.364884 0.30632 0.292165 0.288654

7 0.360278 0.333708 0.314927 0.307126 0.287154

8 0.365677 0.324216 0.309876 0.306393 0.292684 0.294254


Figure 8 - -diversity barchart

Results

18

A correlation analysis of the distance between samples and their -diversity, shows no
relation between them (Figure 28).
The rarefaction curves of annotated species richness for all the samples show a quick
rise at first, and then they become flatter but without leveling off towards an asymptote
(Figure 9). This means that if there had been more reads, probably more species would
be found. Even so, these results allow a reasonable guess of the community structure.


Figure 9 Rarefaction curve of annotated species richness

The Principle Component Analysis for the reads of the 7 samples, with annotation
against the M5RNA database, using the Bray-Curtis measure (chosen for showing a
robust relationship with ecological distance[59]), an e-value of 1e-5 and a minimum
identity of 97%, does not show a clear trend, neither when using the M5NR database,
with a minimum identity of 90% (Figure 10). See Figure 29 and Figure 30 for the
heatmaps with the same thresholds and normalized values to the size of the samples.
Results

19


Figure 10 - PCoA using the M5RNA database (left) and the M5NR database (right)

Nevertheless, when comparing with metagenomes from 1) the gut microbiota of 91
pregnant women of varying prepregnancy BMIs and gestational diabetes status and
their infants (http://metagenomics.anl.gov/linkin.cgi?project=265), and 2) metagenomes
from activated sludge from 2 full-scale tannery wastewater treatment plants
(http://metagenomics.anl.gov/linkin.cgi?project=922), it is clearly seen, that the Mariana
Trench samples group together in a very distinct group. As these two environments are
expected to be very and quite different, respectively, from the deep sea Mariana
Trench samples, this is a good indicator on the reliability of the latter. See for example
Figure 11, for a comparison against the M5NR database, at 90% minimum identity, and
an e-value of 1e-5.
Results

20


Figure 11 - PCoA of the reads against the M5NR database. Red - Mariana Trench; Blue - Activated
Sludge; Green Gut Microbiota










Results

21

3.2. Functional Category Hits Distribution
Looking at the number of features that were annotated based on the reads compared
to the contigs, it is noticeable that the latter provide a much more reliable source for
annotation, as seen from the range of e-values, which was expected. See, for example,
sample 7 in Figure 12 and Figure 13. One might notice that there were more features
predicted from the reads, but at the same time there were more reads than contigs.


Figure 12 - Number of features in the reads of sample 7 annotated by the different databases


Figure 13 Number of features in the contigs of sample 7 annotated by the different databases

Moreover, taking again sample 7 as an example, only 50.7% of the predicted protein
features in the reads could be annotated with similarity to a protein of known function,
whereas 84.9% of the predicted protein features of the contigs were annotated.
Results

22

From all the databases that were used to compare the protein sequences generated
from the contigs, SEED Subsystems[45] had the higher number of annotations. (Figure
12 and Figure 13) It is worth noting, however, that each database has a different type
of annotation data, hence the different number of hits. Since the tools to analyse the
pathways (KEGG Mapper and iPath) use the KEGG database, the focus was put on
the functional hierarchy given by KEGG Orthology (KO)[41][42].
Comparing the reads to KO, using a maximum e-value of 1e-5, a minimum identity of
90%, and a minimum alignment length of 15, on average 53% (0.03) of the reads with
predicted protein functions were annotated as belonging to the Metabolism category.
From those, 14% (0.05) of the reads belong to Energy metabolism.
Roughly 100% of the reads from Energy metabolism, in the reads from sample 1,
correspond to oxidative phosphorylation, and on the rest of the samples, this value lays
around 77% (0.07).
In fact, the F-type H
+
-transporting ATPase subunit beta (K02112), involved in both
oxidative phosphorylation (Figure 14) and photosynthesis (Figure 15), is the second
most abundant hit in sample 1 (out of 54 hits), with an average identity of 91.06% and
an average e-value of -6.14.

Figure 14 - Oxidative Phosphorylation, pathway ko00190.

Results

23

In sample 2, K02112 appears in 11
th
place (out of 239 hits) with an abundance of 9187
together with F-type H
+
-transporting ATPase subunit alpha (K02111) in 10
th
place with
an abundance of 9307.
In sample 3, K02112 has an abundance of 9513 and K02111 of 9758, appearing in 8
th

and 6
th
, respectively, when sorting for abundance. For sample 5 the values are 13405
for K02112 and 12764 for K02111 (10
th
and 12
th
). Sample 6 has even higher
abundances for K02112 and K02111: 16632 and 16260 (8
th
and 9
th
most abundant). In
samples 7 and 8 they appear in 5
th
and 6
th
place, out of 108 and 115 hits, with
abundances of 11492 and 11257, and 10691 and 10294. In all samples from the
second to the seventh, these subunits have an average identity above 91.5%.

Figure 15 Photosynthesis, pathway ko00195.

Using the contigs, with the same settings, only K02112 was found, and only in samples
2 and 8. However, the average alignment length of the hits was 356.55 and 332.22,
respectively, whereas for the reads it was 27.67 and 27.57. Nevertheless, other hits
also classified as belonging to Oxidative Phosphorylation were found, like NADH-
quinone oxidoreductase subunit (K13380 and K13378), NADH-quinone oxidoreductase
subunits (K00338 and K00340), F-type H
+
-transporting ATPase subunit c (K02110), V-
type H
+
-transporting ATPase subunits (K02118 and K02122), cytochrome c oxidase
Results

24

assembly protein subunit 17 (K02260), nucleosome-remodeling factor 38 kDa subunit
(K11726), cytochrome o ubiquinol oxidase subunit III (K02299), cytochrome o ubiquinol
oxidase operon protein cyoD (K02300) and NAD(P)H-quinone oxidoreductase subunit
5 (K05577).
To address, with some degree of confidence, whether alternative energy metabolism
processes occur in any of the samples, the contigs results were further explored.
Indeed, all samples contained contigs involved in Methane Metabolism (Figure 16).

Figure 16 - Methane metabolism, pathway ko00680. In red the enzymes found in the samples.

In addition, contigs from samples 2, 5 and 8, matched hits from nitrogen metabolism
(Figure 17). In all the three samples, nitric oxide reductase subunit B (K04561)
(EC:1.7.2.5) was present, which is involved in denitrification (nitrate nitrogen).
Results

25

Sample 2 also had a nitrogenase iron protein NifH (K02588) (EC:1.18.6.1), a
nitrogenase molybdenum-cofactor synthesis protein NifE (K02587) and a nitrogen
fixation protein NifX (K02596).

Figure 17 - Nitrogen metabolism, pathway ko00910. In red the enzymes found in samples 2, 5 and 8.

Finally, the map generated with iPATH (Figure 18) gives a general overview of the
pathways present, when combining all samples. It is worth noting that photosynthesis
appears mapped; however, this is most likely a misleading mapping, since the enzyme
Results

26

identified is an F-type H+-transporting ATPase, which is involved in photosynthesis but
also in oxidative phosphorylation, as mentioned earlier.

Figure 18 - Metabolic map of the seven samples
Discussion

27

4. Discussion
Marine sediments, and in particular hadal trenches, receive substantial deposition of
microbes and organic matter from the upper water layer[1], and provide a matrix of
complex nutrients and solid surfaces for microbial growth[60]. However, the low
temperature and the extreme hydrostatic pressure demand a certain degree of
adaptation from the organisms inhabiting such an environment. Even so, there seems
to be a fairly high diversity along the sediment depth, as seen in Table 5 and Figure 9.
Proteobacteria is the largest and most metabolically diverse group of Bacteria. They
are all gram-negative, and they divide into 5 classes: alpha, beta, gamma, delta and
epsilon[61]. The dominance of Gammaproteobacteria is in accordance with a study
from the Pacific Artic Ocean, where the temperatures are also very low[62], and
somewhat with the study of sediments at 4000m depth in Pacific Ocean, where not
only Gammaproteobacteria but also Alphaproteobacteria dominate the community[63].
Intriguingly, the outer-layer of an actively venting black-smoker chimney from a
hydrothermal vent field on the Juan de Fuca Ridge[64], is also dominated by
Gammaproteobacteria, even though its temperature lies above 310C.
The PCoA graphs show samples that exhibit similar abundance profiles, in terms of
taxonomy or function, grouped together. However, when comparing the seven
samples, there is no obvious trend in the community towards the depth of the sediment
(Figure 10). Nevertheless, the fact that this projects samples group together and very
distinctly from other projects samples, is a good indicator that this environment has its
own community structure.
The poor correlation between -diversity and distance between samples also supports
the PCoA results (Table 6 and Figure 28). This means that the difference in microbial
community composition (as defined in [65]) is most likely due to factors other than
depth. It is possible that, under such high pressure, some centimeters of sediment do
not really make a difference in the community structure. Alternatively, there might have
been some mixing of the communities during the sampling process.
It should be noted however, that the fact that the community as a whole does not show
a shift alongside the depth of the sediment, does not exclude the hypothesis that some
taxa correlate with it.
Regarding the decision to assemble, the range of e-values of the number of features
annotated with the different databases, as well as the percentage of predicted protein
Discussion

28

features that were annotated, should provide some degree of confidence in the
assembly.
The high number of hits of the oxidative phosphorylation pathway supported the
predictions from [1], that there is intensified O
2
consumption within the sediment, unlike
in the sediment of the reference site (6000m of water depth), where the microbial
activity has reduced rates. This was supported by measurements of the O
2

concentration throughout the depth of the sediment. Attenuation in the O
2

concentration reflects higher rates of its consumption[1] (Figure 19), which is consistent
with the presence of genes involved in aerobic respiration in all the samples.


Figure 19 Oxygen micro-profiles at 6,018m water depth (a); and at Challenger Deep (b) [1].

Even though oxidative phosphorylation dominates the energy metabolism processes,
methane and nitrogen metabolism still play a part in the communitys energetic
potential.
Normally, methanogenesis is associated with anoxic environments; still, it is known that
even in oxic environments, anoxic microenvironments can form, where
methanogenesis takes place[61].
Discussion

29

Once more, the predictions that there is intensified mineralization mediated by the
prokaryotic community at Challenger Deep[1] are supported by the contigs with
homology to features involved in nitrogen metabolism.
Finally, the misleading mapping of the ATPase (Figure 18), should be taken as an
example that care and criticism are fundamental when using automated tools.
Conclusion

30

5. Conclusion
This study was a first description of both the community structure and its functional
potential, in the Mariana Trench, a unique environment for its extreme conditions. The
amount of data generated made it prohibitive to describe it in total. The energy
metabolism was selected for this thesis, since it was interesting to compare with the
results from [1]. The finding that there are enzymes involved in the oxidative
phosphorylation pathway in all 7 samples, supported the published measurements of
oxygen consumption throughout the sediment.
It was expected to observe a taxonomic and/or functional gradient along the depth of
the sediment but that does not seem to happen. A further investigation on this matter
would be helpful to prove if there are any signature taxa of the depth.
The data used in the study will soon be publicly available on MG-RAST, therefore
accessible for additional investigation. However, in the future, it would be sensible to
sample with true replicates, and take a broader number of environmental
measurements, to allow the data to be more comparable to other studies. It would also
be interesting to take samples from sediments from other depths along the Challenger
Deep, to assess if the community uniqueness is due to the extreme depth or to the
overall conditions on that site.
To conclude, it is probable that in 10 years time, with the development of new tools or
with the improvement of the existing ones, all of these results will be proved inaccurate.
However, the aim of this thesis was neither to develop new tools, nor to compare the
existing ones, but to use them wisely and understand their purpose for this analysis.
Hence, the argument of this project is that with this set of tools, this is the product.

References

31

References
[1] R. N. Glud, F. Wenzhfer, M. Middelboe, K. Oguri, R. Turnewitsch, D. E. Canfield, and
H. Kitazato, High rates of microbial carbon turnover in sediments in the deepest oceanic
trench on earth, Nature Geoscience, vol. 6, no. 4, pp. 284288, Mar. 2013.
[2] R. Pawlowicz, Key physical variables in the ocean: temperature, salinity, and density,
Nature Education Knowledge, vol. 4, no. 4, p. 13, 2013.
[3] The international system of units. Bureau International des Poids et Mesures, 2006.
[4] R. A. Lutz and P. G. Falkowski, Ocean science. A dive to Challenger Deep., Science
(New York, N.Y.), vol. 336, no. 6079, pp. 3012, Apr. 2012.
[5] J. Handelsman, M. R. Rondon, S. F. Brady, J. Clardy, and R. M. Goodman, Molecular
biological access to the chemistry of unknown soil microbes: a new frontier for natural
products, Chemistry & Biology, vol. 5, no. 10, pp. R245R249, Oct. 1998.
[6] Society of American Bacteriologists., Bergeys manual of determinative bacteriology, 1st
ed. Baltimore, Williams & Wilkins Co., 1923.
[7] C. R. Woese and G. E. Fox, Phylogenetic structure of the prokaryotic domain: The
primary kingdoms, Proceedings of the National Academy of Sciences, vol. 74, no. 11,
pp. 50885090, Nov. 1977.
[8] P. Hugenholtz and G. W. Tyson, Microbiology: metagenomics., Nature, vol. 455, no.
7212, pp. 4813, Sep. 2008.
[9] E. M. Glass and F. Meyer, Analysis of metagenomics data, in in Bioinformatics for High
Throughput Sequencing, N. Rodrguez-Ezpeleta, M. Hackenberg, and A. M. Aransay,
Eds. New York, NY: Springer New York, 2012, pp. 219229.
[10] J. Handelsman, Metagenomics: application of genomics to uncultured
microorganisms., Microbiology and molecular biology reviews: MMBR, vol. 68, no. 4,
pp. 66985, Dec. 2004.
[11] X. Hao and T. Chen, OTU analysis using metagenomic shotgun sequencing data,
PLoS ONE, vol. 7, no. 11, p. e49785, Nov. 2012.
[12] V. Iverson, R. M. Morris, C. D. Frazar, C. T. Berthiaume, R. L. Morales, and E. V.
Armbrust, Untangling genomes from metagenomes: revealing an uncultured class of
marine Euryarchaeota., Science (New York, N.Y.), vol. 335, no. 6068, pp. 58790, Feb.
2012.
[13] T. Thomas, J. Gilbert, and F. Meyer, Metagenomics - a guide from sampling to data
analysis., Microbial informatics and experimentation, vol. 2, no. 1, p. 3, Jan. 2012.
[14] J. C. Wooley, A. Godzik, and I. Friedberg, A primer on metagenomics., PLoS
computational biology, vol. 6, no. 2, p. e1000667, Feb. 2010.
[15] N. Whiteford, N. Haslam, G. Weber, A. Prgel-Bennett, J. W. Essex, P. L. Roach, M.
Bradley, and C. Neylon, An analysis of the feasibility of short read sequencing., Nucleic
acids research, vol. 33, no. 19, p. e171, Jan. 2005.
References

32

[16] D. R. Bentley, S. Balasubramanian, H. P. Swerdlow, G. P. Smith, J. Milton, C. G. Brown,
K. P. Hall, D. J. Evers, C. L. Barnes, H. R. Bignell, J. M. Boutell, J. Bryant, R. J. Carter,
R. Keira Cheetham, A. J. Cox, D. J. Ellis, M. R. Flatbush, N. A. Gormley, S. J.
Humphray, L. J. Irving, M. S. Karbelashvili, S. M. Kirk, H. Li, X. Liu, K. S. Maisinger, L. J.
Murray, B. Obradovic, T. Ost, M. L. Parkinson, M. R. Pratt, I. M. J. Rasolonjatovo, M. T.
Reed, R. Rigatti, C. Rodighiero, M. T. Ross, A. Sabot, S. V Sankar, A. Scally, G. P.
Schroth, M. E. Smith, V. P. Smith, A. Spiridou, P. E. Torrance, S. S. Tzonev, E. H.
Vermaas, K. Walter, X. Wu, L. Zhang, M. D. Alam, C. Anastasi, I. C. Aniebo, D. M. D.
Bailey, I. R. Bancarz, S. Banerjee, S. G. Barbour, P. A. Baybayan, V. A. Benoit, K. F.
Benson, C. Bevis, P. J. Black, A. Boodhun, J. S. Brennan, J. A. Bridgham, R. C. Brown,
A. A. Brown, D. H. Buermann, A. A. Bundu, J. C. Burrows, N. P. Carter, N. Castillo, M.
Chiara E Catenazzi, S. Chang, R. Neil Cooley, N. R. Crake, O. O. Dada, K. D.
Diakoumakos, B. Dominguez-Fernandez, D. J. Earnshaw, U. C. Egbujor, D. W. Elmore,
S. S. Etchin, M. R. Ewan, M. Fedurco, L. J. Fraser, K. V Fuentes Fajardo, W. Scott
Furey, D. George, K. J. Gietzen, C. P. Goddard, G. S. Golda, P. A. Granieri, D. E.
Green, D. L. Gustafson, N. F. Hansen, K. Harnish, C. D. Haudenschild, N. I. Heyer, M.
M. Hims, J. T. Ho, A. M. Horgan, K. Hoschler, S. Hurwitz, D. V Ivanov, M. Q. Johnson, T.
James, T. A. Huw Jones, G.-D. Kang, T. H. Kerelska, A. D. Kersey, I. Khrebtukova, A. P.
Kindwall, Z. Kingsbury, P. I. Kokko-Gonzales, A. Kumar, M. A. Laurent, C. T. Lawley, S.
E. Lee, X. Lee, A. K. Liao, J. A. Loch, M. Lok, S. Luo, R. M. Mammen, J. W. Martin, P.
G. McCauley, P. McNitt, P. Mehta, K. W. Moon, J. W. Mullens, T. Newington, Z. Ning, B.
Ling Ng, S. M. Novo, M. J. ONeill, M. A. Osborne, A. Osnowski, O. Ostadan, L. L.
Paraschos, L. Pickering, A. C. Pike, A. C. Pike, D. Chris Pinkard, D. P. Pliskin, J.
Podhasky, V. J. Quijano, C. Raczy, V. H. Rae, S. R. Rawlings, A. Chiva Rodriguez, P.
M. Roe, J. Rogers, M. C. Rogert Bacigalupo, N. Romanov, A. Romieu, R. K. Roth, N. J.
Rourke, S. T. Ruediger, E. Rusman, R. M. Sanches-Kuiper, M. R. Schenker, J. M.
Seoane, R. J. Shaw, M. K. Shiver, S. W. Short, N. L. Sizto, J. P. Sluis, M. A. Smith, J.
Ernest Sohna Sohna, E. J. Spence, K. Stevens, N. Sutton, L. Szajkowski, C. L.
Tregidgo, G. Turcatti, S. Vandevondele, Y. Verhovsky, S. M. Virk, S. Wakelin, G. C.
Walcott, J. Wang, G. J. Worsley, J. Yan, L. Yau, M. Zuerlein, J. Rogers, J. C. Mullikin, M.
E. Hurles, N. J. McCooke, J. S. West, F. L. Oaks, P. L. Lundberg, D. Klenerman, R.
Durbin, and A. J. Smith, Accurate whole human genome sequencing using reversible
terminator chemistry., Nature, vol. 456, no. 7218, pp. 539, Nov. 2008.
[17] W. Zhang, J. Chen, Y. Yang, Y. Tang, J. Shang, and B. Shen, A practical comparison of
de novo genome assembly software tools for next-generation sequencing technologies.,
PloS one, vol. 6, no. 3, p. e17915, Jan. 2011.
[18] H. Teeling and F. O. Glckner, Current opportunities and challenges in microbial
metagenome analysis--a bioinformatic perspective., Briefings in bioinformatics, Sep.
2012.
[19] M. Baker, De novo genome assembly: what every biologist should know, Nature
Methods, vol. 9, no. 4, pp. 333337, Mar. 2012.
[20] P. Yilmaz, R. Kottmann, D. Field, R. Knight, J. R. Cole, L. Amaral-Zettler, J. A. Gilbert, I.
Karsch-Mizrachi, A. Johnston, G. Cochrane, R. Vaughan, C. Hunter, J. Park, N.
Morrison, P. Rocca-Serra, P. Sterk, M. Arumugam, M. Bailey, L. Baumgartner, B. W.
Birren, M. J. Blaser, V. Bonazzi, T. Booth, P. Bork, F. D. Bushman, P. L. Buttigieg, P. S.
G. Chain, E. Charlson, E. K. Costello, H. Huot-Creasy, P. Dawyndt, T. DeSantis, N.
Fierer, J. A. Fuhrman, R. E. Gallery, D. Gevers, R. A. Gibbs, I. San Gil, A. Gonzalez, J. I.
Gordon, R. Guralnick, W. Hankeln, S. Highlander, P. Hugenholtz, J. Jansson, A. L. Kau,
S. T. Kelley, J. Kennedy, D. Knights, O. Koren, J. Kuczynski, N. Kyrpides, R. Larsen, C.
L. Lauber, T. Legg, R. E. Ley, C. A. Lozupone, W. Ludwig, D. Lyons, E. Maguire, B. A.
Meth, F. Meyer, B. Muegge, S. Nakielny, K. E. Nelson, D. Nemergut, J. D. Neufeld, L.
K. Newbold, A. E. Oliver, N. R. Pace, G. Palanisamy, J. Peplies, J. Petrosino, L. Proctor,
E. Pruesse, C. Quast, J. Raes, S. Ratnasingham, J. Ravel, D. A. Relman, S. Assunta-
Sansone, P. D. Schloss, L. Schriml, R. Sinha, M. I. Smith, E. Sodergren, A. Spo, J.
Stombaugh, J. M. Tiedje, D. V Ward, G. M. Weinstock, D. Wendel, O. White, A.
Whiteley, A. Wilke, J. R. Wortman, T. Yatsunenko, and F. O. Glckner, Minimum
References

33

information about a marker gene sequence (MIMARKS) and minimum information about
any (x) sequence (MIxS) specifications., Nature biotechnology, vol. 29, no. 5, pp. 415
20, May 2011.
[21] Web of Knowledge. [Online]. Available: www.webofknowledge.com.
[22] J. L. Fox, Natural-born eaters., Nature biotechnology, vol. 29, no. 2, pp. 1036, Feb.
2011.
[23] P. Lorenz and J. Eck, Metagenomics and industrial applications., Nature reviews.
Microbiology, vol. 3, no. 6, pp. 5106, Jun. 2005.
[24] www.biopieces.org. .
[25] Y. Peng, H. Leung, S. Yiu, and F. Chin, IDBA a practical iterative de Bruijn graph de
novo assembler, in 14th RECOMB 2010, 2010, pp. 426440.
[26] Y. Peng, H. C. M. Leung, S. M. Yiu, and F. Y. L. Chin, IDBA-UD: a de novo assembler
for single-cell and metagenomic sequencing data with highly uneven depth.,
Bioinformatics (Oxford, England), vol. 28, no. 11, pp. 14208, Jun. 2012.
[27] F. Meyer, D. Paarmann, M. DSouza, R. Olson, E. M. Glass, M. Kubal, T. Paczian, A.
Rodriguez, R. Stevens, A. Wilke, J. Wilkening, and R. A. Edwards, The metagenomics
RAST server - a public resource for the automatic phylogenetic and functional analysis
of metagenomes., BMC bioinformatics, vol. 9, no. 1, p. 386, Jan. 2008.
[28] M. Kircher, U. Stenzel, and J. Kelso, Improved base calling for the Illumina Genome
Analyzer using machine learning strategies., Genome biology, vol. 10, no. 8, p. R83,
Jan. 2009.
[29] O. Tange, GNU Parallel: the command-line power tool | USENIX, ;login: The USENIX
Magazine, pp. 4247, 2011.
[30] E. W. Weisstein, Eulerian Cycle -- from Wolfram MathWorld. Wolfram Research, Inc.
[31] P. E. C. Compeau, P. A. Pevzner, and G. Tesler, How to apply de Bruijn graphs to
genome assembly., Nature biotechnology, vol. 29, no. 11, pp. 98791, Nov. 2011.
[32] E. W. Weisstein, Hamiltonian Cycle -- from Wolfram MathWorld. Wolfram Research,
Inc.
[33] E. W. Weisstein, NP-Complete Problem -- from Wolfram MathWorld. Wolfram
Research, Inc.
[34] E. W. Weisstein, Eulerian Path -- from Wolfram MathWorld. Wolfram Research, Inc.
[35] J. R. Miller, S. Koren, and G. Sutton, Assembly algorithms for next-generation
sequencing data., Genomics, vol. 95, no. 6, pp. 31527, Jun. 2010.
[36] M. Rho, H. Tang, and Y. Ye, FragGeneScan: predicting genes in short and error-prone
reads., Nucleic acids research, vol. 38, no. 20, p. e191, Nov. 2010.
[37] J. G. Caporaso, J. Kuczynski, J. Stombaugh, K. Bittinger, F. D. Bushman, E. K. Costello,
N. Fierer, A. G. Pea, J. K. Goodrich, J. I. Gordon, G. A. Huttley, S. T. Kelley, D.
Knights, J. E. Koenig, R. E. Ley, C. A. Lozupone, D. McDonald, B. D. Muegge, M.
Pirrung, J. Reeder, J. R. Sevinsky, P. J. Turnbaugh, W. A. Walters, J. Widmann, T.
References

34

Yatsunenko, J. Zaneveld, and R. Knight, QIIME allows analysis of high-throughput
community sequencing data., Nature methods, vol. 7, no. 5, pp. 3356, May 2010.
[38] R. C. Edgar, Search and clustering orders of magnitude faster than BLAST.,
Bioinformatics (Oxford, England), vol. 26, no. 19, pp. 24601, Oct. 2010.
[39] W. J. Kent, BLAT--the BLAST-like alignment tool., Genome research, vol. 12, no. 4, pp.
65664, Apr. 2002.
[40] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K.
Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis,
S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock,
Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.,
Nature genetics, vol. 25, no. 1, pp. 259, May 2000.
[41] M. Kanehisa and S. Goto, KEGG: Kyoto encyclopedia of genes and genomes., Nucleic
acids research, vol. 28, no. 1, pp. 2730, Jan. 2000.
[42] M. Kanehisa, S. Goto, Y. Sato, M. Furumichi, and M. Tanabe, KEGG for integration and
interpretation of large-scale molecular data sets., Nucleic acids research, vol. 40, no.
Database issue, pp. D10914, Jan. 2012.
[43] E. W. Sayers, T. Barrett, D. A. Benson, S. H. Bryant, K. Canese, V. Chetvernin, D. M.
Church, M. DiCuccio, R. Edgar, S. Federhen, M. Feolo, L. Y. Geer, W. Helmberg, Y.
Kapustin, D. Landsman, D. J. Lipman, T. L. Madden, D. R. Maglott, V. Miller, I. Mizrachi,
J. Ostell, K. D. Pruitt, G. D. Schuler, E. Sequeira, S. T. Sherry, M. Shumway, K. Sirotkin,
A. Souvorov, G. Starchenko, T. A. Tatusova, L. Wagner, E. Yaschenko, and J. Ye,
Database resources of the National Center for Biotechnology Information., Nucleic
acids research, vol. 37, no. Database issue, pp. D515, Jan. 2009.
[44] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and E. W. Sayers, GenBank.,
Nucleic acids research, vol. 37, no. Database issue, pp. D2631, Jan. 2009.
[45] R. Overbeek, T. Begley, R. M. Butler, J. V Choudhuri, H.-Y. Chuang, M. Cohoon, V. de
Crcy-Lagard, N. Diaz, T. Disz, R. Edwards, M. Fonstein, E. D. Frank, S. Gerdes, E. M.
Glass, A. Goesmann, A. Hanson, D. Iwata-Reuyl, R. Jensen, N. Jamshidi, L. Krause, M.
Kubal, N. Larsen, B. Linke, A. C. McHardy, F. Meyer, H. Neuweger, G. Olsen, R. Olson,
A. Osterman, V. Portnoy, G. D. Pusch, D. A. Rodionov, C. Rckert, J. Steiner, R.
Stevens, I. Thiele, O. Vassieva, Y. Ye, O. Zagnitko, and V. Vonstein, The subsystems
approach to genome annotation and its use in the project to annotate 1000 genomes.,
Nucleic acids research, vol. 33, no. 17, pp. 5691702, Jan. 2005.
[46] R. K. Aziz, D. Bartels, A. A. Best, M. DeJongh, T. Disz, R. A. Edwards, K. Formsma, S.
Gerdes, E. M. Glass, M. Kubal, F. Meyer, G. J. Olsen, R. Olson, A. L. Osterman, R. A.
Overbeek, L. K. McNeil, D. Paarmann, T. Paczian, B. Parrello, G. D. Pusch, C. Reich, R.
Stevens, O. Vassieva, V. Vonstein, A. Wilke, and O. Zagnitko, The RAST Server: rapid
annotations using subsystems technology., BMC genomics, vol. 9, p. 75, Jan. 2008.
[47] The UniProt Consortium, Reorganizing the protein space at the Universal Protein
Resource (UniProt)., Nucleic acids research, vol. 40, no. Database issue, pp. D715,
Jan. 2012.
[48] J. J. Gillespie, A. R. Wattam, S. A. Cammer, J. L. Gabbard, M. P. Shukla, O. Dalay, T.
Driscoll, D. Hix, S. P. Mane, C. Mao, E. K. Nordberg, M. Scott, J. R. Schulman, E. E.
Snyder, D. E. Sullivan, C. Wang, A. Warren, K. P. Williams, T. Xue, H. S. Yoo, C. Zhang,
Y. Zhang, R. Will, R. W. Kenyon, and B. W. Sobral, PATRIC: the comprehensive
bacterial bioinformatics resource with a focus on human pathogenic species., Infection
and immunity, vol. 79, no. 11, pp. 428698, Nov. 2011.
References

35

[49] S. Powell, D. Szklarczyk, K. Trachana, A. Roth, M. Kuhn, J. Muller, R. Arnold, T. Rattei,
I. Letunic, T. Doerks, L. J. Jensen, C. von Mering, and P. Bork, eggNOG v3.0:
orthologous groups covering 1133 organisms at 41 different taxonomic ranges., Nucleic
acids research, vol. 40, no. Database issue, pp. D2849, Jan. 2012.
[50] I. Letunic, T. Yamada, M. Kanehisa, and P. Bork, iPath: interactive exploration of
biochemical pathways and networks., Trends in biochemical sciences, vol. 33, no. 3, pp.
1013, Mar. 2008.
[51] T. Yamada, I. Letunic, S. Okuda, M. Kanehisa, and P. Bork, iPath2.0: interactive
pathway explorer., Nucleic acids research, vol. 39, no. Web Server issue, pp. W4125,
Jul. 2011.
[52] T. Z. DeSantis, P. Hugenholtz, N. Larsen, M. Rojas, E. L. Brodie, K. Keller, T. Huber, D.
Dalevi, P. Hu, and G. L. Andersen, Greengenes, a chimera-checked 16S rRNA gene
database and workbench compatible with ARB., Applied and environmental
microbiology, vol. 72, no. 7, pp. 506972, Jul. 2006.
[53] C. Quast, E. Pruesse, P. Yilmaz, J. Gerken, T. Schweer, P. Yarza, J. Peplies, and F. O.
Glckner, The SILVA ribosomal RNA gene database project: improved data processing
and web-based tools., Nucleic acids research, vol. 41, no. Database issue, pp. D5906,
Jan. 2013.
[54] J. R. Cole, Q. Wang, E. Cardenas, J. Fish, B. Chai, R. J. Farris, A. S. Kulam-Syed-
Mohideen, D. M. McGarrell, T. Marsh, G. M. Garrity, and J. M. Tiedje, The Ribosomal
Database Project: improved alignments and new tools for rRNA analysis., Nucleic acids
research, vol. 37, no. Database issue, pp. D1415, Jan. 2009.
[55] J. Oksanen, R. Blanchet, F. Guillaume Kindt, P. Legendre, P. R. Minchin, R. B. OHara,
G. L. Simpson, P. Solymos, M. H. H. Stevens, and H. Wagner, vegan: Community
Ecology Package. R package version 2.0-7. 2013.
[56] J. Oksanen, Vegan: ecological diversity. .
[57] N. J. Gotelli and R. K. Colwell, Quantifying biodiversity: procedures and pitfalls in the
measurement and comparison of species richness, Ecology Letters, vol. 4, no. 4, pp.
379391, Jul. 2001.
[58] B. D. Ondov, N. H. Bergman, and A. M. Phillippy, Interactive metagenomic visualization
in a web browser., BMC bioinformatics, vol. 12, p. 385, Jan. 2011.
[59] D. P. Faith, P. R. Minchin, and L. Belbin, Compositional dissimilarity as a robust
measure of ecological distance, Vegetatio, vol. 69, no. 13, pp. 5768, Apr. 1987.
[60] Y. Wang, H.-F. Sheng, Y. He, J.-Y. Wu, Y.-X. Jiang, N. F.-Y. Tam, and H.-W. Zhou,
Comparison of the levels of bacterial diversity in freshwater, intertidal wetland, and
marine sediments by using millions of illumina tags., Applied and environmental
microbiology, vol. 78, no. 23, pp. 826471, Dec. 2012.
[61] M. T. Madigan, J. M. Martinko, P. V. Dunlap, and D. P. Clark, Brock Biology of
Microorganisms, 12th ed. Pearson, 2009.
[62] H. Li, Y. Yu, W. Luo, Y. Zeng, and B. Chen, Bacterial diversity in surface sediments
from the Pacific Arctic Ocean., Extremophiles: life under extreme conditions, vol. 13,
no. 2, pp. 23346, Mar. 2009.
References

36

[63] K. T. Konstantinidis, J. Braff, D. M. Karl, and E. F. DeLong, Comparative metagenomic
analysis of a microbial community residing at a depth of 4,000 meters at station ALOHA
in the North Pacific subtropical gyre., Applied and environmental microbiology, vol. 75,
no. 16, pp. 534555, Aug. 2009.
[64] W. Xie, F. Wang, L. Guo, Z. Chen, S. M. Sievert, J. Meng, G. Huang, Y. Li, Q. Yan, S.
Wu, X. Wang, S. Chen, G. He, X. Xiao, and A. Xu, Comparative metagenomics of
microbial communities inhabiting deep-sea hydrothermal vent chimneys with contrasting
chemistries., The ISME journal, vol. 5, no. 3, pp. 41426, Mar. 2011.
[65] J. Wang, Y. Wu, H. Jiang, C. Li, H. Dong, Q. Wu, J. Soininen, and J. Shen, High beta
diversity of bacteria in the shallow terrestrial subsurface, Environmental Microbiology,
vol. 10, no. 10, pp. 25372549, Oct. 2008.

Appendix

37

Appendix


Figure 20 - Taxonomic distribution of the reads from the seven samples at the phylum level


Appendix

38


Figure 21 - Krona graph of the distribution of the reads of Gammaproteobacteria in sample 1


Figure 22 - Krona graph of the distribution of the reads of Gammaproteobacteria in sample 2
Appendix

39


Figure 23 - Krona graph of the distribution of the reads of Gammaproteobacteria in sample 3


Figure 24 - Krona graph of the distribution of the reads of Gammaproteobacteria in sample 5

Appendix

40


Figure 25 - Krona graph of the distribution of the reads of Gammaproteobacteria in sample 6


Figure 26 - Krona graph of the distribution of the reads of Gammaproteobacteria in sample 7

Appendix

41


Figure 27 - Krona graph of the distribution of the reads of Gammaproteobacteria in sample 8



Figure 28 - Beta diversity related to spacial distance
y = 0.0013x + 0.3036
R = 0.0999
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 5 10 15 20 25 30 35 40
D
i
s
s
i
m
i
l
a
r
i
t
y

Distance (cm)
Correlation between Spacial Distance and Dissimilarity
Appendix

42


Figure 29 - Heatmap of the reads agains the M5RNA database at 97% identity

Appendix

43


Figure 30 - Heatmap of the reads agains the M5NR database at 90% identity

You might also like