You are on page 1of 836
100kb pairs) urrays of chromosome-specific simple repeats, the centromeres of GGAS, GGA2T, and GGAZ are remarkably short (-30kb) and lack the usual repeat structure (Shang ct al., 2010), Being able to clone arst manipulate these cei ‘romeres by homologous recombination Shang etal, 2013) promises to make the chicken the primary model system for the study of vertebrate centromeres. A final point is that, the zebra finch and probably other birds possess a germ- line restricted chromosome, with a function that remains ‘obscure (Itoi etal, 2009), 1.4 GENOME SEQUENCES 1.4.1 Approach All bird genomes sequenced to date have employed a whole ‘genome shotgun method, in which overlaps between mil- lions of random reads are used to assemble contiguous blocks of sequence (ie., contigs) along the genome. Due to their relatively low repeat content, avian genomes are ideal or shotgun sequencing. Contigs are then assembled into scaffolds (i.e. aligned groups of contigs containing size- calibrated gaps), using mate-pair reads in which both ends PART | 1 Undergirding themes are Sequenced from DNA fragments within a selected size range. Even for genomes with deep coverage, this generates, ‘hundreds to thousands of scaffolds that, ideally, are ordeced and aligned using physical (based on mapping of recombi- nant clones in bacterial artificial chromosome (BAC) vee~ tots) and/or linkage maps (Table 1.1). ‘The chicken (International Chicken Genome Sequene~ ing Consortium, 2004) and zebrafineh (Warren etal. 2010) were sequenced by the Sanger method, in which reads are ‘derived one-by-one from recombinant clone libraries. This currently remains the gold standard for genome sequencing but no longer is cost-effective with theadvent of next-gener- ation sequencing (NGS) mediods, which direcily sequence collections of (uncioned) DNA fragments in a multiparallet ‘manner. NGS read lengths often are shorter and sometimes ‘more error-prone than Sanger reads, but NGS compensates by much higher coverage, such that the consensus Sequence is at least as accurate. Various NGS methods have been developed (Matzker, 2010). The first avian genome to be sequenced via NGS was that of the torkey (Table 1), and Wwe can anticipate an onslaught of new bird genomes soon (Genome 10K Community of Scientists, 2000). 1.4.2 Coverage Most current avian genome sequence assemblies contain about 90-95% of thelr respective euchiomatle genomes typically 1.1-1.2Gb; Table 1-1), Coverage is usually esti- ‘mated by the fraction of different mRNA transcripts that ean he found within the assembly, Highly repetitive heterachro- ‘matic Sequences, expecially when repeated in tandem, are nearly impossible to assemble and are missing from all er- tebraie genomes, but these contain few genes. For example, centromeres (however, see Shang et al., 2010), telomeres, and #DNA (tandem repeats that encode ritosomal RNA, on GGA16) are generally missing altogether or shown as gps. and very little of the repeat-rich/gene-poor W chromosome 's usually assembled. Sequence scaffolds are ordered and aligned along chromosomes for birds that have dense linkage ‘maps andlor BAC contig physical maps, sometimes assum ing @ common local order with closely related genomes (comparative maps); however, most NGS-derived avian ‘genomes currently are unordered (Tabi |.1), Sequence seat folds that cannot be placed are arbitrarily elstered on ebsUn (chromosome unknown of, for example, chr _random if the chromosome but not the location is knowa, or simply pro- vided as a list of unplaced seaffolds. Even for the chicken, itthas been impossible to align sequence scaffolds with spe~ cific smaller microchromosomes (GGA29-31, GGA33-38, and most of GGAI6 and 32), so any such sequence is on chrUn. In part, this is dae to a paucity of aligning markers; however, more generally, merochromosomal DNA ts poorly represented in sequence reads. The reasons remain unclear. but they likely relate to microchromosomes being rich in TABLE 1.1 Avan Reference Genome Sequence Astenblies Specs! Fol! Sequeneed Aligned 0 —Saafold_—_Contg_—_pprsinate WCS Proje! Asem Method! Coverage ses (Gh?) Chromosome S50") NSO! (Lb) Coverage? References ik, Ci uct Sap he uy mo Hand cin ‘dos Feit") ami atc — sous ie 1 au dee om Warmth 290 ge ptt dgegae Mopaitices Rothe kT is se Fama ion Mechim gind Gefen 10 Nie 1465 No By 305 : Finch azaor™ fre 202 Large undtoch Rade t Rothe a Ne ane 10S Wm Racket 2004 Fetaay 183 Kock goon no ums e108 He Bee wR Sip ata a0 sen Feb 2013 Paco Rian paral AV mina TS ore a sumaog wey 4 | soideya GaRLEAL rn elses Cote agrees ea Specie! Fold Sequenead Aligned to Safll_—_ Contig. Approximate WOSProied! ——Asembhy Method? Coverage Sse) Chromosome N50! (A) NSO' (LH) Conerge™ References Pompei) Egommieto Wie 10e Ee 3 som thaetal 3013 hea Feb 013 Sater con Faemgvio Mums tk 18a “ Hae ahinetal200 retin Febany 019 Teton guar atu Muming 9100 12 6s MR " aszo faa atid dick Bk 18 we my wr 15% 5 NON} 200) Whitetosel spar ASME imine e a MR ” a sealz0n1 Coe ee ae ceed wn ha re ete nel eee de abe ee” ‘Soto rc {oem yet ama se. soe Scare endear rae a ein seen aceon ‘conmaice ‘nt oi a pa arta pal el Bann NC Conk eet ea eet Ect fe tO ncn Wit pe on ae sou Spat Chapter | 1. Avian Genomics repetitive sequences and high in GC content. It was intially ‘thought that this made microchromesomal DNA refracile to recombinant DNA cloning (and, indeed, itis rare in clone libraries), but these reads remain underrepresented even in uunsloned NGS sequences. The smallest chicken chromo: some with reasonable sequence representation is GGA25 (~2.2Mb), but the sequence assembly is problematic for this and al least two other small chromosomes (GGA28, Gor don et al., 2007; GGA16, Shiina et al., 2007), in part due to repeated sequences, Even though they may berici in repeats, for the most part, microchromosomes are also gene: (Intemational Chicken Genome Sequencing Consortium, 2004), although one cannot be cestaln about GGA29~38, It seems likely that much of the missing $~10% of current assemblies (Table 1-1) lies on microchromesomes and W chromosomes. (Falcon assemblies claim 97-99% coverage (Zhan ct l.,2012), but thisprobably is not due to the Fact that these genomes contain fewer microchromosomes, but rather because the authors measured coverage by the frequency ‘with which cloned sequences are found, so their test set is biased away from microchromesomes.) 1.5, ANNOTATION Much of the value of the reference genome sequence depends on annotation (Yandell and Bnce, 2012), which links the DNA sequence to all the information available on component genes, mRNAs, proteins, etc. Once the genome is sequenced, there are two broad classes of annotation: (1) evideneed-based, whieh uses RNA oF proteomic data (ee Chapters 3 and 4), as well as homology to genes in other species; and (2) ab initio annotation, employing com- pier searches for open reading frames, likely in stop codons, splice junctions, and other sequence-based characteristics to predict the existence of genes for which experimental evidence is lacking. Transposable elements re annotated based on the repetition in the genome, related ‘ness to transposcns in other species, and their characteristic end structures (Jurka et al. 2005). Annotaing regulatory uences (such as transcription factor binding sites) mere problematic; Ht also relies on both comparisons 10 other genomes and evidence from genome-wiie DNA meth- ylation and chromatin immunopreciptation (ChIP) analyses (see Chapter 2). This is exemplified by the human ENCODE project (The ENCODE Project Consomtium, 2012), but it ‘will be some time before that level of data is avaiable for any bird. Much of theannotation of avian genome sequences hhas relied on comparisons to other geaomes and has not been manually curated. Thus, the annotations are trequenty inaccurate, especially for those genes and other elements whose functions are lineage-specific (ie., only found in a given species or only in birds). Thus, one shouldbe hesitant, to accept conclusions hased solely on computer analysis of avian genome sequences in their current state ion and 1.6 GENOME BROWSERS Most of the user community depends on one or more genome browsers to utilize sequence date, There are three ‘major browsers: University of California at Santa Cruz (UCSC) Genome Bioinformaties (www.genome.uese.edu), Ensombl (wwi-ensembl.org), and the National Center for Biotechnology Information (NCBI) Map Viewer (www: nebi.nlm.nih.gov/genome); there are also avian-focused sites such as Avian Genomes. (aviangenomes.org) and Bird Base (birdbase.arizona.edu/birdbase). The browsers all employ the same reference sequence information a a series of chromosomes, scaffolds, or both. Any property that is sequence-specific (genes, ChiP binding sites, RNA sequences, homology with other sequences, et.) can be displayed as a track on the genome (Figure 1.1). Genome browsers are only as good as the underlying sequence assembly and annotation, Not all avian genomes are avail- able at every browser site, and not all annotation tricks are available for each build (i.e, updated assemblies based on ‘now data), The various options are in constant flux. 1.7 GENES All bird genomes evolved via two whole genome duplica- lion events that preceded the ancestral vertebrate genome (Van de Peer et al., 2009). A commonly cited outcome are the four clusters of HOX homeobox. developmental transcription factor genes found in most vertebrates (e... chicken HOXA cluster on GGA2; HOKB on GGA27, Figure |; HOXD on GGAT; and HOXC on chrUn, probably on a jerochromosome). In most instances, one oF more of the potential four ancestral genes or clusters has been lost cur- ing subsequent evolution or,as in the case of the HOX clus~ ters, has diverged to perform different functions, thereby providing a selective force leading t0 its retention. Another ‘major force in gene evolution has been the (usually local) expansion and contraction of gene families, For example, the y-e clade of olfactory receptor genes (always among the ‘most rapidly diverging gene families) is highly expanded in the chicken and zebra finch, but falcon genomes have oaly ‘one of two copies (Zhan et l,, 2013). ‘on the methods employed and the «ailable a genomes are estimated to contain 15-20,000 protein-coding genes, but keep in mind that each gene locus may generate multiple transcripts and proteins due toalterna- live splicing, ranseriptional start sites, and polyadenylation sites. This number may end up being slightly low once addi- tional transcriptome and proteome data accumulate (Chap- ters Rand). Thereis some evidence of a greater rate of gene loss versus gene gain during avian evolution (Intemational Chicken Genome Sequencing Consortium, 2004), but this ‘must be viewed cautiously, given the less fully annotated state of bird genomes. The most reliably identified genes are

You might also like