Cis-regulatory elements (CREs, e.g., promoters and enhancers) regulate gene expression, and variants within CREs can modulate disease risk. Next-generation sequencing has enabled the rapid generation of genomic data that predict the locations of CREs, but a bottleneck lies in functionally interpreting these data. To address this issue, massively parallel reporter assays (MPRAs) have emerged, in which barcoded reporter libraries are introduced into cells, and the resulting barcoded transcripts are quantified by next-generation sequencing. Thus far, MPRAs have been largely restricted to assaying short CREs in a limited repertoire of cultured cell types. Here, we present two advances that extend the biological relevance and applicability of MPRAs. First, we adapt exome capture technology to instead capture candidate CREs, thereby tiling across the targeted regions and markedly increasing the length of CREs that can be readily assayed. Second, we package the library into adeno-associated virus (AAV), thereby allowing delivery to target organs in vivo. As a proof of concept, we introduce a capture library of about 46,000 constructs, corresponding to roughly 3500 DNase I hypersensitive (DHS) sites, into the mouse retina by ex vivo plasmid electroporation and into the mouse cerebral cortex by in vivo AAV injection. We demonstrate tissue-specific cis-regulatory activity of DHSs and provide examples of high-resolution truncation mutation analysis for multiplex parsing of CREs. Our approach should enable massively parallel functional analysis of a wide range of CREs in any organ or species that can be infected by AAV, such as nonhuman primates and human stem cell–derived organoids.

Massively parallel sequencing has revolutionized many areas of biology, but sequencing large amounts of DNA in many individuals is cost-prohibitive and unnecessary for many studies. Genomic complexity reduction techniques such as sequence capture and restriction enzyme-based methods enable the analysis of many more individuals per unit cost. Despite their utility, current complexity reduction methods have limitations, especially when large numbers of individuals are analyzed. Here we develop a much improved restriction site-associated DNA (RAD) sequencing protocol and a new method called Rapture (RAD capture). The new RAD protocol improves versatility by separating RAD tag isolation and sequencing library preparation into two distinct steps. This protocol also recovers more unique (nonclonal) RAD fragments, which improves both standard RAD and Rapture analysis. Rapture then uses an in-solution capture of chosen RAD tags to target sequencing reads to desired loci. Rapture combines the benefits of both RAD and sequence capture, i.e., very inexpensive and rapid library preparation for many individuals as well as high specificity in the number and location of genomic loci analyzed. Our results demonstrate that Rapture is a rapid and flexible technology capable of analyzing a very large number of individuals with minimal sequencing and library preparation cost. The methods presented here should improve the efficiency of genetic analysis for many aspects of agricultural, environmental, and biomedical science.

Metazoan genomes are spatially organized at multiple scales, from packaging of DNA around individual nucleosomes to segregation of whole chromosomes into distinct territories. At the intermediate scale of kilobases to megabases, which encompasses the sizes of genes, gene clusters and regulatory domains, the three-dimensional (3D) organization of DNA is implicated in multiple gene regulatory mechanisms, but understanding this organization remains a challenge. At this scale, the genome is partitioned into domains of different epigenetic states that are essential for regulating gene expression. Here we investigate the 3D organization of chromatin in different epigenetic states using super-resolution imaging. We classified genomic domains in Drosophila cells into transcriptionally active, inactive or Polycomb-repressed states, and observed distinct chromatin organizations for each state. All three types of chromatin domains exhibit power-law scaling between their physical sizes in 3D and their domain lengths, but each type has a distinct scaling exponent. Polycomb-repressed domains show the densest packing and most intriguing chromatin folding behaviour, in which chromatin packing density increases with domain length. Distinct from the self-similar organization displayed by transcriptionally active and inactive chromatin, the Polycomb-repressed domains are characterized by a high degree of chromatin intermixing within the domain. Moreover, compared to inactive domains, Polycomb-repressed domains spatially exclude neighbouring active chromatin to a much stronger degree. Computational modelling and knockdown experiments suggest that reversible chromatin interactions mediated by Polycomb-group proteins play an important role in these unique packaging properties of the repressed chromatin. Taken together, our super-resolution images reveal distinct chromatin packaging for different epigenetic states at the kilobase-to-megabase scale, a length scale that is directly relevant to genome regulation.

The 5S and 45S rDNA sites are useful chromosome landmarks and can provide valuable information about karyotype evolution and species interrelationships. In this study, we employed fluorescence in situ hybridization (FISH) to determine the number and chromosomal location of 5S and 45S rDNA loci in 8 diploid Cucumis species. Two oligonucleotide painting probes specific for the rDNA-bearing chromosomes in C. melo were hybridized to other Cucumis species in order to investigate the homeologies among the rDNA-carrying chromosomes in Cucumis species. The analyzed diploid species showed 3 types of rDNA distribution patterns, which provided clear cytogenetic evidence on the divergence between C. melo and wild diploid African Cucumis species. The present results not only show species interrelationships in the genus Cucumis, but the rDNA FISH patterns can also be used as cytological markers for the discrimination of closely related species. The data will be helpful for breeders to choose the most suitable species from various wild species for improvement of cultivated melon.

Exon-capture studies have typically been restricted to relatively shallow phylogenetic scales due primarily to hybridization constraints. Here, we present an exon-capture system for an entire class of marine invertebrates, the Ophiuroidea, built upon a phylogenetically diverse transcriptome foundation. The system captures approximately 90% of the 1,552 exon target, across all major lineages of the quarter-billion-year-old extant crown group. Key features of our system are 1) basing the target on an alignment of orthologous genes determined from 52 transcriptomes spanning the phylogenetic diversity and trimmed to remove anything difficult to capture, map, or align; 2) use of multiple artificial representatives based on ancestral state reconstructions rather than exemplars to improve capture and mapping of the target; 3) mapping reads to a multi-reference alignment; and 4) using patterns of site polymorphism to distinguish among paralogy, polyploidy, allelic differences, and sample contamination. The resulting data give a well-resolved tree (currently standing at 417 samples, 275,352 sites, 91% data-complete) that will transform our understanding of ophiuroid evolution and biogeography.

The qualification of orthology is a significant challenge when developing large, multiloci phylogenetic data sets from assembled transcripts. Transcriptome assemblies have various attributes, such as fragmentation, frameshifts and mis-indexing, which pose problems to automated methods of orthology assessment. Here, we identify a set of orthologous single-copy genes from transcriptome assemblies for the land snails and slugs (Eupulmonata) using a thorough approach to orthology determination involving manual alignment curation, gene tree assessment and sequencing from genomic DNA. We qualified the orthology of 500 nuclear, protein-coding genes from the transcriptome assemblies of 21 eupulmonate species to produce the most complete phylogenetic data matrix for a major molluscan lineage to date, both in terms of taxon and character completeness. Exon capture targeting 490 of the 500 genes (those with at least one exon textgreater120 bp) from 22 species of Australian Camaenidae successfully captured sequences of 2825 exons (representing all targeted genes), with only a 3.7% reduction in the data matrix due to the presence of putative paralogs or pseudogenes. The automated pipeline Agalma retrieved the majority of the manually qualified 500 single-copy gene set and identified a further 375 putative single-copy genes, although it failed to account for fragmented transcripts resulting in lower data matrix completeness when considering the original 500 genes. This could potentially explain the minor inconsistencies we observed in the supported topologies for the 21 eupulmonate species between the manually curated and ‘Agalma-equivalent’ data set (sharing 458 genes). Overall, our study confirms the utility of the 500 gene set to resolve phylogenetic relationships at a range of evolutionary depths and highlights the importance of addressing fragmentation at the homolog alignment stage for probe design.

By combining high-throughput sequencing with target enrichment (‘hybridization capture’), researchers are able to obtain molecular data from genomic regions of interest for projects that are otherwise constrained by sample quality (e.g. degraded and contamination-rich samples) or a lack of a priori sequence information (e.g. studies on nonmodel species). Despite the use of hybridization capture in various fields of research for many years, the impact of enrichment conditions on capture success is not yet thoroughly understood. We evaluated the impact of a key parameter – hybridization temperature – on the capture success of mitochondrial genomes across the carnivoran family Felidae. Capture was carried out for a range of sample types (fresh, archival, ancient) with varying levels of sequence divergence between bait and target (i.e. across a range of species) using pools of individually indexed libraries on Agilent SureSelect™ arrays. Our results suggest that hybridization capture protocols require specific optimization for the sample type that is being investigated. Hybridization temperature affected the proportion of on-target sequences following capture: for degraded samples, we obtained the best results with a hybridization temperature of 65 °C, while a touchdown approach (65 °C down to 50 °C) yielded the best results for fresh samples. Evaluation of capture performance at a regional scale (sliding window approach) revealed no significant improvement in the recovery of DNA fragments with high sequence divergence from the bait at any of the tested hybridization temperatures, suggesting that hybridization temperature may not be the critical parameter for the enrichment of divergent fragments.

After evolving in Africa at the close of the Miocene, mammoths (Mammuthus sp.) spread through much of the northern hemisphere, diversifying morphologically as they entered various habitats. Paleontologically, these morphs are conventionally recognized as species. In Pleistocene North America alone, several mammoth species have been recognized, inhabiting environments as different as cold tundra-steppe in the north and the arid grasslands or temperate savanna-parklands of the south. Yet mammoth phylogeographic studies have overwhelmingly focused on permafrost-preserved remains of only one of these species, Mammuthus primigenius (woolly mammoth). Here we challenge this bias by performing a geographically and taxonomically wide survey of mammoth genetic diversity across North America. Using a targeted enrichment technique, we sequenced 67 complete mitochondrial genomes from non-primigenius specimens representing M. columbi (Columbian mammoth), M. jeffersonii (Jeffersonian mammoth), and M. exilis (pygmy mammoth), including specimens from contexts not generally associated with good DNA preservation. While we uncovered clear phylogeographic structure in mammoth matrilines, their phylogeny as recovered from mitochondrial DNA is not compatible with existing systematic interpretations of their paleontological record. Instead, our results strongly suggest that various nominal mammoth species interbred, perhaps extensively. We hypothesize that at least two distinct stages of interbreeding between conventional paleontological species are likely responsible for this pattern – one between Siberian woolly mammoths and resident American populations that introduced woolly mammoth phenotypes to the continent, and another between ecomorphologically distinct populations of woolly and Columbian mammoths in North America south of the ice.

With the increasing availability of high-throughput sequencing, phylogenetic analyses are no longer constrained by the limited availability of a few loci. Here, we describe a sequence capture methodology, which we used to collect data for analyses of diversification within Sabal (Arecaceae), a palm genus native to the south-eastern USA, Caribbean, Bermuda and Central America. RNA probes were developed and used to enrich DNA samples for putatively low copy nuclear genes and the plastomes for all Sabal species and two outgroup species. Sequence data were generated on an Illumina MiSeq sequencer and target sequences were assembled using custom workflows. Both coalescence and supermatrix analyses of 133 nuclear genes were used to estimate species trees relationships. Plastid genomes were also analysed, yielding generally poor resolution with regard to species relationships. Species relationships described in both nuclear gene and plastome sequences largely reflect the biogeography of the group and, to a lesser extent, previous morphology-based hypotheses. Beyond the biological implications, this research validates a high-throughput methodology for generating a large number of genes for coalescence-based phylogenetic analyses in plant lineages.

In an era of ever-increasing amounts of whole-genome sequence data for individuals and populations, the utility of traditional single nucleotide polymorphisms (SNPs) array-based genome scans is uncertain. We previously performed a SNP array-based genome scan to identify candidate genes under selection in six distinct grey wolf (Canis lupus) ecotypes. Using this information, we designed a targeted capture array for 1040 genes, including all exons and flanking regions, as well as 5000 1-kb nongenic neutral regions, and resequenced these regions in 107 wolves. Selection tests revealed striking patterns of variation within candidate genes relative to noncandidate regions and identified potentially functional variants related to local adaptation. We found 27% and 47% of candidate genes from the previous SNP array study had functional changes that were outliers in sweed and bayenv analyses, respectively. This result verifies the use of genomewide SNP surveys to tag genes that contain functional variants between populations. We highlight nonsynonymous variants in APOB, LIPG and USH2A that occur in functional domains of these proteins, and that demonstrate high correlation with precipitation seasonality and vegetation. We find Arctic and High Arctic wolf ecotypes have higher numbers of genes under selection, which highlight their conservation value and heightened threat due to climate change. This study demonstrates that combining genomewide genotyping arrays with large-scale resequencing and environmental data provides a powerful approach to discern candidate functional variants in natural populations.