[go: nahoru, domu]

WO2018057779A1 - Compositions of synthetic transposons and methods of use thereof - Google Patents

Compositions of synthetic transposons and methods of use thereof Download PDF

Info

Publication number
WO2018057779A1
WO2018057779A1 PCT/US2017/052776 US2017052776W WO2018057779A1 WO 2018057779 A1 WO2018057779 A1 WO 2018057779A1 US 2017052776 W US2017052776 W US 2017052776W WO 2018057779 A1 WO2018057779 A1 WO 2018057779A1
Authority
WO
WIPO (PCT)
Prior art keywords
complementary region
strand
synthetic
nucleic acid
sequence
Prior art date
Application number
PCT/US2017/052776
Other languages
French (fr)
Inventor
Jianbiao Zheng
Original Assignee
Jianbiao Zheng
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jianbiao Zheng filed Critical Jianbiao Zheng
Publication of WO2018057779A1 publication Critical patent/WO2018057779A1/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA

Definitions

  • the present invention relates to the field of genomics, in particular, sequencing and analysis of nucleic acids.
  • kits to provide phasing information of sequencing reads that facilitate assembly of whole genome sequences and other long-range sequences.
  • Commercial kits are available from e.g., Complete Genomics, Illumina, or lOx Genomics. Also see, for example, Peters B.A. et al., Nature 487: 190-195, 2012; Kaper F. et al , Proc. Natl. Acad. Sci. 110: 5552-5557, 2013; Amini S. et al , Nature Genetics 46: 1343- 1349, 2014; McCoy R.C. et al , PLOS One 9: el0668, 2014; Zheng G. X. Y.
  • Transposases can be used to introduce mutations or insert sequences in nucleic acids. Previously, transposases were used for in vitro or in vivo mutagenesis (e.g. , US6, 159,736) or for producing protein tags (e.g., US5, 652,128). Several companies including NEB, Epicentre (now part of Illumina) and Finnzymes have provided kits for these purposes. Transposases have also been used to fragment target DNA and to introduce primer binding sequences at the same time. See, for example, US6,593,113, 2003; US9,115,396; US9,145,623; and Adey A. et al , Genome Biol. 11 : R119, 2010. Commercial kits are available, including, for example, NEXTERA ® DNA Sample Prep kits by Illumina/Epicentre and MUSEEK Library Preparation kits by Thermo Scientific.
  • the present invention provides compositions, methods, kits and analysis tools for high- quality sequencing of nucleic acids, haplotyping and quantification of whole genome or targeted sequences.
  • the compositions comprise one or more synthetic transposons having two non- complementary regions linked to each other, and the synthetic transposons may or may not contain molecular barcodes.
  • One aspect of the present application provides a synthetic transposon comprising: a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non-complementary region and the second non-complementary region are connected to each other.
  • the first strand of the first non-complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides.
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides.
  • the cleavable nucleotide is a uracil nucleotide.
  • the first strand and the second strand of the first non-complementary region are fused to each other via a first single- stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single- stranded linker and the second single-stranded linker are connected to each other.
  • the first single-stranded linker and the second single-stranded linker hybridize to each other.
  • each of the first single-stranded linker and the second single- stranded linker comprises a cleavable nucleotide.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • the cleavable nucleotide is a uracil nucleotide.
  • the first stem or the second stem comprises a terminal hairpin structure.
  • the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon.
  • the synthetic transposon comprises one or more modified nucleotides.
  • the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • the synthetic transposon further comprises a first molecular barcode and a second molecular barcode, wherein the first molecular barcode is disposed between the first transposon recognition site and the first non-complementary region, and the second molecular barcode is disposed between the second transposon recognition site and the second non-complementary region.
  • the first molecular barcode and the second molecular barcode have the same barcode sequence.
  • the first molecular barcode and the second molecular barcode are double-stranded.
  • the first molecular barcode or the second molecular barcode comprises a single-stranded region.
  • each synthetic transposon comprises a first molecular barcode and a second molecular barcode, wherein the first molecular barcode is disposed between the first transposon recognition site and the first non-complementary region, and the second molecular barcode is disposed between the second transposon recognition site and the second non-complementary region, wherein the first molecular barcode and the second molecular barcode have the same barcode sequence, and wherein each synthetic transposon has a different barcode sequence.
  • the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
  • One aspect of the present application provides a method of preparing a library of template nucleic acids, comprising: (a) contacting a target nucleic acid with any one of the compositions described above, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; and (c) separating the first non-complementary region and the second non-complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids.
  • step (c) comprises treating the repaired target nucleic acid with an endonuclease.
  • the endonuclease is uracil DNA glycosylase (UDG).
  • step (c) comprises denaturing of the repaired target nucleic acid.
  • the method further comprises treating the denatured repaired target nucleic acid with an exonuclease.
  • the method further comprises amplifying the library of template nucleic acids.
  • the amplifying is whole-genome amplification.
  • the amplifying is targeted amplification.
  • the library of template nucleic acids is amplified by a polymerase chain reaction (PCR).
  • the PCR comprises contacting the template nucleic acids with a first primer that hybridizes to the first adapter sequence or reverse complement thereof, and a second primer that hybridizes to the second adapter sequence or reverse complement thereof.
  • the library of template nucleic acids is amplified by rolling circle amplification (RCA) using a primer comprising a first sequence that hybridizes to the first adapter sequence and a second sequence that hybridizes to the second adapter sequence.
  • the method further comprises circularizing the template nucleic acids prior to the RCA.
  • the polymerase is T4 DNA polymerase.
  • the transposase is Tn5 transposase.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid.
  • the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA.
  • the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • One aspect of the present application provides a method of analyzing a target nucleic acid, comprising: (a) preparing a library of template nucleic acids from the target nucleic acid using the method of any one of the methods of preparing a library of template nucleic acids described above; and (b) sequencing the library of template nucleic acids to obtain sequencing reads.
  • the sequencing is next generation sequencing.
  • the sequencing is massively parallel shotgun sequencing.
  • the sequencing is single molecule sequencing.
  • each synthetic transposon comprises a first molecular barcode and a second molecular barcode, wherein the first molecular barcode is disposed between the first transposon recognition site and the first non-complementary region, and the second molecular barcode is disposed between the second transposon recognition site and the second non-complementary region, wherein the first molecular barcode and the second molecular barcode have the same barcode sequence, wherein each synthetic transposon has a different barcode sequence, and wherein the method further comprises: (c) assembling a contiguous sequence of the target nucleic acid from the sequencing reads based on the barcode sequences of the synthetic transposons in the template nucleic acids.
  • step (c) comprises: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same barcode sequences in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the barcode sequences in the synthetic transposons to provide the contiguous sequence of the target nucleic acid.
  • each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single-stranded gaps having duplicated sequences endogenous to the target nucleic acid, and wherein the duplicated sequences are used to assemble the contiguous sequence.
  • the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
  • the method is used for genome assembly, haplotyping, detection of mutation, chromosomal conformation analysis, or methylation analysis.
  • the mutation is selected from the group consisting of substitution, indel, structural variation, and copy number variation.
  • kits for preparing a library of template nucleic acids comprising: (a) the composition according to any one of the compositions described above; (b) a transposase that recognizes the first transposon recognition site and the second transposon recognition site; and (c) instructions for preparing the library of template nucleic acids.
  • the kit further comprises a polymerase, such as a T4 DNA polymerase.
  • the kit further comprises a ligase.
  • the transposase is Tn5 transposase.
  • the kit further comprises an
  • the kit further comprises a first primer and a second primer for PCR amplification of the library of template nucleic acids, wherein the first primer hybridizes to the first adapter sequence or reverse complement thereof, and the second primer hybridizes to the second adapter sequence or reverse complement thereof.
  • the kit further comprises a primer for rolling circle amplification of the library of template nucleic acids, wherein the primer comprises a first sequence that hybridizes to the first adapter sequence and a second sequence that hybridizes to the second adapter sequence.
  • Reference to "about” a value or parameter herein includes (and describes) variations that are directed to that value or parameter per se. For example, description referring to "about X” includes description of "X”.
  • reference to "not" a value or parameter generally means and describes "other than” a value or parameter.
  • the method is not used to treat cancer of type X means the method is used to treat cancer of types other than X.
  • FIG. 1 illustrates integration of a paired-barcode synthetic transposon having a regular double-stranded structure into template DNA followed by amplification using dual PCR primers
  • F denotes a first adapter sequence
  • R denotes a second adapter sequence.
  • Primers designed to match the F and R sequences i.e., same or reverse complementary sequences are used in the amplification step.
  • FIG. 2A depicts an exemplary synthetic transposon fragment comprising a transposase recognition site (201/201rc), a stuff sequence (203/203rc), and a non-complementary region comprising adapter sequences F (204) and R (205).
  • FIG. 2B depicts an exemplary synthetic transposon fragment comprising a transposase recognition site (201/201rc), a stuff sequence (203/203rc), and a non-complementary region comprising adapter sequences F (204) and R (205), and a single-stranded linker (206) disposed between F and R.
  • FIG. 2C depicts an exemplary synthetic transposon fragment comprising a transposase recognition site (201/201rc), a molecular barcode (202/202rc), a stuff sequence (203/203rc), and a non-complementary region comprising adapter sequences F (204) and R (205).
  • FIG. 2D depicts an exemplary synthetic transposon fragment comprising a transposase recognition site (201/201rc), a molecular barcode (202/202rc), a stuff sequence (203/203rc), and a non-complementary region comprising adapter sequences F (204) and R (205), and a single- stranded linker (206) disposed between F and R.
  • FIG. 2E depicts an exemplary synthetic transposon comprising two transposase recognition sites (201/201rc, and 2017201 'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), and no molecular barcodes, wherein each strand of the first non-complementary region is fused to one strand of the second non-complementary region via one or more cleavable nucleotides (denoted by *).
  • FIG. 2F depicts an exemplary synthetic transposon comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 202'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), wherein only one strand of the first non-complementary region is fused to one strand of the second non- complementary region via one or more cleavable nucleotides (denoted by *), and wherein one molecular barcode is single-stranded.
  • FIG. 2G depicts an exemplary synthetic transposon comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 2027202'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), wherein only one strand of the first non-complementary region is fused to one strand of the second non-complementary region via one or more cleavable nucleotides (denoted by *), and wherein both molecular barcodes are double-stranded.
  • FIG. 2H depicts an exemplary synthetic transposon comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 202'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), wherein each strand of the first non-complementary region is fused to one strand of the second non- complementary region via one or more cleavable nucleotides (denoted by *), and wherein one molecular barcode is single-stranded.
  • FIG. 21 depicts an exemplary synthetic transposon comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 2027202'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), wherein each strand of the first non-complementary region is fused to one strand of the second non-complementary region via one or more cleavable nucleotides (denoted by *), and wherein both molecular barcodes are double-stranded.
  • FIG. 2J depicts an exemplary synthetic transposon comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 202'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), and a single-stranded linker (206, 206rc) disposed between F and R, wherein the first non- complementary region is connected to the second non-complementary region via hybridization of the single-stranded linkers, wherein each single-stranded linker has a cleavable nucleotide (denoted by *), and wherein one molecular barcode is single-stranded.
  • FIG. 2K depicts an exemplary synthetic transposon comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 2027202'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), and a single-stranded linker (206, 206rc) disposed between F and R, wherein the first non- complementary region is connected to the second non-complementary region via hybridization of the single-stranded linkers, wherein each single-stranded linker has a cleavable nucleotide (denoted by *), and wherein both molecular barcodes are double-stranded.
  • FIG. 2L depicts an exemplary synthetic transposon comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 202'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), and a single-stranded linker (208, 209) disposed between F and R, and a bridge nucleic acid (207), wherein the first non-complementary region is connected to the second non-complementary region via hybridization of the single-stranded linkers to the bridge nucleic acid, wherein each single- stranded linker has a cleavable nucleotide (denoted by *), and wherein one molecular barcode is single-stranded.
  • FIG. 2M depicts an exemplary synthetic transposon comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and
  • FIG. 2N depicts an exemplary synthetic transposon having one blunt end and one end with a hairpin structure (210), comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 202'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), wherein each strand of the first non- complementary region is fused to one strand of the second non-complementary region via one or more cleavable nucleotides (denoted by *), and wherein one molecular barcode is single- stranded.
  • a hairpin structure comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 202'rc), stuff sequences (203/203rc, and 2037203
  • FIG. 20 depicts an exemplary synthetic transposon having one blunt end and one end with a hairpin structure (210), comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 2027202'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), wherein each strand of the first non- complementary region is fused to one strand of the second non-complementary region via one or more cleavable nucleotides (denoted by *), and wherein both molecular barcode are double- stranded.
  • a hairpin structure comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 2027202'rc), stuff sequences (203/203rc, and 2037
  • FIG. 3 shows an exemplary method of preparing the synthetic transposons of FIG. 2F and FIG. 2G having two identical molecular barcode sequences.
  • 301/301rc sequences containing transposon binding sites
  • 302/302rc sequences containing molecular barcodes
  • 303/303rc stuff sequences for first priming
  • 304/304rc stuff sequences for second priming
  • 305 fixed sequences that may contain PCR primer 1 or F if needed
  • 306 fixed sequences that may contain PCR primer 2 or R if needed.
  • the designation "rc” after a number indicates a reverse complementary sequence.
  • "U” is a uracil nucleotide and is used as an example for the cleavable nucleotide.
  • FIG. 4 shows an exemplary method of preparing the synthetic transposons of FIG. 2H and FIG. 21 having two identical molecular barcode sequences.
  • 401/401rc sequences containing transposon binding sites
  • 402/402rc sequences containing molecular barcodes
  • 403/403rc stuff sequences for first priming
  • 404/404rc stuff sequences for second priming
  • 405 fixed sequences that may contain PCR primer 1 or F
  • 406 fixed sequences that may contain PCR primer 2 or R.
  • the designation "rc” after a number indicates a reverse complementary sequence.
  • "U” is a uracil nucleotide and is used as an example for the cleavable nucleotide.
  • FIG. 5 shows an exemplary method of preparing the synthetic transposons of FIG. 2J and FIG. 2K having two identical molecular barcode sequences.
  • 501/501rc sequences containing transposon binding sites
  • 502/502rc sequences containing molecular barcodes
  • 503/503rc stuff sequences
  • 504/504rc fixed sequences for 1 priming
  • 505/505rc fixed
  • sequences with modified blocked 3 '-end for 2 priming 506: fixed sequence that may contain PCR primer 1 or F; 507/507rc: linker sequences connecting the two non-complementary regions; 508: fixed sequence that may contain PCR primer 2 or R; 509: additional sequence for flexibility of the structures.
  • "*" indicates a cleavable nucleotide or a cleavage site.
  • the 3'-end in 505rc may contain a phosphate group (P) or a reversible dideoxynucleotide.
  • the 3'-phosporyl group can be removed by T4 polynucleotide kinase (T4 PNK) available commercially (e.g. , NEB T4 PNK, catalogue # M0201L).
  • FIG. 6 shows an exemplary method of preparing the synthetic transposons of FIG. 2L and FIG. 2M.
  • 601/601rc sequence containing transposon recognition sites
  • 602/602rc sequence containing molecular barcodes
  • 603/603rc stuff sequences
  • 604/604rc fixed sequences for 1 st priming
  • 605/605rc fixed sequences with blocked 3'-end for 2 nd priming after being deblocked
  • 606 fixed sequence that may contain PCR primer 1 or F
  • 607 linker sequence connecting the first non-complementary region to bridge oligo (607rc+611+610rc)
  • 608 fixed sequence that may contain PCR primer 2 or R
  • 609 additional sequence to provide flexibility of the structure
  • 610 linker sequence connecting the second non-complementary region to bridge oligo
  • FIG. 7 shows an exemplary method of preparing the synthetic transposons of FIG. 2N and FIG. 20.
  • 701/701rc sequences containing transposon binding sites
  • 702/702rc sequences containing molecular barcodes
  • 703/703rc stuff sequences for 1 priming
  • 704/704rc stuff
  • sequences for 2 priming 705: fixed sequences that may contain PCR primer 1 or F; 706: fixed sequences that may contain PCR primer 2 or R.
  • the designation "rc” after a number indicates a reverse complementary sequence.
  • U is a uracil nucleotide and is used as an example for the cleavable nucleotide.
  • "*” denotes another cleavable nucleotide such as an RNA nucleotide.
  • FIG. 8A shows an exemplary synthetic transposon of FIG. 21 having two different molecular barcodes (801 and 802), and an exemplary method of preparing the synthetic transposon.
  • the synthetic transposon can be prepared using two DNA oligos having adapter sequences F and R that correspond to partial sequences of Readlrc and Read2 of the Illumina sequencing primers. On each strand of the synthetic transposon, the F and R sequences are fused to each other via a Uracil nucleotide.
  • FIG. 8B shows an exemplary synthetic transposon of FIG. 2E having two no molecular barcodes, and an exemplary method of preparing the synthetic transposon.
  • the synthetic transposon can be prepared using two DNA oligos having adapter sequences F and R that correspond to partial sequences of Readlrc and Read2 of the Illumina sequencing primers. On each strand of the synthetic transposon, the F and R sequences are fused to each other via a UU dinucleotide.
  • FIG. 8C shows an exemplary synthetic transposon fragment of FIG. 2C having a molecule barcode (803), and an exemplary method of preparing the synthetic transposon fragment.
  • the adapter sequences F and R correspond to partial sequences of Readlrc and Read2 of the Illumina sequencing primers.
  • FIG. 8D shows an exemplary synthetic transposon fragment of FIG. 2D comprising a single oligonucleotide forming a hairpin structure that does not contain a molecular barcode, and an exemplary method of preparing the synthetic transposon fragment.
  • the adapter sequences F and R correspond to partial sequences of Readlrc and Read2 of the Illumina sequencing primers.
  • FIG. 8E shows exemplary primers that can be used for multiplexed pair-end sequencing of libraries prepared using the synthetic transposons of FIG. 8A-8D.
  • FIG. 9 shows an exemplary method of preparing a library of template nucleic acids for sequencing using a plurality of synthetic transposons of FIG. 21.
  • the synthetic transposons are integrated into a target DNA, followed by repair and UDG treatment, PCR amplification with dual primers, which may contain a sequence matching F or R, and any additional adapter sequences (shown as dotted lines) needed for sequencing or analysis.
  • FIG. 10 shows an exemplary method of preparing a library of template nucleic acids using a plurality of synthetic transposons of FIG. 2G.
  • the synthetic transposons are integrated into a target DNA, followed by repair, UDG treatment, and nick ligation to circularize the nucleic acid fragments, thereby allowing downstream analysis, such as rolling circle amplification (RCA) or single molecule sequencing.
  • RCA rolling circle amplification
  • FIG. 11 shows an exemplary method of preparing a library of template nucleic acids using a plurality of synthetic transposons of FIG. 2M.
  • the synthetic transposons are integrated into a target DNA, followed by repair, denaturation, and exonuclease treatment to provide circular nucleic acid fragments, which can be amplified by rolling circle amplification (RCA), or analyzed by RCA sequencing or single molecule sequencing methods.
  • 1001 molecular barcode from the synthetic transposon shown on the left
  • 1002 molecular barcode from the synthetic transposon shown on the right.
  • FIG. 12 shows an exemplary pipeline for analyzing sequencing data from Illumina reads of a sequencing library prepared using the synthetic transposons of the present application.
  • the sequencing library may be a PCR-amplified library prepared as in FIG. 9.
  • the present application discloses synthetic transposons, methods and kits for preparing sequencing libraries from a target nucleic acid, which can be analyzed using next-generation sequencing methods.
  • the synthetic transposons of the present application comprise two non- complementary regions that comprise adapter sequences, wherein the two non-complementary regions are connected to each other and are located between two stem fragments each containing a transposase recognition site. Transposition of the synthetic transposons into the target nucleic acid and subsequent steps that separate the non-complementary regions results in fragmentation of the target nucleic acid, and introduction of the adapter sequences at the same time.
  • the resulting product may be sequenced directly, or amplified in a subsequent step prior to sequencing using primers that match the adapter sequences.
  • the synthetic transposons are designed to comprise a molecular barcode disposed between the transposase recognition site and the non-complementary region.
  • a plurality of synthetic transposons each having a different molecular barcode may be used to prepare a library preserving the contiguity information in the target nucleic acid through the molecular barcodes.
  • the compositions, methods, kits and analysis tools described herein are useful for many applications, including haplotyping, de novo assembly of whole genomes or long contiguous sequences, sequencing of repetitive regions, detection of structural variations and copy number variations, and methylation analysis.
  • FIG. 1 illustrates a method for preparing a sequencing library using a regular double- stranded synthetic transposon having paired barcodes and a pair of adapter sequences disposed in between the paired barcodes, such as the synthetic transposons described in US patent No. 8,829,171.
  • the synthetic transposons are integrated into the target DNA, the product of which is repaired, and subsequently PCR amplified using primers that match the adapter sequences (i.e. , having the same sequences or reverse complementary sequences as the adapter sequences).
  • the synthetic transposons can be inserted in two opposite orientations, yielding three different potential configurations (Config. 1, 2, and 3 in FIG. 1) for fragment of target nucleic acid surrounded by a pair of synthetic transposons with respect to the orientation of the adapter sequences.
  • Config. 1 yields template 1 that can be amplified with high efficiency.
  • Templates 2 and 3 either have F primer binding sites in both ends or R primer binding sites in both ends, leading to self-hairpin structures during renaturation after denaturation step.
  • target sequence fragments having configurations 2 and 3 may become missing or under-represented in the amplified library prepared using such method, leading to difficulty in linking the fragment sequences together for haplotyping purpose, or errors in quantification of the fragments.
  • the sequencing cost could also be increased due to missing or bias amplification using the library preparation method of FIG. 1.
  • the synthetic transposons and methods described herein solves this problem by incorporating the adapter sequences in the non- complementary regions, and introducing two adapter sequence pairs for each insertion site, thereby yielding only one fragment configuration with respect to the adapter orientations, which is amenable to PCR amplification.
  • some embodiments of the synthetic transposons described herein are used to insert the adapter sequences into target nucleic acid, which is subsequently fragmented using simple denaturation or enzymatic cleavage steps that separate the non-complementary regions. The resulting fragments can be directly sequenced without further ligation to sequencing adapters.
  • Y- shaped adapters comprising sequencing adapters are ligated to fragmented nucleic acids.
  • Such methods require end-processing steps prior to the ligation, such as blunt-end polishing, or addition of T or A to the ends of the fragments.
  • end-processing steps may have varying efficiency for different end sequences, which result in biased coverage of fragments in the target nucleic acids.
  • the synthetic transposons and methods described herein overcome such challenges by introducing adapter sequences and fragmenting the target nucleic acid in a single process.
  • one aspect of the present application provides a synthetic transposon comprising: a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non-complementary region and the second non-complementary region are connected to each other.
  • the synthetic transposon further comprises a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region, and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region.
  • compositions comprising a plurality of synthetic transposons, wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region, and the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first non-complement
  • Another aspect of the present application provides a method of preparing a library of template nucleic acids, comprising: (a) contacting a target nucleic acid with any of the synthetic transposons or compositions comprising a plurality of the synthetic transposons described herein, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; and (c) separating the first non-complementary region and the second non-complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids.
  • One aspect of the present application provides a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter and a second strand comprising a second adapter; wherein the second non-complementary region comprises a first strand comprising a third adapter and a second strand comprising a fourth adapter; and wherein the first non-complementary region and the second non-complementary region are connected to each other.
  • the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first adapter and the fourth adapter have the same sequence. In some embodiments, the second adapter and the third adapter have the same sequence. In some embodiments, the first non-complementary region and the second non-complementary region comprise different adapters.
  • the present application provides a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter and a second strand comprising a second adapter, wherein each of the first strand and the second strand of the first non-complementary region is connected to one strand of the first stem; wherein the second non-complementary region comprises a first strand comprising a third adapter and a second strand comprising a fourth adapter, wherein each of the first strand and the second strand of the second non-complementary region is connected to one strand of the second stem; and wherein the first non-complementary region and the second non-complementary region are connected to each
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter and a second strand comprising a second adapter; wherein the second non-complementary region comprises a first strand comprising a third adapter and a second strand comprising a fourth adapter; and wherein the first non-complementary region and the second non- complementary region are connected to each other.
  • the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first molecular barcode and the second molecular barcode have the same sequence. In some embodiments, the first molecular barcode and the second molecular barcode have different sequences. In some embodiments, the first adapter and the fourth adapter have the same sequence. In some embodiments, the second adapter and the third adapter have the same sequence. In some embodiments, the first non- complementary region and the second non-complementary region comprise different adapters.
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non-complementary region and the second non-complementary region are connected to each other.
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • the present application provides a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence, wherein each of the first strand and the second strand of the first non-complementary region is connected to one strand of the first stem; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence, wherein each of the first strand and the second strand of the second non- complementary region is connected to one strand of the second stem; and wherein the first non- complementary region and the second non-complementary region are connected to each
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first strand of the first non-complementary region is fused to the first strand of the second non- complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • cleavable nucleotides such as a uracil nucleotide
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • cleavable nucleotides such as a uracil nucleotide
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand of the first non-complementary region is fused to the first strand of the second non- complementary region via one or more cleavable nucleotides (such as a uracil nucleotide); and wherein the second strand of the first non-complementary region is fused to the second cleavable nucleotides (such
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker; wherein the first strand and the second strand of the second non- complementary region are fused to each other via a second single-stranded linker; and wherein the first single-strand
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker; wherein the first strand and the second strand of the second non- complementary region are fused to each other via a second single-stranded linker; and wherein the first single-strand
  • each of the first single-stranded linker and the second single-stranded linker comprises a cleavable nucleotide (such as a uracil nucleotide).
  • the first stem or the second stem comprises a terminal hairpin structure.
  • the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon.
  • the synthetic transposon comprises one or more modified nucleotides.
  • the first transposase recognition site and the second transposase recognition site have the same sequence.
  • the first transposase recognition site and the second transposase recognition site have different sequences.
  • the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker; wherein the first strand and the second strand of the second non- complementary region are fused to each other via a second single-stranded linker; wherein the synthetic transposon further
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non- complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non-complementary region and the second non-complementary region are connected
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • the first molecular barcode and the second molecular barcode have the same barcode sequence. In some embodiments, the first molecular barcode and the second molecular barcode are double- stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single-stranded region.
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non- complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first strand of the first non- complementary region is fused to the first strand
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • the first molecular barcode and the second molecular barcode have the same barcode sequence. In some embodiments, the first molecular barcode and the second molecular barcode are double-stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single-stranded region.
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non- complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the second strand of the first non-complementary region is fused to the second
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME). In some embodiments, the first molecular barcode and the second molecular barcode have the same barcode sequence. In some embodiments, the first molecular barcode and the second molecular barcode are double- stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single-stranded region.
  • ME mosaic element
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non- complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand of the first non- complementary region is fused to the first strand of
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • the first molecular barcode and the second molecular barcode have the same barcode sequence. In some embodiments, the first molecular barcode and the second molecular barcode are double-stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single- stranded region.
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non- complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complementary region are fuse
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • the first molecular barcode and the second molecular barcode have the same barcode sequence. In some embodiments, the first molecular barcode and the second molecular barcode are double- stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single-stranded region.
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non- complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complementary region are fuse
  • each of the first single- stranded linker and the second single-stranded linker comprises a cleavable nucleotide (such as a uracil nucleotide).
  • the first stem or the second stem comprises a terminal hairpin structure.
  • the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon.
  • the synthetic transposon comprises one or more modified nucleotides.
  • the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences.
  • the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • the first molecular barcode and the second molecular barcode have the same barcode sequence.
  • the first molecular barcode and the second molecular barcode are double-stranded.
  • the first molecular barcode or the second molecular barcode comprises a single-stranded region.
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non- complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complementary region are fuse
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • the first molecular barcode and the second molecular barcode have the same barcode sequence. In some embodiments, the first molecular barcode and the second molecular barcode are double-stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single- stranded region.
  • synthetic transposon fragments corresponding to the first fragment or second fragment of any one of the synthetic transposons described herein. Synthetic transposon fragments that are not connected to each other may be used for fragmenting a target nucleic acid, and to allow amplification of the fragments by PCR using primers corresponding the first and second adapters. [0093] In some embodiments, there is provided a synthetic transposon fragment comprising a stem and a non-complementary region comprising a first strand comprising a first adapter and a second strand comprising a second adapter, wherein the stem has a blunt end.
  • a synthetic transposon fragment comprising a stem and a non-complementary region comprising a first strand comprising a first adapter and a second strand comprising a second adapter, wherein the stem has a hair-pin structure on the end distal from the non-complementary region.
  • a synthetic transposon fragment comprising a stem, a non-complementary region, and a molecular barcode disposed between the stem and the non-complementary region, wherein the non-complementary region comprises a first strand comprising a first adapter and a second strand comprising a second adapter, wherein the stem has a blunt end.
  • a synthetic transposon fragment comprising a stem, a non-complementary region, and a molecular barcode disposed between the stem and the non-complementary region, wherein the non-complementary region comprises a first strand comprising a first adapter and a second strand comprising a second adapter, wherein the stem has a hair-pin structure on the end distal from the non-complementary region.
  • compositions comprising any one of the synthetic transposons described herein.
  • composition comprising a plurality of synthetic transposon each comprising a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non- complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non-complementary region and the second non- complementary region are connected to each other.
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • the first strand of the first non- complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single- stranded linker, and wherein the first single-stranded linker and the second single- stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • composition comprising a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non-complementary region and the second non-complement
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • the first molecular barcode and the second molecular barcode have the same barcode sequence. In some embodiments, the first molecular barcode and the second molecular barcode are double-stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single- stranded region.
  • composition comprising a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first non-complementary region and the second non-complementary
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • the first molecular barcode and the second molecular barcode are double-stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single-stranded region. In some embodiments, the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
  • composition comprising a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand of the first non-complementary region is fused
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME). In some embodiments, the first molecular barcode and the second molecular barcode are double-stranded. In some embodiment
  • the first molecular barcode or the second molecular barcode comprises a single- stranded region.
  • the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
  • composition comprising a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the second strand of the first non-complementary region is fused
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME). In some embodiments, the first molecular barcode and the second molecular barcode are double-stranded. In some embodiment
  • the first molecular barcode or the second molecular barcode comprises a single- stranded region.
  • the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
  • composition comprising a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand of the first non-complementary region is fused
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • the first molecular barcode and the second molecular barcode are double-stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single-stranded region. In some embodiments, the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
  • composition comprising a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complement
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME). In some embodiments, the first molecular barcode and the second molecular barcode are double-stranded. In some embodiment
  • the first molecular barcode or the second molecular barcode comprises a single- stranded region.
  • the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
  • composition comprising a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complement
  • each of the first single-stranded linker and the second single-stranded linker comprises a cleavable nucleotide (such as a uracil nucleotide).
  • the first stem or the second stem comprises a terminal hairpin structure.
  • the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon.
  • the synthetic transposon comprises one or more modified nucleotides.
  • the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences.
  • the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • the first molecular barcode and the second molecular barcode are double-stranded.
  • the first molecular barcode or the second molecular barcode comprises a single- stranded region.
  • the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
  • composition comprising a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complement
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • the first molecular barcode and the second molecular barcode are double-stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single-stranded region. In some embodiments, the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
  • transposase can form a functional complex (i.e., transposome) with one or more transposes recognition sites, and is capable of catalyzing a transposition reaction.
  • a complex comprising a synthetic transposon and a transposase, wherein the synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non-complementary region and the second non-complementary region are connected to each other.
  • the transposase is a dimeric transposase.
  • the transposase is Tn5 transposase, such as a hyperactive Tn5 transposase, for example, EZ-Tn5TM.
  • the first stem or the second stem comprises a terminal hairpin structure.
  • the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon.
  • the synthetic transposon comprises one or more modified nucleotides.
  • the first transposase recognition site and the second transposase recognition site have the same sequence.
  • the first transposase recognition site and the second transposase recognition site have different sequences.
  • the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • the first strand of the first non-complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the second strand of the first non- complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non- complementary region are fused to each other via a second single-stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single-stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • composition comprising a plurality of complexes each comprising a synthetic transposon and a transposase, wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first copy of a molecular barcode disposed between the first transposase recognition site and the first non- complementary region, and the second stem comprises a second transposase recognition site and a second copy of the molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence
  • the transposase is Tn5 transposase, such as a hyperactive Tn5 transposase, for example, EZ-Tn5TM.
  • the first stem or the second stem comprises a terminal hairpin structure.
  • the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon.
  • the synthetic transposon comprises one or more modified nucleotides.
  • the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences.
  • the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • the first strand of the first non- complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single-stranded linker and the second single- stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • the first molecular barcode and the second molecular barcode have the same barcode sequence.
  • each synthetic transposon has a different barcode sequence.
  • the first molecular barcode and the second molecular barcode are double-stranded.
  • the first molecular barcode or the second molecular barcode comprises a single- stranded region.
  • the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
  • the complexes can be prepared by mixing the plurality of synthetic transposons and the transposase.
  • the synthetic transposons and the transposase are incubated for at least about any one of 1 minute, 5 minutes, 10 minutes, 30 minutes, 1 hour or more to form the complexes.
  • the synthetic transposons described herein are nucleic acids containing two synthetic transposon fragments. Unless described otherwise, all elements of the synthetic transposons, including fragments, stems, non-complementary regions, transposase recognition sites, molecular barcodes, adaptors, strands, stuff sequences, bridge nucleic acids, etc., are nucleic acids.
  • the two synthetic transposon fragments are arranged in the same orientation with respect to each other, i.e. , the fragments are connected to each other via direct or indirect interaction between the 3' end of one fragment and the 5' end of the other fragment on one strand or both strands. Each fragment may be fully double-stranded, partially single- stranded, or a hairpin. Each fragment has two ends. The two fragments are connected to each other via the non-complementary regions disposed at one end of each fragment.
  • each fragment contains a stem comprising a transposase recognition site.
  • stem as used herein refer to nucleic acid fragments having extensive fully complementary regions. Each stem typically has two strands that can be separate from each other, or connected to each other on one end via a loop to form a hairpin structure. With the exception of stems having hairpin structures on the ends, the ends of the stems are fully complementary and double stranded. Stems with hairpin structures on the ends have the hairpin structures connected to fully complementary and double-stranded regions.
  • the stems may have a small single-stranded region no more than about any of 20, 15, 10, or 5 nucleotides long, or have an internal non- complementary region of no more than about any of 15, 10, 8, 5, or 2 nucleotides long.
  • each stem has two nucleic acid strands that are fully complementary to each other.
  • one strand contains a single-stranded gap, for example, in the molecular barcode region.
  • One end of the stem (referred herein as the "proximal end”) is fused to the non- complementary region.
  • the other end of the stem (referred herein as the "distal end”) can be a blunt end, or a hairpin.
  • the distal end(s) of one or both stems comprise nucleotides flanking the transposase recognition sites.
  • the stem further comprises a molecular barcode placed between the transposase recognition site and the non- complementary region.
  • the stem further comprises one or more stuff sequences, which are nucleic acids having pre-determined (also referred to as "fixed") sequences. The stuff sequences may be placed between the end of the stem and the transposase recognition site, between the transposase recognition site and the molecular barcode, between the transposase recognition site and the non-complementary region, and/or between the molecular barcode and the non-complementary region.
  • the stuff sequences may provide priming sites, balance G/C contents, and/or minimize secondary structures that facilitate preparation of the synthetic transposons. Additionally, stuff sequences may be chosen to complement the molecular barcodes and the non-complementary regions to allow enough space and flexibility in the synthetic transposon to facilitate binding of the transposase to the transposase recognition sites. The stuff sequences can also facilitate data analysis steps (such as for easy alignment and clustering of sequencing reads).
  • one or more of the 5' ends (also referred herein as 5' termini) of the polynucleotide strands in the synthetic transposons are phosphorylated, or the 5' terminal nucleotide has a 5' phosphate group.
  • Phosphorylated 5' ends facilitate ligation to other nucleic acids, such as adapters, extended, or gap-filled nucleic acid strands (e.g. , for nick-sealing).
  • the 5' terminus of the distal end of the first stem and/or the second stem is phosphorylated.
  • the first stem or the second tern comprises a single-stranded region
  • the first molecular barcode or the second molecular barcode comprises a single-stranded region or is single-stranded
  • the 5' terminus adjacent to the singe-stranded region is phosphorylated.
  • one or more of the 5' ends of the polynucleotide strands in the synthetic transposons are unphosphorylated, for example, the 5' terminal nucleotide has a 5' free hydroxyl group. Synthetic transposons having 5' hydroxyl ends may be phosphorylated in the library construction steps to enable ligation to other nucleic acids or nick-sealing.
  • the non-complementary regions of the synthetic transposons allow processing, efficient amplification, and haplotyping of a target nucleic acid inserted with the synthetic transposon.
  • Each non-complementary region comprises two non-complementary strands of nucleic acids.
  • Each of the non-complementary strands in the non-complementary region is connected to one strand of the corresponding stem region.
  • the two strands of a non- complementary region do not hybridize to each other at normal pH and ionic conditions (such as pH 7 and 150 mM salt).
  • the two strands of a non-complementary region have no more than about any of 60%, 50%, 40%, 30%, 20%, 10%, 5%, or less sequence homology.
  • the two strands of a non-complementary region have no more than about any of 5, 4, 3 or 2 consecutive nucleotides that are complementary to each other. In some embodiments, each strand of a non-complementary region does not form any significant secondary structure.
  • Each strand of a non-complementary region comprises an adapter sequence (also referred herein as an "adapter").
  • the adapter sequences serve as priming sites to allow amplification of a nucleic acid fragment inserted with the synthetic transposon.
  • the two non-complementary regions in a synthetic transposon are identical, but are placed in opposite orientations.
  • each non-complementary region comprises a first strand comprising an adapter sequence F, and a second strand comprising an adapter sequence R, and F of the first non-complementary region is connected to R of the second non- complementary region, and/or R of the first non-complementary region is connected to F of the second non-complementary region.
  • a pair of primers may be designed to comprise the sequence of F or R, or to comprise the complementary sequence of F or R for use in amplification of a nucleic acid fragment comprising the non-complementary regions inserted at both ends.
  • the adapter sequences may be of any suitable length, for example, at least about any of 15, 16, 17, 18, 19, 20, 21, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, or more nucleotides long.
  • the two non-complementary regions have different sets of adapter sequences.
  • the two non-complementary regions have the same set of adapter sequences, but comprise different stuff sequences on one or both strands.
  • Each non-complementary region may comprise two separate strands, or a single fused strand comprising the first strand and the second strand.
  • each non- complementary region is V-shaped, comprising a first strand and a second strand.
  • the first strand and the second strand of each non-complementary region are fused to each other via a single-stranded linker.
  • the first non-complementary region comprises a first strand comprising a first adapter sequence, a second strand comprising a second adapter sequence, and a first single-stranded linker disposed between the first strand and the second strand; and the second non-complementary region comprises a first stand comprising the second adapter sequence, a second strand comprising the first adapter sequence, and a second single-stranded linker disposed between the first strand and the second strand.
  • the first single-stranded linker can hybridize to the second single-stranded linker.
  • the first single-stranded linker is fully complementary to the second single- stranded linker.
  • the first single-stranded linker is complementary to the second single-stranded linker except for one or more cleavable nucleotides.
  • the clustering primer and sequencing primer sequences can be included in the non-complementary strands to allow PCR-free direct next generation sequencing.
  • the non-complementary regions are connected to each other either covalently or non- covalently (such as via hybridization of two sequences).
  • the first strand of the first non-complementary region can be fused to the first strand of the second non-complementary region.
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region.
  • the single-stranded linker of the first non-complementary region is hybridized to the single-stranded linker of the second non-complementary region.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first sequence that hybridizes to the first single-stranded linker of the first non-complementary region, and a second sequence that hybridizes to the second single-stranded linker of the second non-complementary region, thereby, the two non-complementary regions are connected to each other via the hybridization of the bridge nucleic acid to the first single-stranded linker and the second single-stranded linker.
  • the synthetic transposon comprises one or more cleavable nucleotides at the junction(s) between the first non-complementary region and the second non- complementary region, in the bridge nucleic acid, or in the single-stranded linker of the first non-complementary region and/or the second complementary region. Cleavage of the one or more cleavable nucleotides results in separation of the first non-complementary region from the second non-complementary region.
  • the first strand of the first non-complementary region and the first strand of the second non-complementary region are fused to each other via one or more cleavable nucleotides
  • the second strand of the first non-complementary region and the second strand of the second non-complementary region are fused to each other via one or more cleavable nucleotides.
  • the single-stranded linker comprises one or more cleavable nucleotides.
  • the bridge nucleic acid comprises one or more cleavable nucleotides in the sequences that are complementary to the single-stranded linkers.
  • the one or more cleavable nucleotides may be one or more uracil nucleotides, other modified nucleobases with specific nucleases that recognize such nucleobases (such as 8-oxoguanine), a restriction site, or RNA nucleotides wherein the synthetic transposon is a DNA transposon.
  • Uracil DNA glycosylase combined with a DNA glycosylase lyase can be used to cleave a uracil deoxyribonucleotide; and RNA nucleotides can be cleaved by an RNA endonuclease.
  • the synthetic transposon further comprises a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region, and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region.
  • the first molecular barcode and the second molecular barcode may have the same sequence, or different sequences.
  • the first molecular barcode has the same sequence as the second molecular barcode, which allows matching of the molecular barcode sequences from sequencing reads to extract contiguity information in a target nucleic acid inserted with the synthetic transposon.
  • Synthetic transposons having no molecular barcodes or two different molecular barcodes can be used for preparing libraries of template nucleic acids useful for a variety of sequencing applications (except for haplotyping) in the same way as synthetic transposons having molecular barcodes.
  • the molecular barcode comprises a plurality of nucleotides that are randomly or degenerately designed, thereby yielding a highly diverse sequence that can be used to identify each individual synthetic transposon, and the target nucleic acid or fragment thereof that the synthetic transposon inserts into.
  • the molecular barcode is double-stranded.
  • the molecular barcode comprises a single-stranded region, or is single-stranded.
  • the composition may comprise any number of synthetic transposons having different molecular barcodes.
  • the composition comprises a single copy of each synthetic transposon having a different molecular barcode.
  • the composition comprises more than one copy of each synthetic transposon having a different molecular barcode.
  • the plurality of synthetic transposons have at least about any one of 10 4 , 10 s , 10 6 , 10 7 , 10 8 , 10 9 , 10 10 , 10 11 , 10 12 , 10 13 , 10 14 , 10 15 , 10 16 , 10 17 , or more different molecular barcodes.
  • the plurality of synthetic transposons have at least about any one of 10 4 , 10 s , 10 6 , 10 7 , 10 8 , 10 9 , 10 10 , 10 11 , 10 12 , 10 13 , 10 14 , 10 15 , 10 16 , 10 17 , or more sources of clonal molecular barcodes.
  • the nucleotide can be a ribonucleotide, or a deoxyribonucleotide.
  • the molecular barcode can thus be used to identify a particular fragment of a target nucleic acid that the synthetic transposon carrying the molecular barcode inserts into.
  • the molecular barcode may further comprise nucleotides having the same identity for all synthetic transposons (i.e. "fixed” or specifically designed nucleotides).
  • the additional fixed nucleotides or sequences can be placed on either side of the randomly or degenerately designed sequence or interspersed among the randomly or degenerately designed nucleotides.
  • the molecular barcode comprises double-stranded regions. In some embodiments, the molecular barcode is double-stranded. In some embodiments, the molecular barcode is single-stranded. In some embodiments, the molecular barcode is partially single- stranded (i.e. , partially double-stranded). In some embodiments, the molecular barcode has a single-stranded region having at least about any one of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40, 50 or more nucleotides.
  • the randomly and/or degenerately designed nucleotides in the molecular barcode are in single- stranded region of the molecular barcode.
  • the double-stranded region of the at least partially single-stranded molecular barcode comprises fixed nucleotides.
  • the double-stranded region of the at least partially single-stranded molecular barcode consists essentially of fixed nucleotides.
  • the molecular barcode comprises at least about any one of 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 70 80, 90, 100 or more consecutive nucleotides. In some embodiments, the molecular barcode comprises at least about any one of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40 or more randomly designed nucleotides. In some embodiments, the molecular barcode comprises at least about any one of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40 or more degenerately designed nucleotides.
  • the molecular barcode comprises at least about any one of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40 or more fixed (i.e., specifically designed) nucleotides.
  • the molecular barcode is a mixture of randomly designed, degenerately designed or fixed nucleotides. The number of randomly and/or degenerately designed nucleotides in the molecular barcode depends on the actual need.
  • a long target nucleic acid (such as chromosome) may need a plurality of synthetic transposons with higher diversity, i.e., a large number of randomly and/or degenerately designed nucleotides, to provide enough distinct molecular barcodes to tag the large number of segments of the target nucleic acid in order to extract contiguity information.
  • a short target nucleic acid such as a plasmid of a few kilobases long, may only need a small number of randomly and/or degenerately designed nucleotides to provide enough distinct molecular barcodes for tagging.
  • duplicated sequences endogenous to the target nucleic acid flanking the insertion sites of the synthetic transposons may be used in combination with the molecular barcodes in the synthetic transposons to provide contiguity information for the target nucleic acids. Having both randomly designed and specific nucleotides may minimize potential undesired non-specific interactions during the process of synthesizing the synthetic transposons.
  • FIGs. 2A-20 Exemplary synthetic transposons and fragments are shown in FIGs. 2A-20.
  • FIGs. 2A- 2D show exemplary synthetic transposon fragments each comprising a single transposon recognition site.
  • FIG. 2E shows an exemplary synthetic transposon with no molecular barcodes.
  • FIGs. 2F-20 shows exemplary synthetic transposons having two molecular barcodes, and various structures for the non-complementary regions and distal ends of the stems.
  • any of the exemplary synthetic transposons of FIG. 2F- 20 can be modified by replacing the molecular barcodes with stuff sequences or other sequences needed to make corresponding exemplary synthetic transposons that do not have molecular barcodes.
  • a synthetic transposon fragment comprising: (a) a stem comprising from the distal end to the proximal end: a first transposase recognition site and an optional stuff sequence, and (b) a non-complementary region comprising a first strand comprising an adapter sequence F and a second strand comprising an adapter sequence R, wherein the distal end of the stem is a blunt end.
  • Two of the synthetic transposon fragments having the same sequences or different sequences can be linked together via the non-complementary regions to provide a synthetic transposon having no molecular barcodes.
  • a synthetic transposon fragment comprising: (a) a stem comprising from the distal end to the proximal end: a first transposase recognition site and an optional stuff sequence, and (b) a non-complementary region comprising a first strand comprising an adapter sequence F, a second strand comprising an adapter sequence R, and a single-stranded linker disposed between the first strand and the second strand, wherein the distal end of the stem is a blunt end.
  • Two of the synthetic transposon fragments having the same sequences or different sequences can be linked together via the non- complementary regions to provide a synthetic transposon having no molecular barcodes.
  • a first synthetic transposon fragment of FIG. 2B having a first single-stranded linker and a second synthetic transposon fragment of FIG. 2B having a second single-stranded linker that can hybridize to the first single-stranded linker can be mixed together to provide a synthetic transposon.
  • a first synthetic transposon fragment having a first single-stranded linker and a second synthetic transposon fragment having a second single-stranded linker can be mixed together with a bridge nucleic acid that can hybridize to both the first single-stranded linker and the second single-stranded linker to provide a synthetic transposon. If the transposon fragments in Fig.
  • stem-loop common sequences e.g., containing sequencing primer
  • stem-loop common sequences e.g., containing sequencing primer
  • repairing e.g., extension and ligation
  • a synthetic transposon fragment comprising: (a) a stem comprising from the distal end to the proximal end: a first transposase recognition site, a molecular barcode, and an optional stuff sequence, and (b) a non- complementary region comprising a first strand comprising an adapter sequence F and a second strand comprising an adapter sequence R, wherein the distal end of the stem is a blunt end.
  • Two of the synthetic transposon fragments having the same sequences or different sequences can be linked together via the non-complementary regions to provide a synthetic transposon having two molecular barcodes.
  • FIG. 8C shows an exemplary synthetic transposon fragment of FIG. 2C.
  • a synthetic transposon fragment comprising: (a) a stem comprising from the distal end to the proximal end: a first transposase recognition site, a molecular barcode, and an optional stuff sequence, and (b) a non- complementary region comprising a first strand comprising an adapter sequence F, a second strand comprising an adapter sequence R, and a single-stranded linker disposed between the first strand and the second strand, wherein the distal end of the stem is a blunt end.
  • Two of the synthetic transposon fragments having the same sequences or different sequences can be linked together via the non-complementary regions to provide a synthetic transposon having two molecular barcodes.
  • a first synthetic transposon fragment of FIG. 2B having a first single-stranded linker and a second synthetic transposon fragment of FIG. 2B having a second single-stranded linker that can hybridize to the first single-stranded linker can be mixed together to provide a synthetic transposon of FIG. 2K.
  • a first synthetic transposon fragment having a first single-stranded linker and a second synthetic transposon fragment having a second single-stranded linker can be mixed together with a bridge nucleic acid that can hybridize to both the first single-stranded linker and the second single-stranded linker to provide a synthetic transposon of FIG. 2M.
  • FIG. 8D shows an exemplary synthetic transposon fragment of FIG. 2D.
  • a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F and a second strand comprising an adapter sequence R; and (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R and a second strand comprising the adapter sequence F; wherein the distal ends of the first stem and the second stem are both blunt ends; wherein the first strand of the first non-complementary region is fused to the first strand of the second non-complementary region via one or more cleav
  • the two transposase recognition sites can have the same or different sequences.
  • the two molecular barcodes may have the same or different sequences.
  • Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid.
  • the one or more cleavable nucleotides can be cleaved to separate the two non- complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
  • FIG. 8B shows an exemplary synthetic transposon of FIG. 2E.
  • FIG. 8E shows primers that can be used to amplify nucleic acid fragments obtained from insertion of a plurality of the synthetic transposon in a target nucleic acid followed by enzymatic cleavage of the UU dinucleotide that separates the two non-complementary regions in each synthetic transposon.
  • the primers in FIG.8E contain sequences from sequencing primers of the Illumina sequencing platform that allow direct sequencing of the amplified nucleic acid fragments on an Illumina instrument. Randomly designed index tag sequences can be included in one primer to serve as a sample barcode, which allows multiple samples to be sequenced at the same time and subsequently de-multiplexed during data analysis.
  • a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F and a second strand comprising an adapter sequence R; and (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second single-stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R and a second strand comprising the adapter sequence F; wherein the distal ends of the first stem and the second stem are both blunt ends; and where
  • the two transposase recognition sites can have the same or different sequences.
  • the two molecular barcodes may have the same or different sequences.
  • the 5' end neighboring the gap of missing strand in the single- stranded molecular barcode is phosphorylated.
  • Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid.
  • the one or more cleavable nucleotides can be cleaved to separate the two non-complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
  • a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F and a second strand comprising an adapter sequence R; and (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second double- stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R and a second strand comprising the adapter sequence F; wherein the distal ends of the first stem and the second stem are both blunt ends; and
  • the two transposase recognition sites can have the same or different sequences.
  • the two molecular barcodes may have the same or different sequences.
  • Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid.
  • the one or more cleavable nucleotides can be cleaved to separate the two non-complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
  • a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F and a second strand comprising an adapter sequence R; and (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second single-stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R and a second strand comprising the adapter sequence F; wherein the distal ends of the first stem and the second stem are both blunt ends; wherein
  • the two transposase recognition sites can have the same or different sequences.
  • the two molecular barcodes may have the same or different sequences.
  • the 5' end neighboring the gap of missing strand in the single-stranded molecular barcode is phosphorylated.
  • Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid.
  • the one or more cleavable nucleotides can be cleaved to separate the two non- complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
  • a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F and a second strand comprising an adapter sequence R; and (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second double- stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R and a second strand comprising the adapter sequence F; wherein the distal ends of the first stem and the second stem are both blunt ends; wherein the
  • the two transposase recognition sites can have the same or different sequences.
  • the two molecular barcodes may have the same or different sequences.
  • Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid.
  • the one or more cleavable nucleotides can be cleaved to separate the two non- complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
  • FIG. 8A shows an exemplary synthetic transposon of FIG. 21 having two non-identical short molecular barcodes.
  • a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F, a second strand comprising an adapter sequence R, and a first single-stranded linker disposed between the first strand and the second strand; and (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second single-stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R, a second strand comprising the adapter
  • the two transposase recognition sites can have the same or different sequences.
  • the two molecular barcodes may have the same or different sequences.
  • the 5' end neighboring the gap of missing strand in the single-stranded molecular barcode is phosphorylated.
  • Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid.
  • the one or more cleavable nucleotides can be cleaved to separate the two non-complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
  • a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F, a second strand comprising an adapter sequence R, and a first single-stranded linker disposed between the first strand and the second strand; and (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second double-stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R, a second strand comprising the adapter
  • the two transposase recognition sites can have the same or different sequences.
  • the two molecular barcodes may have the same or different sequences.
  • Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid.
  • the one or more cleavable nucleotides can be cleaved to separate the two non-complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
  • a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F, a second strand comprising an adapter sequence R, and a first single-stranded linker disposed between the first strand and the second strand; (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second single-stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R, a second strand comprising the adapter sequence
  • the two transposase recognition sites can have the same or different sequences.
  • the two molecular barcodes may have the same or different sequences.
  • the 5' end neighboring the gap of missing strand in the single-stranded molecular barcode is phosphorylated.
  • Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid.
  • the one or more cleavable nucleotides can be cleaved to separate the two non-complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
  • a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F, a second strand comprising an adapter sequence R, and a first single-stranded linker disposed between the first strand and the second strand; (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second double-stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R, a second strand comprising the adapter sequence
  • the two transposase recognition sites can have the same or different sequences.
  • the two molecular barcodes may have the same or different sequences.
  • Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid.
  • the one or more cleavable nucleotides can be cleaved to separate the two non-complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
  • a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F and a second strand comprising an adapter sequence R; and (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second single-stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R and a second strand comprising the adapter sequence F; wherein the distal end of the first stem is a blunt end, and the distal end
  • the two transposase recognition sites can have the same or different sequences.
  • the two molecular barcodes may have the same or different sequences.
  • the 5' end neighboring the gap of missing strand in the single-stranded molecular barcode is phosphorylated.
  • Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid.
  • the one or more cleavable nucleotides can be cleaved to separate the two non-complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
  • a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F and a second strand comprising an adapter sequence R; and (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second double- stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R and a second strand comprising the adapter sequence F; wherein the distal end of the first stem is a blunt end, and the distal end of
  • the two transposase recognition sites can have the same or different sequences.
  • the two molecular barcodes may have the same or different sequences.
  • Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid.
  • the one or more cleavable nucleotides can be cleaved to separate the two non-complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
  • the synthetic transposons provided herein can be prepared by a variety of methods.
  • the synthetic transposons are prepared by direct synthesis, including chemical synthesis. Such methods are well known in the art, e.g., solid phase synthesis using phosphoramidite precursors such as those derived from protected 2'-deoxynucleosides, ribonucleosides, or nucleoside analogues.
  • Synthetic transposons comprising modified nucleotides may also be chemically synthesized by including modified nucleotide building blocks in the oligo synthesis steps.
  • an unmodified synthetic transposon may first be synthesized, and the 5-methyl group may be added to the target dC nucleobase using a CpG methyltransferase.
  • Synthesis of long oligos up to 180-250 nucleotides (nt) required in this application can be obtained commercially from multiple sources such as IDT (ultramers for up to 200nt), Sigma- Aldrich (up to 180nt) or Biosynthesis (Ubermers up to 250nt regularly and could be as long as 400nt).
  • IDT ultramers for up to 200nt
  • Sigma- Aldrich up to 180nt
  • Biosynthesis Up to 250nt regularly and could be as long as 400nt.
  • Incorporation of modified bases such as LNA or PNA in some common sequences allow the use of short sequences with the same binding stability needed.
  • Modified bases such as uracil can be incorporated easily to allow the cleavage of the strand before library amplification. Incorporation of phosphorothiate bonds, for example, can help to minimize degradation of transposons by exonucleases or endonucleases during their storage.
  • the synthetic transposons are prepared by annealing two oligos, which are then subjected to extension by polymerases to provide the full product.
  • Synthetic transposons having no molecular barcodes or having two different molecular barcodes can be prepared by such methods.
  • FIG. 8A shows a method of preparing an exemplary synthetic transposon of FIG. 21 having two different molecular barcodes.
  • FIG. 8B shows a method of preparing an exemplary synthetic transposon of FIG. 2E having no molecular barcodes.
  • Synthetic transposons with one or two hairpin structures can be conveniently prepared using a single long strand of oligonucleotide with complementary regions that hybridize to provide the synthetic transposons.
  • the synthetic transposons are PCR amplified with common primers, such as primers that hybridize to the stuff sequences to prepare the synthetic transposons.
  • the synthetic transposons are prepared by linking the non- complementary regions of two synthetic transposon fragments.
  • the synthetic transposon fragment is prepared by chemical synthesis.
  • the synthetic transposon fragment is prepared by extending chemically synthesized
  • FIG. 8C shows a method of preparing an exemplary synthetic transposon fragment of FIG. 2C having a molecular barcode.
  • FIG. 8D shows a method of preparing an exemplary synthetic transposon fragment of FIG. 2B having no molecular barcode.
  • Synthetic transposons having two molecular barcodes with the same sequences comprising randomly or degenerately designed nucleotides are prepared using a combination of chemical synthesis and extension by polymerase (also referred as "primer extension") to obtain double-stranded molecular barcodes, and to ensure that the two molecular barcodes have the same sequences.
  • the synthetic transposons having identical paired molecular barcodes are prepared using starting oligos containing only one molecular barcode, followed by a first intramolecular or intermolecular priming to replicate the molecular barcode.
  • a 2nd intramolecular or intermolecular priming is used to displace the replicated molecular barcode sequence.
  • FIGs. 3-7 illustrate exemplary methods for preparing various synthetic transposons having two identical molecular barcodes that contain randomly or degenerately designed nucleotides.
  • a first synthesized oligo (5'-301+302+303+304+305+U+ 306+303rc-3') is provided, which comprises a single-stranded molecular barcode region (302) having randomly or degenerately designed nucleotides.
  • the first synthesized oligo is extended by a DNA polymerase, which is then denatured and hybridized to a second synthesized oligo comprising the adapter sequence 306 and the complementary sequence of stuff sequence 304 (i.e. 304rc).
  • the hybridized oligos are then extended by a DNA polymerase to make the first fragment of the synthetic transposon, which is subsequently annealed to a second synthesized oligo comprising the adapter sequence 305 and the stuff sequence 303, and a third synthesized oligo comprising the transposase recognition sequence 301, to provide a synthetic transposon of FIG. 2F.
  • Further extension and ligation steps to fill in the gap with a complementary sequence of the molecular barcode (302rc) provide a synthetic transposon of FIG. 2G.
  • a first synthesized oligo (5 ' -401 +402+403+404+405+U+406+403rc-3 ' ) is provided, which comprises a single-stranded molecular barcode region (402) having randomly or degenerately designed nucleotides.
  • the first synthesized oligo is extended by a DNA polymerase, which is then denatured and hybridized to a second synthesized oligo comprising the complementary sequence of stuff sequence 404 (i.e. , 404rc), adapter sequence 406, a uracil nucleotide, adapter sequence 405, and stuff sequence 403.
  • the hybridized oligos are then extended by a DNA polymerase to make the first fragment of the synthetic transposon connected to the second non-complementary region, which is subsequently annealed to a second synthesized oligo comprising the transposase recognition sequence 401, to provide a synthetic transposon of FIG. 2H. Further extension and ligation steps to fill in the gap with a DNA polymerase to make the first fragment of the synthetic transposon connected to the second non-complementary region, which is subsequently annealed to a second synthesized oligo comprising the transposase recognition sequence 401, to provide a synthetic transposon of FIG. 2H. Further extension and ligation steps to fill in the gap with a
  • a first synthesized oligo (5'-501+502+503+504+505+506+507+508+505rc- 3') and a second synthesized oligo (5'-503+506+507+508+509+504rc-3') are provided, which are hybridized and extended by a DNA polymerase.
  • the 3' end of the first synthesized oligo is a reversibly blocked nucleotide.
  • the 3' end of the first synthesized oligo may first comprise a 3' phosphate group, which is removed by T4 polynucleotide kinase (T4 PNK) treatment prior to the first extension step to block extension of this 3' end.
  • T4 PNK T4 polynucleotide kinase
  • the 3' end Prior to the second extension step, the 3' end can be phosphorylated, thereby allowing extension from this 3' end.
  • a further round of extension followed by hybridization to a third synthesized oligo comprising the transposase recognition sequence 501 provides a synthetic transposon of FIG. 2J.
  • Single- stranded linker sequences 507 and 507rc each has one or more cleavable nucleotides.
  • a first synthesized oligo (5 ' -601 +602+603+604+605+606+607+608+605rc- 3') a second synthesized oligo (5'-603+606+609+608+610+604rc-3'), and a third synthesized oligo (5'-607rc+611+609rc, i.e. bridge nucleic acid) are provided, which are denatured and hybridized, and then extended by a DNA polymerase.
  • the 3' end of the first synthesized oligo is a reversibly blocked nucleotide.
  • the 3' end of the first synthesized oligo may first comprise a 3' phosphate group, which is removed by T4 polynucleotide kinase (T4 PNK) treatment prior to the first extension step to block extension of this 3' end.
  • T4 PNK T4 polynucleotide kinase
  • the 3' end Prior to the second extension step, the 3' end can be phosphorylated, thereby allowing extension from this 3' end.
  • a further round of extension followed by hybridization to a fourth synthesized oligo comprising the transposase recognition sequence 601 provides a synthetic transposon of FIG. 2L.
  • the bridge nucleic acid may contain one or more cleavable nucleotides in the 607rc and 609rc fragments.
  • the hairpin fragment 707 and the transposase recognition site 701 each has one or more cleavable nucleotides.
  • the oligo is denatured, hybridized, and extended by DNA polymerase.
  • the one or more cleavable nucleotides in 707 and 701 are then cleaved, and the product is denatured and hybridized to a second synthesized oligo (5'-705-703-U-706-704rc-3').
  • the duplex is then extended by DNA polymerase to provide a synthetic transposon of FIG. 2N. Further extension and ligation steps to fill in the gap with a complementary sequence of the molecular barcode (702rc) provide a synthetic transposon of FIG. 20.
  • One aspect of the present application provides a method of preparing a library of template nucleic acids comprising: (a) contacting a target nucleic acid with any one of the composition comprising a plurality of synthetic transposons described herein, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; and (c) separating the first non-complementary region and the second non- complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid.
  • the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA.
  • the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • a method of preparing a library of template nucleic acids comprising: (a) contacting a target nucleic acid with any one of the composition comprising a plurality of synthetic transposons described herein, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; and (c) amplifying the repaired target nucleic acid to provide the library of template nucleic acids.
  • the amplifying is Whole Genome Amplification (WGA). In some embodiments, the amplifying is targeted amplification of loci of interest.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid.
  • the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • a method of preparing a library of template nucleic acids comprising: (a) contacting a target nucleic acid with a composition comprising a plurality of synthetic transposons, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre- mixed prior to contacting the target nucleic acid.
  • the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA.
  • the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • a method of preparing a library of template nucleic acids comprising: (a) contacting a target nucleic acid with a composition comprising a plurality of synthetic transposons, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a
  • the first single-stranded linker and the second single-stranded linker each comprises one or more cleavable nucleotides.
  • the method further comprises contacting the repaired target nucleic acid with an endonuclease, such as uracil DNA glycosylase (UDG), or a mixture of UDG and DNA lyase (e.g. , USERTM), to cleave the one or more cleavable nucleotides.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid.
  • the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA.
  • the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • a method of preparing a library of template nucleic acids comprising: (a) contacting a target nucleic acid with a composition comprising a plurality of synthetic transposons, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a
  • the method further comprises treating the denatured repaired target nucleic acid with an exonuclease.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid.
  • the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA.
  • the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • a method of preparing a library of template nucleic acids comprising: (a) contacting a target nucleic acid with a composition comprising a plurality of synthetic transposons, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid.
  • the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA.
  • the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • a method of preparing a library of template nucleic acids comprising: (a) contacting a target nucleic acid with a composition comprising a plurality of synthetic transposons, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non
  • the first single-stranded linker and the second single-stranded linker each comprises one or more cleavable nucleotides.
  • the method further comprises contacting the repaired target nucleic acid with an endonuclease, such as uracil DNA glycosylase (UDG), or a mixture of UDG and DNA lyase ⁇ e.g. , USERTM), to cleave the one or more cleavable nucleotides.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid.
  • the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA.
  • the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • a method of preparing a library of template nucleic acids comprising: (a) contacting a target nucleic acid with a composition comprising a plurality of synthetic transposons, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid.
  • the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA.
  • the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • a barcoded target nucleic acid comprising a plurality of synthetic transposons inserted randomly or substantially randomly among the endogenous sequence of the barcoded target nucleic acid, wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non- complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand compris
  • the first strand of the first non-complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the first strand and the second strand of the first non-complementary region are fused to each other via a first single- stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single- stranded linker and the second single- stranded linker are hybridized to each other.
  • each synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single- stranded linker, and a second single- stranded sequence that hybridizes to the second single-stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • FIGs. 9-11 show exemplary methods of preparing libraries of template nucleic acids using the synthetic transposons described herein. Additionally, synthetic transposons that do not have molecular barcodes (e.g. , FIG. 2E and FIGs. 8B) can be used for fragmentation and library construction. The intramolecular or intermolecular binding between the 2 transposase recognition sites and transposase in a transposed target nucleic acid allow the stable
  • a composition comprising a plurality of synthetic transposons of FIG. 21 each having a different barcode sequence is contacted with a target DNA and a transposase.
  • the plurality of synthetic transposons are inserted into the target DNA, resulting in single-stranded gaps surrounding the insertion sites.
  • the inserted target DNA is then treated with DNA polymerase to fill in the single-stranded gaps, and simultaneously or subsequently treated with a ligase to seal the nicks.
  • the repaired target DNA is treated with UDG (such as USERTM) to cleave the uracil nucleotide, yielding fragmented template nucleic acids, which are PCR amplified with primers having adapter sequences F and R, or their reverse complements.
  • UDG such as USERTM
  • PCR amplification leads to 2 products that have different read orientations during sequencing.
  • Additional adapter sequences such as sequencing primer sequences, and/or index tags, may be introduced to each amplified nucleic acid by including the adapter sequences and index tags in the PCR primers.
  • the amplified nucleic acid library can then be sequenced using any suitable massively parallel shotgun sequencing method (such as next generation sequencing, or NGS method).
  • WGA whole genome amplification
  • WGA can be performed using either random hexamers or sequences complementary to F and/or R.
  • WGA can be used in the library preparation method, separation of the non-complementary regions is not a required step, and thus, synthetic transposons that do not have modified nucleotide(s) linking the two non-complementary regions can be used.
  • a composition comprising a plurality of synthetic transposons of FIG. 2G each having a different barcode sequence is contacted with a target DNA and a transposase.
  • the plurality of synthetic transposons are inserted into the target DNA, resulting in single-stranded gaps surrounding the insertion sites.
  • the inserted target DNA is then treated with DNA polymerase to fill in the single-stranded gaps, and simultaneously or subsequently treated with a ligase to seal the nicks.
  • the repaired target DNA is treated with UDG (such as USERTM) to cleave the uracil nucleotide, yielding fragmented template nucleic acids.
  • UDG such as USERTM
  • the fragmented template nucleic acids are then hybridized to an oligonucleotide comprising a first sequence that is complementary to the first adapter sequence F and a second sequence that is complementary to the second adapter sequence R.
  • the hybridized fragments are then treated with ligase to circularize the fragmented template nucleic acids.
  • the circularized template nucleic acids can be further analyzed by RCA, or by single-molecule sequencing.
  • a composition comprising a plurality of synthetic transposons of FIG. 2M each having a different barcode sequence is contacted with a target DNA and a transposase, resulting in single-stranded gaps surrounding the insertion sites.
  • the inserted target DNA is then treated with DNA polymerase to fill in the single-stranded gaps, and simultaneously or subsequently treated with a ligase to seal the nicks.
  • the repaired target DNA is then denatured, and subsequently treated with exonuclease to remove the bridge nucleic acids, thereby yielding a library of circularized template nucleic acids.
  • the library of circularized template nucleic acids can then be analyzed by RCA or single-molecule sequencing.
  • the plurality of synthetic transposons can be inserted into target nucleic acids by the transposase that binds to the transposase recognition sites of the synthetic transposons.
  • the plurality of synthetic transposons and the transposase may be pre-mixed to form a complex composition comprising a plurality of complexes each comprising a transposase bound to a synthetic transposon prior to contacting the complex composition with the target nucleic acid.
  • the plurality of synthetic transposons and the transposase are contacted with the target nucleic acids simultaneously, but as separate compositions.
  • synthetic transposons with molecular barcodes having high diversity comprising more than about any one of 5, 10, 15, 20, 25, or more randomly and/or degenerately designed nucleotides are used to ensure that each insertion site in the target nucleic acid has a different molecular barcode.
  • an excess amount of synthetic transposons is contacted with the target nucleic acid to ensure unique labeling of the sites in the target nucleic acid.
  • no more than about any one of 50%, 40%, 30%, 20%, 10%, 5%, 2%, 1%, 0.1%, 0.01%, 0.001%, 0.0001% or less of possible synthetic transposons with distinct molecular barcodes are inserted into the target nucleic acid.
  • 100 cells of human genomic DNA (about 0.6 ng) have a total of 300xl0 9 basepairs.
  • synthetic transposons each having a molecular barcode comprising 25 randomly designed nucleotides at an average of 150- bp distance
  • 2xl0 9 synthetic transposons are inserted out of 10 15 possible distinct synthetic transposons available.
  • transposase duplicated sequences e.g. , 9-nt duplicate sequence of Tn5 transposase
  • the molecular barcode sequences it would be easy to differentiate and align sequencing reads derived from neighboring fragments in a single target molecule.
  • the term "at least a portion” or grammatical equivalents thereof can refer to any fraction of a whole amount.
  • “at least a portion” can refer to at least about any one of 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 99%, 99.9% or 100% of a whole amount.
  • at least about any one of 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, 99% or more of the plurality of synthetic transposons is inserted in the target nucleic acid.
  • the frequency (i.e. , density) of the synthetic transposons inserted in the target nucleic acid can be controlled by various ways, including adjusting the contacting time and temperature, the amount of synthetic transposons, the type and amount of the transposase, and composition of the buffer.
  • the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about any one of 10 kb, 5 kb, 4 kb, 3 kb, 2 kb, 1 kb, 900 bases, 800 bases, 700 bases, 600 bases, 500 bases, 400 bases, 300 bases, 250 bases, 200 bases, 150 bases, 100 bases, or fewer.
  • the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of once per any one of about 100 bases to about 200 bases, about 150 bases to about 250 bases, about 250 bases to about 500 bases, about 500 bases to about 750 bases, about 750 bases to about lkb, about 1 kb to about 5 kb, about 5 kb to about 10 kb, about 100 bases to about 1 kb, or about 100 bases to about 10 kb.
  • synthetic transposons described herein may be particularly useful and effective for preparing sequencing libraries for whole genome sequencing requiring high quality (for example, error rate lower than about 1 in 10 6 bases), targeted capture sequencing, or microbiome sequencing in clinical setting.
  • high quality for example, error rate lower than about 1 in 10 6 bases
  • targeted capture sequencing for example, targeted capture sequencing
  • microbiome sequencing in clinical setting.
  • the target nucleic acid can include any nucleic acid of interest.
  • Target nucleic acids can include, DNA, RNA, peptide nucleic acid, morpholino nucleic acid, locked nucleic acid, glycol nucleic acid, threose nucleic acid, mixtures thereof, and hybrids thereof.
  • the target nucleic acid is genomic DNA, such as whole genome, part of the genome (e.g., individual chromosomes or fragments thereof), mixed genomes (e.g., microbiome). Intact chromosomes in live cells or isolated intact chromosomes can be used to achieve longest contiguity contigs as possible for any given species.
  • the target nucleic acid is mitochondrial DNA.
  • the target nucleic acid is chloroplast DNA.
  • the target nucleic acid is cDNA, synthetic or modified DNA after certain chemical or enzymatic treatments, including bisulfite treatment (e.g., for CpG methylation detection).
  • the target nucleic acid can be of any length.
  • the synthetic transposons and the methods described herein are particularly useful for preparing barcoded libraries to be sequenced and assembled to analyze long, contiguous target nucleic acids having a length of at least about any one of 10 kb, 20 kb, 50 kb, 100 kb, 200 kb, 500 kb, 1 Mb, 2 Mb, 5 Mb, 10 Mb, 20 Mb, 50 Mb, 100 Mb, 200 Mb, or more.
  • the target nucleic acid can comprise any nucleotide sequences. In some embodiments, the target nucleic acid comprises homopolymer sequences.
  • the target nucleic acid can also include repeat sequences.
  • Repeat sequences can be any of a variety of lengths including, for example, at least about any one of 2, 5, 10, 20, 30, 40, 50, 100, 250, 500, 1000 nucleotides or more. Repeat sequences can be repeated, either contiguously or non- contiguously, any of a variety of times including, for example, at least about any one of 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20 times or more.
  • the plurality of synthetic transposons is inserted in a single target nucleic acid.
  • the plurality of synthetic transposons is inserted in a plurality of target nucleic acids.
  • a plurality of target nucleic acids can include a plurality of the same target nucleic acids, a plurality of different target nucleic acids wherein some target nucleic acids are the same, or a plurality of target nucleic acids wherein all target nucleic acids are different.
  • Embodiments that involve a plurality of target nucleic acids can be carried out in multiplex formats such that reagents can be delivered simultaneously to the target nucleic acids, for example, in one or more compartments or on an array surface.
  • the plurality of target nucleic acids can include substantially all of a particular organism's genome.
  • the plurality of target nucleic acids can include at least a portion of a particular organism's genome, including, for example, at least about any one of 1%, 5%, 10%, 25%, 50%, 75%, 80%, 85%, 90%, 95%, or 99% of the genome.
  • the portion can have an upper limit that is at most about any one of 1%, 5%, 10%, 25%, 50%, 75%, 80%, 85%, 90%, 95%, or 99% of the genome.
  • Target nucleic acids can be obtained from any source.
  • target nucleic acids may be prepared from nucleic acid molecules obtained from a single organism or from populations of nucleic acid molecules obtained from natural sources that include one or more organisms.
  • Sources of nucleic acid molecules include, but are not limited to, organelles, cells, tissues, organs, or organisms.
  • Cells that may be used as sources of target nucleic acid molecules may be prokaryotic (bacterial cells, for example, Escherichia, Bacillus, Serratia, Salmonella, Staphylococcus, Streptococcus, Clostridium, Chlamydia, Neisseria, Treponema, Mycoplasma, Borrelia, Legionella, Pseudomonas, Mycobacterium, Helicobacter, Erwinia, Agrobacterium, Rhizobium, and Streptomyces genera); archeaon, such as crenarchaeota, nanoarchaeota or euryarchaeotia; or eukaryotic such as fungi, (for example, yeasts), plants, protozoans and other parasites, and animals (including insects (for example, Drosophila spp.), nematodes (for example, Caenorhabditis elegans), and mammals (for example, rat, mouse, monkey, non
  • the target nucleic acid has damaged or modified bases before, during and after preparation due to aging, exposure to acid, heat or radiation.
  • modifications include nicks, abasic sites, thymidine dimers, oxidized guanine and pyrimidines, deaminated cytosines. If left untreated, these modifications could prevent further amplifications and sequencing, and affect the accurate counting of the target nucleic acids and sequence quality.
  • Commercial repair kits are available, including NEB's PreCR Repair Mix (Cat# M0309S) and Sigma's Restorase (with DNA polymerase, Cat# R1028).
  • reagents often include DNA repair enzymes (such as uracil DNA glycosylase, Fpg, T4 Endonuclease V, Endonuclease IV and Endonuclease VIII), DNA polymerases, and ligases (such as Taq DNA ligase. Enzymes such as ligase can ligate nicks in double stranded DNA.
  • DNA repair enzymes such as uracil DNA glycosylase, Fpg, T4 Endonuclease V, Endonuclease IV and Endonuclease VIII
  • DNA polymerases such as ligases
  • ligases such as Taq DNA ligase. Enzymes such as ligase can ligate nicks in double stranded DNA.
  • a transposase (such as Tn5 transposase) binds the transposase recognition sites, makes staggered cuts at random sites in a target nucleic acid, and inserts synthetic transposons at the cut sites, resulting in a pair of single-stranded gaps of a fixed length flanking the inserted synthetic transposon sequence in the target nucleic acid.
  • the single- stranded gaps have duplicated sequences derived from the target nucleic acid.
  • the duplicated sequences are characteristic for each transposase, for example, the duplicated sequences are 9-nt long for Tn5 transposase, 5-nt long for Tn7 and Mu transposases, 4-nt long for murine leukemia virus, and 2-nt long for Tcl/marine family.
  • Transposition events are random or substantially random. For example, some studies show certain transposition biases (see, e.g., Green B et al, "Insertion site preference of Mu, Tn5, and Tn7 transposons" Mobile DNA 3:3, 2012).
  • the target nucleic acids inserted with the synthetic transposons can be repaired with a polymerase without strand displacement activity and a ligase in vitro to provide repaired target nucleic acids.
  • the polymerase without strand displacement activity allows gap filling of any single-stranded nucleic acid created surrounding the insertion sites (such as single-stranded gaps having duplicated sequences endogenous to the target nucleic acid).
  • the ligase allows nick sealing for nicks having a 5' phosphate.
  • the gap filling reaction catalyzed by the polymerase without strand displacement, and the ligation reaction catalyzed by the ligase can be carried out in a single step, or in separate steps comprising first contacting the target nucleic acid inserted with the synthetic transposons with the polymerase without strand displacement activity and nucleotides, followed by contacting the resulting product with the ligase.
  • Many polymerases and ligases may be suitable for this step.
  • the polymerase is T4 DNA polymerase.
  • the repaired target nucleic acid is then fragmented by separating the first non- complementary region from the second non-complementary region in each inserted synthetic transposon.
  • a suitable separation step may be chosen based on the nature of the connection between the first non-complementary region and the second non-complementary region.
  • an endonuclease, or a combination of endonuclease with lyase may be used to cleave one or more cleavable nucleotides that are used to fuse the first strands or the second strands of the non-complementary regions, to cleave one or more cleavable nucleotides in the single-stranded linkers of the non-complementary regions, or to cleave one or more cleavable nucleotides in the bridge nucleic acid.
  • the repaired target nucleic acid may be treated with a combination of UDG and DNA lyase, such as USERTM, to separate the non- complementary regions.
  • the endonuclease treatment step may occur simultaneously with the repair step, or after the repair step.
  • the repaired target nucleic acid may be denatured, for example by heating, and/or contacting with a denaturing buffer (e.g. , formamide), to separate the non-complementary regions.
  • a denaturing buffer e.g. , formamide
  • the fragmented target nucleic acids are further treated with an exonuclease, such as a single-strand DNA exonuclease to remove the bridge nucleic acid, and/or other undesired single-stranded nucleic acid.
  • an exonuclease such as a single-strand DNA exonuclease to remove the bridge nucleic acid, and/or other undesired single-stranded nucleic acid.
  • the repaired target nucleic acid is both contacted with an endonuclease to cleave the one or more cleavable nucleic acids and subjected to denaturing conditions to separate the non- complementary regions.
  • the nucleic acid fragments obtained after the step of separating the non-complementary regions may be used directly for single molecule sequencing, or amplified by PCR or Rolling Circle Amplification (RCA).
  • PCR can be used to amplify a small number of copies of template DNA to generate thousands to millions of copies of a particular DNA sequence. It usually requires 2 short oligos as primers (e.g. , 18- 36mer) and a heat-stable DNA polymerase (e.g. , Taq DNA polymerase) in the presence of dNTPs and buffer. Generally, it starts with an initial heating step (e.g.
  • PCR has many applications including in disease diagnosis or forensic identification and many variations are available including multiplex PCR, digital PCR, allele-specific PCR.
  • RCA is an isothermal enzymatic process where long strand nucleic acid sequences containing multiple copies are synthesized from circular molecules of DNA or RNA, such as plasmids, bacteriophages or circular RNA genome of viroids.
  • Kits are available to use RCA technology to amplify circular nucleic acids from small or limited amount of samples in hours at a constant temperature without thermal cycling, for example, TempliPhi from GE Healthcare.
  • Some NGS platforms, such as Complete Genomics, can directly sequence RCA products.
  • the template nucleic acids that are not circular may be circularized first.
  • an oligonucleotide comprising sequences that are complementary to the first adapter sequence and the second adapter sequence may be used to anneal to the non-complementary regions after the separation step, followed by treatment with ligase to circularize the nucleic acid fragments.
  • the template nucleic acids are amplified by RCA using a primer comprising a first sequence that hybridizes to the first adapter sequence and a second sequence that hybridizes to the second adapter sequence.
  • the template nucleic acids are amplified by PCR using a first primer that hybridize to the first adapter sequence or reverse complement thereof, and a second primer that hybridize to the second adapter sequence or reverse complement thereof.
  • the template nucleic acids are amplified by PCR using a first primer having the same sequence as the first adapter sequence, and a second primer having the same sequence as the second adapter sequence.
  • the template nucleic acids are amplified by PCR using a first primer having the complementary sequence as the first adapter sequence, and a second primer having the complementary sequence as the second adapter sequence.
  • the whole-genome sequence is amplified.
  • primers that selectively hybridize to sequences of interest may be used for amplification of targeted sequences.
  • additional adapters and/or sample tags also referred herein as "index tags" may be included in the primers for amplification.
  • the amplification step may need long annealing/extension time to obtain products of appropriate size.
  • the method may further comprise purification step(s) to remove short, unwanted products with only the transposon sequences.
  • the method does not comprise a step of separating the non- complementary regions, and the method comprises subjecting the repaired target nucleic acid to whole genome amplification to provide the library of template nucleic acids.
  • WGA is a method for robust amplification of an entire genome, starting with a small amount of DNA and can result in thousands to millions fold of amplified products. WGA may be especially useful for preparing a library of template nucleic acids for sequencing from a limited or previous sample, such as a single cell.
  • Exemplary techniques used for WGA include, but are not limited to, Multiple Displacement Amplification (MDA), Degenerate Oligonucleotide PCR (DOP-PCR) and Primer Extension Preamplification (PEP).
  • MDA Multiple Displacement Amplification
  • DOP-PCR Degenerate Oligonucleotide PCR
  • PEP Primer Extension Preamplification
  • Exemplary commercial kits for WGA include ILLUSTRATM Single Cell GenomiPhi DNA Amplification kit from GE Healthcare,
  • the method may comprise a dilution step to separate the nucleic acid sample, such as the target nucleic acid, the inserted target nucleic acid, the repaired target nucleic acid, or the template nucleic acids into a plurality of compartments (such as wells in a multi-well plate).
  • the nucleic acid sample is diluted into at least about any of 5, 10, 20, 50, 100, 200, 300, 500 or more compartments to allow subsequent steps, such as amplification, in the methods to carry out within the individual compartments.
  • each compartment comprises no more than about any of 5000, 1000, 500, 200, 100, 50, 20, 10, 5, or fewer molecules.
  • Compartment tags may be introduced to the template nucleic acids in the amplification step. Samples from the compartment can be pooled together during sequencing, and the sequencing reads may be de-multiplexed using the compartment tags. The dilution may facilitate mapping of sequencing reads to individual target nucleic acids or segments thereof.
  • the present application further provides methods of analyzing a target nucleic acid by sequencing libraries of template nucleic acids prepared using any of the methods described above.
  • a method of analyzing a target nucleic acid, or sequencing a target nucleic acid comprising: (a) preparing a library of template nucleic acids from the target nucleic acid using any one of the methods described in the "Methods of library preparation" section; and (b) sequencing the library of template nucleic acids to obtain sequencing reads.
  • each synthetic transposon comprises a different barcode sequence
  • the method further comprises assembling a contiguous sequence of the target nucleic acid from the sequencing reads based on the molecular barcodes of the synthetic transposons in the template nucleic acids.
  • the sequencing is next generation sequencing.
  • the sequencing is massively parallel shotgun sequencing.
  • the sequencing is single molecule sequencing.
  • a method of analyzing a target nucleic acid comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g.
  • each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non- complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non-complementary region and the second non- complementary region are connected to each other; (b) contacting the inserted target nucleic acid with a polymerase, nu
  • the first strand of the first non-complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single- stranded linker, and a second single-stranded sequence that hybridizes to the second single - stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • the sequencing is next generation sequencing.
  • the sequencing is massively parallel shotgun sequencing.
  • the sequencing is single molecule sequencing.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • a method of analyzing a target nucleic acid comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g.
  • each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapt
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • cleavable nucleotides such as a uracil nucleotide
  • the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single- stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • the sequencing is next generation sequencing.
  • the sequencing is massively parallel shotgun sequencing.
  • the sequencing is single molecule sequencing.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases. In some embodiments, the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
  • a method of analyzing a target nucleic acid comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g.
  • each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapt
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • cleavable nucleotides such as a uracil nucleotide
  • the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single- stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • the sequencing is next generation sequencing.
  • the sequencing is massively parallel shotgun sequencing.
  • the sequencing is single molecule sequencing.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid.
  • the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single- stranded gaps having duplicated sequences endogenous to the target nucleic acid, and wherein the duplicated sequences are used to assemble the contiguous sequence.
  • the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
  • the library of template nucleic acids prepared using the methods described in the "Methods of library preparation" section can be sequenced directly or subject to any one or more of library construction steps known in the art, including, but not limited to, end repair, ligation to adapters, amplification, and sample tag addition.
  • the library construction method comprises an exome capture step.
  • the processes described herein can be used in conjunction with a variety of sequencing techniques and platforms.
  • the process to determine the nucleotide sequence of a target nucleic acid can be an automated process.
  • the sequencing is next generation sequencing.
  • the sequencing method is a massively parallel shotgun sequencing method.
  • the sequencing method yields short sequencing reads, such as sequencing reads of no more than about any one of 500 bases, 400 bases, 300 bases, 250 base, 200 bases, 150 bases, 100 bases, or fewer.
  • Exemplary sequencing platforms include, but are not limited to, Roche 454 platforms, Illumina HISEQTM, MISEQTM, and NEXTSEQTM platforms, Life Technologies SOLIDTM platforms, ION
  • Some embodiments include pyrosequencing techniques. Pyrosequencing detects the release of inorganic pyrophosphate (PPi) as particular nucleotides are incorporated into the nascent strand (Ronaghi, M., Karamohamed, S., Pettersson, B., Uhlen, M. and Nyren, P. (1996) "Real-time DNA sequencing using detection of pyrophosphate release.” Analytical
  • cycle sequencing is accomplished by stepwise addition of reversible terminator nucleotides containing, for example, a cleavable or photobleachable dye label as described, for example, in U.S. Pat. No. 7,427,67, U.S. Pat. No. 7,414,1163 and U.S. Pat. No. 7,057,026, the disclosures of which are incorporated herein by reference.
  • Solexa now Illumina Inc.
  • WO 07/123,744 filed in the United States patent and trademark Office as U.S. Ser. No.
  • Some embodiments can utilize sequencing by ligation techniques. Such techniques utilize DNA ligase to incorporate short oligonucleotides and identify the incorporation of such short oligonucleotides.
  • Example SBS systems and methods which can be utilized with the methods and systems described herein are described in U.S. Pat. No. 6,969,488, U.S. Pat. No. 6,172,218, and U.S. Pat. No. 6,306,597, the disclosures of which are incorporated herein by reference in their entireties.
  • Some embodiments can include techniques such as next-next technologies.
  • One example can include nanopore sequencing techniques (Deamer, D. W. & Akeson, M.
  • the nanopore can be a synthetic pore or biological membrane protein, such as a-hemolysin.
  • each base-pair can be identified by measuring fluctuations in the electrical conductance of the pore.
  • nanopore sequencing techniques can be useful to confirm sequence information generated by the methods described herein.
  • Some embodiments can utilize methods involving the real-time monitoring of DNA polymerase activity.
  • Nucleotide incorporations can be detected through fluorescence resonance energy transfer (FRET) interactions between a fluorophore-bearing polymerase and ⁇ -phosphate- labeled nucleotides as described, for example, in U.S. Pat. No. 7,329,492 and U.S. Pat. No. 7,211,414 (each of which is incorporated herein by reference in their entireties) or nucleotide incorporations can be detected with zero-mode waveguides as described, for example, in U.S. Pat. No.
  • FRET fluorescence resonance energy transfer
  • the illumination can be restricted to a zeptoliter-scale volume around a surface-tethered polymerase such that incorporation of fluorescently labeled nucleotides can be observed with low background (Levene, M. J. et al. "Zero-mode waveguides for single-molecule analysis at high concentrations.” Science 299, 682- 686 (2003); Lundquist, P.
  • SMRT real-time
  • a SMRT chip comprises a plurality of zero-mode waveguides (ZMW).
  • ZMW zero-mode waveguides
  • Each ZMW comprises a cylindrical hole tens of nanometers in diameter perforating a thin metal film supported by a transparent substrate.
  • attenuated light may penetrate the lower 20-30 nm of each ZMW creating a detection volume of about 1x10 —21 L. Smaller detection volumes increase the sensitivity of detecting fluorescent signals by reducing the amount of background that can be observed.
  • SMRT chips and similar technology can be used in association with nucleotide monomers fluorescently labeled on the terminal phosphate of the nucleotide (Korlach J. et al. , "Long, processive enzymatic DNA synthesis using 100% dye-labeled terminal phosphate-linked nucleotides.” Nucleosides, Nucleotides and Nucleic Acids, 27:1072-1083, 2008; incorporated by reference in its entirety).
  • the label is cleaved from the nucleotide monomer on incorporation of the nucleotide into the polynucleotide. Accordingly, the label is not incorporated into the polynucleotide, increasing the signal: background ratio. Moreover, the need for conditions to cleave a label from a labeled nucleotide monomer is reduced.
  • a sequencing platform that may be used in association with some of the embodiments described herein is provided by Helicos Biosciences Corp.
  • true single molecule sequencing can be utilized (Harris T. D. et al. , "Single Molecule DNA Sequencing of a viral Genome” Science 320: 106-109 (2008), incorporated by reference in its entirety).
  • a library of target nucleic acids can be prepared by the addition of a 3' poly(A) tail to each target nucleic acid. The poly(A) tail hybridizes to poly(T) oligonucleotides anchored on a glass cover slip.
  • the poly(T) oligonucleotide can be used as a primer for the extension of a polynucleotide complementary to the target nucleic acid.
  • fluorescently-labeled nucleotide monomer namely, A, C, G, or T
  • Incorporation of a labeled nucleotide into the polynucleotide complementary to the target nucleic acid is detected, and the position of the fluorescent signal on the glass cover slip indicates the molecule that has been extended.
  • the fluorescent label is removed before the next nucleotide is added to continue the sequencing cycle. Tracking nucleotide incorporation in each polynucleotide strand can provide sequence information for each individual target nucleic acid. Analysis
  • Sequencing reads can be analyzed with various methods.
  • an automated process such as computer software, is used to analyze the sequencing reads to provide a contiguous sequence of the target nucleic acid.
  • Analysis software can be developed from scratch or based on current bioinformatics tools to include molecular barcode identification and clustering algorithms described herein for sequence assembly (de novo or using a reference).
  • Data analysis of the sequencing reads include at least the following three steps: 1) find identical (or near identical, for example, 1 base difference to accommodate sequencing error) molecular barcodes, and optionally with surrounding transposase recognition site and other stuff sequences, to combine reads with the same barcode into 1 molecule (error-correction); 2) use molecular barcodes to link molecules together with haplotype information with the help of duplicate sequences during transposition (mBCs-assisted contig assembly); and 3) use actual target sequences, especially variants to help confirm the assembly of the molecules (validation).
  • Such process can remove polymerase extension errors or recombination introduced during amplification and sequencing. Additionally, the process allows absolute molecule counting. Furthermore, cross-contamination from one sample to another sample in the lab can be removed or reduced by using the molecular barcodes.
  • FIG. 12 shows an exemplary data analysis pipeline.
  • high quality pair-end sequencing reads are used.
  • the sequencing data are first de-multiplexed into separate sample folders.
  • reads with near identical molecular barcodes and target sequence similarity in a sample are clustered into individual, original target nucleic acid molecules.
  • Two or more pair-end reads are required to cluster into single molecules, and failed reads contain singletons in majority. Reads per molecule can also be calculated.
  • molecular barcodes and duplicated sequences generated during transposition e.g.
  • Gap or outliers in the sequences may be present and may limit the contig size. Gaps are due to several factors. First, with bias and randomness in transposition, there are possible long sequences between 2 transposition sites.
  • the middle regions of some long sequences may not be sequenced, or the whole long fragments may be missed, especially on sequencing platforms producing short read lengths.
  • some fragments may be missed if not all are sampled.
  • the quality of the starting nucleic acids, including fragmentation, base modification and nicks that are not repaired during the library construction process can lead to gaps in sequences. For example, even with high efficiency, the gap-filling extension or nick ligation may not be 100% efficient during library preparation, an fragments with gaps may be missed in sequencing.
  • the factors above contribute to incomplete sequences. However, with multiple cells or genome input molecules (for example, 50 equivalent genomes), such problems are significantly reduced as long sequence gap in one genome can be covered by another molecule. With more sequencing coverage, less gap will be present. Longer sequencing reads will also help.
  • the frequency of transposition can also be increased to reduce large gaps. If multiple cells are used for sequencing, it is possible to have some cell to cell difference in the sequences, which allows analysis of sequence variation at single cell level, although this is limited by the contig size that can be achieved.
  • the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence.
  • step (ii) comprises aligning sequencing reads having the same molecular barcodes in the synthetic transposons and the same duplicated sequences of the single-stranded gaps to provide aligned sequencing reads, and/or step (iii) comprises clustering the sequencing reads based on the molecular barcodes in the synthetic transposons and the duplicated sequences of the single-stranded gaps.
  • step (iii) comprises deriving a contig from the clustered sequencing reads and removing the sequences of the synthetic transposons (and if applicable, one copy of the duplicated sequences of the single-stranded gaps) from the contig to provide the contiguous sequence.
  • the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
  • the sequencing reads are assembled to provide a contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same first molecular barcode and the same second molecular barcode; (iii) determining a consensus sequence for each group of aligned sequencing reads; (iv) linking the consensus sequences together based on the molecular barcodes in the synthetic transposons to provide a contig; and (v) removing the sequences of the synthetic transposons (and if applicable, one copy of the duplicated sequences of the single-stranded gaps) from the contig.
  • step (ii) comprises aligning sequencing reads having the same first molecular barcodes, the same second molecular barcodes, and the same duplicated sequences of the single-stranded gaps; and/or step (iv) comprises linking the consensus sequences together based on the molecular barcodes in the synthetic transposons and the duplicated sequences of the single-stranded gaps to provide the contig.
  • a consensus sequence is determined for each group having at least three aligned sequencing reads.
  • a mismatch nucleotide in a group of aligned sequencing reads is considered to be an amplification or sequencing error if no more than 1/3 or aligned sequencing reads in the group has the mismatch nucleotide.
  • the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
  • the sequencing data with the base calls and sample tag information are analyzed through a special pipeline to allow de-multiplexing of samples followed by clustering, error correction and assembly. Sequences of the transposase recognition sites can be used to identify the location of the synthetic transposons in the sequencing reads. In the cases of Tn5 synthetic transposons, a total of 38-bp Tn5 recognition sequences (2xl9-bp,
  • the stuff sequences in the synthetic transposons, or the fixed nucleotides in the molecular barcode sequences can also serve as additional known bases for identification of the synthetic transposons among the sequencing reads.
  • the distinct molecular barcode sequence between the transposase recognition sequences in a synthetic transposon can serve as exogenous tags.
  • the duplicate gap sequences can serve as endogenous tags.
  • Tn5 generates 9-bp duplicated sequences (4 9 or ⁇ 2xl0 5 combinations) flanking the insertion sites, which provides information on the distinct positions of insertion.
  • the duplicated gap sequence can provide additional insertion-specific information for mapping sequencing reads comprising the synthetic transposons to the original location in the target nucleic acid molecule.
  • Tn5 synthetic transposons having 20 randomly designed nucleotides in the molecular barcodes, a total of greater than 2xl0 17 combinations of different sequences can theoretically be used for tagging and extracting contiguity information in a target nucleic acid. This large diversity of molecular barcodes allows the inserted sequences to be different in all positions.
  • each combination of exogenous and optionally endogenous tag sequences uniquely identifies the surrounding sequences from the target nucleic acid.
  • the distinct molecular barcodes and the duplicate gap sequences from target nucleic acids on one or both ends of the synthetic transposon can serve as unique identifiers to cluster sequencing reads with the same molecular barcode and duplicated gap sequence.
  • Amplification or sequencing errors are corrected and amplification bias is eliminated in the clustering process.
  • Such methods can be particularly useful for assembling repetitive sequence regions, such as Alu repeats, so that the contiguity of the repetitive sequences can be resolved. Insertion of the synthetic transposons can break the repetitiveness of many sequences, therefore allow better amplification and sequencing for these sequences that are difficult to amplify or sequence. Consensus sequences derived from the clustered reads are then assembled together to obtain a phased uninterrupted sequence for the target nucleic acid.
  • the synthetic transposons can be identified using the 2 transposase recognition sequences (2xl9-bp for Tn5 transposase recognition sites).
  • the randomly designed sequences in the molecular barcodes (exogenous tags) and/or the duplicate gap sequences flanking the synthetic transposon insertion position (endogenous tags; e.g., 9-nt for Tn5 transposase, which yields 4 9 possible sequences) can be used to trace back the original position of the insertion site in the target nucleic acid and count the original target nucleic acid once for each cluster of reads mapping to the same original target nucleic acid.
  • endogenous tags e.g., 9-nt for Tn5 transposase, which yields 4 9 possible sequences
  • the overlapped sequences among different clustered reads should be the same except for errors from amplification, and/or sequencing, and/or analysis steps. Therefore, a contig representing the error-corrected consensus sequence can be obtained from the sequencing reads clustered based on the sequences of the synthetic transposons and/or the duplicated gap sequences.
  • the library preparation, sequencing, and/or analysis methods described herein may further be supplemented by additional steps and measures in order to obtain high quality, complete sequences in a cost-effective way.
  • the target nucleic acids can be repaired before, during, and/or after transposition; transposition frequency may be increased to minimize the length of sequences between two inserted transposon; loss of nucleic acids may be minimized during processing, for example, by using single-tube processing methods, avoiding purification steps, and/or directly lysing cells to provide target nucleic acids; cluster generation for Illumina sequence can be optimized to allow pair-end sequencing of long templates; the number of cells for each experiment can be optimized; high quality reference sequences can be used; and internal standards may be used for sequencing.
  • the methods of analyzing or sequencing a target nucleic acid as described above can be used in a variety of applications, including, but not limited to high quality sequencing, haplotyping, de novo sequencing, resequencing (such a mutation and cancer sequencing, disease diagnosis, forensic applications, and aging analysis), single-cell sequencing, sequencing of genetic engineered species (such as plants), sequencing of high repetitive regions, pseudogenes and structurally difficult sequences, metagenomics sequencing, structural variation detection, copy number measurement, methylation analysis, genetic linkage analysis for identification of genes involved in disease etiology.
  • the methods have reduced amplification and sequencing errors, and reduced contamination, such as from products of previous experiments.
  • a method of haplotyping a target nucleic acid comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g., hyperactive Tn5 transposase) under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid, wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second
  • the second strand of the first non- complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non- complementary region are fused to each other via a second single-stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single-stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • the sequencing is next generation sequencing.
  • the sequencing is massively parallel shotgun sequencing.
  • the sequencing is single molecule sequencing.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence of the target nucleic acid.
  • each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single- stranded gaps having duplicated sequences endogenous to the target nucleic acid
  • the duplicated sequences are further used to assemble the contiguous sequence.
  • the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
  • a method of assembly (such as de novo assembly, resequencing, or metagenomic sequencing) of a target nucleic acid (such as genomic DNA, mitochondrial DNA, or microbial DNA), comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g., hyperactive Tn5 transposase) under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid, wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • the sequencing is next generation sequencing.
  • the sequencing is massively parallel shotgun sequencing.
  • the sequencing is single molecule sequencing.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases. In some embodiments, the method determines sequences of the target nucleic acids at single cell level.
  • the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence of the target nucleic acid.
  • each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single-stranded gaps having duplicated sequences endogenous to the target nucleic acid
  • the duplicated sequences are further used to assemble the contiguous sequence.
  • the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
  • the methods of assembly disclosed herein may be used to generate reference genome sequences for human or other species or interest using multiple platforms or replicates with extreme low error rates (e.g., with lower than about 1/10, 1/100, 1/1000, or 1/10,000 the error rate of current reference genome sequences).
  • the reference genomes can then be used to speed up the assembly process for new sequences from individuals in a species.
  • a 370bp segment in the 5' untranslated region of murine gene Foxd3 is resistant to amplification, sequencing and cloning (Nelms BL and Labosky PA, A predicted hairpin cluster correlates with barriers to PCR, sequencing and possibly BAC recombineering. Scientific Reports 1 : 106, 2011).
  • the random insertion of the synthetic transposons can help to reduce difficulty in sequencing due to repetitive or hairpin cluster sequences.
  • a method of sequencing repetitive regions in a target nucleic acid comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g.
  • each synthetic transposon comprises a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • cleavable nucleotides such as a uracil nucleotide
  • the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single- stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • the sequencing is next generation sequencing.
  • the sequencing is massively parallel shotgun sequencing.
  • the sequencing is single molecule sequencing.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence of the target nucleic acid.
  • each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single-stranded gaps having duplicated sequences endogenous to the target nucleic acid
  • the duplicated sequences are further used to assemble the contiguous sequence.
  • the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
  • a method of detecting a mutation comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g.
  • each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non- complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the
  • the molecular barcode is double- stranded.
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the first strand and the second strand of the first non- complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single-stranded linker and the second single- stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • the sequencing is next generation sequencing.
  • the sequencing is massively parallel shotgun sequencing.
  • the sequencing is single molecule sequencing.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence of the target nucleic acid.
  • each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single-stranded gaps having duplicated sequences endogenous to the target nucleic acid
  • the duplicated sequences are further used to assemble the contiguous sequence.
  • the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
  • a method of detecting a structural variation in a target nucleic acid comprising: (a) (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g.
  • each synthetic transposon comprises a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the first strand and the second strand of the first non- complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single-stranded linker and the second single- stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • the sequencing is next generation sequencing.
  • the sequencing is massively parallel shotgun sequencing.
  • the sequencing is single molecule sequencing.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence of the target nucleic acid.
  • each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single-stranded gaps having duplicated sequences endogenous to the target nucleic acid
  • the duplicated sequences are further used to assemble the contiguous sequence.
  • the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
  • a method of detecting a copy number variation in a target nucleic acid comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g.
  • each synthetic transposon comprises a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single- stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single-stranded linker, and wherein the first single- stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • the sequencing is next generation sequencing.
  • the sequencing is massively parallel shotgun sequencing.
  • the sequencing is single molecule sequencing.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre- mixed prior to contacting the target nucleic acid.
  • the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • the method further comprises capturing or enhancing the target nucleic acid or barcoded target nucleic acid, such as by using probes that hybridize to the target nucleic acid or barcoded target nucleic acid.
  • the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence of the target nucleic acid.
  • each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single-stranded gaps having duplicated sequences endogenous to the target nucleic acid
  • the duplicated sequences are further used to assemble the contiguous sequence.
  • DNA methylation is a widespread epigenetic modification that plays a pivotal role in the regulation of the genomes of diverse organisms.
  • the most prevalent and widely studied form of DNA methylation in mammalian genomes occurs at the 5 carbon position of cytosine residues, usually in the context of the CpG dinucleotide.
  • Microarrays, and more recently massively parallel sequencing, have enabled the interrogation of cytosine methylation (5mC) on a genome-wide scale (Zilberman and Henikoff 2007).
  • Methods of whole genome bisulfite sequencing that can be used to detect 5mC have been described (e.g., Cokus et al. 2008; Lister et al. 2009; Harris et al. 2010).
  • Treatment of genomic DNA with sodium bisulfite chemically deaminates cytosines much more rapidly than 5mC, preferentially converting them to uracils (Clark et al. 1994).
  • massively parallel sequencing these can be detected on a genome-wide scale at single base -pair resolution.
  • Any of the known whole genome bisulfite sequencing workflows can be applied to genomic DNA samples barcoded with the synthetic transposons of the present application to provide methods of methylation analysis with high accuracy and efficiency.
  • a method of analyzing methylation status of a target nucleic acid comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g.
  • each synthetic transposon comprises a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the first strand and the second strand of the first non- complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single-stranded linker and the second single- stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • the sequencing is next generation sequencing.
  • the sequencing is massively parallel shotgun sequencing.
  • the sequencing is single molecule sequencing.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence of the target nucleic acid.
  • each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single-stranded gaps having duplicated sequences endogenous to the target nucleic acid
  • the duplicated sequences are further used to assemble the contiguous sequence.
  • the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
  • chromosome conformation capture techniques see, for example, Barutcus AR et al, J. Cell Physiol, 231 :31-35, 2016), such as 3C, circularized 3C (i.e. , 4C), carbon-copy 3C (i.e. , 5C), or chromatin immunoprecipitation-based methods (such as ChlP-loop), and genome conformation capture techniques may be combined with any one of the methods of inserting synthetic transposons described herein to assess chromosome interactions.
  • chromatin immunoprecipitation-based methods such as ChlP-loop
  • Chromatation methods can be used to isolate protein-DNA complexes (such as chromatin-DNA complexes), which can then be barcoded with the synthetic transposons of the present application, and sequenced to determine the location in the genome that the protein (such as histones) are associated with.
  • protein-DNA complexes such as chromatin-DNA complexes
  • a method of analyzing conformation of a chromosome comprising: (a) crosslinking the chromosome in vivo (such as within a cell); (b) isolating the crosslinked chromosome; (c) fragmenting (such as mechanically or enzymatically) the crosslinked chromosome to provide crosslinked chromosomal fragments; (d) ligating the ends of the crosslinked chromosomal fragments to provide ligated fragments; (e) reversing the ligated fragments to provide target nucleic acids; (f) contacting the target nucleic acids with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g.
  • each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • cleavable nucleotides such as a uracil nucleotide
  • the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single- stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • the sequencing is next generation sequencing.
  • the sequencing is massively parallel shotgun sequencing.
  • the sequencing is single molecule sequencing.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence of the target nucleic acid.
  • each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single-stranded gaps having duplicated sequences endogenous to the target nucleic acid
  • the duplicated sequences are further used to assemble the contiguous sequence.
  • the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
  • any of the methods and applications described above can be used for diagnosing a disease or a condition in an individual based on the sequence, contiguity information (such as haplotype or 3-dimensional chromosome conformation), and/or quantity of a target nucleic acid in the individual.
  • the target nucleic acid may be present in a sample obtained from the individual, including, but not limited to, biopsy sample, buccal swap, blood sample, or sample of other bodily fluid.
  • the target nucleic acid of the individual is compared to a reference from a healthy individual to provide the diagnosis.
  • a method of diagnosing a disease or a condition of an individual based on status of a target nucleic acid comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g.
  • each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising
  • the diagnosis comprises mutations, such as structural variations or copy number variations in a diseased tissue (such as tumor).
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single- stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single-stranded linker, and wherein the first single - stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • the sequencing is next generation sequencing.
  • the sequencing is massively parallel shotgun sequencing.
  • the sequencing is single molecule sequencing.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre- mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence of the target nucleic acid.
  • each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single- stranded gaps having duplicated sequences endogenous to the target nucleic acid
  • the duplicated sequences are further used to assemble the contiguous sequence.
  • Some embodiments described herein comprise comparing the contiguous sequence of the target nucleic acid in a sample to a reference sequence, the copy number of the target nucleic acid in a sample to a reference value, and/or comparing the contiguous sequence and/or copy number of the target nucleic acid of one sample to that of a reference sample.
  • the reference sequence and reference values may be obtained from a database.
  • the reference sample may be a sample from a healthy or wildtype individual, tissue, or cell. For example, in some
  • the target nucleic acid from a tumor cell of an individual is analyzed and compared to the nucleic acid from a healthy cell of the same individual to provide a diagnosis.
  • kits and articles of manufacture comprising a plurality of any of the synthetic transposons described herein, and for methods of library preparation, analyzing target nucleic acids, or various applications described herein.
  • kits for preparing a library of template nucleic acids comprising: (a) a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non- complementary region and the second non-complementary region are connected to each other; (b) a transposase that recognizes the first transposon recognition site and the second transposon recognition site; and (c)
  • the kit further comprises a polymerase without strand displacement activity, such as T4 DNA polymerase. In some embodiments, the kit further comprises a ligase. In some embodiments, the kit further comprises nucleotides (such as dNTPs). In some embodiments, the transposase is Tn5 transposase, including a modified Tn5 transposase with enhanced activity, such as EZ-Tn5TM. In some embodiments, wherein the first non-complementary region and/or the second non-complementary region comprises one or more cleavable nucleotides, the kit further comprises an endonuclease, such as UDG, for example, USERTM.
  • UDG for example, USERTM.
  • each synthetic transposon further comprises a bridge nucleic acid
  • the kit further comprises a single-strand exonuclease.
  • the kit further comprises a first primer and a second primer for PCR amplification of the library of template nucleic acids, wherein the first primer hybridizes to the first adapter sequence or reverse complement thereof, and the second primer hybridizes to the second adapter sequence or reverse complement thereof.
  • the kit further comprises a primer for rolling circle amplification of the library of template nucleic acids, wherein the primer comprises a first sequence that hybridizes to the first adapter sequence and a second sequence that hybridizes to the second adapter sequence.
  • a kit for preparing a library of template nucleic acids comprising: (a) a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the
  • the kit further comprises a polymerase without strand displacement activity, such as T4 DNA polymerase. In some embodiments, the kit further comprises a ligase. In some embodiments, the kit further comprises nucleotides (such as dNTPs). In some embodiments, the transposase is Tn5 transposase, including a modified Tn5 transposase with enhanced activity, such as EZ-Tn5TM. In some embodiments, wherein the first non-complementary region and/or the second non- complementary region comprises one or more cleavable nucleotides, the kit further comprises an endonuclease, such as UDG, for example, USERTM.
  • UDG for example, USERTM.
  • each synthetic transposon further comprises a bridge nucleic acid
  • the kit further comprises a single- strand exonuclease.
  • the kit further comprises a first primer and a second primer for PCR amplification of the library of template nucleic acids, wherein the first primer hybridizes to the first adapter sequence or reverse complement thereof, and the second primer hybridizes to the second adapter sequence or reverse complement thereof.
  • the kit further comprises a primer for rolling circle amplification of the library of template nucleic acids, wherein the primer comprises a first sequence that hybridizes to the first adapter sequence and a second sequence that hybridizes to the second adapter sequence.
  • kits may contain one or more additional components, such as containers, buffers, reagents, cofactors, or additional agents, such as denaturing agent.
  • additional components such as containers, buffers, reagents, cofactors, or additional agents, such as denaturing agent.
  • the kit components may be packaged together and the package may contain or be accompanied by instructions for using the kit.
  • Embodiment 1 A synthetic transposon comprising: a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non- complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non-complementary region and the second non-complementary region are connected to each other.
  • Embodiment 2 The synthetic transposon of embodiment 1, wherein the first strand of the first non-complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides.
  • Embodiment 3 The synthetic transposon of embodiment 1 or embodiment 2, wherein the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides.
  • Embodiment 4 The synthetic transposon of embodiment 1, wherein the first strand and the second strand of the first non-complementary region are fused to each other via a first single- stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single- stranded linker and the second single-stranded linker are connected to each other.
  • Embodiment 5 The synthetic transposon of embodiment 4, wherein the first single- stranded linker and the second single-stranded linker hybridize to each other.
  • Embodiment 6 The synthetic transposon of embodiment 5, wherein each of the first single-stranded linker and the second single-stranded linker comprises a cleavable nucleotide.
  • Embodiment 7 The synthetic transposon of embodiment 4, further comprising a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single - stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • Embodiment 8 The synthetic transposon of any one of embodiments 2, 3, and 6, wherein the cleavable nucleotide is a uracil nucleotide.
  • Embodiment 9 The synthetic transposon of any one of embodiments 1-8, wherein the first stem or the second stem comprises a terminal hairpin structure.
  • Embodiment 10 The synthetic transposon of any one of embodiments 1-8, wherein the first stem and the second stem comprise blunt ends.
  • Embodiment 11 The synthetic transposon of any one of embodiments 1-10, wherein the synthetic transposon is a DNA transposon.
  • Embodiment 12 The synthetic transposon of any one of embodiments 1-11, wherein the synthetic transposon comprises one or more modified nucleotides.
  • Embodiment 13 The synthetic transposon of any one of embodiments 1-12, wherein the first transposase recognition site and the second transposase recognition site have the same sequence.
  • Embodiment 14 The synthetic transposon of any one of embodiments 1-12, wherein the first transposase recognition site and the second transposase recognition site have different sequences.
  • Embodiment 15 The synthetic transposon of any one of embodiments 1-14, wherein the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • Embodiment 16 The synthetic transposon of any one of embodiments 1-15, further comprising a first molecular barcode and a second molecular barcode, wherein the first molecular barcode is disposed between the first transposon recognition site and the first non- complementary region, and the second molecular barcode is disposed between the second transposon recognition site and the second non-complementary region.
  • Embodiment 17 The synthetic transposon of embodiment 16, wherein the first molecular barcode and the second molecular barcode have the same barcode sequence.
  • Embodiment 18 The synthetic transposon of embodiment 16 or embodiment 17, wherein the first molecular barcode and the second molecular barcode are double-stranded.
  • Embodiment 19 The synthetic transposon of embodiment 16 or embodiment 17, wherein the first molecular barcode or the second molecular barcode comprises a single-stranded region.
  • Embodiment 20 The synthetic transposon of embodiment 19, wherein the 5' terminus adjacent to the single-stranded region is phosphorylated.
  • Embodiment 21 A composition comprising a plurality of synthetic transposons of any one of embodiments 1-20.
  • Embodiment 22 The composition of embodiment 21, wherein each synthetic transposon comprises a first molecular barcode and a second molecular barcode, wherein the first molecular barcode is disposed between the first transposon recognition site and the first non-complementary region, and the second molecular barcode is disposed between the second transposon recognition site and the second non-complementary region, wherein the first molecular barcode and the second molecular barcode have the same barcode sequence, and wherein each synthetic transposon has a different barcode sequence.
  • Embodiment 23 The composition of embodiment 22, wherein the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
  • Embodiment 24 A method of preparing a library of template nucleic acids, comprising: (a) contacting a target nucleic acid with the composition of any one of embodiments 21-23, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; and (c) separating the first non-complementary region and the second non-complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids.
  • Embodiment 25 The method of embodiment 24, wherein step (c) comprises treating the repaired target nucleic acid with an endonuclease.
  • Embodiment 26 The method of embodiment 25, wherein the endonuclease is uracil DNA glycosylase (UDG).
  • UDG uracil DNA glycosylase
  • Embodiment 27 The method of embodiment 24, wherein step (c) comprises denaturing of the repaired target nucleic acid.
  • step (c) comprises denaturing of the repaired target nucleic acid.
  • Embodiment 28 The method of embodiment 27, further comprising treating the denatured repaired target nucleic acid with an exonuclease.
  • Embodiment 29 The method of any one of embodiments 24-28, further comprising amplifying the library of template nucleic acids.
  • Embodiment 30 The method of embodiment 29, wherein the amplifying is whole- genome amplification.
  • Embodiment 31 The method of embodiment 29, wherein the amplifying is targeted amplification.
  • Embodiment 32 The method of any one of embodiments 29-31 , wherein the library of template nucleic acids is amplified by a polymerase chain reaction (PCR).
  • PCR polymerase chain reaction
  • Embodiment 33 The method of embodiment 32, wherein the PCR comprises contacting the template nucleic acids with a first primer that hybridizes to the first adapter sequence or reverse complement thereof, and a second primer that hybridizes to the second adapter sequence or reverse complement thereof.
  • Embodiment 34 The method of any one of embodiments 29-31 , wherein the library of template nucleic acids is amplified by rolling circle amplification (RCA) using a primer comprising a first sequence that hybridizes to the first adapter sequence and a second sequence that hybridizes to the second adapter sequence.
  • RCA rolling circle amplification
  • Embodiment 35 The method of embodiment 34, further comprising circularizing the template nucleic acids prior to the RCA.
  • Embodiment 36 The method of any one of embodiments 24-35, wherein the polymerase is T4 DNA polymerase.
  • Embodiment 37 The method of any one of embodiments 24-36, wherein the transposase is Tn5 transposase.
  • Embodiment 38 The method of any one of embodiments 24-37, wherein the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid.
  • Embodiment 39 The method of any one of embodiments 24-38, wherein the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA.
  • Embodiment 40 The method of any one of embodiments 24-39, wherein the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • Embodiment 41 A method of analyzing a target nucleic acid, comprising: (a) preparing a library of template nucleic acids from the target nucleic acid using the method of any one of embodiments 24-40; and (b) sequencing the library of template nucleic acids to obtain sequencing reads.
  • Embodiment 42 The method of embodiment 41, wherein the sequencing is massively parallel shotgun sequencing.
  • Embodiment 43 The method of embodiment 41, wherein the sequencing is single molecule sequencing.
  • Embodiment 44 The method of any one of embodiments 41-43, wherein each synthetic transposon comprises a first molecular barcode and a second molecular barcode, wherein the first molecular barcode is disposed between the first transposon recognition site and the first non-complementary region, and the second molecular barcode is disposed between the second transposon recognition site and the second non-complementary region, wherein the first molecular barcode and the second molecular barcode have the same barcode sequence, wherein each synthetic transposon has a different barcode sequence, and wherein the method further comprises: (c) assembling a contiguous sequence of the target nucleic acid from the sequencing reads based on the barcode sequences of the synthetic transposons in the template nucleic acids.
  • Embodiment 45 The method of embodiment 44, wherein step (c) comprises: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same barcode sequences in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the barcode sequences in the synthetic transposons to provide the contiguous sequence of the target nucleic acid.
  • Embodiment 46 The method of embodiment 44 or embodiment 45, wherein each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single-stranded gaps having duplicated sequences endogenous to the target nucleic acid, and wherein the duplicated sequences are used to assemble the contiguous sequence.
  • Embodiment 47 The method of any one of embodiments 44-46, further comprising counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
  • Embodiment 48 The method of any one of embodiments 41-47, wherein the method is used for genome assembly, haplotyping, detection of mutation, chromosomal conformation analysis, or methylation analysis.
  • Embodiment 49 The method of embodiment 48, wherein the mutation is selected from the group consisting of substitution, indel, structural variation, and copy number variation.
  • Embodiment 50 A kit for preparing a library of template nucleic acids, comprising: (a) the composition of any one of embodiments 21-23; (b) a transposase that recognizes the first transposon recognition site and the second transposon recognition site; and (c) instructions for preparing the library of template nucleic acids.
  • Embodiment 51 The kit of embodiment 50, further comprising a polymerase.
  • Embodiment 52 The kit of embodiment 51, wherein the polymerase is a T4 DNA polymerase.
  • Embodiment 53 The kit of any one of embodiments 50-52, further comprising a ligase.
  • Embodiment 54 The kit of any one of embodiments 50-53, wherein the transposase is Tn5 transposase.
  • Embodiment 55 The kit of any one of embodiments 50-54, further comprising an endonuclease.
  • Embodiment 56 The kit of embodiment 55, wherein the endonuclease is UDG.
  • Embodiment 57 The kit of any one of embodiments 50-56, further comprising a first primer and a second primer for PCR amplification of the library of template nucleic acids, wherein the first primer hybridizes to the first adapter sequence or reverse complement thereof, and the second primer hybridizes to the second adapter sequence or reverse complement thereof.
  • Embodiment 58 The kit of any one of embodiments 50-56, further comprising a primer for rolling circle amplification of the library of template nucleic acids, wherein the primer comprises a first sequence that hybridizes to the first adapter sequence and a second sequence that hybridizes to the second adapter sequence.
  • Identical twins have identical genomic sequences except for only a few mutations.
  • the specific mutations can be determined by NGS methods, and confirmed by Sanger sequencing methods. Therefore, data from whole genome sequencing of identical twins can be used for checking sequencing errors using the library preparation methods described in the present application.
  • An exemplary method of whole genome sequencing of identical human twins is described below.
  • Human gDNA is extracted from a buccal swap or a drop of blood, and the purity and yield of the gDNA is measured. Alternatively, about 10-20 human cells from each person are lysed without purification to minimize the loss of DNA.
  • a composition comprising a plurality of synthetic transposons as shown in FIG. 21 is prepared. Illumina sequencing primers readl and read2 are incorporated as the first adapter sequence (e.g. , F) and second adapter sequence (e.g. , R) respectively in the non-complementary regions.
  • the molecular barcodes contain a total of 20 randomly or degenerately designed nucleotides intermixed with fixed nucleotides.
  • Duplicate samples of the gDNA inserted with the plurality of synthetic transposons are prepared.
  • about 0.3 ng gDNA is used to contact with the composition comprising the plurality of synthetic transposons, and Tn5 transposase under a condition that allows insertion at a frequency of about 150- bp between adjacent transposition sites.
  • the single-stranded gaps are filled-in with dNTPs and a DNA polymerase without strand displacement activity, such as T4 DNA polymerase.
  • nicks in the DNA are ligated with E. coli ligase, which can be done separately or simultaneously with the gap filling step.
  • the product is treated with USERTM enzyme (NEB) to cleave the uracil nucleotide joining the two non-complementary regions in the inserted synthetic transposons to provide a library of templates.
  • the USERTM enzyme treatment step can be done separately or simultaneously with the gap filling step and the ligation step.
  • the library of templates is subsequently PCR amplified with two corresponding primers: a first primer having Illumina sequence i5 and sequence Readl, and a second primer having Illumina sequence i7 and sequence Read2.
  • the PCR products are then purified, quantified, and sequenced with 2x300 bases pair-end reads using an Illumina NGS instrument.
  • the sequencing reads are subsequently analyzed.
  • the sequencing reads contain stuff sequences, unique molecular barcode sequence, 19-base Tn5 recognition site, 9-base duplicate sequence, and target sequence in both sequencing directions.
  • the sequencing reads may additionally contain an additional copy of 9-base duplicate, 19-base Tn5 recognition site, unique molecular barcode sequence and stuff sequences.
  • the sequencing reads in both directions are matched with each other and combined to yield a single sequence.
  • sequences having identical molecular barcodes are aligned and merged into a single consensus sequence to yield the error-corrected target sequence.
  • Target sequences are assembled to provide whole genome sequence with high quality, which contains haplotype information and any structural variation or mutations.
  • the genomic sequences from the twins are compared to each other to identify mutations, which are verified by Sanger sequencing. Unverified mutations are attributed to sequencing errors, and used to calculate an error rate for the sequencing method described herein, and compared to error rates using other sequencing method, which uses conventional methods (such as commercial kits) to prepare sequencing libraries.
  • microbial gDNAs are extracted from human skin surface using a swap-scrape- swap procedure. The purity and yield of the microbial gDNAs are measured.
  • a composition comprising a plurality of synthetic transposons as shown in FIG. 2H is prepared.
  • PacBio adapter sequences are incorporated as the first and second adapter sequences (i.e., F and R) respectively in the two non-complementary regions.
  • the molecular barcodes contain a total of 20 randomly or degenerately designed nucleotides intermixed with fixed nucleotides. Duplicate samples of the gDNA inserted with the plurality of synthetic transposons are prepared.
  • nanograms of gDNA are used to contact with the composition comprising the plurality of synthetic transposons, and Tn5 transposase under a condition that allows insertion at a frequency of about 1500-bp between adjacent transposition sites.
  • the single-stranded gaps are filled-in with dNTPs and a DNA polymerase without strand displacement activity, such as T4 DNA polymerase.
  • nicks in the DNA are ligated with E. coli ligase, which can be done separately or simultaneously with the gap filling step.
  • the product is subsequently denatured, and treated with exonucleases to remove any linear nucleic acids.
  • the resulting sample is sequenced with a PacBio SMRT ® instrument.
  • Sequencing data is analyzed, and microbial genomes are assembled from the sequencing data. Abundance of each microbial genome is also obtained. The data is further compared to metagenome data in databases. In this case, molecular barcodes are mainly used to link as many fragments as possible from the same original genome.

Landscapes

  • Genetics & Genomics (AREA)
  • Engineering & Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Wood Science & Technology (AREA)
  • Organic Chemistry (AREA)
  • Biotechnology (AREA)
  • General Engineering & Computer Science (AREA)
  • Zoology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Plant Pathology (AREA)
  • Molecular Biology (AREA)
  • Microbiology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Physics & Mathematics (AREA)
  • Biochemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention provides synthetic transposons having two non-complementary regions comprising adapter sequences and linked to each other. The synthetic transposons may further comprise molecular barcodes. Also disclosed are compositions comprising a plurality of the synthetic transposons, methods, and kits for library preparation. The compositions, methods, kits and analysis tools described herein have many applications, including high-quality sequencing, haplotyping, error correction, sequencing of repetitive regions, detection of structural variations and copy number variations, methylation analysis and quantification of target nucleic acids.

Description

COMPOSITIONS OF SYNTHETIC TRANSPOSONS AND METHODS OF USE
THEREOF
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims the priority benefit of United States Provisional Application No. 62/399,188, filed on September 23, 2016, the contents of which are hereby incorporated herein by reference in their entirety.
SUBMISSION OF SEQUENCE LISTING ON ASCII TEXT FILE
[0002] The content of the following submission on ASCII text file is incorporated herein by reference in its entirety: a computer readable form (CRF) of the Sequence Listing (file name: 761972000240SEQLIST.txt, date recorded: September 21, 2017, size: 6 KB).
FIELD OF THE INVENTION
[0003] The present invention relates to the field of genomics, in particular, sequencing and analysis of nucleic acids.
BACKGROUND OF THE INVENTION
[0004] Current next or third generation sequencing platforms, including those developed by Illumina, Ion Torrent, Complete Genomics, Pacific Bioscience or Oxford Nanopore
Technologies, can generate a large amount of sequencing data with read lengths of hundreds of bases to tens of thousands of bases and sequencing errors in the range of 0.1 % or lower per base. For human genome, such sequencing error rate results in millions of errors per genome sequenced. At single read level, the per base error could range from 0.3% to 15% or more per base. Although reasonable overall low error rate has been obtained and there is generally good correlation between different sequencing platforms, platform-specific errors still exist (for example, Luo et at, Direct comparisons of Illumina vs. Roche 454 sequencing technologies on the same microbial community DNA sample. Plos One 7: e30087, 2012). For clinical (e.g., identification of tumor somatic mutation) and forensic applications (e.g., differentiation of identical twins) of genome sequencing, it is critical to reduce the sequencing errors to negligible levels (e.g., Weber-Lehmann J. et al, Forensic Science International Genetics 9: 42-46, 2014).
[0005] Additional challenges exist in current sequencing technologies. For example, gaps exist in genome sequences assembled from short sequencing reads due to difficulties in amplification or sequencing, which often results from complex structures or repetitiveness of the sequences. Repetitive sequences are abundant in many species, including almost half of the human genome. Mapping of these sequences is also challenging, even with advanced bioinformatics tools (e.g., Treangen TJ et al , Nature Reviews Genetics 13: 36-46, 2012). Furthermore, haplotyping information is very difficult to obtain from current next-generation sequencing methods alone due to short reads. Moreover, if PCR or similar amplification steps are involved in library preparation or sequencing, potential recombination during such steps can lead to chimeric products in the range of < 1% to as much as 7% (e.g., Yu et al, BioTechniques 40: 499-507, 2006).
[0006] Recently, several groups have developed methods and kits to provide phasing information of sequencing reads that facilitate assembly of whole genome sequences and other long-range sequences. Commercial kits are available from e.g., Complete Genomics, Illumina, or lOx Genomics. Also see, for example, Peters B.A. et al., Nature 487: 190-195, 2012; Kaper F. et al , Proc. Natl. Acad. Sci. 110: 5552-5557, 2013; Amini S. et al , Nature Genetics 46: 1343- 1349, 2014; McCoy R.C. et al , PLOS One 9: el0668, 2014; Zheng G. X. Y. et al, Nature Biotechnology 34: 303-311, 2016. Another method used for precise haplotyping of long sequences or chromosomes involves insertion of artificial transposons with molecular barcodes in nucleic acids during library construction. See, for example, US8,829,171, US20130203605, US9,328,382; US2016034480. Molecular barcodes (mBCs) or molecular tags (mTags) have also been used in library construction methods to reduce errors introduced by PCR or ligation steps (see, e.g., Kinde I et al, Proc. Natl. Acad. Sci. USA 108: 9530-9535, 2011 ; Schmitt MW et al, Proc. Natl. Acad. Sci. USA 109: 14508-14513, 2012). In these cases, introduction of mBCs is typically done after fragmentation. Thus, the mBCs cannot be used to provide sequence contiguity information, which is required for haplotyping or resolving repetitive sequences based on short-read sequencing results.
[0007] Transposases can be used to introduce mutations or insert sequences in nucleic acids. Previously, transposases were used for in vitro or in vivo mutagenesis (e.g. , US6, 159,736) or for producing protein tags (e.g., US5, 652,128). Several companies including NEB, Epicentre (now part of Illumina) and Finnzymes have provided kits for these purposes. Transposases have also been used to fragment target DNA and to introduce primer binding sequences at the same time. See, for example, US6,593,113, 2003; US9,115,396; US9,145,623; and Adey A. et al , Genome Biol. 11 : R119, 2010. Commercial kits are available, including, for example, NEXTERA® DNA Sample Prep kits by Illumina/Epicentre and MUSEEK Library Preparation kits by Thermo Scientific.
[0008] The disclosures of all publications, patents, patent applications and published patent applications referred to herein are hereby incorporated herein by reference in their entirety.
BRIEF SUMMARY OF THE INVENTION
[0009] The present invention provides compositions, methods, kits and analysis tools for high- quality sequencing of nucleic acids, haplotyping and quantification of whole genome or targeted sequences. The compositions comprise one or more synthetic transposons having two non- complementary regions linked to each other, and the synthetic transposons may or may not contain molecular barcodes.
[0010] One aspect of the present application provides a synthetic transposon comprising: a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non-complementary region and the second non-complementary region are connected to each other.
[0011] In some embodiments of the synthetic transposon described above, the first strand of the first non-complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides. In some embodiments, the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides. In some embodiments, the cleavable nucleotide is a uracil nucleotide.
[0012] In some embodiments of the synthetic transposon described above, the first strand and the second strand of the first non-complementary region are fused to each other via a first single- stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single- stranded linker and the second single-stranded linker are connected to each other. In some embodiments, the first single-stranded linker and the second single-stranded linker hybridize to each other. In some embodiments, each of the first single-stranded linker and the second single- stranded linker comprises a cleavable nucleotide. In some embodiments, the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid. In some embodiments, the cleavable nucleotide is a uracil nucleotide.
[0013] In some embodiments according to any one of the synthetic transposons described above, the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
[0014] In some embodiments according to any one of the synthetic transposons described above, the synthetic transposon is a DNA transposon.
[0015] In some embodiments according to any one of the synthetic transposons described above, the synthetic transposon comprises one or more modified nucleotides.
[0016] In some embodiments according to any one of the synthetic transposons described above, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
[0017] In some embodiments according to any one of the synthetic transposons described above, the synthetic transposon further comprises a first molecular barcode and a second molecular barcode, wherein the first molecular barcode is disposed between the first transposon recognition site and the first non-complementary region, and the second molecular barcode is disposed between the second transposon recognition site and the second non-complementary region. In some embodiments, the first molecular barcode and the second molecular barcode have the same barcode sequence. In some embodiments, the first molecular barcode and the second molecular barcode are double-stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single-stranded region. In some embodiments, the 5' terminus adjacent to the single-stranded region is phosphorylated. [0018] One aspect of the present application provides a composition comprising a plurality of any one of the synthetic transposons described above. In some embodiments, each synthetic transposon comprises a first molecular barcode and a second molecular barcode, wherein the first molecular barcode is disposed between the first transposon recognition site and the first non-complementary region, and the second molecular barcode is disposed between the second transposon recognition site and the second non-complementary region, wherein the first molecular barcode and the second molecular barcode have the same barcode sequence, and wherein each synthetic transposon has a different barcode sequence. In some embodiments, the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
[0019] One aspect of the present application provides a method of preparing a library of template nucleic acids, comprising: (a) contacting a target nucleic acid with any one of the compositions described above, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; and (c) separating the first non-complementary region and the second non-complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids. In some embodiments, step (c) comprises treating the repaired target nucleic acid with an endonuclease. In some embodiments, the endonuclease is uracil DNA glycosylase (UDG). In some embodiments, step (c) comprises denaturing of the repaired target nucleic acid. In some embodiments, the method further comprises treating the denatured repaired target nucleic acid with an exonuclease.
[0020] In some embodiments according to any one of the methods of preparing a library of template nucleic acids described above, the method further comprises amplifying the library of template nucleic acids. In some embodiments, the amplifying is whole-genome amplification. In some embodiments, the amplifying is targeted amplification. In some embodiments, the library of template nucleic acids is amplified by a polymerase chain reaction (PCR). In some embodiments, the PCR comprises contacting the template nucleic acids with a first primer that hybridizes to the first adapter sequence or reverse complement thereof, and a second primer that hybridizes to the second adapter sequence or reverse complement thereof. In some
embodiments, the library of template nucleic acids is amplified by rolling circle amplification (RCA) using a primer comprising a first sequence that hybridizes to the first adapter sequence and a second sequence that hybridizes to the second adapter sequence. In some embodiments, the method further comprises circularizing the template nucleic acids prior to the RCA.
[0021] In some embodiments according to any one of the methods of preparing a library of template nucleic acids described above, the polymerase is T4 DNA polymerase.
[0022] In some embodiments according to any one of the methods of preparing a library of template nucleic acids described above, the transposase is Tn5 transposase.
[0023] In some embodiments according to any one of the methods of preparing a library of template nucleic acids described above, the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid.
[0024] In some embodiments according to any one of the methods of preparing a library of template nucleic acids described above, the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA.
[0025] In some embodiments according to any one of the methods of preparing a library of template nucleic acids described above, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
[0026] One aspect of the present application provides a method of analyzing a target nucleic acid, comprising: (a) preparing a library of template nucleic acids from the target nucleic acid using the method of any one of the methods of preparing a library of template nucleic acids described above; and (b) sequencing the library of template nucleic acids to obtain sequencing reads. In some embodiments, the sequencing is next generation sequencing. In some embodiments, the sequencing is massively parallel shotgun sequencing. In some embodiments, the sequencing is single molecule sequencing.
[0027] In some embodiments according to any one of the methods of analyzing a target nucleic acid described above, wherein each synthetic transposon comprises a first molecular barcode and a second molecular barcode, wherein the first molecular barcode is disposed between the first transposon recognition site and the first non-complementary region, and the second molecular barcode is disposed between the second transposon recognition site and the second non-complementary region, wherein the first molecular barcode and the second molecular barcode have the same barcode sequence, wherein each synthetic transposon has a different barcode sequence, and wherein the method further comprises: (c) assembling a contiguous sequence of the target nucleic acid from the sequencing reads based on the barcode sequences of the synthetic transposons in the template nucleic acids. In some embodiments, step (c) comprises: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same barcode sequences in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the barcode sequences in the synthetic transposons to provide the contiguous sequence of the target nucleic acid. In some embodiments, wherein each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single-stranded gaps having duplicated sequences endogenous to the target nucleic acid, and wherein the duplicated sequences are used to assemble the contiguous sequence.
[0028] In some embodiments according to any one of the methods of analyzing a target nucleic acid described above, the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
[0029] In some embodiments according to any one of the methods of analyzing a target nucleic acid described above, the method is used for genome assembly, haplotyping, detection of mutation, chromosomal conformation analysis, or methylation analysis. In some embodiments, the mutation is selected from the group consisting of substitution, indel, structural variation, and copy number variation.
[0030] Also provided by the present application is a kit for preparing a library of template nucleic acids, comprising: (a) the composition according to any one of the compositions described above; (b) a transposase that recognizes the first transposon recognition site and the second transposon recognition site; and (c) instructions for preparing the library of template nucleic acids. In some embodiments, the kit further comprises a polymerase, such as a T4 DNA polymerase. In some embodiments, the kit further comprises a ligase. In some embodiments, the transposase is Tn5 transposase. In some embodiments, the kit further comprises an
endonuclease. In some embodiments, the endonuclease is UDG. In some embodiments, the kit further comprises a first primer and a second primer for PCR amplification of the library of template nucleic acids, wherein the first primer hybridizes to the first adapter sequence or reverse complement thereof, and the second primer hybridizes to the second adapter sequence or reverse complement thereof. In some embodiments, the kit further comprises a primer for rolling circle amplification of the library of template nucleic acids, wherein the primer comprises a first sequence that hybridizes to the first adapter sequence and a second sequence that hybridizes to the second adapter sequence. [0031] It is understood that aspects and embodiments of the invention described herein include "consisting" and/or "consisting essentially of aspects and embodiments.
[0032] Reference to "about" a value or parameter herein includes (and describes) variations that are directed to that value or parameter per se. For example, description referring to "about X" includes description of "X".
[0033] As used herein, reference to "not" a value or parameter generally means and describes "other than" a value or parameter. For example, the method is not used to treat cancer of type X means the method is used to treat cancer of types other than X.
[0034] The term "about X-Y" used herein has the same meaning as "about X to about Y."
[0035] As used herein and in the appended claims, the singular forms "a," "or," and "the" include plural referents unless the context clearly dictates otherwise.
[0036] These and other aspects and advantages of the present invention will become apparent from the subsequent detailed description and the appended claims. It is to be understood that one, some, or all of the properties of the various embodiments described herein may be combined to form other embodiments of the present invention just as if each and every combination was individually and explicitly disclosed.
BRIEF DESCRIPTION OF THE DRAWINGS
[0037] FIG. 1 illustrates integration of a paired-barcode synthetic transposon having a regular double-stranded structure into template DNA followed by amplification using dual PCR primers F denotes a first adapter sequence, and R denotes a second adapter sequence. Primers designed to match the F and R sequences (i.e., same or reverse complementary sequences) are used in the amplification step.
[0038] FIG. 2A depicts an exemplary synthetic transposon fragment comprising a transposase recognition site (201/201rc), a stuff sequence (203/203rc), and a non-complementary region comprising adapter sequences F (204) and R (205).
[0039] FIG. 2B depicts an exemplary synthetic transposon fragment comprising a transposase recognition site (201/201rc), a stuff sequence (203/203rc), and a non-complementary region comprising adapter sequences F (204) and R (205), and a single-stranded linker (206) disposed between F and R. [0040] FIG. 2C depicts an exemplary synthetic transposon fragment comprising a transposase recognition site (201/201rc), a molecular barcode (202/202rc), a stuff sequence (203/203rc), and a non-complementary region comprising adapter sequences F (204) and R (205).
[0041] FIG. 2D depicts an exemplary synthetic transposon fragment comprising a transposase recognition site (201/201rc), a molecular barcode (202/202rc), a stuff sequence (203/203rc), and a non-complementary region comprising adapter sequences F (204) and R (205), and a single- stranded linker (206) disposed between F and R.
[0042] FIG. 2E depicts an exemplary synthetic transposon comprising two transposase recognition sites (201/201rc, and 2017201 'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), and no molecular barcodes, wherein each strand of the first non-complementary region is fused to one strand of the second non-complementary region via one or more cleavable nucleotides (denoted by *).
[0043] FIG. 2F depicts an exemplary synthetic transposon comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 202'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), wherein only one strand of the first non-complementary region is fused to one strand of the second non- complementary region via one or more cleavable nucleotides (denoted by *), and wherein one molecular barcode is single-stranded.
[0044] FIG. 2G depicts an exemplary synthetic transposon comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 2027202'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), wherein only one strand of the first non-complementary region is fused to one strand of the second non-complementary region via one or more cleavable nucleotides (denoted by *), and wherein both molecular barcodes are double-stranded.
[0045] FIG. 2H depicts an exemplary synthetic transposon comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 202'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), wherein each strand of the first non-complementary region is fused to one strand of the second non- complementary region via one or more cleavable nucleotides (denoted by *), and wherein one molecular barcode is single-stranded.
[0046] FIG. 21 depicts an exemplary synthetic transposon comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 2027202'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), wherein each strand of the first non-complementary region is fused to one strand of the second non-complementary region via one or more cleavable nucleotides (denoted by *), and wherein both molecular barcodes are double-stranded.
[0047] FIG. 2J depicts an exemplary synthetic transposon comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 202'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), and a single-stranded linker (206, 206rc) disposed between F and R, wherein the first non- complementary region is connected to the second non-complementary region via hybridization of the single-stranded linkers, wherein each single-stranded linker has a cleavable nucleotide (denoted by *), and wherein one molecular barcode is single-stranded.
[0048] FIG. 2K depicts an exemplary synthetic transposon comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 2027202'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), and a single-stranded linker (206, 206rc) disposed between F and R, wherein the first non- complementary region is connected to the second non-complementary region via hybridization of the single-stranded linkers, wherein each single-stranded linker has a cleavable nucleotide (denoted by *), and wherein both molecular barcodes are double-stranded.
[0049] FIG. 2L depicts an exemplary synthetic transposon comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 202'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), and a single-stranded linker (208, 209) disposed between F and R, and a bridge nucleic acid (207), wherein the first non-complementary region is connected to the second non-complementary region via hybridization of the single-stranded linkers to the bridge nucleic acid, wherein each single- stranded linker has a cleavable nucleotide (denoted by *), and wherein one molecular barcode is single-stranded.
[0050] FIG. 2M depicts an exemplary synthetic transposon comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and
2027202'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), and a single-stranded linker (208, 209) disposed between F and R, and a bridge nucleic acid (207), wherein the first non-complementary region is connected to the second non- complementary region via hybridization of the single-stranded linkers to the bridge nucleic acid, wherein each single-stranded linker has a cleavable nucleotide (denoted by *), and wherein both molecular barcodes are double-stranded.
[0051] FIG. 2N depicts an exemplary synthetic transposon having one blunt end and one end with a hairpin structure (210), comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 202'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), wherein each strand of the first non- complementary region is fused to one strand of the second non-complementary region via one or more cleavable nucleotides (denoted by *), and wherein one molecular barcode is single- stranded.
[0052] FIG. 20 depicts an exemplary synthetic transposon having one blunt end and one end with a hairpin structure (210), comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 2027202'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), wherein each strand of the first non- complementary region is fused to one strand of the second non-complementary region via one or more cleavable nucleotides (denoted by *), and wherein both molecular barcode are double- stranded.
[0053] FIG. 3 shows an exemplary method of preparing the synthetic transposons of FIG. 2F and FIG. 2G having two identical molecular barcode sequences. 301/301rc: sequences containing transposon binding sites; 302/302rc: sequences containing molecular barcodes;
303/303rc: stuff sequences for first priming; 304/304rc: stuff sequences for second priming; 305: fixed sequences that may contain PCR primer 1 or F if needed; 306: fixed sequences that may contain PCR primer 2 or R if needed. The designation "rc" after a number indicates a reverse complementary sequence. "U" is a uracil nucleotide and is used as an example for the cleavable nucleotide.
[0054] FIG. 4 shows an exemplary method of preparing the synthetic transposons of FIG. 2H and FIG. 21 having two identical molecular barcode sequences. 401/401rc: sequences containing transposon binding sites; 402/402rc: sequences containing molecular barcodes; 403/403rc: stuff sequences for first priming; 404/404rc: stuff sequences for second priming; 405 : fixed sequences that may contain PCR primer 1 or F; 406: fixed sequences that may contain PCR primer 2 or R. The designation "rc" after a number indicates a reverse complementary sequence. "U" is a uracil nucleotide and is used as an example for the cleavable nucleotide.
[0055] FIG. 5 shows an exemplary method of preparing the synthetic transposons of FIG. 2J and FIG. 2K having two identical molecular barcode sequences. 501/501rc: sequences containing transposon binding sites; 502/502rc: sequences containing molecular barcodes;
503/503rc: stuff sequences; 504/504rc: fixed sequences for 1 priming; 505/505rc: fixed
nd
sequences with modified blocked 3 '-end for 2 priming; 506: fixed sequence that may contain PCR primer 1 or F; 507/507rc: linker sequences connecting the two non-complementary regions; 508: fixed sequence that may contain PCR primer 2 or R; 509: additional sequence for flexibility of the structures. "*" indicates a cleavable nucleotide or a cleavage site. The 3'-end in 505rc may contain a phosphate group (P) or a reversible dideoxynucleotide. The 3'-phosporyl group can be removed by T4 polynucleotide kinase (T4 PNK) available commercially (e.g. , NEB T4 PNK, catalogue # M0201L).
[0056] FIG. 6 shows an exemplary method of preparing the synthetic transposons of FIG. 2L and FIG. 2M. 601/601rc: sequence containing transposon recognition sites; 602/602rc: sequence containing molecular barcodes; 603/603rc: stuff sequences; 604/604rc: fixed sequences for 1st priming; 605/605rc: fixed sequences with blocked 3'-end for 2nd priming after being deblocked; 606: fixed sequence that may contain PCR primer 1 or F; 607: linker sequence connecting the first non-complementary region to bridge oligo (607rc+611+610rc); 608: fixed sequence that may contain PCR primer 2 or R; 609: additional sequence to provide flexibility of the structure; 610: linker sequence connecting the second non-complementary region to bridge oligo
(607rc+611+610rc); 611 : part of the bridge oligo (607rc+611+61 Ore) connecting the two non- complementary regions. Additional modified bases (*) may be used such as for cleavage if needed. [0057] FIG. 7 shows an exemplary method of preparing the synthetic transposons of FIG. 2N and FIG. 20. 701/701rc: sequences containing transposon binding sites; 702/702rc: sequences containing molecular barcodes; 703/703rc: stuff sequences for 1 priming; 704/704rc: stuff
nd
sequences for 2 priming; 705: fixed sequences that may contain PCR primer 1 or F; 706: fixed sequences that may contain PCR primer 2 or R. The designation "rc" after a number indicates a reverse complementary sequence. "U" is a uracil nucleotide and is used as an example for the cleavable nucleotide. "*" denotes another cleavable nucleotide such as an RNA nucleotide.
[0058] FIG. 8A shows an exemplary synthetic transposon of FIG. 21 having two different molecular barcodes (801 and 802), and an exemplary method of preparing the synthetic transposon. The synthetic transposon can be prepared using two DNA oligos having adapter sequences F and R that correspond to partial sequences of Readlrc and Read2 of the Illumina sequencing primers. On each strand of the synthetic transposon, the F and R sequences are fused to each other via a Uracil nucleotide.
[0059] FIG. 8B shows an exemplary synthetic transposon of FIG. 2E having two no molecular barcodes, and an exemplary method of preparing the synthetic transposon. The synthetic transposon can be prepared using two DNA oligos having adapter sequences F and R that correspond to partial sequences of Readlrc and Read2 of the Illumina sequencing primers. On each strand of the synthetic transposon, the F and R sequences are fused to each other via a UU dinucleotide.
[0060] FIG. 8C shows an exemplary synthetic transposon fragment of FIG. 2C having a molecule barcode (803), and an exemplary method of preparing the synthetic transposon fragment. The adapter sequences F and R correspond to partial sequences of Readlrc and Read2 of the Illumina sequencing primers.
[0061] FIG. 8D shows an exemplary synthetic transposon fragment of FIG. 2D comprising a single oligonucleotide forming a hairpin structure that does not contain a molecular barcode, and an exemplary method of preparing the synthetic transposon fragment. The adapter sequences F and R correspond to partial sequences of Readlrc and Read2 of the Illumina sequencing primers.
[0062] FIG. 8E shows exemplary primers that can be used for multiplexed pair-end sequencing of libraries prepared using the synthetic transposons of FIG. 8A-8D.
[0063] FIG. 9 shows an exemplary method of preparing a library of template nucleic acids for sequencing using a plurality of synthetic transposons of FIG. 21. The synthetic transposons are integrated into a target DNA, followed by repair and UDG treatment, PCR amplification with dual primers, which may contain a sequence matching F or R, and any additional adapter sequences (shown as dotted lines) needed for sequencing or analysis.
[0064] FIG. 10 shows an exemplary method of preparing a library of template nucleic acids using a plurality of synthetic transposons of FIG. 2G. The synthetic transposons are integrated into a target DNA, followed by repair, UDG treatment, and nick ligation to circularize the nucleic acid fragments, thereby allowing downstream analysis, such as rolling circle amplification (RCA) or single molecule sequencing.
[0065] FIG. 11 shows an exemplary method of preparing a library of template nucleic acids using a plurality of synthetic transposons of FIG. 2M. The synthetic transposons are integrated into a target DNA, followed by repair, denaturation, and exonuclease treatment to provide circular nucleic acid fragments, which can be amplified by rolling circle amplification (RCA), or analyzed by RCA sequencing or single molecule sequencing methods. 1001 : molecular barcode from the synthetic transposon shown on the left; 1002: molecular barcode from the synthetic transposon shown on the right.
[0066] FIG. 12 shows an exemplary pipeline for analyzing sequencing data from Illumina reads of a sequencing library prepared using the synthetic transposons of the present application. For example, the sequencing library may be a PCR-amplified library prepared as in FIG. 9.
DETAILED DESCRIPTION OF THE INVENTION
[0067] The present application discloses synthetic transposons, methods and kits for preparing sequencing libraries from a target nucleic acid, which can be analyzed using next-generation sequencing methods. The synthetic transposons of the present application comprise two non- complementary regions that comprise adapter sequences, wherein the two non-complementary regions are connected to each other and are located between two stem fragments each containing a transposase recognition site. Transposition of the synthetic transposons into the target nucleic acid and subsequent steps that separate the non-complementary regions results in fragmentation of the target nucleic acid, and introduction of the adapter sequences at the same time. The resulting product may be sequenced directly, or amplified in a subsequent step prior to sequencing using primers that match the adapter sequences. In some embodiments, the synthetic transposons are designed to comprise a molecular barcode disposed between the transposase recognition site and the non-complementary region. A plurality of synthetic transposons each having a different molecular barcode may be used to prepare a library preserving the contiguity information in the target nucleic acid through the molecular barcodes. The compositions, methods, kits and analysis tools described herein are useful for many applications, including haplotyping, de novo assembly of whole genomes or long contiguous sequences, sequencing of repetitive regions, detection of structural variations and copy number variations, and methylation analysis.
[0068] The synthetic transposons and methods of use described in the present application differ from those currently known in the art. For example, by contrast, FIG. 1 illustrates a method for preparing a sequencing library using a regular double- stranded synthetic transposon having paired barcodes and a pair of adapter sequences disposed in between the paired barcodes, such as the synthetic transposons described in US patent No. 8,829,171. The synthetic transposons are integrated into the target DNA, the product of which is repaired, and subsequently PCR amplified using primers that match the adapter sequences (i.e. , having the same sequences or reverse complementary sequences as the adapter sequences). However, during transposition, the synthetic transposons can be inserted in two opposite orientations, yielding three different potential configurations (Config. 1, 2, and 3 in FIG. 1) for fragment of target nucleic acid surrounded by a pair of synthetic transposons with respect to the orientation of the adapter sequences. As template strands having self-hairpin structures due to hybridization between complementary adapter sequences are not amplified with high efficiency, only Config. 1 yields template 1 that can be amplified with high efficiency. Templates 2 and 3 either have F primer binding sites in both ends or R primer binding sites in both ends, leading to self-hairpin structures during renaturation after denaturation step. The self-hairpin structure resulting from intramolecular interactions are more stable than intermolecular interactions between the adapter sites and the F/R primers, which may lead to little or no PCR amplification of the template in the next cycle. As a result, target sequence fragments having configurations 2 and 3 may become missing or under-represented in the amplified library prepared using such method, leading to difficulty in linking the fragment sequences together for haplotyping purpose, or errors in quantification of the fragments. As multiple reads per fragment is required to assemble reads from barcoded fragments, the sequencing cost could also be increased due to missing or bias amplification using the library preparation method of FIG. 1. The synthetic transposons and methods described herein solves this problem by incorporating the adapter sequences in the non- complementary regions, and introducing two adapter sequence pairs for each insertion site, thereby yielding only one fragment configuration with respect to the adapter orientations, which is amenable to PCR amplification. [0069] Alternatively, some embodiments of the synthetic transposons described herein are used to insert the adapter sequences into target nucleic acid, which is subsequently fragmented using simple denaturation or enzymatic cleavage steps that separate the non-complementary regions. The resulting fragments can be directly sequenced without further ligation to sequencing adapters. By contrast, in some currently known library construction methods, Y- shaped adapters comprising sequencing adapters are ligated to fragmented nucleic acids. Such methods require end-processing steps prior to the ligation, such as blunt-end polishing, or addition of T or A to the ends of the fragments. However, such end-processing steps may have varying efficiency for different end sequences, which result in biased coverage of fragments in the target nucleic acids. The synthetic transposons and methods described herein overcome such challenges by introducing adapter sequences and fragmenting the target nucleic acid in a single process.
[0070] Accordingly, one aspect of the present application provides a synthetic transposon comprising: a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non-complementary region and the second non-complementary region are connected to each other. In some embodiments, the synthetic transposon further comprises a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region, and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region.
[0071] One aspect of the present application provides a composition comprising a plurality of synthetic transposons, wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region, and the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first non-complementary region and the second non-complementary region are connected to each other; wherein the first molecular barcode and the second molecular barcode have the same barcode sequence; and wherein each synthetic transposon has a different barcode sequence.
[0072] Another aspect of the present application provides a method of preparing a library of template nucleic acids, comprising: (a) contacting a target nucleic acid with any of the synthetic transposons or compositions comprising a plurality of the synthetic transposons described herein, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; and (c) separating the first non-complementary region and the second non-complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids.
[0073] Further provided are methods of analyzing a target nucleic acid using the library of template nucleic acids prepared using the synthetic transposons, kits, articles of manufacture, and analysis tools.
Synthetic transposons
[0074] One aspect of the present application provides a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter and a second strand comprising a second adapter; wherein the second non-complementary region comprises a first strand comprising a third adapter and a second strand comprising a fourth adapter; and wherein the first non-complementary region and the second non-complementary region are connected to each other. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first adapter and the fourth adapter have the same sequence. In some embodiments, the second adapter and the third adapter have the same sequence. In some embodiments, the first non-complementary region and the second non-complementary region comprise different adapters.
[0075] The present application provides a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter and a second strand comprising a second adapter, wherein each of the first strand and the second strand of the first non-complementary region is connected to one strand of the first stem; wherein the second non-complementary region comprises a first strand comprising a third adapter and a second strand comprising a fourth adapter, wherein each of the first strand and the second strand of the second non-complementary region is connected to one strand of the second stem; and wherein the first non-complementary region and the second non-complementary region are connected to each other.
[0076] In some embodiments, there is provided a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter and a second strand comprising a second adapter; wherein the second non-complementary region comprises a first strand comprising a third adapter and a second strand comprising a fourth adapter; and wherein the first non-complementary region and the second non- complementary region are connected to each other. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first molecular barcode and the second molecular barcode have the same sequence. In some embodiments, the first molecular barcode and the second molecular barcode have different sequences. In some embodiments, the first adapter and the fourth adapter have the same sequence. In some embodiments, the second adapter and the third adapter have the same sequence. In some embodiments, the first non- complementary region and the second non-complementary region comprise different adapters.
[0077] In some embodiments, there is provided a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non-complementary region and the second non-complementary region are connected to each other. In some embodiments, the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends. In some embodiments, the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
[0078] The present application provides a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence, wherein each of the first strand and the second strand of the first non-complementary region is connected to one strand of the first stem; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence, wherein each of the first strand and the second strand of the second non- complementary region is connected to one strand of the second stem; and wherein the first non- complementary region and the second non-complementary region are connected to each other.
[0079] In some embodiments, there is provided a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first strand of the first non-complementary region is fused to the first strand of the second non- complementary region via one or more cleavable nucleotides (such as a uracil nucleotide). In some embodiments, the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends. In some embodiments, the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
[0080] In some embodiments, there is provided a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide). In some embodiments, the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends. In some embodiments, the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
[0081] In some embodiments, there is provided a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand of the first non-complementary region is fused to the first strand of the second non- complementary region via one or more cleavable nucleotides (such as a uracil nucleotide); and wherein the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide). In some embodiments, the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends. In some embodiments, the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
[0082] In some embodiments, there is provided a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker; wherein the first strand and the second strand of the second non- complementary region are fused to each other via a second single-stranded linker; and wherein the first single-stranded linker and the second single-stranded linker are connected to each other. In some embodiments, the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends. In some embodiments, the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
[0083] In some embodiments, there is provided a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker; wherein the first strand and the second strand of the second non- complementary region are fused to each other via a second single-stranded linker; and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other. In some embodiments, each of the first single-stranded linker and the second single-stranded linker comprises a cleavable nucleotide (such as a uracil nucleotide). In some embodiments, the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends. In some embodiments, the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
[0084] In some embodiments, there is provided a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker; wherein the first strand and the second strand of the second non- complementary region are fused to each other via a second single-stranded linker; wherein the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single-stranded linker; and wherein the first single- stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid. In some embodiments, the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends. In some embodiments, the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME). [0085] In some embodiments, there is provided a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non- complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non-complementary region and the second non-complementary region are connected to each other. In some embodiments, the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends. In some embodiments, the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME). In some embodiments, the first molecular barcode and the second molecular barcode have the same barcode sequence. In some embodiments, the first molecular barcode and the second molecular barcode are double- stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single-stranded region.
[0086] In some embodiments, there is provided a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non- complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first strand of the first non- complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide). In some embodiments, the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends. In some embodiments, the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME). In some embodiments, the first molecular barcode and the second molecular barcode have the same barcode sequence. In some embodiments, the first molecular barcode and the second molecular barcode are double-stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single-stranded region.
[0087] In some embodiments, there is provided a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non- complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide). In some
embodiments, the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends. In some embodiments, the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME). In some embodiments, the first molecular barcode and the second molecular barcode have the same barcode sequence. In some embodiments, the first molecular barcode and the second molecular barcode are double- stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single-stranded region.
[0088] In some embodiments, there is provided a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non- complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand of the first non- complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide); and wherein the second strand of the first non-complementary region is fused to the second strand of the second non- complementary region via one or more cleavable nucleotides (such as a uracil nucleotide). In some embodiments, the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends. In some embodiments, the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME). In some embodiments, the first molecular barcode and the second molecular barcode have the same barcode sequence. In some embodiments, the first molecular barcode and the second molecular barcode are double-stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single- stranded region.
[0089] In some embodiments, there is provided a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non- complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker; wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker; and wherein the first single-stranded linker and the second single-stranded linker are connected to each other. In some embodiments, the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends. In some embodiments, the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME). In some embodiments, the first molecular barcode and the second molecular barcode have the same barcode sequence. In some embodiments, the first molecular barcode and the second molecular barcode are double- stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single-stranded region. [0090] In some embodiments, there is provided a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non- complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker; wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker; and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other. In some embodiments, each of the first single- stranded linker and the second single-stranded linker comprises a cleavable nucleotide (such as a uracil nucleotide). In some embodiments, the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends. In some embodiments, the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME). In some embodiments, the first molecular barcode and the second molecular barcode have the same barcode sequence. In some embodiments, the first molecular barcode and the second molecular barcode are double-stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single-stranded region.
[0091] In some embodiments, there is provided a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non- complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker; wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker; wherein the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker; and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid. In some embodiments, the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends. In some embodiments, the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME). In some embodiments, the first molecular barcode and the second molecular barcode have the same barcode sequence. In some embodiments, the first molecular barcode and the second molecular barcode are double-stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single- stranded region.
[0092] Also provided are synthetic transposon fragments corresponding to the first fragment or second fragment of any one of the synthetic transposons described herein. Synthetic transposon fragments that are not connected to each other may be used for fragmenting a target nucleic acid, and to allow amplification of the fragments by PCR using primers corresponding the first and second adapters. [0093] In some embodiments, there is provided a synthetic transposon fragment comprising a stem and a non-complementary region comprising a first strand comprising a first adapter and a second strand comprising a second adapter, wherein the stem has a blunt end.
[0094] In some embodiments, there is provided a synthetic transposon fragment comprising a stem and a non-complementary region comprising a first strand comprising a first adapter and a second strand comprising a second adapter, wherein the stem has a hair-pin structure on the end distal from the non-complementary region.
[0095] In some embodiments, there is provided a synthetic transposon fragment comprising a stem, a non-complementary region, and a molecular barcode disposed between the stem and the non-complementary region, wherein the non-complementary region comprises a first strand comprising a first adapter and a second strand comprising a second adapter, wherein the stem has a blunt end.
[0096] In some embodiments, there is provided a synthetic transposon fragment comprising a stem, a non-complementary region, and a molecular barcode disposed between the stem and the non-complementary region, wherein the non-complementary region comprises a first strand comprising a first adapter and a second strand comprising a second adapter, wherein the stem has a hair-pin structure on the end distal from the non-complementary region.
[0097] Also provided are compositions comprising any one of the synthetic transposons described herein.
[0098] Thus, in some embodiments, there is provided a composition comprising a plurality of synthetic transposon each comprising a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non- complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non-complementary region and the second non- complementary region are connected to each other. In some embodiments, the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends. In some embodiments, the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME). In some embodiments, the first strand of the first non- complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide). In some embodiments, the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide). In some embodiments, the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single- stranded linker, and wherein the first single-stranded linker and the second single- stranded linker are hybridized to each other. In some embodiments, the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
[0099] In some embodiments, there is provided a composition comprising a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non-complementary region and the second non-complementary region are connected to each other. In some embodiments, the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends. In some embodiments, the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME). In some embodiments, the first molecular barcode and the second molecular barcode have the same barcode sequence. In some embodiments, the first molecular barcode and the second molecular barcode are double-stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single- stranded region.
[0100] In some embodiments, there is provided a composition comprising a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first non-complementary region and the second non-complementary region are connected to each other; wherein the first molecular barcode and the second molecular barcode have the same barcode sequence; and wherein each synthetic transposon has a different barcode sequence. In some embodiments, the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends. In some embodiments, the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME). In some embodiments, the first molecular barcode and the second molecular barcode are double-stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single-stranded region. In some embodiments, the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
[0101] In some embodiments, there is provided a composition comprising a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand of the first non-complementary region is fused to the first strand of the second non- complementary region via one or more cleavable nucleotides (such as a uracil nucleotide); wherein the first molecular barcode and the second molecular barcode have the same barcode sequence; and wherein each synthetic transposon has a different barcode sequence. In some embodiments, the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends. In some embodiments, the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME). In some embodiments, the first molecular barcode and the second molecular barcode are double-stranded. In some
embodiments, the first molecular barcode or the second molecular barcode comprises a single- stranded region. In some embodiments, the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
[0102] In some embodiments, there is provided a composition comprising a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide); wherein the first molecular barcode and the second molecular barcode have the same barcode sequence; and wherein each synthetic transposon has a different barcode sequence. In some embodiments, the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends. In some embodiments, the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME). In some embodiments, the first molecular barcode and the second molecular barcode are double-stranded. In some
embodiments, the first molecular barcode or the second molecular barcode comprises a single- stranded region. In some embodiments, the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
[0103] In some embodiments, there is provided a composition comprising a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand of the first non-complementary region is fused to the first strand of the second non- complementary region via one or more cleavable nucleotides (such as a uracil nucleotide); wherein the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide); wherein the first molecular barcode and the second molecular barcode have the same barcode sequence; and wherein each synthetic transposon has a different barcode sequence. In some embodiments, the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends. In some embodiments, the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME). In some embodiments, the first molecular barcode and the second molecular barcode are double-stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single-stranded region. In some embodiments, the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
[0104] In some embodiments, there is provided a composition comprising a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker; wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker; wherein the first single-stranded linker and the second single-stranded linker are connected to each other; wherein the first molecular barcode and the second molecular barcode have the same barcode sequence; and wherein each synthetic transposon has a different barcode sequence. In some embodiments, the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends. In some embodiments, the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME). In some embodiments, the first molecular barcode and the second molecular barcode are double-stranded. In some
embodiments, the first molecular barcode or the second molecular barcode comprises a single- stranded region. In some embodiments, the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
[0105] In some embodiments, there is provided a composition comprising a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker; wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single- stranded linker; wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other; wherein the first molecular barcode and the second molecular barcode have the same barcode sequence; and wherein each synthetic transposon has a different barcode sequence. In some embodiments, each of the first single-stranded linker and the second single-stranded linker comprises a cleavable nucleotide (such as a uracil nucleotide). In some embodiments, the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends. In some embodiments, the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME). In some embodiments, the first molecular barcode and the second molecular barcode are double-stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single- stranded region. In some embodiments, the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
[0106] In some embodiments, there is provided a composition comprising a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker; wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker; wherein each synthetic transposon further comprises a bridge nucleic acid comprising a first single- stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single-stranded linker; wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid; wherein the first molecular barcode and the second molecular barcode have the same barcode sequence; and wherein each synthetic transposon has a different barcode sequence. In some embodiments, the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends. In some embodiments, the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME). In some embodiments, the first molecular barcode and the second molecular barcode are double-stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single-stranded region. In some embodiments, the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
[0107] Further provided are complexes comprising any of the synthetic transposon described herein and a transposase, and compositions comprising a plurality of complexes each comprising any of the synthetic transposon described herein and a transposase. In such complexes, the transposase can form a functional complex (i.e., transposome) with one or more transposes recognition sites, and is capable of catalyzing a transposition reaction.
[0108] For example, in some embodiments, there is provided a complex comprising a synthetic transposon and a transposase, wherein the synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non-complementary region and the second non-complementary region are connected to each other. In some embodiments, the transposase is a dimeric transposase. In some
embodiments, the transposase is Tn5 transposase, such as a hyperactive Tn5 transposase, for example, EZ-Tn5™. In some embodiments, the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends. In some embodiments, the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME). In some embodiments, the first strand of the first non-complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide). In some embodiments, the second strand of the first non- complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide). In some embodiments, the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non- complementary region are fused to each other via a second single-stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other. In some embodiments, the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single-stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid. [0109] In some embodiments, there is provided a composition comprising a plurality of complexes each comprising a synthetic transposon and a transposase, wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first copy of a molecular barcode disposed between the first transposase recognition site and the first non- complementary region, and the second stem comprises a second transposase recognition site and a second copy of the molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non-complementary region and the second non-complementary region are connected to each other. In some embodiments, the transposase is Tn5 transposase, such as a hyperactive Tn5 transposase, for example, EZ-Tn5™. In some embodiments, the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends. In some embodiments, the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME). In some embodiments, the first strand of the first non- complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide). In some embodiments, the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide). In some embodiments, the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single-stranded linker and the second single- stranded linker are hybridized to each other. In some embodiments, the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid. In some embodiments, the first molecular barcode and the second molecular barcode have the same barcode sequence. In some embodiments, each synthetic transposon has a different barcode sequence. In some embodiments, the first molecular barcode and the second molecular barcode are double-stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single- stranded region. In some embodiments, the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
[0110] The complexes can be prepared by mixing the plurality of synthetic transposons and the transposase. In some embodiments, the synthetic transposons and the transposase are incubated for at least about any one of 1 minute, 5 minutes, 10 minutes, 30 minutes, 1 hour or more to form the complexes.
Elements of synthetic transposon
[0111] The synthetic transposons described herein are nucleic acids containing two synthetic transposon fragments. Unless described otherwise, all elements of the synthetic transposons, including fragments, stems, non-complementary regions, transposase recognition sites, molecular barcodes, adaptors, strands, stuff sequences, bridge nucleic acids, etc., are nucleic acids. In some embodiments, the two synthetic transposon fragments are arranged in the same orientation with respect to each other, i.e. , the fragments are connected to each other via direct or indirect interaction between the 3' end of one fragment and the 5' end of the other fragment on one strand or both strands. Each fragment may be fully double-stranded, partially single- stranded, or a hairpin. Each fragment has two ends. The two fragments are connected to each other via the non-complementary regions disposed at one end of each fragment.
[0112] The second end of each fragment contains a stem comprising a transposase recognition site. "Stem" as used herein refer to nucleic acid fragments having extensive fully complementary regions. Each stem typically has two strands that can be separate from each other, or connected to each other on one end via a loop to form a hairpin structure. With the exception of stems having hairpin structures on the ends, the ends of the stems are fully complementary and double stranded. Stems with hairpin structures on the ends have the hairpin structures connected to fully complementary and double-stranded regions. The stems may have a small single-stranded region no more than about any of 20, 15, 10, or 5 nucleotides long, or have an internal non- complementary region of no more than about any of 15, 10, 8, 5, or 2 nucleotides long. In some embodiments, each stem has two nucleic acid strands that are fully complementary to each other. In some embodiments, one strand contains a single-stranded gap, for example, in the molecular barcode region. One end of the stem (referred herein as the "proximal end") is fused to the non- complementary region. The other end of the stem (referred herein as the "distal end") can be a blunt end, or a hairpin. In some embodiments, the distal end(s) of one or both stems comprise nucleotides flanking the transposase recognition sites. In some embodiments, the stem further comprises a molecular barcode placed between the transposase recognition site and the non- complementary region. In some embodiments, the stem further comprises one or more stuff sequences, which are nucleic acids having pre-determined (also referred to as "fixed") sequences. The stuff sequences may be placed between the end of the stem and the transposase recognition site, between the transposase recognition site and the molecular barcode, between the transposase recognition site and the non-complementary region, and/or between the molecular barcode and the non-complementary region. The stuff sequences may provide priming sites, balance G/C contents, and/or minimize secondary structures that facilitate preparation of the synthetic transposons. Additionally, stuff sequences may be chosen to complement the molecular barcodes and the non-complementary regions to allow enough space and flexibility in the synthetic transposon to facilitate binding of the transposase to the transposase recognition sites. The stuff sequences can also facilitate data analysis steps (such as for easy alignment and clustering of sequencing reads).
[0113] In some embodiments, one or more of the 5' ends (also referred herein as 5' termini) of the polynucleotide strands in the synthetic transposons are phosphorylated, or the 5' terminal nucleotide has a 5' phosphate group. Phosphorylated 5' ends facilitate ligation to other nucleic acids, such as adapters, extended, or gap-filled nucleic acid strands (e.g. , for nick-sealing). For example, in some embodiments, the 5' terminus of the distal end of the first stem and/or the second stem is phosphorylated. In some embodiments, wherein the first stem or the second tern comprises a single-stranded region, for example, the first molecular barcode or the second molecular barcode comprises a single-stranded region or is single-stranded, the 5' terminus adjacent to the singe-stranded region is phosphorylated. In some embodiments, one or more of the 5' ends of the polynucleotide strands in the synthetic transposons are unphosphorylated, for example, the 5' terminal nucleotide has a 5' free hydroxyl group. Synthetic transposons having 5' hydroxyl ends may be phosphorylated in the library construction steps to enable ligation to other nucleic acids or nick-sealing.
Non-complementary regions
[0114] The non-complementary regions of the synthetic transposons allow processing, efficient amplification, and haplotyping of a target nucleic acid inserted with the synthetic transposon. Each non-complementary region comprises two non-complementary strands of nucleic acids. Each of the non-complementary strands in the non-complementary region is connected to one strand of the corresponding stem region. The two strands of a non- complementary region do not hybridize to each other at normal pH and ionic conditions (such as pH 7 and 150 mM salt). In some embodiments, the two strands of a non-complementary region have no more than about any of 60%, 50%, 40%, 30%, 20%, 10%, 5%, or less sequence homology. In some embodiments, the two strands of a non-complementary region have no more than about any of 5, 4, 3 or 2 consecutive nucleotides that are complementary to each other. In some embodiments, each strand of a non-complementary region does not form any significant secondary structure.
[0115] Each strand of a non-complementary region comprises an adapter sequence (also referred herein as an "adapter"). In some embodiments, the adapter sequences serve as priming sites to allow amplification of a nucleic acid fragment inserted with the synthetic transposon. In some embodiments, the two non-complementary regions in a synthetic transposon are identical, but are placed in opposite orientations. For example, each non-complementary region comprises a first strand comprising an adapter sequence F, and a second strand comprising an adapter sequence R, and F of the first non-complementary region is connected to R of the second non- complementary region, and/or R of the first non-complementary region is connected to F of the second non-complementary region. A pair of primers may be designed to comprise the sequence of F or R, or to comprise the complementary sequence of F or R for use in amplification of a nucleic acid fragment comprising the non-complementary regions inserted at both ends. The adapter sequences may be of any suitable length, for example, at least about any of 15, 16, 17, 18, 19, 20, 21, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, or more nucleotides long. In some embodiments, the two non-complementary regions have different sets of adapter sequences. In some embodiments, the two non-complementary regions have the same set of adapter sequences, but comprise different stuff sequences on one or both strands. [0116] Each non-complementary region may comprise two separate strands, or a single fused strand comprising the first strand and the second strand. In some embodiments, each non- complementary region is V-shaped, comprising a first strand and a second strand. In some embodiments, the first strand and the second strand of each non-complementary region are fused to each other via a single-stranded linker. In some embodiments, the first non-complementary region comprises a first strand comprising a first adapter sequence, a second strand comprising a second adapter sequence, and a first single-stranded linker disposed between the first strand and the second strand; and the second non-complementary region comprises a first stand comprising the second adapter sequence, a second strand comprising the first adapter sequence, and a second single-stranded linker disposed between the first strand and the second strand. In some embodiments, the first single-stranded linker can hybridize to the second single-stranded linker. In some embodiments, the first single-stranded linker is fully complementary to the second single- stranded linker. In some embodiments, the first single-stranded linker is complementary to the second single-stranded linker except for one or more cleavable nucleotides. In some embodiments, the clustering primer and sequencing primer sequences can be included in the non-complementary strands to allow PCR-free direct next generation sequencing.
[0117] The non-complementary regions are connected to each other either covalently or non- covalently (such as via hybridization of two sequences). For example, the first strand of the first non-complementary region can be fused to the first strand of the second non-complementary region. Alternatively or in addition, the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region. In some embodiments, wherein the first strand and the second strand of each non-complementary region are fused to each other via a single-stranded linker, the single-stranded linker of the first non-complementary region is hybridized to the single-stranded linker of the second non-complementary region. In some embodiments, the synthetic transposon further comprises a bridge nucleic acid comprising a first sequence that hybridizes to the first single-stranded linker of the first non-complementary region, and a second sequence that hybridizes to the second single-stranded linker of the second non-complementary region, thereby, the two non-complementary regions are connected to each other via the hybridization of the bridge nucleic acid to the first single-stranded linker and the second single-stranded linker.
[0118] In some embodiments, the synthetic transposon comprises one or more cleavable nucleotides at the junction(s) between the first non-complementary region and the second non- complementary region, in the bridge nucleic acid, or in the single-stranded linker of the first non-complementary region and/or the second complementary region. Cleavage of the one or more cleavable nucleotides results in separation of the first non-complementary region from the second non-complementary region. For example, in some embodiments, the first strand of the first non-complementary region and the first strand of the second non-complementary region are fused to each other via one or more cleavable nucleotides, and/or the second strand of the first non-complementary region and the second strand of the second non-complementary region are fused to each other via one or more cleavable nucleotides. In some embodiments, wherein the first strand and second strand of each non-complementary region are fused to each other via a single- stranded linker, the single-stranded linker comprises one or more cleavable nucleotides. In some embodiments, the bridge nucleic acid comprises one or more cleavable nucleotides in the sequences that are complementary to the single-stranded linkers. The one or more cleavable nucleotides may be one or more uracil nucleotides, other modified nucleobases with specific nucleases that recognize such nucleobases (such as 8-oxoguanine), a restriction site, or RNA nucleotides wherein the synthetic transposon is a DNA transposon. For example, Uracil DNA glycosylase combined with a DNA glycosylase lyase (such as endonuclease VIII) can be used to cleave a uracil deoxyribonucleotide; and RNA nucleotides can be cleaved by an RNA endonuclease.
Molecular barcode
[0119] In some embodiments, the synthetic transposon further comprises a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region, and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region. The first molecular barcode and the second molecular barcode may have the same sequence, or different sequences. In some embodiments, the first molecular barcode has the same sequence as the second molecular barcode, which allows matching of the molecular barcode sequences from sequencing reads to extract contiguity information in a target nucleic acid inserted with the synthetic transposon. Synthetic transposons having no molecular barcodes or two different molecular barcodes can be used for preparing libraries of template nucleic acids useful for a variety of sequencing applications (except for haplotyping) in the same way as synthetic transposons having molecular barcodes.
[0120] In some embodiments, the molecular barcode comprises a plurality of nucleotides that are randomly or degenerately designed, thereby yielding a highly diverse sequence that can be used to identify each individual synthetic transposon, and the target nucleic acid or fragment thereof that the synthetic transposon inserts into. In some embodiments, the molecular barcode is double-stranded. In some embodiments, the molecular barcode comprises a single-stranded region, or is single-stranded.
[0121] The composition may comprise any number of synthetic transposons having different molecular barcodes. In some embodiments, the composition comprises a single copy of each synthetic transposon having a different molecular barcode. In some embodiments, the composition comprises more than one copy of each synthetic transposon having a different molecular barcode. In some embodiments, the plurality of synthetic transposons have at least about any one of 104, 10s, 106, 107, 108, 109, 1010, 1011, 1012, 1013, 1014, 1015, 1016, 1017, or more different molecular barcodes. In some embodiments, the plurality of synthetic transposons have at least about any one of 104, 10s, 106, 107, 108, 109, 1010, 1011, 1012, 1013, 1014, 1015, 1016, 1017, or more sources of clonal molecular barcodes.
[0122] In some embodiments, the molecular barcode of each synthetic transposon is different because it contains nucleotide sequences comprising randomly designed (i.e., having any of the four nucleobases A, C, T, G) or degenerately designed ( . e. , having one of a set of at least two types of nucleobases, for example, B=C/G/T; D=A/G/T; H=A/C/T; V=A/C/G; W=A/T; S=C/G; R=A/G; Y=C/T) nucleotides. The nucleotide can be a ribonucleotide, or a deoxyribonucleotide. The molecular barcode can thus be used to identify a particular fragment of a target nucleic acid that the synthetic transposon carrying the molecular barcode inserts into. The molecular barcode may further comprise nucleotides having the same identity for all synthetic transposons (i.e. "fixed" or specifically designed nucleotides). The additional fixed nucleotides or sequences can be placed on either side of the randomly or degenerately designed sequence or interspersed among the randomly or degenerately designed nucleotides.
[0123] In some embodiments, the molecular barcode comprises double-stranded regions. In some embodiments, the molecular barcode is double-stranded. In some embodiments, the molecular barcode is single-stranded. In some embodiments, the molecular barcode is partially single- stranded (i.e. , partially double-stranded). In some embodiments, the molecular barcode has a single-stranded region having at least about any one of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40, 50 or more nucleotides. In some embodiments, the randomly and/or degenerately designed nucleotides in the molecular barcode are in single- stranded region of the molecular barcode. In some embodiments, the double-stranded region of the at least partially single-stranded molecular barcode comprises fixed nucleotides. In some embodiments, the double-stranded region of the at least partially single-stranded molecular barcode consists essentially of fixed nucleotides.
[0124] In some embodiments, the molecular barcode comprises at least about any one of 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 70 80, 90, 100 or more consecutive nucleotides. In some embodiments, the molecular barcode comprises at least about any one of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40 or more randomly designed nucleotides. In some embodiments, the molecular barcode comprises at least about any one of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40 or more degenerately designed nucleotides. In some embodiments, the molecular barcode comprises at least about any one of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40 or more fixed (i.e., specifically designed) nucleotides. In some embodiments, the molecular barcode is a mixture of randomly designed, degenerately designed or fixed nucleotides. The number of randomly and/or degenerately designed nucleotides in the molecular barcode depends on the actual need. For example, a long target nucleic acid (such as chromosome) may need a plurality of synthetic transposons with higher diversity, i.e., a large number of randomly and/or degenerately designed nucleotides, to provide enough distinct molecular barcodes to tag the large number of segments of the target nucleic acid in order to extract contiguity information. By contrast, a short target nucleic acid, such as a plasmid of a few kilobases long, may only need a small number of randomly and/or degenerately designed nucleotides to provide enough distinct molecular barcodes for tagging. In some cases, duplicated sequences endogenous to the target nucleic acid flanking the insertion sites of the synthetic transposons (e.g. , 9-nt duplicate sequences for Tn5 transposase) may be used in combination with the molecular barcodes in the synthetic transposons to provide contiguity information for the target nucleic acids. Having both randomly designed and specific nucleotides may minimize potential undesired non-specific interactions during the process of synthesizing the synthetic transposons.
Exemplary synthetic transposons
[0125] Exemplary synthetic transposons and fragments are shown in FIGs. 2A-20. FIGs. 2A- 2D show exemplary synthetic transposon fragments each comprising a single transposon recognition site. FIG. 2E shows an exemplary synthetic transposon with no molecular barcodes. FIGs. 2F-20 shows exemplary synthetic transposons having two molecular barcodes, and various structures for the non-complementary regions and distal ends of the stems. One of skill in the art would readily appreciate that any of the exemplary synthetic transposons of FIG. 2F- 20 can be modified by replacing the molecular barcodes with stuff sequences or other sequences needed to make corresponding exemplary synthetic transposons that do not have molecular barcodes. It is possible to mix synthetic transposon fragments, such as those shown in FIG. 2A with those shown in FIG. 2B, to provide full synthetic transposons that can be inserted into target nucleic acids. For example, after transposition of the mixture of synthetic transposon fragments catalyzed by the corresponding transposase, stem-loop and Y-shaped common sequences can be added to each end of fragmented targeted DNA pieces at the same time and such libraries after repairing (e.g., extension and ligation) can be used for certain sequencing applications such as Oxford nanopore single molecule sequencing, or Pacific Biosciences single molecule sequencing.
[0126] As shown in FIG. 2A, in some embodiments, there is provided a synthetic transposon fragment comprising: (a) a stem comprising from the distal end to the proximal end: a first transposase recognition site and an optional stuff sequence, and (b) a non-complementary region comprising a first strand comprising an adapter sequence F and a second strand comprising an adapter sequence R, wherein the distal end of the stem is a blunt end. Two of the synthetic transposon fragments having the same sequences or different sequences can be linked together via the non-complementary regions to provide a synthetic transposon having no molecular barcodes.
[0127] As shown in FIG. 2B, in some embodiments, there is provided a synthetic transposon fragment comprising: (a) a stem comprising from the distal end to the proximal end: a first transposase recognition site and an optional stuff sequence, and (b) a non-complementary region comprising a first strand comprising an adapter sequence F, a second strand comprising an adapter sequence R, and a single-stranded linker disposed between the first strand and the second strand, wherein the distal end of the stem is a blunt end. Two of the synthetic transposon fragments having the same sequences or different sequences can be linked together via the non- complementary regions to provide a synthetic transposon having no molecular barcodes. For example, a first synthetic transposon fragment of FIG. 2B having a first single-stranded linker and a second synthetic transposon fragment of FIG. 2B having a second single-stranded linker that can hybridize to the first single-stranded linker can be mixed together to provide a synthetic transposon. Alternatively, a first synthetic transposon fragment having a first single-stranded linker and a second synthetic transposon fragment having a second single-stranded linker can be mixed together with a bridge nucleic acid that can hybridize to both the first single-stranded linker and the second single-stranded linker to provide a synthetic transposon. If the transposon fragments in Fig. 2B are used for transposition (with transposase) without linkage, stem-loop common sequences (e.g., containing sequencing primer) can be added to both ends of the fragmented targeted DNA pieces and such libraries after repairing (e.g., extension and ligation) can be used for certain sequencing application such as single molecule sequencing.
[0128] As shown in FIG. 2C, in some embodiments, there is provided a synthetic transposon fragment comprising: (a) a stem comprising from the distal end to the proximal end: a first transposase recognition site, a molecular barcode, and an optional stuff sequence, and (b) a non- complementary region comprising a first strand comprising an adapter sequence F and a second strand comprising an adapter sequence R, wherein the distal end of the stem is a blunt end. Two of the synthetic transposon fragments having the same sequences or different sequences can be linked together via the non-complementary regions to provide a synthetic transposon having two molecular barcodes. FIG. 8C shows an exemplary synthetic transposon fragment of FIG. 2C.
[0129] As shown in FIG. 2D, in some embodiments, there is provided a synthetic transposon fragment comprising: (a) a stem comprising from the distal end to the proximal end: a first transposase recognition site, a molecular barcode, and an optional stuff sequence, and (b) a non- complementary region comprising a first strand comprising an adapter sequence F, a second strand comprising an adapter sequence R, and a single-stranded linker disposed between the first strand and the second strand, wherein the distal end of the stem is a blunt end. Two of the synthetic transposon fragments having the same sequences or different sequences can be linked together via the non-complementary regions to provide a synthetic transposon having two molecular barcodes. For example, a first synthetic transposon fragment of FIG. 2B having a first single-stranded linker and a second synthetic transposon fragment of FIG. 2B having a second single-stranded linker that can hybridize to the first single-stranded linker can be mixed together to provide a synthetic transposon of FIG. 2K. Alternatively, a first synthetic transposon fragment having a first single-stranded linker and a second synthetic transposon fragment having a second single-stranded linker can be mixed together with a bridge nucleic acid that can hybridize to both the first single-stranded linker and the second single-stranded linker to provide a synthetic transposon of FIG. 2M. FIG. 8D shows an exemplary synthetic transposon fragment of FIG. 2D. [0130] As shown in FIG. 2E, in some embodiments, there is provided a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F and a second strand comprising an adapter sequence R; and (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R and a second strand comprising the adapter sequence F; wherein the distal ends of the first stem and the second stem are both blunt ends; wherein the first strand of the first non-complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides; and wherein the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides. The two transposase recognition sites can have the same or different sequences. The two molecular barcodes may have the same or different sequences. Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid. The one or more cleavable nucleotides can be cleaved to separate the two non- complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons. FIG. 8B shows an exemplary synthetic transposon of FIG. 2E. FIG. 8E shows primers that can be used to amplify nucleic acid fragments obtained from insertion of a plurality of the synthetic transposon in a target nucleic acid followed by enzymatic cleavage of the UU dinucleotide that separates the two non-complementary regions in each synthetic transposon. The primers in FIG.8E contain sequences from sequencing primers of the Illumina sequencing platform that allow direct sequencing of the amplified nucleic acid fragments on an Illumina instrument. Randomly designed index tag sequences can be included in one primer to serve as a sample barcode, which allows multiple samples to be sequenced at the same time and subsequently de-multiplexed during data analysis.
[0131] As shown in FIG. 2F, in some embodiments, there is provided a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F and a second strand comprising an adapter sequence R; and (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second single-stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R and a second strand comprising the adapter sequence F; wherein the distal ends of the first stem and the second stem are both blunt ends; and wherein the first strand of the first non-complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides. The two transposase recognition sites can have the same or different sequences. The two molecular barcodes may have the same or different sequences. The 5' end neighboring the gap of missing strand in the single- stranded molecular barcode is phosphorylated. Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid. The one or more cleavable nucleotides can be cleaved to separate the two non-complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
[0132] As shown in FIG. 2G, in some embodiments, there is provided a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F and a second strand comprising an adapter sequence R; and (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second double- stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R and a second strand comprising the adapter sequence F; wherein the distal ends of the first stem and the second stem are both blunt ends; and wherein the first strand of the first non-complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides. The two transposase recognition sites can have the same or different sequences. The two molecular barcodes may have the same or different sequences. Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid. The one or more cleavable nucleotides can be cleaved to separate the two non-complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
[0133] As shown in FIG. 2H, in some embodiments, there is provided a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F and a second strand comprising an adapter sequence R; and (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second single-stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R and a second strand comprising the adapter sequence F; wherein the distal ends of the first stem and the second stem are both blunt ends; wherein the first strand of the first non-complementary region is fused to the first strand of the second non- complementary region via one or more cleavable nucleotides; and wherein the second strand of the first non-complementary region is fused to the second strand of the second non- complementary region via one or more cleavable nucleotides. The two transposase recognition sites can have the same or different sequences. The two molecular barcodes may have the same or different sequences. The 5' end neighboring the gap of missing strand in the single-stranded molecular barcode is phosphorylated. Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid. The one or more cleavable nucleotides can be cleaved to separate the two non- complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
[0134] As shown in FIG. 21, in some embodiments, there is provided a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F and a second strand comprising an adapter sequence R; and (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second double- stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R and a second strand comprising the adapter sequence F; wherein the distal ends of the first stem and the second stem are both blunt ends; wherein the first strand of the first non-complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides; and wherein the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides. The two transposase recognition sites can have the same or different sequences. The two molecular barcodes may have the same or different sequences. Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid. The one or more cleavable nucleotides can be cleaved to separate the two non- complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons. FIG. 8A shows an exemplary synthetic transposon of FIG. 21 having two non-identical short molecular barcodes.
[0135] As shown in FIG. 2J, in some embodiments, there is provided a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F, a second strand comprising an adapter sequence R, and a first single-stranded linker disposed between the first strand and the second strand; and (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second single-stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R, a second strand comprising the adapter sequence F, and a second single-stranded linker disposed between the first strand and the second strand; wherein the distal ends of the first stem and the second stem are both blunt ends; wherein the first single- stranded linker hybridize to the second single-stranded linker; and wherein the first single- stranded linker and the second single-stranded linker each comprises one or more cleavable nucleotides. The two transposase recognition sites can have the same or different sequences. The two molecular barcodes may have the same or different sequences. The 5' end neighboring the gap of missing strand in the single-stranded molecular barcode is phosphorylated. Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid. The one or more cleavable nucleotides can be cleaved to separate the two non-complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
[0136] As shown in FIG. 2K, in some embodiments, there is provided a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F, a second strand comprising an adapter sequence R, and a first single-stranded linker disposed between the first strand and the second strand; and (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second double-stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R, a second strand comprising the adapter sequence F, and a second single-stranded linker disposed between the first strand and the second strand; wherein the distal ends of the first stem and the second stem are both blunt ends; wherein the first single- stranded linker hybridize to the second single-stranded linker; and wherein the first single- stranded linker and the second single-stranded linker each comprises one or more cleavable nucleotides. The two transposase recognition sites can have the same or different sequences. The two molecular barcodes may have the same or different sequences. Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid. The one or more cleavable nucleotides can be cleaved to separate the two non-complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
[0137] As shown in FIG. 2L, in some embodiments, there is provided a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F, a second strand comprising an adapter sequence R, and a first single-stranded linker disposed between the first strand and the second strand; (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second single-stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R, a second strand comprising the adapter sequence F, and a second single- stranded linker disposed between the first strand and the second strand; and (3) a bridge nucleic acid; wherein the distal ends of the first stem and the second stem are both blunt ends; and wherein the first non-complementary region and the second non-complementary region are connected to each other via hybridization of the bridge nucleic acid to the first single-stranded linker and the second single-stranded linker. The two transposase recognition sites can have the same or different sequences. The two molecular barcodes may have the same or different sequences. The 5' end neighboring the gap of missing strand in the single-stranded molecular barcode is phosphorylated. Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid. The one or more cleavable nucleotides can be cleaved to separate the two non-complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
[0138] As shown in FIG. 2M, in some embodiments, there is provided a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F, a second strand comprising an adapter sequence R, and a first single-stranded linker disposed between the first strand and the second strand; (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second double-stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R, a second strand comprising the adapter sequence F, and a second single-stranded linker disposed between the first strand and the second strand; and (3) a bridge nucleic acid; wherein the distal ends of the first stem and the second stem are both blunt ends; and wherein the first non-complementary region and the second non-complementary region are connected to each other via hybridization of the bridge nucleic acid to the first single- stranded linker and the second single-stranded linker. The two transposase recognition sites can have the same or different sequences. The two molecular barcodes may have the same or different sequences. Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid. The one or more cleavable nucleotides can be cleaved to separate the two non-complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
[0139] As shown in FIG. 2N, in some embodiments, there is provided a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F and a second strand comprising an adapter sequence R; and (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second single-stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R and a second strand comprising the adapter sequence F; wherein the distal end of the first stem is a blunt end, and the distal end of the second stem has a hairpin structure; wherein the first strand of the first non-complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides; and wherein the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides. The two transposase recognition sites can have the same or different sequences. The two molecular barcodes may have the same or different sequences. The 5' end neighboring the gap of missing strand in the single-stranded molecular barcode is phosphorylated. Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid. The one or more cleavable nucleotides can be cleaved to separate the two non-complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
[0140] As shown in FIG. 20, in some embodiments, there is provided a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F and a second strand comprising an adapter sequence R; and (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second double- stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R and a second strand comprising the adapter sequence F; wherein the distal end of the first stem is a blunt end, and the distal end of the second stem has a hairpin structure; wherein the first strand of the first non-complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides; and wherein the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides. The two transposase recognition sites can have the same or different sequences. The two molecular barcodes may have the same or different sequences. Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid. The one or more cleavable nucleotides can be cleaved to separate the two non-complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
Methods of preparing synthetic transposons
[0141] The synthetic transposons provided herein can be prepared by a variety of methods. In some embodiments, the synthetic transposons are prepared by direct synthesis, including chemical synthesis. Such methods are well known in the art, e.g., solid phase synthesis using phosphoramidite precursors such as those derived from protected 2'-deoxynucleosides, ribonucleosides, or nucleoside analogues. Synthetic transposons comprising modified nucleotides (such as uracil nucleotides, or 5-methyl dC) may also be chemically synthesized by including modified nucleotide building blocks in the oligo synthesis steps. Alternatively, for synthetic transposons having a 5-methyl C in a CpG sequence, an unmodified synthetic transposon may first be synthesized, and the 5-methyl group may be added to the target dC nucleobase using a CpG methyltransferase. Synthesis of long oligos up to 180-250 nucleotides (nt) required in this application can be obtained commercially from multiple sources such as IDT (ultramers for up to 200nt), Sigma- Aldrich (up to 180nt) or Biosynthesis (Ubermers up to 250nt regularly and could be as long as 400nt). Incorporation of modified bases such as LNA or PNA in some common sequences allow the use of short sequences with the same binding stability needed. Modified bases such as uracil can be incorporated easily to allow the cleavage of the strand before library amplification. Incorporation of phosphorothiate bonds, for example, can help to minimize degradation of transposons by exonucleases or endonucleases during their storage.
[0142] In some embodiments, the synthetic transposons are prepared by annealing two oligos, which are then subjected to extension by polymerases to provide the full product. Synthetic transposons having no molecular barcodes or having two different molecular barcodes can be prepared by such methods. For example, FIG. 8A shows a method of preparing an exemplary synthetic transposon of FIG. 21 having two different molecular barcodes. FIG. 8B shows a method of preparing an exemplary synthetic transposon of FIG. 2E having no molecular barcodes. Synthetic transposons with one or two hairpin structures can be conveniently prepared using a single long strand of oligonucleotide with complementary regions that hybridize to provide the synthetic transposons. In some embodiments, the synthetic transposons are PCR amplified with common primers, such as primers that hybridize to the stuff sequences to prepare the synthetic transposons.
[0143] In some embodiments, the synthetic transposons are prepared by linking the non- complementary regions of two synthetic transposon fragments. In some embodiments, the synthetic transposon fragment is prepared by chemical synthesis. In some embodiments, the synthetic transposon fragment is prepared by extending chemically synthesized
oligonucleotide(s) by a polymerase. For example, FIG. 8C shows a method of preparing an exemplary synthetic transposon fragment of FIG. 2C having a molecular barcode. FIG. 8D shows a method of preparing an exemplary synthetic transposon fragment of FIG. 2B having no molecular barcode.
[0144] Synthetic transposons having two molecular barcodes with the same sequences comprising randomly or degenerately designed nucleotides are prepared using a combination of chemical synthesis and extension by polymerase (also referred as "primer extension") to obtain double-stranded molecular barcodes, and to ensure that the two molecular barcodes have the same sequences. In some embodiments, the synthetic transposons having identical paired molecular barcodes are prepared using starting oligos containing only one molecular barcode, followed by a first intramolecular or intermolecular priming to replicate the molecular barcode. In some embodiments, a 2nd intramolecular or intermolecular priming is used to displace the replicated molecular barcode sequence. During the design of molecular barcodes, use of fixed or less random degenerate bases (such as R) in the molecular barcodes may eliminate or minimize the interaction of molecular barcodes with other sequences in the synthetic transposons. FIGs. 3-7 illustrate exemplary methods for preparing various synthetic transposons having two identical molecular barcodes that contain randomly or degenerately designed nucleotides. [0145] For example, in FIG. 3, a first synthesized oligo (5'-301+302+303+304+305+U+ 306+303rc-3') is provided, which comprises a single-stranded molecular barcode region (302) having randomly or degenerately designed nucleotides. The first synthesized oligo is extended by a DNA polymerase, which is then denatured and hybridized to a second synthesized oligo comprising the adapter sequence 306 and the complementary sequence of stuff sequence 304 (i.e. 304rc). The hybridized oligos are then extended by a DNA polymerase to make the first fragment of the synthetic transposon, which is subsequently annealed to a second synthesized oligo comprising the adapter sequence 305 and the stuff sequence 303, and a third synthesized oligo comprising the transposase recognition sequence 301, to provide a synthetic transposon of FIG. 2F. Further extension and ligation steps to fill in the gap with a complementary sequence of the molecular barcode (302rc) provide a synthetic transposon of FIG. 2G.
[0146] In FIG. 4, a first synthesized oligo (5 ' -401 +402+403+404+405+U+406+403rc-3 ' ) is provided, which comprises a single-stranded molecular barcode region (402) having randomly or degenerately designed nucleotides. The first synthesized oligo is extended by a DNA polymerase, which is then denatured and hybridized to a second synthesized oligo comprising the complementary sequence of stuff sequence 404 (i.e. , 404rc), adapter sequence 406, a uracil nucleotide, adapter sequence 405, and stuff sequence 403. The hybridized oligos are then extended by a DNA polymerase to make the first fragment of the synthetic transposon connected to the second non-complementary region, which is subsequently annealed to a second synthesized oligo comprising the transposase recognition sequence 401, to provide a synthetic transposon of FIG. 2H. Further extension and ligation steps to fill in the gap with a
complementary sequence of the molecular barcode (402rc) provide a synthetic transposon of FIG. 21.
[0147] In FIG. 5, a first synthesized oligo (5'-501+502+503+504+505+506+507+508+505rc- 3') and a second synthesized oligo (5'-503+506+507+508+509+504rc-3') are provided, which are hybridized and extended by a DNA polymerase. The 3' end of the first synthesized oligo is a reversibly blocked nucleotide. Alternatively, the 3' end of the first synthesized oligo may first comprise a 3' phosphate group, which is removed by T4 polynucleotide kinase (T4 PNK) treatment prior to the first extension step to block extension of this 3' end. Prior to the second extension step, the 3' end can be phosphorylated, thereby allowing extension from this 3' end. After the deblocking of the 3' end of the first synthesized oligo, a further round of extension followed by hybridization to a third synthesized oligo comprising the transposase recognition sequence 501 provides a synthetic transposon of FIG. 2J. Further extension and ligation steps to fill in the gap with a complementary sequence of the molecular barcode (502rc) provide a synthetic transposon of FIG. 2K. Single- stranded linker sequences 507 and 507rc each has one or more cleavable nucleotides.
[0148] In FIG. 6, a first synthesized oligo (5 ' -601 +602+603+604+605+606+607+608+605rc- 3') a second synthesized oligo (5'-603+606+609+608+610+604rc-3'), and a third synthesized oligo (5'-607rc+611+609rc, i.e. bridge nucleic acid) are provided, which are denatured and hybridized, and then extended by a DNA polymerase. The 3' end of the first synthesized oligo is a reversibly blocked nucleotide. Alternatively, the 3' end of the first synthesized oligo may first comprise a 3' phosphate group, which is removed by T4 polynucleotide kinase (T4 PNK) treatment prior to the first extension step to block extension of this 3' end. Prior to the second extension step, the 3' end can be phosphorylated, thereby allowing extension from this 3' end. After the deblocking of the 3' end of the first synthesized oligo, a further round of extension followed by hybridization to a fourth synthesized oligo comprising the transposase recognition sequence 601 provides a synthetic transposon of FIG. 2L. Further extension and ligation steps to fill in the gap with a complementary sequence of the molecular barcode (602rc) provide a synthetic transposon of FIG. 2M. The bridge nucleic acid may contain one or more cleavable nucleotides in the 607rc and 609rc fragments.
[0149] In FIG. 7, a single synthesized oligo (5 ' -703rc+706+U+705+704+703+702
+701+707+701rc-3') is provided, in which the hairpin fragment 707 and the transposase recognition site 701 each has one or more cleavable nucleotides. The oligo is denatured, hybridized, and extended by DNA polymerase. The one or more cleavable nucleotides in 707 and 701 are then cleaved, and the product is denatured and hybridized to a second synthesized oligo (5'-705-703-U-706-704rc-3'). The duplex is then extended by DNA polymerase to provide a synthetic transposon of FIG. 2N. Further extension and ligation steps to fill in the gap with a complementary sequence of the molecular barcode (702rc) provide a synthetic transposon of FIG. 20.
Methods of library preparation
[0150] One aspect of the present application provides a method of preparing a library of template nucleic acids comprising: (a) contacting a target nucleic acid with any one of the composition comprising a plurality of synthetic transposons described herein, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; and (c) separating the first non-complementary region and the second non- complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids. In some embodiments, the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA. In some embodiments, the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
[0151] In some embodiments, there is provided a method of preparing a library of template nucleic acids comprising: (a) contacting a target nucleic acid with any one of the composition comprising a plurality of synthetic transposons described herein, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; and (c) amplifying the repaired target nucleic acid to provide the library of template nucleic acids. In some embodiments, the amplifying is Whole Genome Amplification (WGA). In some embodiments, the amplifying is targeted amplification of loci of interest. In some embodiments, the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
[0152] Thus, for example, in some embodiments, there is provided a method of preparing a library of template nucleic acids comprising: (a) contacting a target nucleic acid with a composition comprising a plurality of synthetic transposons, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first strand of the first non-complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides and/or the second strand of the first non-complementary region is fused to the second strand of the second non- complementary region via one or more cleavable nucleotides; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; and (c) contacting the repaired target nucleic acid with an endonuclease, such as uracil DNA glycosylase (UDG), or a mixture of UDG and DNA lyase (e.g. , USER™), to cleave the one or more cleavable nucleotides, thereby separating the first non-complementary region and the second non-complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids. In some embodiments, the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA. In some embodiments, the plurality of synthetic transposons and the transposase are pre- mixed prior to contacting the target nucleic acid. In some embodiments, the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
[0153] In some embodiments, there is provided a method of preparing a library of template nucleic acids comprising: (a) contacting a target nucleic acid with a composition comprising a plurality of synthetic transposons, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complementary region are fused to each other via a first single- stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single- stranded linker and the second single- stranded linker hybridize to each other; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; and (c) denaturing the repaired target nucleic acid to separate the first non- complementary region and the second non-complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids. In some embodiments, the first single-stranded linker and the second single-stranded linker each comprises one or more cleavable nucleotides. In some embodiments, the method further comprises contacting the repaired target nucleic acid with an endonuclease, such as uracil DNA glycosylase (UDG), or a mixture of UDG and DNA lyase (e.g. , USER™), to cleave the one or more cleavable nucleotides. In some embodiments, the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA. In some embodiments, the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
[0154] In some embodiments, there is provided a method of preparing a library of template nucleic acids comprising: (a) contacting a target nucleic acid with a composition comprising a plurality of synthetic transposons, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complementary region are fused to each other via a first single- stranded linker; wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker; wherein each synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single-stranded linker; wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; and (c) denaturing the repaired target nucleic acid to separate the first non-complementary region and the second non-complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids. In some embodiments, the method further comprises treating the denatured repaired target nucleic acid with an exonuclease. In some embodiments, the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA. In some embodiments, the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
[0155] In some embodiments, there is provided a method of preparing a library of template nucleic acids comprising: (a) contacting a target nucleic acid with a composition comprising a plurality of synthetic transposons, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first molecular barcode and the second molecular barcode have the same barcode sequence; wherein each synthetic transposon has a different barcode sequence; and wherein the first strand of the first non-complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides and/or the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; and (c) contacting the repaired target nucleic acid with an endonuclease, such as uracil DNA glycosylase (UDG), or a mixture of UDG and DNA lyase {e.g. , USER™), to cleave the one or more cleavable nucleotides, thereby separating the first non-complementary region and the second non-complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids. In some embodiments, the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA. In some embodiments, the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
[0156] In some embodiments, there is provided a method of preparing a library of template nucleic acids comprising: (a) contacting a target nucleic acid with a composition comprising a plurality of synthetic transposons, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, wherein the first single-stranded linker and the second single-stranded linker hybridize to each other; wherein the first molecular barcode and the second molecular barcode have the same barcode sequence; and wherein each synthetic transposon has a different barcode sequence; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; and (c) denaturing the repaired target nucleic acid to separate the first non-complementary region and the second non-complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids. In some embodiments, the first single-stranded linker and the second single-stranded linker each comprises one or more cleavable nucleotides. In some embodiments, the method further comprises contacting the repaired target nucleic acid with an endonuclease, such as uracil DNA glycosylase (UDG), or a mixture of UDG and DNA lyase {e.g. , USER™), to cleave the one or more cleavable nucleotides. In some embodiments, the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA. In some embodiments, the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
[0157] In some embodiments, there is provided a method of preparing a library of template nucleic acids comprising: (a) contacting a target nucleic acid with a composition comprising a plurality of synthetic transposons, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker; wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker; wherein each synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single- stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker; wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid; wherein the first molecular barcode and the second molecular barcode have the same barcode sequence; and wherein each synthetic transposon has a different barcode sequence; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; and (c) denaturing the repaired target nucleic acid to separate the first non-complementary region and the second non-complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids. In some embodiments, the method further comprises treating the denatured repaired target nucleic acid with an
exonuclease. In some embodiments, the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA. In some embodiments, the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
[0158] Further provided are libraries of template nucleic acids prepared by any of the methods described herein, or intermediates prepared in any of the methods, such as the inserted target nucleic acids or the repaired target nucleic acids. The repaired target nucleic acids can be stored without losing the contiguity information in the target nucleic acid. [0159] In some embodiments, there is provided a barcoded target nucleic acid comprising a plurality of synthetic transposons inserted randomly or substantially randomly among the endogenous sequence of the barcoded target nucleic acid, wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non- complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first non-complementary region and the second non-complementary region are connected to each other; wherein the first molecular barcode and the second molecular barcode have the same barcode sequence; and wherein each synthetic transposon has a different barcode sequence. In some embodiments, the first strand of the first non-complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide). In some embodiments, the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide). In some embodiments, the first strand and the second strand of the first non-complementary region are fused to each other via a first single- stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single- stranded linker and the second single- stranded linker are hybridized to each other. In some embodiments, each synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single- stranded linker, and a second single- stranded sequence that hybridizes to the second single-stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
[0160] FIGs. 9-11 show exemplary methods of preparing libraries of template nucleic acids using the synthetic transposons described herein. Additionally, synthetic transposons that do not have molecular barcodes (e.g. , FIG. 2E and FIGs. 8B) can be used for fragmentation and library construction. The intramolecular or intermolecular binding between the 2 transposase recognition sites and transposase in a transposed target nucleic acid allow the stable
manipulation if needed in cases such as dilution of the transposon-integrated nucleic acids without breaking it down before further processing.
[0161] In FIG. 9, a composition comprising a plurality of synthetic transposons of FIG. 21 each having a different barcode sequence is contacted with a target DNA and a transposase. The plurality of synthetic transposons are inserted into the target DNA, resulting in single-stranded gaps surrounding the insertion sites. The inserted target DNA is then treated with DNA polymerase to fill in the single-stranded gaps, and simultaneously or subsequently treated with a ligase to seal the nicks. In the same step or in a separate step, the repaired target DNA is treated with UDG (such as USER™) to cleave the uracil nucleotide, yielding fragmented template nucleic acids, which are PCR amplified with primers having adapter sequences F and R, or their reverse complements. PCR amplification leads to 2 products that have different read orientations during sequencing. Additional adapter sequences, such as sequencing primer sequences, and/or index tags, may be introduced to each amplified nucleic acid by including the adapter sequences and index tags in the PCR primers. The amplified nucleic acid library can then be sequenced using any suitable massively parallel shotgun sequencing method (such as next generation sequencing, or NGS method). Alternatively, whole genome amplification (WGA) can be used to pre-amplify the repaired target DNA inserted with transposon without first cleaving the modified nucleotides. For example, WGA can be performed using either random hexamers or sequences complementary to F and/or R. One of skill in the art would readily appreciate that when WGA is used in the library preparation method, separation of the non-complementary regions is not a required step, and thus, synthetic transposons that do not have modified nucleotide(s) linking the two non-complementary regions can be used.
[0162] In FIG. 10, a composition comprising a plurality of synthetic transposons of FIG. 2G each having a different barcode sequence is contacted with a target DNA and a transposase. The plurality of synthetic transposons are inserted into the target DNA, resulting in single-stranded gaps surrounding the insertion sites. The inserted target DNA is then treated with DNA polymerase to fill in the single-stranded gaps, and simultaneously or subsequently treated with a ligase to seal the nicks. In the same step or in a separate step, the repaired target DNA is treated with UDG (such as USER™) to cleave the uracil nucleotide, yielding fragmented template nucleic acids. The fragmented template nucleic acids are then hybridized to an oligonucleotide comprising a first sequence that is complementary to the first adapter sequence F and a second sequence that is complementary to the second adapter sequence R. The hybridized fragments are then treated with ligase to circularize the fragmented template nucleic acids. The circularized template nucleic acids can be further analyzed by RCA, or by single-molecule sequencing.
[0163] In FIG. 11 , a composition comprising a plurality of synthetic transposons of FIG. 2M each having a different barcode sequence is contacted with a target DNA and a transposase, resulting in single-stranded gaps surrounding the insertion sites. The inserted target DNA is then treated with DNA polymerase to fill in the single-stranded gaps, and simultaneously or subsequently treated with a ligase to seal the nicks. The repaired target DNA is then denatured, and subsequently treated with exonuclease to remove the bridge nucleic acids, thereby yielding a library of circularized template nucleic acids. The library of circularized template nucleic acids can then be analyzed by RCA or single-molecule sequencing.
[0164] The plurality of synthetic transposons can be inserted into target nucleic acids by the transposase that binds to the transposase recognition sites of the synthetic transposons. In some embodiments, the plurality of synthetic transposons and the transposase may be pre-mixed to form a complex composition comprising a plurality of complexes each comprising a transposase bound to a synthetic transposon prior to contacting the complex composition with the target nucleic acid. In some embodiments, the plurality of synthetic transposons and the transposase are contacted with the target nucleic acids simultaneously, but as separate compositions.
[0165] In some embodiments, synthetic transposons with molecular barcodes having high diversity, for example, comprising more than about any one of 5, 10, 15, 20, 25, or more randomly and/or degenerately designed nucleotides are used to ensure that each insertion site in the target nucleic acid has a different molecular barcode. In some embodiments, an excess amount of synthetic transposons is contacted with the target nucleic acid to ensure unique labeling of the sites in the target nucleic acid. In some embodiments, no more than about any one of 50%, 40%, 30%, 20%, 10%, 5%, 2%, 1%, 0.1%, 0.01%, 0.001%, 0.0001% or less of possible synthetic transposons with distinct molecular barcodes are inserted into the target nucleic acid. For example, 100 cells of human genomic DNA (about 0.6 ng) have a total of 300xl09 basepairs. After insertion of synthetic transposons each having a molecular barcode comprising 25 randomly designed nucleotides at an average of 150- bp distance, a total of 2xl09 synthetic transposons are inserted out of 1015 possible distinct synthetic transposons available. Thus, there is a 1 in 500,000 chance to have identical synthetic transposons at two different sites in the barcoded genomic DNA. By combining the transposase duplicated sequences (e.g. , 9-nt duplicate sequence of Tn5 transposase) and the molecular barcode sequences, it would be easy to differentiate and align sequencing reads derived from neighboring fragments in a single target molecule.
[0166] As used herein, the term "at least a portion" or grammatical equivalents thereof can refer to any fraction of a whole amount. For example, "at least a portion" can refer to at least about any one of 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 99%, 99.9% or 100% of a whole amount. In some embodiments, at least about any one of 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, 99% or more of the plurality of synthetic transposons is inserted in the target nucleic acid.
[0167] The frequency (i.e. , density) of the synthetic transposons inserted in the target nucleic acid can be controlled by various ways, including adjusting the contacting time and temperature, the amount of synthetic transposons, the type and amount of the transposase, and composition of the buffer. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about any one of 10 kb, 5 kb, 4 kb, 3 kb, 2 kb, 1 kb, 900 bases, 800 bases, 700 bases, 600 bases, 500 bases, 400 bases, 300 bases, 250 bases, 200 bases, 150 bases, 100 bases, or fewer. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of once per any one of about 100 bases to about 200 bases, about 150 bases to about 250 bases, about 250 bases to about 500 bases, about 500 bases to about 750 bases, about 750 bases to about lkb, about 1 kb to about 5 kb, about 5 kb to about 10 kb, about 100 bases to about 1 kb, or about 100 bases to about 10 kb.
[0168] It should be recognized by persons skilled in the art that there is an increased sequencing cost associated with an increased density of synthetic transposon insertion. Insertion with 75-nucleotide synthetic transposons at a once per about 150 bases frequency results in about 50% higher cost based on the number of bases need to be sequenced. By contrast, a barcoded target nucleic acid with the same synthetic transposons and an insertion frequency of once per about 300 bases results in about 25% higher sequencing cost than sequencing the non- barcoded target nucleic acid. Therefore, a tradeoff between sequencing cost and quality may be considered when using libraries prepared with the methods described herein. For example, synthetic transposons described herein may be particularly useful and effective for preparing sequencing libraries for whole genome sequencing requiring high quality (for example, error rate lower than about 1 in 106 bases), targeted capture sequencing, or microbiome sequencing in clinical setting. With advancements in sequencing technologies, the sequencing cost per base has been dropping and we expect that per base sequencing cost will not be the main cost for many of the applications described herein in the future.
[0169] The target nucleic acid can include any nucleic acid of interest. Target nucleic acids can include, DNA, RNA, peptide nucleic acid, morpholino nucleic acid, locked nucleic acid, glycol nucleic acid, threose nucleic acid, mixtures thereof, and hybrids thereof. In some embodiments, the target nucleic acid is genomic DNA, such as whole genome, part of the genome (e.g., individual chromosomes or fragments thereof), mixed genomes (e.g., microbiome). Intact chromosomes in live cells or isolated intact chromosomes can be used to achieve longest contiguity contigs as possible for any given species. Careful isolation of intact chromosomes has been demonstrated previously (e.g. , Howe B. et at , Chromosome preparation from cultured cells. J Vis. Exp. 83: e50203, 2014). In some embodiments, the target nucleic acid is mitochondrial DNA. In some embodiments, the target nucleic acid is chloroplast DNA. In some embodiments, the target nucleic acid is cDNA, synthetic or modified DNA after certain chemical or enzymatic treatments, including bisulfite treatment (e.g., for CpG methylation detection).
[0170] The target nucleic acid can be of any length. The synthetic transposons and the methods described herein are particularly useful for preparing barcoded libraries to be sequenced and assembled to analyze long, contiguous target nucleic acids having a length of at least about any one of 10 kb, 20 kb, 50 kb, 100 kb, 200 kb, 500 kb, 1 Mb, 2 Mb, 5 Mb, 10 Mb, 20 Mb, 50 Mb, 100 Mb, 200 Mb, or more. The target nucleic acid can comprise any nucleotide sequences. In some embodiments, the target nucleic acid comprises homopolymer sequences. The target nucleic acid can also include repeat sequences. Repeat sequences can be any of a variety of lengths including, for example, at least about any one of 2, 5, 10, 20, 30, 40, 50, 100, 250, 500, 1000 nucleotides or more. Repeat sequences can be repeated, either contiguously or non- contiguously, any of a variety of times including, for example, at least about any one of 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20 times or more.
[0171] In some embodiments, the plurality of synthetic transposons is inserted in a single target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted in a plurality of target nucleic acids. In such embodiments, a plurality of target nucleic acids can include a plurality of the same target nucleic acids, a plurality of different target nucleic acids wherein some target nucleic acids are the same, or a plurality of target nucleic acids wherein all target nucleic acids are different. Embodiments that involve a plurality of target nucleic acids can be carried out in multiplex formats such that reagents can be delivered simultaneously to the target nucleic acids, for example, in one or more compartments or on an array surface. In some embodiments, the plurality of target nucleic acids can include substantially all of a particular organism's genome. The plurality of target nucleic acids can include at least a portion of a particular organism's genome, including, for example, at least about any one of 1%, 5%, 10%, 25%, 50%, 75%, 80%, 85%, 90%, 95%, or 99% of the genome. In particular embodiments, the portion can have an upper limit that is at most about any one of 1%, 5%, 10%, 25%, 50%, 75%, 80%, 85%, 90%, 95%, or 99% of the genome.
[0172] Target nucleic acids can be obtained from any source. For example, target nucleic acids may be prepared from nucleic acid molecules obtained from a single organism or from populations of nucleic acid molecules obtained from natural sources that include one or more organisms. Sources of nucleic acid molecules include, but are not limited to, organelles, cells, tissues, organs, or organisms. Cells that may be used as sources of target nucleic acid molecules may be prokaryotic (bacterial cells, for example, Escherichia, Bacillus, Serratia, Salmonella, Staphylococcus, Streptococcus, Clostridium, Chlamydia, Neisseria, Treponema, Mycoplasma, Borrelia, Legionella, Pseudomonas, Mycobacterium, Helicobacter, Erwinia, Agrobacterium, Rhizobium, and Streptomyces genera); archeaon, such as crenarchaeota, nanoarchaeota or euryarchaeotia; or eukaryotic such as fungi, (for example, yeasts), plants, protozoans and other parasites, and animals (including insects (for example, Drosophila spp.), nematodes (for example, Caenorhabditis elegans), and mammals (for example, rat, mouse, monkey, non-human primate and human)).
[0173] In some embodiments, the target nucleic acid has damaged or modified bases before, during and after preparation due to aging, exposure to acid, heat or radiation. These
modifications include nicks, abasic sites, thymidine dimers, oxidized guanine and pyrimidines, deaminated cytosines. If left untreated, these modifications could prevent further amplifications and sequencing, and affect the accurate counting of the target nucleic acids and sequence quality. Commercial repair kits are available, including NEB's PreCR Repair Mix (Cat# M0309S) and Sigma's Restorase (with DNA polymerase, Cat# R1028). These reagents often include DNA repair enzymes (such as uracil DNA glycosylase, Fpg, T4 Endonuclease V, Endonuclease IV and Endonuclease VIII), DNA polymerases, and ligases (such as Taq DNA ligase. Enzymes such as ligase can ligate nicks in double stranded DNA.
[0174] In some embodiments, a transposase (such as Tn5 transposase) binds the transposase recognition sites, makes staggered cuts at random sites in a target nucleic acid, and inserts synthetic transposons at the cut sites, resulting in a pair of single-stranded gaps of a fixed length flanking the inserted synthetic transposon sequence in the target nucleic acid. The single- stranded gaps have duplicated sequences derived from the target nucleic acid. The duplicated sequences are characteristic for each transposase, for example, the duplicated sequences are 9-nt long for Tn5 transposase, 5-nt long for Tn7 and Mu transposases, 4-nt long for murine leukemia virus, and 2-nt long for Tcl/marine family. Transposition events are random or substantially random. For example, some studies show certain transposition biases (see, e.g., Green B et al, "Insertion site preference of Mu, Tn5, and Tn7 transposons" Mobile DNA 3:3, 2012).
[0175] The target nucleic acids inserted with the synthetic transposons can be repaired with a polymerase without strand displacement activity and a ligase in vitro to provide repaired target nucleic acids. The polymerase without strand displacement activity allows gap filling of any single-stranded nucleic acid created surrounding the insertion sites (such as single-stranded gaps having duplicated sequences endogenous to the target nucleic acid). The ligase allows nick sealing for nicks having a 5' phosphate. The gap filling reaction catalyzed by the polymerase without strand displacement, and the ligation reaction catalyzed by the ligase can be carried out in a single step, or in separate steps comprising first contacting the target nucleic acid inserted with the synthetic transposons with the polymerase without strand displacement activity and nucleotides, followed by contacting the resulting product with the ligase. Many polymerases and ligases may be suitable for this step. In some embodiments, the polymerase is T4 DNA polymerase.
[0176] The repaired target nucleic acid is then fragmented by separating the first non- complementary region from the second non-complementary region in each inserted synthetic transposon. A suitable separation step may be chosen based on the nature of the connection between the first non-complementary region and the second non-complementary region. For example, in some embodiments, an endonuclease, or a combination of endonuclease with lyase may be used to cleave one or more cleavable nucleotides that are used to fuse the first strands or the second strands of the non-complementary regions, to cleave one or more cleavable nucleotides in the single-stranded linkers of the non-complementary regions, or to cleave one or more cleavable nucleotides in the bridge nucleic acid. In some embodiments, wherein the one or more cleavable nucleotides are uracil nucleotides, the repaired target nucleic acid may be treated with a combination of UDG and DNA lyase, such as USER™, to separate the non- complementary regions. The endonuclease treatment step may occur simultaneously with the repair step, or after the repair step.
[0177] In some embodiments, wherein the first non-complementary region is connected to a second non-complementary region via hybridization of two single- stranded linkers in the non- complementary regions, or via hybridization of a bridge nucleic acid to the two single-stranded linkers in the non-complementary regions, the repaired target nucleic acid may be denatured, for example by heating, and/or contacting with a denaturing buffer (e.g. , formamide), to separate the non-complementary regions. In some embodiments, the fragmented target nucleic acids are further treated with an exonuclease, such as a single-strand DNA exonuclease to remove the bridge nucleic acid, and/or other undesired single-stranded nucleic acid. In some embodiments, the repaired target nucleic acid is both contacted with an endonuclease to cleave the one or more cleavable nucleic acids and subjected to denaturing conditions to separate the non- complementary regions.
[0178] The nucleic acid fragments (also referred herein as "template nucleic acids") obtained after the step of separating the non-complementary regions may be used directly for single molecule sequencing, or amplified by PCR or Rolling Circle Amplification (RCA). PCR can be used to amplify a small number of copies of template DNA to generate thousands to millions of copies of a particular DNA sequence. It usually requires 2 short oligos as primers (e.g. , 18- 36mer) and a heat-stable DNA polymerase (e.g. , Taq DNA polymerase) in the presence of dNTPs and buffer. Generally, it starts with an initial heating step (e.g. , 95oC for 2min) to denature the template DNA, then followed by usually 20-40 cycles of denaturation (e.g. , 95oC, 20sec), annealing (e.g. , 60oC for 30sec) and extension (e.g. , 72oC for 30sec-3min) to amplify the target DNA exponentially. A final elongation step (e.g. , 72oC for 7min) is also used for most application followed by final hold at low temperature (e.g. , 4-15oC). PCR has many applications including in disease diagnosis or forensic identification and many variations are available including multiplex PCR, digital PCR, allele-specific PCR.
[0179] RCA is an isothermal enzymatic process where long strand nucleic acid sequences containing multiple copies are synthesized from circular molecules of DNA or RNA, such as plasmids, bacteriophages or circular RNA genome of viroids. Kits are available to use RCA technology to amplify circular nucleic acids from small or limited amount of samples in hours at a constant temperature without thermal cycling, for example, TempliPhi from GE Healthcare. Some NGS platforms, such as Complete Genomics, can directly sequence RCA products. For RCA or single-molecule sequencing that involves RCA, the template nucleic acids that are not circular may be circularized first. For example, an oligonucleotide comprising sequences that are complementary to the first adapter sequence and the second adapter sequence may be used to anneal to the non-complementary regions after the separation step, followed by treatment with ligase to circularize the nucleic acid fragments. In some embodiments, the template nucleic acids are amplified by RCA using a primer comprising a first sequence that hybridizes to the first adapter sequence and a second sequence that hybridizes to the second adapter sequence.
[0180] In some embodiments, the template nucleic acids are amplified by PCR using a first primer that hybridize to the first adapter sequence or reverse complement thereof, and a second primer that hybridize to the second adapter sequence or reverse complement thereof. In some embodiments, the template nucleic acids are amplified by PCR using a first primer having the same sequence as the first adapter sequence, and a second primer having the same sequence as the second adapter sequence. In some embodiments, the template nucleic acids are amplified by PCR using a first primer having the complementary sequence as the first adapter sequence, and a second primer having the complementary sequence as the second adapter sequence. In some embodiments, the whole-genome sequence is amplified. In some embodiments, primers that selectively hybridize to sequences of interest, such as exome probes, may be used for amplification of targeted sequences. In some embodiments, additional adapters and/or sample tags (also referred herein as "index tags") may be included in the primers for amplification. The amplification step may need long annealing/extension time to obtain products of appropriate size. The method may further comprise purification step(s) to remove short, unwanted products with only the transposon sequences.
[0181] In some embodiments, the method does not comprise a step of separating the non- complementary regions, and the method comprises subjecting the repaired target nucleic acid to whole genome amplification to provide the library of template nucleic acids. WGA is a method for robust amplification of an entire genome, starting with a small amount of DNA and can result in thousands to millions fold of amplified products. WGA may be especially useful for preparing a library of template nucleic acids for sequencing from a limited or previous sample, such as a single cell. Exemplary techniques used for WGA include, but are not limited to, Multiple Displacement Amplification (MDA), Degenerate Oligonucleotide PCR (DOP-PCR) and Primer Extension Preamplification (PEP). Exemplary commercial kits for WGA include ILLUSTRA™ Single Cell GenomiPhi DNA Amplification kit from GE Healthcare,
PICOPLEX™ WGA kit from Rubicon Genomics and New England BioLabs, REPLI-G™ Single Cell WGA kit from Qiagen, and GENOMEPLEX™ Complete Whole Genome
Amplification kit from Sigma.
[0182] In some embodiments, the method may comprise a dilution step to separate the nucleic acid sample, such as the target nucleic acid, the inserted target nucleic acid, the repaired target nucleic acid, or the template nucleic acids into a plurality of compartments (such as wells in a multi-well plate). In some embodiments, the nucleic acid sample is diluted into at least about any of 5, 10, 20, 50, 100, 200, 300, 500 or more compartments to allow subsequent steps, such as amplification, in the methods to carry out within the individual compartments. In some embodiments, each compartment comprises no more than about any of 5000, 1000, 500, 200, 100, 50, 20, 10, 5, or fewer molecules. Compartment tags may be introduced to the template nucleic acids in the amplification step. Samples from the compartment can be pooled together during sequencing, and the sequencing reads may be de-multiplexed using the compartment tags. The dilution may facilitate mapping of sequencing reads to individual target nucleic acids or segments thereof.
Methods of analysis
[0183] The present application further provides methods of analyzing a target nucleic acid by sequencing libraries of template nucleic acids prepared using any of the methods described above.
[0184] In some embodiments, there is provided a method of analyzing a target nucleic acid, or sequencing a target nucleic acid, comprising: (a) preparing a library of template nucleic acids from the target nucleic acid using any one of the methods described in the "Methods of library preparation" section; and (b) sequencing the library of template nucleic acids to obtain sequencing reads. In some embodiments, wherein each synthetic transposon comprises a different barcode sequence, the method further comprises assembling a contiguous sequence of the target nucleic acid from the sequencing reads based on the molecular barcodes of the synthetic transposons in the template nucleic acids. In some embodiments, the sequencing is next generation sequencing. In some embodiments, the sequencing is massively parallel shotgun sequencing. In some embodiments, the sequencing is single molecule sequencing.
[0185] In some embodiments, there is provided a method of analyzing a target nucleic acid, comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g. , hyperactive Tn5 transposase) under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid, wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non- complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non-complementary region and the second non- complementary region are connected to each other; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; (c) separating the first non-complementary region and the second non-complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids; and (d) sequencing the library of template nucleic acids to obtain sequencing reads. In some embodiments, the first strand of the first non-complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide). In some embodiments, the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide). In some
embodiments, the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other. In some embodiments, the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single- stranded linker, and a second single-stranded sequence that hybridizes to the second single - stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid. In some embodiments, the sequencing is next generation sequencing. In some embodiments, the sequencing is massively parallel shotgun sequencing. In some embodiments, the sequencing is single molecule sequencing. In some embodiments, the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA. In some embodiments, the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
[0186] In some embodiments, there is provided a method of analyzing a target nucleic acid, comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g. , hyperactive Tn5 transposase) under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid, wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first non-complementary region and the second non-complementary region are connected to each other; wherein the first molecular barcode and the second molecular barcode have the same barcode sequence; and wherein each synthetic transposon has a different barcode sequence; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; (c) separating the first non-complementary region and the second non-complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids; (d) sequencing the library of template nucleic acids to obtain sequencing reads; and (e) assembling a contiguous sequence of the target nucleic acid from the sequencing reads based on the barcode sequences of the synthetic transposons in the template nucleic acids. In some embodiments, the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide). In some
embodiments, the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other. In some embodiments, the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single- stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid. In some embodiments, the sequencing is next generation sequencing. In some embodiments, the sequencing is massively parallel shotgun sequencing. In some embodiments, the sequencing is single molecule sequencing. In some embodiments, the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA. In some embodiments, the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases. In some embodiments, the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
[0187] In some embodiments, there is provided a method of analyzing a target nucleic acid, comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g. , hyperactive Tn5 transposase) under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid, wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first non-complementary region and the second non-complementary region are connected to each other; wherein the first molecular barcode and the second molecular barcode have the same barcode sequence; and wherein each synthetic transposon has a different barcode sequence; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; (c) separating the first non-complementary region and the second non-complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids; (d) sequencing the library of template nucleic acids to obtain sequencing reads; (e) identifying sequences of the synthetic transposons in the sequencing reads; (f) aligning sequencing reads having the same barcode sequences in the synthetic transposons to provide aligned sequencing reads; and (g) clustering the aligned sequencing reads based on the barcode sequences in the synthetic transposons to provide the contiguous sequence of the target nucleic acid. In some embodiments, the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide). In some
embodiments, the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other. In some embodiments, the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single- stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid. In some embodiments, the sequencing is next generation sequencing. In some embodiments, the sequencing is massively parallel shotgun sequencing. In some embodiments, the sequencing is single molecule sequencing. In some embodiments, the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA. In some embodiments, the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases. In some embodiments, wherein each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single- stranded gaps having duplicated sequences endogenous to the target nucleic acid, and wherein the duplicated sequences are used to assemble the contiguous sequence. In some embodiments, the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
Sequencing
[0188] The library of template nucleic acids prepared using the methods described in the "Methods of library preparation" section can be sequenced directly or subject to any one or more of library construction steps known in the art, including, but not limited to, end repair, ligation to adapters, amplification, and sample tag addition. In some embodiments, the library construction method comprises an exome capture step.
[0189] The methods described herein can be used in conjunction with a variety of sequencing techniques and platforms. In some embodiments, the process to determine the nucleotide sequence of a target nucleic acid can be an automated process. In some embodiments, the sequencing is next generation sequencing. In some embodiments, the sequencing method is a massively parallel shotgun sequencing method. In some embodiments, the sequencing method yields short sequencing reads, such as sequencing reads of no more than about any one of 500 bases, 400 bases, 300 bases, 250 base, 200 bases, 150 bases, 100 bases, or fewer. Exemplary sequencing platforms include, but are not limited to, Roche 454 platforms, Illumina HISEQ™, MISEQ™, and NEXTSEQ™ platforms, Life Technologies SOLID™ platforms, ION
TORRENT™ platforms, and Pacific Biosciences and PacBio RS platforms.
[0190] Some embodiments include pyrosequencing techniques. Pyrosequencing detects the release of inorganic pyrophosphate (PPi) as particular nucleotides are incorporated into the nascent strand (Ronaghi, M., Karamohamed, S., Pettersson, B., Uhlen, M. and Nyren, P. (1996) "Real-time DNA sequencing using detection of pyrophosphate release." Analytical
Biochemistry 242(1), 84-9; Ronaghi, M. (2001) "Pyrosequencing sheds light on DNA sequencing." Genome Res. 11(1), 3-11 ; Ronaghi, M., Uhlen, M. and Nyren, P. (1998) "A sequencing method based on real-time pyrophosphate." Science 281(5375), 363; U.S. Pat. No. 6,210,891; U.S. Pat. No. 6,258,568 and U.S. Pat. No. 6,274,320, the disclosures of which are incorporated herein by reference in their entireties). In pyrosequencing, released PPi can be detected by being immediately converted to adenosine triphosphate (ATP) by ATP sulfurylase, and the level of ATP generated is detected via luciferase-produced photons.
[0191] In another example type of sequence by sequencing (SBS) techniques, cycle sequencing is accomplished by stepwise addition of reversible terminator nucleotides containing, for example, a cleavable or photobleachable dye label as described, for example, in U.S. Pat. No. 7,427,67, U.S. Pat. No. 7,414,1163 and U.S. Pat. No. 7,057,026, the disclosures of which are incorporated herein by reference. This approach is being commercialized by Solexa (now Illumina Inc.), and is also described in WO 91/06678 and WO 07/123,744 (filed in the United States patent and trademark Office as U.S. Ser. No. 12/295,337), each of which is incorporated herein by reference in their entireties. The availability of fluorescently-labeled terminators in which both the termination can be reversed and the fluorescent label cleaved facilitates efficient cyclic reversible termination (CRT) sequencing. Polymerases can also be co-engineered to efficiently incorporate and extend from these modified nucleotides.
[0192] Additional example SBS systems and methods which can be utilized with the methods and systems described herein are described in U.S. Patent Application Publication No.
2007/0166705, U.S. Patent Application Publication No. 2006/0188901, U.S. Pat. No. 7,057,026, U.S. Patent Application Publication No. 2006/0240439, U.S. Patent Application Publication No. 2006/0281109, PCT Publication No. WO 05/065814, U.S. Patent Application Publication No. 2005/0100900, PCT Publication No. WO 06/064199 and PCT Publication No. WO 07/010,251, the disclosures of which are incorporated herein by reference in their entireties.
[0193] Some embodiments can utilize sequencing by ligation techniques. Such techniques utilize DNA ligase to incorporate short oligonucleotides and identify the incorporation of such short oligonucleotides. Example SBS systems and methods which can be utilized with the methods and systems described herein are described in U.S. Pat. No. 6,969,488, U.S. Pat. No. 6,172,218, and U.S. Pat. No. 6,306,597, the disclosures of which are incorporated herein by reference in their entireties.
[0194] Some embodiments can include techniques such as next-next technologies. One example can include nanopore sequencing techniques (Deamer, D. W. & Akeson, M.
"Nanopores and nucleic acids: prospects for ultrarapid sequencing." Trends Biotechnol. 18, 147- 151 (2000); Deamer, D. and D. Branton, "Characterization of nucleic acids by nanopore analysis". Acc. Chem. Res. 35:817-825 (2002); Li, J., M. Gershow, D. Stein, E. Brandin, and J. A. Golovchenko, "DNA molecules and configurations in a solid-state nanopore microscope" Nat. Mater. 2:611-615 (2003), the disclosures of which are incorporated herein by reference in their entireties). In such embodiments, the target nucleic acid passes through a nanopore. The nanopore can be a synthetic pore or biological membrane protein, such as a-hemolysin. As the target nucleic acid passes through the nanopore, each base-pair can be identified by measuring fluctuations in the electrical conductance of the pore. (U.S. Pat. No. 7,001,792; Soni, G. V. & Meller, "A. Progress toward ultrafast DNA sequencing using solid-state nanopores." Clin.
Chem. 53, 1996-2001 (2007); Healy, K. "Nanopore-based single-molecule DNA analysis." Nanomed. 2, 459-481 (2007); Cockroft, S. L., Chu, J., Amorin, M. & Ghadiri, M. R. "A single- molecule nanopore device detects DNA polymerase activity with single-nucleotide resolution." J. Am. Chem. Soc. 130, 818-820 (2008), the disclosures of which are incorporated herein by reference in their entireties). In some such embodiments, nanopore sequencing techniques can be useful to confirm sequence information generated by the methods described herein.
[0195] Some embodiments can utilize methods involving the real-time monitoring of DNA polymerase activity. Nucleotide incorporations can be detected through fluorescence resonance energy transfer (FRET) interactions between a fluorophore-bearing polymerase and γ-phosphate- labeled nucleotides as described, for example, in U.S. Pat. No. 7,329,492 and U.S. Pat. No. 7,211,414 (each of which is incorporated herein by reference in their entireties) or nucleotide incorporations can be detected with zero-mode waveguides as described, for example, in U.S. Pat. No. 7,315,019 (which is incorporated herein by reference in its entirety) and using fluorescent nucleotide analogs and engineered polymerases as described, for example, in U.S. Pat. No. 7,405,281 and U.S. Patent Application Publication No. 2008/0108082 (each of which is incorporated herein by reference in their entireties). The illumination can be restricted to a zeptoliter-scale volume around a surface-tethered polymerase such that incorporation of fluorescently labeled nucleotides can be observed with low background (Levene, M. J. et al. "Zero-mode waveguides for single-molecule analysis at high concentrations." Science 299, 682- 686 (2003); Lundquist, P. M. et al. "Parallel confocal detection of single molecules in real time." Opt. Lett. 33, 1026-1028 (2008); Korlach, J. et al. "Selective aluminum passivation for targeted immobilization of single DNA polymerase molecules in zero-mode waveguide nanostructures." Proc. Natl. Acad. Sci. USA 105, 1176-1181 (2008), the disclosures of which are incorporated herein by reference in their entireties). In one example single molecule, real-time (SMRT) DNA sequencing technology provided by Pacific Biosciences Inc. can be utilized with the methods described herein. In some embodiments, a SMRT chip or the like may be utilized (U.S. Pat. Nos. 7,181,122, 7,302,146, 7,313,308, incorporated by reference in their entireties). A SMRT chip comprises a plurality of zero-mode waveguides (ZMW). Each ZMW comprises a cylindrical hole tens of nanometers in diameter perforating a thin metal film supported by a transparent substrate. When the ZMW is illuminated through the transparent substrate, attenuated light may penetrate the lower 20-30 nm of each ZMW creating a detection volume of about 1x10 —21 L. Smaller detection volumes increase the sensitivity of detecting fluorescent signals by reducing the amount of background that can be observed.
[0196] SMRT chips and similar technology can be used in association with nucleotide monomers fluorescently labeled on the terminal phosphate of the nucleotide (Korlach J. et al. , "Long, processive enzymatic DNA synthesis using 100% dye-labeled terminal phosphate-linked nucleotides." Nucleosides, Nucleotides and Nucleic Acids, 27:1072-1083, 2008; incorporated by reference in its entirety). The label is cleaved from the nucleotide monomer on incorporation of the nucleotide into the polynucleotide. Accordingly, the label is not incorporated into the polynucleotide, increasing the signal: background ratio. Moreover, the need for conditions to cleave a label from a labeled nucleotide monomer is reduced.
[0197] An additional example of a sequencing platform that may be used in association with some of the embodiments described herein is provided by Helicos Biosciences Corp. In some embodiments, true single molecule sequencing can be utilized (Harris T. D. et al. , "Single Molecule DNA Sequencing of a viral Genome" Science 320: 106-109 (2008), incorporated by reference in its entirety). In one embodiment, a library of target nucleic acids can be prepared by the addition of a 3' poly(A) tail to each target nucleic acid. The poly(A) tail hybridizes to poly(T) oligonucleotides anchored on a glass cover slip. The poly(T) oligonucleotide can be used as a primer for the extension of a polynucleotide complementary to the target nucleic acid. In one embodiment, fluorescently-labeled nucleotide monomer, namely, A, C, G, or T, are delivered one at a time to the target nucleic acid in the presence DNA polymerase. Incorporation of a labeled nucleotide into the polynucleotide complementary to the target nucleic acid is detected, and the position of the fluorescent signal on the glass cover slip indicates the molecule that has been extended. The fluorescent label is removed before the next nucleotide is added to continue the sequencing cycle. Tracking nucleotide incorporation in each polynucleotide strand can provide sequence information for each individual target nucleic acid. Analysis
[0198] Sequencing reads can be analyzed with various methods. In some embodiments, an automated process, such as computer software, is used to analyze the sequencing reads to provide a contiguous sequence of the target nucleic acid. Analysis software can be developed from scratch or based on current bioinformatics tools to include molecular barcode identification and clustering algorithms described herein for sequence assembly (de novo or using a reference).
[0199] Data analysis of the sequencing reads include at least the following three steps: 1) find identical (or near identical, for example, 1 base difference to accommodate sequencing error) molecular barcodes, and optionally with surrounding transposase recognition site and other stuff sequences, to combine reads with the same barcode into 1 molecule (error-correction); 2) use molecular barcodes to link molecules together with haplotype information with the help of duplicate sequences during transposition (mBCs-assisted contig assembly); and 3) use actual target sequences, especially variants to help confirm the assembly of the molecules (validation). Such process can remove polymerase extension errors or recombination introduced during amplification and sequencing. Additionally, the process allows absolute molecule counting. Furthermore, cross-contamination from one sample to another sample in the lab can be removed or reduced by using the molecular barcodes.
[0200] FIG. 12 shows an exemplary data analysis pipeline. Generally, high quality pair-end sequencing reads are used. For small genomes, targeted panel or metagenomes, it is possible to combine multiple samples or experiments in a single high-throughput sequencing run, and the sequencing data are first de-multiplexed into separate sample folders. Next, reads with near identical molecular barcodes and target sequence similarity in a sample are clustered into individual, original target nucleic acid molecules. Two or more pair-end reads are required to cluster into single molecules, and failed reads contain singletons in majority. Reads per molecule can also be calculated. Using molecular barcodes and duplicated sequences generated during transposition (e.g. , 9-bp sequence by Tn5), molecules can be linked to individual contigs. With deeper sequencing coverage, length of individual contigs can be increased. Each contig represents a sequence from one of the 2 haploid chromosomes in a diploid genome. Either de novo assembly of all contigs or alignment to reference genome can be performed to generate final consensus sequence with high quality and information on coverage, gaps, mutations, copy numbers, etc. [0201] Gap or outliers in the sequences may be present and may limit the contig size. Gaps are due to several factors. First, with bias and randomness in transposition, there are possible long sequences between 2 transposition sites. Consequently, the middle regions of some long sequences may not be sequenced, or the whole long fragments may be missed, especially on sequencing platforms producing short read lengths. Second, as sampling of the molecules is rather random, some fragments may be missed if not all are sampled. Third, the quality of the starting nucleic acids, including fragmentation, base modification and nicks that are not repaired during the library construction process can lead to gaps in sequences. For example, even with high efficiency, the gap-filling extension or nick ligation may not be 100% efficient during library preparation, an fragments with gaps may be missed in sequencing. At single genome level, the factors above contribute to incomplete sequences. However, with multiple cells or genome input molecules (for example, 50 equivalent genomes), such problems are significantly reduced as long sequence gap in one genome can be covered by another molecule. With more sequencing coverage, less gap will be present. Longer sequencing reads will also help.
Additionally, the frequency of transposition can also be increased to reduce large gaps. If multiple cells are used for sequencing, it is possible to have some cell to cell difference in the sequences, which allows analysis of sequence variation at single cell level, although this is limited by the contig size that can be achieved.
[0202] In some embodiments, the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence. In some embodiments, wherein each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single-stranded gaps having duplicated sequences endogenous to the target nucleic acid, step (ii) comprises aligning sequencing reads having the same molecular barcodes in the synthetic transposons and the same duplicated sequences of the single-stranded gaps to provide aligned sequencing reads, and/or step (iii) comprises clustering the sequencing reads based on the molecular barcodes in the synthetic transposons and the duplicated sequences of the single-stranded gaps. In some embodiments, step (iii) comprises deriving a contig from the clustered sequencing reads and removing the sequences of the synthetic transposons (and if applicable, one copy of the duplicated sequences of the single-stranded gaps) from the contig to provide the contiguous sequence. In some embodiments, the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
[0203] In some embodiments, wherein the template nucleic acids each (except for those derived from the ends of the target nucleic acid) comprise a first synthetic transposon comprising a first molecular barcode at one end and a second synthetic transposon comprising a second molecular barcode at the other end, the sequencing reads are assembled to provide a contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same first molecular barcode and the same second molecular barcode; (iii) determining a consensus sequence for each group of aligned sequencing reads; (iv) linking the consensus sequences together based on the molecular barcodes in the synthetic transposons to provide a contig; and (v) removing the sequences of the synthetic transposons (and if applicable, one copy of the duplicated sequences of the single-stranded gaps) from the contig. In some embodiments, wherein each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single- stranded gaps having duplicated sequences endogenous to the target nucleic acid, step (ii) comprises aligning sequencing reads having the same first molecular barcodes, the same second molecular barcodes, and the same duplicated sequences of the single-stranded gaps; and/or step (iv) comprises linking the consensus sequences together based on the molecular barcodes in the synthetic transposons and the duplicated sequences of the single-stranded gaps to provide the contig. In some embodiments, a consensus sequence is determined for each group having at least three aligned sequencing reads. In some embodiments, a mismatch nucleotide in a group of aligned sequencing reads is considered to be an amplification or sequencing error if no more than 1/3 or aligned sequencing reads in the group has the mismatch nucleotide. In some embodiments, the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
[0204] In some embodiments, the sequencing data with the base calls and sample tag information are analyzed through a special pipeline to allow de-multiplexing of samples followed by clustering, error correction and assembly. Sequences of the transposase recognition sites can be used to identify the location of the synthetic transposons in the sequencing reads. In the cases of Tn5 synthetic transposons, a total of 38-bp Tn5 recognition sequences (2xl9-bp,
438 or -7x10 22 possibilities among 38-bp) can be used quite uniquely for transposon identification in a large genome such as human (about 3xl09 bases). The stuff sequences in the synthetic transposons, or the fixed nucleotides in the molecular barcode sequences can also serve as additional known bases for identification of the synthetic transposons among the sequencing reads. Once the transposons are identified, the distinct molecular barcode sequence between the transposase recognition sequences in a synthetic transposon (for example, a molecular barcode with 20 randomly designed nucleotides yields about 10 12 distinct sequences) can serve as exogenous tags. Additionally, when applicable, the duplicate gap sequences can serve as endogenous tags. For example, Tn5 generates 9-bp duplicated sequences (49 or ~2xl05 combinations) flanking the insertion sites, which provides information on the distinct positions of insertion. The duplicated gap sequence can provide additional insertion-specific information for mapping sequencing reads comprising the synthetic transposons to the original location in the target nucleic acid molecule. In embodiments with Tn5 synthetic transposons having 20 randomly designed nucleotides in the molecular barcodes, a total of greater than 2xl017 combinations of different sequences can theoretically be used for tagging and extracting contiguity information in a target nucleic acid. This large diversity of molecular barcodes allows the inserted sequences to be different in all positions. Therefore, each combination of exogenous and optionally endogenous tag sequences uniquely identifies the surrounding sequences from the target nucleic acid. The distinct molecular barcodes and the duplicate gap sequences from target nucleic acids on one or both ends of the synthetic transposon can serve as unique identifiers to cluster sequencing reads with the same molecular barcode and duplicated gap sequence.
Amplification or sequencing errors are corrected and amplification bias is eliminated in the clustering process. Such methods can be particularly useful for assembling repetitive sequence regions, such as Alu repeats, so that the contiguity of the repetitive sequences can be resolved. Insertion of the synthetic transposons can break the repetitiveness of many sequences, therefore allow better amplification and sequencing for these sequences that are difficult to amplify or sequence. Consensus sequences derived from the clustered reads are then assembled together to obtain a phased uninterrupted sequence for the target nucleic acid.
[0205] In analysis, several parameters can be used to help cluster and assemble the sequencing reads to obtain maximal haplotyping information and lead to final counting of the original target molecules. For example, the synthetic transposons can be identified using the 2 transposase recognition sequences (2xl9-bp for Tn5 transposase recognition sites). Then the randomly designed sequences in the molecular barcodes (exogenous tags) and/or the duplicate gap sequences flanking the synthetic transposon insertion position (endogenous tags; e.g., 9-nt for Tn5 transposase, which yields 49 possible sequences) can be used to trace back the original position of the insertion site in the target nucleic acid and count the original target nucleic acid once for each cluster of reads mapping to the same original target nucleic acid. Although it is possible to only use the molecular barcode sequences in the synthetic transposons, use of the duplicated gap sequences can provide additional information for assembly of the sequencing reads. For target nucleic acids in homogenous samples, the overlapped sequences among different clustered reads should be the same except for errors from amplification, and/or sequencing, and/or analysis steps. Therefore, a contig representing the error-corrected consensus sequence can be obtained from the sequencing reads clustered based on the sequences of the synthetic transposons and/or the duplicated gap sequences.
[0206] The library preparation, sequencing, and/or analysis methods described herein may further be supplemented by additional steps and measures in order to obtain high quality, complete sequences in a cost-effective way. For example, the target nucleic acids can be repaired before, during, and/or after transposition; transposition frequency may be increased to minimize the length of sequences between two inserted transposon; loss of nucleic acids may be minimized during processing, for example, by using single-tube processing methods, avoiding purification steps, and/or directly lysing cells to provide target nucleic acids; cluster generation for Illumina sequence can be optimized to allow pair-end sequencing of long templates; the number of cells for each experiment can be optimized; high quality reference sequences can be used; and internal standards may be used for sequencing.
Applications
[0207] The methods of analyzing or sequencing a target nucleic acid as described above can be used in a variety of applications, including, but not limited to high quality sequencing, haplotyping, de novo sequencing, resequencing (such a mutation and cancer sequencing, disease diagnosis, forensic applications, and aging analysis), single-cell sequencing, sequencing of genetic engineered species (such as plants), sequencing of high repetitive regions, pseudogenes and structurally difficult sequences, metagenomics sequencing, structural variation detection, copy number measurement, methylation analysis, genetic linkage analysis for identification of genes involved in disease etiology. The methods have reduced amplification and sequencing errors, and reduced contamination, such as from products of previous experiments. [0208] In some embodiments, there is provided a method of haplotyping a target nucleic acid (such as genomic DNA, for example, a chromosome) comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g., hyperactive Tn5 transposase) under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid, wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non- complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first non-complementary region and the second non-complementary region are connected to each other; wherein the first molecular barcode and the second molecular barcode have the same barcode sequence; and wherein each synthetic transposon has a different barcode sequence; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; (c) separating the first non-complementary region and the second non-complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids; (d) sequencing the library of template nucleic acids to obtain sequencing reads; and (e) assembling a contiguous sequence of the target nucleic acid from the sequencing reads based on the barcode sequences of the synthetic transposons in the template nucleic acids whereby the contiguous sequence provides haplotype information of the target nucleic acid. In some embodiments, the second strand of the first non- complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide). In some embodiments, the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non- complementary region are fused to each other via a second single-stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other. In some embodiments, the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single-stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid. In some embodiments, the sequencing is next generation sequencing. In some embodiments, the sequencing is massively parallel shotgun sequencing. In some embodiments, the sequencing is single molecule sequencing. In some embodiments, the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA. In some embodiments, the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases. In some embodiments, the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence of the target nucleic acid. In some embodiments, wherein each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single- stranded gaps having duplicated sequences endogenous to the target nucleic acid, the duplicated sequences are further used to assemble the contiguous sequence. In some
embodiments, the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
[0209] In some embodiments, there is provided a method of assembly (such as de novo assembly, resequencing, or metagenomic sequencing) of a target nucleic acid (such as genomic DNA, mitochondrial DNA, or microbial DNA), comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g., hyperactive Tn5 transposase) under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid, wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first non-complementary region and the second non-complementary region are connected to each other; wherein the first molecular barcode and the second molecular barcode have the same barcode sequence; and wherein each synthetic transposon has a different barcode sequence; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; (c) separating the first non-complementary region and the second non-complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids; (d) sequencing the library of template nucleic acids to obtain sequencing reads; and (e) assembling a contiguous sequence of the target nucleic acid from the sequencing reads based on the barcode sequences of the synthetic transposons in the template nucleic acids. In some embodiments, the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide). In some embodiments, the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other. In some embodiments, the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid. In some embodiments, the sequencing is next generation sequencing. In some embodiments, the sequencing is massively parallel shotgun sequencing. In some embodiments, the sequencing is single molecule sequencing. In some embodiments, the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA. In some embodiments, the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases. In some embodiments, the method determines sequences of the target nucleic acids at single cell level. In some embodiments, the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence of the target nucleic acid. In some embodiments, wherein each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single-stranded gaps having duplicated sequences endogenous to the target nucleic acid, the duplicated sequences are further used to assemble the contiguous sequence. In some embodiments, the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
[0210] The methods of assembly disclosed herein may be used to generate reference genome sequences for human or other species or interest using multiple platforms or replicates with extreme low error rates (e.g., with lower than about 1/10, 1/100, 1/1000, or 1/10,000 the error rate of current reference genome sequences). The reference genomes can then be used to speed up the assembly process for new sequences from individuals in a species.
[0211] In many genomes, there are highly repetitive regions (such as microsatellites, short interspersed elements or SINEs or long interspersed elements or LINEs), pseudogenes or unique sequences that are difficult for sequencing or assembling. For example, in human genome, up to 50% of the genome are highly repetitive, including Alu repeats, which belong to SINEs and are about 300bp long and comprise of more than 10% of human genome (see Batzer M.A. et al. , Alu repeats and human genomic diversity, Nature Reviews Genetics 3: 370-379, 2002). Also, some specific sequences in the DNA could be difficult for PCR or sequencing. For example, a 370bp segment in the 5' untranslated region of murine gene Foxd3 is resistant to amplification, sequencing and cloning (Nelms BL and Labosky PA, A predicted hairpin cluster correlates with barriers to PCR, sequencing and possibly BAC recombineering. Scientific Reports 1 : 106, 2011). The random insertion of the synthetic transposons can help to reduce difficulty in sequencing due to repetitive or hairpin cluster sequences. [0212] In some embodiments, there is provided a method of sequencing repetitive regions in a target nucleic acid (such as genomic DNA, for example, a chromosome), comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g. , hyperactive Tn5 transposase) under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid, wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first non-complementary region and the second non-complementary region are connected to each other; wherein the first molecular barcode and the second molecular barcode have the same barcode sequence; and wherein each synthetic transposon has a different barcode sequence; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; (c) separating the first non-complementary region and the second non-complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids; (d) sequencing the library of template nucleic acids to obtain sequencing reads; and (e) assembling a contiguous sequence of the target nucleic acid from the sequencing reads based on the barcode sequences of the synthetic transposons in the template nucleic acids. In some embodiments, the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide). In some
embodiments, the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other. In some embodiments, the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single- stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid. In some embodiments, the sequencing is next generation sequencing. In some embodiments, the sequencing is massively parallel shotgun sequencing. In some embodiments, the sequencing is single molecule sequencing. In some embodiments, the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA. In some embodiments, the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases. In some embodiments, the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence of the target nucleic acid. In some embodiments, wherein each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single-stranded gaps having duplicated sequences endogenous to the target nucleic acid, the duplicated sequences are further used to assemble the contiguous sequence. In some embodiments, the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
[0213] In some embodiments, there is provided a method of detecting a mutation (such as SNP, indel, structural variation, translocation, or copy number variation) in a target nucleic acid (e.g. , at single-cell level), comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g. , hyperactive Tn5 transposase) under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid, wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non- complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first non-complementary region and the second non- complementary region are connected to each other; wherein the first molecular barcode and the second molecular barcode have the same barcode sequence; and wherein each synthetic transposon has a different barcode sequence; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; (c) separating the first non-complementary region and the second non-complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids; (d) sequencing the library of template nucleic acids to obtain sequencing reads; (e) assembling a contiguous sequence of the target nucleic acid from the sequencing reads based on the barcode sequences of the synthetic transposons in the template nucleic acids; and (f) comparing the contiguous sequence with a reference sequence to detect the mutation in the target nucleic acid. In some embodiments, the molecular barcode is double- stranded. In some embodiments, the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide). In some embodiments, the first strand and the second strand of the first non- complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single-stranded linker and the second single- stranded linker are hybridized to each other. In some embodiments, the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid. In some embodiments, the sequencing is next generation sequencing. In some embodiments, the sequencing is massively parallel shotgun sequencing. In some embodiments, the sequencing is single molecule sequencing. In some embodiments, the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA. In some embodiments, the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases. In some embodiments, the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence of the target nucleic acid. In some embodiments, wherein each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single-stranded gaps having duplicated sequences endogenous to the target nucleic acid, the duplicated sequences are further used to assemble the contiguous sequence. In some embodiments, the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
[0214] In some embodiments, there is provided a method of detecting a structural variation in a target nucleic acid (such as genomic DNA, for example, a chromosome), comprising: (a) (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g. , hyperactive Tn5 transposase) under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid, wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first non-complementary region and the second non-complementary region are connected to each other; wherein the first molecular barcode and the second molecular barcode have the same barcode sequence; and wherein each synthetic transposon has a different barcode sequence; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; (c) separating the first non-complementary region and the second non-complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids; (d) sequencing the library of template nucleic acids to obtain sequencing reads; (e) assembling a contiguous sequence of the target nucleic acid from the sequencing reads based on the barcode sequences of the synthetic transposons in the template nucleic acids; and (f) comparing the contiguous sequence with a reference sequence to detect the structural variation in the target nucleic acid. In some embodiments, the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide). In some embodiments, the first strand and the second strand of the first non- complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single-stranded linker and the second single- stranded linker are hybridized to each other. In some embodiments, the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid. In some embodiments, the sequencing is next generation sequencing. In some embodiments, the sequencing is massively parallel shotgun sequencing. In some embodiments, the sequencing is single molecule sequencing. In some embodiments, the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA. In some embodiments, the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases. In some embodiments, the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence of the target nucleic acid. In some embodiments, wherein each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single-stranded gaps having duplicated sequences endogenous to the target nucleic acid, the duplicated sequences are further used to assemble the contiguous sequence. In some embodiments, the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
[0215] In some embodiments, there is provided a method of detecting a copy number variation in a target nucleic acid (such as a chromosome, exosome, or target sequences), comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g. , hyperactive Tn5 transposase) under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid, wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first non-complementary region and the second non-complementary region are connected to each other; wherein the first molecular barcode and the second molecular barcode have the same barcode sequence; and wherein each synthetic transposon has a different barcode sequence; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; (c) separating the first non-complementary region and the second non-complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids; (d) sequencing the library of template nucleic acids to obtain sequencing reads; (e) assembling a contiguous sequence of the target nucleic acid from the sequencing reads based on the barcode sequences of the synthetic transposons in the template nucleic acids; (f) counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence; and (g) comparing the copy number of the target nucleic acid with a reference to detect the copy number variation in the target nucleic acid. In some embodiments, the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide). In some embodiments, the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other. In some embodiments, the synthetic transposon further comprises a bridge nucleic acid comprising a first single- stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single-stranded linker, and wherein the first single- stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid. In some embodiments, the sequencing is next generation sequencing. In some embodiments, the sequencing is massively parallel shotgun sequencing. In some embodiments, the sequencing is single molecule sequencing. In some embodiments, the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA. In some embodiments, the plurality of synthetic transposons and the transposase are pre- mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases. In some embodiments, the method further comprises capturing or enhancing the target nucleic acid or barcoded target nucleic acid, such as by using probes that hybridize to the target nucleic acid or barcoded target nucleic acid. In some embodiments, the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence of the target nucleic acid. In some embodiments, wherein each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single-stranded gaps having duplicated sequences endogenous to the target nucleic acid, the duplicated sequences are further used to assemble the contiguous sequence.
[0216] Methods of bisulfite sequencing for analyzing methylation status of target nucleic acids (such as genomic DNA) are provided herein. DNA methylation is a widespread epigenetic modification that plays a pivotal role in the regulation of the genomes of diverse organisms. The most prevalent and widely studied form of DNA methylation in mammalian genomes occurs at the 5 carbon position of cytosine residues, usually in the context of the CpG dinucleotide.
Microarrays, and more recently massively parallel sequencing, have enabled the interrogation of cytosine methylation (5mC) on a genome-wide scale (Zilberman and Henikoff 2007). Methods of whole genome bisulfite sequencing that can be used to detect 5mC have been described (e.g., Cokus et al. 2008; Lister et al. 2009; Harris et al. 2010). Treatment of genomic DNA with sodium bisulfite chemically deaminates cytosines much more rapidly than 5mC, preferentially converting them to uracils (Clark et al. 1994). With massively parallel sequencing, these can be detected on a genome-wide scale at single base -pair resolution. Any of the known whole genome bisulfite sequencing workflows can be applied to genomic DNA samples barcoded with the synthetic transposons of the present application to provide methods of methylation analysis with high accuracy and efficiency.
[0217] In some embodiments, there is provided a method of analyzing methylation status of a target nucleic acid (such as genomic DNA, for example, a chromosome), comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g. , hyperactive Tn5 transposase) under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid, wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first non-complementary region and the second non-complementary region are connected to each other; wherein the first molecular barcode and the second molecular barcode have the same barcode sequence; and wherein each synthetic transposon has a different barcode sequence; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; (c) subjecting the repaired barcoded target nucleic acid to bisulfite treatment; (d) separating the first non-complementary region and the second non-complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids; (f) sequencing the library of template nucleic acids to obtain sequencing reads; (g) assembling a contiguous sequence of the target nucleic acid from the sequencing reads based on the barcode sequences of the synthetic transposons in the template nucleic acids; and (h) comparing the contiguous sequence with a reference sequence of the target nucleic acids to determine methylation positions in the target nucleic acid. In some
embodiments, the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide). In some embodiments, the first strand and the second strand of the first non- complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single-stranded linker and the second single- stranded linker are hybridized to each other. In some embodiments, the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid. In some embodiments, the sequencing is next generation sequencing. In some embodiments, the sequencing is massively parallel shotgun sequencing. In some embodiments, the sequencing is single molecule sequencing. In some embodiments, the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA. In some embodiments, the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases. In some embodiments, the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence of the target nucleic acid. In some embodiments, wherein each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single-stranded gaps having duplicated sequences endogenous to the target nucleic acid, the duplicated sequences are further used to assemble the contiguous sequence. In some embodiments, the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
[0218] Methods of determining chromosomal conformations (such a native 3-D structure of the genome) and protein-target nucleic acid interactions are provided herein. Various chromosome conformation capture techniques (see, for example, Barutcus AR et al, J. Cell Physiol, 231 :31-35, 2016), such as 3C, circularized 3C (i.e. , 4C), carbon-copy 3C (i.e. , 5C), or chromatin immunoprecipitation-based methods (such as ChlP-loop), and genome conformation capture techniques may be combined with any one of the methods of inserting synthetic transposons described herein to assess chromosome interactions. Various chromatin
immunoprecipitation (ChIP) methods (see, for example, P. Collas, Molecular Biotechnology 45(1):87-100, 2010) can be used to isolate protein-DNA complexes (such as chromatin-DNA complexes), which can then be barcoded with the synthetic transposons of the present application, and sequenced to determine the location in the genome that the protein (such as histones) are associated with.
[0219] In some embodiments, there is provided a method of analyzing conformation of a chromosome, comprising: (a) crosslinking the chromosome in vivo (such as within a cell); (b) isolating the crosslinked chromosome; (c) fragmenting (such as mechanically or enzymatically) the crosslinked chromosome to provide crosslinked chromosomal fragments; (d) ligating the ends of the crosslinked chromosomal fragments to provide ligated fragments; (e) reversing the ligated fragments to provide target nucleic acids; (f) contacting the target nucleic acids with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g. , hyperactive Tn5 transposase) under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid, wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first non-complementary region and the second non-complementary region are connected to each other; wherein the first molecular barcode and the second molecular barcode have the same barcode sequence; and wherein each synthetic transposon has a different barcode sequence; (g) contacting the inserted target nucleic acids with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; (h) separating the first non-complementary region and the second non-complementary region of each synthetic transposon inserted in the repaired target nucleic acids to provide the library of template nucleic acids; (i) sequencing the library of template nucleic acids to obtain sequencing reads; (j) assembling a contiguous sequences of the target nucleic acids from the sequencing reads based on the barcode sequences of the synthetic transposons in the template nucleic acids; and (k) comparing the contiguous sequences with a reference sequence of the chromosome to determine conformation of the chromosome. In some embodiments, the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide). In some
embodiments, the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other. In some embodiments, the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single- stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid. In some embodiments, the sequencing is next generation sequencing. In some embodiments, the sequencing is massively parallel shotgun sequencing. In some embodiments, the sequencing is single molecule sequencing. In some embodiments, the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA. In some embodiments, the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases. In some embodiments, the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence of the target nucleic acid. In some embodiments, wherein each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single-stranded gaps having duplicated sequences endogenous to the target nucleic acid, the duplicated sequences are further used to assemble the contiguous sequence. In some embodiments, the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
[0220] Any of the methods and applications described above can be used for diagnosing a disease or a condition in an individual based on the sequence, contiguity information (such as haplotype or 3-dimensional chromosome conformation), and/or quantity of a target nucleic acid in the individual. The target nucleic acid may be present in a sample obtained from the individual, including, but not limited to, biopsy sample, buccal swap, blood sample, or sample of other bodily fluid. In some embodiments, the target nucleic acid of the individual is compared to a reference from a healthy individual to provide the diagnosis.
[0221] In some embodiments, there is provided a method of diagnosing a disease or a condition of an individual based on status of a target nucleic acid (such as genomic DNA, for example, a chromosome) from the individual, comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g. , hyperactive Tn5 transposase) under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid, wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first non-complementary region and the second non-complementary region are connected to each other; wherein the first molecular barcode and the second molecular barcode have the same barcode sequence; and wherein each synthetic transposon has a different barcode sequence; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; (c) separating the first non-complementary region and the second non-complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids; (d) sequencing the library of template nucleic acids to obtain sequencing reads; (e) assembling a contiguous sequence of the target nucleic acid from the sequencing reads based on the barcode sequences of the synthetic transposons in the template nucleic acids; (f) optionally counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence; and (g) providing a diagnosis based on the contiguous sequence and/or the copy number of the target nucleic acid. In some embodiments, the diagnosis comprises mutations, such as structural variations or copy number variations in a diseased tissue (such as tumor). In some embodiments, the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide). In some embodiments, the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other. In some embodiments, the synthetic transposon further comprises a bridge nucleic acid comprising a first single- stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single-stranded linker, and wherein the first single - stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid. In some embodiments, the sequencing is next generation sequencing. In some embodiments, the sequencing is massively parallel shotgun sequencing. In some embodiments, the sequencing is single molecule sequencing. In some embodiments, the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA. In some embodiments, the plurality of synthetic transposons and the transposase are pre- mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases. In some embodiments, the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence of the target nucleic acid. In some embodiments, wherein each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single- stranded gaps having duplicated sequences endogenous to the target nucleic acid, the duplicated sequences are further used to assemble the contiguous sequence.
[0222] Some embodiments described herein comprise comparing the contiguous sequence of the target nucleic acid in a sample to a reference sequence, the copy number of the target nucleic acid in a sample to a reference value, and/or comparing the contiguous sequence and/or copy number of the target nucleic acid of one sample to that of a reference sample. The reference sequence and reference values may be obtained from a database. The reference sample may be a sample from a healthy or wildtype individual, tissue, or cell. For example, in some
embodiments, the target nucleic acid from a tumor cell of an individual is analyzed and compared to the nucleic acid from a healthy cell of the same individual to provide a diagnosis.
Kits and articles of manufacture
[0223] The present application further provides kits and articles of manufacture comprising a plurality of any of the synthetic transposons described herein, and for methods of library preparation, analyzing target nucleic acids, or various applications described herein.
[0224] In some embodiments, there is provided a kit for preparing a library of template nucleic acids, comprising: (a) a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non- complementary region and the second non-complementary region are connected to each other; (b) a transposase that recognizes the first transposon recognition site and the second transposon recognition site; and (c) instructions for preparing the library of template nucleic acids. In some embodiments, the kit further comprises a polymerase without strand displacement activity, such as T4 DNA polymerase. In some embodiments, the kit further comprises a ligase. In some embodiments, the kit further comprises nucleotides (such as dNTPs). In some embodiments, the transposase is Tn5 transposase, including a modified Tn5 transposase with enhanced activity, such as EZ-Tn5™. In some embodiments, wherein the first non-complementary region and/or the second non-complementary region comprises one or more cleavable nucleotides, the kit further comprises an endonuclease, such as UDG, for example, USER™. In some embodiments, wherein each synthetic transposon further comprises a bridge nucleic acid, the kit further comprises a single-strand exonuclease. In some embodiments, the kit further comprises a first primer and a second primer for PCR amplification of the library of template nucleic acids, wherein the first primer hybridizes to the first adapter sequence or reverse complement thereof, and the second primer hybridizes to the second adapter sequence or reverse complement thereof. In some embodiments, the kit further comprises a primer for rolling circle amplification of the library of template nucleic acids, wherein the primer comprises a first sequence that hybridizes to the first adapter sequence and a second sequence that hybridizes to the second adapter sequence.
[0225] In some embodiments, there is provided a kit for preparing a library of template nucleic acids, comprising: (a) a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first non-complementary region and the second non-complementary region are connected to each other; wherein the first molecular barcode and the second molecular barcode have the same barcode sequence; and wherein each synthetic transposon has a different barcode sequence; (b) a transposase that recognizes the first transposon recognition site and the second transposon recognition site; and (c) instructions for preparing the library of template nucleic acids. In some embodiments, the kit further comprises a polymerase without strand displacement activity, such as T4 DNA polymerase. In some embodiments, the kit further comprises a ligase. In some embodiments, the kit further comprises nucleotides (such as dNTPs). In some embodiments, the transposase is Tn5 transposase, including a modified Tn5 transposase with enhanced activity, such as EZ-Tn5™. In some embodiments, wherein the first non-complementary region and/or the second non- complementary region comprises one or more cleavable nucleotides, the kit further comprises an endonuclease, such as UDG, for example, USER™. In some embodiments, wherein each synthetic transposon further comprises a bridge nucleic acid, the kit further comprises a single- strand exonuclease. In some embodiments, the kit further comprises a first primer and a second primer for PCR amplification of the library of template nucleic acids, wherein the first primer hybridizes to the first adapter sequence or reverse complement thereof, and the second primer hybridizes to the second adapter sequence or reverse complement thereof. In some
embodiments, the kit further comprises a primer for rolling circle amplification of the library of template nucleic acids, wherein the primer comprises a first sequence that hybridizes to the first adapter sequence and a second sequence that hybridizes to the second adapter sequence.
[0226] The kits may contain one or more additional components, such as containers, buffers, reagents, cofactors, or additional agents, such as denaturing agent. The kit components may be packaged together and the package may contain or be accompanied by instructions for using the kit. [0227] It will be appreciated by persons skilled in the art the numerous variations, combinations and/or modifications may be made to the invention as shown without departing from the spirit of the inventions as broadly described.
EXEMPLARY EMBODIMENTS
[0228] The invention provides the following embodiments:
[0229] Embodiment 1. A synthetic transposon comprising: a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non- complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non-complementary region and the second non-complementary region are connected to each other.
[0230] Embodiment 2. The synthetic transposon of embodiment 1, wherein the first strand of the first non-complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides.
[0231] Embodiment 3. The synthetic transposon of embodiment 1 or embodiment 2, wherein the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides.
[0232] Embodiment 4. The synthetic transposon of embodiment 1, wherein the first strand and the second strand of the first non-complementary region are fused to each other via a first single- stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single- stranded linker and the second single-stranded linker are connected to each other.
[0233] Embodiment 5. The synthetic transposon of embodiment 4, wherein the first single- stranded linker and the second single-stranded linker hybridize to each other.
[0234] Embodiment 6. The synthetic transposon of embodiment 5, wherein each of the first single-stranded linker and the second single-stranded linker comprises a cleavable nucleotide.
[0235] Embodiment 7. The synthetic transposon of embodiment 4, further comprising a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single - stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
[0236] Embodiment 8. The synthetic transposon of any one of embodiments 2, 3, and 6, wherein the cleavable nucleotide is a uracil nucleotide.
[0237] Embodiment 9. The synthetic transposon of any one of embodiments 1-8, wherein the first stem or the second stem comprises a terminal hairpin structure.
[0238] Embodiment 10. The synthetic transposon of any one of embodiments 1-8, wherein the first stem and the second stem comprise blunt ends.
[0239] Embodiment 11. The synthetic transposon of any one of embodiments 1-10, wherein the synthetic transposon is a DNA transposon.
[0240] Embodiment 12. The synthetic transposon of any one of embodiments 1-11, wherein the synthetic transposon comprises one or more modified nucleotides.
[0241] Embodiment 13. The synthetic transposon of any one of embodiments 1-12, wherein the first transposase recognition site and the second transposase recognition site have the same sequence.
[0242] Embodiment 14. The synthetic transposon of any one of embodiments 1-12, wherein the first transposase recognition site and the second transposase recognition site have different sequences.
[0243] Embodiment 15. The synthetic transposon of any one of embodiments 1-14, wherein the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
[0244] Embodiment 16. The synthetic transposon of any one of embodiments 1-15, further comprising a first molecular barcode and a second molecular barcode, wherein the first molecular barcode is disposed between the first transposon recognition site and the first non- complementary region, and the second molecular barcode is disposed between the second transposon recognition site and the second non-complementary region.
[0245] Embodiment 17. The synthetic transposon of embodiment 16, wherein the first molecular barcode and the second molecular barcode have the same barcode sequence.
[0246] Embodiment 18. The synthetic transposon of embodiment 16 or embodiment 17, wherein the first molecular barcode and the second molecular barcode are double-stranded. [0247] Embodiment 19. The synthetic transposon of embodiment 16 or embodiment 17, wherein the first molecular barcode or the second molecular barcode comprises a single-stranded region.
[0248] Embodiment 20. The synthetic transposon of embodiment 19, wherein the 5' terminus adjacent to the single-stranded region is phosphorylated.
[0249] Embodiment 21. A composition comprising a plurality of synthetic transposons of any one of embodiments 1-20.
[0250] Embodiment 22. The composition of embodiment 21, wherein each synthetic transposon comprises a first molecular barcode and a second molecular barcode, wherein the first molecular barcode is disposed between the first transposon recognition site and the first non-complementary region, and the second molecular barcode is disposed between the second transposon recognition site and the second non-complementary region, wherein the first molecular barcode and the second molecular barcode have the same barcode sequence, and wherein each synthetic transposon has a different barcode sequence.
[0251] Embodiment 23. The composition of embodiment 22, wherein the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
[0252] Embodiment 24. A method of preparing a library of template nucleic acids, comprising: (a) contacting a target nucleic acid with the composition of any one of embodiments 21-23, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; and (c) separating the first non-complementary region and the second non-complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids.
[0253] Embodiment 25. The method of embodiment 24, wherein step (c) comprises treating the repaired target nucleic acid with an endonuclease.
[0254] Embodiment 26. The method of embodiment 25, wherein the endonuclease is uracil DNA glycosylase (UDG).
[0255] Embodiment 27. The method of embodiment 24, wherein step (c) comprises denaturing of the repaired target nucleic acid. [0256] Embodiment 28. The method of embodiment 27, further comprising treating the denatured repaired target nucleic acid with an exonuclease.
[0257] Embodiment 29. The method of any one of embodiments 24-28, further comprising amplifying the library of template nucleic acids.
[0258] Embodiment 30. The method of embodiment 29, wherein the amplifying is whole- genome amplification.
[0259] Embodiment 31. The method of embodiment 29, wherein the amplifying is targeted amplification.
[0260] Embodiment 32. The method of any one of embodiments 29-31 , wherein the library of template nucleic acids is amplified by a polymerase chain reaction (PCR).
[0261] Embodiment 33. The method of embodiment 32, wherein the PCR comprises contacting the template nucleic acids with a first primer that hybridizes to the first adapter sequence or reverse complement thereof, and a second primer that hybridizes to the second adapter sequence or reverse complement thereof.
[0262] Embodiment 34. The method of any one of embodiments 29-31 , wherein the library of template nucleic acids is amplified by rolling circle amplification (RCA) using a primer comprising a first sequence that hybridizes to the first adapter sequence and a second sequence that hybridizes to the second adapter sequence.
[0263] Embodiment 35. The method of embodiment 34, further comprising circularizing the template nucleic acids prior to the RCA.
[0264] Embodiment 36. The method of any one of embodiments 24-35, wherein the polymerase is T4 DNA polymerase.
[0265] Embodiment 37. The method of any one of embodiments 24-36, wherein the transposase is Tn5 transposase.
[0266] Embodiment 38. The method of any one of embodiments 24-37, wherein the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid.
[0267] Embodiment 39. The method of any one of embodiments 24-38, wherein the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA. [0268] Embodiment 40. The method of any one of embodiments 24-39, wherein the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
[0269] Embodiment 41. A method of analyzing a target nucleic acid, comprising: (a) preparing a library of template nucleic acids from the target nucleic acid using the method of any one of embodiments 24-40; and (b) sequencing the library of template nucleic acids to obtain sequencing reads.
[0270] Embodiment 42. The method of embodiment 41, wherein the sequencing is massively parallel shotgun sequencing.
[0271] Embodiment 43. The method of embodiment 41, wherein the sequencing is single molecule sequencing.
[0272] Embodiment 44. The method of any one of embodiments 41-43, wherein each synthetic transposon comprises a first molecular barcode and a second molecular barcode, wherein the first molecular barcode is disposed between the first transposon recognition site and the first non-complementary region, and the second molecular barcode is disposed between the second transposon recognition site and the second non-complementary region, wherein the first molecular barcode and the second molecular barcode have the same barcode sequence, wherein each synthetic transposon has a different barcode sequence, and wherein the method further comprises: (c) assembling a contiguous sequence of the target nucleic acid from the sequencing reads based on the barcode sequences of the synthetic transposons in the template nucleic acids.
[0273] Embodiment 45. The method of embodiment 44, wherein step (c) comprises: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same barcode sequences in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the barcode sequences in the synthetic transposons to provide the contiguous sequence of the target nucleic acid.
[0274] Embodiment 46. The method of embodiment 44 or embodiment 45, wherein each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single-stranded gaps having duplicated sequences endogenous to the target nucleic acid, and wherein the duplicated sequences are used to assemble the contiguous sequence. [0275] Embodiment 47. The method of any one of embodiments 44-46, further comprising counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
[0276] Embodiment 48. The method of any one of embodiments 41-47, wherein the method is used for genome assembly, haplotyping, detection of mutation, chromosomal conformation analysis, or methylation analysis.
[0277] Embodiment 49. The method of embodiment 48, wherein the mutation is selected from the group consisting of substitution, indel, structural variation, and copy number variation.
[0278] Embodiment 50. A kit for preparing a library of template nucleic acids, comprising: (a) the composition of any one of embodiments 21-23; (b) a transposase that recognizes the first transposon recognition site and the second transposon recognition site; and (c) instructions for preparing the library of template nucleic acids.
[0279] Embodiment 51. The kit of embodiment 50, further comprising a polymerase.
[0280] Embodiment 52. The kit of embodiment 51, wherein the polymerase is a T4 DNA polymerase.
[0281] Embodiment 53. The kit of any one of embodiments 50-52, further comprising a ligase.
[0282] Embodiment 54. The kit of any one of embodiments 50-53, wherein the transposase is Tn5 transposase.
[0283] Embodiment 55. The kit of any one of embodiments 50-54, further comprising an endonuclease.
[0284] Embodiment 56. The kit of embodiment 55, wherein the endonuclease is UDG.
[0285] Embodiment 57. The kit of any one of embodiments 50-56, further comprising a first primer and a second primer for PCR amplification of the library of template nucleic acids, wherein the first primer hybridizes to the first adapter sequence or reverse complement thereof, and the second primer hybridizes to the second adapter sequence or reverse complement thereof.
[0286] Embodiment 58. The kit of any one of embodiments 50-56, further comprising a primer for rolling circle amplification of the library of template nucleic acids, wherein the primer comprises a first sequence that hybridizes to the first adapter sequence and a second sequence that hybridizes to the second adapter sequence. EXAMPLES
[0287] The examples below are intended to be purely exemplary of the invention and should therefore not be considered to limit the invention in any way. The following examples and detailed description are offered by way of illustration and not by way of limitation.
Example 1: Whole genome sequencing of identical twins
[0288] Identical twins have identical genomic sequences except for only a few mutations. The specific mutations can be determined by NGS methods, and confirmed by Sanger sequencing methods. Therefore, data from whole genome sequencing of identical twins can be used for checking sequencing errors using the library preparation methods described in the present application. An exemplary method of whole genome sequencing of identical human twins is described below.
[0289] Human gDNA is extracted from a buccal swap or a drop of blood, and the purity and yield of the gDNA is measured. Alternatively, about 10-20 human cells from each person are lysed without purification to minimize the loss of DNA. A composition comprising a plurality of synthetic transposons as shown in FIG. 21 is prepared. Illumina sequencing primers readl and read2 are incorporated as the first adapter sequence (e.g. , F) and second adapter sequence (e.g. , R) respectively in the non-complementary regions. The molecular barcodes contain a total of 20 randomly or degenerately designed nucleotides intermixed with fixed nucleotides. Duplicate samples of the gDNA inserted with the plurality of synthetic transposons are prepared. In each sample, about 0.3 ng gDNA is used to contact with the composition comprising the plurality of synthetic transposons, and Tn5 transposase under a condition that allows insertion at a frequency of about 150- bp between adjacent transposition sites. The single-stranded gaps are filled-in with dNTPs and a DNA polymerase without strand displacement activity, such as T4 DNA polymerase. Nicks in the DNA are ligated with E. coli ligase, which can be done separately or simultaneously with the gap filling step. The product is treated with USER™ enzyme (NEB) to cleave the uracil nucleotide joining the two non-complementary regions in the inserted synthetic transposons to provide a library of templates. The USER™ enzyme treatment step can be done separately or simultaneously with the gap filling step and the ligation step. The library of templates is subsequently PCR amplified with two corresponding primers: a first primer having Illumina sequence i5 and sequence Readl, and a second primer having Illumina sequence i7 and sequence Read2. The PCR products are then purified, quantified, and sequenced with 2x300 bases pair-end reads using an Illumina NGS instrument.
[0290] The sequencing reads are subsequently analyzed. The sequencing reads contain stuff sequences, unique molecular barcode sequence, 19-base Tn5 recognition site, 9-base duplicate sequence, and target sequence in both sequencing directions. For short target sequences, the sequencing reads may additionally contain an additional copy of 9-base duplicate, 19-base Tn5 recognition site, unique molecular barcode sequence and stuff sequences. The sequencing reads in both directions are matched with each other and combined to yield a single sequence.
Additionally, sequences having identical molecular barcodes are aligned and merged into a single consensus sequence to yield the error-corrected target sequence. Target sequences are assembled to provide whole genome sequence with high quality, which contains haplotype information and any structural variation or mutations. The genomic sequences from the twins are compared to each other to identify mutations, which are verified by Sanger sequencing. Unverified mutations are attributed to sequencing errors, and used to calculate an error rate for the sequencing method described herein, and compared to error rates using other sequencing method, which uses conventional methods (such as commercial kits) to prepare sequencing libraries.
Example 2: Single molecule sequencing of human skin metagenomes
[0291] An exemplary method of sequencing human skin metagenomes is described below.
[0292] First, microbial gDNAs are extracted from human skin surface using a swap-scrape- swap procedure. The purity and yield of the microbial gDNAs are measured. A composition comprising a plurality of synthetic transposons as shown in FIG. 2H is prepared. PacBio adapter sequences are incorporated as the first and second adapter sequences (i.e., F and R) respectively in the two non-complementary regions. The molecular barcodes contain a total of 20 randomly or degenerately designed nucleotides intermixed with fixed nucleotides. Duplicate samples of the gDNA inserted with the plurality of synthetic transposons are prepared. In each sample, nanograms of gDNA are used to contact with the composition comprising the plurality of synthetic transposons, and Tn5 transposase under a condition that allows insertion at a frequency of about 1500-bp between adjacent transposition sites. The single-stranded gaps are filled-in with dNTPs and a DNA polymerase without strand displacement activity, such as T4 DNA polymerase. Nicks in the DNA are ligated with E. coli ligase, which can be done separately or simultaneously with the gap filling step. The product is subsequently denatured, and treated with exonucleases to remove any linear nucleic acids. The resulting sample is sequenced with a PacBio SMRT® instrument. Sequencing data is analyzed, and microbial genomes are assembled from the sequencing data. Abundance of each microbial genome is also obtained. The data is further compared to metagenome data in databases. In this case, molecular barcodes are mainly used to link as many fragments as possible from the same original genome.

Claims

CLAIMS What is claimed is:
1. A synthetic transposon comprising: a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region;
wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site;
wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence;
wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence;
and wherein the first non-complementary region and the second non-complementary region are connected to each other.
2. The synthetic transposon of claim 1, wherein the first strand of the first non-complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides, and/or wherein the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides.
3. The synthetic transposon of claim 1, wherein the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other.
4. The synthetic transposon of claim 3, wherein the first single-stranded linker and the second single-stranded linker hybridize to each other.
5. The synthetic transposon of claim 4, wherein each of the first single-stranded linker and the second single-stranded linker comprises a cleavable nucleotide.
6. The synthetic transposon of claim 3, further comprising a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single-stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
7. The synthetic transposon of claim 2, wherein the cleavable nucleotide is a uracil nucleotide.
8. The synthetic transposon of claim 1, wherein the synthetic transposon is a DNA transposon.
9. The synthetic transposon of claim 8, wherein the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
10. The synthetic transposon of any one of claims 1-9, further comprising a first molecular barcode and a second molecular barcode, wherein the first molecular barcode is disposed between the first transposon recognition site and the first non-complementary region, and the second molecular barcode is disposed between the second transposon recognition site and the second non-complementary region.
11. A composition comprising a plurality of synthetic transposons of any one of claims 1-9.
12. The composition of claim 11, wherein each synthetic transposon comprises a first molecular barcode and a second molecular barcode, wherein the first molecular barcode is disposed between the first transposon recognition site and the first non-complementary region, and the second molecular barcode is disposed between the second transposon recognition site and the second non-complementary region, wherein the first molecular barcode and the second molecular barcode have the same barcode sequence, and wherein each synthetic transposon has a different barcode sequence.
13. The composition of claim 12, wherein the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
14. A method of preparing a library of template nucleic acids, comprising:
(a) contacting a target nucleic acid with the composition of claim 11, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid;
(b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; and
(c) separating the first non-complementary region and the second non-complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids.
15. The method of claim 14, wherein step (c) comprises treating the repaired target nucleic acid with an endonuclease.
16. The method of claim 15, wherein the endonuclease is uracil DNA glycosylase (UDG).
17. The method of claim 14, wherein step (c) comprises denaturing of the repaired target nucleic acid.
18. The method of claim 17, further comprising treating the denatured repaired target nucleic acid with an exonuclease.
19. The method of any one of claims 14-18, further comprising amplifying the library of
template nucleic acids.
20. The method of any one of claims 14-19, wherein the transposase is Tn5 transposase.
21. The method of any one of claims 14-20, wherein the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid.
22. The method of any one of claims 14-21, wherein the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
23. A method of analyzing a target nucleic acid, comprising:
(a) preparing a library of template nucleic acids from the target nucleic acid using the method of claim 14; and
(b) sequencing the library of template nucleic acids to obtain sequencing reads.
24. The method of claim 23, wherein each synthetic transposon comprises a first molecular barcode and a second molecular barcode, wherein the first molecular barcode is disposed between the first transposon recognition site and the first non-complementary region, and the second molecular barcode is disposed between the second transposon recognition site and the second non-complementary region, wherein the first molecular barcode and the second molecular barcode have the same barcode sequence, wherein each synthetic transposon has a different barcode sequence, and wherein the method further comprises: (c) assembling a contiguous sequence of the target nucleic acid from the sequencing reads based on the barcode sequences of the synthetic transposons in the template nucleic acids.
25. A kit for preparing a library of template nucleic acids, comprising:
(a) the composition of any one of claims 11-13;
(b) a transposase that recognizes the first transposon recognition site and the second transposon recognition site; and
(c) instructions for preparing the library of template nucleic acids.
PCT/US2017/052776 2016-09-23 2017-09-21 Compositions of synthetic transposons and methods of use thereof WO2018057779A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201662399188P 2016-09-23 2016-09-23
US62/399,188 2016-09-23

Publications (1)

Publication Number Publication Date
WO2018057779A1 true WO2018057779A1 (en) 2018-03-29

Family

ID=61691129

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2017/052776 WO2018057779A1 (en) 2016-09-23 2017-09-21 Compositions of synthetic transposons and methods of use thereof

Country Status (1)

Country Link
WO (1) WO2018057779A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10227574B2 (en) 2016-12-16 2019-03-12 B-Mogen Biotechnologies, Inc. Enhanced hAT family transposon-mediated gene transfer and associated compositions, systems, and methods
WO2021077415A1 (en) * 2019-10-25 2021-04-29 Peking University Methylation detection and analysis of mammalian dna
US11278570B2 (en) 2016-12-16 2022-03-22 B-Mogen Biotechnologies, Inc. Enhanced hAT family transposon-mediated gene transfer and associated compositions, systems, and methods
US11760983B2 (en) 2018-06-21 2023-09-19 B-Mogen Biotechnologies, Inc. Enhanced hAT family transposon-mediated gene transfer and associated compositions, systems, and methods
EP4165203A4 (en) * 2020-06-12 2024-07-17 Harvard College Compositions and methods for dna methylation analysis

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012061832A1 (en) * 2010-11-05 2012-05-10 Illumina, Inc. Linking sequence reads using paired code tags
US20130289251A1 (en) * 2010-12-23 2013-10-31 Roche Diagnostics Operations, Inc. Binding agent
US20150176071A1 (en) * 2013-12-20 2015-06-25 Illumina, Inc. Preserving genomic connectivity information in fragmented genomic dna samples
US20150368638A1 (en) * 2013-03-13 2015-12-24 Illumina, Inc. Methods and compositions for nucleic acid sequencing
WO2016061517A2 (en) * 2014-10-17 2016-04-21 Illumina Cambridge Limited Contiguity preserving transposition
US20160177359A1 (en) * 2014-02-03 2016-06-23 Thermo Fisher Scientific Baltics Uab Method for controlled dna fragmentation
WO2016130704A2 (en) * 2015-02-10 2016-08-18 Illumina, Inc. Methods and compositions for analyzing cellular components

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012061832A1 (en) * 2010-11-05 2012-05-10 Illumina, Inc. Linking sequence reads using paired code tags
US20130289251A1 (en) * 2010-12-23 2013-10-31 Roche Diagnostics Operations, Inc. Binding agent
US20150368638A1 (en) * 2013-03-13 2015-12-24 Illumina, Inc. Methods and compositions for nucleic acid sequencing
US20150176071A1 (en) * 2013-12-20 2015-06-25 Illumina, Inc. Preserving genomic connectivity information in fragmented genomic dna samples
US20160177359A1 (en) * 2014-02-03 2016-06-23 Thermo Fisher Scientific Baltics Uab Method for controlled dna fragmentation
WO2016061517A2 (en) * 2014-10-17 2016-04-21 Illumina Cambridge Limited Contiguity preserving transposition
WO2016130704A2 (en) * 2015-02-10 2016-08-18 Illumina, Inc. Methods and compositions for analyzing cellular components

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10227574B2 (en) 2016-12-16 2019-03-12 B-Mogen Biotechnologies, Inc. Enhanced hAT family transposon-mediated gene transfer and associated compositions, systems, and methods
US11111483B2 (en) 2016-12-16 2021-09-07 B-Mogen Biotechnologies, Inc. Enhanced hAT family transposon-mediated gene transfer and associated compositions, systems and methods
US11162084B2 (en) 2016-12-16 2021-11-02 B-Mogen Biotechnologies, Inc. Enhanced hAT family transposon-mediated gene transfer and associated compositions, systems, and methods
US11278570B2 (en) 2016-12-16 2022-03-22 B-Mogen Biotechnologies, Inc. Enhanced hAT family transposon-mediated gene transfer and associated compositions, systems, and methods
US12037611B2 (en) 2016-12-16 2024-07-16 R&D Systems, Inc. Enhanced hAT family transposon-mediated gene transfer and associated compositions, systems, and methods
US12097220B2 (en) 2016-12-16 2024-09-24 R&D Systems, Inc. Enhanced hAT family transposon-mediated gene transfer and associated compositions, systems, and methods
US11760983B2 (en) 2018-06-21 2023-09-19 B-Mogen Biotechnologies, Inc. Enhanced hAT family transposon-mediated gene transfer and associated compositions, systems, and methods
WO2021077415A1 (en) * 2019-10-25 2021-04-29 Peking University Methylation detection and analysis of mammalian dna
CN114391043A (en) * 2019-10-25 2022-04-22 北京大学 Methylation detection and analysis of mammalian DNA
CN114391043B (en) * 2019-10-25 2024-03-15 昌平国家实验室 Methylation detection and analysis of mammalian DNA
EP4165203A4 (en) * 2020-06-12 2024-07-17 Harvard College Compositions and methods for dna methylation analysis

Similar Documents

Publication Publication Date Title
US11319534B2 (en) Methods and compositions for nucleic acid sequencing
US11505795B2 (en) Error detection in sequence tag directed sequencing reads
US20180087050A1 (en) Methods of inserting molecular barcodes
EP2427569B1 (en) The use of class iib restriction endonucleases in 2nd generation sequencing applications
IL287853B2 (en) Contiguity preserving transposition
US20120003657A1 (en) Targeted sequencing library preparation by genomic dna circularization
US20200283839A1 (en) Methods of attaching adapters to sample nucleic acids
US20220127597A1 (en) Haplotagging - haplotype phasing and single-tube combinatorial barcoding of nucleic acid molecules using bead-immobilized tn5 transposase
WO2018057779A1 (en) Compositions of synthetic transposons and methods of use thereof
US20140228223A1 (en) High throughput paired-end sequencing of large-insert clone libraries
JP2009529876A (en) Methods and means for sequencing nucleic acids
US20210403904A1 (en) Methods for haplotyping with short read sequence technology
US20240271126A1 (en) Oligo-modified nucleotide analogues for nucleic acid preparation
WO2012008831A1 (en) Simplified de novo physical map generation from clone libraries

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17853918

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17853918

Country of ref document: EP

Kind code of ref document: A1