Sequencing tools

This repo stores different utils for TGIRT-seq/NGS pipelines. This consist of several ready-to-use scripts and two module for operations on fastq files and BAM alignments.

Installation

git clone https://github.com/wckdouglas/sequencing_tools.git
cd sequencing_tools
pip install -r requirements.txt
python setup.py install --user

Prerequisits:

pysam>0.11.0
numpy>=1.12.1
cython>=0.25
matplotlib>=2.0.0
scipy>=0.19.0
seaborn>=0.7.1
networkx
ujson
six
pytest

UMI clipper

Clipping UMI from paired end fastq

usage: seqtools clipFQ [-h] [-o OUT_FILE] -1 FASTQ1 -2 FASTQ2 [-x IDXBASE]
                     [-q BARCODECUTOFF] [-a MISMATCH] [-r {read1,read2}]

Clip the barcode sequence and attached to the front of seq id

optional arguments:
  -h, --help            show this help message and exit
  -o OUT_FILE, --out_file OUT_FILE
                        Interleaved Fastq files (default: -)
  -1 FASTQ1, --fastq1 FASTQ1
                        Paired end Fastq file 1 with four line/record
  -2 FASTQ2, --fastq2 FASTQ2
                        Paired end Fastq file 2 with four line/record
  -x IDXBASE, --idxBase IDXBASE
                        how many base in 5' end as index? (default:
                        XXXXXXXXXXXXX) X as umi bases, can also add constant
                        regions at the back, such as XXXXCATGC, if CATGC is
                        the constant region
  -q BARCODECUTOFF, --barcodeCutOff BARCODECUTOFF
                        Average base calling quality for barcode sequence
                        (default=20)
  -a MISMATCH, --mismatch MISMATCH
                        Allow how many mismatch in constant region (deflaut:
                        1)
  -r {read1,read2}, --read {read1,read2}
                        Which read is the UMI on?

Paired end fastq merger

Merging paired end fastq to a fragment, only output reads that are successfully merged.

usage: seqtools mergepe [-h] [-i INTERLEAVED] [-o OUTFILE] [-m MIN_LEN]
                      [-e ERROR] [-a]

Merging interleaved, paired-end fastq file and output overlapped regions only
with error correction using cutadapt module to find overlapping regions

optional arguments:
  -h, --help            show this help message and exit
  -i INTERLEAVED, --interleaved INTERLEAVED
                        Interleaved Fastq files (default: -)
  -o OUTFILE, --outfile OUTFILE
                        Merged fastq file (default: -)
  -m MIN_LEN, --min_len MIN_LEN
                        Minimum length of sequence to output (default: 18)
  -e ERROR, --error ERROR
                        Maximum error rate of alignment (default: 0.1)
  -a, --all             Output all bases (default: only overlapping regions)

Unique bam splitter

This script split bowtie2/hisat2 bam file to uniquely mapped or multiple mapped alignments

usage: seqtools splitBam [-h] [-a {bowtie2,hisat2}] [-i INBAM] -o OUTPREFIX

Splitting bam file to uniquely mapped and multiple mapped (only support
bowtie2 and hisat2)

optional arguments:
  -h, --help            show this help message and exit
  -a {bowtie2,hisat2}, --aligner {bowtie2,hisat2}
                        PRogram generating the bam file (default: bowtie2)
  -i INBAM, --inBam INBAM
                        bam file name, or stdin (default: stdin)
  -o OUTPREFIX, --outprefix OUTPREFIX
                        output prefix ($OUTPUTPREFIX.unique.bam,
                        $OUTPUTPREFIX.multi.bam)

Multimap BAM reducer

This scripts process multiply mapped reads and output ribosomal reads or the shortest read pairs from the alignments input bam file should be generated from split_uniq_bam.py

usage: seqtools filterMulti [-h] -i INFILE -o OUTFILE [-b] [-z]

Process multiply mapped bam and get either shortest insert size or ribosomal
reads

optional arguments:
  -h, --help            show this help message and exit
  -i INFILE, --infile INFILE
                        Input bam/sam file (use < - > for stdin)
  -o OUTFILE, --outfile OUTFILE
                        Output bam/sam file (use < - > for stdout)
  -b, --bam_in          Provide this flag if bam instead of sam is used as
                        input
  -z, --bam_out         Provide this flag if bam is needed for output

BAM to bed converter

usage: seqtools bam2bed [-h] -i IN_BAM [-o OUT_BED] [-m MIN_SIZE] [-M MAX_SIZE]
                     [-t TAG] [-a]

Making paired-end bam into bed file for every fragment

optional arguments:
  -h, --help            show this help message and exit
  -i IN_BAM, --in_bam IN_BAM
                        BAM file name, or stdin (-) ** name sorted
  -o OUT_BED, --out_bed OUT_BED
                        BED file output (default: - )
  -m MIN_SIZE, --min_size MIN_SIZE
                        minimum fragment size to report
  -M MAX_SIZE, --max_size MAX_SIZE
                        minimum fragment size to report
  -t TAG, --tag TAG     tag to extract
  -a, --all             output supplementary as single fragment

UMI tagger

Adding UMI sequence to RX tag in bam. Alignments must be from fastq files generated by clip_fastq.py

usage: seqtools bamTag [-h] -i IN_BAM [-o OUT_BAM] [-t TAG] [-d DELIM]
                      [-f FRAGMENT]

Putting UMI as RX tag in bam, enables picard Markduplicates with
BARCODE_TAG=RX

optional arguments:
  -h, --help            show this help message and exit
  -i IN_BAM, --in_bam IN_BAM
                        BAM file name, or stdin (-)
  -o OUT_BAM, --out_bam OUT_BAM
                        BAM file output (default: - )
  -t TAG, --tag TAG     Tag id (default: RX )
  -d DELIM, --delim DELIM
                        Deliminator separating read id and bc (default: _ )
  -f FRAGMENT, --fragment FRAGMENT
                        after splitting read name using {delim}, which fragment
                        is UMI? can use -1 as last piece (default: 0)

Deuplicate BED

Deduplicate fragments with UMI in a bed file.

usage: seqtools dedup [-h] [-i INFILE] [-o OUTFILE] [-t THRESHOLD]
                          [-d DELIM] [-f F]

Demultiplexing UMI bed file with the follwing columns: 1. chrom name 2. start
3. end 4. {$UMI}_{$READ_ID} 5. score 6.strand The program internally used
hamming distance matrix of the barcodes to generate connected network and
identified UMI clusters

optional arguments:
  -h, --help            show this help message and exit
  -i INFILE, --infile INFILE
                        input bedfile, can be "-" or "/dev/stdin" for stdin
                        (default: -).Format should be sorted BED: 1. chrom 2.
                        start 3. end 4. {barcode}_{id} 6. strand
  -o OUTFILE, --outfile OUTFILE
                        output bedfile, can be "-" or "/dev/stdout" for stdin
                        (default: -).
  -t THRESHOLD, --threshold THRESHOLD
                        How many error between barcodes can be tolerated?
                        (default = 1)
  -d DELIM, --delim DELIM
                        Deliminator separating read id and bc (default: _ )
  -f F                  after splitting read name using {delim}, which
                        fragmnet is UMI? can use -1 as last piece (default: 0)

Deinterleaved fastq

Deinterleaved an interleaved fastq file.

usage: seqtools deinterleave [-h] [-i INFQ] -1 READ1 [-2 READ2]

Deinterleaving an interleaved fastq file

optional arguments:
  -h, --help            show this help message and exit
  -i INFQ, --infq INFQ  interleaved fq (defualt: -)
  -1 READ1, --read1 READ1
                        read 1 fastq output
  -2 READ2, --read2 READ2
                        read 2 fastq output

Adjust fragment counts for UMI saturations

UMI saturation can be a problem for highly-expressed geens, this tool adjust the fragment count by implementing a poisson model.

where m is the diversity of UMI (4^nt, nt is number of bases as UMI), k is the number of different UMI being detected for the fragment and n is the true number of fragments that we are interested at.

Input for this tool is a 6-columns BED file with each fragment named as {UMI}_{READ_ID}

usage: seqtools calcUMI [-h] -i IN_BED [-o OUT_BED] [--umi UMI]

Adjusting fragment count for UMI saturation, see paper: Counting individual
DNA molecules by the stochastic attachment of diverse labels.

optional arguments:
  -h, --help            show this help message and exit
  -i IN_BED, --in_bed IN_BED
                        BED file name, or stdin (-) ** name sorted
  -o OUT_BED, --out_bed OUT_BED
                        BED file output (default: - )
  --umi UMI             Number of nucleotide as umi (default: 6)

Extracting base count from BAM along the genome/genomic regions

usage: seqtools pileup [-h] -i BAM -f FASTA [-b BASES] [-q QUAL]
                              [-c CROP] [-r BED] [--no_indel]
                              [--min_coverage MIN_COVERAGE]
                              [--concensus_fasta CONCENSUS_FASTA]

Pileup whole genome, only output bases where coverage > 0

optional arguments:
  -h, --help            show this help message and exit
  -i BAM, --bam BAM     Input bam file (indexed)
  -f FASTA, --fasta FASTA
                        reference fasta file
  -b BASES, --bases BASES
                        number of bases to look at every iteration (default:
                        100000)
  -q QUAL, --qual QUAL  base quality to filter (defulat: 30)
  -c CROP, --crop CROP  Crop how many bases from ends (defulat: 0)
  -r BED, --bed BED     bed file for regions (default: whole genome)
  --no_indel            Not considering alignments with Indel
  --min_coverage MIN_COVERAGE
                        Minimum coverage to output
  --concensus_fasta CONCENSUS_FASTA
                        Generate concensus fasta (only work if bed file is
                        provided)

Modules

Detail usage can be found at readthedocs

viz_tools.cor_plot

A specialized paired correlation plot

from sequencing_tools.viz_tools import cor_plot
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_style('white')

d = np.random.rand(100,10)
d = pd.DataFrame(d)
fig = plt.figure()
cor_plot(d, fig)
fig.savefig('cor.png',bbox_inches='tight')

Name		Name	Last commit message	Last commit date
Latest commit History 850 Commits
.github/workflows		.github/workflows
bin		bin
docs		docs
img		img
notes		notes
sequencing_tools		sequencing_tools
test		test
.DS_Store		.DS_Store
.coverage		.coverage
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
default.profraw		default.profraw
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
test_script.sh		test_script.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sequencing tools

Installation

Ready-to-use scripts

UMI clipper

Paired end fastq merger

Unique bam splitter

Multimap BAM reducer

BAM to bed converter

UMI tagger

Deuplicate BED

Deinterleaved fastq

Adjust fragment counts for UMI saturations

Extracting base count from BAM along the genome/genomic regions

Modules

viz_tools.cor_plot

About

Releases 6

Packages

Languages

License

wckdouglas/sequencing_tools

Folders and files

Latest commit

History

Repository files navigation

Sequencing tools

Installation

Ready-to-use scripts

UMI clipper

Paired end fastq merger

Unique bam splitter

Multimap BAM reducer

BAM to bed converter

UMI tagger

Deuplicate BED

Deinterleaved fastq

Adjust fragment counts for UMI saturations

Extracting base count from BAM along the genome/genomic regions

Modules

viz_tools.cor_plot

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 6

Packages 0

Languages

Packages