[go: nahoru, domu]

Skip to content

cmks/DAS_Tool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DAS Tool for genome resolved metagenomics

DAS Tool

DAS Tool is an automated method that integrates the results of a flexible number of binning algorithms to calculate an optimized, non-redundant set of bins from a single assembly.

Reference

Christian M. K. Sieber, Alexander J. Probst, Allison Sharrar, Brian C. Thomas, Matthias Hess, Susannah G. Tringe & Jillian F. Banfield (2018). Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy. Nature Microbiology. https://doi.org/10.1038/s41564-018-0171-1.

Usage

DAS_Tool [options] -i <contig2bin> -c <contigs_fasta> -o <outputbasename>

Options:
   -i --bins=<contig2bin>                   Comma separated list of tab separated contigs to bin tables.
   -c --contigs=<contigs>                   Contigs in fasta format.
   -o --outputbasename=<outputbasename>     Basename of output files.
   -l --labels=<labels>                     Comma separated list of binning prediction names.
   --search_engine=<search_engine>          Engine used for single copy gene identification (diamond/blastp/usearch) [default: diamond].
   -p --proteins=<proteins>                 Predicted proteins (optional) in prodigal fasta format (>contigID_geneNo).
                                            Gene prediction step will be skipped.
   --write_bin_evals                        Write evaluation of input bin sets.
   --write_bins                             Export bins as fasta files.
   --write_unbinned                         Write unbinned contigs.
   -t --threads=<threads>                   Number of threads to use [default: 1].
   --score_threshold=<score_threshold>      Score threshold until selection algorithm will keep selecting bins (0..1) [default: 0.5].
   --duplicate_penalty=<duplicate_penalty>  Penalty for duplicate single copy genes per bin (weight b).
                                            Only change if you know what you are doing (0..3) [default: 0.6].
   --megabin_penalty=<megabin_penalty>      Penalty for megabins (weight c). Only change if you know what you are doing (0..3) [default: 0.5].
   --dbDirectory=<dbDirectory>              Directory of single copy gene database [default: db].
   --resume                                 Use existing predicted single copy gene files from a previous run.
   --debug                                  Write debug information to log file.
   -v --version                             Print version number and exit.
   -h --help                                Show this.

Input file format

  • Bins [--bins, -i]: Tab separated files of contig-IDs and bin-IDs. Contigs to bin file example:
Contig_1	bin.01
Contig_8	bin.01
Contig_42	bin.02
Contig_49	bin.03
  • Contigs [--contigs, -c]: Assembled contigs in fasta format:
>Contig_1
ATCATCGTCCGCATCGACGAATTCGGCGAACGAGTACCCCTGACCATCTCCGATTA...
>Contig_2
GATCGTCACGCAGGCTATCGGAGCCTCGACCCGCAAGCTCTGCGCCTTGGAGCAGG...
  • Proteins (optional) [--proteins]: Predicted proteins in prodigal fasta format. Header contains contig-ID and gene number:
>Contig_1_1
MPRKNKKLPRHLLVIRTSAMGDVAMLPHALRALKEAYPEVKVTVATKSLFHPFFEG...
>Contig_1_2
MANKIPRVPVREQDPKVRATNFEEVCYGYNVEEATLEASRCLNCKNPRCVAACPVN...

Output files

  • Summary of output bins including quality and completeness estimates (*_DASTool_summary.tsv).
  • Contigs to bin file of output bins (*_DASTool_contigs2bin.tsv).
  • Quality and completeness estimates of input bin sets, if --write_bin_evals is set (*_allBins.eval).
  • Bins in fasta format if --write_bins is set (DASTool_bins).

Examples: Running DAS Tool on sample data.

Example 1: Run DAS Tool on binning predictions of MetaBAT, MaxBin, CONCOCT and tetraESOMs. Output files will start with the prefix DASToolRun1:

DAS_Tool  -i sample_data/sample.human.gut_concoct_contigs2bin.tsv,\
sample_data/sample.human.gut_maxbin2_contigs2bin.tsv,\
sample_data/sample.human.gut_metabat_contigs2bin.tsv,\
sample_data/sample.human.gut_tetraESOM_contigs2bin.tsv \
-l concoct,maxbin,metabat,tetraESOM \
-c sample_data/sample.human.gut_contigs.fa \
-o sample_output/DASToolRun1

Example 2: Run DAS Tool again with different parameters. Use the proteins predicted in Example 1 to skip the gene prediction step, output evaluations of input bins, set the number of threads to 2 and score threshold to 0.6. Output files will start with the prefix DASToolRun2:

DAS_Tool -i sample_data/sample.human.gut_concoct_contigs2bin.tsv,\
sample_data/sample.human.gut_maxbin2_contigs2bin.tsv,\
sample_data/sample.human.gut_metabat_contigs2bin.tsv,\
sample_data/sample.human.gut_tetraESOM_contigs2bin.tsv \
-l concoct,maxbin,metabat,tetraESOM \
-c sample_data/sample.human.gut_contigs.fa \
-o sample_output/DASToolRun2 \
--proteins sample_output/DASToolRun1_proteins.faa \
--write_bin_evals \
--threads 2 \
--score_threshold 0.6

Dependencies

*) The free version of USEARCH only can use up to 4Gb RAM. Therefore, the use of DIAMOND or BLAST+ is recommended for big datasets.

Installation

# Download and extract DASTool.zip archive:
unzip DAS_Tool-1.x.x.zip
cd ./DAS_Tool-1.x.x

# Unzip SCG database:
unzip ./db.zip -d db

# Run DAS Tool:
./DAS_Tool -h

Installation of dependent R-packages:

$ R
> repo='http://cran.us.r-project.org' #select a repository
> install.packages('data.table', repos=repo, dependencies = T)
> install.packages('magrittr', repos=repo, dependencies = T)
> install.packages('docopt', repos=repo, dependencies = T)
> q() #quit R-session

Installation using conda or homebrew

DAS Tool now can also be installed via bioconda and homebrew.

Bioconda

Bioconda repository: https://bioconda.github.io/recipes/das_tool/README.html. Thanks @keuv-grvl and @silask!.

Add bioconda channel:

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Install DAS Tool using conda:

conda install -c bioconda das_tool

Homebrew

Homebrew-bio repository: https://github.com/brewsci/homebrew-bio. Thanks @gaberoo!

Install DAS Tool using Homebrew:

brew install brewsci/bio/das_tool

Docker

It is also possible to run DAS Tool using Docker. A Docker image can be built using the Dockerfile included in the repository:

cd ./DAS_Tool-1.x.x
docker build -t cmks/das_tool .

To test the build run:

docker run --rm -it -v $(pwd)/sample_data:/sample_data -v $(pwd)/sample_output:/sample_output cmks/das_tool \
DAS_Tool -i /sample_data/sample.human.gut_concoct_contigs2bin.tsv,\
/sample_data/sample.human.gut_maxbin2_contigs2bin.tsv,\
/sample_data/sample.human.gut_metabat_contigs2bin.tsv,\
/sample_data/sample.human.gut_tetraESOM_contigs2bin.tsv \
-l concoct,maxbin,metabat,tetraESOM \
-c /sample_data/sample.human.gut_contigs.fa \
-o /sample_output/dockerTest

Preparation of input files

Not all binning tools provide results in a tab separated file of contig-IDs and bin-IDs. A helper script can be used to convert a set of bins in fasta format to tabular contigs2bin file, which can be used as input for DAS Tool: src/Fasta_to_Contigs2Bin.sh -h.

Usage:

Fasta_to_Contigs2Bin: Converts genome bins in fasta format to contigs-to-bin table.

Usage: Fasta_to_Contigs2Bin.sh -e fasta > my_contigs2bin.tsv

   -e, --extension            Extension of fasta files. (default: fasta)
   -i, --input_folder         Folder with bins in fasta format. (default: ./)
   -h, --help                 Show this message.

Example: Converting MaxBin fasta output into tab separated contigs2bin file:

$ ls /maxbin/output/folder
maxbin.001.fasta   maxbin.002.fasta   maxbin.003.fasta...

$ src/Fasta_to_Contigs2Bin.sh -i /maxbin/output/folder -e fasta > maxbin.contigs2bin.tsv

$ head gut_maxbin2_contigs2bin.tsv
NODE_10_length_127450_cov_375.783524	maxbin.001
NODE_27_length_95143_cov_427.155298	maxbin.001
NODE_51_length_78315_cov_504.322425	maxbin.001
NODE_84_length_66931_cov_376.684775	maxbin.001
NODE_87_length_65653_cov_460.202156	maxbin.001

Some binning tools (such as CONCOCT) provide a comma separated tabular output. To convert a comma separated file into a tab separated file a one liner can be used: perl -pe "s/,/\t/g;" contigs2bin.csv > contigs2bin.tsv.

Example: Converting CONCOCT csv output into tab separated contigs2bin file:

$ head concoct_clustering_gt1000.csv
NODE_2_length_147519_cov_33.166976,42
NODE_3_length_141012_cov_38.678171,42
NODE_4_length_139685_cov_35.741896,42

$ perl -pe "s/,/\tconcoct./g;" concoct_clustering_gt1000.csv > concoct.contigs2bin.tsv

$ head concoct.contigs2bin.tsv
NODE_2_length_147519_cov_33.166976	concoct.42
NODE_3_length_141012_cov_38.678171	concoct.42
NODE_4_length_139685_cov_35.741896	concoct.42

Troubleshooting/ known issues

Docopt issue

Problem: When executing DAS Tool a truncated version of the help message is displayed (see below). This is a known bug of the current version of the docopt R package, which occurs if the command-line syntax is violated.

Error: DAS Tool

Usage:
  DAS_Tool [options] -i <contig2bin> -c <contigs_fasta> -o <outputbasename>
  DAS_Tool [--help]

Options:
   -i --bins=<contig2bin>                   Comma separated list of tab separated contigs to bin tables.
   -c --contigs=<contigs>                   Contigs in fasta format.
   -o --outputbasename=<outputbasename>     Basename of output files.
   -l --labels=<labels>                     Comma separated list of binning prediction names.
   --search_engine=<search_engine>          Engine used for single copy gene identification (di
Execution halted

Solution: Check command line for any typos.

Dependencies not found

Problem: All dependencies are installed and the environmental variables are set but DAS Tool still claims that specific depencendies are missing.

Solution: Make sure that the dependency executable names are correct. For example USEARCH has to be executable with the command If your USEARCH binary is called differently (e.g. usearch9.0.2132_i86linux32) you can either rename it or add a symbolic link called usearch:

$ ln -s usearch9.0.2132_i86linux32 usearch

Memory limit of 32-bit usearch version exceeded

Problem: Running DAS Tool with the free version of USEARCH on a large metagenomic dataset results in the following error:

---Fatal error---
Memory limit of 32-bit process exceeded, 64-bit build required
makeblastdb did not work for my_proteins.faa, please check your input file

Solution: Use DIAMOND or BLAST as alignment tool (--search_engine diamond or --search_engine blast):