A Python program for evaluating site-level concordance of a query VCF against a truth VCF.
The summary metric CSV file contains:
- Variant type: SNV or INDEL
- Total number of variants in the truth and query VCF files
- Total true-positive, false-positive, and false-negative calls
- Recall and Precision
This tool compares two variant callsets against each other and produces a CSV file with summary metrics.
In the following examples, we assume that the code has been installed to the directory ${VCFCompare}
.
$ python3 ${VCFCompare}/src/python/VCFCompare.py \
-t example/gatk_variants.vcf \
-q example/bcftools_variants.vcf \
-o test
$ ls test.*
test.csv
The example above compares an example run of GATK 4.1.0.0 against an example run of bcftools 1.9 on the same random sample.
The summary metric CSV file contains:
Type | TRUTH.TOTAL | TP | FP | FN | QUERY.TOTAL | Recall | Precision |
---|---|---|---|---|---|---|---|
SNV | 3610 | 3573 | 538 | 37 | 4111 | 0.989750693 | 0.869131598 |
INDEL | 205 | 104 | 101 | 101 | 247 | 0.507317073 | 0.421052632 |
If you want to visualize the difference and intersection between the truth and query VCF files, you can use the upset.R script under src/R/ The upset.R script runs on a set of VCFCompare.py results produced with the -o flag, as shown in the example above
$ Rscript ${VCFCompare}/src/R/upset.R \
-i test.csv \
-o TestRun
$ ls TestRun.*
TestRun.INDEL.pdf TestRun.SNV.pdf
This will produce two PDF files: one for SNVs and one for INDELs. Below is a screenshot for SNVs