Combining aws-java-nio-spi-for-s3 with GATK. #8672
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Introduction
Recently, Amazon has created the tool aws-java-nio-spi-for-s3 that allows java-based applications to read and/or write to aws without the need for recompilation during runtime. Since then, we've utilised this tool, in conjunction with a locally modified version of gatk, to communicate with aws. Since we had the code that allows for communication with aws anyway, we decided to share it and maybe it can be part of the gatk toolkit in the future.
How does it work?
The user is able to provide an additional parameter '--s3', adding the nio-spi-for-s3-2.0.0-dev-all.jar file to the java classpath. File locations starting with 's3://' are then able to be provided, resulting of reading/writing of these files to aws. When using this option, however, the aws credentials have to be set correctly, for which you can find more information here. Currently, I haven't implemented it for --spark due to a lack of need/inexperience with spark.
Current Issues
We found some issues for which we do not know any solution. If this tool was to be implemented in GATK in the future, these have to be resolved eventually.
Doesn't work for picard-based tools
First, 'aws-java-nio-spi-for-s3' doesn't seem work for (most) picard tools, since most of them utilise the java.io.File package, which is limited to local filesystem files, as opposed to java.nio.Path (we think).
Issues reading genome reference files from AWS
Secondy, most tools that require a reference genome (i.e. BaseRecalibrator, HaplotypeCaller..) do not seem function when provided with a reference genome file stored on AWS. The error we receive can be found underneath and is much less clear. We believe that the issue lies in the interaction between the caching of the indexed reference file and 'aws-java-nio-spi-for-s3', since we tested in a custom java script that the package 'htsjdk' works like intended when the reference genome is read from AWS.
Notably, some tools do not have this issue, such as the the vqsr tools (VariantRecalibrator and applyVQSR).