Combining aws-java-nio-spi-for-s3 with GATK. #8672

DanishIntizar · 2024-02-01T13:26:18Z

Introduction

Recently, Amazon has created the tool aws-java-nio-spi-for-s3 that allows java-based applications to read and/or write to aws without the need for recompilation during runtime. Since then, we've utilised this tool, in conjunction with a locally modified version of gatk, to communicate with aws. Since we had the code that allows for communication with aws anyway, we decided to share it and maybe it can be part of the gatk toolkit in the future.

How does it work?

The user is able to provide an additional parameter '--s3', adding the nio-spi-for-s3-2.0.0-dev-all.jar file to the java classpath. File locations starting with 's3://' are then able to be provided, resulting of reading/writing of these files to aws. When using this option, however, the aws credentials have to be set correctly, for which you can find more information here. Currently, I haven't implemented it for --spark due to a lack of need/inexperience with spark.

Current Issues

We found some issues for which we do not know any solution. If this tool was to be implemented in GATK in the future, these have to be resolved eventually.

Doesn't work for picard-based tools

First, 'aws-java-nio-spi-for-s3' doesn't seem work for (most) picard tools, since most of them utilise the java.io.File package, which is limited to local filesystem files, as opposed to java.nio.Path (we think).

Issues reading genome reference files from AWS

Secondy, most tools that require a reference genome (i.e. BaseRecalibrator, HaplotypeCaller..) do not seem function when provided with a reference genome file stored on AWS. The error we receive can be found underneath and is much less clear. We believe that the issue lies in the interaction between the caching of the indexed reference file and 'aws-java-nio-spi-for-s3', since we tested in a custom java script that the package 'htsjdk' works like intended when the reference genome is read from AWS.
Notably, some tools do not have this issue, such as the the vqsr tools (VariantRecalibrator and applyVQSR).

…i-s3 tool. Currently only works locally.

lbergelson · 2024-02-06T15:49:38Z

@DanishIntizar Hello! Thank you for this pr. This is great to see an official plugin from amazon available. I appreciate that you took the time to make it an optional include. I think if we're going to include it we might as well just add it as one of our normal dependencies though. Assuming there aren't any dependency conflicts it should (always a risky statement) be independent from everything else.

Thanks also for identifying the different issues you mentioned. It's expected that it won't work with most picard tools as you discovered, but we're actively in the process of updating more of them too support Paths instead of Files so that will slowly improve.

The second issue is more worrisome. We regularly use an equivalent provider with google to read reference files through the exact same code, so I suspect there is either some sort of mismatched assumptions in the way they are handling things. Maybe something strange with the Path.resolve methods or the like. (Or in in the much worse potential case a bug in their look ahead caching.)

I'd like to look into that before we'd merge this. Ideally we would have tests for this. Are there any public AWS paths we could read from without any secret authentication?

DanishIntizar · 2024-02-12T07:30:45Z

Hello! Unfortunately, I am unable to provide public AWS paths myself, since I tested it using our own AWS credentials. What I can do, however, is provide the reference data we used (although I believe you have enough testing data already). Also I can maybe bring you in contact with the developer of this tool, if you want.

added support of reading and writing directly to AWS using the nio-sp…

c9696d8

…i-s3 tool. Currently only works locally.

DanishIntizar marked this pull request as draft February 2, 2024 07:56

DanishIntizar marked this pull request as ready for review February 2, 2024 13:24

DanishIntizar force-pushed the master branch from 67f358a to c9696d8 Compare February 6, 2024 09:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combining aws-java-nio-spi-for-s3 with GATK. #8672

Combining aws-java-nio-spi-for-s3 with GATK. #8672

Combining aws-java-nio-spi-for-s3 with GATK. #8672

Are you sure you want to change the base?

Combining aws-java-nio-spi-for-s3 with GATK. #8672

Conversation

Introduction

How does it work?

Current Issues

Doesn't work for picard-based tools

Issues reading genome reference files from AWS