[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coercion of String to File outside a task: where does it happen? #581

Open
adamnovak opened this issue Jul 28, 2023 · 1 comment
Open

Coercion of String to File outside a task: where does it happen? #581

adamnovak opened this issue Jul 28, 2023 · 1 comment
Labels

Comments

@adamnovak
Copy link

In WDL 1.1, String has an "obvious and unambiguous" coercion to File:

https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#type-coercion

What is this conversion, exactly?

For example, a natural way to implement a scatter is to have each copy of the body execute in parallel on a different machine. So if you do something like:

workflow Example {

    input {
        String filename = "/etc/hostname"
    }

    File firstFile = filename
    String firstContent = read_string(firstFile)
    scatter(iteration in range(10000)) {
        File file = filename
        String content = read_string(file)
    }

    output {
        String first = firstContent
        Array[String] others = content
    }

}

What is supposed to be true about the others array? Are the values all the same? Are they all the same as first? For that matter, is firstalways the same thing you would see if you ran cat /etc/hostname instead of executing the workflow?

When working with functions that take either File or String, the spec says that Strings are interpreted "relative to the current working directory of the task"; presumably that would apply to coercions as well. But here we're not in a task, and a workflow as a distributed thing doesn't seem to me to have a natural working directory.

In the Toil workflow runner's WDL interpreter, I really want to be allowed to move workflow code to any machine at any line, or even between lines, so I can ship fragments of it to the cluster when they're ready to run, just like the tasks. But that seems to be breaking some of the GATK workflows which assume that, at least at top-level scope, a workflow can always read the same filesystem its caller sees. So I'm trying to work out how legitimate that assumption is, and whether there are guarantees that go further than that.

@patmagee
Copy link
Member

@adamnovak great question and this is only a half answer. I would really like to tease out your use case a bit more because federation is something I am personally interested as well...

For the time being though you are correct ins rating that there seems to be an assumption that the workflow itself has access to the same file system as the individual task. That is, in general how engines have been implemented and what the expected behaviour is.

Whether this is the correct behaviour, or the logical behaviour is a totally different matter. It has lead to a lot of complexity within engines and the community.

The path representation of a file in a workflow is likely different within the context of a task via a workflow. In a workflow, the path often is the absolute reference to the file (think object in object store, or file on father file system). However within a task, the engine has a hit of liberty here or changing the path and making it something relative to within the container.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants