[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow the concurrent run of multiple pipeline revisions #2870

Open
pditommaso opened this issue May 13, 2022 · 33 comments · May be fixed by #4659
Open

Allow the concurrent run of multiple pipeline revisions #2870

pditommaso opened this issue May 13, 2022 · 33 comments · May be fixed by #4659
Assignees
Milestone

Comments

@pditommaso
Copy link
Member

Summary

Nextflow relies on built-in integration with Git to pull and run a workflow.

When the user specifies the Git repository URL on then run command line, Nextflow carry out a Git clone command, stores the pipeline code into the $HOME/.nextflow/assets directory and launch the execution from there.

When the user specifies the -r (revision) CLI option, the repository is checked out at the specified revision ie. branch, tag or even commit id.

This however poses a problem when if two or more users run different versions at the same time, because the last performing the operation would override the previous repository code, which could be a disruptive operation.

This is not such an unlikely event considering a pipeline execution can last for hours or even days.

To mitigate this problem nextflow refuses to perform a run if the project is currently checkout to a non-default version and the run does not specify the revision to be executed in an explicit manner. However, this is the cause of other unexpected side effects. See here.

Goal

The goal of this enhancement is to allow the concurrent use of multiple pipeline revision in the same computer and deprecated the need for the stick revision check.

This could be achieved by downloading the Git repository with bare clone instead of a normal clone, and checkout the work tree into a separate subdirectory named as the commit id associated with the specified revision.

For example, if the user runs

nextflow run https://github.com/nextflow-io/hello

nextflow should clone the repo above with the bare option and store in the path $HOME/.nextflow/assets/nextflow-io/hello.git

Then implicitly the default branch is checkout, therefore the associate commit should be retrieved e.g. 4eab81bd42eed592f4371cd91b755ec78df25fe9, therefore the following path should be created containing the work tree accessible for the execution

$HOME/.nextflow/assets/nextflow-io/hello.git/.nextflow/revs/4eab81bd42eed592f4371cd91b755ec78df25fe9

When the user-specified a different revision e.g.

nextflow run https://github.com/nextflow-io/hello -r dev

A new subdirectory with the corresponding commit id should be created.

The commit id should be resolved against the local git clone, unless the -latest option is specified.

@jorgeaguileraseqera
Copy link
Contributor

@pditommaso one question:

do we need to check if the repo is present in the asset directory with the "old" format (no bare) and in this case no use the bare feature? i.e. some kind of retro compatibility or we'll force to remove and recreate local repos

@pditommaso
Copy link
Member Author

Good point. If already exists think should report a warning message maybe?

@jorgeaguileraseqera
Copy link
Contributor

so, report with a warning message (maybe with some instructions to remove the current repo) and stop the command, right?

@pditommaso
Copy link
Member Author

No I mean, show a warning message i.e. log.warn and do not stop. Usually nextflow only stops on error.

@jorgeaguileraseqera
Copy link
Contributor

ah ok, so if I understand correctly we try to identify which kind of repo we are working on at startup

  • if it's a working git (no bare) we use it as usual (with a warning message)
  • if it's a bare git we "activate" this feature

@pditommaso
Copy link
Member Author
pditommaso commented Jun 13, 2022

I see your point. In principle the bare should have been created when the feature has been enabled with a config flag or env variable, right?

If so, I think when this option does not match the repo format a warning should be reported

@pditommaso
Copy link
Member Author

@jorgeaguileraseqera any ETA for this?

@jorgeaguileraseqera
Copy link
Contributor

Hope to have in these days

(it's a little tedious due to the API rate limit breaks sometimes to run all tests )

@pditommaso
Copy link
Member Author

Can you please open at least a draft PR asap?

@pditommaso
Copy link
Member Author

due to the API rate limit breaks sometimes to run all tests

Do you mean Github rate limits? Are you using your GITHUB_TOKEN for tests?

@jorgeaguileraseqera
Copy link
Contributor

yes, I've created one and configured the env to run the tests

@pditommaso
Copy link
Member Author

Weird, but for such tests it should not depends on GitHub. It can created a small test repos and then use it for testing.

There's something similar for testing Git submodules

@pditommaso pditommaso modified the milestones: 22.10.0, 23.04.0 Oct 14, 2022
@notestaff
Copy link

Implementing the functionality in this issue would also solve issue #2655 .

Maybe also, clarify in the issue title that "concurrent run" is only for runs from different working directories (with different work/ and .nextflow subdirs).

@ewels
Copy link
Member
ewels commented Dec 2, 2022

Maybe also, clarify in the issue title that "concurrent run" is only for runs from different working directories (with different work/ and .nextflow subdirs).

Note that we're talking about the NXF_HOME folder (~/.nextflow), not the hidden .nextflow folder in the launch directory here.

@pditommaso
Copy link
Member Author

We lost the momentum with this feature :/

@lukbut
Copy link
lukbut commented Dec 5, 2022

Hi! This was recently brought to my attention. Just flagging that this would likely impact our engineers who might be developing on different feature branches but on the same workflow repo, on our development environments (which currently only run on our on-prem infrastructure).

@pditommaso
Copy link
Member Author

Impacting in a good or bad way?

@lukbut
Copy link
lukbut commented Dec 5, 2022

Hi @pditommaso impact in a bad way, I'm afraid! Our current idea for developing workflows within our organisation is for engineers to have their own branch in a workflow repository. They would implement changes in their own branch, and potentially run said workflows on our on-prem infrastructure to test their implementations. I believe that due to this bug, the engineers would end up over-writing each other's workflow implementations, if multiple implementation of the same workflow are tested at the same time?

@pditommaso
Copy link
Member Author

Understand, but it's not a bug. Nextflow has always worked in this way. The goal of this issue is exactly to overcome this limitation

@leonorpalmeira
Copy link

Don't know how the solution to this issue will be implemented, but don't forget (see #2655 (comment)) the use case where a developer has their own repository (outside of Nextflow's built-in integration of pull and run commands) and switches between branches during the execution of a pipeline. The solution to this issue should be that the execution shouldn't be affected by modifications of the original repository. Thanks :-)

@lukbut
Copy link
lukbut commented Oct 18, 2023

Hi! This issue has just come up again at Genomics England as it is likely that our engineers would want to run different branches of the workflows simultaneously. Is there any chance that this is getting implemented soon?

@bentsherman
Copy link
Member

Hey Luke, we are planning to implement this but no set timeline yet.

@pditommaso
Copy link
Member Author

Indeed, it is something to prioritize. Tagging @marcodelapierre for visibility

@pditommaso pditommaso modified the milestones: 23.10.0, 24.04.0 Oct 30, 2023
@marcodelapierre marcodelapierre self-assigned this Nov 1, 2023
@marcodelapierre
Copy link
Member
marcodelapierre commented Nov 1, 2023

Paolo I have found a git functionality for this.

Let's bash code:

# for ease of description
ROOT_DIR="/path/to/.nextflow/assets"
repo="nextflow-io/hello"
revision="rocket"
def_remote="origin"

# user
nextflow run $repo -r $revision

# behind the scenes

# only if revision is not there already
if [ ! -d $ROOT_DIR/$repo/$revision ] ; then

# first revision requested
if [ ! -d $ROOT_DIR/$repo ] ; then
  mkdir -p $ROOT_DIR/$repo/first
  git clone -b $revision https://github.com/$repo $ROOT_DIR/$repo/first
  cd $ROOT_DIR/$repo/first
  def_branch=$( git remote show $def_remote | sed -n '/HEAD branch/s/.*: //p' )
  cd ..
  mv first $def_branch
  ln -s $def_branch first_branch

# additional revision
else 
  cd $ROOT_DIR/$repo/first_branch
  git worktree add --track -b $revision ../$revision $def_remote/$revision
fi

fi

The key functionality is this one:

git worktree add --track -b dsl2 ../dsl2 origin/dsl2 

Docs: https://git-scm.com/docs/git-worktree

Found here: https://stackoverflow.com/questions/2048470/git-working-on-two-branches-simultaneously
And also here: https://stackoverflow.com/questions/6270193/how-can-i-have-multiple-working-directories-with-git/30185564#30185564

What do you think?

If you like it, I can give it a shot myself, soon after I have worked on another couple of pending work items.

@marcodelapierre
Copy link
Member

Forgot to mention the key advantage: only the repo file tree is duplicated, whereas all the Git related files such as in .git/ exist only once

@ewels

This comment was marked as off-topic.

@marcodelapierre

This comment was marked as off-topic.

@marcodelapierre
Copy link
Member

@pditommaso keen on your take on my proposed solution before I work on the implementation

@pditommaso
Copy link
Member Author

This is indeed an excellent idea. This could simplify the solution compared to the use of the bare repository approach.

Using the worktree solution, the main/master checkout should remain in the current location. Instead, when -r <revision> is requested it should be created a new work-tree under the path $NXF_ASSETS/revisions/<unique-id>, where unique-id is computed as sipHash24 of Project URI + revision.

Likely use of the --detach flag can also be useful.

@marcodelapierre
Copy link
Member

maybe this path for non-master revsions: $NXF_ASSETS/revisions/$repo/$revision

@marcodelapierre
Copy link
Member

Apol @pditommaso , had to prioritise other activities with larger customer impact.

I am keen to get this one done, on top of my list for when I am back in January.

@notestaff
Copy link

Ideally, the worktree should be checked out with all submodules recursively cloned, or there should be an option to do so. But if this complicates things, can be left for a later release.

Thanks a lot for working on this!

@marcodelapierre
Copy link
Member
marcodelapierre commented Jan 8, 2024

Working on it.
Turns out that the eclipse.jgit project we currently rely on does not support git worktree; there is a [PR (https://bugs.eclipse.org/bugs/show_bug.cgi?id=477475), that has been open for years to only add support to manage existing worktrees, not even to create new ones.

Proposed steps for way forward:

  1. start implementing just a change in clone directory structure and repo management, so that multiple revisions by a repo are supported;
  2. consider whether to polish cloned pipelines from .git to save disk space;
  3. if relevant, explore alternatives to jgit (if any) that have wider git support (the main advantage of worktree is indeed avoiding the .git duplicates, so I don't think this step needs exploring).

At this stage, I believe 1. can already be good enough. In its basic implementation it would duplicate the .git files; however, is a local collection of revisions of a pipeline very much different from one of multiple pipelines?

So, going to proceed with 1. to begin with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment