[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated Python and PyMC, removed TensorFlow, and added PyTorch in conda environment. #8561

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

samuelklee
Copy link
Contributor
@samuelklee samuelklee commented Oct 23, 2023

Copying over some discussion from Slack, with some slight modifications:

I took a quick stab at updating the environment for gCNV. Even taking out TensorFlow (assuming that the CNN will not be supported by this environment), it's a difficult task:

  1. The goal is to update Python from 3.6 to 3.10+, since Terra now requires the latter for officially supported images.
  2. However, gCNV relies on the PyMC3 package. PyMC3 3.1 is currently used in GATK master. 3.1 was released in 2017, not long before our release of gCNV in 2018, but it's very old now.
  3. The latest version of Python that is supported by PyMC3 3.1 in conda is Python 3.6.
  4. @asmirnov239 has a draft PR (Add pytorch to the conda environment #8094) that updates PyMC3 to 3.5 and Python to 3.7, which clearly still falls short of Python 3.10+. This PR also updated some gCNV code to make it compatible with PyMC3 3.5. (It also removed TensorFlow and added PyTorch.)
  5. @asmirnov239 also merged a PR that added tests for numerical reproducibility of GermlineCNVCaller in cohort mode in Added gCNV integration test to detect numerical differences in the outputs. #7889.
  6. The earliest version of PyMC that supports Python 3.10+ is PyMC 4, released in 2022.
  7. However, PyMC 4 introduces API changes, which will also require additional gCNV code changes and numerical testing.
  8. These API changes are because the underlying computational backend for PyMC was updated from Theano (think of this as an old alternative to TensorFlow) to Aesara.
  9. Since then, PyMC 5.9 has been released and the underlying backend has been updated again, from Aesara to PyTensor.
  10. So if we are going to update the environment to support Python 3.10+, it probably makes sense to go all the way to PyMC 5.9.

I've made some strides in this PR; as of 6b08f3a, I've made enough updates to accommodate API changes so that cohort-mode inference for both GermlineCNVCaller and DetermineGermlineContigPloidy runs successfully under Python 3.10 and PyMC 5.9.0---although note that 5.9.1 has been released in the interim!

However, our work has just begun. Results now produced in the numerical tests mentioned above are quite far off from the original expected results. It remains to be seen whether this is due to the randomness of inference, some slight changes to the model prior that were necessitated by the API changes, or some bugs introduced in other code updates. (Also note that I believe Andrey's PR in item 4 already broke these tests, although the numerical differences were much smaller and more reasonable---but perhaps he can confirm. Also noting here that I think determinism is still currently broken as of this commit---there have been some changes to PyTensor/PyMC seeding so that our previous theano/PyMC3 hack no longer applies.)

So I think the next step is to just go to scientific-level testing and see what the fallout is. Ideally, we'd still get good performance (or perhaps better! at least on the runtime side, hopefully...) and we can just update the numerical tests. But if performance tanks, then we might need to see whether I've introduced any bugs. @mwalker174 @asmirnov239 perhaps you can comment on what might be the appropriate test suite here----1kGP?

I'll also highlight again that this PR will remove TensorFlow and might require that the corresponding CNN implementations be supported by an alternate strategy, at least until the PyTorch implementation goes in.

@samuelklee samuelklee marked this pull request as draft October 23, 2023 18:31
@samuelklee samuelklee changed the title Sl python version update Updated Python and PyMC, removed TensorFlow, and added PyTorch in conda environment. Oct 23, 2023
@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@samuelklee samuelklee force-pushed the sl_python_version_update branch 2 times, most recently from 6534430 to 558ccaf Compare November 9, 2023 20:48
@mwalker174
Copy link
Contributor

Thanks for your work on this @samuelklee! Testing on both wes and wgs would be ideal. For wgs we can use the gatk-sv reference panel, which is our standard (I can help with this once a docker is ready). For wes, 1kgp would work although it's definitely showing its age. Are the integration test differences large?

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@samuelklee
Copy link
Contributor Author
samuelklee commented Dec 8, 2023

OK, I think things are looking good! Updated a bunch of things, including the following:

  • conda 23.1.0 -> 23.10.0; in the base Docker, also disabled conda auto-updating and set the solver to the much faster libmamba
  • python 3.6.10 -> 3.10.13
  • pymc 3.1 -> 5.10.0
  • theano 1.0.4 -> pytensor 2.18.1
  • added pytorch 2.1.0
  • removed tensorflow 1.15.0 and other CNN dependencies
  • added libblas-dev to the base Docker; I think MKL versions of all packages are being used, but we should verify!

These and other packages (numpy, scipy, etc.) are all pretty much at the latest available versions for python 3.10. I've also bumped version numbers for our internal python packages.

I also made all of the changes to the gCNV code to accommodate any changes introduced by PyMC/Pytensor. For the most part, these were minor renamings of theano/tt/etc. to pytensor/pt/etc.

However, there were some more nontrivial changes, including to 1) model priors (since some of the distributions previously used were removed or are now supported differently), 2) the implementation of posterior sampling, 3) some shape/dimshuffle operations, and other things along these lines.

Using a single test shard of 20 1kGP WES samples x 1000 intervals, I have verified determinism/reproducibility for DetermineGermlineContigPloidy COHORT/CASE modes, GermlineCNVCaller COHORT/CASE modes, and PostprocessGermlineCNVCalls. Numerical results are also relatively close to those from 4.4.0.0 for all identifiable call and model quantities (albeit far outside any reasonable exact-match thresholds, most likely due to differences in RNG, sampling, and the aforementioned priors).

Some remaining TODOs:

  • Rebuild and push the base Docker. EDIT: Mostly covered by Update the GATK base image to a newer LTS ubuntu release #8610, but this also includes an addition of libblas-dev.
  • Update expected results for integration tests, perhaps add any that might be missing. EDIT: These were generated on WSL Ubuntu 20.04.2, we'll see if things pass on 22.04. Note that changing the ARD priors does change the names of the expected files, since the transform is appended to the corresponding variable name. DetermineGermlineContigPloidy and PostprocessGermlineCNVCalls are missing exact-match tests and should probably have some, but I'll leave that to someone else.
  • Update other python integration tests.
  • Clean up some of the changes to the priors.
  • Clean up some TODO comments that I left to track code changes that might result in changed numerics. I'll try to go through and convert these to PR comments in an initial review pass.
  • Test over multiple shards on WGS and WES. Probably some scientific tests on ~100 samples in both cohort and case mode would do the trick. We should also double check runtime/memory performance (I noted ~1.5x speedups, but didn't measure carefully; I also want to make sure the changes to posterior sampling didn't introduce any memory issues). @mwalker174 will ping you when a Docker is ready! Might be good to loop in Isaac and/or Jack as well.
  • Perhaps add back the fix for 2-interval shards in Number of intervals edge case gCNV fix #8180, which I removed since the required functionality wasn't immediately available in Pytensor. Not sure if this actually broke things though---need to check. (However, I don't actually think this is a very important use case to support...)
  • Delete/deprecate/etc. CNN tools/tests as appropriate. Note that this has to be done concurrently, since we remove Tensorflow. @droazen perhaps I can take a first stab at this in a subsequent commit to this PR once more of the gCNV dust settles and/or has undergone a preliminary review? EDIT: Disabled integration/WDL tests. We should add some deprecation messages to the tools---we can note that they should still work in previous environments but will be untested. I might set up a separate PR for deletion, to be done at the appropriate time (but I call dibs on this, can't have @davidbenjamin overtaking my all-time record for number of lines deleted 😛).

@gatk-bot
Copy link
gatk-bot commented Dec 8, 2023

Github actions tests reported job failures from actions build 7143821808
Failures in the following jobs:

Test Type JDK Job ID Logs
conda 17.0.6+10 7143821808.3 logs

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@samuelklee
Copy link
Contributor Author
samuelklee commented Apr 3, 2024

Here's one, if anyone would like to try it out! us.gcr.io/broad-dsde-methods/broad-gatk-snapshots/gatk@sha256:c405d1d7fc42a008bb11a2f60f95bb65f1c487cf4986850a7d5d7d4bdfc37965

@samuelklee
Copy link
Contributor Author
samuelklee commented Apr 26, 2024

Finishing up a tieout using old MalariaGEN Pf7 CNV genotyping runs, and I think things look good from this perspective, at least.

This tieout uses a subset of the Pf7 samples containing 300 cohort and 1683 case samples (which were indeed treated as a cohort-case cluster in the original Pf7 CNV genotyping analysis). ~4k genomic bins are covered. We compare this branch against 4.5.0.0, as well as this branch against itself (checking for reproducibility).

Costs for this branch ($10.92) and 4.5.0.0 ($10.96) were quite comparable. Note that a small portion of these costs derives from Pf7-specific genotyping steps, which I did not bother to remove from the workflow.

Runtime for the ploidy modeling and postprocessing steps were comparable. Interestingly, runtime for the gCNV was ~20-25% longer with this branch than with 4.5.0.0, but memory usage fell by a factor of ~3 (~6GB to ~2GB)! I am not sure if we could recoup the runtime with some more tweaking of the environment (perhaps double checking that optimized BLAS/MKL/etc. packages are properly used, changing environment variables/flags, etc.), but I think the decrease in memory usage is quite nice.

Concordance was checked for the following quantities (4.5.0.0 is on the x-axis and this branch is on the y-axis in all plots below):

  1. Variational posterior means (mu_*) and standard deviations (std_*) for all analogous variables in the ploidy and gCNV models. There were some slight changes to the gCNV model in this branch (e.g., the functional form of the ARD prior was changed), which means some variables are no longer directly comparable. Furthermore, some variables (such as the bias factors W) are degenerate and cannot be immediately compared. Otherwise, there is good concordance between the remaining variables, e.g.:

image
image
image
image
image
image

  1. ... Will update more later!

@droazen
Copy link
Collaborator
droazen commented Apr 26, 2024

Thanks for sharing these results @samuelklee ! That memory vs runtime tradeoff does not seem at all bad to me.

What else needs to be done before we can feel confident about merging this (apart from deprecating the old CNN tools)? Is the posterior sampling issue resolved?

@samuelklee
Copy link
Contributor Author

Yes, that's been resolved! Phew.

If anyone wants to start taking an initial review pass, that would be appreciated! There are probably a few very minor cleanups I could do (e.g., removing stray commented code and TODOs), but I don't think that should hold anything up.

@samuelklee
Copy link
Contributor Author

Also, I think @mwalker174 should have the final say on merging, as he's the current gCNV owner. Would be great if he could run any GATK-SV-specific tests and report back on runtime/accuracy!

@droazen
Copy link
Collaborator
droazen commented May 20, 2024

@samuelklee It's great to see this inching closer to finally getting merged! I think we should plan on merging it shortly after the upcoming 4.6 release, assuming the remaining testing doesn't reveal any issues. Since 4.6 will contain a number of important bug fixes and user-requested changes, I want to get it out before making a big breaking change to the Python environment. After 4.6 is out, we can meet briefly to discuss how best to handle the deprecation of the legacy CNN tools, and hopefully get this in at last.

Copy link
Contributor
@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for documenting all the API changes. This looks reasonable to me (along with the malaria plots). There are a couple more TODOs that I think are to-done. @droazen is going to help us come up with a CNN deprecation/transition plan so we don't have to turn off tests to merge

@@ -295,7 +295,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
wdlTest: [ 'RUN_CNV_GERMLINE_COHORT_WDL', 'RUN_CNV_GERMLINE_CASE_WDL', 'RUN_CNV_SOMATIC_WDL', 'RUN_M2_WDL', 'RUN_CNN_WDL', 'RUN_VCF_SITE_LEVEL_FILTERING_WDL' ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may already be a WDL for the NVIDIA CNN in the Terra workspace, but this will be part of the CNNScoreVariants Deprecation Plan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droazen leaving this as a reminder that we'll need to deal with the CNN tests/code/etc. disabled here in a subsequent PR.

@cmnbroad
Copy link
Collaborator

@droazen When we finally remove CNNScoreVariants, don't forget that you can put an entry in the DeprecatedToolsRegistry with a helpful message about the replacement.

@mwalker174
Copy link
Contributor

Very excited that this is almost ready! I'm presently going to run some tests on matched BGE/exome samples to assess runtime performance and results with gatk-sv.

@samuelklee
Copy link
Contributor Author

@mwalker174 reported last week that exomes look good, with perhaps more details on cost/accuracy forthcoming!

I'll be OOO next week, but will address comments after I get back---already looked through them and I think it should be quick (famous last words).

@mwalker174
Copy link
Contributor
mwalker174 commented Jun 24, 2024

We've run some basic WGS tests in gatk-sv, and the results look good. Using our 161-sample 30x NYGC 1KGP test panel, here is a summary of per-sample raw calls from gCNV+cnMOPs. That is, records look like:

chr1	103639835	103663835	1kgp_161_bwa_scramble_DEL_9527	HG01552	DEL	cnmops,gcnv
chr1	103643835	103645835	1kgp_161_bwa_scramble_DEL_9531	HG01494	DEL	gcnv
chr1	103643835	103652200	1kgp_161_bwa_scramble_DEL_9534	HG02236	DEL	cnmops,gcnv
chr1	103645600	103724500	1kgp_161_bwa_scramble_DEL_9538	HG03370	DEL	cnmops,gcnv
chr1	103647835	103649835	1kgp_161_bwa_scramble_DEL_9541	HG01495	DEL	gcnv
chr1	103647835	103681000	1kgp_161_bwa_scramble_DEL_9542	HG01494	DEL	cnmops,gcnv
…

Master:
DEL: 641700
DUP: 699063

Branch:
DEL: 640669
Master intersection: 635178 (99.14% sensitivity)
DUP: 691469
Master intersection: 677687 (98.00%)

Note that this is a simple bedtools intersection requiring 90% reciprocal overlap, but not matching samples.

Subsetting just to NA19420:

Master:
DEL: 4974
DUP: 4254

Branch:
DEL: 4958
Master intersection: 4797 (96.75%)
DUP: 4210
Master intersection: 3802 (90.30%)

Further subsetting to NA19240 variants over 5kbp, which is the default gatk-sv threshold for depth-only calls:

Master:
DEL: 1133
DUP: 916

Branch:
DEL: 1132
Master intersection: 1060 (93.64%)
DUP: 893
Master intersection: 757 (84.77%)

While it does appear there's an appreciable difference here, if we subset to NA19240 variants that survive gatk-sv filtering to the "CleanVcf" stage (in the current master), the differences are much less:

Master:
DEL: 225
DUP: 61

Branch:
DEL: 226
Master intersection: 223 (98.67%)
DUP: 58
Master intersection: 58 (95.08%)

I think the differences in raw calls is probably "in the noise" for WGS, given the high concordance in this test sample after gatk-sv filtering and genotyping are applied.

Edit: Thanks to @kirtanav98 for running these tests.

@mwalker174
Copy link
Contributor

Also we've noticed a slight drop in peak memory usage from 8.7 GB to 7.2 (17% decrease) in both cohort and case modes for GermlineCNVCaller which is a welcome improvement!

@JMF47
Copy link
JMF47 commented Jun 24, 2024

Here's the performance benchmarking
sen_del sen_dup ppv_del ppv_dup sen_del_hist sen_dup_hist ppv_del_hist ppv_dup_hist
1 0.2000000 0.4121864 0.9387755 0.8521127 345 279 98 142
2 0.4598540 0.6086957 0.9647059 0.8602941 137 184 85 136
3 0.6352941 0.7181208 0.9565217 0.8650794 85 149 69 126
4 0.7796610 0.7744361 0.9636364 0.8606557 59 133 55 122
5 0.8913043 0.8715596 0.9583333 0.8495575 46 109 48 113
6 0.9024390 0.9019608 0.9750000 0.8545455 41 102 40 110
7 0.9428571 0.9263158 0.9722222 0.8666667 35 95 36 105
8 0.9310345 0.9534884 0.9666667 0.8736842 29 86 30 95
9 0.9285714 0.9518072 1.0000000 0.9213483 28 83 26 89
10 0.9565217 0.9487179 1.0000000 0.9259259 23 78 23 81

NEW gCNV docker

  sen_del   sen_dup   ppv_del   ppv_dup sen_del_hist sen_dup_hist ppv_del_hist ppv_dup_hist

1 0.2034884 0.4107143 0.9387755 0.8581560 344 280 98 141
2 0.4705882 0.6086957 0.9647059 0.8676471 136 184 85 136
3 0.6547619 0.7248322 0.9565217 0.8730159 84 149 69 126
4 0.7931034 0.7819549 0.9636364 0.8677686 58 133 55 121
5 0.9111111 0.8807339 0.9583333 0.8558559 45 109 48 111
6 0.9250000 0.9019608 0.9750000 0.8584906 40 102 40 106
7 0.9705882 0.9263158 0.9722222 0.8712871 34 95 36 101
8 0.9642857 0.9534884 0.9655172 0.8791209 28 86 29 91
9 0.9629630 0.9518072 1.0000000 0.9186047 27 83 25 86
10 1.0000000 0.9487179 1.0000000 0.9113924 22 78 22 79

@samuelklee
Copy link
Contributor Author

Great, thanks @mwalker174 and @JMF47! I will try to get this tidied up before heading out for 7/4.

@matthdsm
Copy link
matthdsm commented Jul 1, 2024

Hi all,

Any chance this will make it into a release soon? I was hoping this got merged with the recent docker image overhaul.

Thanks
Matthias

@samuelklee
Copy link
Contributor Author

@matthdsm this was intentionally left out of the recent 4.6 release, but should go into the next minor release. Would of course appreciate any testing/feedback from the community before then!

@samuelklee samuelklee marked this pull request as ready for review July 2, 2024 04:11
@samuelklee
Copy link
Contributor Author
samuelklee commented Jul 2, 2024

Released gatkbase-3.3.0 to broadinstitute/gatk:gatkbase-3.3.0, but getting Permission "artifactregistry.repositories.uploadArtifacts" denied on resource "projects/broad-gatk/locations/us/repositories/us.gcr.io" when trying to push to us.gcr.io/broad-gatk/gatk:gatkbase-3.3.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants