[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc(Transfer learning and fine-tuning) is quite different from real executive result. #66696

Open
lida2003 opened this issue Apr 30, 2024 · 6 comments
Assignees
Labels
comp:keras Keras related issues TF 2.15 For issues related to 2.15.x type:performance Performance Issue

Comments

@lida2003
Copy link
lida2003 commented Apr 30, 2024

Issue type

Support

Have you reproduced the bug with TensorFlow Nightly?

No

Source

binary

TensorFlow version

2.15.0+nv24.03 GPU version, check this link: https://forums.developer.nvidia.com/t/multiple-executive-warnings-after-switching-tensorflow-from-2-16-1-cpu-to-v60dp-tensorflow-2-15-0-nv24-03-gpu-version/291208

Custom code

No

OS platform and distribution

Jetson Orin Nano ubuntu 22.04 Jammy

Mobile device

No response

Python version

3.10.12

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

CUDA12.2.140/cuDNN8.9.4.25

GPU model and memory

sm90 8GB

Current behavior?

The executive result (trend of curve and abosolute value) is quite different from document.

图片

图片

图片

Standalone code to reproduce the issue

100% reproducable

Relevant log output

Also tried Colab, which is consistent with documentation:

图片

@tilakrayal
Copy link
Contributor

@lida2003,
Could you please confirm whether the difference in Accuracy & Loss has happened in GPU/CPU with both tensorflow v2.15, v2.16? Also I will also try to debug more on this issue and provide the resolution. Thank you!

@tilakrayal tilakrayal added TF 2.15 For issues related to 2.15.x type:performance Performance Issue labels May 2, 2024
@lida2003
Copy link
Author
lida2003 commented May 2, 2024

@lida2003,
Could you please confirm whether the difference in Accuracy & Loss has happened in GPU/CPU with both tensorflow v2.15, v2.16?

  • NVIDIA 2.15.0+nv24.03 GPU version faild 100%
  • Colab is consistent with tensorflow document (But I don't know the version)

PS: Jetson Orin Nano 8GB, CPU&GPU shared memory

v2.16

Well, on the very begining, I have installed (pip binary installation) 2.16 CPU version on Jetson Orin Nano. Runing Keras-Fine-Tuning-Pre-Trained-Models without any resource warning. It might be the way CPU using swap area.

The result is also different from the document, See link below:

When I switched to NVIDIA 2.15.0+nv24.03 GPU version: Tensorflow v2.16.1 GPU version local build on Jetson Orin Nano failed

Also I will also try to debug more on this issue and provide the resolution. Thank you!

There are also a copule of other things might be a clue for you. Here is a link on NVIDIA forum:

Please take a look at those warnings and memory issue. I think we need a sanity check before software is packed for release (put on repo).

EDIT: Keep sync with NVIDIA feedback.

@tilakrayal tilakrayal added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label May 16, 2024
@tilakrayal tilakrayal removed the type:support Support issues label Jun 11, 2024
@tilakrayal
Copy link
Contributor

@lida2003,
As you mentioned, I tried to execute the transfer learning code on the colab with the Tensorflow v2.16, v2.15 and observed that the loss and Accuracy is the same as the document. I suspect there might be an issue with the GPU. I will re-check and deep dive on the same.
image

Thank you!

@tilakrayal tilakrayal added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Jun 11, 2024
@lida2003
Copy link
Author

@tilakrayal

Please check Inconsistency of NVIDIA 2.15.0+nv24.03 v.s. Colab v.s. Tensorflow Documentation

  • JetPack 6.0 DP and tf2.15.0+nv24.04 ==> BAD
  • JetPack v6.0 + tensorflow2.15.0+nv24.05 ==>OK

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Jun 12, 2024
@tilakrayal tilakrayal added the comp:keras Keras related issues label Jun 17, 2024
@tilakrayal
Copy link
Contributor
tilakrayal commented Jun 17, 2024

@lida2003,
Could you please let me know if you are facing the same issue on TensorFlow v2.16 where keras3.0 is by default included. eras 3 is a full rewrite of Keras that enables you to run your Keras workflows on top of either JAX, TensorFlow, or PyTorch, and that unlocks brand new large-scale model training and deployment capabilities.

Also please try to comment this particular line and execute the code.
predictions = tf.nn.sigmoid(predictions)

Thank you!

@tilakrayal tilakrayal added the stat:awaiting response Status - Awaiting response from author label Jun 17, 2024
@lida2003
Copy link
Author

Could you please let me know if you are facing the same issue on TensorFlow v2.16 where keras3.0 is by default included.

Sorry, I have met difficulties on build from source code (I have asked about some build steps, but I built without any luck)

so I have used NVIDIA binary.

图片

图片

But I think it might be some issue with NVIDIA V60DP version, check this for details: Inconsistency of NVIDIA 2.15.0+nv24.03 v.s. Colab v.s. Tensorflow Documentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:keras Keras related issues TF 2.15 For issues related to 2.15.x type:performance Performance Issue
Projects
None yet
Development

No branches or pull requests

2 participants