Doc(Transfer learning and fine-tuning) is quite different from real executive result. #66696

lida2003 · 2024-04-30T11:08:24Z

Issue type

Support

Have you reproduced the bug with TensorFlow Nightly?

No

Source

binary

TensorFlow version

2.15.0+nv24.03 GPU version, check this link: https://forums.developer.nvidia.com/t/multiple-executive-warnings-after-switching-tensorflow-from-2-16-1-cpu-to-v60dp-tensorflow-2-15-0-nv24-03-gpu-version/291208

Custom code

No

OS platform and distribution

Jetson Orin Nano ubuntu 22.04 Jammy

Mobile device

No response

Python version

3.10.12

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

CUDA12.2.140/cuDNN8.9.4.25

GPU model and memory

sm90 8GB

Current behavior?

The executive result (trend of curve and abosolute value) is quite different from document.

My result: transfer_learning.zip

Document (https://github.com/tensorflow/docs/blob/master/site/en/tutorials/images/transfer_learning.ipynb) says:

Standalone code to reproduce the issue

100% reproducable

Relevant log output

Also tried Colab, which is consistent with documentation:

tilakrayal · 2024-05-02T12:43:20Z

@lida2003,
Could you please confirm whether the difference in Accuracy & Loss has happened in GPU/CPU with both tensorflow v2.15, v2.16? Also I will also try to debug more on this issue and provide the resolution. Thank you!

lida2003 · 2024-05-02T22:18:42Z

@lida2003,
Could you please confirm whether the difference in Accuracy & Loss has happened in GPU/CPU with both tensorflow v2.15, v2.16?

NVIDIA 2.15.0+nv24.03 GPU version faild 100%
Colab is consistent with tensorflow document (But I don't know the version)

PS: Jetson Orin Nano 8GB, CPU&GPU shared memory

v2.16

Well, on the very begining, I have installed (pip binary installation) 2.16 CPU version on Jetson Orin Nano. Runing Keras-Fine-Tuning-Pre-Trained-Models without any resource warning. It might be the way CPU using swap area.

The result is also different from the document, See link below:

Multiple executive warnings after switching tensorflow from 2.16.1 CPU to v60dp tensorflow==2.15.0+nv24.03 GPU version

When I switched to NVIDIA 2.15.0+nv24.03 GPU version: Tensorflow v2.16.1 GPU version local build on Jetson Orin Nano failed

The resource issue, we have confirmed with NVIDIA.

Also I will also try to debug more on this issue and provide the resolution. Thank you!

There are also a copule of other things might be a clue for you. Here is a link on NVIDIA forum:

Please take a look at those warnings and memory issue. I think we need a sanity check before software is packed for release (put on repo).

EDIT: Keep sync with NVIDIA feedback.

tilakrayal · 2024-06-11T13:13:10Z

@lida2003,
As you mentioned, I tried to execute the transfer learning code on the colab with the Tensorflow v2.16, v2.15 and observed that the loss and Accuracy is the same as the document. I suspect there might be an issue with the GPU. I will re-check and deep dive on the same.

Thank you!

lida2003 · 2024-06-12T22:45:56Z

@tilakrayal

Please check Inconsistency of NVIDIA 2.15.0+nv24.03 v.s. Colab v.s. Tensorflow Documentation

JetPack 6.0 DP and tf2.15.0+nv24.04 ==> BAD
JetPack v6.0 + tensorflow2.15.0+nv24.05 ==>OK

tilakrayal · 2024-06-17T10:29:08Z

@lida2003,
Could you please let me know if you are facing the same issue on TensorFlow v2.16 where keras3.0 is by default included. eras 3 is a full rewrite of Keras that enables you to run your Keras workflows on top of either JAX, TensorFlow, or PyTorch, and that unlocks brand new large-scale model training and deployment capabilities.

Also please try to comment this particular line and execute the code.
predictions = tf.nn.sigmoid(predictions)

Thank you!

lida2003 · 2024-06-17T11:03:14Z

Could you please let me know if you are facing the same issue on TensorFlow v2.16 where keras3.0 is by default included.

Sorry, I have met difficulties on build from source code (I have asked about some build steps, but I built without any luck)

so I have used NVIDIA binary.

2.15 is their latest build for v60

2.15 is their latest build for v60dp

But I think it might be some issue with NVIDIA V60DP version, check this for details: Inconsistency of NVIDIA 2.15.0+nv24.03 v.s. Colab v.s. Tensorflow Documentation

google-ml-butler bot added the type:support Support issues label Apr 30, 2024

google-ml-butler bot assigned tilakrayal Apr 30, 2024

tilakrayal added TF 2.15 For issues related to 2.15.x type:performance Performance Issue labels May 2, 2024

tilakrayal added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label May 16, 2024

tilakrayal removed the type:support Support issues label Jun 11, 2024

tilakrayal added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Jun 11, 2024

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Jun 12, 2024

tilakrayal added the comp:keras Keras related issues label Jun 17, 2024

tilakrayal added the stat:awaiting response Status - Awaiting response from author label Jun 17, 2024

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Jun 17, 2024

mihaimaruseac mentioned this issue Jun 17, 2024

Transfer learning and fine-tuning doc seems to have unexpected results #69480

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doc(Transfer learning and fine-tuning) is quite different from real executive result. #66696

Doc(Transfer learning and fine-tuning) is quite different from real executive result. #66696

Doc(Transfer learning and fine-tuning) is quite different from real executive result. #66696

Doc(Transfer learning and fine-tuning) is quite different from real executive result. #66696

Comments

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

Relevant log output