[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TF 2.17.0 RC0 Fails to work with GPUs (and TF 2.16 too) #63362

Open
JuanVargas opened this issue Mar 10, 2024 · 141 comments · Fixed by #70293
Open

TF 2.17.0 RC0 Fails to work with GPUs (and TF 2.16 too) #63362

JuanVargas opened this issue Mar 10, 2024 · 141 comments · Fixed by #70293
Assignees
Labels
2.17 Issues related to 2.17 release awaiting review Pull request awaiting review comp:gpu GPU related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.16 type:bug Bug

Comments

@JuanVargas
Copy link

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

No

Source

binary

TensorFlow version

TF 2.16.1

Custom code

No

OS platform and distribution

Linux Ubuntu 22.04.4 LTS

Mobile device

No response

Python version

3.10.12

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

12.4

GPU model and memory

No response

Current behavior?

I created a python venv in which I installed TF 2.16.1 following your instructions: pip install tensorflow
When I run python, import tf, and issue tf.config.list_physical_devices('GPU')
I get an empty list [ ]

I created another python venv, installed TF 2.16.1, only this time with the instructions:

python3 -m pip install tensorflow[and-cuda]

When I run that version, import tensorflow as tf, and issue

tf.config.list_physical_devices('GPU')

I also get an empty list.

BTW, I have no problems running on my box TF 2.15.1 with GPUs. Julia also works just fine with GPUs and so does PyTorch.
the

Standalone code to reproduce the issue

Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2024-03-09 19:15:45.018171: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-09 19:15:50.412646: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
>>> tf.__version__
'2.16.1'

tf.config.list_physical_devices('GPU') 
2024-03-09 19:16:28.923792: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-03-09 19:16:29.078379: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[]
>>>

Relevant log output

No response

@sh-shahrokhi
Copy link
sh-shahrokhi commented Mar 10, 2024

It does not work with python=3.12.2 either. Same error. installed tensorflow with $ pip install tensorflow[and-cuda]

@damadorPL
Copy link

The same error on bare Ubuntu and WSL2 2.15 works without any problems with python 3.11

@DiegoMont
Copy link

I have the same problem with Ubuntu 22.04.4 with the following environment:

  • tensorflow==2.16.1
  • Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
  • cuDNN 8.6.0.163
  • gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

nvcc --version output:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

@AlpriElse
Copy link

I'm not sure if this is the root cause, but I resolved my own issue which also surfaced as a "Cannot dlopen some GPU libraries." error when trying to run python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

To resolve my issue, I followed the tested build versions here:
https://www.tensorflow.org/install/source#gpu

and I needed to update my existing installations from cuDNN 9 -> 8.9 and CUDA 12.4->12.3

When you're on an NVIDIA download page like this one for CUDA Toolkit, don't just download the latest version. See previous versions by hitting "Archive of Previous CUDA Releases"

@JuanVargas can you try uninstalling your existing CUDA installation to a tested build configuration for TF 2.16 by downgrading to CUDA 12.3?

I followed this post to uninstall my existing cuda installation:
https://askubuntu.com/questions/530043/removing-nvidia-cuda-toolkit-and-installing-new-one

@DiegoMont can you try upgrading your cuDNN to 8.9 and CUDA to 12.3?

@Gwyki
Copy link
Gwyki commented Mar 11, 2024

I am having the same issue. Brand new Ubuntu 22.04 WSL2 image. Blank Conda environment with either python 3.12.* or 3.11.* fails to correctly setup tensorflow for GPU use when following the recommended:
pip install tensorflow[and-cuda]

Trying to list the physical devices results in:

2024-03-11 02:00:00.294704: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-11 02:00:00.709325: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-03-11 02:00:01.180225: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:984] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-03-11 02:00:01.180445: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[]
cuDNN 8.9.*
Cuda 12.3
Tensorflow 2.16.1
TensorRT 8.6.1

Is this a new issue caused by the fact that it doesn't appear that any system cuda needs to be separately installed in WSL2 anymore. I certainly didn't install one manually and yet nvidia-smi is happily reporting cuda version 12.3. It probably comes down to some env paths not set correctly but playing around with $CUDA_PATH and guessing the location within the conda environment has not resolved anything. TensorRT doesn't seem to be picked up yet is definitely installed in the conda environment. Pytorch GPU visibility works as expected.

@SuryanarayanaY SuryanarayanaY added comp:gpu GPU related issues TF 2.16 labels Mar 11, 2024
@SuryanarayanaY
Copy link
Collaborator

Hi @JuanVargas ,

For GPU package you need to ensure the installation of CUDA driver which can be verified with nvidia-smi command. Then you need to install TF-cuda package with pip install tensorflow[and-cuda] which automatically installs required cuda/cudnn libraries.

I have checked in colab and able to detect GPU.Please refer attached gist.

@SuryanarayanaY SuryanarayanaY added the stat:awaiting response Status - Awaiting response from author label Mar 11, 2024
@damadorPL
Copy link
damadorPL commented Mar 11, 2024

doublequotes in pip install because of ZSH

pip install "tensorflow[and-cuda]==2.16.1"                                                                       
 

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: tensorflow==2.16.1 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow[and-cuda]==2.16.1) (2.16.1)
Requirement already satisfied: absl-py>=1.0.0 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (2.1.0)
Requirement already satisfied: astunparse>=1.6.0 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (1.6.3)
Requirement already satisfied: flatbuffers>=23.5.26 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (24.3.7)
Requirement already satisfied: gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (0.5.4)
Requirement already satisfied: google-pasta>=0.1.1 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (0.2.0)
Requirement already satisfied: h5py>=3.10.0 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (3.10.0)
Requirement already satisfied: libclang>=13.0.0 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (16.0.6)
Requirement already satisfied: ml-dtypes~=0.3.1 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (0.3.2)
Requirement already satisfied: opt-einsum>=2.3.2 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (3.3.0)
Requirement already satisfied: packaging in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (24.0)
Requirement already satisfied: protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (4.25.3)
Requirement already satisfied: requests<3,>=2.21.0 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (2.31.0)
Requirement already satisfied: setuptools in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (69.1.1)
Requirement already satisfied: six>=1.12.0 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (1.16.0)
Requirement already satisfied: termcolor>=1.1.0 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (2.4.0)
Requirement already satisfied: typing-extensions>=3.6.6 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (4.10.0)
Requirement already satisfied: wrapt>=1.11.0 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (1.16.0)
Requirement already satisfied: grpcio<2.0,>=1.24.3 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (1.62.1)
Requirement already satisfied: tensorboard<2.17,>=2.16 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (2.16.2)
Requirement already satisfied: keras>=3.0.0 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (3.0.5)
Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (0.36.0)
Requirement already satisfied: numpy<2.0.0,>=1.23.5 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (1.26.4)
Requirement already satisfied: nvidia-cublas-cu12==12.3.4.1 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow[and-cuda]==2.16.1) (12.3.4.1)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.3.101 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow[and-cuda]==2.16.1) (12.3.101)
Requirement already satisfied: nvidia-cuda-nvcc-cu12==12.3.107 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow[and-cuda]==2.16.1) (12.3.107)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.3.107 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow[and-cuda]==2.16.1) (12.3.107)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.3.101 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow[and-cuda]==2.16.1) (12.3.101)
Requirement already satisfied: nvidia-cudnn-cu12==8.9.7.29 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow[and-cuda]==2.16.1) (8.9.7.29)
Requirement already satisfied: nvidia-cufft-cu12==11.0.12.1 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow[and-cuda]==2.16.1) (11.0.12.1)
Requirement already satisfied: nvidia-curand-cu12==10.3.4.107 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow[and-cuda]==2.16.1) (10.3.4.107)
Requirement already satisfied: nvidia-cusolver-cu12==11.5.4.101 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow[and-cuda]==2.16.1) (11.5.4.101)
Requirement already satisfied: nvidia-cusparse-cu12==12.2.0.103 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow[and-cuda]==2.16.1) (12.2.0.103)
Requirement already satisfied: nvidia-nccl-cu12==2.19.3 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow[and-cuda]==2.16.1) (2.19.3)
Requirement already satisfied: nvidia-nvjitlink-cu12==12.3.101 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow[and-cuda]==2.16.1) (12.3.101)
Requirement already satisfied: wheel<1.0,>=0.23.0 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from astunparse>=1.6.0->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (0.42.0)
Requirement already satisfied: rich in ./miniconda3/envs/tf/lib/python3.11/site-packages (from keras>=3.0.0->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (13.7.1)
Requirement already satisfied: namex in ./miniconda3/envs/tf/lib/python3.11/site-packages (from keras>=3.0.0->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (0.0.7)
Requirement already satisfied: dm-tree in ./miniconda3/envs/tf/lib/python3.11/site-packages (from keras>=3.0.0->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (0.1.8)
Requirement already satisfied: charset-normalizer<4,>=2 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from requests<3,>=2.21.0->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from requests<3,>=2.21.0->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from requests<3,>=2.21.0->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (2.2.1)
Requirement already satisfied: certifi>=2017.4.17 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from requests<3,>=2.21.0->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (2024.2.2)
Requirement already satisfied: markdown>=2.6.8 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorboard<2.17,>=2.16->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (3.5.2)
Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorboard<2.17,>=2.16->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (0.7.2)
Requirement already satisfied: werkzeug>=1.0.1 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorboard<2.17,>=2.16->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (3.0.1)
Requirement already satisfied: MarkupSafe>=2.1.1 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from werkzeug>=1.0.1->tensorboard<2.17,>=2.16->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (2.1.5)
Requirement already satisfied: markdown-it-py>=2.2.0 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from rich->keras>=3.0.0->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from rich->keras>=3.0.0->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (2.17.2)
Requirement already satisfied: mdurl~=0.1 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from markdown-it-py>=2.2.0->rich->keras>=3.0.0->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (0.1.2)
nvidia-smi             
                                                                                           
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.60.01              Driver Version: 551.76         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 Ti     On  |   00000000:01:00.0  On |                  N/A |
|  0%   39C    P5             10W /  285W |    4334MiB /  12282MiB |     13%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A        41      G   /Xwayland                                   N/A      |
+-----------------------------------------------------------------------------------------+

python3

Python 3.11.7 (main, Dec 15 2023, 18:12:31) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
print(tf.config.list_physical_devices('GPU'))2024-03-11 09:36:29.601060: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-11 09:36:29.921637: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-11 09:36:30.793353: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
>>> print(tf.config.list_physical_devices('GPU'))
2024-03-11 09:36:33.878560: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:984] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-03-11 09:36:33.980099: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[]
>>>

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Mar 11, 2024
@damadorPL
Copy link
nvcc -V 
                                                                                                          
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Feb_27_16:19:38_PST_2024
Cuda compilation tools, release 12.4, V12.4.99
Build cuda_12.4.r12.4/compiler.33961263_0

@damadorPL
Copy link
damadorPL commented Mar 11, 2024

got it work :) first
https://developer.nvidia.com/rdp/cudnn-archive?source=post_page-----bfbeb77e7c89--------------------------------

then download Local Installer for Ubuntu22.04 x86_64 (Deb)

unpack and install libcudnn8_8.9.7.29-1+cuda12.2_amd64.deb

sudo dpkg -i libcudnn8_8.9.7.29-1+cuda12.2_amd64.deb   
                                                           
Selecting previously unselected package libcudnn8.
(Reading database ... 47318 files and directories currently installed.)
Preparing to unpack libcudnn8_8.9.7.29-1+cuda12.2_amd64.deb ...
Unpacking libcudnn8 (8.9.7.29-1+cuda12.2) ...
Setting up libcudnn8 (8.9.7.29-1+cuda12.2) ...

python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"  

                             
2024-03-11 10:27:47.879686: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-11 10:27:47.909157: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-11 10:27:48.316717: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-03-11 10:27:48.664469: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:984] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-03-11 10:27:48.688059: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:984] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-03-11 10:27:48.688111: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:984] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

@JuanVargas
Copy link
Author
JuanVargas commented Mar 11, 2024 via email

@sh-shahrokhi
Copy link
sh-shahrokhi commented Mar 11, 2024 via email

@JuanVargas
Copy link
Author
JuanVargas commented Mar 11, 2024 via email

@JuanVargas
Copy link
Author
JuanVargas commented Mar 11, 2024 via email

@sh-shahrokhi
Copy link
sh-shahrokhi commented Mar 11, 2024 via email

@damadorPL
Copy link
damadorPL commented Mar 11, 2024

https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ you can get .deb file there directrly

@Gwyki
Copy link
Gwyki commented Mar 11, 2024

Thanks @sh-shahrokhi. I thought it was path related. Modified slightly to make it python version independent if you put it in your conda environment activation ([environment]/etc/activate.d/env_vars.sh).

NVIDIA_DIR=$(dirname $(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)")))
for dir in $NVIDIA_DIR/*; do
    if [ -d "$dir/lib" ]; then
        export LD_LIBRARY_PATH="$dir/lib:$LD_LIBRARY_PATH"
    fi
done

This is not a resolution as this post install step should not be necessary.

W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

I can't seem to do similar tricks to resolve the TensorRT issues when installed similarly into the conda environment. Any ideas?

@sh-shahrokhi
Copy link
sh-shahrokhi commented Mar 11, 2024

Thanks @sh-shahrokhi. I thought it was path related. Modified slightly to make it python version independent if you put it in your conda environment activation ([environment]/etc/activate.d/env_vars.sh).

NVIDIA_DIR=$(dirname $(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)")))
for dir in $NVIDIA_DIR/*; do
    if [ -d "$dir/lib" ]; then
        export LD_LIBRARY_PATH="$dir/lib:$LD_LIBRARY_PATH"
    fi
done

This is not a resolution as this post install step should not be necessary.

W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

I can't seem to do similar tricks to resolve the TensorRT issues when installed similarly into the conda environment. Any ideas?

I don't actually use TensorRT, but I would check if the required .so file for it is visible to tensorflow. Maybe I would need to find the name of required file in tensorflow source code.

This actually doesn't change the fact that the new tensorflow version should be tested by google team before release, or the bugs should be fixed. It seems they only care about having a working docker image, not anything else.

@Gwyki
Copy link
Gwyki commented Mar 12, 2024

I have given up on TensorRT. I guess I won't be using it either.

This actually doesn't change the fact that the new tensorflow version should be tested by google team before release, or the bugs should be fixed. It seems they only care about having a working docker image, not anything else.

Agreed. Installing TF has always been hit or miss and it seems that in the many years since I last used TF that hasn't changed one bit.

@moozoo64
Copy link

Well, I wasted 8hr of my Sunday on this setting up another pc from scratch. Before reverting to the old version. Now looking to move off tensor flow.

@mihaimaruseac
Copy link
Collaborator
mihaimaruseac commented Mar 12, 2024

In general, we used to test RC versions before release. For example, we used to have RC0, RC1 and RC2 for TF 2.9. This gave people and downstream teams enough time to test and report issues.

It seems that 2.16.1 only had an RC0 (for 2.16.0).

The release process is (was?) like this:

  • cut the release branch (e.g., r2.17)
  • immediately trigger the release pipeline. This would create a few PRs to update version numbers, release notes, but after this step RC0 should be as close as possible to the version on master branch at the time the release branch has been cut. There should not be any code changes to the release branch at this point (except to maybe cherrypick fixes from master from hard bugs caused by cutting the branch at a wrong commit)
  • have at least a week of testing for downstream teams to test RC0
  • get fixes to discovered bugs landed on master, cherrypick them to release branch, after they are already tested on nightly releases
  • trigger RC1 pipeline. Again, no other code changes should occur now, except to fix bugs discovered during building
  • wait a week for downstream teams to test. If there are bugs, repeat the steps above for another RC, otherwise repeat the steps above for the final version.

Overall, this process would take number_of_RCs + 1 weeks with a possibility of a few more weeks of delay.

However, for 2.16 release, although the branch was cut on Feb 8th, there has been only one RC. Most likely issues can be solved by a patch release

Copy link

Are you satisfied with the resolution of your issue?
Yes
No

@JuanVargas
Copy link
Author

I am closing this (unresolved issue) because I am told by the Keras/TF team that the issue is related to TF.

@eabase
Copy link
eabase commented Jun 14, 2024

Can someone care to explain why TF >2.10 cannot be run with GPU in native windows?
This totally makes no sense whatsoever, as all other HW, WSL, and Conda works with GPU. Including other python packages, such as Torch. So what is going on?

I.e. What is the problem and why is it not being addressed by the community?

@sh-shahrokhi
Copy link
sh-shahrokhi commented Jun 14, 2024

Can someone care to explain why TF >2.10 cannot be run with GPU in native windows? This totally makes no sense whatsoever, as all other HW, WSL, and Conda works with GPU. Including other python packages, such as Torch. So what is going on?

I.e. What is the problem and why is it not being addressed by the community?

Google removed the native windows cuda build starting TF 2.11
There is nothing you can do about it, building from source with cuda will also fail in windows.

@mihaimaruseac
Copy link
Collaborator

Everyone that cared about full support of TF is no longer in the team. See above comments for more details and differences

@eabase
Copy link
eabase commented Jun 15, 2024

@sh-shahrokhi

Google removed the native windows cuda build starting TF 2.11

Unfortunately that doesn't say anything. I don't see how you can "remove" any of that, apart from breaking the build scripts. Whatever you "remove" must still be present for all other nix builds. WSL is not that different from MSYS, MinGW, which (no longer) is too far from VS C/C++ builds.

@sh-shahrokhi
Copy link
sh-shahrokhi commented Jun 15, 2024

@sh-shahrokhi

Google removed the native windows cuda build starting TF 2.11

Unfortunately that doesn't say anything. I don't see how you can "remove" any of that, apart from breaking the build scripts. Whatever you "remove" must still be present for all other nix builds. WSL is not that different from MSYS, MinGW, which (no longer) is too far from VS C/C++ builds.

#58629
Also:
#59918

@ben-jy
Copy link
ben-jy commented Jun 17, 2024

Also kindly note that the current issue opened "TF 2.16.1 Fails to work with GPUs" involves Linux Operating Systems and potentially the additional steps to be specified in the official TensorFlow documentation in order to utilize GPUs locally.

I started a not very pleasant acquaintance with tensorflow with this version. As I understand it, the specific reason is 2.16.1 and it does not work in wsl. Because nothing worked for me. And the question is which version can be installed so that it works normally in wsl.
Also, for the future, I will say that installing anaconda does not help either. You can install a maximum of 2.10 version on it

@MrOxMasTer I totally understand your frustration but I reassure you that TensorFlow version 2.16.1 can actually work with your cuda-enabled GPU.

You can try the following:

  1. Create a fresh conda virtual environment in WSL and activate it, like this:
conda create --name tf python=3.11
conda activate tf
  1. Within the fresh conda virtual environment tf created in the previous step run the following commands sequentially:
pip install --upgrade pip
pip install tensorflow[and-cuda]
  1. Set environment variables:

Note: This step is required in order to utilize your GPU but not yet included in the official TensorFlow documentation. All NVIDIA libs are installed with TensorFlow due to the fact you ran the command pip install tensorflow[and-cuda] in the previous step!

Locate the directory for the conda environment in your terminal window by running in the terminal:

echo $CONDA_PREFIX

Enter that directory and create these subdirectories and files:

cd $CONDA_PREFIX
mkdir -p ./etc/conda/activate.d
mkdir -p ./etc/conda/deactivate.d
touch ./etc/conda/activate.d/env_vars.sh
touch ./etc/conda/deactivate.d/env_vars.sh

Edit ./etc/conda/activate.d/env_vars.sh as follows:

#!/bin/sh

# Store original LD_LIBRARY_PATH 
export ORIGINAL_LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" 

# Get the CUDNN directory 
CUDNN_DIR=$(dirname $(dirname $(python -c "import nvidia.cudnn; print(nvidia.cudnn.__file__)")))

# Set LD_LIBRARY_PATH to include CUDNN directory
export LD_LIBRARY_PATH=$(find ${CUDNN_DIR}/*/lib/ -type d -printf "%p:")${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

# Get the ptxas directory  
PTXAS_DIR=$(dirname $(dirname $(python -c "import nvidia.cuda_nvcc; print(nvidia.cuda_nvcc.__file__)")))

# Set PATH to include the directory containing ptxas
export PATH=$(find ${PTXAS_DIR}/*/bin/ -type d -printf "%p:")${PATH:+:${PATH}}

Edit ./etc/conda/deactivate.d/env_vars.sh as follows:

#!/bin/sh

# Restore original LD_LIBRARY_PATH
export LD_LIBRARY_PATH="${ORIGINAL_LD_LIBRARY_PATH}"

# Unset environment variables
unset CUDNN_DIR
unset PTXAS_DIR

Verify the GPU setup: python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Additionally, as I was informed the next version of TensorFlow will hopefully arrive within the next days!

I hope it helps!

Doesn't work for me :/ I even reinstalled completely WSL, but I still get an empty list when showing the available devices... Should CUDA be unistalled on Windows side ? When I use "nvidia-smi", it is written that I have the 12.5 Cuda Version, even if I didn't install anything on WSL... Is that normal ?

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.52.01 Driver Version: 555.99 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+

@sgkouzias
Copy link

Also kindly note that the current issue opened "TF 2.16.1 Fails to work with GPUs" involves Linux Operating Systems and potentially the additional steps to be specified in the official TensorFlow documentation in order to utilize GPUs locally.

I started a not very pleasant acquaintance with tensorflow with this version. As I understand it, the specific reason is 2.16.1 and it does not work in wsl. Because nothing worked for me. And the question is which version can be installed so that it works normally in wsl.
Also, for the future, I will say that installing anaconda does not help either. You can install a maximum of 2.10 version on it

@MrOxMasTer I totally understand your frustration but I reassure you that TensorFlow version 2.16.1 can actually work with your cuda-enabled GPU.

You can try the following:

  1. Create a fresh conda virtual environment in WSL and activate it, like this:
conda create --name tf python=3.11
conda activate tf
  1. Within the fresh conda virtual environment tf created in the previous step run the following commands sequentially:
pip install --upgrade pip
pip install tensorflow[and-cuda]
  1. Set environment variables:

Note: This step is required in order to utilize your GPU but not yet included in the official TensorFlow documentation. All NVIDIA libs are installed with TensorFlow due to the fact you ran the command pip install tensorflow[and-cuda] in the previous step!

Locate the directory for the conda environment in your terminal window by running in the terminal:

echo $CONDA_PREFIX

Enter that directory and create these subdirectories and files:

cd $CONDA_PREFIX
mkdir -p ./etc/conda/activate.d
mkdir -p ./etc/conda/deactivate.d
touch ./etc/conda/activate.d/env_vars.sh
touch ./etc/conda/deactivate.d/env_vars.sh

Edit ./etc/conda/activate.d/env_vars.sh as follows:

#!/bin/sh

# Store original LD_LIBRARY_PATH 
export ORIGINAL_LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" 

# Get the CUDNN directory 
CUDNN_DIR=$(dirname $(dirname $(python -c "import nvidia.cudnn; print(nvidia.cudnn.__file__)")))

# Set LD_LIBRARY_PATH to include CUDNN directory
export LD_LIBRARY_PATH=$(find ${CUDNN_DIR}/*/lib/ -type d -printf "%p:")${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

# Get the ptxas directory  
PTXAS_DIR=$(dirname $(dirname $(python -c "import nvidia.cuda_nvcc; print(nvidia.cuda_nvcc.__file__)")))

# Set PATH to include the directory containing ptxas
export PATH=$(find ${PTXAS_DIR}/*/bin/ -type d -printf "%p:")${PATH:+:${PATH}}

Edit ./etc/conda/deactivate.d/env_vars.sh as follows:

#!/bin/sh

# Restore original LD_LIBRARY_PATH
export LD_LIBRARY_PATH="${ORIGINAL_LD_LIBRARY_PATH}"

# Unset environment variables
unset CUDNN_DIR
unset PTXAS_DIR

Verify the GPU setup: python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Additionally, as I was informed the next version of TensorFlow will hopefully arrive within the next days!

I hope it helps!

Doesn't work for me :/ I even reinstalled completely WSL, but I still get an empty list when showing the available devices... Should CUDA be unistalled on Windows side ? When I use "nvidia-smi", it is written that I have the 12.5 Cuda Version, even if I didn't install anything on WSL... Is that normal ?

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.52.01 Driver Version: 555.99 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+

@ben-jy frankly I have no clue. Did you check the official documentation? Your setup meets the technical requirements? What's the Python version in WSL2? Is it compatible with TensorFlow 2.16.1? What's the name of your NVIDIA GPU? The output of the command nvidia-smi in WSL2 seems normal since your GPU driver is installed in Windows. However you could try reinstalling everything (compatible GPU driver, afterwards WSL2 and then TensorFlow)...

@sh-shahrokhi
Copy link
sh-shahrokhi commented Jun 17, 2024 via email

@ben-jy
Copy link
ben-jy commented Jun 18, 2024

@sgkouzias I checked the official documentation, but I find it not very clear, and seems a bit contradictory: the software requirements state that CUDA and cuDNN should be installed on the machine, but the pip package should install them automatically with Tensorflow right ? Besides, this medium tutorial explain that CUDA should not be installed on Windows side, neither on WSL side, and be installed using the pip package. Maybe I should try to uninstall all CUDA-related on Windows...
Concerning your other questions:

  1. I have an RTX 3070 Ti, which is in the list of CUDA-enabled product.
  2. I use conda and I tried the install with Python 3.10 and 3.11, which are in the software requirements of the officiel documentation. Those versions are said compatible with TensorFlow 2.16.1, according to the PyPi package tags

I will try to make a clean reinstall of my GPU driver, as well as unistalling CUDA on Windows side. If it doesn't work, I think it is better to install CUDA and cuDNN manually, along with an oldest TensorFlow version. It is still a shame that the official documentation of such a large and important library is so unclear.

@tilakrayal
Copy link
Contributor

@learning-to-play

@mihaimaruseac
Copy link
Collaborator

Can you test the 2.17.0 RC0, please? It is too late to update 2.16, but if 2.17 RC0 doesn't work, maybe there will be a chance to fix by RC1/final

@sgkouzias
Copy link

Can you test the 2.17.0 RC0, please? It is too late to update 2.16, but if 2.17 RC0 doesn't work, maybe there will be a chance to fix by RC1/final

@mihaimaruseac I just tested but unfortunately it has the same issue.

@mihaimaruseac mihaimaruseac changed the title TF 2.16.1 Fails to work with GPUs TF 2.17.0 RC0 Fails to work with GPUs (and TF 2.16 too) Jun 19, 2024
@mihaimaruseac
Copy link
Collaborator

@learning-to-play maybe this can get fixed before final release? TF does not work with GPUs, started failing since TF 2.16 release.

@sgkouzias
Copy link
sgkouzias commented Jun 19, 2024

A tested workaround to utilize GPU for Linux users:

  1. Create a virtual environment with venv:
    python3 -m venv tf

  2. Activate the environment
    source tf/bin/activate

  3. Upgrade pip
    pip install --upgrade pip

  4. Install TensorFlow 2.17.0.rc0
    pip install tensorflow[and-cuda]==2.17.0rc0

  5. Create symbolic links to NVIDIA shared libraries

pushd $(dirname $(python -c 'print(__import__("tensorflow").__file__)'))
ln -svf ../nvidia/*/lib/*.so* .
popd
  1. Verify installation
    python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

I have created a respective pull request still pending review in good faith and for the shake of all users as TensorFlow is "An Open Source Machine Learning Framework for Everyone".

@learning-to-play
Copy link
Collaborator
learning-to-play commented Jun 19, 2024

@SeeForTwo @poulsbo Could you please take a look? If there is a fix that needs to be cherry picked to 2.16.2 or 2.17.0, please follow these steps:

  • Submit a fix to TensorFlow HEAD
  • Ensure nightly builds are green.
  • Create a cherry pick PR to the corresponding release branches r2.16 and r2.17 and assign to @rtg0795

@sgkouzias Does this issue happen for both TF 2.16.1 and 2.17.0rc0?

@sgkouzias
Copy link
sgkouzias commented Jun 19, 2024

@SeeForTwo @poulsbo Could you please take a look? If there is a fix that needs to be cherry picked to 2.16.2 or 2.17.0, please follow these steps:

  • Submit a fix to TensorFlow HEAD
  • Ensure nightly builds are green.
  • Create a cherry pick PR to the corresponding release branches r2.16 and r2.17 and assign to @rtg0795

@sgkouzias Does this issue happen for both TF 2.16.1 and 2.17.0rc0?

@learning-to-play yes indeed. The only difference is that on version 2.17.0.rc0 you only need the symlinks to NVIDIA libs in order to utilize GPUs while on version 2.16.1 you should in addition to creating symlinks to NVIDIA libs create symlink to ptxas.

copybara-service bot pushed a commit that referenced this issue Jun 24, 2024
…_deps`.

Should fix #63362

Reverts changelist 582804278

PiperOrigin-RevId: 646146985
copybara-service bot pushed a commit that referenced this issue Jun 24, 2024
…_deps`.

Should fix #63362

Reverts changelist 582804278

PiperOrigin-RevId: 646146985
@belitskiy belitskiy reopened this Jun 24, 2024
@tensorflow tensorflow deleted a comment from google-ml-butler bot Jun 24, 2024
@learning-to-play learning-to-play added the 2.17 Issues related to 2.17 release label Jun 25, 2024
tensorflow-jenkins pushed a commit that referenced this issue Jun 25, 2024
…_deps`.

Should fix #63362

Reverts changelist 582804278

PiperOrigin-RevId: 646182849
tensorflow-jenkins pushed a commit that referenced this issue Jun 25, 2024
…_deps`.

Should fix #63362

Reverts changelist 582804278

PiperOrigin-RevId: 646182849
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.17 Issues related to 2.17 release awaiting review Pull request awaiting review comp:gpu GPU related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.16 type:bug Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.