-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
failed to set new cublas math mode: CUBLAS_STATUS_INVALID_VALUE #57359
Comments
@Co1lin |
@sushreebarsa I understand. Currently I use a dynamic model generation technique, and the code is really complex. I will try to manually build the same model (so the code will simple) as the one leading to crash now and see whether it can reproduce the same issue. |
@Co1lin Thank you for the response! |
I'm getting the same error, on 2.9.0 but I reproduced it in 2.8.0 and 2.9.1 too. I'll see if I can create a small enough example to post. 2022-08-23 16:26:25.102620: E tensorflow/stream_executor/cuda/cuda_blas.cc:197] failed to set new cublas math mode: CUBLAS_STATUS_INVALID_VALUE |
Here's more of my trace:
Node: 'EfficientNet/predictions/MatMul' |
I'm sorry that currently I am not able to provide a minimal example for reproduction. Now I use a dynamic graph generation technique, and the code is not publicly available, though it will be public later. I tried to build the same graph manually and statically to see whether it can reproduce the same issue, but unfortunately it cannot. However, I find a workaround which works for me. Below are how I find it, and I hope this can provide information for you to fix this issue. First, let's focus on the most useful error information among those outputs: 2022-08-21 23:09:42.546282: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2022-08-21 23:09:42.546307: E tensorflow/stream_executor/cuda/cuda_blas.cc:197] failed to set new cublas math mode: CUBLAS_STATUS_INVALID_VALUE
2022-08-21 23:09:42.546320: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at matmul_op_impl.h:438 : INTERNAL: Failed initializing math mode
outputs= (shape=(2, 2, 2, 2) dtype=<dtype: 'float32'>) Then we can find the source code and the position reporting the error here, though the file path is not exactly the same as what is logged. From the code near the position above, we can know it is the failure of Some posts like this say this function can be used to accelerate "TF32 Tensor Core operations", which looks related to this line: 2022-08-21 23:09:42.546282: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once. So this error is caused by TF32 related optimizations. Therefore we can disable this function following this, which is adding this line to the python code. tf.config.experimental.enable_tensor_float_32_execution(False) From the cublas source code shown above, we can also see another common error message here. This is the one discussed in #9489, and it can be solved by methods listed here. Note that it is not the same issue as the one we discussed here. To sum up, we can add these lines to avoid two common issues caused by cublas. But I hope the internal issue can be fixed in the future. tf.config.experimental.enable_tensor_float_32_execution(False)
for gpu in tf.config.experimental.list_physical_devices('GPU'):
tf.config.experimental.set_memory_growth(gpu, True) @jhuus Maybe you can have a try. |
I have a new discovery. In my environment, if I remove |
@Co1lin Thank you for the update! |
@sushreebarsa Hi! I am wondering if it's better to output a more friendly error message for this assertion error? Only logging Node: 'dense/MatMul'
Failed initializing math mode
[[{{node dense/MatMul}}]] [Op:__inference_call_156] is quite confusing. If it's ok, I would like to add some extra information here., like:
|
In my case, I am trying to use torchaudio with tensorflow. If I use librosa
it works but if I use torchaudio I get the error. Since torchaudio uses GPU
it is much faster. If this can’t be fixed I’ll just rewrite the whole thing
to use torch I guess.
…On Sun, Aug 28, 2022 at 7:17 AM sushreebarsa ***@***.***> wrote:
@Co1lin <https://github.com/Co1lin> Thank you for the update!
Please move this issue to closed status if it is resolved for you?
Thank you!
—
Reply to this email directly, view it on GitHub
<#57359 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AR43UYBKGNHBWPUXATNYJRDV3NDC5ANCNFSM57GFHK4A>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@jhuus Could you try |
Actually, just importing tensorflow before I import torchaudio fixed the
problem! It makes me a little worried about other possible compatibility
issues between torchaudio and tensorflow though.
…On Sun, Aug 28, 2022 at 7:58 AM Colin ***@***.***> wrote:
@jhuus <https://github.com/jhuus> Could you try
tf.config.experimental.enable_tensor_float_32_execution(False)? I think
it only sacrifices a little performance but enables you to use torch and
tensorflow at the same time.
—
Reply to this email directly, view it on GitHub
<#57359 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AR43UYH2XPY3RWVPWE3ZM53V3NH75ANCNFSM57GFHK4A>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Updated 10min later: The problem went away after restarting the kernel. I'm facing the same errors.
And it started to appear only after I did $ pip install setfit Now I can reproduce this problem with this simple snippet. The exact same code below was working fine before I installed from sklearn.base import BaseEstimator, TransformerMixin
import tensorflow_hub as hub
class UseEmbedder(TransformerMixin, BaseEstimator):
def __init__(self):
self._embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
def fit(self, X, y=None, sample_weight=None):
return self
def transform(self, X):
return self._embed(X).numpy()
def fit_transform(self, X, y=None, sample_weight=None):
return self.transform(X)
embedding_transformer = UseEmbedder()
embedding_transformer.transform(['why did this just break']) |
It worked for me, big thanks! Removing the torch import did not work for me. Since I haven't posted in this thread before: I was having the same issue (I think anyway). Let me know if you want me to post my entire traceback. End of traceback:
|
I am running into this error with TF 2.11.0. wondering if there is any concrete solution? |
There are several suggested above. |
I am having the exact same issue when running it with tf 2.11, the workaround works for me:
Would like to know the root cause and plan to fix it if possible. |
Issue Type
Bug
Source
binary
Tensorflow Version
v2.9.0-18-gd8ce9f9c301 2.9.1
Custom Code
No
OS Platform and Distribution
Linux Ubuntu 20.04.4 LTS
Mobile device
No response
Python version
No response
Bazel version
No response
GCC/Compiler version
No response
CUDA/cuDNN version
No response
GPU model and memory
No response
Current Behaviour?
I have a dynamic
keras.Model
namedsymbol_net
. When executing forward computation (callcall
method), sometimes it crashes as follows if there's a Dense layer in the model.I have searched on the Internet and tries so many solutions including combining them, like
But all of them don't work. I have a GPU with 12 GiB. On the multi-user machine, when I was running the code, there remains 12000 MiB for me, so it's enough. My model is quite small, like this , which won't take a lot of mem.
Standalone code to reproduce the issue
Relevant log output
The text was updated successfully, but these errors were encountered: