[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set_intra_op_parallelism_threads and set_inter_op_parallelism_threads has no impact on thread usage #48772

Open
matt-houk opened this issue Apr 27, 2021 · 8 comments
Assignees
Labels
comp:gpu GPU related issues TF 2.4 for issues related to TF 2.4 type:support Support issues

Comments

@matt-houk
Copy link

System information

  • I am using a custom implementation of DepthwiseConv3D, extends the Conv3D class and is partially based on code from the following repository, https://github.com/alexandrosstergiou/keras-DepthwiseConv3D adapted to work in tensorflow 2.4.1
  • CentOS Linux Version 7
  • Installed from source
  • Tensorflow version 2.4.1
  • Python 3.8.8
  • Cuda 11.0, CuDNN 8
  • Happens on GTX1080, RTX2080

Describe the current behavior

I am running code on a compute cluster, hence the different GPUs. I am required by the compute cluster admin to restrict thread usage when possible and had been referred to use the functions from tf.config.threading in the tensorflow documentation. I set both intra and inter thread parallelism to 2 and used an interactive session on the node to monitor thread usage with top, however the usage of the thread parameters seems to have no impact on thread usage. I still observe the python process using all available threads.

My understanding from the documentation for these functions is that all is required is to call them with the parameters wanted, no errors related to threading have been raised.

My github repository is as follows, I use the mult.csh file under the EEGNet folder to execute the code, which runs the runMB3D.py file, using the network model from MB3DEEGNet.py.
https://github.com/matt-houk/MB3DCNN

I have attached both the stderr and stdout output, I canceled the code when I noticed it using excessive threads.

err-mult.txt
out-mult.txt

@matt-houk matt-houk added the type:bug Bug label Apr 27, 2021
@UsharaniPagadala UsharaniPagadala added the TF 2.4 for issues related to TF 2.4 label Apr 28, 2021
@Saduf2019 Saduf2019 added comp:dist-strat Distribution Strategy related issues type:support Support issues and removed type:bug Bug labels Apr 29, 2021
@nikitamaia nikitamaia added comp:gpu GPU related issues and removed comp:dist-strat Distribution Strategy related issues labels May 3, 2021
@nikitamaia nikitamaia assigned sanjoy and unassigned nikitamaia May 3, 2021
@matt-houk
Copy link
Author
matt-houk commented May 5, 2021

I just wanted to include some followup info from efforts I have made to debug the issue over the past week.

I had attempted to set OMP_NUM_THREADS prior to running the python file, this also appears to have no impact.

I have also set the environment variables TF_NUM_INTEROP_THREADS and TF_NUM_INTRAOP_THREADS prior to application runtime, checking them using os.environ from within python both before and after tensorflow is imported, as well as the tf.config.threading.set_intra/inter_op_parallelism_threads calls, the environment variables remain unchanged at 1, however, when observing the python process using top, I observe that it is using 13 threads using the nTH column in top.

Please let me know if it seems I am making an incorrect assumption on anything here. Thank you.

@matt-houk
Copy link
Author

If there are any other issues that are relevant that I have been unable to unearth during my searches that might be able to help, I would love if somebody could link them! While this issue doesn't prevent me from being able to do work, it dramatically slows the rate at which I can do so and none of my attempts to resolve it have been successful.

@matt-houk
Copy link
Author
matt-houk commented May 13, 2021

Can someone clarify for me whether the functions
tf.config.set_intra_op_parallelism_threads(NUM_THREADS) tf.config.set_inter_op_parallelism_threads(NUM_THREADS)
are meant to control the threads used by the python process or the nvidia-cuda-mps process?

Adding
tf.config.set_soft_device_threading(True)
Was able to restrict nvidia-cuda-mps to 4 threads when observed using top, which would be in line with my case where NUM_THREADS = 2, if this is the case, I simply need to determine why Python is choosing to use as many threads as it can. If that is the case, while I would appreciate any advice anyone might have, that might be outside of TensorFlow and instead of a Python issue. Thanks!

@matt-houk
Copy link
Author
matt-houk commented Jun 17, 2021

Is there a possibility of getting a response on this issue? Because otherwise, it is probably best to close it. The issue is still unresolved but there has been no activity outside of my own updates. Any additional insight would be appreciated.

Also, just to clarify my understanding from above, the functions set_inter_op_parallelism and set_intra_op_parallelism should affect all processes involved with TensorFlow, including both the nvidia-cuda-mps and python processes. My issue is that Python is scaling to all available threads and I want to make sure I am understanding correctly that these functions should effect the Python process, not another process.

@rohan100jain
Copy link
Member
rohan100jain commented Jun 19, 2021

Apologies for the delayed response. Would it be possible to provide some sample D_cropped.npy and K_cropped.npy so that I can try and reproduce the issue? Generally the inter op and intra op threads are meant for the C++ tensorflow executor runtime - to be precise the inter op threadpool controls the number of C++ threads we use to dispath ops which executing functions / graphs and the intra op threadpool controls the size of the EIGEN threadpool for individual ops. The GPU runtime with NVIDIA might have its own threadpool and thats why that needs to be configured as well. Usually we rarely do any multi-threading in python but I'd like to reproduce the issue to figure out what might be going on.

One other suggestion to get more data would be to run the Tensorflow profiler https://www.tensorflow.org/guide/profiler which would clearly enumerate the list of threads in action. That might provide some insight as well.

@matt-houk
Copy link
Author

Here is a sample of the files,
Data files.zip

Sorry for the delayed response, I will also look into the profiler on my end!

@rohan100jain
Copy link
Member

Thanks... I was able to reproduce the issue and yeah I do see a large number of threads. A hypothesis is that there are some tf.data threads that you're seeing. But it'll be a lot clearer with the profile.

@matt-houk
Copy link
Author

Just got the opportunity to try the profiler, sorry for the further delay, was working to meet some deadlines. Any attempt I make to use profiling appears to immediately cause a segfault. I will update the GitHub (I've been behind on that anyway) and the relevant files can be found under the Groups directory, I made some modifications to the code in an attempt to improve runtime, using a different method of implementing a DepthwiseConv3D layer I found.

Below is the stack trace from the GDB debugger, everything else should be about the same. The Segmentation Fault consistently occurs at the model.fit call on line 149 in the runMotor.py file.

#0 0x00002aaaaaf70c3c in free () from /lib64/libc.so.6 #1 0x00002aaab742b65d in Eigen::internal::TensorExecutor<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 5, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorReverseOp<Eigen::array<long, 5ul> const, Eigen::TensorShufflingOp<Eigen::array<long, 5ul> const, Eigen::TensorReshapingOp<Eigen::DSizes<long, 5> const, Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorReshapingOp<Eigen::DSizes<long, 2> const, Eigen::TensorVolumePatchOp<-1l, -1l, -1l, Eigen::TensorForcedEvalOp<Eigen::TensorShufflingOp<Eigen::array<long, 5ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 5, 1, long>, 16, Eigen::MakePointer> const> const> const> const> const, Eigen::TensorForcedEvalOp<Eigen::TensorReshapingOp<Eigen::DSizes<long, 2> const, Eigen::TensorShufflingOp<Eigen::array<long, 5ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 5, 1, long>, 16, Eigen::MakePointer> const> const> const> const, Eigen::NoOpOutputKernel const> const> const> const> const> const, Eigen::ThreadPoolDevice, true, (Eigen::internal::TiledEvaluation)1>::run(Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 5, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorReverseOp<Eigen::array<long, 5ul> const, Eigen::TensorShufflingOp<Eigen::array<long, 5ul> const, Eigen::TensorReshapingOp<Eigen::DSizes<long, 5> const, Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorReshapingOp<Eigen::DSizes<long, 2> const, Eigen::TensorVolumePatchOp<-1l, -1l, -1l, Eigen::TensorForcedEvalOp<Eigen::TensorShufflingOp<Eigen::array<long, 5ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 5, 1, long>, 16, Eigen::MakePointer> const> const> const> const> const, Eigen::TensorForcedEvalOp<Eigen::TensorReshapingOp<Eigen::DSizes<long, 2> const, Eigen::TensorShufflingOp<Eigen::array<long, 5ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 5, 1, long>, 16, Eigen::MakePointer> const> const> const> const, Eigen::NoOpOutputKernel const> const> const> const> const> const&, Eigen::ThreadPoolDevice const&) () from /usr/local/usrapps/multibranch/mjhouk/tf241_py377/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #2 0x00002aaab742c42e in tensorflow::functor::CuboidConvolutionBackwardFilter<Eigen::ThreadPoolDevice, float>::operator()(Eigen::ThreadPoolDevice const&, Eigen::TensorMap<Eigen::Tensor<float, 5, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorMap<Eigen::Tensor<float const, 5, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorMap<Eigen::Tensor<float const, 5, 1, long>, 16, Eigen::MakePointer>, int, int, int) () from /usr/local/usrapps/multibranch/mjhouk/tf241_py377/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #3 0x00002aaab7435294 in tensorflow::Conv3DCustomBackpropFilterOp<Eigen::ThreadPoolDevice, float>::Compute(tensorflow::OpKernelContext*) () from /usr/local/usrapps/multibranch/mjhouk/tf241_py377/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #4 0x00002aaae1004707 in tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::Process(tensorflow::SimplePropagatorState::TaggedNode, long long) () from /usr/local/usrapps/multibranch/mjhouk/tf241_py377/lib/python3.7/site-packages/tensorflow/python/../libtensorflow_framework.so.2 #5 0x00002aaab4f65d32 in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) () from /usr/local/usrapps/multibranch/mjhouk/tf241_py377/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #6 0x00002aaab4f62047 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /usr/local/usrapps/multibranch/mjhouk/tf241_py377/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #7 0x00002aaae10a3d3c in tensorflow::(anonymous namespace)::PThread::ThreadFn(void*) () from /usr/local/usrapps/multibranch/mjhouk/tf241_py377/lib/python3.7/site-packages/tensorflow/python/../libtensorflow_framework.so.2 #8 0x00002aaaaacd6e25 in start_thread () from /lib64/libpthread.so.0 #9 0x00002aaaaafe9bad in clone () from /lib64/libc.so.6

This might require a separate issue post, if so let me know and I will proceed. I am approaching concluding the project I was working on, but am fully willing to continue with efforts to debug this issue as I am able.

Let me know if you have any other questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:gpu GPU related issues TF 2.4 for issues related to TF 2.4 type:support Support issues
Projects
None yet
Development

No branches or pull requests

8 participants