set_intra_op_parallelism_threads and set_inter_op_parallelism_threads has no impact on thread usage #48772

matt-houk · 2021-04-27T06:36:40Z

System information

I am using a custom implementation of DepthwiseConv3D, extends the Conv3D class and is partially based on code from the following repository, https://github.com/alexandrosstergiou/keras-DepthwiseConv3D adapted to work in tensorflow 2.4.1
CentOS Linux Version 7
Installed from source
Tensorflow version 2.4.1
Python 3.8.8
Cuda 11.0, CuDNN 8
Happens on GTX1080, RTX2080

Describe the current behavior

I am running code on a compute cluster, hence the different GPUs. I am required by the compute cluster admin to restrict thread usage when possible and had been referred to use the functions from tf.config.threading in the tensorflow documentation. I set both intra and inter thread parallelism to 2 and used an interactive session on the node to monitor thread usage with top, however the usage of the thread parameters seems to have no impact on thread usage. I still observe the python process using all available threads.

My understanding from the documentation for these functions is that all is required is to call them with the parameters wanted, no errors related to threading have been raised.

My github repository is as follows, I use the mult.csh file under the EEGNet folder to execute the code, which runs the runMB3D.py file, using the network model from MB3DEEGNet.py.
https://github.com/matt-houk/MB3DCNN

I have attached both the stderr and stdout output, I canceled the code when I noticed it using excessive threads.

err-mult.txt
out-mult.txt

matt-houk · 2021-05-05T18:52:55Z

I just wanted to include some followup info from efforts I have made to debug the issue over the past week.

I had attempted to set OMP_NUM_THREADS prior to running the python file, this also appears to have no impact.

I have also set the environment variables TF_NUM_INTEROP_THREADS and TF_NUM_INTRAOP_THREADS prior to application runtime, checking them using os.environ from within python both before and after tensorflow is imported, as well as the tf.config.threading.set_intra/inter_op_parallelism_threads calls, the environment variables remain unchanged at 1, however, when observing the python process using top, I observe that it is using 13 threads using the nTH column in top.

Please let me know if it seems I am making an incorrect assumption on anything here. Thank you.

matt-houk · 2021-05-13T01:21:21Z

If there are any other issues that are relevant that I have been unable to unearth during my searches that might be able to help, I would love if somebody could link them! While this issue doesn't prevent me from being able to do work, it dramatically slows the rate at which I can do so and none of my attempts to resolve it have been successful.

matt-houk · 2021-05-13T17:28:35Z

Can someone clarify for me whether the functions
tf.config.set_intra_op_parallelism_threads(NUM_THREADS) tf.config.set_inter_op_parallelism_threads(NUM_THREADS)
are meant to control the threads used by the python process or the nvidia-cuda-mps process?

Adding
tf.config.set_soft_device_threading(True)
Was able to restrict nvidia-cuda-mps to 4 threads when observed using top, which would be in line with my case where NUM_THREADS = 2, if this is the case, I simply need to determine why Python is choosing to use as many threads as it can. If that is the case, while I would appreciate any advice anyone might have, that might be outside of TensorFlow and instead of a Python issue. Thanks!

matt-houk · 2021-06-17T07:08:49Z

Is there a possibility of getting a response on this issue? Because otherwise, it is probably best to close it. The issue is still unresolved but there has been no activity outside of my own updates. Any additional insight would be appreciated.

Also, just to clarify my understanding from above, the functions set_inter_op_parallelism and set_intra_op_parallelism should affect all processes involved with TensorFlow, including both the nvidia-cuda-mps and python processes. My issue is that Python is scaling to all available threads and I want to make sure I am understanding correctly that these functions should effect the Python process, not another process.

rohan100jain · 2021-06-19T15:21:45Z

Apologies for the delayed response. Would it be possible to provide some sample D_cropped.npy and K_cropped.npy so that I can try and reproduce the issue? Generally the inter op and intra op threads are meant for the C++ tensorflow executor runtime - to be precise the inter op threadpool controls the number of C++ threads we use to dispath ops which executing functions / graphs and the intra op threadpool controls the size of the EIGEN threadpool for individual ops. The GPU runtime with NVIDIA might have its own threadpool and thats why that needs to be configured as well. Usually we rarely do any multi-threading in python but I'd like to reproduce the issue to figure out what might be going on.

One other suggestion to get more data would be to run the Tensorflow profiler https://www.tensorflow.org/guide/profiler which would clearly enumerate the list of threads in action. That might provide some insight as well.

matt-houk · 2021-06-25T03:24:58Z

Here is a sample of the files,
Data files.zip

Sorry for the delayed response, I will also look into the profiler on my end!

rohan100jain · 2021-06-27T19:52:47Z

Thanks... I was able to reproduce the issue and yeah I do see a large number of threads. A hypothesis is that there are some tf.data threads that you're seeing. But it'll be a lot clearer with the profile.

matt-houk · 2021-07-07T04:26:43Z

Just got the opportunity to try the profiler, sorry for the further delay, was working to meet some deadlines. Any attempt I make to use profiling appears to immediately cause a segfault. I will update the GitHub (I've been behind on that anyway) and the relevant files can be found under the Groups directory, I made some modifications to the code in an attempt to improve runtime, using a different method of implementing a DepthwiseConv3D layer I found.

Below is the stack trace from the GDB debugger, everything else should be about the same. The Segmentation Fault consistently occurs at the model.fit call on line 149 in the runMotor.py file.

#0 0x00002aaaaaf70c3c in free () from /lib64/libc.so.6 #1 0x00002aaab742b65d in Eigen::internal::TensorExecutor<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 5, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorReverseOp<Eigen::array<long, 5ul> const, Eigen::TensorShufflingOp<Eigen::array<long, 5ul> const, Eigen::TensorReshapingOp<Eigen::DSizes<long, 5> const, Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorReshapingOp<Eigen::DSizes<long, 2> const, Eigen::TensorVolumePatchOp<-1l, -1l, -1l, Eigen::TensorForcedEvalOp<Eigen::TensorShufflingOp<Eigen::array<long, 5ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 5, 1, long>, 16, Eigen::MakePointer> const> const> const> const> const, Eigen::TensorForcedEvalOp<Eigen::TensorReshapingOp<Eigen::DSizes<long, 2> const, Eigen::TensorShufflingOp<Eigen::array<long, 5ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 5, 1, long>, 16, Eigen::MakePointer> const> const> const> const, Eigen::NoOpOutputKernel const> const> const> const> const> const, Eigen::ThreadPoolDevice, true, (Eigen::internal::TiledEvaluation)1>::run(Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 5, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorReverseOp<Eigen::array<long, 5ul> const, Eigen::TensorShufflingOp<Eigen::array<long, 5ul> const, Eigen::TensorReshapingOp<Eigen::DSizes<long, 5> const, Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorReshapingOp<Eigen::DSizes<long, 2> const, Eigen::TensorVolumePatchOp<-1l, -1l, -1l, Eigen::TensorForcedEvalOp<Eigen::TensorShufflingOp<Eigen::array<long, 5ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 5, 1, long>, 16, Eigen::MakePointer> const> const> const> const> const, Eigen::TensorForcedEvalOp<Eigen::TensorReshapingOp<Eigen::DSizes<long, 2> const, Eigen::TensorShufflingOp<Eigen::array<long, 5ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 5, 1, long>, 16, Eigen::MakePointer> const> const> const> const, Eigen::NoOpOutputKernel const> const> const> const> const> const&, Eigen::ThreadPoolDevice const&) () from /usr/local/usrapps/multibranch/mjhouk/tf241_py377/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #2 0x00002aaab742c42e in tensorflow::functor::CuboidConvolutionBackwardFilter<Eigen::ThreadPoolDevice, float>::operator()(Eigen::ThreadPoolDevice const&, Eigen::TensorMap<Eigen::Tensor<float, 5, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorMap<Eigen::Tensor<float const, 5, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorMap<Eigen::Tensor<float const, 5, 1, long>, 16, Eigen::MakePointer>, int, int, int) () from /usr/local/usrapps/multibranch/mjhouk/tf241_py377/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #3 0x00002aaab7435294 in tensorflow::Conv3DCustomBackpropFilterOp<Eigen::ThreadPoolDevice, float>::Compute(tensorflow::OpKernelContext*) () from /usr/local/usrapps/multibranch/mjhouk/tf241_py377/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #4 0x00002aaae1004707 in tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::Process(tensorflow::SimplePropagatorState::TaggedNode, long long) () from /usr/local/usrapps/multibranch/mjhouk/tf241_py377/lib/python3.7/site-packages/tensorflow/python/../libtensorflow_framework.so.2 #5 0x00002aaab4f65d32 in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) () from /usr/local/usrapps/multibranch/mjhouk/tf241_py377/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #6 0x00002aaab4f62047 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /usr/local/usrapps/multibranch/mjhouk/tf241_py377/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #7 0x00002aaae10a3d3c in tensorflow::(anonymous namespace)::PThread::ThreadFn(void*) () from /usr/local/usrapps/multibranch/mjhouk/tf241_py377/lib/python3.7/site-packages/tensorflow/python/../libtensorflow_framework.so.2 #8 0x00002aaaaacd6e25 in start_thread () from /lib64/libpthread.so.0 #9 0x00002aaaaafe9bad in clone () from /lib64/libc.so.6

This might require a separate issue post, if so let me know and I will proceed. I am approaching concluding the project I was working on, but am fully willing to continue with efforts to debug this issue as I am able.

Let me know if you have any other questions!

matt-houk added the type:bug Bug label Apr 27, 2021

google-ml-butler bot assigned UsharaniPagadala Apr 27, 2021

UsharaniPagadala added the TF 2.4 for issues related to TF 2.4 label Apr 28, 2021

UsharaniPagadala assigned Saduf2019 and UsharaniPagadala and unassigned UsharaniPagadala Apr 28, 2021

Saduf2019 added comp:dist-strat Distribution Strategy related issues type:support Support issues and removed type:bug Bug labels Apr 29, 2021

Saduf2019 assigned jvishnuvardhan and unassigned Saduf2019 Apr 29, 2021

jvishnuvardhan assigned nikitamaia and unassigned jvishnuvardhan Apr 29, 2021

nikitamaia added comp:gpu GPU related issues and removed comp:dist-strat Distribution Strategy related issues labels May 3, 2021

nikitamaia assigned sanjoy and unassigned nikitamaia May 3, 2021

sanjoy assigned rohan100jain May 4, 2021

rohan100jain assigned jsimsa Jun 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

set_intra_op_parallelism_threads and set_inter_op_parallelism_threads has no impact on thread usage #48772

set_intra_op_parallelism_threads and set_inter_op_parallelism_threads has no impact on thread usage #48772

set_intra_op_parallelism_threads and set_inter_op_parallelism_threads has no impact on thread usage #48772

set_intra_op_parallelism_threads and set_inter_op_parallelism_threads has no impact on thread usage #48772

Comments