[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows heap corruption caused by upgrading protobuf #53234

Closed
meteorcloudy opened this issue Nov 29, 2021 · 5 comments
Closed

Windows heap corruption caused by upgrading protobuf #53234

meteorcloudy opened this issue Nov 29, 2021 · 5 comments
Assignees
Labels
2.6.0 comp:grappler Grappler related issues type:build/install Build and install issues

Comments

@meteorcloudy
Copy link
Member
meteorcloudy commented Nov 29, 2021

System information

Describe the problem

This is a summary of the Windows heap corruption bug discovered when upgrading protobuf in #52853 (comment).

In the PR, we tried to upgrade protobuf from 3.9.2 to 3.19.0, but some tf optimizer related tests are failing on Windows with

Windows fatal exception: code 0xc0000374

The error code indicates a heap corruption. After some investigation, we discovered the problem is caused by:

  • How protobuf is linked in TensorFlow on Windows
    On Windows, most of TF C++ code is linked into _pywrap_tensorflow_internal.pyd and some extension code is linked into individual pyd files, like _pywrap_tf_optimizer.pyd. The later is also linked to the former. However, the protobuf library is statically linked into both dynamic libraries.
  • How protobuf works
    Basically, protobuf deletes some global string in a proto desctructor after comparing the address of some global default string. When there are two protobuf runtimes, there are two default strings with different addresses, which caused some memory to be accidentally deleted when mixed together.

@acozzette also explained why protobuf doesn't work well when linked in multiple places at #52853 (comment)

Provide the exact sequence of commands / steps that you executed before running into the problem

To reproduce the original error in a full build:

git clone https://github.com/meteorcloudy/tensorflow.git
cd tensorflow
git fetch origin upgrade_protobuf_grpc
configure
bazel test --announce_rc --config=opt tensorflow/python/grappler:memory_optimizer_test --test_arg=MemoryOptimizerSwapTest.testNoSwapping

However, the full TF build isn't debuggable, I have constructed some smaller targets to imitate the situation.
To build the minimal reproduce case in a debug build:

git clone https://github.com/meteorcloudy/tensorflow.git
cd tensorflow
git fetch origin reproduce_win_heap_corruption_2
configure
bazel run --announce_rc --config=dbg tensorflow/python/grappler:tf_optimizer_wrapper_bin

When debugging in Visual Studio, you should be able to see the following:
image

Possible Solution

To properly solve this problem, we probably need to migrate TF to cc_shared_library, which makes dynamic linking more controllable.

@meteorcloudy meteorcloudy added the type:build/install Build and install issues label Nov 29, 2021
@meteorcloudy
Copy link
Member Author

@mohantym
Copy link
Contributor

Hi @sanatmpa1 ! Could you please look at this Issue?

@mohantym mohantym assigned sanatmpa1 and unassigned mohantym Nov 30, 2021
copybara-service bot pushed a commit that referenced this issue Feb 10, 2022
… Python eager context to enable access to the functions defined via tf.function.

There are some (hopefully temporary) shortcuts:
 * protobufs crossing pybind boundaries are serialized, until the Windows builds are fixed, see #53234
 * declarations dependent on MLIR types will use opaque pointers to hide the type information, until we've created a header-only library for them, or until we get rid of header-only libraries
 * when we add MLIR interfaces to pybind, MLIR objects will be serialized similarly with protos, until we've added pybind conversions

PiperOrigin-RevId: 427737336
Change-Id: Ie14b6d85211fd19fee846f0fd5e34f5e48a2a374
copybara-service bot pushed a commit that referenced this issue May 20, 2022
See also: #53234
See also: protocolbuffers/protobuf#9954
See also: #56077

PiperOrigin-RevId: 450054200
tensorflow-jenkins pushed a commit that referenced this issue May 20, 2022
See also: #53234
See also: protocolbuffers/protobuf#9954
See also: #56077

PiperOrigin-RevId: 450054200
mihaimaruseac added a commit that referenced this issue May 20, 2022
See also: #53234
See also: protocolbuffers/protobuf#9954
See also: #56077

PiperOrigin-RevId: 450054200
pranve pushed a commit to pranve/tensorflow that referenced this issue May 21, 2022
pranve pushed a commit to pranve/tensorflow that referenced this issue May 21, 2022
pranve pushed a commit to pranve/tensorflow that referenced this issue May 21, 2022
georgepaw pushed a commit to graphcore/tensorflow that referenced this issue Jul 25, 2022
Summary:
See also: tensorflow/tensorflow#53234
See also: protocolbuffers/protobuf#9954
See also: tensorflow/tensorflow#56077

TF1.15 Only

Reviewers: #tensorflow, #framework_ip_review_-_any_oss_or_third-party_code_use_has_been_approved, vladimirm

Reviewed By: #tensorflow, #framework_ip_review_-_any_oss_or_third-party_code_use_has_been_approved, vladimirm

Maniphest Tasks: T63128

Differential Revision: https://phabricator.sourcevertex.net/D67981
chxin66 pushed a commit to chxin66/tensorflow that referenced this issue Sep 5, 2022
@learning-to-play
Copy link
Collaborator

Adding @vam-google who's looking at protobuf upgrade.

@mihaimaruseac
Copy link
Collaborator

Closing due to 84f4092

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.6.0 comp:grappler Grappler related issues type:build/install Build and install issues
Projects
None yet
Development

No branches or pull requests

7 participants