[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PGO build broken on Windows #111786

Open
mdboom opened this issue Nov 6, 2023 · 15 comments
Open

PGO build broken on Windows #111786

mdboom opened this issue Nov 6, 2023 · 15 comments
Assignees
Labels
build The build process and cross-build OS-windows type-bug An unexpected behavior, bug, or error

Comments

@mdboom
Copy link
Contributor
mdboom commented Nov 6, 2023

Bug report

Bug description:

Building with --pgo on Windows seems to be broken on main (853b4b5).

My hunch, since this is failing in ceval.c, is that the recent merge of the Tier 1 and Tier 2 interpreters may have made the interpreter loop function too large for MSVC to handle.

Build command:

PCbuild\build.bat --pgo -c Release

Error:

Merging C:\actions-runner\_work\benchmarking\benchmarking\cpython\PCbuild\amd64\python313!1.pgc
  C:\actions-runner\_work\benchmarking\benchmarking\cpython\PCbuild\amd64\python313!1.pgc: Used 25.9% (15643384 / 60370944) of total space reserved.  0.0% of the counts were dropped due to overflow.
    Reading PGD file 1: C:\actions-runner\_work\benchmarking\benchmarking\cpython\PCbuild\amd64\python313.pgd
     Creating library C:\actions-runner\_work\benchmarking\benchmarking\cpython\PCbuild\amd64\python313.lib and object C:\actions-runner\_work\benchmarking\benchmarking\cpython\PCbuild\amd64\python313.exp
  Generating code
  
  0 of 0 ( 0.0%) original invalid call sites were matched.
  0 new call sites were added.
  257 of 14721 (  1.75%) profiled functions will be compiled for speed, and the rest of the functions will be compiled for size
C:\actions-runner\_work\benchmarking\benchmarking\cpython\Python\ceval.c(669): fatal error C1001: Internal compiler error. [C:\actions-runner\_work\benchmarking\benchmarking\cpython\PCbuild\pythoncore.vcxproj]
  (compiler file 'D:\a\_work\1\s\src\vctools\Compiler\Utc\src\p2\main.c', line 224)
   To work around this problem, try simplifying or changing the program near the locations listed above.
  If possible please provide a repro here: https://developercommunity.visualstudio.com/ 
  Please choose the Technical Support command on the Visual C++ 
   Help menu, or open the Technical Support help file for more information
    link!InvokeCompilerPass()+0x10e636
    link!InvokeCompilerPass()+0x10e636
    link!InvokeCompilerPass()+0x10e373
    link!InvokeCompilerPass()+0x10b310
    link!InvokeCompilerPass()+0x10b215
    link!InvokeCompilerPass()+0x102cea
  

Build FAILED.

Complete build log

We don't currently have a buildbot that builds with --pgo on Windows -- probably something to fix as well.

CPython versions tested on:

CPython main branch

Operating systems tested on:

Windows

Linked issue

Microsoft Developer Community MSVC bug report

Linked PRs

@mdboom mdboom added type-bug An unexpected behavior, bug, or error OS-windows build The build process and cross-build labels Nov 6, 2023
@vstinner
Copy link
Member
vstinner commented Nov 6, 2023

fatal error C1001: Internal compiler error.

That sounds like a MSVC bug.

To work around this problem, try simplifying or changing the program near the locations listed above.
If possible please provide a repro here: https://developercommunity.visualstudio.com/

Can you try to report the bug to MSVC support?

@zooba
Copy link
Member
zooba commented Nov 6, 2023

Yeah, we'll need a link repro for this, but it should be easy enough to get them to diagnose it. Reporting through dev community is still the best option, even if you work there 😉

Hopefully there's a resolution that doesn't require a specific compiler version. Our infrastructure needs a revamp to handle that situation...

@mdboom mdboom self-assigned this Nov 6, 2023
@mdboom
Copy link
Contributor Author
mdboom commented Nov 6, 2023

My plan is to explore configuration/pragma options to see if anything can be worked around (and benchmark the effect of that so we know what the performance impact of not doing certain optimizations on certain functions is). Barring that we may need to revert or redesign #111428.

Reporting the bug to MSVC also makes sense, and I plan to do that. But I don't think we can count on it to solve our issue -- there will always be a need to compile on a range of versions for a variety of reasons, including what @zooba already said.

@mdboom
Copy link
Contributor Author
mdboom commented Nov 6, 2023

Reading the description of error C1001, it's clear this is an error in the optimizer. This lead me to a description of MSVC's optimization pragmas as a way to control the optimization of just this one very large function (rather than spending time on anything more global).

The two main optimization flags are size ("s") and time ("t"). If you disable them both, you create a new problem, which is that the stack frames get larger, and it exceeds the C stack space during some unit tests (test_pickle being one of them). If you disable only size, the internal compiler error remains. However, if you just disable time ("t"), it successfully compiles.

It would seems that disabling "optimize for speed" would make things slower, however, running on our benchmarking machine, it is about 3% faster (by the HPT method, which is more stable between runs) or 6% faster by the geometric mean method.

So it seems like disabling the "t" optimization for just _PyEval_EvalFrameDefault has no downside.

I still plan to report this issue to MSVC. When there is a fix in MSVC, we can add an #ifdef to remove this workaround for that newer version of MSVC, assuming it helps performance.

@zooba
Copy link
Member
zooba commented Nov 6, 2023

It would seems that disabling "optimize for speed" would make things slower, however, running on our benchmarking machine, it is about 3% faster

Large functions are often faster when optimised for size, as the instruction cache pressure tends to outweigh anything else.

That said, I'm fairly sure that PGO ignores all of those options and uses its own data, and generally it improves upon the /O options by 10-20%, if not more. So it may not be a suitable choice.

I've also found in the past that many of our optimiser crashes have been due to mishandled types between object file (e.g. a function argument is defined as int in one but unsigned int in the other). Might be worth looking for anywhere new that kind of thing may have happened - the compiler should never crash, but it's potentially crashing because of an error in our code that we'd want to fix anyway.

@gvanrossum
Copy link
Member

@zooba

That said, I'm fairly sure that PGO ignores all of those options and uses its own data, and generally it improves upon the /O options by 10-20%, if not more. So it may not be a suitable choice.

How do you reconcile that with @mdboom's actual benchmark results? IIUC the benchmark suite is run with PGO. And I just tried build.bat --pgo with and without the pragma from Mike's PR, and the compiler error disappears when I add the pragma. So it has some effect.

@mdboom
Copy link
Contributor Author
mdboom commented Nov 7, 2023

The #pragma definitely has effect on --pgo builds (and the error only ever occurs there anyway, without the #pragma). I suppose if we wanted to be more surgical, we could only set the #pragma if we know we are in a PGO build (if that's possible).

I've also found in the past that many of our optimizer crashes have been due to mishandled types between object file (e.g. a function argument is defined as int in one but unsigned int in the other). Might be worth looking for anywhere new that kind of thing may have happened - the compiler should never crash, but it's potentially crashing because of an error in our code that we'd want to fix anyway.

I spent a little time looking at this -- I didn't see any function signatures change in #111428 that might explain this. I think the more likely culprit is that the already large _PyEval_EvalFrameDefault more than doubled in size.

@zooba
Copy link
Member
zooba commented Nov 7, 2023

How do you reconcile that with @mdboom's actual benchmark results?

I guess it doesn't ignore the settings? 🤷‍♂️ Perhaps only for pragmas, but since it doesn't fail during non-PGO builds, clearly PGO is still running in that section of code.

I suppose if we wanted to be more surgical, we could only set the #pragma if we know we are in a PGO build (if that's possible).

We don't, but we could set a preprocessor variable specific to PGO builds (maybe the compiler sets one anyway?). I'd be more inclined to see how it goes only setting the pragma for non-PGO builds though, and let the profiling decide how to handle the large function. At least theoretically, that ought to be the way it works best, but theoretically the compiler shouldn't crash, so we're well into "make it work with the reality we have right now" territory anyway.

@mdboom
Copy link
Contributor Author
mdboom commented Nov 7, 2023

MSVC issue

gvanrossum pushed a commit that referenced this issue Nov 9, 2023
…r PGO (#111794)

In PGO mode, this function caused a compiler error in MSVC.
It turns out that optimizing for space only save the day, and is even faster.
However, without PGO, this is neither necessary nor slower.
@mdboom mdboom reopened this Nov 14, 2023
@zooba
Copy link
Member
zooba commented Nov 15, 2023

Judging by the timestamp, Mike's reopened this because I asked, so we can set up a better system for handling the workaround. Specifically, I want to have a build flag/property to disable all workarounds so that we can have a test run (probably privately) against the latest compilers to track when they get fixed.

I'm going to be offline for a couple of weeks, but will get to implementing or reviewing it after that.

@neonene
Copy link
Contributor
neonene commented Nov 17, 2023

Without this PGO workaround, 3.13's EXTRA_CASES (version specific) seem to require the opcode variable to be declared as uint8_t, rather than to be cast. C1001 error can be seen even before mixing tiers at 7e135a4, if opcode is int type.

The error went away when I tried making tier1 and tier2 have their own opcodes (uint8_t and int). An alternative was to replace EXTRA_CASES with default:, leaving int opcode as-is. However, #pragma optimize("t", off) was realistically needed to speed up them after all.

@mdboom
Copy link
Contributor Author
mdboom commented Nov 17, 2023

Thanks, @neonene. I will take some of these ideas, measure them, and file a PR.

@mdboom
Copy link
Contributor Author
mdboom commented Nov 20, 2023

I measured this a few different ways and compared

  • The base for all of these is a recent main (2dbb2e0) which has the pragma workaround that turns off "optimize for speed" on _PyEval_EvalFrameDefault
  • sep-opcode uses an uint8_t for Tier 1 opcodes and a separate uint16_t for Tier 2 opcodes
  • sep-opcode_pragma_off does the above, but also removes the #pragma (since it's no longer needed to prevent the compiler crash)
  • Os-global sets /Os for all code during the PGInstrument phase as @neonene suggested as a workaround
commit change
sep-opcode (48b46e7) 1.03x faster (100% rel.)
sep-opcode_pragma-off (c47be17) 1.00x faster (85% rel.)
Os-global (04300ec) 1.00x slower (89% rel.)

So it seems like the best thing for performance is to separate the opcode variables, leaving everything else the same.

mdboom added a commit to mdboom/cpython that referenced this issue Nov 20, 2023
Suggested by @neonene: python#111786 (comment)

This makes Windows about 3% faster on pyperformance benchmarks.
mdboom added a commit to mdboom/cpython that referenced this issue Nov 20, 2023
Suggested by @neonene: python#111786 (comment)

This makes Windows about 3% faster on pyperformance benchmarks.
mdboom added a commit to mdboom/cpython that referenced this issue Nov 20, 2023
This makes Windows about 3% faster on pyperformance benchmarks.
gvanrossum pushed a commit that referenced this issue Nov 20, 2023
This makes Windows about 3% faster on pyperformance benchmarks.
@zooba
Copy link
Member
zooba commented Nov 29, 2023

@mdboom Does your merged fix address this issue entirely? No point adding compatibility infrastructure if we've got a good workaround.

@neonene
Copy link
Contributor
neonene commented Dec 4, 2023

It appears that Windows PGO builds in faster-cpython/benchmarking-public repo have slowed down with Tier 2 off, after generated cases were sorted (1619f43).

If the switch-cases cannot be arranged for working around, a similar performance issue closed with low priority in MSVC's bug tracker would be worth checking, where they suggested ssa-pre- option for cl or link, to recover to some degree.

The hidden option worked for me in pythoncore.vcxproj with MSVC v.1936 and v.1938, specified at least at PGUpdate phase. And optimize("t",off) pragma in ceval.c seems to work almost independently.

MSVC v.1929 (VS2019) is fine with/without this option.

aisk pushed a commit to aisk/cpython that referenced this issue Feb 11, 2024
…SVC for PGO (python#111794)

In PGO mode, this function caused a compiler error in MSVC.
It turns out that optimizing for space only save the day, and is even faster.
However, without PGO, this is neither necessary nor slower.
aisk pushed a commit to aisk/cpython that referenced this issue Feb 11, 2024
…on#112289)

This makes Windows about 3% faster on pyperformance benchmarks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build The build process and cross-build OS-windows type-bug An unexpected behavior, bug, or error
Projects
Status: In Progress
Development

No branches or pull requests

6 participants
@mdboom @vstinner @zooba @gvanrossum @neonene and others