How Profile-Guided Optimization (PGO) works

Profile-guided optimization (also known as PGO, or "pogo") is a way of further optimizing optimized builds of your game using information about the way that your game behaves when it is played in the real-world. In this way, infrequently run code such as error or edge-cases are de-emphasized from your code's critical execution paths, speeding it up.

A diagram showing a visual overview of how PGO
works

Figure 1. An overview of how PGO works.

To use PGO, you first instrument your build to generate profile data that the compiler can work with. Then you exercise your code by running that build and generating one or more profile data files. Finally, you copy those files back from the device and use them with the compiler to optimize your executable using the profile information you captured.

How optimized builds without PGO work

A build that is optimized without using profile data uses a number of heuristics when deciding how to generate optimized code.

Some are signaled explicitly by the developer - for example, in C++ 20 or later, by using branch-direction hints such as [[likely]] and [[unlikely]]. Another example would be by using the inline keyword, or even __forceinline (although in general, it's better and more flexible to stick with the former). By default, some compilers assume that the first leg of a branch (that is, the if statement, not the else part) is the most likely one. The optimizer may also make assumptions from static analysis of the code about how it will execute - but this is usually limited in scope.

The problem with these heuristics is that they can't correctly help the compiler in all situations - even with exhaustive manual markup - so while the code that is generated is typically well-optimized, it's not as good as it could be if the compiler had more information about its behavior at runtime.

Generating a profile

When your executable is built with PGO enabled in instrumented mode, your executable is augmented with code at the beginning of every code block – for example, the beginning of a function, or the beginning of each arm of a branch. This code is used to keep track of a count of every time the block is entered by running code, which the compiler can use later to generate optimized code.

Some other tracking is also performed – for example, the size of typical copy operations in a block, so that fast, inlined versions of the operation can be generated later.

After some kind of representative work has been performed by the game, the executable must call a function – __llvm_profile_write_file() – to write out the profile data to a customizable location on the device. This function is linked into your game automatically when your build configuration has PGO instrumentation enabled.

The written profile data file should then be copied back to the host computer, and preferably kept in a location along with other profiles from the same build so that they can be used together.

For example, you can modify your game code to call __llvm_profile_write_file() when the current game scene ends. Then, to take a profile you would build your game with instrumentation turned on, and then deploy it to your Android device. While it is running, profile data is automatically captured – your QA engineer runs through the game, exercising different scenarios (or just goes about their normal test pass).

When you are done exercising different parts of your game, you can return to the main menu, which would end the current game scene and write out the profile data.

A script can then be used to copy the profile data off the test device, and upload it to a central repository where it can be captured for later use.

Merging profile data

Once a profile has been obtained from a device, it needs to be converted from the profile data file generated by the instrumented build, into a form that the compiler can consume. AGDE does this for you automatically, for any profile data files that you add to your project.

PGO is designed to combine the results of multiple instrumented profile runs together – AGDE also does this for you automatically if you have multiple files in a single project.

As an example of how merging profile data sets can be useful, let's say that you had a lab full of QA engineers all playing different levels of your game. Each of their playthroughs is recorded, and then used to generate profile data from a PGO-instrumented build of your game. Merging profiles lets you combine the results from all of these different test runs - which may execute wildly different parts of your code - to give better results.

Even better, when performing longitudinal tests, where you keep copies of profile data from internal release to internal release, rebuilding doesn't necessarily invalidate old profile data. For the most part, code is relatively stable from release to release, so profile data from older builds can still be useful and doesn't go stale immediately.

Generating Profile-Guided Optimized builds

Once you have added the profile data to your project, you can use it to build your executable by enabling PGO in Optimization mode in your build configuration.

This directs the compiler's optimizer to use the profile data you captured earlier when making optimization decisions.

When to use Profile-Guided Optimization

PGO is not intended to be something you enable at the beginning of development, or during day-to-day iteration on code. During development you should focus on algorithmic and data-layout based optimizations as they'll give you much larger benefits.

PGO comes later in the development process, when you're polishing for release. Think of Profile-Guided Optimization as the cherry on top that lets you squeeze the last bit of performance out of your code after you've already spent some time optimizing your code yourself.

Expected performance improvement with PGO

This depends on a large number of factors, including how comprehensive and stale your profiles are, and how close to optimal your code would have been with a traditional optimized build.

In general, a very conservative estimate would be that CPU costs will reduce by ~5% in key threads. You may see different results.

Instrumentation overhead

PGO's instrumentation is comprehensive, and while it is automatically generated, it's not free. The overhead of PGO instrumentation may vary depending on your codebase.

Performance cost of Profile-Guided Instrumentation

You might see a drop in frame-rate with instrumented builds. In some cases – depending on how close to 100% utilized your CPU is during normal operation – this drop might be so large as to make normal gameplay difficult.

We recommend that most developers build out a semi-deterministic replay mode for their game. This kind of functionality gives the ability for your QA team to start the game at a known, repeatable starting location in your game (such as a save game or specific test level), and then record their input. This input recorded from the test build can be fed into a PGO-Instrumented build, played back, and generate real-world profile data regardless of how long it takes to process an individual frame – even if the game were running so slowly that it was unplayable.

This kind of functionality also has other major benefits, such as multiplying tester effort: one tester can record their input on a device, and then it can be played back across multiple different types of devices for smoke testing purposes.

A replay system like this can have huge benefits on Android where there are a large number of device variants in the ecosystem – and the benefits don't end there: It can form a core part of your continuous integration build system too, allowing you to perform regular overnight performance regression and smoke tests.

The recording should record user input at the most appropriate point within your game's input mechanism (likely not direct touchscreen events, but instead recording their consequences as commands). These inputs should also feature a frame count that ticks up monotonically during gameplay, so that during playback, the replay mechanism can wait for the appropriate frame on which it should trigger an event.

In playback mode, your game should avoid online sign-in, shouldn't show ads, and should operate at a fixed timestep (at your target frame-rate). You should consider disabling vsync.

It's not important that everything (for example, particle systems) in your game is perfectly deterministically repeatable, but the same actions should deliver the same in-game consequences and results – that is, gameplay should be the same.

Memory cost of Profile-Guided Instrumentation

The memory overhead of PGO instrumentation varies a lot based on the specific library being compiled. In our tests we saw an overall increase in size of the test executable by ~2.2x. This size increase included both the extra code needed to instrument the code blocks, and the space needed to store the counters. These tests were not exhaustive, and your experience may differ.

When to update or discard your profile data

You should update your profiles whenever you make a large change to your code (or game content).

What this means precisely depends on your build environment, and where you are in development.

As mentioned before, you shouldn't carry profile data across major build environment changes; while this won't prevent you from building or break your build, this will reduce the performance benefits of using PGO as very little profile data will be applicable to the new build environment. However, this isn't the only case where your profile data might become stale.

Let's start by assuming that you won't use PGO until the nearing the end of development when you're preparing for a release, beyond maybe gathering a weekly capture so that performance-focused engineers can verify that there won't be any unexpected hiccups closer to release.

This changes as you approach your release window, when your QA team is testing every day, and running through the game exhaustively. During this phase you can generate profiles from that data daily, and use those to inform future builds for performance testing and adjusting your own performance budgets.

When you are preparing for a release, you should lock the build version that you're planning on releasing, and then have QA run through it generating your new profile data. You then build using this data to produce a final version of your executable.

QA can then give this optimized, shipping build a final run-through to ensure that it is good to release.