Main > Implementing an AOT pipeline for FEX-Emu
Startup time / Stutter is a common problem in many emulators doing Dynamic Binary Translation, and FEX-Emu is no exception.
In fact, the problem is worse than a typical console emulator, as the amount of translated code is usually far bigger, and a general purpose OS such as linux has far more executable code than a console-optimized OS.
While optimizing the translator is important, when there are hundreds of megabytes of code being translated, there’s only one solution: Do the translations before hand, aka Ahead Of Time (AOT) translation.
FEX-Emu has supported “capturing” and “loading” the generated IR for a while now, the next tag will include major improvements in the AOT IR pipeline.
AOT IR vs AOT OBJ
AOT IR is caching our internal Intermediate Representation (IR), but still doing the final IR -> Executable Code generation step. Generating and optimizing the IR from x86 source can be slow, with each fragment taking 0.1 ms on average, with some outliers taking as long as 100ms or more.
To implement an AOT IR scheme, the IR needs to be guest position independent (PIC), as different libraries are loaded at different addresses in different processes, as well as ASLR security measures. However, there are no worries about Executable Code PIC, as the Executable Code is still generated during runtime.
This means that with AOT IR, translation times are still a significant part of our runtime.
AOT OBJ, which caches generated “Object Code”, requires the Object Code to be also PIC, as well as thread sharable. The generated Object Code calls back to FEXCore, for things like syscall handling and further code translations, so it also needs to be linked / relocated on load as well. FEX-Emu currently doesn’t support this, though I’ll be working on it the coming months.
How AOT IR is implemented in FEX-Emu
FEX-Emu identifies binaries by tracking mmap system calls, via Context::AddNamedRegion and Context::RemoveNamedRegion. It then keeps a list of the files that actually contain executable code, and writes their file-id and full path in ~/.fex-emu/aotir/fileid.path.
At a later time, a script (FEXUpdateAOTIRCache) can be used to that calls FEXLoader generate the translations for those binaries. This still requires a launch of the application to collect the used binaries, though in theory one can do that once and distribute the resulting .path files.
FEXLoader has –aotirload and –aotirgenerate options that instruct it to load AOT IR files, or to generate them.
AOT IR files are designed so they can be memory mapped, and contain a sorted index that is binary searched when the JIT translator is looking for IR for a given block.
The entries include a guest code hash, using the xxh3 algorithm, to make sure the guest code actually matches what is included in the IR cache.
AOT IR generation is a bit more involved, as it requires us to scan an elf file for binary code. FEX-Emu currently uses exports and debug symbols as well as Exception Unwind Tables to seed the translation list. It then scans the parts of the file marked as executable for CALLs (0xE8 <32 bit branch target>) and uses some heuristics to further augment the translation list.
From that point on, FEX-Emu will translate iteratively, adding any discovered entrypoints from the frontend to the translation list.
This usually results to around 90% of the executable entrypoints being discovered in most binaries, which is Good Enough For Now (tm).
How does it perform?
Having a pregenerated AOT IR cache cuts clang launch time down to less than 1/3rd, from 1.3 to 0.38 seconds.
AOT IR generation takes ~ 2.5 minutes, but that involves translating all of LLVM and libclang-cpp. This results in 3.4GB of IR stored on the disk.
skmp@mangie:~/projects/FEX/build$ time FEXLoader --abinopf --abilocalflags --no-tsoenabled `which clang`
clang: error: no input files
real 0m1.288s
user 0m1.221s
sys 0m0.067s
skmp@mangie:~/projects/FEX/build$ time FEXUpdateAOTIRCache
Using FEXLoader
Processing clang-10965173196873158944-stLp.path (/usr/lib/llvm-10/bin/clang) with --abinopf --abilocalflags --no-tsoenabled --smc=mman
ld-2.31.so-12644759886748246987-stLp.path has already been generated
Processing libbsd.so.0.10.0-17632796791215918337-stLp.path (/usr/lib/x86_64-linux-gnu/libbsd.so.0.10.0) with --abinopf --abilocalflags --no-tsoenabled --smc=mman
Processing libc-2.31.so-7349183479499852091-stLp.path (/usr/lib/x86_64-linux-gnu/libc-2.31.so) with --abinopf --abilocalflags --no-tsoenabled --smc=mman
Processing libclang-cpp.so.10-10471326509063107810-stLp.path (/usr/lib/llvm-10/lib/libclang-cpp.so.10) with --abinopf --abilocalflags --no-tsoenabled --smc=mman
Processing libdl-2.31.so-17698566885211325079-stLp.path (/usr/lib/x86_64-linux-gnu/libdl-2.31.so) with --abinopf --abilocalflags --no-tsoenabled --smc=mman
Processing libedit.so.2.0.63-13979542772503255021-stLp.path (/usr/lib/x86_64-linux-gnu/libedit.so.2.0.63) with --abinopf --abilocalflags --no-tsoenabled --smc=mman
Processing libffi.so.7.1.0-12726661787665649382-stLp.path (/usr/lib/x86_64-linux-gnu/libffi.so.7.1.0) with --abinopf --abilocalflags --no-tsoenabled --smc=mman
Processing libgcc_s.so.1-9915869280066639001-stLp.path (/usr/lib/x86_64-linux-gnu/libgcc_s.so.1) with --abinopf --abilocalflags --no-tsoenabled --smc=mman
Processing libLLVM-10.so.1-14260893221645401968-stLp.path (/usr/lib/x86_64-linux-gnu/libLLVM-10.so.1) with --abinopf --abilocalflags --no-tsoenabled --smc=mman
Processing libm-2.31.so-15503997489653830012-stLp.path (/usr/lib/x86_64-linux-gnu/libm-2.31.so) with --abinopf --abilocalflags --no-tsoenabled --smc=mman
Processing libpthread-2.31.so-2514442363550555384-stLp.path (/usr/lib/x86_64-linux-gnu/libpthread-2.31.so) with --abinopf --abilocalflags --no-tsoenabled --smc=mman
Processing librt-2.31.so-2450112145572591240-stLp.path (/usr/lib/x86_64-linux-gnu/librt-2.31.so) with --abinopf --abilocalflags --no-tsoenabled --smc=mman
Processing libstdc++.so.6.0.28-2517825116486181853-stLp.path (/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28) with --abinopf --abilocalflags --no-tsoenabled --smc=mman
Processing libtinfo.so.6.2-1330250129250458281-stLp.path (/usr/lib/x86_64-linux-gnu/libtinfo.so.6.2) with --abinopf --abilocalflags --no-tsoenabled --smc=mman
Processing libz.so.1.2.11-699361181824784349-stLp.path (/usr/lib/x86_64-linux-gnu/libz.so.1.2.11) with --abinopf --abilocalflags --no-tsoenabled --smc=mman
real 2m29.773s
user 2m12.747s
sys 0m9.529s
skmp@mangie:~/projects/FEX/build$ time FEXLoader --aotirload --abinopf --abilocalflags --no-tsoenabled `which clang`
clang: error: no input files
real 0m0.377s
user 0m0.317s
sys 0m0.060s
skmp@mangie:~/projects/FEX/build$ du -hs ~/.fex-emu/aotir
3.4G /home/skmp/.fex-emu/aotir
Further Work
Multithreading the AOT translator, using less memory, and implementing AOT OBJ are the next steps in the AOT journey for FEX-Emu. You can track progress in the AOT planning ticket.