Automatic Parallel Compilation Viability: First Report

Complete Tasks

There were two tasks scheduled to be complete in this stage:

Update the driver to pass a hidden -fsplit-outputs to the compiler, writting down the path to each assembler file generated by the compiler.
Update the compiler to partition the program, and fork on each partition, generating one asm file per partition. The compiler should compile some programs correctly at this point.

Both of these tasks were completed at this point, as we have a version of GCC that bootstraps and have a small speedup when compiling insn-emit.c in parallel (with --enable-checking, 42s sequential w/o partitioning vs. 36s parallel).

First Implementation

The first working implementation consists of two parts:

A partitioner designed to handle the Compilation Unit as if it only have part of the program information, as opposed as the WPA partitioner that have the entire program callgraph. This imposes some restriction on us about our partitioner, that is:
1. Static functions that calls eachother are mapped to the same partition to avoid renaming and promotion to globals.
2. Functions that have references to a same global variable are mapped into the same partition to avoid multiple definitions of this partition.
3. Functions which address are taken partitioned together with whoever took the address and with calls that has the possibility to receive the address as argument.
4. COMDATs should be in the same partition.
A way to apply a ltrans_partition without dumping the partition to disk and reloading it. This is done by looking into the partitioner generated encoder and releasing information about functions that are not in the partition by releasing function bodies, dominator tree, entries at clone tree, and setting the variables accordingly.

This implementation lead to very unbalanced partitions, as it generate one really big partition and several very small partitions. However, the partitioner showed to be very fast in practice, with preliminary complexity of O(|E| * log*(|V|)), which is approximately linear on the number of edges of the graph, as log*(2^64) = 5. Sure the worst case scenario where the callgraph is the complete graph implies |E| = |V|*(|V| - 1)/2, but that looks very synthetic to a input program.

Ongoing Implementation

This implementation is still work on progress, but has showed significant improvements when compared to the first implementation. First we relax the condition 1. by not merging calls that were inlined, and relax condition 2. by not merging functions that references non-static global variables.

We managed to get this version to bootstrap by commenting some assertions at ipa-cp optimization pass, probably because node->local is incorrectly set in some cases, and calling remove_unreachable_nodes after applying the partition to the callgraph, but marking nodes inside the partition to be force_output before entering that function so there is no risk of being removed incorrectly, and unseting force_output afterwards for nodes that wasn't set beforehand.

This methodology yielded a speedup of 2x on a Quad Core Intel Core-i7 8565U CPU when compiling insn-emit.c (24.509s parallel vs. 45.909s sequential), but no speedup at all when compiling gimple-match.c, this file looks really problematic on current methodology because it has 3 entry points and is stuffed with static functions.

We are also exploring the possibility of promoting non-inlined static functions to globals in some cases where it might improve compilation performance, although it may slowdown the produced binary in some architectures, therefore completely removing restriction 1. Our results with that restriction relaxed on gimple-match.c showed a 2.12x improvement over serial compilation (36.184s parallel vs. 1m17s sequential).

Conclusions

So far the work has shows to be promising. Compilation of insn-emit.c is greatly improved with 2x speedup although no improvements to gimple-match.c was observed without relaxing condition 1. Once this new partitioner bootstrap, we will proceed to implementing support for the GNU Jobserver.