For the third evaluation, we expected to deliver the product as a series of patches for trunk. The patch series were in fact delivered [1], but several items must be fixed before merge.
This report mainly has a lot in common with the second report, as this phase was mostly bug fixing and rebasing.
Overall, the project works and speedups ranges from 0.95x to 3.3x. Bootstrap is working, and therefore this can be used in an experimental state.
git clone --single-branch --branch devel/autopar_devel \ git://gcc.gnu.org/git/gcc.git gcc_autopar_devel
cd gcc_autopar_devel mkdir build && cd build ../configure --disable-bootstrap --enable-languages=c,c++ make -j 8 make DESTDIR=/tmp/gcc11_autopar install
If you want to test whether your version is working, just launch Gcc with -fparallel-jobs=2
when compiling a file with -c.
If you want to compile a project with this version it uses GNU Makefiles, you must modify the compilation rule command and prepend a +
token to it. For example, in Git's Makefile, Change:
$(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs) $(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) $(EXTRA_CPPFLAGS) $<
to:
$(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs) +$(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) $(EXTRA_CPPFLAGS) $<
as well as point the CC variable to the installed gcc, and append a -fparallel-jobs=jobserver
on your CFLAGS variable.
In LTO, the Whole Program Analysis decides how to partition the callgraph for running the LTRANS stage in parallel. This project works very similar to this, however with some changes.
The first was to modify the LTO structure so that it accepts the compilation without IR streaming to files. This avoid an IO overhead when compiling in parallel.
The second was to use a custom partitioner to find which nodes should be in the same partition. This was mainly done to bring COMDAT together, as well as symbols that are part of other symbols, and even private symbols so that we do not output hidden global symbols.
However, experiment showed that bringing private symbols together did not yield a interesting speedup on some large files, and therefore we implemented two modes of partitioning:
Partition without static promotion. This is the safer method to use, as we do not modify symbols in the Compilation Unit. This may lead to speedups in files that have multiple entries points with low connectivity between then (such as insn-emit.c), however this will not provide speedups when this hypothesis is not true (gimple-match.c is an example of this). This is the default mode.
Partition with static promotion to global. This is a more aggressive method, as we can decide to promote some functions to global to increase parallelism opportunity. This also will change the final assembler name of the promoted function to avoid collision with functions of others Compilation Units. To use this mode, the user has to manually specify --param=promote-statics=1, as they must be aware of this.
Currently, partitioner 2. do not account the number of nodes to be promoted. Implementing this certainly will reduce impact on produced code.
We implemented a interface to communicate with the GNU Make's Jobserver that is able to detect when the GNU Make Jobserver is active, thanks to Nathan Sidwell. This works as follows:
When -fparallel-jobs=jobserver is provided, GCC will try to detect if there is a running Jobserver in which we can communicate to. If true, we return the token that Make originally gave to us, then we wait for make for a new token that, when provided, will launch a forked child process with the partition information; or fall back to default compilation if Jobserver is not detected with a warning. Now if -fparallel-jobs=auto, this warning will not be provided and the default number of jobs will be set to 2, unless a single core CPU is detected. For more information about the implementation, check gcc/jobserver.cc file in the autopar_devel branch.
Speedups ranged from 0.95x to 1.9x on a Quad-Core Intel Core-i7 8565U, and from 0.99x to 3.3x when testing on a 32-Cores AMD Opteron 6376. As a comparison, the theoretical maximum speedup by parallelizing this part of compilation is 4x, so there are things that can be improved here.
Results are shown in the following table. There are three columns:
The test was the result of a single execution with a previous warm up execution. The compiled GCC had checking enabled, and therefore release version might have better timings in both sequential and parallel, but the speedup may remain the same.
CPU: Intel Core-i7 8565U (4 Cores, 8 Threads) | | | Without Static | With Static | | LTO Emulated | | File | Sequential | Promotion | Promotion | Speedup | Parallelism | |----------------|------------|----------------|-----------------------|--------------| | gimple-match.c | 60s | 63s | 34s | 1.7x | 32s | | insn-emit.c | 37s | 19s | 20s | 1.9x | 20s |
CPU: 4x AMD Opteron 6376 (32 Cores, 64 Threads) | | | Without Static | With Static | | LTO Emulated | | File | Sequential | Promotion | Promotion | Speedup | Parallelism | |----------------|------------|----------------|-----------------------|--------------| | gimple-match.c | 2m42s | 2m43s | 49s | 3.3x | 46s | | insn-emit.c | 1m35s | 30s | 30s | 3.1x | 32s |
There was also a reduction from 35m32s to 29m30s when bootstrapping GCC when -fparallel-jobs=8 on the Opteron machine.
This project can still be improved on a lot of places. For example, just by looking at the table above, we can conclude that:
Cosmetic:
[1] - https://gcc.gnu.org/pipermail/gcc-patches/2020-August/552346.html