| # libCODY: COmpiler DYnamism<sup><a href="#1">1</a></sup> |
| |
| Copyright (C) 2020 Nathan Sidwell, nathan@acm.org |
| |
| libCODY is an implementation of a communication protocol between |
| compilers and build systems. |
| |
| **WARNING:** This is preliminary software. |
| |
| In addition to supporting C++modules, this may also support LTO |
| requirements and could also deal with generated #include files |
| and feed the compiler with prepruned include paths and whatnot. (The |
| system calls involved in include searches can be quite expensive on |
| some build infrastructures.) |
| |
| * Client and Server objects |
| * Direct connection for in-process use |
| * Testing with Joust (that means nothing to you, doesn't it!) |
| |
| |
| ## Problem Being Solved |
| |
| The origin is in C++20 modules: |
| ``` |
| import foo; |
| ``` |
| |
| At that import, the compiler needs<sup><a href="#2">2</a></sup> to |
| load up the compiled serialization of module `foo`. Where is that |
| file? Does it even exist? Unless the build system already knows the |
| dependency graph, this might be a completely unknown module. Now, the |
| build system knows how to build things, but it might not have complete |
| information about the dependencies. The ultimate source of |
| dependencies is the source code being compiled, and specifying the |
| same thing in multiple places is a recipe for build skew. |
| |
| Hence, a protocol by which a compiler can query a build system. This |
| was originally described in <a |
| href="https://wg21.link/p1184r1">p1184r1:A Module Mapper</a>. Along |
| with a proof-of-concept hack in GNUmake, described in <a |
| href="https://wg21.link/p1602">p1602:Make Me A Module</a>. The current |
| implementation has evolved and an update to p1184 will be forthcoming. |
| |
| ## Packet Encoding |
| |
| The protocol is turn-based. The compiler sends a block of one or more |
| requests to the builder, then waits for a block of responses to all of |
| those requests. If the builder needs to compile something to satisfy |
| a request, there may be some time before the response. A builder may |
| service multiple compilers concurrently, each as a separate connection. |
| |
| When multiple requests are in a block, the responses are also in a |
| block, and in corresponding order. The responses must not be |
| commenced eagerly -- they must wait until the incoming block has ended |
| (as mentioned above, it is turn-based). To do otherwise risks |
| deadlock, as there is no requirement for a sending end of the |
| communication to listen for incoming responses (or new requests) until |
| it has completed sending its current block. |
| |
| Every request has a response. |
| |
| Requests and responses are user-readable text. It is not intended as |
| a transmission medium to send large binary objects (such as compiled |
| modules). It is presumed the builder and the compiler share a file |
| system, for that kind of thing.<sup><a href="#3">3</a></sup> |
| |
| Messages characters are encoded in UTF8. |
| |
| Messages are a sequence of octets ending with a NEWLINE (0xa). The lines |
| consist of a sequence of words, separated by WHITESPACE (0x20 or 0x9). |
| Words themselves do not contain WHITESPACE. Lines consisting solely |
| of WHITESPACE (or empty) are ignored. |
| |
| To encode a block of multiple messages, non-final messages end with a |
| single word of SEMICOLON (0x3b), immediately before the NEWLINE. Thus |
| a serial connection can determine whether a block is complete without |
| decoding the messages. |
| |
| Words containing characters in the set [-+_/%.A-Za-z0-9] need not be |
| quoted. Words containing characters outside that set should be |
| quoted. A zero-length word may be achieved with `''` |
| |
| Quoted words begin and end with APOSTROPHE (x27). Within the quoted |
| word, BACKSLASH (x5c) is used as an escape mechanism, with the |
| following meanings: |
| |
| * \\n - NEWLINE (0xa) |
| * \\t - TAB (0x9) |
| * \\' - APOSTROPHE (') |
| * \\\\ - BACKSLASH (\\) |
| |
| Characters in the range [0x00, 0x20) and 0x7f are encoded with one or |
| two lowercase hex characters. Octets in the range [0x80,0xff) are |
| UTF8 encodings of unicode characters outside the traditional ASCII set |
| and passed as such. |
| |
| Decoding should be more relaxed. Unquoted words containing characters |
| in the range [0x20,0xff] other than BACKSLASH or APOSTROPHE should be |
| accepted. In a quoted sequence, `\` followed by one or two lower case |
| hex characters decode to that octet. Further, words can be |
| constructed from a mixture of abutted quoted and unquoted sequences. |
| For instance `FOO' 'bar` would decode to the word `FOO bar`. |
| |
| Notice that the block continuation marker of `;` is not a valid |
| encoding of the word `;`, which would be `';'`. |
| |
| It is recommended that words are separated by single SPACE characters. |
| |
| ## Messages |
| |
| The message descriptions use `$metavariable` examples. |
| |
| The request messages are specific to a particular action. The response |
| messages are more generic, describing their value types, but not their |
| meaning. Message consumers need to know the response to decode them. |
| Notice the `Packet::GetRequest()` method records in response packets |
| what the request being responded to was. Do not confuse this with the |
| `Packet::GetCode ()` method. |
| |
| ### Responses |
| |
| The simplest response is a single: |
| |
| `OK` |
| |
| This indicates the request was successful. |
| |
| |
| An error response is: |
| |
| `ERROR $message` |
| |
| The message is a human-readable string. It indicates failure of the request. |
| |
| Pathnames are encoded with: |
| |
| `PATHNAME $pathname` |
| |
| Boolean responses use: |
| |
| `BOOL `(`TRUE`|`FALSE`) |
| |
| ### Handshake Request |
| |
| The first message is a handshake: |
| |
| `HELLO $version $compiler $ident` |
| |
| The `$version` is a numeric value, currently `1`. `$compiler` identifies |
| the compiler — builders may need to keep compiled modules from |
| different compilers separate. `$ident` is an identifier the builder |
| might use to identify the compilation it is communicating with. |
| |
| Responses are: |
| |
| `HELLO $version $builder [$flags]` |
| |
| A successful handshake. The communication is now connected and other |
| messages may be exchanged. An ERROR response indicates an unsuccessful |
| handshake. The communication remains unconnected. |
| |
| There is nothing restricting a handshake to its own message block. Of |
| course, if the handshake fails, subsequent non-handshake messages in |
| the block will fail (producing error responses). |
| |
| The `$flags` word, if present allows a server to control what requests |
| might be given. See below. |
| |
| ### C++ Module Requests |
| |
| A set of requests are specific to C++ modules: |
| |
| #### Flags |
| |
| Several requests and one response have an optional `$flags` word. |
| These are the `Cody::Flags` value pertaining to that request. If |
| omitted the value 0 is implied. The following flags are available: |
| |
| * `0`, `None`: No flags. |
| |
| * `1<<0`, `NameOnly`: The request is for the name only, and not the |
| CMI contents. |
| |
| The `NameOnly` flag may be provded in a handshake response, and |
| indicates that the server is interested in requests only for their |
| implied dependency information. It may be provided on a request to |
| indicate that only the CMI name is required, not its contents (for |
| instance, when preprocessing). Note that a compiler may still make |
| `NameOnly` requests even if the server did not ask for such. |
| |
| #### Repository |
| |
| All relative CMI file names are relative to a repository. (There are |
| usually no absolute CMI files). The repository may be determined |
| with: |
| |
| `MODULE-REPO` |
| |
| A PATHNAME response is expected. The `$pathname` may be an empty |
| word, which is equivalent to `.`. When the response is a relative |
| pathname, it must be relative to the client's current working |
| directory (which might be a process on a different host to the |
| server). You may set the repository to `/`, if you with to use paths |
| relative to the root directory. |
| |
| #### Exporting |
| |
| A compilation of a module interface, partition or header unit can |
| inform the builder with: |
| |
| `MODULE-EXPORT $module [$flags]` |
| |
| This will result in a PATHNAME response naming the Compiled Module |
| Interface pathname to write. |
| |
| The `MODULE-EXPORT` request does not indicate the module has been |
| successfully compiled. At most one `MODULE-EXPORT` is to be made, and |
| as the connection is for a single compilation, the builder may infer |
| dependency relationships between the module being generated and import |
| requests made. |
| |
| Named module names and header unit names are distinguished by making |
| the latter unambiguously look like file names. Firstly, they must be |
| fully resolved according to the compiler's usual include path. If |
| that results in an absolute name file name (beginning with `/`, or |
| certain other OS-specific sequences), all is well. Otherwise a |
| relative file name must be prefixed by `./` to be distinguished from a |
| similarly named named module. This prefixing must occur, even if the |
| header-unit's name contains characters that cannot appear in a named |
| module's name. |
| |
| It is expected that absolute header-unit names convert to relative CMI |
| names, to keep all CMIs within the CMI repository. This means that |
| steps must be taken to distinguish the CMIs for `/here` from `./here`, |
| and this can be achieved by replacing the leading `./` directory with |
| `,/`, which is visually similar but does not have the self-reference |
| semantics of dot. Likewise, header-unit names containing `..` |
| directories, can be remapped to `,,`. (When symlinks are involved |
| `bob/dob/..` might not be `bob`, of course.) C++ header-unit |
| semantics are such that there is no need to resolve multiple ways of |
| spelling a particular header-unit to a unique CMI file. |
| |
| Successful compilation of an interface is indicated with a subsequent: |
| |
| `MODULE-COMPILED $module [$flags]` |
| |
| request. This indicates the CMI file has been written to disk, so |
| that any other compilations waiting on it may proceed. Depending on |
| compiler implementation, the CMI may be written before the compilation |
| completes. A single OK response is expected. |
| |
| Compilation failure can be inferred by lack of a `MODULE-COMPILED` |
| request. It is presumed the builder can determine this, as it is also |
| responsible for launching and reaping the compiler invocations |
| themselves. |
| |
| #### Importing |
| |
| Importation, including that of header-units, uses: |
| |
| `MODULE-IMPORT $module [$flags]` |
| |
| A PATHNAME response names the CMI file to be read. Should the builder |
| have to invoke a compilation to produce the CMI, the response should |
| be delayed until that occurs. If such a compilation fails, an error |
| response should be provided to the requestor — which will then |
| presumably fail in some manner. |
| |
| #### Include Translation |
| |
| Include translation can be determined with: |
| |
| `INCLUDE-TRANSLATE $header [$flags]` |
| |
| The header name, `$header`, is the fully resolved header name, in the |
| above-mentioned unambiguous filename form. The response will either |
| be a BOOL response indicating textual inclusion, or a PATHNAME |
| response naming the CMI for such translation. The BOOL value is TRUE, |
| if the header is known to be a textual header, and FALSE if nothing is |
| known about it -- the latter might cause diagnostics about incomplete |
| knowledge. |
| |
| ### GCC LTO Messages |
| |
| These set of requests are used for GCC LTO jobserver integration with GNU Make |
| |
| ## Building libCody |
| |
| Libcody is written in C++11. (It's a intended for compilers, so |
| there'd be a bootstrapping problem if it used the latest and greatest.) |
| |
| ### Using configure and make. |
| |
| It supports the usual `configure`, `make`, `make check` & `make install` |
| sequence. It does not support building in the source directory -- |
| that just didn't drop out, and it's not how I build things (because, |
| again, for compilers). Excitingly it uses my own `joust` test |
| harness, so you'll need to build and install that somewhere, if you |
| want the comfort of testing. |
| |
| The following configure options are available, in addition to the usual set: |
| |
| * `--enable-checking` Compile with assert-like checking. Defaults to on. |
| |
| * `--with-tooldir=DIR` Prepend `DIR` to `PATH` when building (`DIR` |
| need not already include the trailing `/bin`, and the right things |
| happen). Use this if you need to point to non-standard tools that |
| you usually don't have in your path. This path is also used when |
| the configure script searches for programs. |
| |
| * `--with-toolinc=DIR`, `--with-toollib=DIR`, include path and library |
| path variants of `--with-tooldir`. If these are siblings of the |
| tool bin directory, they'll be found automatically. |
| |
| * `--with-compiler=NAME` Specify a particular compiler to use. |
| Usually what configure finds is sufficiently usable. |
| |
| * `--with-bugurl=URL` Override the bugreporting URL. Do this if |
| you're providing libcody as part of a package that /you/ are |
| supporting. |
| |
| * `--enable-maintainer-mode` Specify that rules to rebuild things like |
| `configure` (with `autoconf`) should be enabled. When not enabled, |
| you'll get a message if these appear out of date, but that can |
| happen naturally after an update or clone as `git`, in common with |
| other VCs, doesn't preserve the relative ordering of file |
| modifications. You can use `make MAINTAINER=touch` to shut make up, |
| if this occurs (or manually execute the `autoconf` and related |
| commands). |
| |
| When building, you can override the default optimization flags with |
| `CXXFLAGS=$flags`. I often build a debuggable library with `make |
| CXXFLAGS=-g3`. |
| |
| The `Makefile` will also parallelize according to the number of CPUs, |
| unless you specify explicitly with a `-j` option. This is a little |
| clunky, as it's not possible to figure out inside the makefile whether |
| the user provided `-j`. (Or at least I've not figured out how.) |
| |
| ### Using cmake and make |
| |
| #### In the clang/LLVM project |
| |
| The primary motivation for a cmake implementation is to allow building |
| libcody "in tree" in clang/LLVM. In that case, a checkout of libcody |
| can be placed (or symbolically linked) into clang/tools. This will |
| configure and build the library along with other LLVM dependencies. |
| |
| *NOTE* This is not treated as an installable entity (it is present only |
| for use by the project). |
| |
| *NOTE* The testing targets would not be appropriate in this configuration; |
| it is expected that lit-based testing of the required functionality will be |
| done by the code using the library. |
| |
| #### Stand-alone |
| |
| For use on platforms that don't support configure & make effectively, it |
| is possible to use the cmake & make process in stand-alone mode (similar |
| to the configure & make process above). |
| |
| An example use. |
| ``` |
| cmake -DCMAKE_INSTALL_PREFIX=/path/to/installation -DCMAKE_CXX_COMPILER=clang++ /path/to/libcody/source |
| make |
| make install |
| ``` |
| Supported flags (additions to the usual cmake ones). |
| |
| * `-DCODY_CHECKING=ON,OFF`: Compile with assert-like checking. (defaults ON) |
| |
| * `-DCODY_WITHEXCEPTIONS=ON,OFF`: Compile with C++ exceptions and RTTI enabled. |
| (defaults OFF, to be compatible with GCC and LLVM). |
| |
| *TODO*: At present there is no support for `ctest` integration (this should be |
| feasible, provided that `joust` is installed and can be discovered by `cmake`). |
| |
| ## API |
| |
| The library defines entities in the `::Cody` namespace. |
| |
| There are 4 user-visible classes: |
| |
| * `Packet`: Responses to requests are `Packets`. These have a code, |
| indicating the response kind, and a payload. |
| |
| * `Client`: The compiler-end of a connection. Requests may be made |
| and responses are returned. |
| |
| * `Server`: The builder-end of a connection. Requests may be waited |
| for, and responses made. Builders that serve multiple concurrent |
| connections and spawn compilations to resolve dependencies may need |
| to derive from this class to provide response queuing. |
| |
| * `Resolver`: The processing engine of the builder side. User code is |
| expected to derive from this class and provide virtual function |
| overriders to affect the semantics of the resolver. |
| |
| In addition there are a number of helpers to setup connections. |
| |
| Logically the Client and the Server communicate via a sequential |
| channel. The channel may be provided by: |
| |
| * two pipes, with different file descriptors for reading and writing |
| at each end. |
| |
| * a socket, which will use the same file descriptor for reading and |
| writing. the socket can be created in a number of ways, including |
| Unix domain and IPv6 TCP, for which helpers are provided. |
| |
| * a direct, in-process, connection, using buffer swapping. |
| |
| The communication channel is presumed reliable. |
| |
| Refer to the (currently very sparse) doxygen-generated documentation |
| for details of the API. |
| |
| ## Examples |
| |
| To create an in-process resolver, use the following boilerplate: |
| |
| ``` |
| class MyResolver : Cody::Resolver { ... stuff here ... }; |
| |
| Cody::Client *MakeClient (char const *maybe_ident) |
| { |
| auto *r = new MyResolver (...); |
| auto *s = new Cody::Server (r); |
| auto *c = new Cody::Client (s); |
| |
| auto t = c->ConnectRequest ("ME", maybe_ident); |
| if (t.GetCode () == Cody::Client::TC_CONNECT) |
| ;// Yay! |
| else if (t.GetCode () == Cody::Client::TC_ERROR) |
| report_error (t.GetString ()); |
| |
| return c; |
| } |
| |
| ``` |
| |
| For a remotely connecting client: |
| ``` |
| Cody::Client *MakeClient () |
| { |
| char const *err = nullptr; |
| int fd = OpenInet6 (char const **err, name, port); |
| if (fd < 0) |
| { ... error... return nullptr;} |
| |
| auto *c = new Cody::Client (fd); |
| |
| auto t = c->ConnectRequest ("ME", maybe_ident); |
| if (t.GetCode () == Cody::Client::TC_CONNECT) |
| ;// Yay! |
| else if (t.GetCode () == Cody::Client::TC_ERROR) |
| report_error (t.GetString ()); |
| |
| return c; |
| } |
| ``` |
| |
| # Future Directions |
| |
| * Current Directory. There is no mechanism to check the builder and |
| the compiler have the same working directory. Perhaps that should |
| be addressed. |
| |
| * Include path canonization and/or header file lookup. This can be |
| expensive, particularly with many `-I` options, due to the system |
| calls. Perhaps using a common resource would be cheaper? |
| |
| * Generated header file lookup/construction. This is essentially the |
| same problem as importing a module, and build systems are crap at |
| dealing with this. |
| |
| * Link-time compilations. Another place the compiler would like to |
| ask the build system to do things. |
| |
| * C++20 API entrypoints — std:string_view would be nice |
| |
| * Exception-safety audit. Exceptions are not used, but memory |
| exhaustion could happen. And perhaps user's resolver code employs |
| exceptions? |
| |
| <a name="1">1</a>: Or a small town in Wyoming |
| |
| <a name="2">2</a>: This describes one common implementation technique. |
| The std itself doesn't require such serializations, but the ability to |
| create them is kind of the point. Also, 'compiler' is used where we |
| mean any consumer of a module, and 'build system' where we mean any |
| producer of a module. |
| |
| <a name="3">3</a>: Even when the builder is managing a distributed set |
| of compilations, the builder must have a mechanism to get source files |
| to, and object files from, the compilations. That scheme can also |
| transfer the CMI files. |