| \input texinfo |
| @setfilename cppinternals.info |
| @settitle The GNU C Preprocessor Internals |
| |
| @ifinfo |
| @dircategory Programming |
| @direntry |
| * Cpplib: (cppinternals). Cpplib internals. |
| @end direntry |
| @end ifinfo |
| |
| @c @smallbook |
| @c @cropmarks |
| @c @finalout |
| @setchapternewpage odd |
| @ifinfo |
| This file documents the internals of the GNU C Preprocessor. |
| |
| Copyright 2000, 2001 Free Software Foundation, Inc. |
| |
| Permission is granted to make and distribute verbatim copies of |
| this manual provided the copyright notice and this permission notice |
| are preserved on all copies. |
| |
| @ignore |
| Permission is granted to process this file through Tex and print the |
| results, provided the printed document carries copying permission |
| notice identical to this one except for the removal of this paragraph |
| (this paragraph not being relevant to the printed manual). |
| |
| @end ignore |
| Permission is granted to copy and distribute modified versions of this |
| manual under the conditions for verbatim copying, provided also that |
| the entire resulting derived work is distributed under the terms of a |
| permission notice identical to this one. |
| |
| Permission is granted to copy and distribute translations of this manual |
| into another language, under the above conditions for modified versions. |
| @end ifinfo |
| |
| @titlepage |
| @c @finalout |
| @title Cpplib Internals |
| @subtitle Last revised Jan 2001 |
| @subtitle for GCC version 3.0 |
| @author Neil Booth |
| @page |
| @vskip 0pt plus 1filll |
| @c man begin COPYRIGHT |
| Copyright @copyright{} 2000, 2001 |
| Free Software Foundation, Inc. |
| |
| Permission is granted to make and distribute verbatim copies of |
| this manual provided the copyright notice and this permission notice |
| are preserved on all copies. |
| |
| Permission is granted to copy and distribute modified versions of this |
| manual under the conditions for verbatim copying, provided also that |
| the entire resulting derived work is distributed under the terms of a |
| permission notice identical to this one. |
| |
| Permission is granted to copy and distribute translations of this manual |
| into another language, under the above conditions for modified versions. |
| @c man end |
| @end titlepage |
| @page |
| |
| @node Top, Conventions,, (DIR) |
| @chapter Cpplib - the core of the GNU C Preprocessor |
| |
| The GNU C preprocessor in GCC 3.0 has been completely rewritten. It is |
| now implemented as a library, cpplib, so it can be easily shared between |
| a stand-alone preprocessor, and a preprocessor integrated with the C, |
| C++ and Objective C front ends. It is also available for use by other |
| programs, though this is not recommended as its exposed interface has |
| not yet reached a point of reasonable stability. |
| |
| This library has been written to be re-entrant, so that it can be used |
| to preprocess many files simultaneously if necessary. It has also been |
| written with the preprocessing token as the fundamental unit; the |
| preprocessor in previous versions of GCC would operate on text strings |
| as the fundamental unit. |
| |
| This brief manual documents some of the internals of cpplib, and a few |
| tricky issues encountered. It also describes certain behaviour we would |
| like to preserve, such as the format and spacing of its output. |
| |
| Identifiers, macro expansion, hash nodes, lexing. |
| |
| @menu |
| * Conventions:: Conventions used in the code. |
| * Lexer:: The combined C, C++ and Objective C Lexer. |
| * Whitespace:: Input and output newlines and whitespace. |
| * Hash Nodes:: All identifiers are hashed. |
| * Macro Expansion:: Macro expansion algorithm. |
| * Files:: File handling. |
| * Index:: Index. |
| @end menu |
| |
| @node Conventions, Lexer, Top, Top |
| @unnumbered Conventions |
| @cindex interface |
| @cindex header files |
| |
| cpplib has two interfaces - one is exposed internally only, and the |
| other is for both internal and external use. |
| |
| The convention is that functions and types that are exposed to multiple |
| files internally are prefixed with @samp{_cpp_}, and are to be found in |
| the file @samp{cpphash.h}. Functions and types exposed to external |
| clients are in @samp{cpplib.h}, and prefixed with @samp{cpp_}. For |
| historical reasons this is no longer quite true, but we should strive to |
| stick to it. |
| |
| We are striving to reduce the information exposed in cpplib.h to the |
| bare minimum necessary, and then to keep it there. This makes clear |
| exactly what external clients are entitled to assume, and allows us to |
| change internals in the future without worrying whether library clients |
| are perhaps relying on some kind of undocumented implementation-specific |
| behaviour. |
| |
| @node Lexer, Whitespace, Conventions, Top |
| @unnumbered The Lexer |
| @cindex lexer |
| @cindex tokens |
| |
| The lexer is contained in the file @samp{cpplex.c}. We want to have a |
| lexer that is single-pass, for efficiency reasons. We would also like |
| the lexer to only step forwards through the input files, and not step |
| back. This will make future changes to support different character |
| sets, in particular state or shift-dependent ones, much easier. |
| |
| This file also contains all information needed to spell a token, i.e. to |
| output it either in a diagnostic or to a preprocessed output file. This |
| information is not exported, but made available to clients through such |
| functions as @samp{cpp_spell_token} and @samp{cpp_token_len}. |
| |
| The most painful aspect of lexing ISO-standard C and C++ is handling |
| trigraphs and backlash-escaped newlines. Trigraphs are processed before |
| any interpretation of the meaning of a character is made, and unfortunately |
| there is a trigraph representation for a backslash, so it is possible for |
| the trigraph @samp{??/} to introduce an escaped newline. |
| |
| Escaped newlines are tedious because theoretically they can occur |
| anywhere - between the @samp{+} and @samp{=} of the @samp{+=} token, |
| within the characters of an identifier, and even between the @samp{*} |
| and @samp{/} that terminates a comment. Moreover, you cannot be sure |
| there is just one - there might be an arbitrarily long sequence of them. |
| |
| So the routine @samp{parse_identifier}, that lexes an identifier, cannot |
| assume that it can scan forwards until the first non-identifier |
| character and be done with it, because this could be the @samp{\} |
| introducing an escaped newline, or the @samp{?} introducing the trigraph |
| sequence that represents the @samp{\} of an escaped newline. Similarly |
| for the routine that handles numbers, @samp{parse_number}. If these |
| routines stumble upon a @samp{?} or @samp{\}, they call |
| @samp{skip_escaped_newlines} to skip over any potential escaped newlines |
| before checking whether they can finish. |
| |
| Similarly code in the main body of @samp{_cpp_lex_token} cannot simply |
| check for a @samp{=} after a @samp{+} character to determine whether it |
| has a @samp{+=} token; it needs to be prepared for an escaped newline of |
| some sort. These cases use the function @samp{get_effective_char}, |
| which returns the first character after any intervening newlines. |
| |
| The lexer needs to keep track of the correct column position, |
| including counting tabs as specified by the @samp{-ftabstop=} option. |
| This should be done even within comments; C-style comments can appear in |
| the middle of a line, and we want to report diagnostics in the correct |
| position for text appearing after the end of the comment. |
| |
| Some identifiers, such as @samp{__VA_ARGS__} and poisoned identifiers, |
| may be invalid and require a diagnostic. However, if they appear in a |
| macro expansion we don't want to complain with each use of the macro. |
| It is therefore best to catch them during the lexing stage, in |
| @samp{parse_identifier}. In both cases, whether a diagnostic is needed |
| or not is dependent upon lexer state. For example, we don't want to |
| issue a diagnostic for re-poisoning a poisoned identifier, or for using |
| @samp{__VA_ARGS__} in the expansion of a variable-argument macro. |
| Therefore @samp{parse_identifier} makes use of flags to determine |
| whether a diagnostic is appropriate. Since we change state on a |
| per-token basis, and don't lex whole lines at a time, this is not a |
| problem. |
| |
| Another place where state flags are used to change behaviour is whilst |
| parsing header names. Normally, a @samp{<} would be lexed as a single |
| token. After a @code{#include} directive, though, it should be lexed |
| as a single token as far as the nearest @samp{>} character. Note that |
| we don't allow the terminators of header names to be escaped; the first |
| @samp{"} or @samp{>} terminates the header name. |
| |
| Interpretation of some character sequences depends upon whether we are |
| lexing C, C++ or Objective C, and on the revision of the standard in |
| force. For example, @samp{::} is a single token in C++, but two |
| separate @samp{:} tokens, and almost certainly a syntax error, in C. |
| Such cases are handled in the main function @samp{_cpp_lex_token}, based |
| upon the flags set in the @samp{cpp_options} structure. |
| |
| Note we have almost, but not quite, achieved the goal of not stepping |
| backwards in the input stream. Currently @samp{skip_escaped_newlines} |
| does step back, though with care it should be possible to adjust it so |
| that this does not happen. For example, one tricky issue is if we meet |
| a trigraph, but the command line option @samp{-trigraphs} is not in |
| force but @samp{-Wtrigraphs} is, we need to warn about it but then |
| buffer it and continue to treat it as 3 separate characters. |
| |
| @node Whitespace, Hash Nodes, Lexer, Top |
| @unnumbered Whitespace |
| @cindex whitespace |
| @cindex newlines |
| @cindex escaped newlines |
| @cindex paste avoidance |
| @cindex line numbers |
| |
| The lexer has been written to treat each of @samp{\r}, @samp{\n}, |
| @samp{\r\n} and @samp{\n\r} as a single new line indicator. This allows |
| it to transparently preprocess MS-DOS, Macintosh and Unix files without |
| their needing to pass through a special filter beforehand. |
| |
| We also decided to treat a backslash, either @samp{\} or the trigraph |
| @samp{??/}, separated from one of the above newline indicators by |
| non-comment whitespace only, as intending to escape the newline. It |
| tends to be a typing mistake, and cannot reasonably be mistaken for |
| anything else in any of the C-family grammars. Since handling it this |
| way is not strictly conforming to the ISO standard, the library issues a |
| warning wherever it encounters it. |
| |
| Handling newlines like this is made simpler by doing it in one place |
| only. The function @samp{handle_newline} takes care of all newline |
| characters, and @samp{skip_escaped_newlines} takes care of arbitrarily |
| long sequences of escaped newlines, deferring to @samp{handle_newline} |
| to handle the newlines themselves. |
| |
| Another whitespace issue only concerns the stand-alone preprocessor: we |
| want to guarantee that re-reading the preprocessed output results in an |
| identical token stream. Without taking special measures, this might not |
| be the case because of macro substitution. We could simply insert a |
| space between adjacent tokens, but ideally we would like to keep this to |
| a minimum, both for aesthetic reasons and because it causes problems for |
| people who still try to abuse the preprocessor for things like Fortran |
| source and Makefiles. |
| |
| The token structure contains a flags byte, and two flags are of interest |
| here: @samp{PREV_WHITE} and @samp{AVOID_LPASTE}. @samp{PREV_WHITE} |
| indicates that the token was preceded by whitespace; if this is the case |
| we need not worry about it incorrectly pasting with its predecessor. |
| The @samp{AVOID_LPASTE} flag is set by the macro expansion routines, and |
| indicates that paste avoidance by insertion of a space to the left of |
| the token may be necessary. Recursively, the first token of a macro |
| substitution, the first token after a macro substitution, the first |
| token of a substituted argument, and the first token after a substituted |
| argument are all flagged @samp{AVOID_LPASTE} by the macro expander. |
| |
| If a token flagged in this way does not have a @samp{PREV_WHITE} flag, |
| and the routine @var{cpp_avoid_paste} determines that it might be |
| misinterpreted by the lexer if a space is not inserted between it and |
| the immediately preceding token, then stand-alone CPP's output routines |
| will insert a space between them. To avoid excessive spacing, |
| @var{cpp_avoid_paste} tries hard to only request a space if one is |
| likely to be necessary, but for reasons of efficiency it is slightly |
| conservative and might recommend a space where one is not strictly |
| needed. |
| |
| Finally, the preprocessor takes great care to ensure it keeps track of |
| both the position of a token in the source file, for diagnostic |
| purposes, and where it should appear in the output file, because using |
| CPP for other languages like assembler requires this. The two positions |
| may differ for the following reasons: |
| |
| @itemize @bullet |
| @item |
| Escaped newlines are deleted, so lines spliced in this way are joined to |
| form a single logical line. |
| |
| @item |
| A macro expansion replaces the tokens that form its invocation, but any |
| newlines appearing in the macro's arguments are interpreted as a single |
| space, with the result that the macro's replacement appears in full on |
| the same line that the macro name appeared in the source file. This is |
| particularly important for stringification of arguments - newlines |
| embedded in the arguments must appear in the string as spaces. |
| @end itemize |
| |
| The source file location is maintained in the @var{lineno} member of the |
| @var{cpp_buffer} structure, and the column number inferred from the |
| current position in the buffer relative to the @var{line_base} buffer |
| variable, which is updated with every newline whether escaped or not. |
| |
| TODO: Finish this. |
| |
| @node Hash Nodes, Macro Expansion, Whitespace, Top |
| @unnumbered Hash Nodes |
| @cindex hash table |
| @cindex identifiers |
| @cindex macros |
| @cindex assertions |
| @cindex named operators |
| |
| When cpplib encounters an "identifier", it generates a hash code for it |
| and stores it in the hash table. By "identifier" we mean tokens with |
| type @samp{CPP_NAME}; this includes identifiers in the usual C sense, as |
| well as keywords, directive names, macro names and so on. For example, |
| all of "pragma", "int", "foo" and "__GNUC__" are identifiers and hashed |
| when lexed. |
| |
| Each node in the hash table contain various information about the |
| identifier it represents. For example, its length and type. At any one |
| time, each identifier falls into exactly one of three categories: |
| |
| @itemize @bullet |
| @item Macros |
| |
| These have been declared to be macros, either on the command line or |
| with @code{#define}. A few, such as @samp{__TIME__} are builtins |
| entered in the hash table during initialisation. The hash node for a |
| normal macro points to a structure with more information about the |
| macro, such as whether it is function-like, how many arguments it takes, |
| and its expansion. Builtin macros are flagged as special, and instead |
| contain an enum indicating which of the various builtin macros it is. |
| |
| @item Assertions |
| |
| Assertions are in a separate namespace to macros. To enforce this, cpp |
| actually prepends a @code{#} character before hashing and entering it in |
| the hash table. An assertion's node points to a chain of answers to |
| that assertion. |
| |
| @item Void |
| |
| Everything else falls into this category - an identifier that is not |
| currently a macro, or a macro that has since been undefined with |
| @code{#undef}. |
| |
| When preprocessing C++, this category also includes the named operators, |
| such as @samp{xor}. In expressions these behave like the operators they |
| represent, but in contexts where the spelling of a token matters they |
| are spelt differently. This spelling distinction is relevant when they |
| are operands of the stringizing and pasting macro operators @code{#} and |
| @code{##}. Named operator hash nodes are flagged, both to catch the |
| spelling distinction and to prevent them from being defined as macros. |
| @end itemize |
| |
| The same identifiers share the same hash node. Since each identifier |
| token, after lexing, contains a pointer to its hash node, this is used |
| to provide rapid lookup of various information. For example, when |
| parsing a @code{#define} statement, CPP flags each argument's identifier |
| hash node with the index of that argument. This makes duplicated |
| argument checking an O(1) operation for each argument. Similarly, for |
| each identifier in the macro's expansion, lookup to see if it is an |
| argument, and which argument it is, is also an O(1) operation. Further, |
| each directive name, such as @samp{endif}, has an associated directive |
| enum stored in its hash node, so that directive lookup is also O(1). |
| |
| @node Macro Expansion, Files, Hash Nodes, Top |
| @unnumbered Macro Expansion Algorithm |
| |
| @node Files, Index, Macro Expansion, Top |
| @unnumbered File Handling |
| @cindex files |
| |
| Fairly obviously, the file handling code of cpplib resides in the file |
| @samp{cppfiles.c}. It takes care of the details of file searching, |
| opening, reading and caching, for both the main source file and all the |
| headers it recursively includes. |
| |
| The basic strategy is to minimize the number of system calls. On many |
| systems, the basic @code{open ()} and @code{fstat ()} system calls can |
| be quite expensive. For every @code{#include}-d file, we need to try |
| all the directories in the search path until we find a match. Some |
| projects, such as glibc, pass twenty or thirty include paths on the |
| command line, so this can rapidly become time consuming. |
| |
| For a header file we have not encountered before we have little choice |
| but to do this. However, it is often the case that the same headers are |
| repeatedly included, and in these cases we try to avoid repeating the |
| filesystem queries whilst searching for the correct file. |
| |
| For each file we try to open, we store the constructed path in a splay |
| tree. This path first undergoes simplification by the function |
| @code{_cpp_simplify_pathname}. For example, |
| @samp{/usr/include/bits/../foo.h} is simplified to |
| @samp{/usr/include/foo.h} before we enter it in the splay tree and try |
| to @code{open ()} the file. CPP will then find subsequent uses of |
| @samp{foo.h}, even as @samp{/usr/include/foo.h}, in the splay tree and |
| save system calls. |
| |
| Further, it is likely the file contents have also been cached, saving a |
| @code{read ()} system call. We don't bother caching the contents of |
| header files that are re-inclusion protected, and whose re-inclusion |
| macro is defined when we leave the header file for the first time. If |
| the host supports it, we try to map suitably large files into memory, |
| rather than reading them in directly. |
| |
| The include paths are intenally stored on a null-terminated |
| singly-linked list, starting with the @code{"header.h"} directory search |
| chain, which then links into the @code{<header.h>} directory chain. |
| |
| Files included with the @code{<foo.h>} syntax start the lookup directly |
| in the second half of this chain. However, files included with the |
| @code{"foo.h"} syntax start at the beginning of the chain, but with one |
| extra directory prepended. This is the directory of the current file; |
| the one containing the @code{#include} directive. Prepending this |
| directory on a per-file basis is handled by the function |
| @code{search_from}. |
| |
| Note that a header included with a directory component, such as |
| @code{#include "mydir/foo.h"} and opened as |
| @samp{/usr/local/include/mydir/foo.h}, will have the complete path minus |
| the basename @samp{foo.h} as the current directory. |
| |
| Enough information is stored in the splay tree that CPP can immediately |
| tell whether it can skip the header file because of the multiple include |
| optimisation, whether the file didn't exist or couldn't be opened for |
| some reason, or whether the header was flagged not to be re-used, as it |
| is with the obsolete @code{#import} directive. |
| |
| For the benefit of MS-DOS filesystems with an 8.3 filename limitation, |
| CPP offers the ability to treat various include file names as aliases |
| for the real header files with shorter names. The map from one to the |
| other is found in a special file called @samp{header.gcc}, stored in the |
| command line (or system) include directories to which the mapping |
| applies. This may be higher up the directory tree than the full path to |
| the file minus the base name. |
| |
| @node Index,, Files, Top |
| @unnumbered Index |
| @printindex cp |
| |
| @contents |
| @bye |