gcc/doc/analyzer.texi - gcc - Git at Google

 @c Copyright (C) 2019-2021 Free Software Foundation, Inc.
 @c This is part of the GCC manual.
 @c For copying conditions, see the file gcc.texi.
 @c Contributed by David Malcolm <dmalcolm@redhat.com>.

 @node Static Analyzer
 @chapter Static Analyzer
 @cindex analyzer
 @cindex static analysis
 @cindex static analyzer

 @menu
 * Analyzer Internals::       Analyzer Internals
 * Debugging the Analyzer::   Useful debugging tips
 @end menu

 @node Analyzer Internals
 @section Analyzer Internals
 @cindex analyzer, internals
 @cindex static analyzer, internals

 @subsection Overview

 The analyzer implementation works on the gimple-SSA representation.
 (I chose this in the hopes of making it easy to work with LTO to
 do whole-program analysis).

 The implementation is read-only: it doesn't attempt to change anything,
 just emit warnings.

 The gimple representation can be seen using @option{-fdump-ipa-analyzer}.
 @quotation Tip
 If the analyzer ICEs before this is written out, one workaround is to use
 @option{--param=analyzer-bb-explosion-factor=0} to force the analyzer
 to bail out after analyzing the first basic block.
 @end quotation

 First, we build a @code{supergraph} which combines the callgraph and all
 of the CFGs into a single directed graph, with both interprocedural and
 intraprocedural edges.  The nodes and edges in the supergraph are called
 ``supernodes'' and ``superedges'', and often referred to in code as
 @code{snodes} and @code{sedges}.  Basic blocks in the CFGs are split at
 interprocedural calls, so there can be more than one supernode per
 basic block.  Most statements will be in just one supernode, but a call
 statement can appear in two supernodes: at the end of one for the call,
 and again at the start of another for the return.

 The supergraph can be seen using @option{-fdump-analyzer-supergraph}.

 We then build an @code{analysis_plan} which walks the callgraph to
 determine which calls might be suitable for being summarized (rather
 than fully explored) and thus in what order to explore the functions.

 Next is the heart of the analyzer: we use a worklist to explore state
 within the supergraph, building an "exploded graph".
 Nodes in the exploded graph correspond to <point,@w{ }state> pairs, as in
      "Precise Interprocedural Dataflow Analysis via Graph Reachability"
      (Thomas Reps, Susan Horwitz and Mooly Sagiv).

 We reuse nodes for <point, state> pairs we've already seen, and avoid
 tracking state too closely, so that (hopefully) we rapidly converge
 on a final exploded graph, and terminate the analysis.  We also bail
 out if the number of exploded <end-of-basic-block, state> nodes gets
 larger than a particular multiple of the total number of basic blocks
 (to ensure termination in the face of pathological state-explosion
 cases, or bugs).  We also stop exploring a point once we hit a limit
 of states for that point.

 We can identify problems directly when processing a <point,@w{ }state>
 instance.  For example, if we're finding the successors of

 @smallexample
    <point: before-stmt: "free (ptr);",
     state: @{"ptr": freed@}>
 @end smallexample

 then we can detect a double-free of "ptr".  We can then emit a path
 to reach the problem by finding the simplest route through the graph.

 Program points in the analysis are much more fine-grained than in the
 CFG and supergraph, with points (and thus potentially exploded nodes)
 for various events, including before individual statements.
 By default the exploded graph merges multiple consecutive statements
 in a supernode into one exploded edge to minimize the size of the
 exploded graph.  This can be suppressed via
 @option{-fanalyzer-fine-grained}.
 The fine-grained approach seems to make things simpler and more debuggable
 that other approaches I tried, in that each point is responsible for one
 thing.

 Program points in the analysis also have a "call string" identifying the
 stack of callsites below them, so that paths in the exploded graph
 correspond to interprocedurally valid paths: we always return to the
 correct call site, propagating state information accordingly.
 We avoid infinite recursion by stopping the analysis if a callsite
 appears more than @code{analyzer-max-recursion-depth} in a callstring
 (defaulting to 2).

 @subsection Graphs

 Nodes and edges in the exploded graph are called ``exploded nodes'' and
 ``exploded edges'' and often referred to in the code as
 @code{enodes} and @code{eedges} (especially when distinguishing them
 from the @code{snodes} and @code{sedges} in the supergraph).

 Each graph numbers its nodes, giving unique identifiers - supernodes
 are referred to throughout dumps in the form @samp{SN': @var{index}} and
 exploded nodes in the form @samp{EN: @var{index}} (e.g. @samp{SN: 2} and
 @samp{EN:29}).

 The supergraph can be seen using @option{-fdump-analyzer-supergraph-graph}.

 The exploded graph can be seen using @option{-fdump-analyzer-exploded-graph}
 and other dump options.  Exploded nodes are color-coded in the .dot output
 based on state-machine states to make it easier to see state changes at
 a glance.

 @subsection State Tracking

 There's a tension between:
 @itemize @bullet
 @item
 precision of analysis in the straight-line case, vs
 @item
 exponential blow-up in the face of control flow.
 @end itemize

 For example, in general, given this CFG:

 @smallexample
       A
      / \
     B   C
      \ /
       D
      / \
     E   F
      \ /
       G
 @end smallexample

 we want to avoid differences in state-tracking in B and C from
 leading to blow-up.  If we don't prevent state blowup, we end up
 with exponential growth of the exploded graph like this:

 @smallexample

            1:A
           /   \
          /     \
         /       \
       2:B       3:C
        |         |
       4:D       5:D        (2 exploded nodes for D)
      /   \     /   \
    6:E   7:F 8:E   9:F
     |     |   |     |
    10:G 11:G 12:G  13:G    (4 exploded nodes for G)

 @end smallexample

 Similar issues arise with loops.

 To prevent this, we follow various approaches:

 @enumerate a
 @item
 state pruning: which tries to discard state that won't be relevant
 later on withing the function.
 This can be disabled via @option{-fno-analyzer-state-purge}.

 @item
 state merging.  We can try to find the commonality between two
 program_state instances to make a third, simpler program_state.
 We have two strategies here:

   @enumerate
   @item
      the worklist keeps new nodes for the same program_point together,
      and tries to merge them before processing, and thus before they have
      successors.  Hence, in the above, the two nodes for D (4 and 5) reach
      the front of the worklist together, and we create a node for D with
      the merger of the incoming states.

   @item
      try merging with the state of existing enodes for the program_point
      (which may have already been explored).  There will be duplication,
      but only one set of duplication; subsequent duplicates are more likely
      to hit the cache.  In particular, (hopefully) all merger chains are
      finite, and so we guarantee termination.
      This is intended to help with loops: we ought to explore the first
      iteration, and then have a "subsequent iterations" exploration,
      which uses a state merged from that of the first, to be more abstract.
   @end enumerate

 We avoid merging pairs of states that have state-machine differences,
 as these are the kinds of differences that are likely to be most
 interesting.  So, for example, given:

 @smallexample
       if (condition)
         ptr = malloc (size);
       else
         ptr = local_buf;

       .... do things with 'ptr'

       if (condition)
         free (ptr);

       ...etc
 @end smallexample

 then we end up with an exploded graph that looks like this:

 @smallexample

                    if (condition)
                      / T      \ F
             ---------          ----------
            /                             \
       ptr = malloc (size)             ptr = local_buf
           |                               |
       copy of                         copy of
         "do things with 'ptr'"          "do things with 'ptr'"
       with ptr: heap-allocated        with ptr: stack-allocated
           |                               |
       if (condition)                  if (condition)
           | known to be T                 | known to be F
       free (ptr);                         |
            \                             /
             -----------------------------
                          | ('ptr' is pruned, so states can be merged)
                         etc

 @end smallexample

 where some duplication has occurred, but only for the places where the
 the different paths are worth exploringly separately.

 Merging can be disabled via @option{-fno-analyzer-state-merge}.
 @end enumerate

 @subsection Region Model

 Part of the state stored at a @code{exploded_node} is a @code{region_model}.
 This is an implementation of the region-based ternary model described in
 @url{https://www.researchgate.net/publication/221430855_A_Memory_Model_for_Static_Analysis_of_C_Programs,
 "A Memory Model for Static Analysis of C Programs"}
 (Zhongxing Xu, Ted Kremenek, and Jian Zhang).

 A @code{region_model} encapsulates a representation of the state of
 memory, with a @code{store} recording a binding between @code{region}
 instances, to @code{svalue} instances.  The bindings are organized into
 clusters, where regions accessible via well-defined pointer arithmetic
 are in the same cluster.  The representation is graph-like because values
 can be pointers to regions.  It also stores a constraint_manager,
 capturing relationships between the values.

 Because each node in the @code{exploded_graph} has a @code{region_model},
 and each of the latter is graph-like, the @code{exploded_graph} is in some
 ways a graph of graphs.

 Here's an example of printing a @code{program_state}, showing the
 @code{region_model} within it, along with state for the @code{malloc}
 state machine.

 @smallexample
 (gdb) call debug (*this)
 rmodel:
 stack depth: 1
   frame (index 0): frame: ‘test’@@1
 clusters within frame: ‘test’@@1
   cluster for: ptr_3: &HEAP_ALLOCATED_REGION(12)
 m_called_unknown_fn: FALSE
 constraint_manager:
   equiv classes:
   constraints:
 malloc:
   0x2e89590: &HEAP_ALLOCATED_REGION(12): unchecked ('ptr_3')
 @end smallexample

 This is the state at the point of returning from @code{calls_malloc} back
 to @code{test} in the following:

 @smallexample
 void *
 calls_malloc (void)
 @{
   void *result = malloc (1024);
   return result;
 @}

 void test (void)
 @{
   void *ptr = calls_malloc ();
   /* etc.  */
 @}
 @end smallexample

 Within the store, there is the cluster for @code{ptr_3} within the frame
 for @code{test}, where the whole cluster is bound to a pointer value,
 pointing at @code{HEAP_ALLOCATED_REGION(12)}.  Additionally, this pointer
 has the @code{unchecked} state for the @code{malloc} state machine
 indicating it hasn't yet been checked against NULL since the allocation
 call.

 @subsection Analyzer Paths

 We need to explain to the user what the problem is, and to persuade them
 that there really is a problem.  Hence having a @code{diagnostic_path}
 isn't just an incidental detail of the analyzer; it's required.

 Paths ought to be:
 @itemize @bullet
 @item
 interprocedurally-valid
 @item
 feasible
 @end itemize

 Without state-merging, all paths in the exploded graph are feasible
 (in terms of constraints being satisfied).
 With state-merging, paths in the exploded graph can be infeasible.

 We collate warnings and only emit them for the simplest path
 e.g. for a bug in a utility function, with lots of routes to calling it,
 we only emit the simplest path (which could be intraprocedural, if
 it can be reproduced without a caller).

 We thus want to find the shortest feasible path through the exploded
 graph from the origin to the exploded node at which the diagnostic was
 saved.  Unfortunately, if we simply find the shortest such path and
 check if it's feasible we might falsely reject the diagnostic, as there
 might be a longer path that is feasible.  Examples include the cases
 where the diagnostic requires us to go at least once around a loop for a
 later condition to be satisfied, or where for a later condition to be
 satisfied we need to enter a suite of code that the simpler path skips.

 We attempt to find the shortest feasible path to each diagnostic by
 first constructing a ``trimmed graph'' from the exploded graph,
 containing only those nodes and edges from which there are paths to
 the target node, and using Dijkstra's algorithm to order the trimmed
 nodes by minimal distance to the target.

 We then use a worklist to iteratively build a ``feasible graph''
 (actually a tree), capturing the pertinent state along each path, in
 which every path to a ``feasible node'' is feasible by construction,
 restricting ourselves to the trimmed graph to ensure we stay on target,
 and ordering the worklist so that the first feasible path we find to the
 target node is the shortest possible path.  Hence we start by trying the
 shortest possible path, but if that fails, we explore progressively
 longer paths, eventually trying iterations through loops.  The
 exploration is captured in the feasible_graph, which can be dumped as a
 .dot file via @option{-fdump-analyzer-feasibility} to visualize the
 exploration.  The indices of the feasible nodes show the order in which
 they were created.  We effectively explore the tree of feasible paths in
 order of shortest path until we either find a feasible path to the
 target node, or hit a limit and give up.

 This is something of a brute-force approach, but the trimmed graph
 hopefully keeps the complexity manageable.

 This algorithm can be disabled (for debugging purposes) via
 @option{-fno-analyzer-feasibility}, which simply uses the shortest path,
 and notes if it is infeasible.

 The above gives us a shortest feasible @code{exploded_path} through the
 @code{exploded_graph} (a list of @code{exploded_edge *}).  We use this
 @code{exploded_path} to build a @code{diagnostic_path} (a list of
 @strong{events} for the diagnostic subsystem) - specifically a
 @code{checker_path}.

 Having built the @code{checker_path}, we prune it to try to eliminate
 events that aren't relevant, to minimize how much the user has to read.

 After pruning, we notify each event in the path of its ID and record the
 IDs of interesting events, allowing for events to refer to other events
 in their descriptions.  The @code{pending_diagnostic} class has various
 vfuncs to support emitting more precise descriptions, so that e.g.

 @itemize @bullet
 @item
 a deref-of-unchecked-malloc diagnostic might use:
 @smallexample
   returning possibly-NULL pointer to 'make_obj' from 'allocator'
 @end smallexample
 for a @code{return_event} to make it clearer how the unchecked value moves
 from callee back to caller
 @item
 a double-free diagnostic might use:
 @smallexample
   second 'free' here; first 'free' was at (3)
 @end smallexample
 and a use-after-free might use
 @smallexample
   use after 'free' here; memory was freed at (2)
 @end smallexample
 @end itemize

 At this point we can emit the diagnostic.

 @subsection Limitations

 @itemize @bullet
 @item
 Only for C so far
 @item
 The implementation of call summaries is currently very simplistic.
 @item
 Lack of function pointer analysis
 @item
 The constraint-handling code assumes reflexivity in some places
 (that values are equal to themselves), which is not the case for NaN.
 As a simple workaround, constraints on floating-point values are
 currently ignored.
 @item
 There are various other limitations in the region model (grep for TODO/xfail
 in the testsuite).
 @item
 The constraint_manager's implementation of transitivity is currently too
 expensive to enable by default and so must be manually enabled via
 @option{-fanalyzer-transitivity}).
 @item
 The checkers are currently hardcoded and don't allow for user extensibility
 (e.g. adding allocate/release pairs).
 @item
 Although the analyzer's test suite has a proof-of-concept test case for
 LTO, LTO support hasn't had extensive testing.  There are various
 lang-specific things in the analyzer that assume C rather than LTO.
 For example, SSA names are printed to the user in ``raw'' form, rather
 than printing the underlying variable name.
 @end itemize

 Some ideas for other checkers
 @itemize @bullet
 @item
 File-descriptor-based APIs
 @item
 Linux kernel internal APIs
 @item
 Signal handling
 @end itemize

 @node Debugging the Analyzer
 @section Debugging the Analyzer
 @cindex analyzer, debugging
 @cindex static analyzer, debugging

 @subsection Special Functions for Debugging the Analyzer

 The analyzer recognizes various special functions by name, for use
 in debugging the analyzer.  Declarations can be seen in the testsuite
 in @file{analyzer-decls.h}.  None of these functions are actually
 implemented.

 Add:
 @smallexample
   __analyzer_break ();
 @end smallexample
 to the source being analyzed to trigger a breakpoint in the analyzer when
 that source is reached.  By putting a series of these in the source, it's
 much easier to effectively step through the program state as it's analyzed.

 The analyzer handles:

 @smallexample
 __analyzer_describe (0, expr);
 @end smallexample

 by emitting a warning describing the 2nd argument (which can be of any
 type), at a verbosity level given by the 1st argument.  This is for use when
 debugging, and may be of use in DejaGnu tests.

 @smallexample
 __analyzer_dump ();
 @end smallexample

 will dump the copious information about the analyzer's state each time it
 reaches the call in its traversal of the source.

 @smallexample
 extern void __analyzer_dump_capacity (const void *ptr);
 @end smallexample

 will emit a warning describing the capacity of the base region of
 the region pointed to by the 1st argument.

 @smallexample
 __analyzer_dump_path ();
 @end smallexample

 will emit a placeholder ``note'' diagnostic with a path to that call site,
 if the analyzer finds a feasible path to it.

 The builtin @code{__analyzer_dump_exploded_nodes} will emit a warning
 after analysis containing information on all of the exploded nodes at that
 program point:

 @smallexample
   __analyzer_dump_exploded_nodes (0);
 @end smallexample

 will output the number of ``processed'' nodes, and the IDs of
 both ``processed'' and ``merger'' nodes, such as:

 @smallexample
 warning: 2 processed enodes: [EN: 56, EN: 58] merger(s): [EN: 54-55, EN: 57, EN: 59]
 @end smallexample

 With a non-zero argument

 @smallexample
   __analyzer_dump_exploded_nodes (1);
 @end smallexample

 it will also dump all of the states within the ``processed'' nodes.

 @smallexample
    __analyzer_dump_region_model ();
 @end smallexample
 will dump the region_model's state to stderr.

 @smallexample
 __analyzer_dump_state ("malloc", ptr);
 @end smallexample

 will emit a warning describing the state of the 2nd argument
 (which can be of any type) with respect to the state machine with
 a name matching the 1st argument (which must be a string literal).
 This is for use when debugging, and may be of use in DejaGnu tests.

 @smallexample
 __analyzer_eval (expr);
 @end smallexample
 will emit a warning with text "TRUE", FALSE" or "UNKNOWN" based on the
 truthfulness of the argument.  This is useful for writing DejaGnu tests.


 @subsection Other Debugging Techniques

 The option @option{-fdump-analyzer-json} will dump both the supergraph
 and the exploded graph in compressed JSON form.

 One approach when tracking down where a particular bogus state is
 introduced into the @code{exploded_graph} is to add custom code to
 @code{program_state::validate}.
	@c Copyright (C) 2019-2021 Free Software Foundation, Inc.
	@c This is part of the GCC manual.
	@c For copying conditions, see the file gcc.texi.
	@c Contributed by David Malcolm <dmalcolm@redhat.com>.

	@node Static Analyzer
	@chapter Static Analyzer
	@cindex analyzer
	@cindex static analysis
	@cindex static analyzer

	@menu
	* Analyzer Internals:: Analyzer Internals
	* Debugging the Analyzer:: Useful debugging tips
	@end menu

	@node Analyzer Internals
	@section Analyzer Internals
	@cindex analyzer, internals
	@cindex static analyzer, internals

	@subsection Overview

	The analyzer implementation works on the gimple-SSA representation.
	(I chose this in the hopes of making it easy to work with LTO to
	do whole-program analysis).

	The implementation is read-only: it doesn't attempt to change anything,
	just emit warnings.

	The gimple representation can be seen using @option{-fdump-ipa-analyzer}.
	@quotation Tip
	If the analyzer ICEs before this is written out, one workaround is to use
	@option{--param=analyzer-bb-explosion-factor=0} to force the analyzer
	to bail out after analyzing the first basic block.
	@end quotation

	First, we build a @code{supergraph} which combines the callgraph and all
	of the CFGs into a single directed graph, with both interprocedural and
	intraprocedural edges. The nodes and edges in the supergraph are called
	``supernodes'' and ``superedges'', and often referred to in code as
	@code{snodes} and @code{sedges}. Basic blocks in the CFGs are split at
	interprocedural calls, so there can be more than one supernode per
	basic block. Most statements will be in just one supernode, but a call
	statement can appear in two supernodes: at the end of one for the call,
	and again at the start of another for the return.

	The supergraph can be seen using @option{-fdump-analyzer-supergraph}.

	We then build an @code{analysis_plan} which walks the callgraph to
	determine which calls might be suitable for being summarized (rather
	than fully explored) and thus in what order to explore the functions.

	Next is the heart of the analyzer: we use a worklist to explore state
	within the supergraph, building an "exploded graph".
	Nodes in the exploded graph correspond to <point,@w{ }state> pairs, as in
	"Precise Interprocedural Dataflow Analysis via Graph Reachability"
	(Thomas Reps, Susan Horwitz and Mooly Sagiv).

	We reuse nodes for <point, state> pairs we've already seen, and avoid
	tracking state too closely, so that (hopefully) we rapidly converge
	on a final exploded graph, and terminate the analysis. We also bail
	out if the number of exploded <end-of-basic-block, state> nodes gets
	larger than a particular multiple of the total number of basic blocks
	(to ensure termination in the face of pathological state-explosion
	cases, or bugs). We also stop exploring a point once we hit a limit
	of states for that point.

	We can identify problems directly when processing a <point,@w{ }state>
	instance. For example, if we're finding the successors of

	@smallexample
	<point: before-stmt: "free (ptr);",
	state: @{"ptr": freed@}>
	@end smallexample

	then we can detect a double-free of "ptr". We can then emit a path
	to reach the problem by finding the simplest route through the graph.

	Program points in the analysis are much more fine-grained than in the
	CFG and supergraph, with points (and thus potentially exploded nodes)
	for various events, including before individual statements.
	By default the exploded graph merges multiple consecutive statements
	in a supernode into one exploded edge to minimize the size of the
	exploded graph. This can be suppressed via
	@option{-fanalyzer-fine-grained}.
	The fine-grained approach seems to make things simpler and more debuggable
	that other approaches I tried, in that each point is responsible for one
	thing.

	Program points in the analysis also have a "call string" identifying the
	stack of callsites below them, so that paths in the exploded graph
	correspond to interprocedurally valid paths: we always return to the
	correct call site, propagating state information accordingly.
	We avoid infinite recursion by stopping the analysis if a callsite
	appears more than @code{analyzer-max-recursion-depth} in a callstring
	(defaulting to 2).

	@subsection Graphs

	Nodes and edges in the exploded graph are called ``exploded nodes'' and
	``exploded edges'' and often referred to in the code as
	@code{enodes} and @code{eedges} (especially when distinguishing them
	from the @code{snodes} and @code{sedges} in the supergraph).

	Each graph numbers its nodes, giving unique identifiers - supernodes
	are referred to throughout dumps in the form @samp{SN': @var{index}} and
	exploded nodes in the form @samp{EN: @var{index}} (e.g. @samp{SN: 2} and
	@samp{EN:29}).

	The supergraph can be seen using @option{-fdump-analyzer-supergraph-graph}.

	The exploded graph can be seen using @option{-fdump-analyzer-exploded-graph}
	and other dump options. Exploded nodes are color-coded in the .dot output
	based on state-machine states to make it easier to see state changes at
	a glance.

	@subsection State Tracking

	There's a tension between:
	@itemize @bullet
	@item
	precision of analysis in the straight-line case, vs
	@item
	exponential blow-up in the face of control flow.
	@end itemize

	For example, in general, given this CFG:

	@smallexample
	A
	/ \
	B C
	\ /
	D
	/ \
	E F
	\ /
	G
	@end smallexample

	we want to avoid differences in state-tracking in B and C from
	leading to blow-up. If we don't prevent state blowup, we end up
	with exponential growth of the exploded graph like this:

	@smallexample

	1:A
	/ \
	/ \
	/ \
	2:B 3:C
	\| \|
	4:D 5:D (2 exploded nodes for D)
	/ \ / \
	6:E 7:F 8:E 9:F
	\| \| \| \|
	10:G 11:G 12:G 13:G (4 exploded nodes for G)

	@end smallexample

	Similar issues arise with loops.

	To prevent this, we follow various approaches:

	@enumerate a
	@item
	state pruning: which tries to discard state that won't be relevant
	later on withing the function.
	This can be disabled via @option{-fno-analyzer-state-purge}.

	@item
	state merging. We can try to find the commonality between two
	program_state instances to make a third, simpler program_state.
	We have two strategies here:

	@enumerate
	@item
	the worklist keeps new nodes for the same program_point together,
	and tries to merge them before processing, and thus before they have
	successors. Hence, in the above, the two nodes for D (4 and 5) reach
	the front of the worklist together, and we create a node for D with
	the merger of the incoming states.

	@item
	try merging with the state of existing enodes for the program_point
	(which may have already been explored). There will be duplication,
	but only one set of duplication; subsequent duplicates are more likely
	to hit the cache. In particular, (hopefully) all merger chains are
	finite, and so we guarantee termination.
	This is intended to help with loops: we ought to explore the first
	iteration, and then have a "subsequent iterations" exploration,
	which uses a state merged from that of the first, to be more abstract.
	@end enumerate

	We avoid merging pairs of states that have state-machine differences,
	as these are the kinds of differences that are likely to be most
	interesting. So, for example, given:

	@smallexample
	if (condition)
	ptr = malloc (size);
	else
	ptr = local_buf;

	.... do things with 'ptr'

	if (condition)
	free (ptr);

	...etc
	@end smallexample

	then we end up with an exploded graph that looks like this:

	@smallexample

	if (condition)
	/ T \ F
	--------- ----------
	/ \
	ptr = malloc (size) ptr = local_buf
	\| \|
	copy of copy of
	"do things with 'ptr'" "do things with 'ptr'"
	with ptr: heap-allocated with ptr: stack-allocated
	\| \|
	if (condition) if (condition)
	\| known to be T \| known to be F
	free (ptr); \|
	\ /
	-----------------------------
	\| ('ptr' is pruned, so states can be merged)
	etc

	@end smallexample

	where some duplication has occurred, but only for the places where the
	the different paths are worth exploringly separately.

	Merging can be disabled via @option{-fno-analyzer-state-merge}.
	@end enumerate

	@subsection Region Model

	Part of the state stored at a @code{exploded_node} is a @code{region_model}.
	This is an implementation of the region-based ternary model described in
	@url{https://www.researchgate.net/publication/221430855_A_Memory_Model_for_Static_Analysis_of_C_Programs,
	"A Memory Model for Static Analysis of C Programs"}
	(Zhongxing Xu, Ted Kremenek, and Jian Zhang).

	A @code{region_model} encapsulates a representation of the state of
	memory, with a @code{store} recording a binding between @code{region}
	instances, to @code{svalue} instances. The bindings are organized into
	clusters, where regions accessible via well-defined pointer arithmetic
	are in the same cluster. The representation is graph-like because values
	can be pointers to regions. It also stores a constraint_manager,
	capturing relationships between the values.

	Because each node in the @code{exploded_graph} has a @code{region_model},
	and each of the latter is graph-like, the @code{exploded_graph} is in some
	ways a graph of graphs.

	Here's an example of printing a @code{program_state}, showing the
	@code{region_model} within it, along with state for the @code{malloc}
	state machine.

	@smallexample
	(gdb) call debug (*this)
	rmodel:
	stack depth: 1
	frame (index 0): frame: ‘test’@@1
	clusters within frame: ‘test’@@1
	cluster for: ptr_3: &HEAP_ALLOCATED_REGION(12)
	m_called_unknown_fn: FALSE
	constraint_manager:
	equiv classes:
	constraints:
	malloc:
	0x2e89590: &HEAP_ALLOCATED_REGION(12): unchecked ('ptr_3')
	@end smallexample

	This is the state at the point of returning from @code{calls_malloc} back
	to @code{test} in the following:

	@smallexample
	void *
	calls_malloc (void)
	@{
	void *result = malloc (1024);
	return result;
	@}

	void test (void)
	@{
	void *ptr = calls_malloc ();
	/* etc. */
	@}
	@end smallexample

	Within the store, there is the cluster for @code{ptr_3} within the frame
	for @code{test}, where the whole cluster is bound to a pointer value,
	pointing at @code{HEAP_ALLOCATED_REGION(12)}. Additionally, this pointer
	has the @code{unchecked} state for the @code{malloc} state machine
	indicating it hasn't yet been checked against NULL since the allocation
	call.

	@subsection Analyzer Paths

	We need to explain to the user what the problem is, and to persuade them
	that there really is a problem. Hence having a @code{diagnostic_path}
	isn't just an incidental detail of the analyzer; it's required.

	Paths ought to be:
	@itemize @bullet
	@item
	interprocedurally-valid
	@item
	feasible
	@end itemize

	Without state-merging, all paths in the exploded graph are feasible
	(in terms of constraints being satisfied).
	With state-merging, paths in the exploded graph can be infeasible.

	We collate warnings and only emit them for the simplest path
	e.g. for a bug in a utility function, with lots of routes to calling it,
	we only emit the simplest path (which could be intraprocedural, if
	it can be reproduced without a caller).

	We thus want to find the shortest feasible path through the exploded
	graph from the origin to the exploded node at which the diagnostic was
	saved. Unfortunately, if we simply find the shortest such path and
	check if it's feasible we might falsely reject the diagnostic, as there
	might be a longer path that is feasible. Examples include the cases
	where the diagnostic requires us to go at least once around a loop for a
	later condition to be satisfied, or where for a later condition to be
	satisfied we need to enter a suite of code that the simpler path skips.

	We attempt to find the shortest feasible path to each diagnostic by
	first constructing a ``trimmed graph'' from the exploded graph,
	containing only those nodes and edges from which there are paths to
	the target node, and using Dijkstra's algorithm to order the trimmed
	nodes by minimal distance to the target.

	We then use a worklist to iteratively build a ``feasible graph''
	(actually a tree), capturing the pertinent state along each path, in
	which every path to a ``feasible node'' is feasible by construction,
	restricting ourselves to the trimmed graph to ensure we stay on target,
	and ordering the worklist so that the first feasible path we find to the
	target node is the shortest possible path. Hence we start by trying the
	shortest possible path, but if that fails, we explore progressively
	longer paths, eventually trying iterations through loops. The
	exploration is captured in the feasible_graph, which can be dumped as a
	.dot file via @option{-fdump-analyzer-feasibility} to visualize the
	exploration. The indices of the feasible nodes show the order in which
	they were created. We effectively explore the tree of feasible paths in
	order of shortest path until we either find a feasible path to the
	target node, or hit a limit and give up.

	This is something of a brute-force approach, but the trimmed graph
	hopefully keeps the complexity manageable.

	This algorithm can be disabled (for debugging purposes) via
	@option{-fno-analyzer-feasibility}, which simply uses the shortest path,
	and notes if it is infeasible.

	The above gives us a shortest feasible @code{exploded_path} through the
	@code{exploded_graph} (a list of @code{exploded_edge *}). We use this
	@code{exploded_path} to build a @code{diagnostic_path} (a list of
	@strong{events} for the diagnostic subsystem) - specifically a
	@code{checker_path}.

	Having built the @code{checker_path}, we prune it to try to eliminate
	events that aren't relevant, to minimize how much the user has to read.

	After pruning, we notify each event in the path of its ID and record the
	IDs of interesting events, allowing for events to refer to other events
	in their descriptions. The @code{pending_diagnostic} class has various
	vfuncs to support emitting more precise descriptions, so that e.g.

	@itemize @bullet
	@item
	a deref-of-unchecked-malloc diagnostic might use:
	@smallexample
	returning possibly-NULL pointer to 'make_obj' from 'allocator'
	@end smallexample
	for a @code{return_event} to make it clearer how the unchecked value moves
	from callee back to caller
	@item
	a double-free diagnostic might use:
	@smallexample
	second 'free' here; first 'free' was at (3)
	@end smallexample
	and a use-after-free might use
	@smallexample
	use after 'free' here; memory was freed at (2)
	@end smallexample
	@end itemize

	At this point we can emit the diagnostic.

	@subsection Limitations

	@itemize @bullet
	@item
	Only for C so far
	@item
	The implementation of call summaries is currently very simplistic.
	@item
	Lack of function pointer analysis
	@item
	The constraint-handling code assumes reflexivity in some places
	(that values are equal to themselves), which is not the case for NaN.
	As a simple workaround, constraints on floating-point values are
	currently ignored.
	@item
	There are various other limitations in the region model (grep for TODO/xfail
	in the testsuite).
	@item
	The constraint_manager's implementation of transitivity is currently too
	expensive to enable by default and so must be manually enabled via
	@option{-fanalyzer-transitivity}).
	@item
	The checkers are currently hardcoded and don't allow for user extensibility
	(e.g. adding allocate/release pairs).
	@item
	Although the analyzer's test suite has a proof-of-concept test case for
	LTO, LTO support hasn't had extensive testing. There are various
	lang-specific things in the analyzer that assume C rather than LTO.
	For example, SSA names are printed to the user in ``raw'' form, rather
	than printing the underlying variable name.
	@end itemize

	Some ideas for other checkers
	@itemize @bullet
	@item
	File-descriptor-based APIs
	@item
	Linux kernel internal APIs
	@item
	Signal handling
	@end itemize

	@node Debugging the Analyzer
	@section Debugging the Analyzer
	@cindex analyzer, debugging
	@cindex static analyzer, debugging

	@subsection Special Functions for Debugging the Analyzer

	The analyzer recognizes various special functions by name, for use
	in debugging the analyzer. Declarations can be seen in the testsuite
	in @file{analyzer-decls.h}. None of these functions are actually
	implemented.

	Add:
	@smallexample
	__analyzer_break ();
	@end smallexample
	to the source being analyzed to trigger a breakpoint in the analyzer when
	that source is reached. By putting a series of these in the source, it's
	much easier to effectively step through the program state as it's analyzed.

	The analyzer handles:

	@smallexample
	__analyzer_describe (0, expr);
	@end smallexample

	by emitting a warning describing the 2nd argument (which can be of any
	type), at a verbosity level given by the 1st argument. This is for use when
	debugging, and may be of use in DejaGnu tests.

	@smallexample
	__analyzer_dump ();
	@end smallexample

	will dump the copious information about the analyzer's state each time it
	reaches the call in its traversal of the source.

	@smallexample
	extern void __analyzer_dump_capacity (const void *ptr);
	@end smallexample

	will emit a warning describing the capacity of the base region of
	the region pointed to by the 1st argument.

	@smallexample
	__analyzer_dump_path ();
	@end smallexample

	will emit a placeholder ``note'' diagnostic with a path to that call site,
	if the analyzer finds a feasible path to it.

	The builtin @code{__analyzer_dump_exploded_nodes} will emit a warning
	after analysis containing information on all of the exploded nodes at that
	program point:

	@smallexample
	__analyzer_dump_exploded_nodes (0);
	@end smallexample

	will output the number of ``processed'' nodes, and the IDs of
	both ``processed'' and ``merger'' nodes, such as:

	@smallexample
	warning: 2 processed enodes: [EN: 56, EN: 58] merger(s): [EN: 54-55, EN: 57, EN: 59]
	@end smallexample

	With a non-zero argument

	@smallexample
	__analyzer_dump_exploded_nodes (1);
	@end smallexample

	it will also dump all of the states within the ``processed'' nodes.

	@smallexample
	__analyzer_dump_region_model ();
	@end smallexample
	will dump the region_model's state to stderr.

	@smallexample
	__analyzer_dump_state ("malloc", ptr);
	@end smallexample

	will emit a warning describing the state of the 2nd argument
	(which can be of any type) with respect to the state machine with
	a name matching the 1st argument (which must be a string literal).
	This is for use when debugging, and may be of use in DejaGnu tests.

	@smallexample
	__analyzer_eval (expr);
	@end smallexample
	will emit a warning with text "TRUE", FALSE" or "UNKNOWN" based on the
	truthfulness of the argument. This is useful for writing DejaGnu tests.


	@subsection Other Debugging Techniques

	The option @option{-fdump-analyzer-json} will dump both the supergraph
	and the exploded graph in compressed JSON form.

	One approach when tracking down where a particular bogus state is
	introduced into the @code{exploded_graph} is to add custom code to
	@code{program_state::validate}.