libgomp: let plugins handle allocating the target variable table

In my examination of BabelStream results on AMD GCN, I've found that,
for each BabelStream kernel execution, we spend significant time in
allocating and initializing memory in gomp_map_vars (~55µs, whereas the
actual BabelStream code executes in ~746µs, meaning we increase the time
BabelStream measures by 7% just on that).

Upon further examination, I've found that the only reason gomp_map_vars
decides to allocate and map any memory in the first place is because it
is constructing the table of pointers to variables on the target, which
I've taken to calling the "target variable table".  Given that the GCN
plugin already must perform some memory allocation before starting up a
kernel, namely to allocate kernel arguments, it would be beneficial if
we could merge this allocation with the kernel arguments allocation.

In addition, since the kernel arguments live in host memory, populating
them can be performed using string functions, without any need to call
for expensive host2dev copies.

This patch introduces an opaque type for "offload sessions".  This type
is defined by each plugin and allows it to store data related to a
single offload job.  The sessions are allocated and managed by libgomp,
and initialized and utilized by the plugin.  Their lifetime starts with
a call to GOMP_OFFLOAD_session_start, and ends with
GOMP_OFFLOAD_{openacc_{async_,}exec,{async_,}run}.

The patch then uses this framework to make management of the target
variable table more flexible: the plugin may elect to implement
GOMP_OFFLOAD_session_allocate_target_var_table, which allows the plugin
to attempt to allocate the target variable table in host memory.

If it fails, or if the plugin does not provide this function, libgomp
will perform this allocation as it does today - in target memory - and
tell the session about it using
GOMP_OFFLOAD_session_set_target_var_table.

In the case of AMD GCN, upon a call to
GOMP_OFFLOAD_session_allocate_target_var_table, the plugin will
immediately allocate kernel arguments with enough space for the target
variable table, no matter what size the plugin asks for[1], and return
that pointer to libgomp.

This results in the runtime of gomp_map_vars effectively disappearing
from traces.

[1] It may be beneficial to limit this, to some fixed amount, to make it
    so that the future allocation cache has a higher cache hit rate.  It
    may also depend on whether hsa_memory_allocate for kernel arguments
    takes runtime proportional to the number of bytes it needs to
    allocate.

include/ChangeLog:

	* gomp-constants.h (GOMP_VERSION): Bump.  Signature of
	GOMP_OFFLOAD_run et al changed.

libgomp/ChangeLog:

	* libgomp-plugin.h (GOMP_OFFLOAD_run, GOMP_OFFLOAD_exec)
	(GOMP_OFFLOAD_async_run, GOMP_OFFLOAD_openacc_async_exec): Pass
	session in place of target variable table and devices.
	(struct gomp_offload_session): New.
	(GOMP_OFFLOAD_session_size): New
	(GOMP_OFFLOAD_check_session_struct): New.
	(GOMP_OFFLOAD_session_boilerplate): New.
	(GOMP_OFFLOAD_session_start): New.
	(GOMP_OFFLOAD_session_allocate_target_var_table): New.
	(GOMP_OFFLOAD_session_set_target_var_table): New.
	* libgomp.h (struct gomp_target_task): Add offload_session
	field.
	(struct gomp_device_descr): Add offload session management
	functions.
	(gomp_offload_session_new): New.
	(goacc_map_vars): Add SESSION to signature
	* oacc-host.c (struct gomp_offload_session): Define, for host
	offload fallback case.
	(host_session_size): New.  Implements GOMP_OFFLOAD_session_size.
	(host_session_start): New.  Implements
	GOMP_OFFLOAD_session_start.
	(host_session_set_target_var_table): New.  Implements
	GOMP_OFFLOAD_session_set_target_var_table.
	(host_run): Adjust to match GOMP_OFFLOAD_run.
	(host_openacc_exec): Adjust to match GOMP_OFFLOAD_openacc_exec.
	(host_openacc_async_exec): Adjust to match
	GOMP_OFFLOAD_openacc_async_exec.
	* oacc-mem.c (acc_map_data): Adjust call to goacc_map_vars.
	(goacc_enter_datum): Ditto.
	(goacc_enter_data_internal): Ditto.
	* oacc-parallel.c (GOACC_parallel_keyed): Allocate and pass
	offload session.
	(GOACC_data_start): Adjust call to goacc_map_vars.
	* plugin/plugin-gcn.c (struct kernel_dispatch): Remove
	kernarg_cache_node.
	(struct kernargs): Add a flexible array member for the target
	variable table.
	(struct kernel_launch): Store an offload session rather than
	target var. table pointer.
	(print_kernel_dispatch): Receive kernargs as parameter.
	(struct gomp_offload_session): Define.
	(init_session): New.
	(GOMP_OFFLOAD_session_start): Implement, using init_session.
	(release_session): New.
	(alloc_kernargs_on_agent): Rename to...
	(allocate_session_kernargs): ... this, store result in
	passed-in SESSION, and allocate extra room for target variable
	table (rounding it up to nearest multiple of 64 pointers).
	(GOMP_OFFLOAD_session_allocate_target_var_table): Implement
	using the previous function.
	(GOMP_OFFLOAD_session_set_target_var_table): Ditto.
	(create_kernel_dispatch): Remove kernarg allocation, instead
	receiving it as an argument.
	(release_kernel_dispatch): Receive kernargs as an argument,
	don't release them.
	(run_kernel): Adjust to use sessions.
	(destroy_module): Ditto.
	(GOMP_OFFLOAD_load_image): Ditto.
	(execute_queue_entry): Adjust to match changed struct
	kernel_launch.
	(queue_push_launch): Ditto.
	(gcn_exec): Receive and pass along session.
	(GOMP_OFFLOAD_run): Ditto.
	(GOMP_OFFLOAD_async_run): Ditto.
	(GOMP_OFFLOAD_openacc_exec): Ditto.
	(GOMP_OFFLOAD_openacc_async_exec): Ditto.
	* plugin/plugin-nvptx.c (struct gomp_offload_session): Define.
	(GOMP_OFFLOAD_session_start): Implement.
	(GOMP_OFFLOAD_session_set_target_var_table): Implement.
	(GOMP_OFFLOAD_openacc_exec): Adjust to receive session.
	(GOMP_OFFLOAD_openacc_async_exec): Ditto.
	(GOMP_OFFLOAD_run): Ditto.
	* target.c (gomp_get_tvt_size): Extract helper from...
	(gomp_map_vars_internal): ... here.  Receive SESSION, iff doing
	target offload.  Use a target variable table on the host
	allocated by GOMP_OFFLOAD_session_allocate_target_var_table if
	possible, or call GOMP_OFFLOAD_session_set_target_var_table with
	an allocated device pointer otherwise.
	(gomp_map_vars): Update to pass along session.
	(goacc_map_vars): Ditto.
	(GOMP_target): Allocate and pass along session.
	(GOMP_target_ext): Ditto.
	(gomp_target_data_fallback): Adjust call to gomp_map_vars.
	(GOMP_target_data): Ditto.
	(GOMP_target_data_ext): Ditto.
	(GOMP_target_enter_exit_data): Ditto.
	(gomp_target_task_fn): Start and pass along session, the storage
	for which is allocated by gomp_create_target_task.
	(DLSYM2): Rename from DLSYM, adding a new parameter for the
	variable to populate, akin to DLSYM_OPT.
	(DLSYM): Delegate to DLSYM2.
	(gomp_load_plugin_for_device): Populate session-related fields.
	* task.c (gomp_create_target_task): Allocate enough storage for
	an offload session.
	* testsuite/libgomp.c-c++-common/gcn-kernel-launch-no-tvt-alloc.c: New test.
	* testsuite/libgomp.c-c++-common/gcn-kernel-launch-tvt-alloc.c: New test.
12 files changed