x86: Properly implement AMX-TILE load/store intrinsics

ldtilecfg and sttilecfg take a 512-byte memory block.  With
_tile_loadconfig implemented as

extern __inline void
__attribute__((__gnu_inline__, __always_inline__, __artificial__))
_tile_loadconfig (const void *__config)
{
  __asm__ volatile ("ldtilecfg\t%X0" :: "m" (*((const void **)__config)));
}

GCC sees:

(parallel [
  (asm_operands/v ("ldtilecfg	%X0") ("") 0
   [(mem/f/c:DI (plus:DI (reg/f:DI 77 virtual-stack-vars)
                         (const_int -64 [0xffffffffffffffc0])) [1 MEM[(const void * *)&tile_data]+0 S8 A128])]
   [(asm_input:DI ("m"))]
   (clobber (reg:CC 17 flags))])

and the memory operand size is 1 byte.  As the result, the rest of 511
bytes is ignored by GCC.  Implement ldtilecfg and sttilecfg intrinsics
with a pointer to XImode to honor the 512-byte memory block.

gcc/ChangeLog:

	PR target/114098
	* config/i386/amxtileintrin.h (_tile_loadconfig): Use
	__builtin_ia32_ldtilecfg.
	(_tile_storeconfig): Use __builtin_ia32_sttilecfg.
	* config/i386/i386-builtin.def (BDESC): Add
	__builtin_ia32_ldtilecfg and __builtin_ia32_sttilecfg.
	* config/i386/i386-expand.cc (ix86_expand_builtin): Handle
	IX86_BUILTIN_LDTILECFG and IX86_BUILTIN_STTILECFG.
	* config/i386/i386.md (ldtilecfg): New pattern.
	(sttilecfg): Likewise.

gcc/testsuite/ChangeLog:

	PR target/114098
	* gcc.target/i386/amxtile-4.c: New test.

(cherry picked from commit 4972f97a265c574d51e20373ddefd66576051e5c)
5 files changed