vect: support vectorization of early break forced live IVs as scalar

Consider this simple loop

long long arr[1024];
long long *f()
{
    int i;
    for (i = 0; i < 1024; i++)
      if (arr[i] == 42)
        break;
    return arr + i;
}

where today we generate this at -O3:

.L2:
        add     v29.4s, v29.4s, v25.4s
        add     v28.4s, v28.4s, v26.4s
        cmp     x2, x1
        beq     .L9
.L6:
        ldp     q30, q31, [x1], 32
        cmeq    v30.2d, v30.2d, v27.2d
        cmeq    v31.2d, v31.2d, v27.2d
        addhn   v31.2s, v31.2d, v30.2d
        fmov    x3, d31
        cbz     x3, .L2

but which is highly inefficient.  This loops has 3 IVs (PR119577), one normal
scalar one, two vector ones, one counting up and one counting down (PR115120)
and has a forced unrolling due to an increase in VF because of the mismatch in
modes between the IVs and the loop body (PR119860).

This patch fixed all three of these issues and we now generate:

.L2:
        add     w2, w2, 2
        cmp     w2, 1024
        beq     .L13
.L5:
        ldr     q31, [x1]
        add     x1, x1, 16
        cmeq    v31.2d, v31.2d, v30.2d
        umaxp   v31.4s, v31.4s, v31.4s
        fmov    x0, d31
        cbz     x0, .L2

or with sve

.L3:
        add     x1, x1, x3
        whilelo p7.d, w1, w2
        b.none  .L11
.L4:
        ld1d    z30.d, p7/z, [x0, x1, lsl 3]
        cmpeq   p7.d, p7/z, z30.d, z31.d
        b.none  .L3

which shows that the new scalar IV is efficiently merged with the loop
control one based on IVopts.

To accomplish this the patch reworks how we handle "forced lived inductions"
with regard to vectorization.

Prior to this change when we vectorize a loop with early break any induction
variables would be forced live.  Forcing live means that even though the values
aren't used inside the loop we must preserve the values such that when we start
the scalar loop we can pass the correct initial values.

However this had several side-effects:

1. We must be able to vectorize the induction.
2. The induction variable participates in VF determination.  This would often
   times lead to a higher VF than would have normally been needed.  As such the
   vector loops become less profitable.
3. IVcannon on constant loop iterations inserts a downward counting IV in
   addition to the upwards one in order to support things like doloops.
   Normally this duplicate IV is removed by IV opts, but IV doesn't understand
   vector inductions.  As such we end up with 3 IVs.

This patch fixes all three of these by choosing instead to create a new scalar
IV that's adjusted within the loop and to update all the IV statements outside
the loop by using this new IV.

We re-use vect_update_ivs_after_vectorizer for all exits now and put in a dummy
value representing the IV that is to be generated later.

To do this we delay when we call vect_update_ivs_after_vectorizer until after
the skip_epilogue edge is created and vect_update_ivs_after_vectorizer now
updates all out of loop usages of IVs and not just that in the merge edge to
the scalar loop.  This not only generates better code, but negates the need to
fixup the "forced live" scalar IVs later on.

This new scalar IV is then materialized in
vect_update_ivs_after_vectorizer_for_early_breaks.  When PFA using masks by
skipping iterations we now roll up the pfa IV into the new scalar IV by
adjusting the first iteration back from start - niters_peel and then take the
MAX <scal_iv, 0> to correctly handle the first iteration.

Because we are now re-using vect_update_ivs_after_vectorizer we have an issue
with UB clamping on non-linear inductions.

At the moment when doing early exit updating I just ignore the possibility of UB
since if the main exit is OK, the early exit is one iteration behind the main
one and so should be ok.

Things however get complicated with PEELED loops.

gcc/ChangeLog:

	PR tree-optimization/115120
	PR tree-optimization/119577
	PR tree-optimization/119860
	* tree-vect-loop-manip.cc (vect_can_advance_ivs_p): Check for nonlinear
	mult induction and early break.
	(vect_update_ivs_after_vectorizer): Support early break exits.
	(vect_do_peeling): Support scalar IVs.
	* tree-vect-loop.cc (vect_peel_nonlinear_iv_init): Support early break.
	(vect_update_nonlinear_iv): use `unsigned_type_for` such that function
	works for both vector and scalar types.
	(vectorizable_induction, vectorizable_live_operation): Remove vector
	early break IV code.
	(vect_update_ivs_after_vectorizer_for_early_breaks): New.
	(vect_transform_loop): Support new scalar IV for early break.
	* tree-vect-slp.cc (vect_analyze_slp): Remove SLP build for early break
	IVs.
	* tree-vect-stmts.cc (vect_stmt_relevant_p): No longer mark early break
	IVs as completely unused rather than used_only_live.  They no longer
	contribute to the vector loop and so should not be analyzed.
	(can_vectorize_live_stmts): Remove vector early vreak IV code.
	* tree-vectorizer.h (LOOP_VINFO_EARLY_BRK_NITERS_VAR): New.
	(class loop_vec_info): Add early_break_niters_var.

gcc/testsuite/ChangeLog:

	PR tree-optimization/115120
	PR tree-optimization/119577
	PR tree-optimization/119860
	* gcc.dg/vect/vect-early-break_39.c: Update.
	* gcc.dg/vect/vect-early-break_139.c: New testcase.
	* gcc.target/aarch64/sve/peel_ind_10.c: Update.
	* gcc.target/aarch64/sve/peel_ind_11.c: Update.
	* gcc.target/aarch64/sve/peel_ind_12.c: Update.
	* gcc.target/aarch64/sve/peel_ind_5.c: Update.
	* gcc.target/aarch64/sve/peel_ind_6.c: Update.
	* gcc.target/aarch64/sve/peel_ind_7.c: Update.
	* gcc.target/aarch64/sve/peel_ind_9.c: Update.
	* gcc.target/aarch64/sve/pr119351.c
15 files changed