vect: support vectorization of early break forced live IVs as scalar
Consider this simple loop
long long arr[1024];
long long *f()
{
int i;
for (i = 0; i < 1024; i++)
if (arr[i] == 42)
break;
return arr + i;
}
where today we generate this at -O3:
.L2:
add v29.4s, v29.4s, v25.4s
add v28.4s, v28.4s, v26.4s
cmp x2, x1
beq .L9
.L6:
ldp q30, q31, [x1], 32
cmeq v30.2d, v30.2d, v27.2d
cmeq v31.2d, v31.2d, v27.2d
addhn v31.2s, v31.2d, v30.2d
fmov x3, d31
cbz x3, .L2
but which is highly inefficient. This loops has 3 IVs (PR119577), one normal
scalar one, two vector ones, one counting up and one counting down (PR115120)
and has a forced unrolling due to an increase in VF because of the mismatch in
modes between the IVs and the loop body (PR119860).
This patch fixed all three of these issues and we now generate:
.L2:
add w2, w2, 2
cmp w2, 1024
beq .L13
.L5:
ldr q31, [x1]
add x1, x1, 16
cmeq v31.2d, v31.2d, v30.2d
umaxp v31.4s, v31.4s, v31.4s
fmov x0, d31
cbz x0, .L2
or with sve
.L3:
add x1, x1, x3
whilelo p7.d, w1, w2
b.none .L11
.L4:
ld1d z30.d, p7/z, [x0, x1, lsl 3]
cmpeq p7.d, p7/z, z30.d, z31.d
b.none .L3
which shows that the new scalar IV is efficiently merged with the loop
control one based on IVopts.
To accomplish this the patch reworks how we handle "forced lived inductions"
with regard to vectorization.
Prior to this change when we vectorize a loop with early break any induction
variables would be forced live. Forcing live means that even though the values
aren't used inside the loop we must preserve the values such that when we start
the scalar loop we can pass the correct initial values.
However this had several side-effects:
1. We must be able to vectorize the induction.
2. The induction variable participates in VF determination. This would often
times lead to a higher VF than would have normally been needed. As such the
vector loops become less profitable.
3. IVcannon on constant loop iterations inserts a downward counting IV in
addition to the upwards one in order to support things like doloops.
Normally this duplicate IV is removed by IV opts, but IV doesn't understand
vector inductions. As such we end up with 3 IVs.
This patch fixes all three of these by choosing instead to create a new scalar
IV that's adjusted within the loop and to update all the IV statements outside
the loop by using this new IV.
We re-use vect_update_ivs_after_vectorizer for all exits now and put in a dummy
value representing the IV that is to be generated later.
To do this we delay when we call vect_update_ivs_after_vectorizer until after
the skip_epilogue edge is created and vect_update_ivs_after_vectorizer now
updates all out of loop usages of IVs and not just that in the merge edge to
the scalar loop. This not only generates better code, but negates the need to
fixup the "forced live" scalar IVs later on.
This new scalar IV is then materialized in
vect_update_ivs_after_vectorizer_for_early_breaks. When PFA using masks by
skipping iterations we now roll up the pfa IV into the new scalar IV by
adjusting the first iteration back from start - niters_peel and then take the
MAX <scal_iv, 0> to correctly handle the first iteration.
Because we are now re-using vect_update_ivs_after_vectorizer we have an issue
with UB clamping on non-linear inductions.
At the moment when doing early exit updating I just ignore the possibility of UB
since if the main exit is OK, the early exit is one iteration behind the main
one and so should be ok.
Things however get complicated with PEELED loops.
gcc/ChangeLog:
PR tree-optimization/115120
PR tree-optimization/119577
PR tree-optimization/119860
* tree-vect-loop-manip.cc (vect_can_advance_ivs_p): Check for nonlinear
mult induction and early break.
(vect_update_ivs_after_vectorizer): Support early break exits.
(vect_do_peeling): Support scalar IVs.
* tree-vect-loop.cc (vect_peel_nonlinear_iv_init): Support early break.
(vect_update_nonlinear_iv): use `unsigned_type_for` such that function
works for both vector and scalar types.
(vectorizable_induction, vectorizable_live_operation): Remove vector
early break IV code.
(vect_update_ivs_after_vectorizer_for_early_breaks): New.
(vect_transform_loop): Support new scalar IV for early break.
* tree-vect-slp.cc (vect_analyze_slp): Remove SLP build for early break
IVs.
* tree-vect-stmts.cc (vect_stmt_relevant_p): No longer mark early break
IVs as completely unused rather than used_only_live. They no longer
contribute to the vector loop and so should not be analyzed.
(can_vectorize_live_stmts): Remove vector early vreak IV code.
* tree-vectorizer.h (LOOP_VINFO_EARLY_BRK_NITERS_VAR): New.
(class loop_vec_info): Add early_break_niters_var.
gcc/testsuite/ChangeLog:
PR tree-optimization/115120
PR tree-optimization/119577
PR tree-optimization/119860
* gcc.dg/vect/vect-early-break_39.c: Update.
* gcc.dg/vect/vect-early-break_139.c: New testcase.
* gcc.target/aarch64/sve/peel_ind_10.c: Update.
* gcc.target/aarch64/sve/peel_ind_11.c: Update.
* gcc.target/aarch64/sve/peel_ind_12.c: Update.
* gcc.target/aarch64/sve/peel_ind_5.c: Update.
* gcc.target/aarch64/sve/peel_ind_6.c: Update.
* gcc.target/aarch64/sve/peel_ind_7.c: Update.
* gcc.target/aarch64/sve/peel_ind_9.c: Update.
* gcc.target/aarch64/sve/pr119351.c
15 files changed