^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) /* SPDX-License-Identifier: GPL-2.0 */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2) /*
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) * arch/alpha/lib/ev6-copy_page.S
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) *
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) * Copy an entire page.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6) */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) /* The following comparison of this routine vs the normal copy_page.S
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) was written by an unnamed ev6 hardware designer and forwarded to me
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) via Steven Hobbs <hobbs@steven.zko.dec.com>.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) First Problem: STQ overflows.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) -----------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) It would be nice if EV6 handled every resource overflow efficiently,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) but for some it doesn't. Including store queue overflows. It causes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) a trap and a restart of the pipe.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) To get around this we sometimes use (to borrow a term from a VSSAD
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) researcher) "aeration". The idea is to slow the rate at which the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) processor receives valid instructions by inserting nops in the fetch
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) path. In doing so, you can prevent the overflow and actually make
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) the code run faster. You can, of course, take advantage of the fact
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) that the processor can fetch at most 4 aligned instructions per cycle.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) I inserted enough nops to force it to take 10 cycles to fetch the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) loop code. In theory, EV6 should be able to execute this loop in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) 9 cycles but I was not able to get it to run that fast -- the initial
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) conditions were such that I could not reach this optimum rate on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) (chaotic) EV6. I wrote the code such that everything would issue
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) in order.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) Second Problem: Dcache index matches.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) -------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) If you are going to use this routine on random aligned pages, there
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) is a 25% chance that the pages will be at the same dcache indices.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) This results in many nasty memory traps without care.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) The solution is to schedule the prefetches to avoid the memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) conflicts. I schedule the wh64 prefetches farther ahead of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) read prefetches to avoid this problem.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) Third Problem: Needs more prefetching.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) --------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) In order to improve the code I added deeper prefetching to take the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) most advantage of EV6's bandwidth.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) I also prefetched the read stream. Note that adding the read prefetch
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) forced me to add another cycle to the inner-most kernel - up to 11
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) from the original 8 cycles per iteration. We could improve performance
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) further by unrolling the loop and doing multiple prefetches per cycle.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) I think that the code below will be very robust and fast code for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) purposes of copying aligned pages. It is slower when both source and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) destination pages are in the dcache, but it is my guess that this is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) less important than the dcache miss case. */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) #include <asm/export.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) .text
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) .align 4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) .global copy_page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) .ent copy_page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) copy_page:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) .prologue 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) /* Prefetch 5 read cachelines; write-hint 10 cache lines. */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) wh64 ($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) ldl $31,0($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) ldl $31,64($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) lda $1,1*64($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) wh64 ($1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) ldl $31,128($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) ldl $31,192($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) lda $1,2*64($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) wh64 ($1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) ldl $31,256($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) lda $18,118
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) lda $1,3*64($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) wh64 ($1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) nop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) lda $1,4*64($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) lda $2,5*64($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) wh64 ($1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) wh64 ($2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) lda $1,6*64($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) lda $2,7*64($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) wh64 ($1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) wh64 ($2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) lda $1,8*64($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) lda $2,9*64($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) wh64 ($1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) wh64 ($2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) lda $19,10*64($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) nop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) /* Main prefetching/write-hinting loop. */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) 1: ldq $0,0($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) ldq $1,8($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) ldq $2,16($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) ldq $3,24($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) ldq $4,32($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) ldq $5,40($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) ldq $6,48($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) ldq $7,56($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) ldl $31,320($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) /* This gives the extra cycle of aeration above the minimum. */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) wh64 ($19)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) stq $0,0($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) subq $18,1,$18
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) stq $1,8($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) stq $2,16($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) addq $17,64,$17
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) stq $3,24($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) stq $4,32($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) stq $5,40($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) addq $19,64,$19
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) stq $6,48($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) stq $7,56($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) addq $16,64,$16
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) bne $18, 1b
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) /* Prefetch the final 5 cache lines of the read stream. */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) lda $18,10
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) ldl $31,320($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) ldl $31,384($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) ldl $31,448($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) ldl $31,512($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) ldl $31,576($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) nop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) nop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) /* Non-prefetching, non-write-hinting cleanup loop for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) final 10 cache lines. */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) 2: ldq $0,0($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) ldq $1,8($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) ldq $2,16($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) ldq $3,24($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) ldq $4,32($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) ldq $5,40($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) ldq $6,48($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) ldq $7,56($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) stq $0,0($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) subq $18,1,$18
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) stq $1,8($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) addq $17,64,$17
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) stq $2,16($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) stq $3,24($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) stq $4,32($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) stq $5,40($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) stq $6,48($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) stq $7,56($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) addq $16,64,$16
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) bne $18, 2b
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) ret
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) nop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) nop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) .end copy_page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) EXPORT_SYMBOL(copy_page)