Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) /* SPDX-License-Identifier: GPL-2.0 */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) /*
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3)  * arch/alpha/lib/ev6-copy_page.S
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4)  *
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5)  * Copy an entire page.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6)  */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) /* The following comparison of this routine vs the normal copy_page.S
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9)    was written by an unnamed ev6 hardware designer and forwarded to me
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10)    via Steven Hobbs <hobbs@steven.zko.dec.com>.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11)  
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12)    First Problem: STQ overflows.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13)    -----------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) 	It would be nice if EV6 handled every resource overflow efficiently,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) 	but for some it doesn't.  Including store queue overflows.  It causes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) 	a trap and a restart of the pipe.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) 	To get around this we sometimes use (to borrow a term from a VSSAD
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) 	researcher) "aeration".  The idea is to slow the rate at which the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) 	processor receives valid instructions by inserting nops in the fetch
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) 	path.  In doing so, you can prevent the overflow and actually make
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) 	the code run faster.  You can, of course, take advantage of the fact
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) 	that the processor can fetch at most 4 aligned instructions per cycle.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) 	I inserted enough nops to force it to take 10 cycles to fetch the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) 	loop code.  In theory, EV6 should be able to execute this loop in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) 	9 cycles but I was not able to get it to run that fast -- the initial
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) 	conditions were such that I could not reach this optimum rate on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) 	(chaotic) EV6.  I wrote the code such that everything would issue
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) 	in order. 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33)    Second Problem: Dcache index matches.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34)    -------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) 	If you are going to use this routine on random aligned pages, there
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) 	is a 25% chance that the pages will be at the same dcache indices.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) 	This results in many nasty memory traps without care.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) 	The solution is to schedule the prefetches to avoid the memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) 	conflicts.  I schedule the wh64 prefetches farther ahead of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) 	read prefetches to avoid this problem.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44)    Third Problem: Needs more prefetching.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45)    --------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) 	In order to improve the code I added deeper prefetching to take the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) 	most advantage of EV6's bandwidth.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) 	I also prefetched the read stream. Note that adding the read prefetch
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) 	forced me to add another cycle to the inner-most kernel - up to 11
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) 	from the original 8 cycles per iteration.  We could improve performance
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) 	further by unrolling the loop and doing multiple prefetches per cycle.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55)    I think that the code below will be very robust and fast code for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56)    purposes of copying aligned pages.  It is slower when both source and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57)    destination pages are in the dcache, but it is my guess that this is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58)    less important than the dcache miss case.  */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) #include <asm/export.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) 	.text
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) 	.align 4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) 	.global copy_page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) 	.ent copy_page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) copy_page:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) 	.prologue 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) 	/* Prefetch 5 read cachelines; write-hint 10 cache lines.  */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) 	wh64	($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) 	ldl	$31,0($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) 	ldl	$31,64($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) 	lda	$1,1*64($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) 	wh64	($1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) 	ldl	$31,128($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) 	ldl	$31,192($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) 	lda	$1,2*64($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) 	wh64	($1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) 	ldl	$31,256($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) 	lda	$18,118
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) 	lda	$1,3*64($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) 	wh64	($1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) 	nop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) 	lda	$1,4*64($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) 	lda	$2,5*64($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) 	wh64	($1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) 	wh64	($2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) 	lda	$1,6*64($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) 	lda	$2,7*64($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) 	wh64	($1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) 	wh64	($2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) 	lda	$1,8*64($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) 	lda	$2,9*64($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) 	wh64	($1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) 	wh64	($2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) 	lda	$19,10*64($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) 	nop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) 	/* Main prefetching/write-hinting loop.  */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) 1:	ldq	$0,0($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) 	ldq	$1,8($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) 	unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) 	unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) 	unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) 	unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) 	ldq	$2,16($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) 	ldq	$3,24($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) 	ldq	$4,32($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) 	ldq	$5,40($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) 	unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) 	unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) 	unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) 	unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) 	ldq	$6,48($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) 	ldq	$7,56($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) 	ldl	$31,320($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) 	unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) 	unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) 	unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) 	/* This gives the extra cycle of aeration above the minimum.  */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) 	unop			
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) 	unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) 	unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) 	unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) 	wh64	($19)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) 	unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) 	unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) 	unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) 	stq	$0,0($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) 	subq	$18,1,$18
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) 	stq	$1,8($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) 	unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) 	unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) 	stq	$2,16($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) 	addq	$17,64,$17
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) 	stq	$3,24($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) 	stq	$4,32($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) 	stq	$5,40($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) 	addq	$19,64,$19
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) 	unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) 	stq	$6,48($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) 	stq	$7,56($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) 	addq	$16,64,$16
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) 	bne	$18, 1b
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) 	/* Prefetch the final 5 cache lines of the read stream.  */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) 	lda	$18,10
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) 	ldl	$31,320($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) 	ldl	$31,384($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) 	ldl	$31,448($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) 	ldl	$31,512($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) 	ldl	$31,576($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) 	nop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) 	nop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) 	/* Non-prefetching, non-write-hinting cleanup loop for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) 	   final 10 cache lines.  */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) 2:	ldq	$0,0($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) 	ldq	$1,8($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) 	ldq	$2,16($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) 	ldq	$3,24($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) 	ldq	$4,32($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) 	ldq	$5,40($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) 	ldq	$6,48($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) 	ldq	$7,56($17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) 	stq	$0,0($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) 	subq	$18,1,$18
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) 	stq	$1,8($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) 	addq	$17,64,$17
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) 	stq	$2,16($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) 	stq	$3,24($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) 	stq	$4,32($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) 	stq	$5,40($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) 	stq	$6,48($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) 	stq	$7,56($16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) 	addq	$16,64,$16
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) 	bne	$18, 2b
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) 	ret
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) 	nop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) 	unop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) 	nop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) 	.end copy_page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) 	EXPORT_SYMBOL(copy_page)