Orange Pi5 kernel

^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) ==========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) NAND Error-correction Code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) ==========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) Introduction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) Having looked at the linux mtd/nand driver and more specific at nand_ecc.c
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) I felt there was room for optimisation. I bashed the code for a few hours
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) performing tricks like table lookup removing superfluous code etc.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) After that the speed was increased by 35-40%.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) Still I was not too happy as I felt there was additional room for improvement.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) Bad! I was hooked.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) I decided to annotate my steps in this file. Perhaps it is useful to someone
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) or someone learns something from it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) The problem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) ===========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) NAND flash (at least SLC one) typically has sectors of 256 bytes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) However NAND flash is not extremely reliable so some error detection
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) (and sometimes correction) is needed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) This is done by means of a Hamming code. I'll try to explain it in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) laymans terms (and apologies to all the pro's in the field in case I do
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) not use the right terminology, my coding theory class was almost 30
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) years ago, and I must admit it was not one of my favourites).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) As I said before the ecc calculation is performed on sectors of 256
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) bytes. This is done by calculating several parity bits over the rows and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) columns. The parity used is even parity which means that the parity bit = 1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) if the data over which the parity is calculated is 1 and the parity bit = 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) if the data over which the parity is calculated is 0. So the total
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) number of bits over the data over which the parity is calculated + the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) parity bit is even. (see wikipedia if you can't follow this).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) Parity is often calculated by means of an exclusive or operation,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) sometimes also referred to as xor. In C the operator for xor is ^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) Back to ecc.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) Let's give a small figure:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) =========  ==== ==== ==== ==== ==== ==== ==== ====   === === === === ====
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) byte   0:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp2 rp4 ... rp14
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) byte   1:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp2 rp4 ... rp14
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) byte   2:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp3 rp4 ... rp14
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) byte   3:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp3 rp4 ... rp14
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) byte   4:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp2 rp5 ... rp14
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) byte 254:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp3 rp5 ... rp15
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) byte 255:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp3 rp5 ... rp15
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53)            cp1  cp0  cp1  cp0  cp1  cp0  cp1  cp0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54)            cp3  cp3  cp2  cp2  cp3  cp3  cp2  cp2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55)            cp5  cp5  cp5  cp5  cp4  cp4  cp4  cp4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) =========  ==== ==== ==== ==== ==== ==== ==== ====   === === === === ====
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) This figure represents a sector of 256 bytes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) cp is my abbreviation for column parity, rp for row parity.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) Let's start to explain column parity.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) - cp0 is the parity that belongs to all bit0, bit2, bit4, bit6.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65)   so the sum of all bit0, bit2, bit4 and bit6 values + cp0 itself is even.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) Similarly cp1 is the sum of all bit1, bit3, bit5 and bit7.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) - cp2 is the parity over bit0, bit1, bit4 and bit5
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) - cp3 is the parity over bit2, bit3, bit6 and bit7.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) - cp4 is the parity over bit0, bit1, bit2 and bit3.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) - cp5 is the parity over bit4, bit5, bit6 and bit7.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) Note that each of cp0 .. cp5 is exactly one bit.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) Row parity actually works almost the same.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) - rp0 is the parity of all even bytes (0, 2, 4, 6, ... 252, 254)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) - rp1 is the parity of all odd bytes (1, 3, 5, 7, ..., 253, 255)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) - rp2 is the parity of all bytes 0, 1, 4, 5, 8, 9, ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81)   (so handle two bytes, then skip 2 bytes).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) - rp3 is covers the half rp2 does not cover (bytes 2, 3, 6, 7, 10, 11, ...)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) - for rp4 the rule is cover 4 bytes, skip 4 bytes, cover 4 bytes, skip 4 etc.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85)   so rp4 calculates parity over bytes 0, 1, 2, 3, 8, 9, 10, 11, 16, ...)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) - and rp5 covers the other half, so bytes 4, 5, 6, 7, 12, 13, 14, 15, 20, ..
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) The story now becomes quite boring. I guess you get the idea.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) - rp6 covers 8 bytes then skips 8 etc
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) - rp7 skips 8 bytes then covers 8 etc
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) - rp8 covers 16 bytes then skips 16 etc
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) - rp9 skips 16 bytes then covers 16 etc
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) - rp10 covers 32 bytes then skips 32 etc
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) - rp11 skips 32 bytes then covers 32 etc
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) - rp12 covers 64 bytes then skips 64 etc
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) - rp13 skips 64 bytes then covers 64 etc
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) - rp14 covers 128 bytes then skips 128
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) - rp15 skips 128 bytes then covers 128
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) In the end the parity bits are grouped together in three bytes as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) =====  ===== ===== ===== ===== ===== ===== ===== =====
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) ECC    Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) =====  ===== ===== ===== ===== ===== ===== ===== =====
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) ECC 0   rp07  rp06  rp05  rp04  rp03  rp02  rp01  rp00
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) ECC 1   rp15  rp14  rp13  rp12  rp11  rp10  rp09  rp08
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) ECC 2   cp5   cp4   cp3   cp2   cp1   cp0      1     1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) =====  ===== ===== ===== ===== ===== ===== ===== =====
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) I detected after writing this that ST application note AN1823
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) (http://www.st.com/stonline/) gives a much
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) nicer picture.(but they use line parity as term where I use row parity)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) Oh well, I'm graphically challenged, so suffer with me for a moment :-)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) And I could not reuse the ST picture anyway for copyright reasons.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) Attempt 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) =========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) Implementing the parity calculation is pretty simple.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) In C pseudocode::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126)   for (i = 0; i < 256; i++)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127)   {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128)     if (i & 0x01)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129)        rp1 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130)     else
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131)        rp0 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132)     if (i & 0x02)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133)        rp3 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp3;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134)     else
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135)        rp2 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp2;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136)     if (i & 0x04)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137)       rp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp5;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138)     else
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139)       rp4 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp4;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140)     if (i & 0x08)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141)       rp7 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp7;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142)     else
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143)       rp6 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp6;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144)     if (i & 0x10)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145)       rp9 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp9;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146)     else
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147)       rp8 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp8;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148)     if (i & 0x20)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149)       rp11 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp11;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150)     else
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151)       rp10 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp10;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152)     if (i & 0x40)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153)       rp13 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp13;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154)     else
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155)       rp12 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp12;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156)     if (i & 0x80)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157)       rp15 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp15;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158)     else
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159)       rp14 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp14;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160)     cp0 = bit6 ^ bit4 ^ bit2 ^ bit0 ^ cp0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161)     cp1 = bit7 ^ bit5 ^ bit3 ^ bit1 ^ cp1;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162)     cp2 = bit5 ^ bit4 ^ bit1 ^ bit0 ^ cp2;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163)     cp3 = bit7 ^ bit6 ^ bit3 ^ bit2 ^ cp3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164)     cp4 = bit3 ^ bit2 ^ bit1 ^ bit0 ^ cp4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165)     cp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ cp5
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166)   }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) Analysis 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) C does have bitwise operators but not really operators to do the above
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) efficiently (and most hardware has no such instructions either).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) Therefore without implementing this it was clear that the code above was
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) not going to bring me a Nobel prize :-)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) Fortunately the exclusive or operation is commutative, so we can combine
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) the values in any order. So instead of calculating all the bits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) individually, let us try to rearrange things.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) For the column parity this is easy. We can just xor the bytes and in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) end filter out the relevant bits. This is pretty nice as it will bring
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) all cp calculation out of the for loop.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) Similarly we can first xor the bytes for the various rows.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) This leads to:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) Attempt 1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) =========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193)   const char parity[256] = {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194)       0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195)       1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196)       1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197)       0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198)       1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199)       0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200)       0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201)       1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202)       1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203)       0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204)       0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205)       1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206)       0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207)       1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208)       1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209)       0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210)   };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212)   void ecc1(const unsigned char *buf, unsigned char *code)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213)   {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214)       int i;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215)       const unsigned char *bp = buf;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216)       unsigned char cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217)       unsigned char rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218)       unsigned char rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219)       unsigned char par;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221)       par = 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222)       rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223)       rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224)       rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225)       rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227)       for (i = 0; i < 256; i++)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228)       {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229)           cur = *bp++;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230)           par ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231)           if (i & 0x01) rp1 ^= cur; else rp0 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232)           if (i & 0x02) rp3 ^= cur; else rp2 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233)           if (i & 0x04) rp5 ^= cur; else rp4 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234)           if (i & 0x08) rp7 ^= cur; else rp6 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235)           if (i & 0x10) rp9 ^= cur; else rp8 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236)           if (i & 0x20) rp11 ^= cur; else rp10 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237)           if (i & 0x40) rp13 ^= cur; else rp12 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238)           if (i & 0x80) rp15 ^= cur; else rp14 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239)       }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240)       code[0] =
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241)           (parity[rp7] << 7) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242)           (parity[rp6] << 6) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243)           (parity[rp5] << 5) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244)           (parity[rp4] << 4) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245)           (parity[rp3] << 3) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246)           (parity[rp2] << 2) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247)           (parity[rp1] << 1) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248)           (parity[rp0]);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249)       code[1] =
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250)           (parity[rp15] << 7) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251)           (parity[rp14] << 6) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252)           (parity[rp13] << 5) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253)           (parity[rp12] << 4) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254)           (parity[rp11] << 3) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255)           (parity[rp10] << 2) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256)           (parity[rp9]  << 1) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257)           (parity[rp8]);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258)       code[2] =
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259)           (parity[par & 0xf0] << 7) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260)           (parity[par & 0x0f] << 6) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261)           (parity[par & 0xcc] << 5) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262)           (parity[par & 0x33] << 4) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263)           (parity[par & 0xaa] << 3) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264)           (parity[par & 0x55] << 2);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265)       code[0] = ~code[0];
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266)       code[1] = ~code[1];
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267)       code[2] = ~code[2];
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268)   }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) Still pretty straightforward. The last three invert statements are there to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) give a checksum of 0xff 0xff 0xff for an empty flash. In an empty flash
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) all data is 0xff, so the checksum then matches.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) I also introduced the parity lookup. I expected this to be the fastest
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) way to calculate the parity, but I will investigate alternatives later
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) on.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) Analysis 1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) The code works, but is not terribly efficient. On my system it took
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) almost 4 times as much time as the linux driver code. But hey, if it was
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) *that* easy this would have been done long before.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) No pain. no gain.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287) Fortunately there is plenty of room for improvement.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) In step 1 we moved from bit-wise calculation to byte-wise calculation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) However in C we can also use the unsigned long data type and virtually
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) every modern microprocessor supports 32 bit operations, so why not try
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) to write our code in such a way that we process data in 32 bit chunks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) Of course this means some modification as the row parity is byte by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) byte. A quick analysis:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) for the column parity we use the par variable. When extending to 32 bits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) we can in the end easily calculate rp0 and rp1 from it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) (because par now consists of 4 bytes, contributing to rp1, rp0, rp1, rp0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299) respectively, from MSB to LSB)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300) also rp2 and rp3 can be easily retrieved from par as rp3 covers the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) first two MSBs and rp2 covers the last two LSBs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) Note that of course now the loop is executed only 64 times (256/4).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) And note that care must taken wrt byte ordering. The way bytes are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305) ordered in a long is machine dependent, and might affect us.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) Anyway, if there is an issue: this code is developed on x86 (to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307) precise: a DELL PC with a D920 Intel CPU)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) And of course the performance might depend on alignment, but I expect
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) that the I/O buffers in the nand driver are aligned properly (and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311) otherwise that should be fixed to get maximum performance).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) Let's give it a try...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) Attempt 2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) =========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321)   extern const char parity[256];
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323)   void ecc2(const unsigned char *buf, unsigned char *code)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324)   {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325)       int i;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326)       const unsigned long *bp = (unsigned long *)buf;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327)       unsigned long cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328)       unsigned long rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329)       unsigned long rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330)       unsigned long par;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332)       par = 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333)       rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334)       rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335)       rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336)       rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338)       for (i = 0; i < 64; i++)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339)       {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340)           cur = *bp++;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341)           par ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342)           if (i & 0x01) rp5 ^= cur; else rp4 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343)           if (i & 0x02) rp7 ^= cur; else rp6 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344)           if (i & 0x04) rp9 ^= cur; else rp8 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345)           if (i & 0x08) rp11 ^= cur; else rp10 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346)           if (i & 0x10) rp13 ^= cur; else rp12 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347)           if (i & 0x20) rp15 ^= cur; else rp14 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348)       }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349)       /*
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350)          we need to adapt the code generation for the fact that rp vars are now
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351)          long; also the column parity calculation needs to be changed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352)          we'll bring rp4 to 15 back to single byte entities by shifting and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353)          xoring
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354)       */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355)       rp4 ^= (rp4 >> 16); rp4 ^= (rp4 >> 8); rp4 &= 0xff;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356)       rp5 ^= (rp5 >> 16); rp5 ^= (rp5 >> 8); rp5 &= 0xff;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357)       rp6 ^= (rp6 >> 16); rp6 ^= (rp6 >> 8); rp6 &= 0xff;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358)       rp7 ^= (rp7 >> 16); rp7 ^= (rp7 >> 8); rp7 &= 0xff;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359)       rp8 ^= (rp8 >> 16); rp8 ^= (rp8 >> 8); rp8 &= 0xff;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360)       rp9 ^= (rp9 >> 16); rp9 ^= (rp9 >> 8); rp9 &= 0xff;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361)       rp10 ^= (rp10 >> 16); rp10 ^= (rp10 >> 8); rp10 &= 0xff;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362)       rp11 ^= (rp11 >> 16); rp11 ^= (rp11 >> 8); rp11 &= 0xff;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363)       rp12 ^= (rp12 >> 16); rp12 ^= (rp12 >> 8); rp12 &= 0xff;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364)       rp13 ^= (rp13 >> 16); rp13 ^= (rp13 >> 8); rp13 &= 0xff;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365)       rp14 ^= (rp14 >> 16); rp14 ^= (rp14 >> 8); rp14 &= 0xff;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366)       rp15 ^= (rp15 >> 16); rp15 ^= (rp15 >> 8); rp15 &= 0xff;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367)       rp3 = (par >> 16); rp3 ^= (rp3 >> 8); rp3 &= 0xff;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368)       rp2 = par & 0xffff; rp2 ^= (rp2 >> 8); rp2 &= 0xff;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369)       par ^= (par >> 16);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370)       rp1 = (par >> 8); rp1 &= 0xff;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371)       rp0 = (par & 0xff);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372)       par ^= (par >> 8); par &= 0xff;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374)       code[0] =
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375)           (parity[rp7] << 7) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376)           (parity[rp6] << 6) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377)           (parity[rp5] << 5) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378)           (parity[rp4] << 4) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379)           (parity[rp3] << 3) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380)           (parity[rp2] << 2) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381)           (parity[rp1] << 1) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382)           (parity[rp0]);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383)       code[1] =
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384)           (parity[rp15] << 7) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385)           (parity[rp14] << 6) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386)           (parity[rp13] << 5) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387)           (parity[rp12] << 4) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 388)           (parity[rp11] << 3) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 389)           (parity[rp10] << 2) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 390)           (parity[rp9]  << 1) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 391)           (parity[rp8]);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 392)       code[2] =
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 393)           (parity[par & 0xf0] << 7) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 394)           (parity[par & 0x0f] << 6) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 395)           (parity[par & 0xcc] << 5) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 396)           (parity[par & 0x33] << 4) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 397)           (parity[par & 0xaa] << 3) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 398)           (parity[par & 0x55] << 2);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 399)       code[0] = ~code[0];
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 400)       code[1] = ~code[1];
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 401)       code[2] = ~code[2];
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 402)   }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 403) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 404) The parity array is not shown any more. Note also that for these
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 405) examples I kinda deviated from my regular programming style by allowing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 406) multiple statements on a line, not using { } in then and else blocks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 407) with only a single statement and by using operators like ^=
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 408) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 409) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 410) Analysis 2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 411) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 412) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 413) The code (of course) works, and hurray: we are a little bit faster than
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 414) the linux driver code (about 15%). But wait, don't cheer too quickly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 415) There is more to be gained.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 416) If we look at e.g. rp14 and rp15 we see that we either xor our data with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 417) rp14 or with rp15. However we also have par which goes over all data.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 418) This means there is no need to calculate rp14 as it can be calculated from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 419) rp15 through rp14 = par ^ rp15, because par = rp14 ^ rp15;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 420) (or if desired we can avoid calculating rp15 and calculate it from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 421) rp14).  That is why some places refer to inverse parity.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 422) Of course the same thing holds for rp4/5, rp6/7, rp8/9, rp10/11 and rp12/13.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 423) Effectively this means we can eliminate the else clause from the if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 424) statements. Also we can optimise the calculation in the end a little bit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 425) by going from long to byte first. Actually we can even avoid the table
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 426) lookups
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 427) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 428) Attempt 3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 429) =========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 430) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 431) Odd replaced::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 432) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 433)           if (i & 0x01) rp5 ^= cur; else rp4 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 434)           if (i & 0x02) rp7 ^= cur; else rp6 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 435)           if (i & 0x04) rp9 ^= cur; else rp8 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 436)           if (i & 0x08) rp11 ^= cur; else rp10 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 437)           if (i & 0x10) rp13 ^= cur; else rp12 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 438)           if (i & 0x20) rp15 ^= cur; else rp14 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 439) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 440) with::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 441) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 442)           if (i & 0x01) rp5 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 443)           if (i & 0x02) rp7 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 444)           if (i & 0x04) rp9 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 445)           if (i & 0x08) rp11 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 446)           if (i & 0x10) rp13 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 447)           if (i & 0x20) rp15 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 448) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 449) and outside the loop added::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 450) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 451)           rp4  = par ^ rp5;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 452)           rp6  = par ^ rp7;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 453)           rp8  = par ^ rp9;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 454)           rp10  = par ^ rp11;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 455)           rp12  = par ^ rp13;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 456)           rp14  = par ^ rp15;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 457) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 458) And after that the code takes about 30% more time, although the number of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 459) statements is reduced. This is also reflected in the assembly code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 460) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 461) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 462) Analysis 3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 463) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 464) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 465) Very weird. Guess it has to do with caching or instruction parallellism
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 466) or so. I also tried on an eeePC (Celeron, clocked at 900 Mhz). Interesting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 467) observation was that this one is only 30% slower (according to time)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 468) executing the code as my 3Ghz D920 processor.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 469) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 470) Well, it was expected not to be easy so maybe instead move to a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 471) different track: let's move back to the code from attempt2 and do some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 472) loop unrolling. This will eliminate a few if statements. I'll try
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 473) different amounts of unrolling to see what works best.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 474) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 475) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 476) Attempt 4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 477) =========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 478) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 479) Unrolled the loop 1, 2, 3 and 4 times.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 480) For 4 the code starts with::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 481) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 482)     for (i = 0; i < 4; i++)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 483)     {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 484)         cur = *bp++;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 485)         par ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 486)         rp4 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 487)         rp6 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 488)         rp8 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 489)         rp10 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 490)         if (i & 0x1) rp13 ^= cur; else rp12 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 491)         if (i & 0x2) rp15 ^= cur; else rp14 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 492)         cur = *bp++;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 493)         par ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 494)         rp5 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 495)         rp6 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 496)         ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 497) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 498) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 499) Analysis 4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 500) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 501) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 502) Unrolling once gains about 15%
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 503) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 504) Unrolling twice keeps the gain at about 15%
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 505) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 506) Unrolling three times gives a gain of 30% compared to attempt 2.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 507) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 508) Unrolling four times gives a marginal improvement compared to unrolling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 509) three times.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 510) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 511) I decided to proceed with a four time unrolled loop anyway. It was my gut
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 512) feeling that in the next steps I would obtain additional gain from it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 513) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 514) The next step was triggered by the fact that par contains the xor of all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 515) bytes and rp4 and rp5 each contain the xor of half of the bytes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 516) So in effect par = rp4 ^ rp5. But as xor is commutative we can also say
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 517) that rp5 = par ^ rp4. So no need to keep both rp4 and rp5 around. We can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 518) eliminate rp5 (or rp4, but I already foresaw another optimisation).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 519) The same holds for rp6/7, rp8/9, rp10/11 rp12/13 and rp14/15.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 520) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 521) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 522) Attempt 5
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 523) =========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 524) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 525) Effectively so all odd digit rp assignments in the loop were removed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 526) This included the else clause of the if statements.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 527) Of course after the loop we need to correct things by adding code like::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 528) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 529)     rp5 = par ^ rp4;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 530) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 531) Also the initial assignments (rp5 = 0; etc) could be removed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 532) Along the line I also removed the initialisation of rp0/1/2/3.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 533) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 534) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 535) Analysis 5
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 536) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 537) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 538) Measurements showed this was a good move. The run-time roughly halved
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 539) compared with attempt 4 with 4 times unrolled, and we only require 1/3rd
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 540) of the processor time compared to the current code in the linux kernel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 541) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 542) However, still I thought there was more. I didn't like all the if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 543) statements. Why not keep a running parity and only keep the last if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 544) statement. Time for yet another version!
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 545) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 546) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 547) Attempt 6
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 548) =========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 549) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 550) THe code within the for loop was changed to::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 551) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 552)     for (i = 0; i < 4; i++)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 553)     {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 554)         cur = *bp++; tmppar  = cur; rp4 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 555)         cur = *bp++; tmppar ^= cur; rp6 ^= tmppar;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 556)         cur = *bp++; tmppar ^= cur; rp4 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 557)         cur = *bp++; tmppar ^= cur; rp8 ^= tmppar;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 558) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 559)         cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 560)         cur = *bp++; tmppar ^= cur; rp6 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 561)         cur = *bp++; tmppar ^= cur; rp4 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 562)         cur = *bp++; tmppar ^= cur; rp10 ^= tmppar;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 563) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 564)         cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; rp8 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 565)         cur = *bp++; tmppar ^= cur; rp6 ^= cur; rp8 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 566)         cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp8 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 567)         cur = *bp++; tmppar ^= cur; rp8 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 568) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 569)         cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 570)         cur = *bp++; tmppar ^= cur; rp6 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 571)         cur = *bp++; tmppar ^= cur; rp4 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 572)         cur = *bp++; tmppar ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 573) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 574)         par ^= tmppar;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 575)         if ((i & 0x1) == 0) rp12 ^= tmppar;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 576)         if ((i & 0x2) == 0) rp14 ^= tmppar;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 577)     }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 578) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 579) As you can see tmppar is used to accumulate the parity within a for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 580) iteration. In the last 3 statements is added to par and, if needed,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 581) to rp12 and rp14.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 582) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 583) While making the changes I also found that I could exploit that tmppar
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 584) contains the running parity for this iteration. So instead of having:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 585) rp4 ^= cur; rp6 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 586) I removed the rp6 ^= cur; statement and did rp6 ^= tmppar; on next
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 587) statement. A similar change was done for rp8 and rp10
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 588) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 589) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 590) Analysis 6
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 591) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 592) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 593) Measuring this code again showed big gain. When executing the original
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 594) linux code 1 million times, this took about 1 second on my system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 595) (using time to measure the performance). After this iteration I was back
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 596) to 0.075 sec. Actually I had to decide to start measuring over 10
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 597) million iterations in order not to lose too much accuracy. This one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 598) definitely seemed to be the jackpot!
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 599) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 600) There is a little bit more room for improvement though. There are three
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 601) places with statements::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 602) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 603) 	rp4 ^= cur; rp6 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 604) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 605) It seems more efficient to also maintain a variable rp4_6 in the while
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 606) loop; This eliminates 3 statements per loop. Of course after the loop we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 607) need to correct by adding::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 608) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 609) 	rp4 ^= rp4_6;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 610) 	rp6 ^= rp4_6
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 611) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 612) Furthermore there are 4 sequential assignments to rp8. This can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 613) encoded slightly more efficiently by saving tmppar before those 4 lines
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 614) and later do rp8 = rp8 ^ tmppar ^ notrp8;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 615) (where notrp8 is the value of rp8 before those 4 lines).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 616) Again a use of the commutative property of xor.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 617) Time for a new test!
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 618) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 619) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 620) Attempt 7
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 621) =========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 622) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 623) The new code now looks like::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 624) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 625)     for (i = 0; i < 4; i++)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 626)     {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 627)         cur = *bp++; tmppar  = cur; rp4 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 628)         cur = *bp++; tmppar ^= cur; rp6 ^= tmppar;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 629)         cur = *bp++; tmppar ^= cur; rp4 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 630)         cur = *bp++; tmppar ^= cur; rp8 ^= tmppar;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 631) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 632)         cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 633)         cur = *bp++; tmppar ^= cur; rp6 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 634)         cur = *bp++; tmppar ^= cur; rp4 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 635)         cur = *bp++; tmppar ^= cur; rp10 ^= tmppar;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 636) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 637)         notrp8 = tmppar;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 638)         cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 639)         cur = *bp++; tmppar ^= cur; rp6 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 640)         cur = *bp++; tmppar ^= cur; rp4 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 641)         cur = *bp++; tmppar ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 642)         rp8 = rp8 ^ tmppar ^ notrp8;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 643) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 644)         cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 645)         cur = *bp++; tmppar ^= cur; rp6 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 646)         cur = *bp++; tmppar ^= cur; rp4 ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 647)         cur = *bp++; tmppar ^= cur;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 648) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 649)         par ^= tmppar;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 650)         if ((i & 0x1) == 0) rp12 ^= tmppar;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 651)         if ((i & 0x2) == 0) rp14 ^= tmppar;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 652)     }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 653)     rp4 ^= rp4_6;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 654)     rp6 ^= rp4_6;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 655) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 656) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 657) Not a big change, but every penny counts :-)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 658) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 659) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 660) Analysis 7
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 661) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 662) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 663) Actually this made things worse. Not very much, but I don't want to move
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 664) into the wrong direction. Maybe something to investigate later. Could
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 665) have to do with caching again.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 666) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 667) Guess that is what there is to win within the loop. Maybe unrolling one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 668) more time will help. I'll keep the optimisations from 7 for now.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 669) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 670) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 671) Attempt 8
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 672) =========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 673) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 674) Unrolled the loop one more time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 675) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 676) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 677) Analysis 8
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 678) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 679) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 680) This makes things worse. Let's stick with attempt 6 and continue from there.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 681) Although it seems that the code within the loop cannot be optimised
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 682) further there is still room to optimize the generation of the ecc codes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 683) We can simply calculate the total parity. If this is 0 then rp4 = rp5
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 684) etc. If the parity is 1, then rp4 = !rp5;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 685) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 686) But if rp4 = rp5 we do not need rp5 etc. We can just write the even bits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 687) in the result byte and then do something like::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 688) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 689)     code[0] |= (code[0] << 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 690) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 691) Lets test this.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 692) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 693) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 694) Attempt 9
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 695) =========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 696) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 697) Changed the code but again this slightly degrades performance. Tried all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 698) kind of other things, like having dedicated parity arrays to avoid the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 699) shift after parity[rp7] << 7; No gain.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 700) Change the lookup using the parity array by using shift operators (e.g.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 701) replace parity[rp7] << 7 with::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 702) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 703) 	rp7 ^= (rp7 << 4);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 704) 	rp7 ^= (rp7 << 2);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 705) 	rp7 ^= (rp7 << 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 706) 	rp7 &= 0x80;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 707) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 708) No gain.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 709) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 710) The only marginal change was inverting the parity bits, so we can remove
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 711) the last three invert statements.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 712) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 713) Ah well, pity this does not deliver more. Then again 10 million
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 714) iterations using the linux driver code takes between 13 and 13.5
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 715) seconds, whereas my code now takes about 0.73 seconds for those 10
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 716) million iterations. So basically I've improved the performance by a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 717) factor 18 on my system. Not that bad. Of course on different hardware
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 718) you will get different results. No warranties!
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 719) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 720) But of course there is no such thing as a free lunch. The codesize almost
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 721) tripled (from 562 bytes to 1434 bytes). Then again, it is not that much.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 722) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 723) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 724) Correcting errors
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 725) =================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 726) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 727) For correcting errors I again used the ST application note as a starter,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 728) but I also peeked at the existing code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 729) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 730) The algorithm itself is pretty straightforward. Just xor the given and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 731) the calculated ecc. If all bytes are 0 there is no problem. If 11 bits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 732) are 1 we have one correctable bit error. If there is 1 bit 1, we have an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 733) error in the given ecc code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 734) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 735) It proved to be fastest to do some table lookups. Performance gain
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 736) introduced by this is about a factor 2 on my system when a repair had to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 737) be done, and 1% or so if no repair had to be done.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 738) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 739) Code size increased from 330 bytes to 686 bytes for this function.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 740) (gcc 4.2, -O3)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 741) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 742) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 743) Conclusion
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 744) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 745) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 746) The gain when calculating the ecc is tremendous. Om my development hardware
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 747) a speedup of a factor of 18 for ecc calculation was achieved. On a test on an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 748) embedded system with a MIPS core a factor 7 was obtained.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 749) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 750) On a test with a Linksys NSLU2 (ARMv5TE processor) the speedup was a factor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 751) 5 (big endian mode, gcc 4.1.2, -O3)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 752) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 753) For correction not much gain could be obtained (as bitflips are rare). Then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 754) again there are also much less cycles spent there.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 755) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 756) It seems there is not much more gain possible in this, at least when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 757) programmed in C. Of course it might be possible to squeeze something more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 758) out of it with an assembler program, but due to pipeline behaviour etc
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 759) this is very tricky (at least for intel hw).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 760) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 761) Author: Frans Meulenbroeks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 762) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 763) Copyright (C) 2008 Koninklijke Philips Electronics NV.