Re: [cc65] Macros in inline assembler

From: Andreas R�ckert <a_rueckert1gmx.net> Date: 2012-01-19 18:31:33 · This archive was generated by hypermail 2.1.8 : 2012-01-19 18:31:50 CET

Hi!

-------- Original-Nachricht --------
> Datum: Thu, 19 Jan 2012 18:17:36 +0100
> Von: Ullrich von Bassewitz <uz@musoftware.de>
> An: cc65@musoftware.de
> Betreff: Re: [cc65] Macros in inline assembler

> On Thu, Jan 19, 2012 at 05:33:47PM +0100, "Andreas Rückert" wrote:
> > We are talking about a modified version of this code:
> >
> > http://www.koders.com/c/fid0D6D481A7D85CEB963C3F4258F30CF903DA541F3.aspx
> >
> > or precisely about the unpacking starting in line 103.
> 
> Interesting. This code doesn't contain the undefined behaviour bug from
> your first mail:-)

??? The code is not from me...maybe that's why it's better... :-) 

> If you want to make that fast on the 6502, treat it as a block of bytes,
> not
> longs. What the loop on line 103 does is to change byte order of the first
> 16
> 32 bit words. This translates to the following asm code (untested):
> 
>         ldy     #0
> Loop:   tya
>         tax
>         lda     (block),y
>         sta     W+3,x
>         iny
>         lda     (block),y
>         sta     W+2,x
>         iny
>         lda     (block),y
>         sta     W+1,x
>         iny
>         lda     (block),y
>         sty     w+0,x
>         iny
>         cpy     #64
>         bne     Loop
> 
> No longs involved and therefore reasonably fast.

I actually had a solution very similar to this:

http://www.forum64.de/wbb3/board25-coder-unter-sich/board308-programmieren/board29-cross-development/45342-bitcoin-mining-auf-dem-c64/index2.html#post579877

Wp was a pointer to W, but as an array of bytes, not longs.

The code actually not copies the first 16 longs, but 16 longs from the
data array to the first 16 longs in W. If data would be an array of
longs, the code would be

W[0] = data[0];
W[1] = data[4];
W[2] = data[8];
...etc. So a simple memcopy plus byteswap does not help. That's why my
assembler code has this decrement in data for a 4-byte block (a long), 
but then adds 16 (4 four-byte longs) to get the next long.

> > I want to run this code on several platforms, including PCs with
> > optional GPUs, so rewriting everything in assembler is not really
> > an option. Maybe some part with conditional compilation. But the
> > later sha256 rounds are too complex to unloop them by hand. So I'll
> > keep them as C for now.
> 
> Problem is that 32 bit integers common on other platforms are
> extraordinarily
> slow on the 6502. There is no chance to get that fast on the 6502 without
> changing the actual implementation.

That's why I changed the implementation...somewhat... :-)

Ciao,
Andreas

-- 
Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de
----------------------------------------------------------------------
To unsubscribe from the list send mail to majordomo@musoftware.de with
the string "unsubscribe cc65" in the body(!) of the mail.