This repository contains the reverse engineered source code and explanations of one of the most famous Atari ST demos.
No doubt that if you owned an Atari ST at the end of the eighties you have been really impressed by one of the most famous megademos: The Cuddly Demos by The Carebears (TCB): Demozoo, Pouët
Technically incredible (Syncscroll, Fullscreen) and never seen effects pushed it into all ST users memory.
One screen that I have always loved is the The Star-Wars Scroller: it is not especially technically advanced, but it was the first demo (to my knowledge) to show a scroller like in the eponym movie.
And surprinsingly, this kind of scroller has not been very frequent (non exhaustive list here).
I have always loved this effect, and with my friend ZPK (from T.AL) we made a version of this effect on Falcon for the Place To Be Again in 1994 (unfortunately lost...), and I striked back during the 2020 lockdown, with a STE demo: Demozoo, Pouët, GitHub
I also made a tutorial about the way I have implemented it( GitHub, YouTube), and during a discussion on atari-forum about this tutorial, it was asked if anyone has looked into how TCB did theirs all those years back.
I haven't and didn't want to, in order not to be influenced before coding 😉 . And also I suppose that it is the habit I have taken from back in the days, i.e. to try to reproduce their demos without having access to their source code ! 😄
But after reading a blog article of Dbug who reverse engineered the 3D Doc Demo (Cuddly demos too !), I have thought the time has come to dig into TCB's code !
First have a look at the Dbug's article mentioned above, as I have followed his way of proceeding, you will find more details than here. Moreover there are a lot of common tricks used by both demos (which seems logical since both are from TCB) that are explained in his article.
So I managed to find an extracted standalone version of the demo from a compilation: TCB_STWR.PRG, size = 41 560 bytes.
The file is compressed of course, and once uncompressed its size is 104 634 bytes.
Then I have used Easy Rider to get a first version of the source code, and vasm to assemble it again.
The source code contains several sections: the music which is relocated to $6E000 at the begin of the demo, and the demo itself that is relocated to $AC00 at the very beginning of the execution.
The code is composed of a mix of PC relative (using labels) and absolute adressing (sometimes both are used for a same address on two consecutive lines of code...) which makes the readibility difficult and the association between labels and adresses mandatory for a correct understanding.
For this, the wonderful hrdb (Hatari Remote Debugger GUI) has been of a great help ! Thanks Tat for this amazing tool !
Finally I haved defined both constants (equ
) and labels to make the source understandable.
I think it could have been simpler by using the ORG
directive, but I did not manage to use it more than once with vasm...
I initially wanted to focus only on the Star Wars scroller part (SWSC) but it was not as straightforward as I thought, and I finally reversed, commented and explained most of the demo. Only the distorted scroller part is not fully reversed, because it doesn't seem to reveal big surprises and I had already spent too much time on this demo.
The commented source code is here.
First things we can learn:
- there is almost 25% of CPU left
- the SW scroller part takes 280 KBytes of memory
- the demo is refreshed at 50 Hz, but is not in the VBL code. The main code is in an infinte loop that syncs with VBL using a flag
- the VBL sets this flag, initializes rasters, and plays music
- there is an attract mode (see Dbug's article)
- it was impossible to quit the demo, either by attract mode or by space key press... I had to modify the code to restore the good behaviour; it seems this has been initially modified by the people who extracted the demo...
- the Union Wizz Coders logo is drawn only once during initialization, and appears/disappears by simple color change
- the drawing of the whole demo is made using double-buffering
Main figures of CPU usage:
jsr MUSIC_ADR+4 ; Plays Music : 11 scanlines = 5 600 cycles
bsr clear_SWSC ; Clear SWSC : 26 scanlines = 13 312 cycles
bsr clear_stars ; Clear stars : 6 scanlines = 3 072 cycles
bsr clear_moving_logo ; Clear Moving logo : 9 scanlines = 4 608 cycles
bsr draw_dist_scroller ; Draw distorted scroller : 70 scanlines = 35 840 cycles
bsr draw_stars ; Draw stars : 19 scanlines = 9 728 cycles
bsr draw_SWSC ; Draw SWSC : 71 scanlines = 36 352 cycles
bsr draw_moving_logo ; Draw Moving logo : 36 scanlines = 18 432 cycles
Rasters are implemented through daisy-chain Timer B launched at the end of the VBL. There is a first code dealing with the upper half of the screen, and another one with the gradient of the SW scroller. For the first half there are two color tables, which are alternated at each VBL so as to produce smoother gradient of colors.
; Rasters management
not.w RASTER_FLAG
bmi.s .vbl3
lea BUF_RASTER_1,a6
bra.s .vbl4
.vbl3: lea BUF_RASTER_2,a6
.vbl4: ; Set Timer B
move.l #TIMB_CODE_ADR,$120.l ; timb_code
move.b #0,$fffffa1b.w
move.b #2,$fffffa21.w ; 2 lines
move.b #8,$fffffa1b.w ; Event count mode
rte
timb_code: ; @ TIMB_CODE_ADR = $ACC4
; Rasters for first half screen
move.w (a6),$ffff8244.w
move.w (a6)+,$ffff8246.w
bmi.s timb_code_2
rte
timb_code_2: ; Change rasters table for SW Scroller
move.l #TIMB_CODE3_ADR,$120.l
move.b #0,$fffffa1b.w
move.b #2,$fffffa21.w
move.b #8,$fffffa1b.w
move.l a6,-(a7)
move.l a0,-(a7)
lea SW_Scroll_Rasters_Table(pc),a0
lea $ffff8240.w,a6
move.l (a0)+,(a6)+
move.l (a0)+,(a6)+
move.l (a0)+,(a6)+
move.l (a0)+,(a6)+
move.l (a0)+,(a6)+
move.l (a0)+,(a6)+
move.l (a0)+,(a6)+
move.l (a0)+,(a6)+
movea.l (a7)+,a0
movea.l (a7)+,a6
rte
timb_code_3: ; @ TIMB_CODE3_ADR = $AD0E
; Rasters for SW Scroller part
move.w (a6)+,$ffff8248.w
rte
There are 100 stars displayed on two bitplanes. The drawing is performed using a 20 iterations loop which displays 5 stars. This is quite strange not to have completely unrolled the loop, or instead directly used a 100 iterations loop... The display is based on a precomputed included table which contains for each star:
- the delta screen address. This value is reminded in another buffer for the clearing at the next VBL
- the 2 bitplane pixel data. So the display code is very simple.
movea.l (a3),a5 ; Data address in sequence for current star
move.w (a5)+,d1 ; Get delta screen address
bmi.s .star1_rewind ; End of sequence for current star ?
.star1: move.l (a5)+,d2 ; 2 bitplane data (several possible colors for stars) for current star
or.l d2,0(a1,d1.w) ; Display star
move.w d1,(a4)+ ; Save the delta screen address in buffer for clearing at next iteration
move.l a5,(a3) ; Save new data address in sequence for current star
During the demo initialisation, the letters of the THE UNION
moving logo are X shifted and a mask is computed.
Then, as for the stars, all the display is based on a table, and each letter is displayed indepently. The addresses of the modified screen words are strored in a table so as to ease the cleaning.
draw_moving_logo_letter:
; d0 = X pos
; d1 = Y pos
; a1 = letter sprite address
; a3 = clear buffer
lea BUF_LINE_OFFSET,a2
add.w d1,d1
add.w d1,d1
movea.l 0(a2,d1.w),a0 ; offset of screen line
adda.l L_Log_Base(pc),a0 ; destination address
move.w d0,d1
lsr.w #1,d0
andi.w #$fff8,d0 ; X offset
adda.w d0,a0
move.l a0,(a3)+ ; Save address for further cleaning
subq.l #2,a0
andi.w #$f,d1 ; Get the shifted sprite address
add.w d1,d1
add.w d1,d1
lea MOVING_LOGO_SHIFT_OFST,a2
move.l 0(a2,d1.w),d1
adda.l d1,a1
REPT 10 ; 10 lines
movem.l (a0)+,d0-d3 ; Get screen data
and.w (a1),d0 ; Apply mask
and.l (a1)+,d1
and.w (a1),d2
and.l (a1)+,d3
or.w (a1)+,d0 ; Or sprite
or.l (a1)+,d1
or.w (a1)+,d2
or.l (a1)+,d3
movem.l d0-d3,-(a0) ; Write screen
lea 160(a0),a0 ; Next screen line
ENDR
rts
Sorry, I did not reverse the whole part of this scroller 😉 But I have commented a part of it, you can look at the source.
Here we are ! Now let's see how this scroller has been implemented !
First of all, it is based on using different sizes fo each char, and displaying them at precomputed X positions. No texture mapping, nor polygon filling approaches, which seems logical in order to be efficient on a ST.
The scroller height is 82 lines. And its width varies between 288 pixels and 80 pixels. So roughly 15 000 pixels.
The font is 16 x 10 x 1 bitplane. It contains 25 characters: alphabet minus "J", "Q" anf "Z" plus dot and space. The remove of unused letters from the font, and the limited number of symbols highlights the difficulties to limit the memory usage (280 KBytes used as mentioned before). The font is rather small and does not allow a good definition for the "3D projection" effect. But this is hidden by the high speed of the scrolling. Here is for example what it looks like when stopped and without rasters.
Even the closest lines are not really readable and the more distant ones are.. euh... But speed and rasters change everything ! Well done !
They have taken the approximation that the projection for all columns is the same (not true, but visually acceptable), so this resumes the problem to having different char sizes and to X-preshift them. This rescaling and X-shifting is performed during the demo initialisation and requires a final font buffer of 232 KBytes. This pre-processing is done with a per pixel subroutine, and is not really optimised: this explains the delay before the demo really starts.
swsc_rescale_pixel:
movem.l d0-d3,-(a7)
; d0 = pixel index
; d1 = line number
; d4 = currently processed pixel
; a0 = source font address
; a1 = rescale font address
; The current routine is called for consecutive d4 pixel values, and selected d0 values
; this is what allows the rescaling
lsl.b #1,d1 ; 2 bytes per line in source data, so multiply per 2
move.w 0(a0,d1.l),d2 ; Get the 16 pixels of the char line. Memory access done for each pixel... Not very efficient
lsl.b #1,d1 ; 32 pixels (4 bytes) per line in destination data to handle X-shift , so multiply again per 2
moveq #$f,d3 ; 16 pixels
sub.b d0,d3 ; Reverse order for pixel index
btst d3,d2 ; Is the pixel set in source ?
bne .srcPixelSet
move.w #$ffff,d0
moveq #$f,d3
sub.b d4,d3
bclr d3,d0
and.w d0,0(a1,d1.l) ; Clear pixel
bra .exit
.srcPixelSet:
moveq #0,d0
moveq #$f,d3
sub.b d4,d3
bset d3,d0
or.w d0,0(a1,d1.l) ; Set Pixel
.exit: movem.l (a7)+,d0-d3
rts
And the global pre-processing code:
swsc_rescale_shift_font:
lea swsc_font(pc),a0
lea SWSC_FONT_BUFF,a1
moveq #$1b,d6 ; 28 chars (alphabet + dot and space). But it seems that J, Q and Z are missing in the font data.
; This is managed later by a translation table
.nextChar: moveq #0,d4 ; Currently processed pixel index
moveq #$c,d3 ; 13 X-rescaling to be done
lea swsc_rescale_table(pc),a2 ; this table contains the list of "kept" pixels wen rescaling
.nextPixel: move.b (a2)+,d0 ; d0 contains the index of a pixel to be kept
tst.b d0
bmi .nextSize ; no more pixels to be kept for this size
moveq #0,d1 ; Current char line number
moveq #9,d5 ; Char height = 10
.nextCharLine: bsr swsc_rescale_pixel
addq.b #1,d1 ; Next line
dbf d5,.nextCharLine
addq.l #1,d4 ; The current pixel index has been processed for all lines, we can process the next pixel
bra .nextPixel
.nextSize: ; Now process the X-shifting
lea 40(a1),a3 ; a3 = end of char in buffer
moveq #$e,d0 ; 16 shifts to be done
moveq #1,d1 ; Current shift number
.nextXShift: moveq #9,d7 ; Char height = 10
.nextShiftedLine:
moveq #0,d2
move.w (a1),d2 ; Rescaled 16 pixels line
swap d2
lsr.l d1,d2 ; Shift
move.w d2,2(a3)
swap d2
move.w d2,(a3)
addq.l #4,a3 ; Next line
addq.l #4,a1
dbf d7,.nextShiftedLine
addq.l #1,d1 ; Next shift number
lea -40(a1),a1
dbf d0,.nextXShift
lea 640(a1),a1 ; Go after all shifts
moveq #0,d4
dbf d3,.nextPixel
lea 20(a0),a0
dbf d6,.nextChar
rts
Now if we look at the main code. First the cleaning is very simple: it is done by sections of lines which have roughly the same width:
clear_SWSC: ; Cleans the previous displayed SWSC per sections
moveq #0,d1
movea.l L_Log_Base(pc),a0
lea 18724(a0),a0 ; Line 118 of screen + 2nd bitplan
lea 56(a0),a0
moveq #$a,d0
.loop1: move.w d1,(a0)
move.w d1,8(a0)
move.w d1,16(a0)
move.w d1,24(a0)
move.w d1,32(a0)
move.w d1,40(a0) ; 96 pixels for first section
lea 160(a0),a0
dbf d0,.loop1
...
lea -8(a0),a0
moveq #$b,d0
.loop8: move.w d1,(a0)
move.w d1,8(a0)
move.w d1,16(a0)
move.w d1,24(a0)
move.w d1,32(a0)
move.w d1,40(a0)
move.w d1,48(a0)
move.w d1,56(a0)
move.w d1,64(a0)
move.w d1,72(a0)
move.w d1,80(a0)
move.w d1,88(a0)
move.w d1,96(a0)
move.w d1,104(a0)
move.w d1,112(a0)
move.w d1,120(a0)
move.w d1,128(a0)
move.w d1,136(a0) ; 288 pixels at max
lea 160(a0),a0
dbf d0,.loop8
rts
Then the display is split in two steps:
- Scroll of one line and fill one line of a buffer. This buffer contains a "non projected view" of the scrolltext containing the addresses of the font char lines (and not the bitmap itself)
- Display the "3D projection" of this buffer on the screen
Here is the code for the first part:
fill_SWSC_buffer_line:
; Fill an intermediate buffer line before displaying to screen
; this buffer contains a "non projected view" of the scrolltext containing the addresses of the font char lines (and not the bitmap itself)
move.l L_SWSC_Buf_Pos(pc),d3 ; Current position (offset) in buffer
move.l SWSC_BUF_POS,d2 ; Move.l d3,d2 would have been the same...
addi.l #$30,d2 ; Next line (12 chars * 4 bytes = $30)
cmp.l L_SWSC_Buf_Mid(pc),d2 ; Get middle of buffer
blt .setPos ; Current pos is higher the middle ?
sub.l L_SWSC_Buf_Mid(pc),d2 ; Go back !
.setPos: move.l d2,SWSC_BUF_POS
lea SWSC_BUFFER,a5 ; a5 = begin of buffer
lea 21120(a5),a4 ; a4 = middle of buffer
lea SWSC_FONT_BUFF,a0 ; a0 = begin of the rescaled/X-shifted font
lea L_swsc_text(pc),a2 ; a2 = begin of the text
lea swsc_Translation_Table(pc),a1 ; a1 = text translation table (for chars without font items)
move.w L_Swsc_Text_Pos(pc),d4
cmpi.b #$ff,0(a2,d4.w) ; End of text
bne .noTextEnd
clr.w SWSC_TEXT_POS ; Go to begin of text
.noTextEnd: cmpi.l #$28,SWSC_CHAR_LINE4 ; $28 = 40 = 4*10, so are we after the last line of the chars font ?
blt .readTextLine ; if no, read the text line
moveq #-1,d1 ; if yes, then we insert empty lines for spacing
move.l d1,0(a5,d3.w)
move.l d1,0(a4,d3.w)
move.l d1,4(a5,d3.w)
move.l d1,4(a4,d3.w)
move.l d1,8(a5,d3.w)
move.l d1,8(a4,d3.w)
move.l d1,12(a5,d3.w)
move.l d1,12(a4,d3.w)
move.l d1,16(a5,d3.w)
move.l d1,16(a4,d3.w)
move.l d1,20(a5,d3.w)
move.l d1,20(a4,d3.w)
move.l d1,24(a5,d3.w)
move.l d1,24(a4,d3.w)
move.l d1,28(a5,d3.w)
move.l d1,28(a4,d3.w)
move.l d1,32(a5,d3.w)
move.l d1,32(a4,d3.w)
move.l d1,36(a5,d3.w)
move.l d1,36(a4,d3.w)
move.l d1,40(a5,d3.w)
move.l d1,40(a4,d3.w)
move.l d1,44(a5,d3.w)
move.l d1,44(a4,d3.w)
addi.l #$30,d3
bra .endLine
.readTextLine: moveq #$b,d7 ; 12 chars per line
.nextChar: moveq #0,d1
move.b 0(a2,d4.w),d1 ; Get text char
move.b 0(a1,d1.w),d1 ; Translate it into Font char number (for chars without font)
lea swsc_char_to_fontaddr(pc),a3 ; Then into address for char in font buffer
lsl.w #2,d1
move.l 0(a3,d1.w),d1
addi.l #SWSC_FONT_BUFF,d1
add.l L_Swsc_Char_Line4(pc),d1
move.l d1,0(a5,d3.w)
move.l d1,0(a4,d3.w)
addq.l #4,d3
addq.l #1,d4
dbf d7,.nextChar
.endLine: cmp.l L_SWSC_Buf_Mid(pc),d3 ; Manage buffer length (as at the begin of this sub-routine)
blt .setLineNumber
sub.l L_SWSC_Buf_Mid(pc),d3
.setLineNumber: addq.l #4,SWSC_CHAR_LINE4 ; Next line (4 bytes per line)
cmpi.l #$48,SWSC_CHAR_LINE4 ; Have we inserted also 10 lines of spaces ?
blt .exit
subi.l #$48,SWSC_CHAR_LINE4 ; Yes, so we go back to line 0
addi.w #$c,SWSC_TEXT_POS ; And we go to next scroller line
.exit: rts
And now the drawing itself. It is based on two precomputed tables (already included in the executable):
swsc_X_size_table
: this table contains pairs of words. There are 12 pairs per line (12 chars), and for the displayed 82 lines. Each pair is composed of:
- horizontal jump (delta screen address)
- offset in rescaled/X-shifted font to get the correct char size & X position
swsc_Y_table
: because of the 3D projection, only some lines of the buffer are displayed on screen. This tables lists the jumps to be performed in the buffer to go to the next displayed line
draw_SWSC: bsr fill_SWSC_buffer_line
movea.l L_Log_Base(pc),a0
lea 18724(a0),a0 ; Line #117, 3rd bitplan
lea SWSC_BUFFER,a5
lea swsc_X_size_table(pc),a3
lea swsc_Y_table(pc),a4
adda.l L_SWSC_Buf_Pos(pc),a5 ; Set to current position in buffer
moveq #$51,d1 ; 82 lines
.loopLine: move.l (a5)+,d3 ; d3 = char address in font
bmi .emptyLine ; if -1, empty line
movea.l d3,a1 ; a1 = char address in font
adda.w (a3)+,a0 ; horizontal screen jump to word where the char will be displayed
adda.w (a3)+,a1 ; Select char size and X preshift to be used for display
move.w (a1)+,d0 ; Read the first 16 pixels
or.w d0,(a0) ; OR with previous char (because chars are shifted across columns)
move.w (a1),d0 ; Then read the next 16 pixels
or.w d0,8(a0) ; And OR again to screen
; Do the same thing for the next 11 chars
REPT 11
adda.w (a3)+,a0
movea.l (a5)+,a1
adda.w (a3)+,a1
move.w (a1)+,d0
or.w d0,(a0)
move.w (a1),d0
or.w d0,8(a0)
ENDR
.nextLine: adda.w (a4)+,a5 ; Select the next line in buffer to be displayed
dbf d1,.loopLine
rts
.emptyLine: ; Empty line, display nothing
; Only update pointers
lea 44(a5),a5 ; Next line in buffer (4*12)
adda.w (a3),a0 ; Perform X jumps on screen address
adda.w 4(a3),a0
adda.w 8(a3),a0
adda.w 12(a3),a0
adda.w 16(a3),a0
adda.w 20(a3),a0
adda.w 24(a3),a0
adda.w 28(a3),a0
adda.w 32(a3),a0
adda.w 36(a3),a0
adda.w 40(a3),a0
adda.w 44(a3),a0
lea 48(a3),a3
bra .nextLine
And that's all ! There is no complexity in the code itself (no specific, or advanced optimisation). That could appear quite simple but the hidden difficulty of course comes from the choice of the method and of the sizing in order to fit in memory and to use a reasonable amount of CPU time. And by experience I know how brain-intensive and time-consuming this phase can be !!
Of course we have no access to the tools that allowed to generate all the precomputed tables. The maths are quite simple, but here again, there is a lot of hidden work.
Nice job !
I hope that most of you have reached their objectives at this point, and have a good vision of how TCB implemented this effect.
But I think that you have also understood that my goal was not only to see how TCB did their demo, but also to compare with my implementation of the same effect: no war here, of course 😉, but are the approaches identical ? Are there different tricks ? Or are the solutions quite similar ?
If you are interested in this further step, it may be interesting to first read the tutorial I wrote( GitHub, YouTube).
Now let's compare the implementations.
First of course the design constraints have not been the same: fullscreen, 3D projection close to the movie, and slow scrolling. So a size of 400 x 152 x 187 (51 600 pixels) vs. 288 x 80 x 82 (15 000 pixels). This size combined with the slow scrolling have led to a 16x24 font size (vs. 16 x 10). Moreover, it was mandatory to have both uppercase and lowercase chars: 56 letters instead of 25.
It is also based on using different sizes fo each char, and displaying them at precomputed X positions. No surprise for me here, I do not see other solutions to be efficient.
The font is also reduced and X-shifted. Moreover the display is also split in the two same steps:
- Scroll of one line and fill one line of a buffer. This buffer contains a "non projected view" of the scrolltext containing the addresses of the font char lines (and not the bitmap itself)
- Display the "3D projection" of this buffer on the screen
I have also taken the approximation that the projection for all columns is the same (not true, but visually acceptable) in order to reduce the precomputed font.
But because of higer number of chars and of the size fonts, the required memory size was too huge: 990 KBytes vs. 232 KBytes. I therefore had go a step further: letters have a lot of common lines (within a same letter, and between letters): for the 56 letters, there are only 188 different lines.
I have therefore worked on reduced elementary lines. This has the inconvienent of requiring an additional indirection, but it saves a lot of memory: at the end 135 KBytes (vs. 232 KBytes).
I win ! 😉
Because of the display method I have used (see further), there is no need to clean the screen before displaying.
Moreover the height of the spacing between is reduced, so there are less empty lines. In order to have a fixed CPU time, I have chosen to display spaces for empty lines instead of not displaying them.
So no need for cleaning, but additional display time.
No winner...
The display method is very close. But except for the top lines where more than 2 columns fit in 16 pixels, it is possible to avoid making an OR between columns using a bitplanes trick: in most cases, a letter of a column will have to be “split” onto two consecutive 16 pixels groups. We use a single Move.L to copy 1st part of the letter on bitplane 3 of the first 16 pixels group, then the 2nd part on bitplane 0 of next 16 pixels group. And we do the same for the next column.
This allows to save some precious cycles !
I win again ! 😉
We have finally used the same approach, but I managed to bring additional optimisations to cope with the larger size. But at the end no doubt that TCB clearly won on the myth & iconic battlefields !! Thank you guys for this inspiring demo !
David aka Uko from T.AL (The Arctic Land)
uko.tal@gmail.com or uko at http://www.atari-forum.com