jurem/SicTools

Question: differentiation between instruction and data bytes in object code

Closed this issue · 3 comments

Hi, could somebody please help me understand how (or even if) a SIC simulator is supposed to differentiate between instruction opcodes and data bytes (created by e.g. BYTE or WORD directives) when running object code?

All SIC simulator/emulator implementations that I found simply treat incoming bytes as opcodes first, and if such an opcode does not exist, they just skip it and move on to the next byte. But as far as I understand, this approach would not work even for most simple programs, like ZERO WORD 0, which translates to 000000, would be interpreted as LDA 0 (since LDA's opcode is 00), right?

Am I missing something here, or should this be considered as a real limitation of SIC?

It is indeed a limitation of SIC, as described in the article SICSIM: A simulator of the educational SIC/XE computer for a system-software course:

Disassembly.
Notice that there is no general mechanism
to distinguish between the data and
instruction bytes. Hence, SICSIM eagerly
disassembles the content of the memory
at the current address, i.e., it always tries
to disassemble the bytes as a SIC/XE
instruction. However, since a sequence of
bytes may not represent a valid
instruction this may not always be
possible. In this case the disassembler
shows a single byte of data as a BYTE
directive.
On the other hand, the data bytes (e.g.,
the values of the variables) may
accidentally represent a SIC/XE instruction.
For example, in Figure 3 the
bytes at addresses 00015 and 00016 are
shown correctly, while the bytes 010203
at address 00017 are displayed as LDA
#203.
Even more complex situations may
appear if the data bytes are followed by
instruction bytes. In particular, when the
last few bytes of data and the first few
bytes of code form a valid instruction, the
disassembly becomes misaligned. For
example, in Listing 1 the last two bytes of
data initialization (i.e., bytes 01 02 at
address 0001A) are merged with the first
byte of the J instruction into a LDA
instruction. The remaining two bytes of J
instruction are displayed as BYTE
directives (see Figure 3, addresses 0001A
to 0001E).

jurem commented

Hi. That's the way all processors work. If the opcode at the PC address does not represent a valid opcode the corresponding trap is triggered. In SicTools an exception is raised and "Invalid opcode" or similar message is printer to stderr (as far as I remember, for details please examine the source code).

PS: I guess there are also some (not widespread, mostly academic) architectures out there that work differently, e.g., that may use tags or some other kind of mechanism to differentiate between code and data, but this is really an exception. Another (partial) solution is also by using segment rights, e.g., execution for TEXT segments, R/W for data segments.

Similarly, there are various approaches to disassembly. SicTools just eagerly disassembles the machine code as explained above by kjenova. Maybe you can also check ghidra or IDA software (reverse engineering tools) to see a plethora of other approaches.

Ok, I think I get it now, so I have to make sure myself that I do some kind of GOTO so that PC doesn't go into the data. Thanks guys!