Sarcasm Z80 Assembler

General Information

Sarcasm is a Z80 assembler written in Perl. Perhaps its best feature is that it is totally awesome. Second to that would be... ...yes, definately the multiple instructions on a line seperated by semicolons. If that idea turns you off, you'd best go elsewhere because there's a lot more where that came from. Can you say "pays more attention to whitespace than commas" without having a seizure? It truly is an awesome assembler, even if it seems that nineteen out of twenty people are completely uninterested after reading this page.

Download

The current version, 2016-05-05, is available in both ZIP and TGZ formats, and is also available as a bunch of files.

Note: This latest release of Sarcasm contains a significant change to how labels are specified. (A colon is now required, whereas in previous versions, using a colon would have been a syntax error.) Because of this, if you are compiling old code and don't want to have to update it, you'll want to download the previous version. Note that this is essentially the only change in this release, so there's no reason to update if you don't feel like changing all of your code at the moment.

The previous version, 2014-12-03, is also available in both ZIP and TGZ formats, as well as a bunch of files.

Archival copies of even older releases are available in the download area just in case you feel like using something with known bugs.

Windows users will likely need something like Strawberry Perl so that they can execute Perl scripts. Linux and FreeBSD users can likely use Sarcasm without any additional software.

Documentation

Information

Sarcasm is what I call a "search and replace" assembler.

Many years ago I wanted to write an assembler, but the thought of writing hundreds of lines of code to compare each opcode to string constants, then dozens more to compare each operand to string constants, then bunches of code to calculate the bytes that represent that opcode and those operands, seemed totally unlike anything I was interested in doing. ...but eventually I had an idea: What if I just made a file that contained every possible instruction one might type, and the byte sequence that represents that instruction? Then the assembler could be dumb and just look up the answers in a table. That sounded like a much easier programming challenge, one which I might actually complete.

Sure enough, I did complete it, because now we have Sarcasm the Z80 Assembler. There's a file that comes with it named "opcodes.txt" which isn't simply a list of opcodes it accepts, but rather, it's actually a part of Sarcasm and tells it what byte sequences to generate for each instruction. Once Sarcasm has cleaned up the formatting of your source code, and accounted for all of your labels and any directives, it just does a search-and-replace on what's left over. If you type "ld a [label]" it finds a line that reads "3A ld a [xxxx]" in opcodes.txt and subsequently knows that, once it calculates the value of your label, it just needs to output byte "3A" followed by the two byte representation of your label and it's done.

General Syntax

Sarcasm, like most non-ancient programming languages, uses the # symbol for comments, such that, also like most non-ancient programming languages, it can use the ; symbol to separate multiple instructions on a single line.

However, unlike every other programming language, it pays no attention to commas whatsoever. This came about as I was writing the parser and somehow found looking for commas and spitting out error messages when they weren't present to be an insurmountable pain in the ass. So I decided there would be no commas. However, an unavoidable habit of typing commas in assembly code eventually forced me to make Sarcasm accept them, but they're treated as being no different from spaces; Sarcasm doesn't care whether you use them or where you put them as they simply aren't part of its syntax.

Sarcasm also uses square brackets [ and ] instead of parenthesis ( and ) for dereferencing pointers. I can't say there's more reason to this other than that it's what I'm used to from having used NASM for many years, and parenthesis are for math. Of course, Sarcasm won't recognize parenthesis in math, which makes it all the more strange that it also won't recognize them for dereferencing pointers, but if you continue reading you'll realize this is relatively meaningless in the clusterfuck of how Sarcasm isn't like other Z80 assemblers.

Labels are delcared with a : suffix, which functions just like a ;, but which also indicates that what preceeds it is a label. There's much more information about labels in the section documenting the namespace directive.

Directives

range [name] [lowest address] [highest address]

This directive allows you to define and name a range of memory. Memory ranges allow you to tell Sarcasm where code and data will exist in Z80 memory. The also allow Sarcasm to warn you if you add too much code or data and end up outside of the range of memory you wanted your code or data to exist in.

Examples:

range rom $0000 $7FFF
range ram $8000 $FFFF

...or perhaps something more elaborate...

range start $0000 $0037
range int   $0038 $0065
range nmi   $0066 $007F
range code  $0080 $3FFF
range data  $4000 $7FFF
range ram   $8000 $EFFF
range stack $F000 $FFFF

section [name]

This directive tells Sarcasm which memory range you want the following code or data to be assembled into. Sarcasm maintains a separate "address pointer" for each memory range so that you can switch back and forth between them without accidentally overwriting code previously added to the section.

Example:

range code $0080 $3FFF
range data $4000 $7FFF

section code
  ld hl message_1
  call print_message

section data
  message_1: data "This is message #1." $00

section code
  ld hl message_2
  call print_message

section data
  message_2: data "This is message #2." $00

section code
  print_message: # dummy label for non-existant function

output test.rom $0000 $7FFF

In this above example, despite being interspersed within the code instructions in the source file, the message strings will be separated into the 'data' section of the generated ROM file, and the two pieces of code within the 'code' section will be contiguous, as shown in this hex code dump:

$ hexdump -C test.rom
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000080  21 00 40 cd 8c 00 21 14  40 cd 8c 00 00 00 00 00  |!.@...!.@.......|
00000090  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00004000  54 68 69 73 20 69 73 20  6d 65 73 73 61 67 65 20  |This is message |
00004010  23 31 2e 00 54 68 69 73  20 69 73 20 6d 65 73 73  |#1..This is mess|
00004020  61 67 65 20 23 32 2e 00  00 00 00 00 00 00 00 00  |age #2..........|
00004030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00008000

At this point you probably feel like you know everything there is to know about how to use Sarcasm, but keep reading as there are seven more directives!

output [filename] [lowest address] [highest address]

Sarcasm internally keeps a 64 kB buffer which it assembles all instructions and data into. This directive tells it to write a portion of that memory to a file. You can use this directive multiple times to create multiple output files. The data written to the file is only what is in that 64 kB memory buffer at the time this directive is encountered, and so you generally want this directive to appear as the last line of your source code. If Sarcasm never encounters an output directive, it displays an error message.

goto [address]

This directive simply changes the "address pointer" that Sarcasm writes code or data to within the current range/section. The specified address must be within the current range/section, otherwise an error message is generated.

Example:

range test $0080 $3FFF

section test

  data "one"
  goto $1000; data "two"
  goto $0800; data "three"

output test.rom $0000 $7FFF

...which outputs this ROM file...

$ hexdump -C test.rom
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000080  6f 6e 65 00 00 00 00 00  00 00 00 00 00 00 00 00  |one.............|
00000090  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000800  74 68 72 65 65 00 00 00  00 00 00 00 00 00 00 00  |three...........|
00000810  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001000  74 77 6f 00 00 00 00 00  00 00 00 00 00 00 00 00  |two.............|
00001010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00008000

bytes [list of byte values]
words [list of word values]

These directives insert binary numbers into the assembly output. The bytes directive accepts only bytes and the words directive accepts only words. Each may be decimal numbers, hexadecimal numbers (prefixed with a $ symbol), code labels, or simple arithmetic (addition and subtraction) involving any of those types of numbers.

Example:

range test $4000 $7FFF

section test

  # Remember, commas are unnecessary and are effectively spaces,
  # and are included only to make the statements easier to read.

  bytes 1, 2, 1+2
  words $4567, random_label, random_label+$10

  random_label:
  data "random_label is here"

output test.rom $0000 $7FFF

...which outputs this ROM file...

$ hexdump -C test.rom
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00004000  01 02 03 67 45 09 40 19  40 72 61 6e 64 6f 6d 5f  |...gE.@.@random_|
00004010  6c 61 62 65 6c 20 69 73  20 68 65 72 65 00 00 00  |label is here...|
00004020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00008000

data [list of data items]

This directive inserts binary data into the assembly output. It supports four data types: bytes, words, text strings, and byte strings. To differentiate between bytes and words, each is required to be exactly two or four hexadecimal digits in length. Text strings may be enclosed in single or double quotation marks. Byte strings are a string of hexadecimal digits prefixed with a ! symbol, an even number of digits in length, which will be stored in "big endian" order, a.k.a. the order in which you type the various bytes within the string.

Example:

range test $4000 $7FFF

section test

# The data directive allows easy mixing of data types:

data $01 $0203 "Four" !12345678 !DECADE

# Text strings can be in multiple formats:

data "This is a text string."
data 'This is also a string.'

output test.rom $0000 $7FFF

...which outputs this ROM file...

$ hexdump -C test.rom
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00004000  01 03 02 46 6f 75 72 12  34 56 78 de ca de 54 68  |...Four.4Vx...Th|
00004010  69 73 20 69 73 20 61 20  74 65 78 74 20 73 74 72  |is is a text str|
00004020  69 6e 67 2e 54 68 69 73  20 69 73 20 61 6c 73 6f  |ing.This is also|
00004030  20 61 20 73 74 72 69 6e  67 2e 00 00 00 00 00 00  | a string.......|
00004040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00008000

replace [name] [replacement text]

This directive replaces "name" with the "replacement text" every time it is encountered in the source file. This allows you to create named constants, so that frequently used numbers can be easily changed simply by changing one line in the source code rather than dozens.

Unfortunately, these statements are parsed only after Sarcasm has reformatted all of the code for easy parsing, and so you can't do any arithmetic with these substutions as Sarcasm won't understand what it sees, and correcting this is seems beyond my ability at the moment. So you'll have to define a name for each constant, like this...

range test $0080 $3FFF

replace port_a $60
replace port_b $61
replace port_c $62

section test

  ld a $00; out port_a
  ld a $20; out port_b
  ld a $FC; out port_c

output test.rom $0000 $7FFF

...which really isn't a big deal in my opinion. It'd be nice if we could just define "port" as "$60" and then use "port+1" and "port+2" for the other two ports, but defining three separate constants is more in keeping with the purpose of named constants anyway. What if port_b becomes "port+5" in the future? Do you want to search your code for instances of "+2" and replace them with "+5" or do you want to just change the definition of "port_b" and be done?

namespace [name]

This directive is perhaps the most difficult to explain, largely because Sarcasm features two levels of local variables.

In most assemblers, you'll create labels, and those labels will be accessible only within the source file you create them in, unless you export the labels with an "export" directive and then import them in another file with a "global" directive. Sarcasm similarly allows one source file to avoid polluting the namespace of another source file, but without the "export" and "global" directives.

In Sarcasm, each source file exists in its own namespace. By default, the name of this namespace is the name of the source file minus its file extention. If, for example, you have a file named "tree.asm" and it contains a label "leaf", you can access that label from other source files by typing "tree.leaf" so that Sarcasm knows to look in the "tree" namespace for the "leaf" label.

However, if you aren't happy with this default name for the namespace, you can specify your own with the namespace directive. Additionally, if you want, you can have multiple namespaces within a single file, and switch back and forth between them as much as you like.

...but we're not done yet. Labels in Sarcasm are kind of complex. It's a good thing, as assembly language is full of so many labels and having to rename them all every time you copy and paste a loop gets old real quick, and these complex label features help with that, but they're hard to explain.

Perhaps the best I can do is with this example code:

range test $0000 $7FFF
section test

namespace one

apple: data "apple in one"
peach: data "peach in one"

namespace two

apple: data "apple in two"
peach: data "peach in two"

# The identical labels are allowed because each is in a seperate namespace.

# When you use a label, the current namespace is used if one is not specified:

namespace one
  ld hl apple         # loads address of "apple in one"
  ld hl one.apple     # loads address of "apple in one"
  ld hl two.apple     # loads address of "apple in two"

# However, to complicate things further, there are also sub-labels:

namespace three
  apple:
    .skin; data "apple.skin in three"
    .core; data "apple.core in three"
  pear:
    .skin; data "pear.skin in three"
    .core; data "pear.core in three"

# Sub labels can only be accessed in short form until a new label is declared.

    ld hl .skin       # loads address of "pear.skin in three"
    ld hl .core       # loads address of "pear.core in three"

  orange:
    .skin; data "orange.skin in three"

# Once you declare a new label, you need to specify sub-labels as such:

    ld hl apple.skin  # loads address of "apple.skin in three"

# ...or if you are in a different namespace...

namespace four

  ld hl three.apple.skin    # loads address of "apple.skin in three"

# So you might ask, what if you have this:

namespace xxx
  yyy: data "xxx.yyy"
    .zzz; data "xxx.yyy.zzz"

namespace whatever
  xxx: data "whatever.xxx"
    .yyy; data "whatever.xxx.yyy"

# ...and then you do something like this...

  ld hl xxx.yyy       # Is that 'yyy' in the 'xxx' namespace, or
                      # is it 'xxx.yyy' in the 'whatever' namespace?

# Well, the answer is that it is "whatever.xxx.yyy" so long as you are in
# the 'whatever' namespace, as Sarcasm prefers the more local match.

Well, I hope that explains it anyway.

Opcodes

Sarcasm's opcode's aren't identical to typical Z80 opcode syntax you'll see elsewhere on the internet. I'll try to document all of the differences here.

IX, IY, IXL, IXH, IYL, IYH

The Z80 had such a nice scheme going on with its register names until those undocumented registers came along. Byte registers were one letter and word registers were two letters. Now we have three letter registers that aren't 24-bit registers? ...and it was so cool that the two 8-bit halves of HL were H and L.

IX in Sarcasm's syntax is ST, and its 8-bit halves are S and T
IY in Sarcasm's syntax is UV, and its 8-bit halves are U and V

RST, BIT, SET, RES, and IM

The way that Sarcasm works doesn't allow for instructions to include numbers as operands when those numbers don't become bytes in the machine code. For this reason, these instructions have a different format in Sarcasm.

RST $38 in Sarcasm's syntax is rst38
BIT 3, A in Sarcasm's syntax is bit3 a
SET 3, (IX+7), A in Sarcasm's syntax is set3 a [st+7]
IM 1 in Sarcasm's syntax is im1

EX AF, AF'

As Sarcasm uses the apostrophe as a quotation mark, having a single quote in an instruction just isn't possible. As such, this instruction in Sarcasm syntax is merely ex af with no second operand.

JP (HL)

This instruction's syntax is just the result of someone's confusion. As written in typical Z80 assembly syntax, one would think that it reads an address from the memory pointed to by HL and then jumps to that address, but in reality it simply jumps to the address stored in HL. To avoid this unnecessary confusion, in Sarcasm the syntax of this instruction is simply jp hl so that it looks like what it does.

RLC, RRC, RL, RR, RLCA, RRCA, RLA, RRA

These just confuse the fuck out of me. I'm far too used to seeing "C" as a symbol for the carry flag, but the ones with "C" in them are the ones that don't rotate through the carry flag. Then it becomes even more confusing when you realize that the "C" stands for "circular" and so you start thinking that the opcodes without a "C" in the name don't rotate the bits but instead simply shift them.

I think Intel 8086 assembly named these instructions much better, and so I've adopted those opcode names for Sarcasm:

RLC and RLCA in Sarcasm's syntax are rol, a.k.a. "rotate left"
RRC and RRCA in Sarcasm's syntax are ror, a.k.a. "rotate right"
RL and RLA in Sarcasm's syntax are rcl, a.k.a. "rotate carry left"
RR and RRA in Sarcasm's syntax are rcr, a.k.a. "rotate carry right"

As for the versions with "A" appended on the end, which generate one-byte opcodes instead of two-byte opcodes, in Sarcasm you just use the instruction without an operand to generate those.

CPL

I just couldn't remember CPL for the life of me. So I swapped it out with the 8086 opcode name.

CPL in Sarcasm's syntax is not, like the logic gate.

SCF, CCF

"Set carry flag" and "clear carry flag?" No, fuck you, it's "compliment carry flag." Bullshit like this causes me to waste days debugging code, so I went with the less ambiguous 8086 opcodes:

SCF in Sarcasm's syntax is stc, "set carry"
CCF in Sarcasm's syntax is cmc, "compliment carry"

Tossing that "m" in there makes it so much less ambiguous.

SLA, SRA, SLL, SRL

Well, fuck if SLL isn't a useless instruction made to resemble a useful instruction.

SLA in Sarcasm's syntax is SHL a.k.a. "shift left"
SRL in Sarcasm's syntax is SHR a.k.a. "shift right"
These two instructions are useful for scaling unsigned numbers.

SLA in Sarcasm's syntax is also SAL a.k.a. "shift arithmetic left"
SRA in Sarcasm's syntax is SAR a.k.a. "shift arithmetic right"
These two instructions are useful for scaling signed numbers.

Finally, SLL in Sarcasm's syntax is SIL a.k.a. "shift illogically left", to reflect the fact that what it does doesn't make a damn bit of sense and so you probably shouldn't be using it.

ADD, ADC, SUB, SBC, AND, XOR, OR, CP

The 8-bit versions of these instructions take only one operand, as the destination register is always the A register. Seeing a register as an operand always causes me to assume I can specify a different register, which just wastes time when I rewrite code, attempt to compile, then I'm reminded that I don't actually have a choice, and so I have to restore the code to its original version.

IN, OUT

Similarly, you only get to choose one of the two operands to these instructions, so I dropped the non-optional operands.

IN A, $A5 in Sarcasm's syntax is in $A5
IN A, (C) in Sarcasm's syntax is in a
IN C, (C) in Sarcasm's syntax is in c
IN (C) in Sarcasm's syntax is in
OUT (C), 0 in Sarcasm's syntax is out

OTIR, OTDR

Tossing the U out of OUT just to avoid having a five letter opcode is dumb as fuck. Just like Perl's "elsif" which just makes me want to break someone's head open. Well, maybe not break someone's head open, but it definately makes me want to cry. Why create countless typos just to avoid typing one fucking letter?

OTIR in Sarcasm's syntax is outir
OTDR in Sarcasm's syntax is outdr

Contact Information

Feel free to send comments and questions to my email address. As far as I know, only five people use Sarcasm, so it isn't as if I'm swamped with email or anything. In fact, learning that someone else uses Sarcasm would make me ecstatic. ...and might even convince me to make it even more awesome.

Also, since you're clearly interested in Z80 stuff, you might want to have a look at my Z80 EEPROM Programmer or my Z80 System Design as they're awesome too. Especially that EEPROM programmer, it's shit-your-pants awesome.