-
Notifications
You must be signed in to change notification settings - Fork 54
Case study: migration of MSX project from tniASM to sjasmplus
"The project" is "the abandoned project by Daemos" for MSX computer and I don't mention its original name, because while it is hobby non-profit project for retro platform, it is believed mentioning the original name brings bad luck upon one due to clashing with Intellectual Property of one big Ncompany.
As contributor of sjasmplus, I stumbled upon the mention of it in MSX forum post where the project's author was considering possible migration from tniASM 0.45 (due to technical reasons making the usage of tniASM in future very difficult for him). He had only two requirements for the new assembler, that it will 1) assemble the project, 2) is open source. Easy?! There was sjasmplus mentioned together with 20+k errors (and no further details). The post was about one year old.
And I was looking for some real-world non-trivial Z80 assembly project to verify capability of sjasmplus to be used at "production" level, plus I was curious what are the hurdles in such migration and what those 20+k errors are about.
I offered to take a look on the project, to evaluate if such migration is possible and how to do it, got email from Daemos within couple of days, including sources of the project - about half a million of lines of assembly.
After quick look through the sources I was even more curious and slightly confused, because the syntax between the two assemblers is "identical" at first look, and there were no complex macros/scripts, it was all straightforward assembly with Zilog syntax.
So where did those 20+k errors come from? It's all about details, let's go through it, to share the experience with anyone else interested into using sjasmplus for their projects (but being used to different assembler):
After review of initial bug list I did notice few things which didn't seem to be good fit for automated-migration script, yet they could be fixed in a way to keep the source working in tniASM. I decided to hunt these first, adjust the source manually (to be more compatible with sjasmplus and increase chances of automated migration) while keeping it in working state (for simple verification of final binary).
Initially I was getting about 1500 errors only, didn't seem so bleak. It turned out the project did contain few local labels named .end:
, which sjasmplus will assemble as directive END with colon after it, ending the assembling process completely. So the first manual fix of sources was to put all ".end:" label definitions at beginning of line, where the label definition must start in sjasmplus (can't be defined later after some whitespace).
But while doing so, I did notice the source of the project contains plenty of directives (like include) (and even few instructions) starting at beginning of line. There is actually option --dirbol
to enable DIRectives at Beginning Of Line, and it should ignore "end/.end" labels at beginning of line, i.e. perfect fit for the situation, so I enabled --dirbol option as a next step.
Here is the part, where being a contributor to sjasmplus project makes the migration considerable easier, as I'm not only aware of these obscure options, but also when it turns out they don't work properly (still ending the assembling process on ".end:" label, even if it should have been ignored), I can simply add few new automated tests to the sjasmplus project to simulate the failing state, and fix the bug in the assembler (non-contributor would have to work around the issue by changing the name of the label to something like ".endthis:", or report the bug and wait for fix and new release).
With the fixed version of assembler the process was finally going through more than just few thousand of source lines... just to end instantly on missing files in include
/incbin
directives. As with many sources developed in windows environment (and later with "wine" in linux), the file-names used through the source were not ready for case sensitive file system, and also the paths did contain the backslash as path delimiter. So I did the second big manual edit, to fix all file names and paths into linux compatible way, using only regular slashes and having correct upper/lower case letters in the source.
There were also few instructions/directives in mixed case like Equ
. In default configuration the sjasmplus requires same-case of instructions/directives (equ/EQU), and as there were only about 15 of these, I quickly fixed them by hand. There is option --syntax=i
to make instruction/directives case insensitive, but I prefer to use the mixed case only for labels, so I didn't use the option and fixed the source instead.
Another type of errors was about DB
/DW
lines exceeding the limit of 128 elements per line. Now this one is quite baked into the sjasmplus C-like implementation, both 128 elements and 2048 total characters per line. So there was no other easier way around, than to split those DB
/DW
long lines into multiple lines of source. I did it manually, as there were not that many of them (at least it seemed so), but still took me about hour or two. Maybe producing some regex to split it automatically would be faster (and less error prone?).
After these manual changes to the source, all of them did keep the source still compatible with original tniASM, so I did assemble the modified source with tniASM, and compared resulting ROM binary with original one to make sure I didn't introduce any change. It was still the same.
Assembling the same source with sjasmplus did still produce about 25k of errors (the number kept growing as the sjasmplus did reach further into the source thanks to the filename fixes), but now there were distinct group of errors which I did believed can be resolved by automated migration script and/or macros, so this was like "stage 1" (manual edits compatible with previous assembler) of migration finished.
For "stage 2" - automated script to patch source - I decided to use common GNU tools: bash
, find
and sed
, writing bash script which did search for all source files in the project, and fed one by one into the sed with the migration-commands, patching the source automatically. The goal was to get fully migrated source which will assemble with sjasmplus with zero errors.
First rule was to split lines where the original source did define more than one label on the same line, while in sjasmplus the label definition must start at beginning of line = single label per line at most. These were reasonably easy to detect in original source, because tniASM requires the labels to be strictly followed by colon character, and the colon is not used in any other context (except strings/comments), so three fine-tuned regex expressions resolved all of these without any false positive.
Next rule was to replace pipe character used as delimiter between multiple instructions on the single line, sjasmplus supports the same feature, but the delimiter in sjasmplus is colon. I.e. ldi | ldi
was replaced by ldi : ldi
(the pipe character in sjasmplus is binary OR operator in expressions).
But it turned out tniASM is even more relaxed about the syntax, and it actually allows to write even lines like ldi ldi ldi ldi
without any delimiter except space. There's no equivalent for this in sjasmplus (although there's alternative using dot-repeater syntax: .4 ldi
), so I added few hard-coded rules to detect the five different instructions unrolled in this way, and added colons between them. And few more specific rules to add colon between some DB
/DW
data structures which were also written without any delimiter together.
After each new rule I did run the script, then verified only expected lines got transformed and only in expected way (tuning the rules in case they did hit also something unexpected) - by checking the differences in git-cola GUI client, assembled the sources with sjasmplus to get refreshed list of errors and reverted the sources with git rollback to original state, to write more migration rules, as the plan was to run it as single big transformation with working source at the end.
Next rule was changing the { tniASM multi-line block comments }
to /* sjasmplus multi-line block comments */
.
Another issue with original source were binary numbers. Both tniASM and sjasmplus share the %00110101
syntax, but in tniASM you can add any amount of spaces between digits and the project was using it to improve the readability of source. Sjasmplus does allow C++ numeric literals syntax, so one can put single apostrophe between two digits. For a case like %0011 0101
that was simple rule to turn those into %0011'0101
, but there were about three places in the source where the spaces were used to spread the digits across half of the line to align the digits with comment line above them. This had to be ruined and the sjasmplus version does have only single apostrophe grouping the bits into logical groups (like %01'101'011
), but they can't align with the long comment (still the final result seems good enough and readable).
As I mentioned earlier, there were directives and instructions starting at the very beginning of line. While the directives have the --dirbol
option to support this syntax, the instructions in sjasmplus can't start at beginning of line, at least one space/tab is mandatory. Luckily this can be resolved with regex substitution easily.
And contrary to that, there were also lines defining labels, which didn't start at beginning of line. This is again impossible to achieve in sjasmplus, so another rule to trim all white-space ahead of new label was added.
And last rule in the long sed
command list was to rename local labels .1:
and similar to .n1:
, because in sjasmplus even local label starting with dot character in source must have second character an underscore or alphabet char, not digit (labels consisting of digits do exist in sjasmplus as "temporary labels" similar to C/C++, but that's not a good fit for what the original source was trying to achieve, it did need actual valid regular label).
The remaining assembling errors were about unknown directives/instructions fname, rb, rw
. I did add new source file sjasm_preamble.asm
defining sjasmplus macros with identical names, providing similar functionality, and the preamble file is assembled together with the main file of the project, so it does fix the missing directives errors.
So after all these, I could assemble the project with sjasmplus and I got zero errors, right? Not really. There were still about 1500 errors, because sjasmplus is also case-sensitive for labels, and sources did contain in some places multiple variants of the same label. Still, having the assembling almost-working, allowed me to produce export file with all symbols used by the source, sorted it by names to have all variants together, and I wrote in couple of hours yet another bash script, going through the symbols file line by line, comparing which are identical except the case, and creating second set of sed
commands to replace those labels with single variant.
After few iterations the generated sed
script did replace all problematic labels, and after few more iterations, fixes and tuning, it stopped replacing other occurrences of those strings in places like file names... ;) :)
So after restoring all the sources back to the "after stage 1" state (when tniASM was still capable to build identical binary), and running these two automated scripts, I finally got the sources capable to assemble with sjasmplus, resulting in zero errors, and producing the ROM binary. "Stage 2" of migration finished, after three evenings being spent on the migration effort.
Unfortunately the binary was obviously different. Having extra 150kiB of zeroes at the end (original ROM file has 1.7MiB) didn't worry me too much, but doing hexdiff on the two files revealed hundreds of differences in the actual machine code. So I did produce the listing file with sjasmplus, and through a bit of detective work I hunt down the different bytes in binary and their source origin, most of them being result of label values being shifted by +1 to +4 bytes. After about hundred of these, I finally found the real culprit, it was the non-Zilog syntax of sub instruction: sub a,7
This doesn't report error in default configuration of sjasmplus, because sjasmplus by default has this "multi-argument" feature, where you can chain arguments for the same instruction together, like ld a,4,b,5
will assemble as ld a,4 : ld b,5
. This may be very handy feature if you are used to it, but in case of sub a,7
it will produce two sub instructions: sub a : sub 7
. As this is not the first time this feature did bite somebody, we did already add option to sjasmplus to deal with this, but it's not default setting, so adding OPT --syntax=a
into preamble file fixed these - the options will change the multi-argument delimiter to ",,", then sub a,7
is understood and assembled as single sub 7
, just like the tniASM does.
Fixing this did also resolve the extra offset in some labels, so the hexdiff was now pointing to few hundreds bytes long block of extra zeroes near the beginning of file. These turned out to be by-product of imperfect rb/rw
replacement macros, which did still produce output bytes into the binary, while tniASM will only advance the program counter without emitting output bytes. Adding few FPOS
directives to the macros to move output file pointer few bytes back was enough to resolve this, and the extra block of zeroes was gone.
But the binary did still differ, now only toward end of it, somewhere beyond 1.5MiB of identical bytes. These differences turned out to be the ugliest part of migration. The project is designed in a way, where it opens the output file at beginning of source, and it keeps including data and code in correct order to produce final ROM binary in one smooth go without any interruption, changing the program counter with ORG
directives every now and then and using PHASE
/UNPHASE
directives extensively to assemble machine code with correct target addresses baked in (to make it work after the code will be relocated at runtime to final destination).
It turned out the tniASM does allow to grow the program counter by this approach also beyond the 16 bit address space of Z80, producing label values like $2C0400, but sjasmplus was truncating these to 16 bit values with warnings about exceeding memory limits of Z80 CPU. Later when these truncated labels were used in math expressions calculating bank numbers and similar, the results were different. A quick fix adding the truncated values back by adding magic numbers in the source did finally fix remaining differences in output, and the project was producing identical binary output when assembled with sjasmplus (this magic value fix did also remove the extra 150kiB of zeroes at the end of file).
But I was not happy about that last part, so I switched to the sjasmplus project, and added new --longptr
option to allow this kind of project design (in new projects written from scratch for sjasmplus, I would personally use the "virtual device" feature of sjasmplus but that requires different style of work with labels and banking and in the end the way how "the project" is written makes also lot of sense, and is usable with the new option too). With the new option available, I reverted the magic numbers from last edit, and verified the resulting binary one more time.
And that's it, the project can be now built with sjasmplus v1.14.3. While the total amount of modified lines of original source is not small, most of the changes are very minor, changing few characters to cater for the small differences between syntax of the two assemblers.
For those who prefer assembly source code examples instead of the text above, here is a quick summary of changes done in the source:
include "src1.asm"
=> directives in sjasmplus need whitespace ahead =>
=> or option --dirbol for assembling can be and was used to fix this.
db 1,2,3,4,...,128,129,130,...,400
=> DB/DW in sjasmplus has limit of 128 elements per line =>
db 1,2,3,4,...,128
db 129,130,..,256
db ...
include "File\Path\xYz.asm" ; while real file name is "file\path\Xyz.asm"
=> assembling in linux on case sensitive filesystem with regular slash =>
include "file/path/Xyz.asm"
x: Equ 1
Ret
=> directives/instructions in sjasmplus are same-case =>
x: EQU 1
ret
label: ; label with whitespace at beginning of line
=> labels must start at beginning of line in sjasmplus =>
label:
data1: incbin "d1.bin" | data1len: equ $ - data1
=> only single label per line is possible in sjasmplus =>
data1: incbin "d1.bin"
data1len: equ $ - data1
Label: jp label
=> labels are case sensitive in sjasmplus =>
Label: jp Label
ret
=> instructions can't be at beginning of line =>
ret
ldi | ldi | ldi ldi ldi
=> sjasmplus does use colon for multiple instructions =>
ldi : ldi : ldi : ldi : ldi
{ multi line
block comment
}
=> sjasmplus has these under /* */ (can also nest) =>
/* multi line
block comment
*/
ld a,%0010 1100
=> sjasmplus support C++ way of grouping numeric literals =>
ld a,%0010'1100 ; also ld hl,$3C'2A or ld de,12'345
.1: db 'x' ; creating full label "memoryblock.1"
=> in sjasmplus even local label can't start with digit =>
.n1: db 'x' ; creating full label "memoryblock.n1"
fname "output.rom"
=> similar directive "output" =>
output "output.rom" ; truncates file by default
x: sub a,7
=> by default this will assemble as "sub a : sub 7" =>
; adding global option --syntax=a to make this assemble as expected
org $8000
phase $2C000
LongPtrLabel: dw LongPtrLabel/$10 ; = $2C00 result expected
dephase
=> added new option --longptr to sjasmplus project to support this.
Finally I'm adding both bash scripts used to do the automatic part of conversion, but be warned these are custom-tailored for the project, manually vetted that every sed command does replace only the source parts I did want to affect. Modifying them for other tniASM-syntax project is probably easier than writing them from scratch, but it will still require patience and labour, verifying each rule step by step and tuning them to avoid any unwanted results.
#!/usr/bin/env bash
echo -n -e "Searching for '*.asm' files, found: "
OLD_IFS=$IFS
IFS=$'\n'
# ASM_FILES=($(find * -iname *.asm -type f))
ASM_FILES=($(find * \( -iname *.asm -o -iname *.gen \) -type f))
IFS=$OLD_IFS
[[ -n $ASM_FILES ]] && echo ${#ASM_FILES[@]}
[[ -z $ASM_FILES ]] && echo "none" && exit 1
## go through all asm files and patch them
for f in "${ASM_FILES[@]}"; do
echo -e "\033[96m$f\033[0m"
cat "$f" | \
sed -E '
# break "... | label: ..." into new lines, sjasmplus can have only one label per line
/^[^;]*\|/ s/([^;|]*?)\|\s*([_\.[:alpha:]][_\.[:alnum:]]*:[^;|]*(|;.*))/\1\n\2/g
# break "label1: equ ... label2: equ ..." into new lines (no separator!) (this command works only once per line :/)
s/^(\s*[._[:alpha:]][._?[:alnum:]]*:[^;\n]+?)(\b[._[:alpha:]][._?[:alnum:]]*:\s+equ\s+.+?)/\1\n\2/
# break similar case but instead of equ there are dw directives
s/^(\s*[._[:alpha:]][._?[:alnum:]]*:[^;\n]+?)(\b[._[:alpha:]][._?[:alnum:]]*:\s+dw\s+.+?)/\1\n\2/
# replace remaining "|" with colons
/^[^;]*\|/ s/\s?\|\s?/ : /g
# add colon between outi/ini/ldi unrolled blocks (which did not have the | between them)
s/\bldi\s+ldi(\s+|\b)/ldi : ldi : /g
s/\bini\s+ini(\s+|\b)/ini : ini : /g
s/\bouti\s+outi(\s+|\b)/outi : outi : /g
s/\bnop\s+nop(\s+|\b)/nop : nop : /g
s/\bhalt\s+halt(\s+|\b)/halt : halt : /g
# add colon to some data structures, customized rule for infolist.asm:
s/(\s+db\s+[-0-9][0-9]+\s*)(\ dw\s+[-0-9][0-9]+\s*)(\ db\s)/\1:\2:\3/
# customized rule for handleobjectmovement8.asm:
s/(\s+dw\s+[-0-9][0-9]+\s*\ )(\ db\s+)/\1:\2/
# customized rule for handleobjectmovement10.asm:
s/(\s+dw\s+\w+\+[0-9]{4}\s*)(\ db\s+)/\1 :\2/
# multiline comment {} to /* */ (must be at beginning of line => enough for this project)
/^\{/,/^\}/ {
s:^\{:/\*:
s:^\}:\*/:
}
# number "%0101 0101" must be formatted in C++ way with apostrophe as digit group delimiter
s/(%[01]{4})\ ([01]{4})\b/\1'\''\2/g
# (binary number formatting) customized rule for SeeSfxReplayer.asm:
s/%\ ([01])\ ([01])\s+([01])\ ([01])\ ([01])\s+([01])\ ([01])\ ([01])/%\1\2'\''\3\4\5'\''\6\7\8/g
# add space ahead of instructions starting at beginning of line
s/^(adc|add|and|bit|call|ccf|cp|cpd|cpdr|cpi|cpir|cpl|daa|dec|di|djnz|ei|ex|exx|halt|im|in|inc|ind|indr|ini|inir|jp|jr|ld|ldd|lddr|ldi|ldir|neg|nop|or|otdr|otir|out|outd|outi|pop|push|res|ret|reti|retn|rl|rla|rlc|rlca|rld|rr|rra|rrc|rrca|rrd|rst|sbc|scf|set|sla|sli|sra|srl|sub|xor)\b/ \1/I
# remove space ahead of labels
s/^\s+([._[:alpha:]][._?[:alnum:]]*:)/\1/
# rename local labels ".1" to ".9" to ".n#", as in sjasmplus even local label can not start with digit
s/^\.([0-9]):/\.n\1:/
s/^([^;]*)\.([0-9])\b/\1\.n\2/
' > "$f.new"
mv "$f.new" "$f"
done # end of FOR (go through all asm files)
## try to assemble the result and create sorted list of labels
cd engine
sjasmplus --msg=err --dirbol --fullpath --sym=symbols.txt sjasm_preamble.asm Main.asm 2> ../errors_s1.txt
cd ..
sort -f -r engine/symbols.txt > symbols.txt && rm engine/symbols.txt || exit 1
[[ -z $1 ]] && echo "To run also labels fix, add any argument" && exit 0
# first stage assembling went reasonably well (symbols are available), try to fix labels
source labelsFix.bash
# and run second assembling
cd engine
sjasmplus --dirbol --fullpath sjasm_preamble.asm Main.asm 2> ../errors_s2.txt
cd ..
#!/usr/bin/env bash
# WARNING: this is by no means universal correct label fixer, the produced rules
# for sed vere manually vetted (for example to see if the local labels substitution
# will not modify some other label due to naming collision).
# symbols are successfully sorted, try to parse through them and find duplicates
# (caused by case-sensitivity of sjasmplus), fix sources by search+replace
echo "" > sedCommands.txt
while read -r line; do
[[ -z $line ]] && break
lab1=${line%:*}
adr1=${line##*equ }
[[ -z $lab1 || -z $adr1 ]] && break
# nonlocal=${lab1%.*} (root part of local label, not needed actually)
if [[ "${lab2^^}" == "${lab1^^}" ]]; then
# same label, but different case, check which has non-zero EQU value, prefer that
if [[ ${adr2^^} -lt ${adr1^^} ]]; then
oldlab=$lab2; newlab=$lab1; newadr=$adr1
else
oldlab=$lab1; newlab=$lab2; newadr=$adr2
fi
echo "[$oldlab] -> [$newlab] ($newadr)" # debug output about substitution
# add substitution rules for sed to the command list
escapeOldLab=${oldlab//./\\.}
escapeOldLab=${escapeOldLab//\?/\\?}
echo "s/([-+(,[:space:]]|^)$escapeOldLab([-+),[:space:]]|$)/\\1${newlab}\\2/" >> sedCommands.txt
oldLocal=${oldlab##*.}
[[ "$oldLocal" != "$oldlab" && "$oldLocal" != "${newlab##*.}" ]] \
&& echo "s/\\.$oldLocal([-+),[:space:]]|$)/.${newlab##*.}\\1/" >> sedCommands.txt
lab2=$newlab
adr2=$newadr
else
lab2=$lab1
adr2=$adr1
fi
done < symbols.txt
[[ ! -s sedCommands.txt ]] && echo "No commands for sed?" && exit 1
## go through all asm files and patch them
echo -e "\033[92mPatching \033[97mlabels\033[0m"
for f in "${ASM_FILES[@]}"; do
echo -e "\033[92m$f\033[0m"
cat "$f" | \
sed -E -f sedCommands.txt > "$f.new"
mv "$f.new" "$f"
done # end of FOR (go through all asm files)
And the "preamble" file used to define the missing directives of tniASM and modify the default sjasmplus syntax to be more compatible with tniASM:
; tniASM compatibility macros defining unrecognized directives: rb, rw, fname
rb MACRO count?
ds count?
; in tniASM the `rb/rw` does not emit bytes to output, so rewind the output
FPOS -(count?)
ENDM
rw MACRO count?
ds 2 * count?
; in tniASM the `rb/rw` does not emit bytes to output, so rewind the output
FPOS -(2 * count?)
ENDM
fname MACRO name?
DEFINE __CURRENT_OUTPUT_NAME__ name?
OUTPUT name?,t ; truncate the file first
OUTPUT name?,r ; reopen it to allow also position seeks (for rb/rw macros)
ENDM
; switch multiarg delimiter to ",," (to produce correct opcode for `sub a,7` ("a" option)
; treat "wholesome" round parentheses as memory access ("b" option)
; warn about any fake instruction ("f")
OPT --syntax=abf
; command line to build migrated sources with sjasmplus (inside the "engine" folder):
; sjasmplus --sym=theproject.sym --dirbol --fullpath --longptr ../sjasm_preamble.asm Main.asm