INSNS

This work is licensed under the Creative Commons Attribution
3.0 Unported License. To view a copy of this license, visit
http://creativecommons.org/licenses/by/3.0/ or send a letter
to Creative Commons, 171 Second Street, Suite 300, San Francisco,
California, 94105, USA.


Instructions

0. Conventions
1. MOV
2. Special regs
 2.1. MOV from $c
 2.2. MOV to $c
 2.3. MOV from $a
 2.4. SHL to $a
 2.5. ADD from $a to $a
 2.6. MOV from sreg
3. Integer instructions
 3.1. Integer ADD family
 3.2. Integer short MUL
 3.3. Integer 24-bit MUL
 3.4. Integer MUL-ADD
 3.5. Integer SAD
 3.6. Integer MIN/MAX
 3.7. Integer SET
4. Bit instructions
 4.1. Bit operations
 4.2. Bit shifts
5. TBD

0. Conventions

  S(x): 31th bit of x for 32-bit x, 15th for 16-bit x.
  SEX(x): sign-extension of x
  ZEX(x): zero-extension of x

1. Normal MOV

  [lanemask] mov b32/b16 DST SRC

  lanemask assumed 0xf for short and immediate versions.

	if (lanemask & 1 << (laneid & 3)) DST = SRC;

  Short:	0x10000000 base opcode
  		0x00008000 0: b16, 1: b32
		operands: S*DST, S*SRC1/S*SHARED

  Imm:		0x10000000 base opcode
  		0x00008000 0: b16, 1: b32
		operands: L*DST, IMM

  Long:		0x10000000 0x00000000 base opcode
  		0x00000000 0x04000000 0: b16, 1: b32
  		0x00000000 0x0003c000 lanemask
		operands: LL*DST, L*SRC1/L*SHARED

2.1. MOV from $c

  mov DST COND

  DST is 32-bit $r.

	DST = COND;

  Long:		0x00000000 0x20000000 base opcode
  		operands: LDST, COND

2.2. MOV to $c

  mov CDST SRC

  SRC is 32-bit $r. Yes, the 0x40 $c write enable flag in second word is
  actually ignored.

	CDST = SRC;

  Long:		0x00000000 0xa0000000 base opcode
  		operands: CDST, LSRC1

2.3. MOV from $a

  mov DST AREG

  DST is 32-bit $r. Setting flag normally used for autoincrement mode doesn't
  work, but still causes crash when using non-writable $a's.

	DST = AREG;

  Long:		0x00000000 0x40000000 base opcode
  		0x02000000 0x00000000 crashy flag
  		operands: LDST, AREG

2.4. SHL to $a

  shl ADST SRC SHCNT

  SRC is 32-bit $r.

	ADST = SRC << SHCNT;

  Long:		0x00000000 0xc0000000 base opcode
  		operands: ADST, LSRC1/LSHARED, HSHCNT

2.5. ADD from $a to $a

  add ADST AREG OFFS

  Like mov from $a, setting flag normally used for autoincrement mode doesn't
  work, but still causes crash when using non-writable $a's.

	ADST = AREG + OFFS;

  Long:		0xd0000000 0x20000000 base opcode
  		0x02000000 0x00000000 crashy flag
  		operands: ADST, AREG, OFFS

2.6. MOV from sreg

  mov DST physid	S=0
  mov DST clock		S=1
  mov DST sreg2		S=2
  mov DST sreg3		S=3
  mov DST pm0		S=4
  mov DST pm1		S=5
  mov DST pm2		S=6
  mov DST pm3		S=7

  DST is 32-bit $r.

	DST = SREG;

  Long:		0x00000000 0x60000000 base opcode
  		0x00000000 0x0001c000 S
  		operands: LDST

3.1. Integer ADD family

  add [sat] b32/b16 [CDST] DST SRC1 SRC2		O2=0, O1=0
  sub [sat] b32/b16 [CDST] DST SRC1 SRC2		O2=0, O1=1
  subr [sat] b32/b16 [CDST] DST SRC1 SRC2		O2=1, O1=0
  addc [sat] b32/b16 [CDST] DST SRC1 SRC2 COND		O2=1, O1=1

  All operands are 32-bit or 16-bit according to size specifier.

	b16/b32 s1, s2;
	bool c;
	switch (OP) {
		case add: s1 = SRC1, s2 = SRC2, c = 0; break;
		case sub: s1 = SRC1, s2 = ~SRC2, c = 1; break;
		case subr: s1 = ~SRC1, s2 = SRC2, c = 1; break;
		case addc: s1 = SRC1, s2 = SRC2, c = COND.C; break;
	}
	res = s1+s2+c;	// infinite precision
	CDST.C = res >> (b32 ? 32 : 16);
	res = res & (b32 ? 0xffffffff : 0xffff);
	CDST.O = (S(s1) == S(s2)) && (S(s1) != S(res));
	if (sat && CDST.O)
		if (S(res)) res = (b32 ? 0x7fffffff : 0x7fff);
		else res = (b32 ? 0x80000000 : 0x8000);
	CDST.S = S(res);
	CDST.Z = res == 0;
	DST = res;

  Short/imm:	0x20000000 base opcode
		0x10000000 O2 bit
		0x00400000 O1 bit
		0x00008000 0: b16, 1: b32
		0x00000100 sat flag
		operands: S*DST, S*SRC1/S*SHARED, S*SRC2/S*CONST/IMM, $c0

  Long:		0x20000000 0x00000000 base opcode
		0x10000000 0x00000000 O2 bit
		0x00400000 0x00000000 O1 bit
		0x00000000 0x04000000 0: b16, 1: b32
		0x00000000 0x08000000 sat flag
		operands: MCDST, LL*DST, L*SRC1/L*SHARED, L*SRC3/L*CONST3, COND

3.2. Integer short MUL

  mul [CDST] DST u16/s16 SRC1 u16/s16 SRC2

  DST is 32-bit, SRC1 and SRC2 are 16-bit.

	b32 s1, s2;
	if (src1_signed)
		s1 = SEX(SRC1);
	else
		s1 = ZEX(SRC1);
	if (src2_signed)
		s2 = SEX(SRC2);
	else
		s2 = ZEX(SRC2);
	b32 res = s1*s2;	// modulo 2^32
	CDST.O = 0;
	CDST.C = 0;
	CDST.S = S(res);
	CDST.Z = res == 0;
	DST = res;

  Short/imm:	0x40000000 base opcode
  		0x00008000 src1 is signed
  		0x00000100 src2 is signed
  		operands: SDST, SHSRC/SHSHARED, SHSRC2/SHCONST/IMM

  Long:		0x40000000 0x00000000 base opcode
  		0x00000000 0x00008000 src1 is signed
  		0x00000000 0x00004000 src2 is signed
  		operands: MCDST, LLDST, LHSRC1/LHSHARED, LHSRC2/LHCONST2

3.3. Integer 24-bit MUL

  mul [CDST] DST [high] u24/s24 SRC1 SRC2

  All operands are 32-bit.

	b48 s1, s2;
	if (signed) {
		s1 = SEX((b24)SRC1);
		s2 = SEX((b24)SRC2);
	} else {
		s1 = ZEX((b24)SRC1);
		s2 = ZEX((b24)SRC2);
	}
	b48 m = s1*s2;	// modulo 2^48
	b32 res = (high ? m >> 16 : m & 0xffffffff);
	CDST.O = 0;
	CDST.C = 0;
	CDST.S = S(res);
	CDST.Z = res == 0;
	DST = res;

  Short/imm:	0x40000000 base opcode
  		0x00008000 src are signed
  		0x00000100 high
  		operands: SDST, SSRC/SSHARED, SSRC2/SCONST/IMM

  Long:		0x40000000 0x00000000 base opcode
  		0x00000000 0x00008000 src are signed
  		0x00000000 0x00004000 high
  		operands: MCDST, LLDST, LSRC1/LSHARED, LSRC2/LCONST2

3.4. Integer MUL-ADD

  addop [CDST] DST mul u16 SRC1 SRC2 SRC3		O1=0 O2=000	S2=0 S1=0
  addop [CDST] DST mul s16 SRC1 SRC2 SRC3		O1=0 O2=001	S2=0 S1=1
  addop sat [CDST] DST mul s16 SRC1 SRC2 SRC3		O1=0 O2=010	S2=1 S1=0
  addop [CDST] DST mul u24 SRC1 SRC2 SRC3		O1=0 O2=011	S2=1 S1=1
  addop [CDST] DST mul s24 SRC1 SRC2 SRC3		O1=0 O2=100
  addop sat [CDST] DST mul s24 SRC1 SRC2 SRC3		O1=0 O2=101
  addop [CDST] DST mul high u24 SRC1 SRC2 SRC3	O1=0 O2=110
  addop [CDST] DST mul high s24 SRC1 SRC2 SRC3	O1=0 O2=111
  addop sat [CDST] DST mul high s24 SRC1 SRC2 SRC3	O1=1 O2=000

  addop is one of:

  add	O3=00	S4=0 S3=0
  sub	O3=01	S4=0 S3=1
  subr	O3=10	S4=1 S3=0
  addc	O3=11	S4=1 S3=1

  If addop is addc, insn also takes an additional COND parameter. DST and
  SRC3 are always 32-bit, SRC1 and SRC2 are 16-bit for u16/s16 variants,
  32-bit for u24/s24 variants. Only a few of the variants are encodable as
  short/immediate, and they're restricted to DST=SRC3.

	if (u24 || s24) {
		b48 s1, s2;
		if (s24) {
			s1 = SEX((b24)SRC1);
			s2 = SEX((b24)SRC2);
		} else {
			s1 = ZEX((b24)SRC1);
			s2 = ZEX((b24)SRC2);
		}
		b48 m = s1*s2;	// modulo 2^48
		b32 mres = (high ? m >> 16 : m & 0xffffffff);
	} else {
		b32 s1, s2;
		if (s16) {
			s1 = SEX(SRC1);
			s2 = SEX(SRC2);
		} else {
			s1 = ZEX(SRC1);
			s2 = ZEX(SRC2);
		}
		b32 mres = s1*s2;	// modulo 2^32
	}
	b32 s1, s2;
	bool c;
	switch (OP) {
		case add: s1 = mres, s2 = SRC3, c = 0; break;
		case sub: s1 = mres, s2 = ~SRC3, c = 1; break;
		case subr: s1 = ~mres, s2 = SRC3, c = 1; break;
		case addc: s1 = mres, s2 = SRC3, c = COND.C; break;
	}
	res = s1+s2+c;	// infinite precision
	CDST.C = res >> 32;
	res = res & 0xffffffff;
	CDST.O = (S(s1) == S(s2)) && (S(s1) != S(res));
	if (sat && CDST.O)
		if (S(res)) res = 0x7fffffff;
		else res = 0x80000000;
	CDST.S = S(res);
	CDST.Z = res == 0;
	DST = res;

  Short/imm:	0x60000000 base opcode
  		0x00000100 S1
		0x00008000 S2
		0x00400000 S3
		0x10000000 S4
		operands: SDST, S*SRC/S*SHARED, S*SRC2/S*CONST/IMM, SDST, $c0

  Long:		0x60000000 0x00000000 base opcode
  		0x10000000 0x00000000 O1
		0x00000000 0xe0000000 O2
		0x00000000 0x0c000000 O3
		operands: MCDST, LLDST, L*SRC1/L*SHARED, L*SRC2/L*CONST2, L*SRC3/L*CONST3, COND

3.5. Integer SAD

  sad [CDST] DST u16/s16/u32/s32 SRC1 SRC2 SRC3

  Short variant is restricted to DST same as SRC3. All operands are 32-bit or
  16-bit according to size specifier.

	int s1, s2; // infinite precision
	if (signed) {
		s1 = SEX(SRC1);
		s2 = SEX(SRC2);
	} else {
		s1 = ZEX(SRC1);
		s2 = ZEX(SRC2);
	}
	b32 mres = abs(s1-s2);	// modulo 2^32
	res = mres+s3;		// infinite precision
	CDST.C = res >> (b32 ? 32 : 16);
	res = res & (b32 ? 0xffffffff : 0xffff);
	CDST.O = (S(mres) == S(s3)) && (S(mres) != S(res));
	CDST.S = S(res);
	CDST.Z = res == 0;
	DST = res;

  Short:	0x50000000 base opcode
  		0x00008000 0: b16 1: b32
		0x00000100 src are signed
		operands: DST, SDST, S*SRC/S*SHARED, S*SRC2/S*CONST, SDST

  Long:		0x50000000 0x00000000 base opcode
		0x00000000 0x04000000 0: b16, 1: b32
  		0x00000000 0x08000000 src sre signed
		operands: MCDST, LLDST, L*SRC1/L*SHARED, L*SRC2/L*CONST2, L*SRC3/L*CONST3

3.6. Integer MIN/MAX

  min u16/u32/s16/s32 [CDST] DST SRC1 SRC2
  max u16/u32/s16/s32 [CDST] DST SRC1 SRC2

  All operands are 32-bit or 16-bit according to size specifier.

	if (SRC1 < SRC2) { // signed comparison for s16/s32, unsigned for u16/u32.
		res = (min ? SRC1 : SRC2);
	} else {
		res = (min ? SRC2 : SRC1);
	}
	CDST.O = 0;
	CDST.C = 0;
	CDST.S = S(res);
	CDST.Z = res == 0;
	DST = res;

  Long:		0x30000000 0x80000000 base opcode
  		0x00000000 0x20000000 0: max, 1: min
		0x00000000 0x08000000 0: u16/u32, 1: s16/s32
		0x00000000 0x04000000 0: b16, 1: b32
		operands: MCDST, LL*DST, L*SRC1/L*SHARED, L*SRC2/L*CONST2

3.7 Integer SET

  set [CDST] DST cond u16/s16/u32/s32 SRC1 SRC2

  cond can be any subset of {l, g, e}.

  All operands are 32-bit or 16-bit according to size specifier.
 
	int s1, s2; // infinite precision
	if (signed) {
		s1 = SEX(SRC1);
		s2 = SEX(SRC2);
	} else {
		s1 = ZEX(SRC1);
		s2 = ZEX(SRC2);
	}
	bool c;
	if (s1 < s2)
		c = cond.l;
	else if (s1 == s2)
		c = cond.e;
	else /* s1 > s2 */
		c = cond.g;
	if (c) {
		res = (b32?0xffffffff:0xffff);
	} else {
		res = 0;
	}
	CDST.O = 0;
	CDST.C = 0;
	CDST.S = S(res);
	CDST.Z = res == 0;
	DST = res;

  Long:		0x30000000 0x60000000 base opcode
		0x00000000 0x08000000 0: u16/u32, 1: s16/s32
		0x00000000 0x04000000 0: b16, 1: b32
		0x00000000 0x00010000 cond.g
		0x00000000 0x00008000 cond.e
		0x00000000 0x00004000 cond.l
		operands: MCDST, LL*DST, L*SRC1/L*SHARED, L*SRC2/L*CONST2

4.1. Bit operations

  and b32/b16 [CDST] DST [not] SRC1 [not] SRC2		O2=0, O1=0
  or b32/b16 [CDST] DST [not] SRC1 [not] SRC2		O2=0, O1=1
  xor b32/b16 [CDST] DST [not] SRC1 [not] SRC2		O2=1, O1=0
  mov2 b32/b16 [CDST] DST [not] SRC1 [not] SRC2		O2=1, O1=1

  Immediate forms only allows 32-bit operands, and cannot negate second op.

	s1 = (not1 ? ~SRC1 : SRC1);
	s2 = (not2 ? ~SRC2 : SRC2);
	switch (OP) {
		case and: res = s1 & s2; break;
		case or: res = s1 | s2; break;
		case xor: res = s1 ^ s2; break;
		case mov2: res = s2; break;
	}
	CDST.O = 0;
	CDST.C = 0;
	CDST.S = S(res);
	CDST.Z = res == 0;
	DST = res;

  Imm:		0xd0000000 base opcode
		0x00400000 not1
		0x00008000 O2 bit
		0x00000100 O1 bit
		operands: SDST, SSRC/SSHARED, IMM
		assumed: not2=0 and b32.

  Long:		0xd0000000 0x00000000 base opcode
		0x00000000 0x04000000 0: b16, 1: b32
		0x00000000 0x00020000 not2
		0x00000000 0x00010000 not1
		0x00000000 0x00008000 O2 bit
		0x00000000 0x00004000 O1 bit
		operands: MCDST, LL*DST, L*SRC1/L*SHARED, L*SRC2/L*CONST2

4.2. Bit shifts

  shl b16/b32 [CDST] DST SRC1 SRC2
  shl b16/b32 [CDST] DST SRC1 SHCNT
  shr u16/u32 [CDST] DST SRC1 SRC2
  shr u16/u32 [CDST] DST SRC1 SHCNT
  shr s16/s32 [CDST] DST SRC1 SRC2
  shr s16/s32 [CDST] DST SRC1 SHCNT

    All operands 16/32-bit according to size specifier, except SHCNT. Shift
    counts are always treated as unsigned, passing negative value to shl
    doesn't get you a shr.

    	int size = (b32 ? 32 : 16);
	if (shl) {
		res = SRC1 << SRC2; // infinite precision, shift count doesn't wrap.
		if (SRC2 < size) { // yes, <. So if you shift 1 left by 32 bits, you DON'T get CDST.C set. but shift 2 left by 31 bits, and it gets set just fine.
			CDST.C = (res >> size) & 1; // basically, the bit that got shifted out.
		} else {
			CDST.C = 0;
		}
		res = res & (b32 ? 0xffffffff : 0xffff);
	} else {
		res = SRC1 >> SRC2; // infinite precision, shift count doesn't wrap.
		if (signed && S(SRC1)) {
			if (SRC2 < size)
				res |= (1<<size)-(1<<(size-SRC2)); // fill out the upper bits with 1's.
			else
				res |= (1<<size)-1;
		}
		if (SRC2 < size && SRC2 > 0) {
			CDST.C = (SRC1 >> (SRC2-1)) & 1;
		} else {
			CDST.C = 0;
		}
	}
	if (SRC2 == 1) {
		CDST.O = (S(SRC1) != S(res));
	} else {
		CDST.O = 0;
	}
	CDST.S = S(res);
	CDST.Z = res == 0;
	DST = res;

  Long:		0x30000000 0xc0000000 base opcode
  		0x00000000 0x20000000 0: shl, 1: shr
		0x00000000 0x08000000 0: u16/u32, 1: s16/s32 [shr only]
		0x00000000 0x04000000 0: b16, 1: b32
		0x00000000 0x00010000 0: use SRC2, 1: use SHCNT
		operands: MCDST, LL*DST, L*SRC1/L*SHARED, L*SRC2/L*CONST2/SHCNT

5. TBD

  interp [cent] [flat] DST v[] [SRC]

    Gets interpolated FP input, optionally multiplying by a given value

  rcp f32 DST SRC
  rsqrt f32 DST SRC
  lg2 f32 DST SRC
  sin f32 DST SRC
  cos f32 DST SRC
  ex2 f32 DST SRC

    Computes a transcendential function of the argument. rcp is 1/x, rsqrt is
    1/sqrt(x). sin, cos, ex2 need arguments preprocessed by appropriate pre
    insn. rcp, rsqrt, lg2 take a float argument directly.

  presin f32 DST SRC
  preex2 f32 DST SRC

    Preprocesses a float argument for use in subsequent sin/cos or ex2
    operation, respectively.

  mov lock CDST DST s[]

    Tries to lock a word of s[] memory and load a word from it. CDST tells
    you if it was successfully locked+loaded, or no. A successfully locked
    word can't be locked by any other thread until it is unlocked.

  mov unlock s[] SRC

    Stores a word to previously-locked s[] word and unlocks it.

  PREDICATE vote any/all CDST

    This instruction doesn't use the predicate field for conditional execution,
    abusing it instead as an input argument. vote any sets CDST to true iff the
    input predicate evaluated to true in any of the warp's active threads.
    vote all sets it to true iff the predicate evaluated to true in all acive
    threads of the current warp.

  set [CDST] DST <cmpop> f32/f64 SRC1 SRC2

    Does given comparison operation on SRC1 and SRC2. DST is set to 0xffffffff
    if comparison evaluats true, 0 if it evaluates false. if used, CDST.SZ are
    set according to DST.

  min f32/f64 DST SRC1 SRC2
  max f32/f64 DST SRC1 SRC2

    Sets DST to the smaller/larger of two SRC1 operands. If one operand is NaN,
    DST is set to the non-NaN operand. If both are NaN, DST is set to NaN.

  cvt <integer dst> <integer src>
  cvt <integer rounding modifier> <integer dst> <float src>
  cvt <rounding modifier> <float dst> <integer src>
  cvt <rounding modifier> <float dst> <float src>
  cvt <integer rounding modifier> <float dst> <float src>

    Converts between formats. For integer destinations, always clamps result
    to target type range.

  add [sat] rn/rz f32 DST SRC1 SRC2

    Adds two floating point numbers together.

  mul [sat] rn/rz f32 DST SRC1 SRC2

    Multiplies two floating point numbers together

  slct b32 DST SRC1 SRC2 f32 SRC3

    Sets DST to SRC1 if SRC3 is positive or 0, to SRC2 if SRC3 negative or NaN.

  quadop f32 <op1> <op2> <op3> <op4> DST <srclane> SRC1 SRC2

    Intra-quad information exchange instruction. Mad as a hatter.
    First, SRC1 is taken from the given lane in current quad. Then
    op<currentlanenumber> is executed on it and SRC2, results get
    written to DST. ops can be add [SRC1+SRC2], sub [SRC1-SRC2],
    subr [SRC2-SRC1], mov2 [SRC2]. srclane can be at least l0, l1,
    l2, l3, and these work everywhere. If you're running in FP, looks
    like you can also use dox [use current lane number ^ 1] and doy
    [use current lane number ^ 2], but using these elsewhere results
    in always getting 0 as the result...

  add f32 DST mul SRC1 SRC2 SRC3

    A multiply-add instruction. With intermediate rounding. Nothing
    interesting. DST = SRC1 * SRC2 + SRC3;

  fma f64 DST SRC1 SRC2 SRC3

    Fused multiply-add, with no intermediate rounding.

  texauto [deriv] live/all <texargs>

    Does a texture fetch. Inputs are: x, y, z, array index, dref [skip all
    that your current sampler setup doesn't use]. x, y, z, dref are floats,
    array index is integer. If running in FP or the deriv flag is on,
    derivatives are computed based on coordinates in all threads of current
    quad. Otherwise, derivatives are assumed 0. For FP, if the live flag
    is on, the tex instruction is only run for fragments that are going to
    be actually written to the render target, ie. for ones that are inside
    the rendered primitive and haven't been discarded yet. all executes
    the tex even for non-visible fragments, which is needed if they're going
    to be used for further derivatives, explicit or implicit.

  texbias [deriv] live/all <texargs>

    Same as texauto, except takes an additional [last] float input specifying
    the LOD bias to add. Note that bias needs to be the same for all threads
    in the current quad executing the texbias insn.

  texlod live/all <texargs>

    Does a texture fetch with given coordinates and LOD. Inputs are like
    texbias, except you have explicit LOD instead of the bias. Just like
    in texbias, the LOD should be the same for all threads involved.

  texsize live/all <texargs>

    Gives you (width, height, depth, mipmap level count) in output, takes
    integer LOD parameter as its only input.

  texfetch live/all <texargs>

    A single-texel fetch. The inputs are x, y, z, index, lod, and are all
    integer.

  emit

    GP-only instruction that emits current contents of $o registers as the
    next vertex in the output primitive and clears $o for some reason.

  restart

    GP-only instruction that finishes current output primitive and starts
    a new one.

  bra <code target>

    Branches to the given place in the code. If only some subset of threads
    in the current warp executes it, one of the paths is chosen as the active
    one, and the other is suspended until the active path exits or rejoins.

  call <code target>

    Pushes address of the next insn onto the stack and branches to given place.
    Cannot be predicated.

  ret

    Returns from a called function. If there's some not-yet-returned divergent
    path on the current stack level, switches to it. Otherwise pops off the
    entry from stack, rejoins all the paths to the pre-call state, and
    continues execution from the return address on stack. Accepts predicates.

  breakaddr <code target>

    Like call, except doesn't branch anywhere, uses given operand as the
    return address, and pushes a different type of entry onto the stack.

  break
  
    Like ret, except accepts breakaddr's stack entry type, not call's.

  quadon

    Temporarily enables all threads in the current quad, even if they were
    disabled before [by diverging, exitting, or not getting started at all].
    Nesting this is probably a bad idea, and so is using any non-quadpop
    control insns while this is active. For diverged threads, the saved PC
    is unaffected by this temporal enabling.

  quadpop

    Undoes a previous quadon command.

  bar sync <barrier number>

    Waits until all threads in the block arrive at the barrier, then continues
    execution... probably... somehow...

  trap

    Causes an error, killing the program instantly.

  joinat <code target>

    The arugment is address of a future join instruction and gets pushed
    onto the stack, together with a mask of currently active threads, for
    future rejoining.

  brkpt
  
    Doesn't seem to do anything, probably generates a breakpoint when enabled
    somewhere in PGRAPH, somehow.
  
  exit

    Actually, not a separate instruction, just a modifier available on all
    long insns. Finishes thread's execution after the current insn ends.

  join

    Also a modifier. Switches to other diverged execution paths on the same
    stack level, until they've all reached the join point, then pops off the
    entry and continues execution with a rejoined path.