8. Serializer
8.1. Overview
The "text" part of the txtvm
comes into play with the
serialization format. It is designed to only utilize
printable ascii characters.
Instead of using 8-bit bytes, txtvm uses 6-bit syllables, which track with the base64 encoding table: A-Z, a-z, 0-9, and + and /.
There are 95 printable characters, so this leaves 31 characters remaining. These characters are reserved for core opcode operations.
Those characters are:!"#$%&'()*,-.:;<=>?@[ /]^_`{|}~
space will be ignored, so really it's 30 characters.
Integer values can be 6, 12, 18, or 24 bits long, called int6, int12, int18, and int24. They are stored in little-endian.
Each integer operation begins with the command character, then reads subsequent characters (assumbed to be 6-bit syllables).
This means:
int6 (') is a 2 character operation.
int12 (#) is 3 characters.
int18 ($) is 4 characters.
int24 (%) is 5 characters.
should also be uint8, uint12, uint18, and uint24 as '&', -, '.', ':'.
long values are 32-bit values, and are 9 characters long. They can be signed or unsigned. represented as '?'.
An execute operation will pop an integer off the stack and use that as an look-up id for what word to execute.
Represented as '!'
Text is represented as '"'. It pops a size of the stack, then reads that many characters as a literal string. Only printable characters work here. Otherwise, blobs will have to be used.
A base64 blob is represented as '*'. It pop the size from the stack, then read that many bytes as literally base64 encoding.
Later on, ascii85 encoding may be used as well. The size of this string is expected to be a multiple of 5. Represented as '@'.
8.2. Opcodes
8.2.1. macros
Every opcode is defined as a macro with the corresponding ascii character. This helps makes code more readable.
<<type_macros>>
8.2.2. int6
The int6
opcode uses the character '
.
#define TYPE_INT6 '\''
It takes in 1 additional byte, which it then turns into a 6-bit signed integer value.
Write an int6 with txtvm_write_int6
.
int txtvm_write_int6(txtvm *vm, int x);
Because it's signed, some biasing happens to make it positive.
6-bit values have a range [-31, 31], so add 32 to make it 0-63.
int txtvm_write_int6(txtvm *vm, int x)
{
char b[2];
int nbytes;
x = (x + 32) & 0x3f; /* bias and mask */
b[0] = TYPE_INT6;
b[1] = base64_encode[x];
nbytes = txtvm_io_write(&vm->io, 2, b);
if (nbytes < 0) return TXTVM_NOT_OK;
else if (nbytes < 2) return TXTVM_INCOMPLETE_WRITE;
return TXTVM_OK;
}
Read an int6 with txtvm_read_int6
. This assumes the type
byte has already been written.
int txtvm_read_int6(txtvm *vm, int *x);
When the byte is read, the bias must be factored in to convert it back to a signed value.
int txtvm_read_int6(txtvm *vm, int *x)
{
int n;
char b;
n = txtvm_io_read(&vm->io, 1, &b);
if (n != 1) return TXTVM_INCOMPLETE_READ;
*x = base64_decode[b & 0x7f] - 32;
return TXTVM_OK;
}
8.2.3. int12
The int12 opcode has uses character #
.
#define TYPE_INT12 '#'
It takes in 2 additional bytes, which it then turns into a 12-bit integer value. It's stored as little endian, so the the byte order is LSB followed by MSB.
Write an int12 with txtvm_write_int12
.
int txtvm_write_int12(txtvm *vm, int x);
Because it's signed, bias with 2048 before encoding.
int txtvm_write_int12(txtvm *vm, int x)
{
char b[3];
int nbytes;
x += 2048; /* bias */
b[0] = TYPE_INT12;
b[1] = base64_encode[x & 0x3f];
b[2] = base64_encode[(x >> 6) & 0x3f];
nbytes = txtvm_io_write(&vm->io, 3, b);
if (nbytes < 0) return TXTVM_NOT_OK;
else if (nbytes < 3) return TXTVM_INCOMPLETE_WRITE;
return TXTVM_OK;
}
Read an int12 with txtvm_read_int12
.
int txtvm_read_int12(txtvm *vm, int *x);
Because it is signed, remember to remove the bias!
int txtvm_read_int12(txtvm *vm, int *x)
{
int n;
char b[2];
n = txtvm_io_read(&vm->io, 2, b);
if (n < 0) return TXTVM_NOT_OK;
else if (n != 2) return TXTVM_INCOMPLETE_WRITE;
*x = base64_decode[b[0] & 0x7f];
*x |= (base64_decode[b[1] & 0x7f]) << 6;
*x -= 2048;
return TXTVM_OK;
}
8.2.4. call
The call opcode uses the character !
.
#define TYPE_CALL '!'
No additional bytes are read. It runs under the assumption that a positive integer value is at the top of the stack. The call command, when run, will pop that value off, and then look and run up the corresponding entry in the dictinonary
Write a call with txtvm_write_call
.
int txtvm_write_call(txtvm *vm);
int txtvm_write_call(txtvm *vm)
{
char b;
int nbytes;
b = TYPE_CALL;
nbytes = txtvm_io_write(&vm->io, 1, &b);
if (nbytes < 0) return TXTVM_NOT_OK;
else if (nbytes != 1) return TXTVM_INCOMPLETE_WRITE;
return TXTVM_OK;
}
8.3. lookup tables
The following table is used to convert values in range 0-63 to ascii characters that map to the base64 character set: A..Z, a..z, 0..9, and + and /.
static const char * base64_encode =
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
"abcdefghijklmnopqrstuvwxyz"
"0123456789+/";
Decoding is a little bit more tricky since the values aren't exactly lined up. To keep things consistent, we're going to use a lookup table here as well, at the cost of having a bit more redundant memory. Those parts will be padded with zeros.
static const char base64_decode[128] = {
/* pad with 43 zeros to first ASCII character + */
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0,
/* + and /, spaced by 3 zeros */
62, 0, 0, 0, 63,
/* 0-9 */
52, 53, 54, 55, 56, 57, 58, 59, 60, 61,
/* 7 zeros between 9 and A */
0, 0, 0, 0, 0, 0, 0,
/* A-Z */
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25,
/* 6 zeros between 9 and A */
0, 0, 0, 0, 0, 0,
/* a-z */
26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
/* padding: 5 0's aligns to 128 bytes total */
0, 0, 0, 0, 0,
};
prev | home | next