3. An Outline of What Worgle does

This aims to show a broad overview of how Orgle (and Worgle) will work essentially. Orgle is a bootstrap program written in C, used to generate C code for Worgle (this program here). At the highest level, the two programs share the same basic program structure.

3.1. Initialization

3.1.1. Initialize worgle data

Worgle is initialized before stuff is loaded.

<<local_variables>>=
worgle_d worg;
<<initialization>>=
worgle_init(&worg);

3.1.2. Get and set filename

The file name is currently aqcuired from the command line, so the program must check and make sure that there are the right number of arguments. If there isn't, return an error.

<<local_variables>>=
char *filename;
<<initialization>>=
filename = NULL;
if(argc < 2) {
    fprintf(stderr, "Usage: %s filename.org\n", argv[0]);
    return 1;
}
<<parse_cli_args>>
<<check_filename>>

Check the filename. If the filename is not set inside by the command line, return an error,

<<check_filename>>=
if(filename == NULL) {
    fprintf(stderr, "No filename specified\n");
    return 1;
}

3.1.3. Initialize return codes

The main return code determines the overall state of the program.

<<local_variables>>=
int rc;

By default, it is set to be okay, which is 0 on POSIX systems.

<<initialization>>=
rc = 0;

3.2. Load file into memory

The first thing the program will do is load the file.

While most parsers tend to parse things on a line by line basis via a file stream, this parser will load the entire file into memory. This is done due to the textual nature of the program. It is much easier to simply allocate everything in one big block and reference chunks, then to allocate smaller chunks as you go.

3.2.1. Loadfile function

<<loading>>=
for(i = 0; i < worg.nbuffers; i++) {
    rc = loadfile(&worg, i);
    if(!rc) goto cleanup;
}

A file is loaded into a textbuffer via the function loadfile. In the worg startup sequence, the buffer list has been preallocated with the filename after parsing the command line arguments (see <>). It is in this stage that the memory block is allocated and the file loaded into it. The file will be allocated and loaded into the file located in index position file.

On success, the function will return TRUE (1). On failure, FALSE (0).

<<static_function_declarations>>=
static int loadfile(worgle_d *worg, int file);
<<functions>>=
static int loadfile(worgle_d *worg, int file)
{
<<loadfile_localvars>>
<<loadfile>>
    return 1;
}

3.2.2. Open file

File is loaded into a local file handle fp.

<<loadfile_localvars>>=
FILE *fp;
char *filename;
worgle_textbuf *txt;
<<loadfile>>=
txt = &worg->buffers[file];
filename = txt->filename.str;
fp = fopen(filename, "r");

if(fp == NULL) {
    fprintf(stderr, "Could not find file %s\n", filename);
    return 1;
}

3.2.3. Get file size

The size is acquired by going to the end of the file and getting the current file position.

<<loadfile_localvars>>=
size_t size;
<<loadfile>>=
fseek(fp, 0, SEEK_END);
size = ftell(fp);

3.2.4. Allocate memory, read, and close

Memory is allocated in a local buffer variable via calloc. The buffer is then stored inside of the worg struct.

<<loadfile_localvars>>=
char *buf;
<<loadfile>>=
buf = calloc(1, size);
worgle_textbuf_init(&worg->buffers[file], buf, size);

The file is rewound back to the beginning and then read into the buffer. The file is no longer needed at this point, so it is closed.

<<loadfile>>=
fseek(fp, 0, SEEK_SET);
fread(buf, size, 1, fp);
fclose(fp);

3.3. Parsing

3.3.1. Top Level Parsing Function

The second phase of the program is the parsing stage.

The parsing stage will parse files line-by-line. The program will find a line by skimming through the block up to a line break character, then pass that off to be parsed. Line by line, the parser will read the program and produce a structure of the tangled code in memory.

Parsing is done via the function parse_file.

<<local_variables>>=
int i;
<<parsing>>=
for (i = 0; i < worg.nbuffers; i++) {
    rc = parse_file(&worg, i);
    if (rc) goto cleanup;
}
<<flush_last_block>>

The parse_file function will parse a file whose filename is located in the index position denoted by file.

<<function_declarations>>=
int parse_file(worgle_d *worg, int file);
<<functions>>=
int parse_file(worgle_d *worg, int file)
{
    char *buf;
    size_t size;
    worgle_textbuf *curbuf;
<<parser_local_variables>>

    curbuf = &worg->buffers[file];
    buf = curbuf->buf;
    size = curbuf->size;
    worg->curbuf = curbuf;
#ifndef WORGLITE
    worg->curorg = &worg->orgs[file];
    if (file > 0) {
        worg->curorg->prev = &worg->orgs[file - 1];
    } else {
        worg->curorg->prev = NULL;
    }
#endif
<<parser_initialization>>
    while (1) {
<<getline>>
        if(mode == MODE_ORG) {
<<parse_mode_org>>
        } else if(mode == MODE_CODE) {
<<parse_mode_code>>
        } else if(mode == MODE_BEGINCODE) {
<<parse_mode_begincode>>
        }
    }
    return rc;
}

3.3.2. Parsing Modes

The parser is implemented as a relatively simple state machine, whose behavior shifts between parsing org-mode markup (MODE_ORG), and code blocks (MODE_BEGINCODE and MODE_CODE). The state machine makes a distinction between the start of a new code block (MODE_BEGINCODE), which provides information like the name of the code block and optionally the name of the file to tangle to, and the code block itself (MODE_CODE).

<<enums>>=
enum {
<<parse_modes>>
};
3.3.2.1. MODE_ORG

<<parse_modes>>=
MODE_ORG,
3.3.2.1.1. Org Parse Top

When the parser state is set to be in MODE_ORG, this is what happens.

<<parse_mode_org>>=
#ifndef WORGLITE
if (generate_db) {
<<parse_headers>>
}
#endif
<<find_next_named_block>>
#ifndef WORGLITE
if (generate_db) {
<<parse_content>>
}
#endif
3.3.2.1.2. Finding the next named block

When the parser is in MODE_ORG, it mostly searching for the start of the next named block. When it finds a match, it extracts the name, gets ready to begin the a new block, and changes the mode MODE_BEGINCODE.

A common hard-to-find error happens when a colon is forgotten in the NAME tag. A special check will occur here to make sure that colon isn't forgotten.

<<find_next_named_block>>=
if(read >= 7) {
    if(!strncmp(line, "#+NAME", 6)) {
#ifndef WORGLITE
        if (generate_db) {
<<append_content_before_code>>
        }
#endif
        if(line[6] != ':') {
            fprintf(stderr,
                    "line %lu: expected ':'\n",
                    worg->linum);
            rc = 1;
            break;
        }
        mode = MODE_BEGINCODE;
        parse_name(line, read, &str);
        worgle_begin_block(worg, &str);
#ifndef WORGLITE
        continue;
#endif
    }
}
3.3.2.1.3. Extracting information from #+NAME

Name extraction of the current line is done with a function called parse_name.

<<static_function_declarations>>=
static int parse_name(char *line, size_t len, worgle_string *str);


<<functions>>=
static int parse_name(char *line, size_t len, worgle_string *str)
{
    size_t n;
    size_t pos;
    int mode;

    line+=7;
    len-=7;
    /* *namelen = 0; */
    str->size = 0;
    str->str = NULL;
    if(len <= 0) return 1;
    pos = 0;
    mode = 0;
    for(n = 0; n < len; n++) {
        if(mode == 2) break;
        switch(mode) {
            case 0:
                if(line[n] == ' ') {

                } else {
                    str->str = &line[n];
                    str->size++;
                    pos++;
                    mode = 1;
                }
                break;
            case 1:
                if(line[n] == 0xa) {
                    mode = 2;
                    break;
                }
                pos++;
                str->size++;
                break;
            default:
                break;
        }
    }
    /* *namelen = pos; */
    return 1;
}
3.3.2.1.4. Beginning a new block

A new code block is started with the function worgle_begin_block.

<<function_declarations>>=
void worgle_begin_block(worgle_d *worg, worgle_string *name);

When a new block begins, the current block in Worgle is set to be a value retrieved from the block dictionary.

<<functions>>=
void worgle_begin_block(worgle_d *worg, worgle_string *name)
{
    worg->curblock = worgle_hashmap_get(&worg->dict, name);
<<worgle_block_set_id>>
<<increment_nblocks>>
#ifndef WORGLITE
<<append_code_reference>>
#endif
}
3.3.2.1.5. DONE Parsing Header Information

CLOSED: [2019-09-12 Thu 07:10] A valid header in org mode starts with one or more as one or more asterisks, followed by a space. Anything after this space is considered to be the name of the header. The number of asterisks indicates the header level.

If indeed the line is a header, both the header name and level are appended to the currently opened org file.

A quick sanity check is done before the header is parsed via parse_header.

<<parse_headers>>=
if (read >= 2) {
    if (parse_header(worg, line, read)) {
        continue;
    }
}

The actual parsing logic happens in the function parse_header.

<<static_function_declarations>>=
#ifndef WORGLITE
static int parse_header(worgle_d *worg,
                        char *line,
                        size_t len);
#endif
<<functions>>=
#ifndef WORGLITE
static int parse_header(worgle_d *worg,
                        char *line,
                        size_t len)
{
    int mode;
    int rc;
    size_t s;
    char *header;
    worgle_string str;
    int lvl;
    mode = 0;

    if(line[0] != '*') return 0;

    rc = 0;
    worgle_string_init(&str);
    lvl = 1;
    for (s = 1; s < len; s++) {
        if (mode == 2) break;
        switch (mode) {
            case 0:
                if (line[s] == '*') {
                    lvl++;
                } else if (line[s] == ' '){
                    mode = 1;
                } else {
                    mode = 2;
                    rc = 0;
                }
                break;
            case 1:
                rc = 1;
                mode = 2;
                header = &line[s];
                str.str = header;
                str.size = len - s;
                str.size -= line[len - 1] == '\n';
<<append_content_before_header>>
                worgle_orgfile_append_header(worg,
                                             &str,
                                             lvl);
<<set_content_flag_after_header>>
                break;
        }
    }
    return rc;
}
#endif
3.3.2.1.6. DONE Content Parsing

CLOSED: [2019-12-10 Tue 20:26] In between headers and codeblocks are things called content. It is assumed to be text like this, but it can also contain comments and commands that worgle doesn't yet understand.

Content parsing happens in MODE_ORG, and is the fallback option when no other pattern is picked up. When it reaches that point, the parser will take the current line and append it to the context block.

Appending content to the content block is a matter of extending the size of the block (text is mapped to a contiguous memory block).

<<parse_content>>=
#ifndef WORGLITE
<<setup_new_content_block>>
worg->segblock.size += read;
#endif

When a content block is started, the block variable must be reset. The circumstances for a starting a content block happen: whenever a new header is found, or whenever content is found immediately after a code block ends.

The solution to this is to have a flag for this that is set anytime a new content block has the poential to be started. The next time the parser arrives as a line that is considered to be content, it will check this flag, and utilize the block.

<<setup_new_content_block>>=
if (worg->new_content) {
    worg->new_content = 0;
    worgle_string_reset(&worg->segblock);
    worg->segblock.str = line;
}

The new_content flag at startup. It is also set when a code bock ends, or after a header.

<<set_content_flag_after_block>>=
worg->new_content = 1;
<<set_content_flag_after_header>>=
worg->new_content = 1;

A content block is considered finished when a code block or new header section is reached, or if a document has ended(?)

No WORGLITE macro magic or generate_db conditionals are needed to append a content block before a header. At this level, it is already assumed.

<<append_content_before_header>>=
worgle_orgfile_append_content(worg, &worg->segblock);
worgle_string_reset(&worg->segblock);

A content block should be appended before a code block starts, which is when a code reference is appended.

<<append_content_before_code>>=
worgle_orgfile_append_content(worg, &worg->segblock);
worgle_string_reset(&worg->segblock);

Any remaining blocks at the end of all parsing will be appended to. Not sure where this logic will go yet.

At the end of all parsing, the last block must be flushed out.

<<flush_last_block>>=
#ifndef WORGLITE
if (generate_db) {
    worgle_orgfile_append_content(&worg, &worg.segblock);
}
#endif
3.3.2.1.7. DONE Code Reference

CLOSED: [2019-12-10 Tue 20:26] Anytime a new code block begins, a reference to this new block is stored in the data representation of the file. This should happen when a new block begins. Probably in worgle_begin_block.

<<append_code_reference>>=
worgle_orgfile_append_reference(worg, worg->curblock);
3.3.2.2. MODE_BEGINCODE

<<parse_modes>>=
MODE_BEGINCODE,

A parser set to mode MODE_BEGINCODE is only interested in finding the beginning block. If it doesn't, it returns a syntax error. If it does, it goes on to extract a potential new filename to tangle, which then gets appended to the Worgle file list.

<<parse_mode_begincode>>=
if (read >= 11) {
    if(!strncmp (line, "#+BEGIN_SRC",11)) {
<<begin_the_code>>
        if (parse_begin(line, read, &str) == 2) {
            worgle_append_file(worg, &str);
        }
        continue;
    } else {
        fwrite(line, read, 1, stderr);
        fprintf(stderr,
                "line %lu: Expected #+BEGIN_SRC\n",
                worg->linum);
        rc = 1;
        break;
    }
}
fprintf(stderr,
        "line %lu: Expected #+BEGIN_SRC\n",
        worg->linum);
rc = 1;
3.3.2.2.1. Extracting information from #+BEGIN_SRC

The begin source flag in org-mode can have a number of options, but the only one we really care about for this tangler is the ":tangle" option.

<<static_function_declarations>>=
static int parse_begin(char *line, size_t len, worgle_string *str);

The state machine begins right after the BEGIN_SRC declaration, which is why the string is offset by 11.

The state machine for this parser is linear, and has 5 modes:

- mode 0: Skip whitespace after BEGIN_SRC - mode 1: Find ":tangle" pattern - mode 2: Ignore imediate whitespace after "tangle", and begin getting filename - mode 3: Get filename size by reading up to the next space or line break - mode 4: Don't do anything, wait for line to end.

<<functions>>=
static int parse_begin(char *line, size_t len, worgle_string *str)
{
    size_t n;
    int mode;
    int rc;

    line += 11;
    len -= 11;

    if(len <= 0) return 0;


    mode = 0;
    n = 0;
    rc = 1;
    str->str = NULL;
    str->size = 0;
    while(n < len) {
        switch(mode) {
            case 0: /* initial spaces after BEGIN_SRC */
                if(line[n] == ' ') {
                    n++;
                } else {
                    mode = 1;
                }
                break;
            case 1: /* look for :tangle */
                if(line[n] == ' ') {
                    mode = 0;
                    n++;
                } else {
                    if(line[n] == ':') {
                        if(!strncmp(line + n + 1, "tangle", 6)) {
                            n+=7;
                            mode = 2;
                            rc = 2;
                        }
                    }
                    n++;
                }
                break;
            case 2: /* save file name, spaces after tangle */
                if(line[n] != ' ') {
                    str->str = &line[n];
                    str->size++;
                    mode = 3;
                }
                n++;
                break;
            case 3: /* read up to next space or line break */
                if(line[n] == ' ' || line[n] == '\n') {
                    mode = 4;
                } else {
                    str->size++;
                }
                n++;
                break;
            case 4: /* countdown til end */
                n++;
                break;
        }
    }

    return rc;
}
3.3.2.2.2. Setting up code for a new read

When a new codeblock has indeed been found, the mode is switched to MODE_CODE, and the block_started boolean flag gets set. In addition, the string used to keep track of the new block is reset.

<<begin_the_code>>=
mode = MODE_CODE;
worg->block_started = 1;
worgle_string_reset(&worg->block);
3.3.2.2.3. Appending a new file

If a new file is found, the filename gets appended to the file list via the function worgle_append_file.

<<function_declarations>>=
void worgle_append_file(worgle_d *worg, worgle_string *filename);
<<functions>>=
void worgle_append_file(worgle_d *worg, worgle_string *filename)
{
    worgle_file *f;
    f = worgle_filelist_append(&worg->flist, filename, worg->curblock);
<<worgle_file_set_id>>
}
3.3.2.3. MODE_CODE

<<parse_modes>>=
MODE_CODE

In MODE_CODE, actual code is parsed inside of the code block. The parser will keep reading chunks of code until one of two things happen: a code reference is found, or the END_SRC command is found.

<<parse_mode_code>>=
if(read >= 9) {
    if(!strncmp(line, "#+END_SRC", 9)) {
        mode = MODE_ORG;
        worg->block_started = 0;
        worgle_append_string(worg);
#ifndef WORGLITE
<<set_content_flag_after_block>>
#endif
        continue;
    }
}

if(check_for_reference(line, read, &str)) {
    worgle_append_string(worg);
    worgle_append_reference(worg, &str);
    worg->block_started = 1;
    worgle_string_reset(&worg->block);
    continue;
}

worg->block.size += read;

if(worg->block_started) {
    worg->block.str = line;
    worg->block_started = 0;
    worg->curline = worg->linum;
}
<<function_declarations>>=
void worgle_append_string(worgle_d *worg);

In this function, the currently active string block is appened to the currently active code block. It is called when the parser is inside a code block (aka MODE_CODE).

The current line number is checked if it is a valid (positive) value. A negative value indicates a properly initialized, but unset value. This will happen if the initial code block begins with a reference. A negative value will cause invalid line declarations in the generated code.

In some cases, Worgle will try to append an empty string to a block. While harmless for tangling, this can cause issues when doing metadata export. Empty strings will be ignored.

<<functions>>=
void worgle_append_string(worgle_d *worg)
{
    worgle_segment *seg;
    if (worg->curblock == NULL) return;
    if (worg->curline < 0) return;

    if (worg->block.size == 0) return;

    seg = worgle_block_append_string(worg->curblock,
                                     &worg->block,
                                     worg->curline,
                                     &worg->curbuf->filename);
<<worgle_segment_string_set_id>>
<<store_last_string_id>>
}
<<function_declarations>>=
void worgle_append_reference(worgle_d *worg, worgle_string *ref);
<<functions>>=
void worgle_append_reference(worgle_d *worg, worgle_string *ref)
{
    worgle_segment *seg;
    if(worg->curblock == NULL) return;
    seg = worgle_block_append_reference(worg->curblock,
                                        ref,
                                        worg->linum,
                                        &worg->curbuf->filename);
<<worgle_segment_reference_set_id>>
<<store_last_reference_id>>
}
<<static_function_declarations>>=
static int check_for_reference(char *line , size_t size, worgle_string *str);
<<functions>>=
static int check_for_reference(char *line , size_t size, worgle_string *str)
{
    int mode;
    size_t n;
    mode = 0;

    str->size = 0;
    str->str = NULL;
    for(n = 0; n < size; n++) {
        if(mode < 0) break;
        switch(mode) {
            case 0: /* spaces */
                if(line[n] == ' ') continue;
                else if(line[n] == '<') mode = 1;
                else mode = -1;
                break;
            case 1: /* second < */
                if(line[n] == '<') mode = 2;
                else mode = -1;
                break;
            case 2: /* word setup */
                str->str = &line[n];
                str->size++;
                mode = 3;
                break;
            case 3: /* the word */
                if(line[n] == '>') {
                    mode = 4;
                    break;
                }
                str->size++;
                break;
            case 4: /* last > */
                if(line[n] == '>') mode = 5;
                else mode = -1;
                break;
        }
    }

    return (mode == 5);
}

3.3.3. Parser Local Variables

The parsing stage requires a local variable called str to be used from time to time. Not sure where else to put this.

<<parser_local_variables>>=
worgle_string str;
<<parser_initialization>>=
worgle_string_init(&str);

line refers to the pointer address that the line will write to.

<<parser_local_variables>>=
char *line;
<<parser_initialization>>=
line = NULL;

pos refers to the current buffer position.

<<parser_local_variables>>=
size_t pos;
<<parser_initialization>>=
pos = 0;

This is the local variable read.

<<parser_local_variables>>=
size_t read;

The overall parser mode state is set by the local variable mode.

<<parser_local_variables>>=
int mode;

It is set to be the initial mode of MODE_ORG.

<<parser_initialization>>=
mode = MODE_ORG;

The main return code determines the overall state of the program.

<<parser_local_variables>>=
int rc;

By default, it is set to be okay, which is 0 on POSIX systems.

<<parser_initialization>>=
rc = 0;

The getline function used by the parser returns a status code, which tells the program when it has reached the end of the file.

<<parser_local_variables>>=
int status;

This is set to be TRUE (1) by default.

<<parser_initialization>>=
status = 0;

3.3.4. Reading a line at a time

Despite being loaded into memory, the program still reads in code one line at a time. The parsing relies on new line feeds to denote the beginnings and endings of sections and code references.

Before reading the line, the line number inside worgle is incremented.

In order to handle multiple files, this value must explicitely be reset to be zero every time inside of the parse_file function.

<<parser_initialization>>=
worg->linum = 0;

A special readline function has been written based on getline that reads lines of text from an allocated block of text. This function is called worgle_getline.

After the line has been read, the program checks the return code status. If all the lines of text have been read, the program breaks out of the while loop.

<<getline>>=
worg->linum++;
status = worgle_getline(buf, &line, &pos, &read, size);
if(!status) break;
<<static_function_declarations>>=
static int worgle_getline(char *fullbuf,
                  char **line,
                  size_t *pos,
                  size_t *line_size,
                  size_t buf_size);


fullbuf refers to the full text buffer.

line is a pointer where the current line will be stored.

pos is the current buffer position.

line_size is a variable written to that returns the size of the line. This includes the line break character.

buf_size is the size of the whole buffer.

<<functions>>=
static int worgle_getline(char *fullbuf,
                  char **line,
                  size_t *pos,
                  size_t *line_size,
                  size_t buf_size)
{
    size_t p;
    size_t s;
    *line_size = 0;
    p = *pos;
    *line = &fullbuf[p];
    s = 0;
    while(1) {
        s++;
        if(p >= buf_size) return 0;
        if(fullbuf[p] == '\n') {
            *pos = p + 1;
            *line_size = s;
            return 1;
        }
        p++;
    }
}

3.4. Generation

The last phase of the program is code generation.

A parsed file generates a structure of how the code will look. The generation stage involves iterating through the structure and producing the code.

Due to the hierarchical nature of the data structures, the generation stage is surprisingly elegant with a single expanding entry point.

At the very top, generation consists of writing all the files in the filelist. Each file will then go and write the top-most block associated with that file. A block will then write the segment list it has embedded inside of it. A segment will either write a string literal to disk, or a recursively expand block reference.

<<generation>>=
if(!rc && tangle_code) if(!worgle_generate(&worg)) rc = 1;
<<function_declarations>>=
int worgle_generate(worgle_d *worg);
<<functions>>=
int worgle_generate(worgle_d *worg)
{
    return worgle_filelist_write(&worg->flist, &worg->dict);
}

If the use_warnings flag is turned on, Worgle will scan the dictionary after generation and flag warnings about any unused blocks.

<<generation>>=
if(!rc && use_warnings) rc = worgle_warn_unused(&worg);

3.5. Cleanup

At the end up the program, all allocated memory is freed via worgle_free.

<<cleanup>>=
cleanup:
worgle_free(&worg);
return rc;



prev | home | next