(Updated Sat 2021-06-05)

Parsing in Fexl

As many computer programmers have observed: "Parsing is hard!" I concur, but it can be simplified. I like to think in terms of functions, so I take a functional programming approach instead of the more common imperative or stateful approach.

A parse function takes a list of characters (a "stream") as input and returns some type of result. In some cases it may return multiple types of results, such as one for failure and one for success. In that case the parse function would receive two handler functions as input, typically called no for failure and yes for success. This allows you to capture syntax errors directly, instead of having the code die per the usual practice.

When the parse function calls a handler function, it passes in the data resulting from the parse, but it also passes in the tail of the input stream after the parse. This allows you to chain parse functions sequentially, consuming the input stream in logical segments.

Basic parse functions

The two basic functions that work directly with an input stream are:

\scan=(\no\yes\xs xs (no []) yes)
\push=(\x\yes\xs yes [x;xs])

The scan function is what you use to examine the next character in the stream. It takes two handlers no and yes, along with the input stream xs, and returns (no []) if the stream is empty, or (yes x xt) if the stream consists of a character x followed by the tail stream xt.

The push function is what you use to push a character that you've already read back onto the front of the input stream, so it can be read subsequently by some other function.

Note that those are the only two parse functions which refer to the input stream xs directly. All other parse functions are built on top of those, using a point-free and monadic style so you can compose parse functions without passing the input stream around explicitly.

A parsing example

Now consider a rather interesting parsing task. Write a function which collects all characters in the input up to the next occurrence of a given terminator string. For example, if the input stream starts with "123~END456", and you use the terminator string "~END", the function should return "123", and leave the input sitting at the "4" character after the terminator.

Note that you have to handle the case where the terminator is not found. If the stream is "123END456", and you use the terminator "~END", then it is not found.

Here's the outline of that function:

# Collect a string up to the next occurrence of the terminator string.
# Return (yes str) if the term was found, otherwise return no.

\collect_string=
    (\term\no\yes
    ...
    )

Note that because I'm using the point-free monadic style, I do not mention the input stream xs explicitly. However, xs is in fact the fourth input to the collect_string function. Also, for the same reason, it "goes without saying" that the function will pass in the tail of the list to the handlers, so the actual return values will by (yes str xt) if the term was found, or (no xt) if not found — where xt is the tail of the list after the parse.

So here's an example of how it might be called:

collect_string term
    (
    say "missing terminator"
    )
    (\str
    say ["str  = "(as_str str)]
    )

I show the full implementation of collect_string in the next section.

Parsing the SSV format

In our investment accounting business I make extensive use of the "SSV" (space-separated value) format to specify event histories, such as trades, capital contributions, dividends, interest, etc. This is a series of lines, with each line consisting of a series of items separated by white space. An item can be a plain word such as "trade" (without the quotes), or a string enclosed in quotes such as "Alice Smith" (with the quotes), or a "tilde string" as used in Fexl itself, where you can choose an arbitrary delimiter such as "~END" and then enclose any block of text you like within a pair of those delimiters. Text within a quoted or tilde-quoted string may include line feeds, so you can have multi-line items.

The format allows comments, indicated by a "#" and continuing to the end of the line. (Of course, a "#" inside a quoted string does not indicate a comment.)

Here is the module for parsing the SSV format:

read_ssv.fxl:

# Parse the SSV (space-separated value) format.

# Collect a string up to the next occurrence of the terminator string.
\collect_string=
    (\t_ch\term\no\yes
    \buf_result=buf_new

    @\loop_start
    \fh_term=(readstr term)

    buffer_to t_ch buf_result no
        (
        \buf_match=buf_new
        buf_put buf_match t_ch

        @\loop_term
        \t=(sgetc fh_term)
        is_undef t (\str=(buf_get buf_result) yes str);

        scan no \x
        eq x t
            (
            buf_put buf_match x
            loop_term
            )
            (
            buf_put buf_result (buf_get buf_match)
            push x;
            loop_start
            )
        )
    )

# LATER 20211227 Insist on white space terminator, forbid QU and "~".
\get_plain_item=
    (
    collect_token
        (\x
        is_white x T;
        eq x QU T;
        eq x "~" T;
        F
        )
    )

\get_item=
    (\no\yes
    scan no \x
    is_eol x no;
    eq x QU (collect_to (eq QU) no yes);
    eq x "~"
        (
        collect_to is_white no \term
        collect_string x term no yes
        );
    push x;
    get_plain_item yes
    )

\get_row=
    (@\loop\no\yes
    skip_match (\x is_eol x F; is_white x);
    get_item no \item
    loop (yes [item]) \row yes [item;row]
    )

\get_rows=
    (@\loop\yes
    skip_match is_white;
    get_row (yes []) \row
    loop \rows yes [row;rows]
    )

\read_ssv_string=(read_string get_rows)
\read_ssv_chars=(read_chars get_rows)
\read_ssv_file=(read_file get_rows)

\form
def "read_ssv_string" read_ssv_string;
def "read_ssv_chars" read_ssv_chars;
def "read_ssv_file" read_ssv_file;
form

Parsing the CSV format

Data feeds from broker accounts often use the "CSV" (comma-separated value) format. Here is the module for parsing the CSV format:

read_csv.fxl:

# Parse the CSV (comma-separated value) format.
# NOTE: https://www.ietf.org/rfc/rfc4180.txt
# "Spaces are considered part of a field and should not be ignored."

\get_plain_item=
    (\sep
    collect_token
        (\x
        eq x sep T;
        is_eol x T;
        F
        )
    )

# Get a char inside a quoted string.  A single QU char is treated as end of
# string.  Two QU chars in a row are treated as a single QU character which
# appears in the string.
\get_quote_char=
    (\fail\done\keep
    scan fail \x
    eq x QU
        (
        scan done \x
        eq x QU
            (keep QU)
            (push x done)
        )
        (keep x)
    )

# Get a quoted item.
\get_quote_item=
    (\no\yes
    \buf=buf_new
    @\loop
    get_quote_char no
        (\str=(buf_get buf) yes str)
        (\x buf_put buf x loop)
    )

\get_item=
    (\sep
    \get_plain_item=(get_plain_item sep)
    \no\yes
    scan no \x
    eq x QU (get_quote_item no yes);
    push x;
    get_plain_item yes
    )

\get_row=
    (\sep
    \get_item=(get_item sep)
    @\loop\no\yes
    get_item no \item
    scan (yes [item]) \x
    eq x sep
        (loop (yes [item]) \row yes [item;row])
        (yes [item])
    )

\get_rows=
    (\sep
    \get_row=(get_row sep)
    @\loop\yes
    skip_match is_eol;
    get_row (yes []) \row
    loop \rows yes [row;rows]
    )

\read_rows=
    (\sep\read
    \get_rows=(get_rows sep)
    read get_rows
    )

# comma-separated
\read_csv_string=(read_rows ","  read_string)
\read_csv_chars=(read_rows "," read_chars)
\read_csv_file=(read_rows "," read_file)

# tab-separated
\read_tsv_string=(read_rows TAB read_string)
\read_tsv_chars=(read_rows TAB read_chars)
\read_tsv_file=(read_rows TAB read_file)

\form
def "read_csv_string" read_csv_string;
def "read_csv_chars" read_csv_chars;
def "read_csv_file" read_csv_file;
def "read_tsv_string" read_tsv_string;
def "read_tsv_chars" read_tsv_chars;
def "read_tsv_file" read_tsv_file;
form

Note: If you needed to use a terminator other than "," or TAB, you could do something like this:

# Use arbitrary separator.
\read_sep_string=(\sep read_rows sep read_string)
\read_sep_chars=(\sep read_rows sep read_chars)
\read_sep_file=(\sep read_rows sep read_file)

Parsing the JSON format

I also have a JSON parser which I use for many API data feeds, but it's not in the standard library yet. I'll write something about it here later.

Validating the parsers

To validate the SSV and CSV parsers, I have devised an extensive test suite, including:

This includes not only normal cases, but also very intricate corner cases, such as having the stream end without a line feed, or failing to close a quoted string properly. I have very deliberately "fuzzed" the test cases to exercise many possibilities.

Supporting modules

There is a supporting module "read.fxl" which defines some common functions used by read_ssv.fxl and read_csv.fxl above:

read.fxl:

#
\scan=(\no\yes\xs xs (no []) yes)
\push=(\x\yes\xs yes [x;xs])

\is_eol=(\x eq x LF T; eq x CR)
\is_white=(ge " ")

# Skip matching characters.
\skip_match=
    (\is_match\yes @\loop
    scan yes \x
    is_match x
        loop
        (push x yes)
    )

# Buffer characters up to a specific end char.
\buffer_to=
    (\t_ch\buf\no\yes
    @\loop
    scan no \x
    eq x t_ch
        yes
        (buf_put buf x loop)
    )

# Collect characters up to an end char.  The end char is skipped.
\collect_to=
    (\is_end\no\yes
    \buf=buf_new
    @\loop
    scan no \x
    is_end x
        (\str=(buf_get buf) yes str)
        (buf_put buf x loop)
    )

# Collect characters up to an end char or EOF.  The end char is kept.
\collect_token=
    (\is_end\yes
    \buf=buf_new
    \done==(\str=(buf_get buf) yes str)
    @\loop
    scan done \x
    is_end x
        (push x done)
        (buf_put buf x loop)
    )

\read_chars=(\read read (\data\xs data))

\stream_string=
    (\str
    \fh=(readstr str)
    stream_get (sgetc fh)
    )

\stream_file=
    (\name
    \fh=(fopen name "r")
    is_void fh void;
    stream_get (fgetc fh)
    )

\read_string=
    (\read\in
    stream_string in \xs
    read_chars read xs
    )

\read_file=
    (\read\in
    stream_file in \xs
    read_chars read xs
    )

\form
def "scan" scan;
def "push" push;
def "is_eol" is_eol;
def "is_white" is_white;
def "skip_match" skip_match;
def "buffer_to" buffer_to;
def "collect_to" collect_to;
def "collect_token" collect_token;
def "read_chars" read_chars;
def "read_string" read_string;
def "read_file" read_file;
form