Change symbol internally to support opaque octet strings #17

michaeljclark · 2024-02-18T22:49:49Z

I don't expect you to merge this, as it's mostly a hint at a possible direction as it would be useful for hashing graphs of symbols containing binary data so as a step, the symbol should be an opaque octet string i.e. change from zero terminated string to length prefix octet-string using the length from the object. To do this we need to 1). defer the size padding and alignment calculations and 2). as a discrete step, hoist strlen from intern and make_symbol.

the first change defers padding and alignment of the object's length by storing the native length in the object and it adds an obj_size function that derives the padded and aligned size to be used inside alloc, forward and gc.
the second discrete change adds intern_opaque and make_symbol_opaque with wrappers that call strlen. before this change, the symbol is still zero-terminated, but after this change they are binary data and we use memcmp in intern. this also changed print to use "%.*s" which I verified will truncate the string and I tested it on Linux and Windows.
- after the second change the symbol no longer stores the zero. this could be debated as it makes some interop harder because old C-string code will need a temporary but I think it's actually worse to do it that way because it allows compatibility with old code that is not binary clean and that is a bad thing™.

cat >a.c <<EOF
#include <stdio.h>
int main() { printf("%.*s\n", 4, "hello"); }
EOF
cc a.c
./a.out
hell

There should be no functionally visible change if this works. It passes all tests in test.sh. btw it's great that the code is so simple because a core change like this is rather easy to do without the noise in a larger codebase like minischeme.

addendum

I might consider adding some backslash escapes to read_symbol and print in my tree. that makes sense if we were to add a double quoted octet string type like the scheme string type. but there is also the question about isomorphic forward and reverse of opaque symbols in lisp code using some escape sequence like \{#x0} in read_symbol and print so that binary data can 100% forward and reverse when parsing and emitting quoted or unquoted symbols. the curly escape is used in LaTeX and termcap (it's percent escaped as %{num} in termcap) and the #x0 us a hex literal. from scheme. I like the curly escape and I like the scheme method for hex literal i.e. \{0} == \{#x0} and it also could support constants like character names from the Unicode database such as "U+220E ∎ END OF PROOF" which we would escape as \{end-of-proof} if we had the Unicode symbol table. also minischeme uses #\A for its char type, so \{#\A} would basically be embedding a char code for 'A' inside a symbol using the numeric integer or char code escape which would be UTF-8 encoded, so while it is an integer code it in fact inserts a one-byte string. C has an N in front with it's new \N{symbol} escape but I personally think the N is redundant. also I am not sure if you want to complicate minilisp with any of this?

in any case this is a preparatory change so that folks who derive from minilisp can support '\0' in their symbols in out-of-tree code. I think it makes sense to have something like this in-tree even if more complicated things are kept out-of-tree because its a rather simple core infrastructure change. My motivation is that I want to use minilisp for testing some termcap entries for a terminal emulator project that I am working on and they have some zeroes in them thus I need to support zeroes in my tests and I don't like that they store zeros as \{#x80} then later clear the top bit making them only 7-bit clean and thus making it not possible to cleanly support UTF-8 in termcap entries.

michaeljclark · 2024-02-19T06:17:17Z

@rui314

I am still thinking about minimal changes to minilisp to support modifying octet-strings which may simply be to add some methods to index bytes within symbols without adding any new types. I want to port a pure implementation of SHA-256 to minilisp as an exercise and possibly add it to examples, and even if we ultimately use a native hash algorithm its a good exercise for testing a representation in lisp. so I am thinking what is the minimum to make this practical. I made a port of the Stanford SHA-256 reference implementation to GLSL glhash that compiles to SPIR-V and does not need pointers which is a reasonable starting point. I don't think we need an array type if the symbol itself is indexable.

a minimal escape sequence so that we can round trip an octet-string as a symbol, and an indexed set and get might be the bare minimum change for a compact in-memory representation of a byte array for a pure lisp hash algorithm. sure, we can simulate a byte array using a list but that is going to be very inefficient. even if we can index bytes in a symbol we would need shift, and, or to compose uint as SHA-256 actually needs uint typed arrays. if we took the approach of reusing symbol we would just need to add an escape like \xNN so that non printable characters, parenthesis, quotes or whitespace can be escaped into symbols to make it easier to get data in and out. but to bind a mutable symbol one needs a defined identity to use as a reference to the symbol we want to mutate, otherwise we would need to copy the symbol for every byte we change.

curious about your thoughts. I'm just letting you know what I'm thinking of in case there is a preferred type of change that you might accept as a minimal change. perhaps it is a just to use a list of integers... curious how you would do it?

michaeljclark · 2024-02-21T22:29:31Z

\{'o}_\{'o}

michaeljclark added 2 commits February 19, 2024 10:14

Change object to store unaligned length in bytes.

ad2e317

Change intern to compare names using length field.

a2dba94

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change symbol internally to support opaque octet strings #17

Change symbol internally to support opaque octet strings #17

michaeljclark commented Feb 18, 2024

michaeljclark commented Feb 19, 2024

michaeljclark commented Feb 21, 2024

Change symbol internally to support opaque octet strings #17

Are you sure you want to change the base?

Change symbol internally to support opaque octet strings #17

Conversation

michaeljclark commented Feb 18, 2024

addendum

michaeljclark commented Feb 19, 2024

michaeljclark commented Feb 21, 2024