Change symbol internally to support opaque octet strings #17
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I don't expect you to merge this, as it's mostly a hint at a possible direction as it would be useful for hashing graphs of symbols containing binary data so as a step, the symbol should be an opaque octet string i.e. change from zero terminated string to length prefix octet-string using the length from the object. To do this we need to 1). defer the size padding and alignment calculations and 2). as a discrete step, hoist
strlen
fromintern
andmake_symbol
.obj_size
function that derives the padded and aligned size to be used insidealloc
,forward
andgc
.intern_opaque
andmake_symbol_opaque
with wrappers that callstrlen
. before this change, the symbol is still zero-terminated, but after this change they are binary data and we usememcmp
inintern
. this also changedprint
to use "%.*s" which I verified will truncate the string and I tested it on Linux and Windows.There should be no functionally visible change if this works. It passes all tests in
test.sh
. btw it's great that the code is so simple because a core change like this is rather easy to do without the noise in a larger codebase likeminischeme
.addendum
I might consider adding some backslash escapes to
read_symbol
andprint
in my tree. that makes sense if we were to add a double quoted octet string type like the scheme string type. but there is also the question about isomorphic forward and reverse of opaque symbols in lisp code using some escape sequence like\{#x0}
inread_symbol
andprint
so that binary data can 100% forward and reverse when parsing and emitting quoted or unquoted symbols. the curly escape is used in LaTeX and termcap (it's percent escaped as%{num}
in termcap) and the #x0 us a hex literal. from scheme. I like the curly escape and I like the scheme method for hex literal i.e.\{0}
==\{#x0}
and it also could support constants like character names from the Unicode database such as "U+220E ∎ END OF PROOF" which we would escape as\{end-of-proof}
if we had the Unicode symbol table. also minischeme uses#\A
for its char type, so\{#\A}
would basically be embedding a char code for 'A' inside a symbol using the numeric integer or char code escape which would be UTF-8 encoded, so while it is an integer code it in fact inserts a one-byte string. C has an N in front with it's new\N{symbol}
escape but I personally think the N is redundant. also I am not sure if you want to complicate minilisp with any of this?in any case this is a preparatory change so that folks who derive from minilisp can support '\0' in their symbols in out-of-tree code. I think it makes sense to have something like this in-tree even if more complicated things are kept out-of-tree because its a rather simple core infrastructure change. My motivation is that I want to use minilisp for testing some termcap entries for a terminal emulator project that I am working on and they have some zeroes in them thus I need to support zeroes in my tests and I don't like that they store zeros as
\{#x80}
then later clear the top bit making them only 7-bit clean and thus making it not possible to cleanly support UTF-8 in termcap entries.