From bb45c188e13741b64a33b77ea5cb27372d1a846e Mon Sep 17 00:00:00 2001
From: Dmitri Prime <bolms@google.com>
Date: Wed, 18 Dec 2024 15:09:31 -0800
Subject: [PATCH] Add docs for would-be contributors. (#213)

---
 doc/contributing.md     | 130 ++++++++
 doc/how-to-design.md    | 710 ++++++++++++++++++++++++++++++++++++++++
 doc/how-to-implement.md | 203 ++++++++++++
 doc/index.md            |   5 +
 4 files changed, 1048 insertions(+)
 create mode 100644 doc/contributing.md
 create mode 100644 doc/how-to-design.md
 create mode 100644 doc/how-to-implement.md

diff --git a/doc/contributing.md b/doc/contributing.md
new file mode 100644
index 0000000..f6146a4
--- /dev/null
+++ b/doc/contributing.md
@@ -0,0 +1,130 @@
+# Contributing to Emboss
+
+If you would like to fix a bug or add a new feature to Emboss, great!  This
+document is intended to help you understand the procedure, so that your change
+can land in the main Emboss repository.
+
+You do not have to take a change all the way from start to finish, either: if
+you can get a design approved, then someone else can implement it much more
+easily.  Conversely, if you are looking for a way to help, you might look for
+existing [feature
+requests](https://github.com/google/emboss/labels/enhancement) that have
+designs or at [open design sketches](design_docs/) that you might be able to
+implement.
+
+
+## All Changes
+
+Because Emboss is a Google project, in order to submit code you will need to
+sign a [Google Contributor License Agreement
+(CLA)](https://cla.developers.google.com/).
+
+**IMPORTANT**: if your contribution includes code that is not covered by a
+Google CLA and is not owned by Google, the Emboss project has to follow special
+procedures to include it.  Please let us know ([filing an issue on
+GitHub](https://github.com/google/emboss/issues/new) is probably the easiest
+way) so that we can walk you through the process.  In particular, we generally
+cannot accept any code from StackExchange or similar sites, and any code that
+comes from a non-Google open source project needs to have an acceptable license
+and be committed to the Emboss repository in a specific location.
+
+
+### How-To Guides
+
+This document covers the process of getting a change into the main Emboss
+repository — i.e., what you need to do to get your change into
+[the main Emboss repository](https://github.com/google/emboss/).
+
+[How to Implement Changes to Emboss](how-to-implement.md) has an overview of
+how to make code changes to Emboss.
+
+[How to Design Features for Emboss](how-to-design.md) has an overview of what
+to think about during your design.
+
+
+### Bug Fixes vs New Features
+
+The general process for bug fixes and new features is the same, but bug fixes
+usually require less design work, and therefore can go through lighter
+processes.
+
+
+## Very Small Changes
+
+If you have a tiny change — for example,  making a fix that does not change the
+design of `embossc` — you can jump directly into coding.
+
+This process works best if your change is small and not likely to be
+controversial, or if you have already completed [the steps for small
+changes](#small-changes).
+
+1.  [Code up your proposed change](how-to-implement.md) and open a [pull
+    request
+    (PR)](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request)
+    to [the main Emboss repository](https://github.com/google/emboss/).  If you
+    have not completed [the steps for small changes](#small-changes), this
+    gives the Emboss maintainers something more concrete to look at, but you
+    may end up doing more work if your initial proposal turns out to be the
+    wrong approach.
+
+2.  The Emboss maintainers will review your PR, and may request changes.  Don't
+    be discouraged if your PR is not immediately accepted — even PRs that
+    maintainers send to each other often have requests for changes!  We want
+    the Emboss code to be high-quality, which means helping you make your code
+    better.
+
+3.  Once your PR reaches a point where it is good enough, an Emboss maintainer
+    will merge it into the Emboss repository.
+
+
+## Small Changes
+
+If your change is small, but still requires some design work — for example,
+adding a new utility function in the C++ runtime library, or making a bug fix
+that involves re-structuring some of the `embossc` code — it is usually best to
+get some feedback before you start coding.
+
+1.  [File an issue on GitHub](https://github.com/google/emboss/issues/new), if
+    there is not an issue already.  It is best to use the *problem you want to
+    solve* for the issue title and description, and then propose your design in
+    a comment.
+
+2.  Once the Emboss maintainers have had a chance to review your proposal and
+    agree on the general outline, follow [the procedure for very small
+    changes](#very-small-changes).
+
+
+## Medium and Large Changes
+
+If you have a medium or large change — for example, introducing a new pass in
+`embossc`, adding a new data type, adding a new operator to the Emboss
+expression language, making a cross-cutting refactoring of `embossc`, etc. —
+then you should start by writing a *design sketch*.
+
+A design sketch is, basically, an informal design doc — it covers the topics
+that a design doc would cover, but may have open questions or alternatives that
+haven't been locked down.
+
+1.  [File an issue on GitHub](https://github.com/google/emboss/issues/new), if
+    there is not an issue already.  It is best to use the *problem you want to
+    solve* for the issue title and description.
+
+2.  Look at [existing design sketches](design_docs/) and [archived design docs
+    for changes that have already landed](design_docs/archive/) to get a feel
+    for what should be in a design sketch.
+
+3.  If you have not already done so, read [How to Design Features for
+    Emboss](how-to-design.md).
+
+4.  Write a draft design sketch for your change, and open a pull request
+    against [the main Emboss repository](https://github.com/google/emboss/).
+
+5.  It is very likely that your design sketch will need revision before it is
+    accepted.  If it does, do not be discouraged — we want your change to
+    succeed!
+
+6.  Once your design sketch has been accepted, you can move on to
+    implementation, following (more or less) the same procedure you would
+    follow for [small](#small-changes) or [very small
+    changes](#very-small-changes).  Depending on the complexity of the change,
+    you may need to split your implementation into multiple changes.
diff --git a/doc/how-to-design.md b/doc/how-to-design.md
new file mode 100644
index 0000000..04e28ea
--- /dev/null
+++ b/doc/how-to-design.md
@@ -0,0 +1,710 @@
+# Things to Think About When Designing Features for Emboss (An Incomplete List)
+
+Original Author:
+
+Ben Olmstead (aka reventlov, aka Dmitri Prime), original designer and author of
+Emboss
+
+
+# General Design Principles
+
+There are many, many books, articles, talks, classes, and exercises on good
+software design, and most general design principles apply to Emboss.  In this
+section, I will only cover the "most important" principles and those that I do
+not see highlighted in many other places.
+
+
+## Design to Real Problems, Not Hypotheticals
+
+In order to avoid "second system effect," designs that do not work in practice,
+and wasted effort, it is best to design to a specific problem — preferably a
+few instances of that problem, so that your design is more likely to solve a
+wide range of real world problems.
+
+For example, in Emboss if you wait until you have a specific data structure
+that is awkward or impossible to express, then try to find examples of other
+structures that are awkward in the same way, and then design a feature to
+handle those data structures, you are much more likely to come up with a
+solution that a) will actually be used, and b) will be used in more than one
+place.
+
+
+## Design to the Problem, Not the Solution
+
+Often, users will have a problem, think "I could solve this if I could do X,"
+and then ask for a feature for X without mentioning their original problem.  As
+a software designer, one of the first things you should do is try to figure out
+the original problem — usually by asking the user some probing questions — so
+that you can design to the problem, not to the user's solution.
+
+(Note that this is sometimes true even if you are the user: it is easy to get
+tunnel vision about a solution you came up with.  Sometimes you need to step
+back and try to find a different solution.)
+
+
+## Do Not Try to Do Everything
+
+Avoid the temptation to cover every possible use case, even if some of those
+would generally fit within the domain of your project.  A project like Emboss
+will attract extremely specific requests — requests whose solutions do not
+generalize.
+
+
+### Emboss is a "95% Solution"
+
+Instead of trying to cover every use case for every user, leave "escape
+hatches" in your design, so that users can use Emboss for the cases it covers,
+and integrate their own solutions in the places that Emboss does not cover.
+
+There will always be formats that Emboss cannot handle without becoming an
+actual programming language — even something as "basic" as compression is
+generally beyond what Emboss is meant to be capable of.
+
+
+## Be Conservative
+
+Emboss has strong backwards-compatibility guarantees: in particular, once a
+feature is "released," support for that feature is guaranteed more or less
+forever.  Because of this, new features should be narrow, even if there are
+"obvious" expansions, and even if narrowing the feature actually takes more
+code in the compiler.  You can always expand a feature later, but narrowing it
+or cutting it out would break Emboss's support guarantees.
+
+Although this principle is very standard for professional, publicly-released
+software, it may be a culture shock to developers who are used to
+"monorepo"[^mono] environments such as Google — it is not possible to just
+update all users in the real world!  Note that even many of Google's *open
+source* projects, such as Abseil, require their users to periodically update
+their code to the latest conventions, which imposes a cost on users of those
+projects.  Emboss is intended for smaller developers and embedded systems,
+which often do not have the resources for such migrations.
+
+[^mono]: In the several years that Emboss spent inside Google's monorepo it
+    underwent many large, backwards-incompatible changes that made the current
+    language significantly better.  Early incubation in a controlled
+    environment can be valuable for a new language!
+
+
+## Design for Later Expansion
+
+### Leave "Reserved Space" for Future Features
+
+Emboss uses `$` in many keyword names, but does not allow `$` to be used in
+user identifiers — this lets Emboss add `$` keywords without worrying about
+colliding with identifiers in existing code.  (This is in direct contrast to
+most programming languages, where introducing new keywords often breaks
+existing code.)
+
+As another example, Emboss disallows identifiers that collide with keywords in
+many programming languages — this gives room for Emboss to add back ends for
+those programming languages later, without having to figure out a convention
+for mangling identifiers that collide.  As a real-world counterexample,
+Protocol Buffers had to figure out a convention for handling field names that
+collide with C++ identifiers such as `class` — and `protoc` still generates
+broken C++ code if you have two fields named `class` and `class_` in the same
+`message`.
+
+
+### Leave "Extension Points"
+
+An "extension point" is a place where someone should be able to hook into the
+system without changing the system.  This can be an API, a "hook," a defined
+data format, or something else entirely, but the defining factor is that it is
+a way to add new features or alter behavior without changing the existing
+software.
+
+In practice, many extension points won't "just work" until there are at least a
+few things using them, due to bugs or unexpected coupling, but in principle
+they should not require any modification.
+
+One extension point in the Emboss compiler is the full separation between front
+and back ends, so that future back ends (such as Rust, Protocol Buffers, PDF
+documentation, etc.) can be added without changing the overall design or
+(theoretically) any of the existing compiler.[^ext]
+
+[^ext]: This is not unique or original to Emboss: separate front and back ends
+    are totally standard in modern compiler design.
+
+In the physical world, an electrical outlet or a network port is an extension
+point — there is nothing there right now, but there is a defined place for
+something to be added later.
+
+
+### Leave "Lines of Cleavage"
+
+A "line of cleavage" is similar to an extension point, except that instead of
+being a ready-to-go place to add something new, it's a place where the major
+work was done, but there are still some pieces that need to be fixed up.
+
+A line of cleavage in the Emboss compiler is the use of a special `.emb` file
+(`prelude.emb`) to define "built-in" types, with the aim of eventually allowing
+end users to define their own types at the same level.  This feature still has
+open design decisions, such as:
+
+*   How will users define their type for the back end(s)?
+*   How will users define the range of an integer type for the expression
+    system?
+
+But these are relatively minor compared to the larger question of "how can
+Emboss allow end users to define their own basic types?"
+
+In software, lines of cleavage are usually invisible to end users, and can be
+difficult to see even for developers working on the code.
+
+In the physical world, an example of this is putting empty conduit into walls
+or ceilings: that way, new electrical or communication wires or pneumatic tubes
+can be pulled through the conduit and attached to new outlets, without having
+to open up *all* the walls.
+
+
+## Consider Known Potential Features
+
+Every complex software system has a cloud of potential features around it:
+features which, for one reason or another, have not been implemented yet, but
+which some stakeholder(s) want.  These features usually exist at every stage
+from "idle thought in a developer's mind" to "partially implemented, but not
+finished," and the likelihoods of each one to become a finished feature cover
+an equally wide range.
+
+When designing a new feature there are very good reasons to think about these
+potential features:
+
+First, you should ensure that your new feature does not make another
+highly-desirable feature impossible.  In Emboss, for example, if your new
+feature made it impossible to support a string type, that would be a very good
+reason to redesign your feature (or abandon it, if it is fundamentally
+incompatible).
+
+Second, sometimes you can tweak your design so that a potential feature becomes
+obsolete: fundamentally, every feature request exists to solve a problem, and
+often it is not the only way to solve that problem.  If you can solve it in a
+different way, you can make users happy and avoid some future work.  (Though be
+careful: it can be difficult to infer the full scope of a user's problem(s)
+from a feature request.)
+
+Third, thinking about specific potential features can help narrow the amount of
+"future design space" that you need to consider, which makes it easier to put
+extension points and lines of cleavage in your design in places where they will
+actually be used.
+
+
+# General Language Design Principles
+
+In contrast to general software design principles, there are far fewer sources
+on good *language* design.  I speculate that this is because there are far
+fewer language designers than software designers.  (There are tens of millions
+of software developers, but only tens of thousands of programming, markup, and
+data definition languages — and of those, maybe two thousand or so are
+"serious" languages with significant real-world use.)
+
+Luckily, there are many publicly available and documented languages to learn
+from directly.
+
+Language design can be very roughly divided into syntactic and semantic
+concerns: syntax is how the language *looks* (what symbols and keywords are
+used, and in what order), while semantics cover how the language *works* (what
+actually happens).  It might seem like semantics are more important, but syntax
+has a huge effect on how easy it is to understand existing code and to write
+correct code, which are both incredibly important in real-world use.
+
+In this section, I will try to outline language design principles that I have
+found or developed, particularly when they are useful for Emboss.
+
+
+## Be Mindful of the Power/Analysis Tradeoff
+
+[Turing-complete languages cannot be fully
+analyzed](https://en.wikipedia.org/wiki/Halting_problem).  This is one of the
+reasons that languages like HTML and CSS are not programming languages: the
+more expressive a language is, the more difficult it is to analyze.
+
+The `.emb` format is intended to be more on the declarative side, so that
+definitions can be analyzed and transformed as necessary.
+
+
+## Look at Other Languages
+
+Although Emboss is a data definition language (DDL), not a programming
+language, many lessons and principles from programming language design can be
+applied, as well as lessons from other DDLs, and sometimes even interface
+definition languages (IDLs), as well as markup and query languages.
+
+In particular, for Emboss it is often worth looking at:
+
+*   Popular programming languages: C, C++, Rust, JavaScript, TypeScript, C#,
+    Java, Go, Python 3, Swift, Objective C, Lua.  "Systems" programming
+    languages such as C, C++, and Rust are usually the most relevant of these,
+    but it is useful to survey all the popular languages because many Emboss
+    users will be familiar with them.  Note that Lua is used for Wireshark
+    packet definitions.
+
+*   Selected "interesting" programming languages: Wuffs, Haskell, Ocaml, Agda,
+    Coq.  These have some lessons for Emboss, especially its expression system
+    — in particular, they're all much more principled than "standard"
+    programming languages about how they handle types and values.  There are
+    many other programming languages that have interesting ideas (FORTH,
+    Prolog, D, Perl, Logo, Scratch, APL, so-called "esoteric" programming
+    languages), but they usually are not relevant to Emboss.
+
+*   DDLs: Kaitai Struct, Protocol Buffers, Cap'n Proto, SQL-DDL.  Kaitai Struct
+    is the closest of these to solving the same problem as Emboss (though it
+    has some fundamentally different design decisions which make it far worse
+    for embedded systems), but all have some lessons.  Some higher-level schema
+    languages like DTD, XML Schema, or JSON Schema tend to be less relevant to
+    Emboss.  Note that there are a number of DDLs that are also IDLs: in actual
+    use, some of them (Protocol Buffers) are used more often for their DDL
+    features, while others (XPIDL, COM) are used more for their IDL features.
+
+
+## Learn Academic Theory
+
+Many (most?) languages are designed by people who have minimal knowledge of the
+academic theories of how programming languages work — for Emboss, Category
+Theory is particularly useful, and the computer science of parsers (especially
+LR(1) parsers) is useful for tweaking the parser generator or adding new
+syntax.
+
+This is a case where a little bit of learning goes a long way: you do not need
+to learn a *lot* about parsers or Category Theory to benefit from them.
+
+
+## Try to Acquire Practical Knowledge
+
+Many of the academic topics related to programming language design have
+corresponding industrial knowledge, and there are practical concerns that have
+very little to do with academic theory.
+
+The Emboss compiler is (loosely) based on the design of LLVM, with a series of
+transformation passes that operate somewhat independently, and independent back
+end code generators.[^designoops]
+
+[^designoops]: After many years of experience with this, I think that this is
+    not quite the right design for Emboss, and I would make two major changes:
+    first (and simplest), I would divide the current "front end" into a true
+    front end that only handled syntax and some types of syntax sugar, and a
+    "middle end" that handled all of the symbol resolution, bounds analysis,
+    constraint checking, etc.  Second, I would use a "compute-on-demand" (lazy
+    evaluation) approach in the middle end, which would allow certain
+    operations to be decoupled.  The LLVM design is more suited for independent
+    optimization passes, not for the kind of gradual annotation process in the
+    Emboss middle end.
+
+As another example, understanding how (and how well) Clang, GCC, and MSVC can
+optimize C++ code is crucial to generating high-performance code from Emboss
+(and Emboss leans very heavily on the C++ compiler to optimize its output).
+
+Some bits of practical knowledge are tiny little bits of almost-trivia.  For
+example, if you have C or C++ code in a (text) template, and you use `$` to
+indicate substitution variables (as in `$var` or `$var$`), then most editors
+and code formatters will treat your substitution variables as normal
+identifiers.  This is because almost every C and C++ compiler allows you to use
+`$` in identifiers, even though there has never been a C or C++ standard that
+allows those names, and it is rarely noted in any compiler, editor, or
+formatter's documentation.
+
+
+## Use Existing Syntax
+
+Emboss pulls many conventions from programming, data definition, and markup
+languages.  In general, if there is a feature in Emboss that works in a way
+that is the same as in other languages, it is best to pull syntax from
+elsewhere — ideally, pull in the most common syntax.  Many examples of this in
+Emboss are so common you might not even think about them:
+
+*   Arithmetic operators (`+`, `-`, `*`)
+*   Operator precedence (`*` binds more tightly than `+` and `-`, but also: see
+    the next section)
+
+Other examples are most specific, with no universal convention:
+
+*   `: Type` syntax for type annotation (TypeScript, Python, Ocaml, Rust, ...)
+
+This is *especially* important for Emboss, because most people reading or
+writing Emboss code will not want to spend much time becoming an "Emboss
+expert" — where someone might be willing to spend days or weeks to learn how to
+write Rust code, they are more likely to spend hours or minutes learning to
+write Emboss.
+
+
+## Avoid Existing Syntax
+
+However, there are three main reasons to avoid using existing syntax:
+
+*   The "standard" syntax is error prone.  One example of this is operator
+    precedence in most programming languages: errors related to not knowing the
+    relative precedence of `&&` and `||` are so common that most compilers have
+    an option to warn if they are mixed without parentheses.  Emboss handles
+    this — and a few other error-prone constructs — by having a *partial
+    ordering* for precedence instead of the standard total ordering, and making
+    it a syntax error to mix operators such as `&&` and `||` that have
+    incomparable (neither equal, less than, nor greater than) precedence.  As
+    far as I can tell, this is a totally new innovation in Emboss: there is no
+    precedent (no pun intended) whatsoever for partial precedence order.
+
+    When avoiding syntax in this way, it is ideal to make the standard syntax
+    into a syntax error (so that no one can use it accidentally) and to add an
+    error message to the compiler that suggests the correct syntax.
+
+*   The existing syntax is not used consistently: if multiple programming
+    languages use the same syntax for slightly different semantics, it is
+    usually worth avoiding the syntax.  For example, `/` has quite a few
+    different semantics — in many languages, it is a type-parameterized
+    division, where the numeric result depends on the (static or dynamic) types
+    of its operands, and across languages, the "integer division" flavor is not
+    consistent — in most programming languages it is *truncating division* (`-7
+    / 3 == -2`), but in some programming languages it is *flooring division*
+    (`-7 / 3 == -3`).
+
+*   The semantics do not match: if an Emboss feature is *almost*, but *not
+    quite* equivalent to a feature in other languages, it is best to avoid
+    making the Emboss feature look like the other feature.
+
+
+## Poll Users/Programmers
+
+When designing a new feature, try to come up with several alternatives and poll
+Emboss users (or sometimes non-Emboss-using programmers) as to which one they
+prefer.
+
+For syntax, one especially powerful technique is to show an example of the
+proposed syntax to people who have never seen it, and ask "what do you think
+this means?" without any hinting or prompting.  This is the "gold standard" way
+of finding out whether your syntax is clear or not.
+
+
+## Avoid Error-Prone Constructs
+
+Computing now has roughly seventy years of experience with artificial languages
+(in programming, markup, data definition, query, etc. flavors), and we have
+learned a lot about what kinds of constructs are error-prone for humans to use.
+Avoid these, where possible!  Some examples include:
+
+*   Large semantic differences should not have small, easily-overlooked
+    syntactic differences.  For example, allowing single- and double-character
+    operators (`=` and `==`, `|` and `||`, etc.) in the same contexts: a
+    classic C-family programming error is to use `=` in a condition instead of
+    `==`.  Many modern languages either force `=` to be used only in "statement
+    context" (and some, like C#, also ban side-effectless statements such as `x
+    == y;`) or use a different operator like `:=` for assignment.  (Or both, as
+    in Python, which allows `:=` but not `=` for "expression assignment.")
+
+*   Syntax should have *consistent* semantic meaning.  For example, in
+    JavaScript these two snippets mean the same thing:
+
+    ```js
+    return f() + 10;
+    ```
+
+    ```js
+    return f() +
+              10;
+    ```
+
+    but this one is different (it returns `undefined`, thanks to JavaScript's
+    automatic `;` insertion):
+
+    ```js
+    return
+        f() + 10;
+    ```
+
+    A small difference in the placement of the line break leads to totally
+    different semantics!
+
+    C++ has a number of places where identical syntax can have wildly different
+    semantics, especially (ab)use of operator overloads and [the most vexing
+    parse](https://en.wikipedia.org/wiki/Most_vexing_parse).
+
+*   Hoare calls "null" his "billion-dollar mistake," and the way that null
+    pointers are handled in most programming languages, especially C and C++,
+    is particularly error-prone.  (But note that it isn't really "null" itself
+    that is problematic — it's that there is no way to mark a pointer as "not
+    null," and that doing anything with a null pointer leads to undefined
+    behavior.  However, some popular language features, such as the `?.`
+    operator found in several programming languages and the `std::optional<>`
+    type in C++, show that there is some utility to nullable types, as long as
+    there is language support for enforcing null checks and/or allowing null to
+    propagate in the same way that NaN can.)
+
+*   Edge cases, such as integer overflow, are difficult for humans to reason
+    about.  In systems programming languages like C and C++, this leads to a
+    significant percentage of security flaws.  (C and C++ compilers use the
+    "integer overflow is undefined" rule *extensively* in optimization, so
+    there are pragmatic trade-offs in general.  Emboss is used in smaller
+    contexts with tighter safety guarantees.)
+
+
+# Emboss-Specific Considerations
+
+Emboss sits in a section of design space that has very few alternatives, and as
+a result there are things to think about when designing Emboss features that do
+not apply to many other languages.
+
+Also, because Emboss already exists, there are a number of systems within
+Emboss-the-language that may interact with new features.
+
+And finally, if you want your feature to become implemented, it is necessary to
+consider how difficult it would be to implement new features in `embossc`.
+
+
+## Survey Data Formats
+
+Maybe the least fun (at least for me[^unfun]) part of designing Emboss features
+is reading through data sheets, programming manuals, RFCs, and user guides to
+understand the data formats used in the real world, so that any new feature can
+handle a reasonable subset of those formats.  Some sources to consider:
+
+*   Data sheets and programming manuals for:
+    *   complex sensors, such as LiDAR
+    *   GPS receivers
+    *   servos
+    *   LED panels and segmented displays
+    *   clock hardware
+    *   ADCs and DACs
+    *   camera sensors
+    *   power control devices
+    *   simple sensors such as barometers, hygrometers, current sensors,
+        voltage sensors, light sensors, etc. (though many very simple sensors
+        use analog outputs or very, very simple digital outputs that do not
+        have a "protocol" as such)
+*   RFCs for low-level protocols such as Ethernet, IP, ICMP, UDP, TCP, and ARP
+
+<!-- TODO: assemble a list of links to actual examples -->
+
+[^unfun]: One of my original motivations for creating Emboss is that I find
+    reading data sheets and implementing code to read/write the data formats
+    therein to be extremely tedious.
+
+
+## Structure Layout System
+
+The "heart" of Emboss is what may be called the "structure layout system:" the
+engine that determines which bits to read and write in order to produce or
+commit the values of fields.  When designing, consider:
+
+*   Does this feature require reaching "outside" of a scope?  For example,
+    referencing a sibling field from within a field's scope is currently
+    impossible, because each field has its own scope.  Allowing `[requires:
+    this == sibling]` means expanding that scope.
+
+*   Does this feature require information that is not (currently) available to
+    the layout engine, or not available at the right place or time?  For
+    example, if you are designing a feature to allow field sizes to be `$auto`,
+    how does that interact with structures that are variable size?
+
+*   Does this feature require information that is potentially circular, or
+    would it interact with another potential feature to require circular
+    information, and is there a way to resolve that?  For example: if you are
+    designing a feature to allow field sizes to be `$auto`, inferring their
+    size from their type, how will that interact with the potential feature to
+    allow `struct`s that grow to the size they are given?
+
+
+## Expression System
+
+Although most expressions in Emboss definitions are simple (such as `x*4` or
+even just `0`), the expression system in Emboss tracks a lot of information,
+such as:
+
+*   What is the type of every subexpression (e.g., integer, specific
+    enumeration, opaque, etc.)?
+*   For integer and boolean expressions, does the expression evaluate to fixed
+    (constant) value?
+*   For integer expressions, what are the upper and lower bounds of the
+    expression?  (Used for determining the correct integer types to use in
+    generated code.)
+*   For integer expressions, is the value guaranteed to be equal to some fixed
+    value modulo some constant?  (Used for generating faster code for aligned
+    memory access.)
+
+When designing a feature, consider:
+
+*   Will any new types be `opaque` to the expression system, or will it be
+    possible to perform operations on them?  If they are `opaque` for now, will
+    they stay that way, or will it be possible to manipulate them in the
+    future?  For example, adding a string type in Emboss might start as
+    `opaque`, but allow operations like "value at index" or "substring" in the
+    future.
+*   When adding new operations, how will they interact with the bounds and
+    alignment tracking?  For example: truncating division often breaks
+    alignment tracking, whereas flooring division does not.
+*   Will this feature invalidate existing code?  Anything that causes the
+    inferred integer bounds of existing code to expand can break existing code.
+
+Note that the entire point of Emboss is to provide a bridge between physical
+data layout (as defined in the structure layout system) and abstract values
+with no specific representation (as exposed through the expression system).
+
+
+## Parsing
+
+Any new syntax has to be added to the parser.  Aside from the language design
+considerations for new syntax (see the ["General Language Design Principles"
+section](#general-language-design-principles)), there are a few levels of
+concern for the actual implementation:
+
+*   Is it computationally feasible to parse this syntax in an intuitive,
+    unambiguous way?
+*   Is it humanly feasible to express this syntax as an LR(1) grammar that can
+    be parsed by Emboss's shift-reduce parser engine?
+*   Is it feasible to parse this syntax using a different parsing engine type
+    (Earley, recursive descent, TDOP, parser combinator, etc.)?
+
+The first consideration is more of a general language design consideration: if
+your language design says "users will be able to specify their program in
+English," that is not really feasible (or unambiguous).  (Not that it hasn't
+been tried, many times.)
+
+The second consideration — can you add this syntax to `embossc`? — is the most
+practical and important consideration for Emboss.  LR(1) grammars are pretty
+restrictive (though shift-reduce parsers have advantages — there are reasons
+Emboss is using one), and even when it is *possible* to express a particular
+syntactic construct in LR(1)[^zimm], it may be difficult for most programmers to
+actually do so.  As a practical matter, I recommend trying to actually add your
+syntax to `module_ir.py`.
+
+[^zimm]: I (Ben Olmstead) think it would be awesome to implement [[Zimmerman,
+    2022](https://arxiv.org/abs/2209.08383)] plus a few extensions of my own
+    devising in Emboss's shift-reduce engine, which would make the grammar
+    design space significantly larger.  I would also separate the parser
+    generator engine into its own project.
+
+The third consideration is more future-focused and abstract: does this syntax
+lock Emboss into using a shift-reduce parser in the future?  Ideally, no.
+Luckily(?), LR(1) grammars are one of the more restrictive types of grammars in
+common use, so it is likely that anything that can be handled by the current
+parser can be handled by many other types of parsers.
+
+
+## Generated Code
+
+Right now, there is only the generated C++ code, but there should be other back
+ends in the future.  Some new features are pure syntax sugar (e.g., `$next` or
+`a < b < c`) that are replaced in the IR long before it reaches the back end
+(e.g., with the offset+length of the syntactically-previous field, or the IR
+equivalent of `a < b && b < c`), while others require extensive changes to how
+code is generated.
+
+*   What information will the back end need in order to generate working code?
+*   Does this feature require embedded-unfriendly generated code?  (E.g.,
+    memory allocation, I/O.)
+*   Can the existing C++ back end, which just walks the IR tree in a single
+    pass while building up strings which are combined into a `.h`, handle this
+    feature in its current design?
+*   How will this feature interact with various generated templates?
+*   Can/should this feature be, itself, templated?
+
+
+## C++ Runtime Library
+
+The runtime library will be included with every program that touches Emboss, so
+it is important to make it efficient.  When adding features, consider:
+
+*   Can the feature be added in such a way that it does not cost anything for
+    programs that do not use the feature?  A standalone C++ template will not
+    be included in a program unless the program instantiates the template, but
+    if the new code is used from somewhere in an existing function, it may be
+    included in programs that do not use it directly.
+
+*   Can the feature be added without allocating any heap memory?  Can it be
+    added with O(1) stack memory use?  Both of these are important for some
+    embedded systems, such as OS-less microcontroller and hard-real-time
+    environments.  Some features may intrinsically require memory allocation,
+    in which case it is best if they can be separated: for example, Emboss
+    structure-to-string conversion requires allocation, and even `#include`'ing
+    the appropriate headers can be too much for some environments, even if the
+    serialization code is never included in the final binary.
+
+*   How much can you rely on the C++ compiler to optimize things?  If you have
+    to implement your own optimizations, that will cost more development time
+    and add more complexity to the standard library.
+
+
+## Compiler Complexity
+
+The Emboss compiler is already quite complex, and has many subsystems that
+interact.  It is already quite difficult to reason about some interactions.
+
+*   Can the feature be added at an "edge" of the compiler?  For example, if you
+    can implement your feature as syntax sugar that converts the new feature to
+    existing IR early in the compilation process, it is much easier to verify
+    that it will not cause problematic interactions.  Similarly, if you can
+    implement your feature entirely in the back end or in the runtime library,
+    you do not need to worry about interactions inside the front end.
+
+*   If a feature cannot be added at an edge, how can you design it to minimize
+    the complexity?  (Ideally, you could even unify existing systems in such a
+    way that the overall complexity of the compiler is lower at the end.)
+
+
+## Future Back Ends
+
+It is important to have some idea of how any feature would be implemented
+against future back ends.
+
+
+### Programming Language (Rust/Python/Java/Go/C#/Lua/etc) Back Ends
+
+Some features may be difficult to implement in other languages.  For example,
+Python does not have a native `switch` statement, so any `switch`-like feature
+in Emboss may be awkward to implement — but this does not necessarily mean that
+Emboss should not have a `switch`.
+
+As a rule of thumb, languages can be grouped into tiers:
+
+1.  "Systems"/embedded-friendly languages: C++, Rust, C.  Top support.
+2.  Languages used for parsing/analyzing raw sensor dumps: C#, Java, Go,
+    Python, etc.  Should have good support, but not gate any features.
+3.  Languages that are rarely used to touch binary data: JavaScript,
+    TypeScript, etc.  Can be mostly ignored.
+4.  Dead and obscure languages: Perl, COBOL, APL, INTERCAL, etc.  Can be
+    ignored entirely.
+
+(It may be difficult to classify some languages, such as FORTRAN, which is
+still hanging around in 2024.)
+
+Remember that other back ends may have different requirements and guarantees
+than the C++ back end: for example, it would be unreasonable for a Java back
+end to promise "no dynamic memory allocation."
+
+
+### Other Data Format (Protobuf/JSON/etc) Back Ends
+
+These back ends would translate binary structures into alternate
+representations that are easier for some tools to use: for example, Google has
+many, many tools for processing Protocol Buffers, and JSON is popular in the
+open-source world.
+
+Most other formats have limitations that may make some kinds of Emboss
+constructs difficult or impossible to correctly reproduce: for example, Emboss
+already supports "infinitely nested" `struct` types, like:
+
+```
+struct Foo:
+    0 [+10]  Foo  child_foo
+```
+
+Formats like Protobuf or JSON, which do not have any way of representing loops
+in their data graph, cannot handle this.
+
+Until the most recent versions of Protobuf, mismatches between Protobuf `enum`
+and Emboss `enum` made it functionally impossible to map any Emboss `enum`
+types onto Protobuf `enum` types: Emboss `enum` types are open (allow any
+value, even ones that are not listed in the `enum`), where all Protobuf `enum`
+types were closed (only allowed known values).  (The most recent Protobuf
+versions, Proto3 and Editions, allow you to have open `enum` types.)
+
+Generally, it is not worth blocking an Emboss feature because of these kinds of
+mismatches, but it is worth thinking about how to avoid them, if possible.
+
+
+### Documentation (PDF/Markdown/etc) Back Ends
+
+These back ends would translate `.emb` files to a form of human-readable
+documentation, intended for publication on a web site, in an RFC, or as part of
+a PDF datasheet.  This type of back end is the motivation for having both `--`
+documentation blocks and `#` comments in Emboss.
+
+Since the output from these back ends would be intended for human consumption,
+for the most part you would only need to ensure that your feature can be
+understood by humans.
diff --git a/doc/how-to-implement.md b/doc/how-to-implement.md
new file mode 100644
index 0000000..452d2d1
--- /dev/null
+++ b/doc/how-to-implement.md
@@ -0,0 +1,203 @@
+# How to Implement Changes to Emboss
+
+<!-- TODO(bolms): write and link to guides on the `embossc` design -->
+
+## Getting the Code
+
+The master Emboss repository lives at https://github.com/google/emboss — you
+can `git clone` that repository directly, or make [a fork on
+GitHub](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/about-forks)
+and then `git clone` your fork.
+
+
+## Prerequisites
+
+In order to run Emboss, you will need [Python](https://www.python.org/).
+Emboss supports all versions of Python that are [still supported by the
+(C)Python codevelopers](https://devguide.python.org/versions/), but versions
+older than that generally will not work.
+
+The Emboss tests run under [Bazel](https://bazel.build/).  In order to run the
+tests, you will need to [install Bazel](https://bazel.build/start) on your
+system.
+
+
+## Running Tests
+
+Emboss has a reasonably extensive test suite.  In order to run the test suite,
+`cd` into the top `emboss` directory, and run:
+
+```sh
+bazel test ...
+```
+
+Bazel will download the necessary prerequisites, compile the (C++) code, and
+run all the tests.  Tests will take a moment to compile and run; Bazel will
+show running status, like this:
+
+```
+Starting local Bazel server and connecting to it...
+[502 / 782] 22 actions, 21 running
+    Compiling runtime/cpp/test/emboss_memory_util_test.cc; 10s linux-sandbox
+    Compiling runtime/cpp/test/emboss_memory_util_test.cc; 10s linux-sandbox
+    Creating runfiles tree bazel-out/k8-fastbuild/bin/compiler/front_end/emboss_front_end.runfiles; 2s local
+    Creating runfiles tree bazel-out/k8-fastbuild/bin/compiler/front_end/synthetics_test.runfiles; 1s local
+    Creating runfiles tree bazel-out/k8-fastbuild/bin/compiler/front_end/make_parser_test.runfiles; 1s local
+    Creating runfiles tree bazel-out/k8-fastbuild/bin/compiler/back_end/cpp/header_generator_test.runfiles; 1s local
+    Compiling absl/strings/str_join.h; 0s linux-sandbox
+    Creating runfiles tree bazel-out/k8-fastbuild/bin/compiler/front_end/generate_cached_parser.runfiles; 0s local ...
+```
+
+You may see a few `WARNING` messages; these are generally harmless.
+
+Once Bazel finishes running tests, you should see a list of all tests and their
+status (all should be `PASSED` if you just cloned the main Emboss repo):
+
+```
+Starting local Bazel server and connecting to it...
+INFO: Analyzed 226 targets (98 packages loaded, 4080 targets configured).
+INFO: Found 116 targets and 110 test targets...
+INFO: Elapsed time: 65.577s, Critical Path: 22.22s
+INFO: 862 processes: 372 internal, 490 linux-sandbox.
+INFO: Build completed successfully, 862 total actions
+//compiler/back_end/cpp:alignments_test                                  PASSED in 0.2s
+//compiler/back_end/cpp:alignments_test_no_opts                          PASSED in 0.1s
+//compiler/back_end/cpp:anonymous_bits_test                              PASSED in 0.2s
+//compiler/back_end/cpp:anonymous_bits_test_no_opts                      PASSED in 0.1s
+[... many more tests ...]
+//runtime/cpp/test:emboss_prelude_test                                   PASSED in 0.2s
+//runtime/cpp/test:emboss_prelude_test_no_opts                           PASSED in 0.2s
+//runtime/cpp/test:emboss_text_util_test                                 PASSED in 0.2s
+//runtime/cpp/test:emboss_text_util_test_no_opts                         PASSED in 0.2s
+
+Executed 110 out of 110 tests: 110 tests pass.
+```
+
+If a test fails, you will see lines at the end like:
+
+```
+//compiler/back_end/cpp:alignments_test                                  FAILED in 0.0s
+  /usr/local/home/bolms/.cache/bazel/_bazel_bolms/444a471ee8e028e0535394d088883276/execroot/_main/bazel-out/k8-fastbuild/testlogs/compiler/back_end/cpp/alignments_test/test.log
+
+Executed 110 out of 110 tests: 109 tests pass and 1 fails locally.
+```
+
+You can read the `test.log` file to find out where the failure occurred.
+
+Note that each C++ test actually runs multiple times with different Emboss
+`#define` options, so a single failure may cause multiple Bazel tests to fail:
+
+```
+//compiler/back_end/cpp:alignments_test                                  FAILED in 0.0s
+  /usr/local/home/bolms/.cache/bazel/_bazel_bolms/1c6e4694f903a02feef32c92ec3f1cae/execroot/_main/bazel-out/k8-fastbuild/testlogs/compiler/back_end/cpp/alignments_test/test.log
+//compiler/back_end/cpp:alignments_test_no_checks                        FAILED in 0.0s
+  /usr/local/home/bolms/.cache/bazel/_bazel_bolms/1c6e4694f903a02feef32c92ec3f1cae/execroot/_main/bazel-out/k8-fastbuild/testlogs/compiler/back_end/cpp/alignments_test_no_checks/test.log
+//compiler/back_end/cpp:alignments_test_no_checks_no_opts                FAILED in 0.0s
+  /usr/local/home/bolms/.cache/bazel/_bazel_bolms/1c6e4694f903a02feef32c92ec3f1cae/execroot/_main/bazel-out/k8-fastbuild/testlogs/compiler/back_end/cpp/alignments_test_no_checks_no_opts/test.log
+//compiler/back_end/cpp:alignments_test_no_opts                          FAILED in 0.0s
+  /usr/local/home/bolms/.cache/bazel/_bazel_bolms/1c6e4694f903a02feef32c92ec3f1cae/execroot/_main/bazel-out/k8-fastbuild/testlogs/compiler/back_end/cpp/alignments_test_no_opts/test.log
+
+Executed 168 out of 168 tests: 164 tests pass and 4 fail locally.
+```
+
+(The Emboss repository goes one step further and runs each of *those* tests
+under multiple compilers and optimization options.)
+
+If you are working on fixing a failure in one particular test, you can tell
+Bazel to run just that test by specifying the name of the test on the command
+line:
+
+```
+bazel test //compiler/back_end/cpp:alignments_test
+```
+
+This can be quicker than re-running the entire test suite.
+
+
+### `docs_are_up_to_date_test`
+
+If you are making changes to the Emboss grammar, you can ignore failures in
+`docs_are_up_to_date_test` until you have your updated grammar finalized: that
+test ensures that certain generated documentation files are up to date when
+code reaches the main repository.  See [Checked-In Generated
+Code](#checked-in-generated-code), below.
+
+
+## Implementing a Feature
+
+The the Emboss compiler is under [`compiler/`](../compiler/), with
+[`front_end/`](../compiler/front_end/), [`back_end/`](../compiler/front_end/),
+and [`util/`](../compiler/util/) directories for the front end, back end, and
+shared utilities, respectively.
+
+The C++ runtime library is under [`runtime/cpp/`](../runtime/cpp).
+
+
+### Coding Style
+
+For Python, Emboss uses the default style of
+the [Black](https://black.readthedocs.io/en/stable/) code formatter.[^genfile]
+
+[^genfile]: There is one, very large, generated `.py` file checked into the
+    Emboss repository that is intentionally excluded from code formatting —
+    both because it can hang the formatter and because the formatted version
+    takes noticeably longer for CPython to load.
+
+For C++, Emboss uses the `--style=Google` preset of
+[ClangFormat](https://clang.llvm.org/docs/ClangFormat.html).
+
+
+## Writing Tests
+
+Most code changes require tests: bug fixes should have at least one test that
+fails before the bug fix and passes after the fix, and new features should have
+many tests that cover all aspects of how the feature might be used.
+
+
+### Python
+
+The Emboss Python tests use the Python
+[`unittest`](https://docs.python.org/3/library/unittest.html) module.  Most
+[tests of the Emboss front end](../compiler/front_end/) are structured as:
+
+1.  Run a small `.emb` file through the front end, stopping immediately before
+    the step under test, and hold the result IR (intermediate representation).
+2.  Run the step under test on that IR, making sure that there are either no
+    errors, or that the errors are expected.
+3.  For the "no errors" tests, check various properties of the resulting IR to
+    ensure that the step under test did what it was supposed to.
+
+
+### C++
+
+The Emboss C++ tests use [the GoogleTest
+framework](https://google.github.io/googletest/).
+
+[Pure runtime tests](../runtime/cpp/test) `#include` the C++ runtime library
+headers and manually instantiate them, then test various properties.
+
+[Generated code tests](../compiler/back_end/cpp/testcode/), which incidentally
+test the runtime library as well, work by using a header generated from a [test
+`.emb`](../testdata/) and interacting with the generated Emboss code the way
+that a user might do so.
+
+
+## Writing Documentation
+
+If you are adding a feature to Emboss, make sure to update [the
+documentation](../doc/).  In particular, the [language
+reference](../doc/language-reference.md) and the [C++ code
+reference](cpp-reference.md) are very likely to need to be updated.
+
+
+## Checked-In Generated Code
+
+There are several checked-in generated files in the Emboss source repository.
+As a general rule, this is not a best practice, but it is necessary in order to
+achieve the "zero installation" use of the Emboss compiler, where an end user
+can simply `git clone` the repository and run the `embossc` executable directly
+— even if the cloned repository lives on a read-only filesystem.
+
+In order to minimize the chances of any of those files becoming stale, each one
+has a unit test that checks that the file in the Emboss directory matches what
+its generator would currently generate.
diff --git a/doc/index.md b/doc/index.md
index 6c383e8..7cfb5d6 100644
--- a/doc/index.md
+++ b/doc/index.md
@@ -16,3 +16,8 @@ Details of the textual representation Emboss uses for structures can be found in
 the [Emboss Text Format Reference](text-format.md).
 
 There is a tentative [roadmap of future development](roadmap.md).
+
+If you are interested in contributing to Emboss, please read [Contributing to
+Emboss](contributing.md), and you may wish to read [How to Design Features for
+Emboss](how-to-design.md) and [How to Implement Changes to
+Emboss](how-to-implement.md).