From bb45c188e13741b64a33b77ea5cb27372d1a846e Mon Sep 17 00:00:00 2001 From: Dmitri Prime Date: Wed, 18 Dec 2024 15:09:31 -0800 Subject: [PATCH] Add docs for would-be contributors. (#213) --- doc/contributing.md | 130 ++++++++ doc/how-to-design.md | 710 ++++++++++++++++++++++++++++++++++++++++ doc/how-to-implement.md | 203 ++++++++++++ doc/index.md | 5 + 4 files changed, 1048 insertions(+) create mode 100644 doc/contributing.md create mode 100644 doc/how-to-design.md create mode 100644 doc/how-to-implement.md diff --git a/doc/contributing.md b/doc/contributing.md new file mode 100644 index 0000000..f6146a4 --- /dev/null +++ b/doc/contributing.md @@ -0,0 +1,130 @@ +# Contributing to Emboss + +If you would like to fix a bug or add a new feature to Emboss, great! This +document is intended to help you understand the procedure, so that your change +can land in the main Emboss repository. + +You do not have to take a change all the way from start to finish, either: if +you can get a design approved, then someone else can implement it much more +easily. Conversely, if you are looking for a way to help, you might look for +existing [feature +requests](https://github.com/google/emboss/labels/enhancement) that have +designs or at [open design sketches](design_docs/) that you might be able to +implement. + + +## All Changes + +Because Emboss is a Google project, in order to submit code you will need to +sign a [Google Contributor License Agreement +(CLA)](https://cla.developers.google.com/). + +**IMPORTANT**: if your contribution includes code that is not covered by a +Google CLA and is not owned by Google, the Emboss project has to follow special +procedures to include it. Please let us know ([filing an issue on +GitHub](https://github.com/google/emboss/issues/new) is probably the easiest +way) so that we can walk you through the process. In particular, we generally +cannot accept any code from StackExchange or similar sites, and any code that +comes from a non-Google open source project needs to have an acceptable license +and be committed to the Emboss repository in a specific location. + + +### How-To Guides + +This document covers the process of getting a change into the main Emboss +repository — i.e., what you need to do to get your change into +[the main Emboss repository](https://github.com/google/emboss/). + +[How to Implement Changes to Emboss](how-to-implement.md) has an overview of +how to make code changes to Emboss. + +[How to Design Features for Emboss](how-to-design.md) has an overview of what +to think about during your design. + + +### Bug Fixes vs New Features + +The general process for bug fixes and new features is the same, but bug fixes +usually require less design work, and therefore can go through lighter +processes. + + +## Very Small Changes + +If you have a tiny change — for example, making a fix that does not change the +design of `embossc` — you can jump directly into coding. + +This process works best if your change is small and not likely to be +controversial, or if you have already completed [the steps for small +changes](#small-changes). + +1. [Code up your proposed change](how-to-implement.md) and open a [pull + request + (PR)](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request) + to [the main Emboss repository](https://github.com/google/emboss/). If you + have not completed [the steps for small changes](#small-changes), this + gives the Emboss maintainers something more concrete to look at, but you + may end up doing more work if your initial proposal turns out to be the + wrong approach. + +2. The Emboss maintainers will review your PR, and may request changes. Don't + be discouraged if your PR is not immediately accepted — even PRs that + maintainers send to each other often have requests for changes! We want + the Emboss code to be high-quality, which means helping you make your code + better. + +3. Once your PR reaches a point where it is good enough, an Emboss maintainer + will merge it into the Emboss repository. + + +## Small Changes + +If your change is small, but still requires some design work — for example, +adding a new utility function in the C++ runtime library, or making a bug fix +that involves re-structuring some of the `embossc` code — it is usually best to +get some feedback before you start coding. + +1. [File an issue on GitHub](https://github.com/google/emboss/issues/new), if + there is not an issue already. It is best to use the *problem you want to + solve* for the issue title and description, and then propose your design in + a comment. + +2. Once the Emboss maintainers have had a chance to review your proposal and + agree on the general outline, follow [the procedure for very small + changes](#very-small-changes). + + +## Medium and Large Changes + +If you have a medium or large change — for example, introducing a new pass in +`embossc`, adding a new data type, adding a new operator to the Emboss +expression language, making a cross-cutting refactoring of `embossc`, etc. — +then you should start by writing a *design sketch*. + +A design sketch is, basically, an informal design doc — it covers the topics +that a design doc would cover, but may have open questions or alternatives that +haven't been locked down. + +1. [File an issue on GitHub](https://github.com/google/emboss/issues/new), if + there is not an issue already. It is best to use the *problem you want to + solve* for the issue title and description. + +2. Look at [existing design sketches](design_docs/) and [archived design docs + for changes that have already landed](design_docs/archive/) to get a feel + for what should be in a design sketch. + +3. If you have not already done so, read [How to Design Features for + Emboss](how-to-design.md). + +4. Write a draft design sketch for your change, and open a pull request + against [the main Emboss repository](https://github.com/google/emboss/). + +5. It is very likely that your design sketch will need revision before it is + accepted. If it does, do not be discouraged — we want your change to + succeed! + +6. Once your design sketch has been accepted, you can move on to + implementation, following (more or less) the same procedure you would + follow for [small](#small-changes) or [very small + changes](#very-small-changes). Depending on the complexity of the change, + you may need to split your implementation into multiple changes. diff --git a/doc/how-to-design.md b/doc/how-to-design.md new file mode 100644 index 0000000..04e28ea --- /dev/null +++ b/doc/how-to-design.md @@ -0,0 +1,710 @@ +# Things to Think About When Designing Features for Emboss (An Incomplete List) + +Original Author: + +Ben Olmstead (aka reventlov, aka Dmitri Prime), original designer and author of +Emboss + + +# General Design Principles + +There are many, many books, articles, talks, classes, and exercises on good +software design, and most general design principles apply to Emboss. In this +section, I will only cover the "most important" principles and those that I do +not see highlighted in many other places. + + +## Design to Real Problems, Not Hypotheticals + +In order to avoid "second system effect," designs that do not work in practice, +and wasted effort, it is best to design to a specific problem — preferably a +few instances of that problem, so that your design is more likely to solve a +wide range of real world problems. + +For example, in Emboss if you wait until you have a specific data structure +that is awkward or impossible to express, then try to find examples of other +structures that are awkward in the same way, and then design a feature to +handle those data structures, you are much more likely to come up with a +solution that a) will actually be used, and b) will be used in more than one +place. + + +## Design to the Problem, Not the Solution + +Often, users will have a problem, think "I could solve this if I could do X," +and then ask for a feature for X without mentioning their original problem. As +a software designer, one of the first things you should do is try to figure out +the original problem — usually by asking the user some probing questions — so +that you can design to the problem, not to the user's solution. + +(Note that this is sometimes true even if you are the user: it is easy to get +tunnel vision about a solution you came up with. Sometimes you need to step +back and try to find a different solution.) + + +## Do Not Try to Do Everything + +Avoid the temptation to cover every possible use case, even if some of those +would generally fit within the domain of your project. A project like Emboss +will attract extremely specific requests — requests whose solutions do not +generalize. + + +### Emboss is a "95% Solution" + +Instead of trying to cover every use case for every user, leave "escape +hatches" in your design, so that users can use Emboss for the cases it covers, +and integrate their own solutions in the places that Emboss does not cover. + +There will always be formats that Emboss cannot handle without becoming an +actual programming language — even something as "basic" as compression is +generally beyond what Emboss is meant to be capable of. + + +## Be Conservative + +Emboss has strong backwards-compatibility guarantees: in particular, once a +feature is "released," support for that feature is guaranteed more or less +forever. Because of this, new features should be narrow, even if there are +"obvious" expansions, and even if narrowing the feature actually takes more +code in the compiler. You can always expand a feature later, but narrowing it +or cutting it out would break Emboss's support guarantees. + +Although this principle is very standard for professional, publicly-released +software, it may be a culture shock to developers who are used to +"monorepo"[^mono] environments such as Google — it is not possible to just +update all users in the real world! Note that even many of Google's *open +source* projects, such as Abseil, require their users to periodically update +their code to the latest conventions, which imposes a cost on users of those +projects. Emboss is intended for smaller developers and embedded systems, +which often do not have the resources for such migrations. + +[^mono]: In the several years that Emboss spent inside Google's monorepo it + underwent many large, backwards-incompatible changes that made the current + language significantly better. Early incubation in a controlled + environment can be valuable for a new language! + + +## Design for Later Expansion + +### Leave "Reserved Space" for Future Features + +Emboss uses `$` in many keyword names, but does not allow `$` to be used in +user identifiers — this lets Emboss add `$` keywords without worrying about +colliding with identifiers in existing code. (This is in direct contrast to +most programming languages, where introducing new keywords often breaks +existing code.) + +As another example, Emboss disallows identifiers that collide with keywords in +many programming languages — this gives room for Emboss to add back ends for +those programming languages later, without having to figure out a convention +for mangling identifiers that collide. As a real-world counterexample, +Protocol Buffers had to figure out a convention for handling field names that +collide with C++ identifiers such as `class` — and `protoc` still generates +broken C++ code if you have two fields named `class` and `class_` in the same +`message`. + + +### Leave "Extension Points" + +An "extension point" is a place where someone should be able to hook into the +system without changing the system. This can be an API, a "hook," a defined +data format, or something else entirely, but the defining factor is that it is +a way to add new features or alter behavior without changing the existing +software. + +In practice, many extension points won't "just work" until there are at least a +few things using them, due to bugs or unexpected coupling, but in principle +they should not require any modification. + +One extension point in the Emboss compiler is the full separation between front +and back ends, so that future back ends (such as Rust, Protocol Buffers, PDF +documentation, etc.) can be added without changing the overall design or +(theoretically) any of the existing compiler.[^ext] + +[^ext]: This is not unique or original to Emboss: separate front and back ends + are totally standard in modern compiler design. + +In the physical world, an electrical outlet or a network port is an extension +point — there is nothing there right now, but there is a defined place for +something to be added later. + + +### Leave "Lines of Cleavage" + +A "line of cleavage" is similar to an extension point, except that instead of +being a ready-to-go place to add something new, it's a place where the major +work was done, but there are still some pieces that need to be fixed up. + +A line of cleavage in the Emboss compiler is the use of a special `.emb` file +(`prelude.emb`) to define "built-in" types, with the aim of eventually allowing +end users to define their own types at the same level. This feature still has +open design decisions, such as: + +* How will users define their type for the back end(s)? +* How will users define the range of an integer type for the expression + system? + +But these are relatively minor compared to the larger question of "how can +Emboss allow end users to define their own basic types?" + +In software, lines of cleavage are usually invisible to end users, and can be +difficult to see even for developers working on the code. + +In the physical world, an example of this is putting empty conduit into walls +or ceilings: that way, new electrical or communication wires or pneumatic tubes +can be pulled through the conduit and attached to new outlets, without having +to open up *all* the walls. + + +## Consider Known Potential Features + +Every complex software system has a cloud of potential features around it: +features which, for one reason or another, have not been implemented yet, but +which some stakeholder(s) want. These features usually exist at every stage +from "idle thought in a developer's mind" to "partially implemented, but not +finished," and the likelihoods of each one to become a finished feature cover +an equally wide range. + +When designing a new feature there are very good reasons to think about these +potential features: + +First, you should ensure that your new feature does not make another +highly-desirable feature impossible. In Emboss, for example, if your new +feature made it impossible to support a string type, that would be a very good +reason to redesign your feature (or abandon it, if it is fundamentally +incompatible). + +Second, sometimes you can tweak your design so that a potential feature becomes +obsolete: fundamentally, every feature request exists to solve a problem, and +often it is not the only way to solve that problem. If you can solve it in a +different way, you can make users happy and avoid some future work. (Though be +careful: it can be difficult to infer the full scope of a user's problem(s) +from a feature request.) + +Third, thinking about specific potential features can help narrow the amount of +"future design space" that you need to consider, which makes it easier to put +extension points and lines of cleavage in your design in places where they will +actually be used. + + +# General Language Design Principles + +In contrast to general software design principles, there are far fewer sources +on good *language* design. I speculate that this is because there are far +fewer language designers than software designers. (There are tens of millions +of software developers, but only tens of thousands of programming, markup, and +data definition languages — and of those, maybe two thousand or so are +"serious" languages with significant real-world use.) + +Luckily, there are many publicly available and documented languages to learn +from directly. + +Language design can be very roughly divided into syntactic and semantic +concerns: syntax is how the language *looks* (what symbols and keywords are +used, and in what order), while semantics cover how the language *works* (what +actually happens). It might seem like semantics are more important, but syntax +has a huge effect on how easy it is to understand existing code and to write +correct code, which are both incredibly important in real-world use. + +In this section, I will try to outline language design principles that I have +found or developed, particularly when they are useful for Emboss. + + +## Be Mindful of the Power/Analysis Tradeoff + +[Turing-complete languages cannot be fully +analyzed](https://en.wikipedia.org/wiki/Halting_problem). This is one of the +reasons that languages like HTML and CSS are not programming languages: the +more expressive a language is, the more difficult it is to analyze. + +The `.emb` format is intended to be more on the declarative side, so that +definitions can be analyzed and transformed as necessary. + + +## Look at Other Languages + +Although Emboss is a data definition language (DDL), not a programming +language, many lessons and principles from programming language design can be +applied, as well as lessons from other DDLs, and sometimes even interface +definition languages (IDLs), as well as markup and query languages. + +In particular, for Emboss it is often worth looking at: + +* Popular programming languages: C, C++, Rust, JavaScript, TypeScript, C#, + Java, Go, Python 3, Swift, Objective C, Lua. "Systems" programming + languages such as C, C++, and Rust are usually the most relevant of these, + but it is useful to survey all the popular languages because many Emboss + users will be familiar with them. Note that Lua is used for Wireshark + packet definitions. + +* Selected "interesting" programming languages: Wuffs, Haskell, Ocaml, Agda, + Coq. These have some lessons for Emboss, especially its expression system + — in particular, they're all much more principled than "standard" + programming languages about how they handle types and values. There are + many other programming languages that have interesting ideas (FORTH, + Prolog, D, Perl, Logo, Scratch, APL, so-called "esoteric" programming + languages), but they usually are not relevant to Emboss. + +* DDLs: Kaitai Struct, Protocol Buffers, Cap'n Proto, SQL-DDL. Kaitai Struct + is the closest of these to solving the same problem as Emboss (though it + has some fundamentally different design decisions which make it far worse + for embedded systems), but all have some lessons. Some higher-level schema + languages like DTD, XML Schema, or JSON Schema tend to be less relevant to + Emboss. Note that there are a number of DDLs that are also IDLs: in actual + use, some of them (Protocol Buffers) are used more often for their DDL + features, while others (XPIDL, COM) are used more for their IDL features. + + +## Learn Academic Theory + +Many (most?) languages are designed by people who have minimal knowledge of the +academic theories of how programming languages work — for Emboss, Category +Theory is particularly useful, and the computer science of parsers (especially +LR(1) parsers) is useful for tweaking the parser generator or adding new +syntax. + +This is a case where a little bit of learning goes a long way: you do not need +to learn a *lot* about parsers or Category Theory to benefit from them. + + +## Try to Acquire Practical Knowledge + +Many of the academic topics related to programming language design have +corresponding industrial knowledge, and there are practical concerns that have +very little to do with academic theory. + +The Emboss compiler is (loosely) based on the design of LLVM, with a series of +transformation passes that operate somewhat independently, and independent back +end code generators.[^designoops] + +[^designoops]: After many years of experience with this, I think that this is + not quite the right design for Emboss, and I would make two major changes: + first (and simplest), I would divide the current "front end" into a true + front end that only handled syntax and some types of syntax sugar, and a + "middle end" that handled all of the symbol resolution, bounds analysis, + constraint checking, etc. Second, I would use a "compute-on-demand" (lazy + evaluation) approach in the middle end, which would allow certain + operations to be decoupled. The LLVM design is more suited for independent + optimization passes, not for the kind of gradual annotation process in the + Emboss middle end. + +As another example, understanding how (and how well) Clang, GCC, and MSVC can +optimize C++ code is crucial to generating high-performance code from Emboss +(and Emboss leans very heavily on the C++ compiler to optimize its output). + +Some bits of practical knowledge are tiny little bits of almost-trivia. For +example, if you have C or C++ code in a (text) template, and you use `$` to +indicate substitution variables (as in `$var` or `$var$`), then most editors +and code formatters will treat your substitution variables as normal +identifiers. This is because almost every C and C++ compiler allows you to use +`$` in identifiers, even though there has never been a C or C++ standard that +allows those names, and it is rarely noted in any compiler, editor, or +formatter's documentation. + + +## Use Existing Syntax + +Emboss pulls many conventions from programming, data definition, and markup +languages. In general, if there is a feature in Emboss that works in a way +that is the same as in other languages, it is best to pull syntax from +elsewhere — ideally, pull in the most common syntax. Many examples of this in +Emboss are so common you might not even think about them: + +* Arithmetic operators (`+`, `-`, `*`) +* Operator precedence (`*` binds more tightly than `+` and `-`, but also: see + the next section) + +Other examples are most specific, with no universal convention: + +* `: Type` syntax for type annotation (TypeScript, Python, Ocaml, Rust, ...) + +This is *especially* important for Emboss, because most people reading or +writing Emboss code will not want to spend much time becoming an "Emboss +expert" — where someone might be willing to spend days or weeks to learn how to +write Rust code, they are more likely to spend hours or minutes learning to +write Emboss. + + +## Avoid Existing Syntax + +However, there are three main reasons to avoid using existing syntax: + +* The "standard" syntax is error prone. One example of this is operator + precedence in most programming languages: errors related to not knowing the + relative precedence of `&&` and `||` are so common that most compilers have + an option to warn if they are mixed without parentheses. Emboss handles + this — and a few other error-prone constructs — by having a *partial + ordering* for precedence instead of the standard total ordering, and making + it a syntax error to mix operators such as `&&` and `||` that have + incomparable (neither equal, less than, nor greater than) precedence. As + far as I can tell, this is a totally new innovation in Emboss: there is no + precedent (no pun intended) whatsoever for partial precedence order. + + When avoiding syntax in this way, it is ideal to make the standard syntax + into a syntax error (so that no one can use it accidentally) and to add an + error message to the compiler that suggests the correct syntax. + +* The existing syntax is not used consistently: if multiple programming + languages use the same syntax for slightly different semantics, it is + usually worth avoiding the syntax. For example, `/` has quite a few + different semantics — in many languages, it is a type-parameterized + division, where the numeric result depends on the (static or dynamic) types + of its operands, and across languages, the "integer division" flavor is not + consistent — in most programming languages it is *truncating division* (`-7 + / 3 == -2`), but in some programming languages it is *flooring division* + (`-7 / 3 == -3`). + +* The semantics do not match: if an Emboss feature is *almost*, but *not + quite* equivalent to a feature in other languages, it is best to avoid + making the Emboss feature look like the other feature. + + +## Poll Users/Programmers + +When designing a new feature, try to come up with several alternatives and poll +Emboss users (or sometimes non-Emboss-using programmers) as to which one they +prefer. + +For syntax, one especially powerful technique is to show an example of the +proposed syntax to people who have never seen it, and ask "what do you think +this means?" without any hinting or prompting. This is the "gold standard" way +of finding out whether your syntax is clear or not. + + +## Avoid Error-Prone Constructs + +Computing now has roughly seventy years of experience with artificial languages +(in programming, markup, data definition, query, etc. flavors), and we have +learned a lot about what kinds of constructs are error-prone for humans to use. +Avoid these, where possible! Some examples include: + +* Large semantic differences should not have small, easily-overlooked + syntactic differences. For example, allowing single- and double-character + operators (`=` and `==`, `|` and `||`, etc.) in the same contexts: a + classic C-family programming error is to use `=` in a condition instead of + `==`. Many modern languages either force `=` to be used only in "statement + context" (and some, like C#, also ban side-effectless statements such as `x + == y;`) or use a different operator like `:=` for assignment. (Or both, as + in Python, which allows `:=` but not `=` for "expression assignment.") + +* Syntax should have *consistent* semantic meaning. For example, in + JavaScript these two snippets mean the same thing: + + ```js + return f() + 10; + ``` + + ```js + return f() + + 10; + ``` + + but this one is different (it returns `undefined`, thanks to JavaScript's + automatic `;` insertion): + + ```js + return + f() + 10; + ``` + + A small difference in the placement of the line break leads to totally + different semantics! + + C++ has a number of places where identical syntax can have wildly different + semantics, especially (ab)use of operator overloads and [the most vexing + parse](https://en.wikipedia.org/wiki/Most_vexing_parse). + +* Hoare calls "null" his "billion-dollar mistake," and the way that null + pointers are handled in most programming languages, especially C and C++, + is particularly error-prone. (But note that it isn't really "null" itself + that is problematic — it's that there is no way to mark a pointer as "not + null," and that doing anything with a null pointer leads to undefined + behavior. However, some popular language features, such as the `?.` + operator found in several programming languages and the `std::optional<>` + type in C++, show that there is some utility to nullable types, as long as + there is language support for enforcing null checks and/or allowing null to + propagate in the same way that NaN can.) + +* Edge cases, such as integer overflow, are difficult for humans to reason + about. In systems programming languages like C and C++, this leads to a + significant percentage of security flaws. (C and C++ compilers use the + "integer overflow is undefined" rule *extensively* in optimization, so + there are pragmatic trade-offs in general. Emboss is used in smaller + contexts with tighter safety guarantees.) + + +# Emboss-Specific Considerations + +Emboss sits in a section of design space that has very few alternatives, and as +a result there are things to think about when designing Emboss features that do +not apply to many other languages. + +Also, because Emboss already exists, there are a number of systems within +Emboss-the-language that may interact with new features. + +And finally, if you want your feature to become implemented, it is necessary to +consider how difficult it would be to implement new features in `embossc`. + + +## Survey Data Formats + +Maybe the least fun (at least for me[^unfun]) part of designing Emboss features +is reading through data sheets, programming manuals, RFCs, and user guides to +understand the data formats used in the real world, so that any new feature can +handle a reasonable subset of those formats. Some sources to consider: + +* Data sheets and programming manuals for: + * complex sensors, such as LiDAR + * GPS receivers + * servos + * LED panels and segmented displays + * clock hardware + * ADCs and DACs + * camera sensors + * power control devices + * simple sensors such as barometers, hygrometers, current sensors, + voltage sensors, light sensors, etc. (though many very simple sensors + use analog outputs or very, very simple digital outputs that do not + have a "protocol" as such) +* RFCs for low-level protocols such as Ethernet, IP, ICMP, UDP, TCP, and ARP + + + +[^unfun]: One of my original motivations for creating Emboss is that I find + reading data sheets and implementing code to read/write the data formats + therein to be extremely tedious. + + +## Structure Layout System + +The "heart" of Emboss is what may be called the "structure layout system:" the +engine that determines which bits to read and write in order to produce or +commit the values of fields. When designing, consider: + +* Does this feature require reaching "outside" of a scope? For example, + referencing a sibling field from within a field's scope is currently + impossible, because each field has its own scope. Allowing `[requires: + this == sibling]` means expanding that scope. + +* Does this feature require information that is not (currently) available to + the layout engine, or not available at the right place or time? For + example, if you are designing a feature to allow field sizes to be `$auto`, + how does that interact with structures that are variable size? + +* Does this feature require information that is potentially circular, or + would it interact with another potential feature to require circular + information, and is there a way to resolve that? For example: if you are + designing a feature to allow field sizes to be `$auto`, inferring their + size from their type, how will that interact with the potential feature to + allow `struct`s that grow to the size they are given? + + +## Expression System + +Although most expressions in Emboss definitions are simple (such as `x*4` or +even just `0`), the expression system in Emboss tracks a lot of information, +such as: + +* What is the type of every subexpression (e.g., integer, specific + enumeration, opaque, etc.)? +* For integer and boolean expressions, does the expression evaluate to fixed + (constant) value? +* For integer expressions, what are the upper and lower bounds of the + expression? (Used for determining the correct integer types to use in + generated code.) +* For integer expressions, is the value guaranteed to be equal to some fixed + value modulo some constant? (Used for generating faster code for aligned + memory access.) + +When designing a feature, consider: + +* Will any new types be `opaque` to the expression system, or will it be + possible to perform operations on them? If they are `opaque` for now, will + they stay that way, or will it be possible to manipulate them in the + future? For example, adding a string type in Emboss might start as + `opaque`, but allow operations like "value at index" or "substring" in the + future. +* When adding new operations, how will they interact with the bounds and + alignment tracking? For example: truncating division often breaks + alignment tracking, whereas flooring division does not. +* Will this feature invalidate existing code? Anything that causes the + inferred integer bounds of existing code to expand can break existing code. + +Note that the entire point of Emboss is to provide a bridge between physical +data layout (as defined in the structure layout system) and abstract values +with no specific representation (as exposed through the expression system). + + +## Parsing + +Any new syntax has to be added to the parser. Aside from the language design +considerations for new syntax (see the ["General Language Design Principles" +section](#general-language-design-principles)), there are a few levels of +concern for the actual implementation: + +* Is it computationally feasible to parse this syntax in an intuitive, + unambiguous way? +* Is it humanly feasible to express this syntax as an LR(1) grammar that can + be parsed by Emboss's shift-reduce parser engine? +* Is it feasible to parse this syntax using a different parsing engine type + (Earley, recursive descent, TDOP, parser combinator, etc.)? + +The first consideration is more of a general language design consideration: if +your language design says "users will be able to specify their program in +English," that is not really feasible (or unambiguous). (Not that it hasn't +been tried, many times.) + +The second consideration — can you add this syntax to `embossc`? — is the most +practical and important consideration for Emboss. LR(1) grammars are pretty +restrictive (though shift-reduce parsers have advantages — there are reasons +Emboss is using one), and even when it is *possible* to express a particular +syntactic construct in LR(1)[^zimm], it may be difficult for most programmers to +actually do so. As a practical matter, I recommend trying to actually add your +syntax to `module_ir.py`. + +[^zimm]: I (Ben Olmstead) think it would be awesome to implement [[Zimmerman, + 2022](https://arxiv.org/abs/2209.08383)] plus a few extensions of my own + devising in Emboss's shift-reduce engine, which would make the grammar + design space significantly larger. I would also separate the parser + generator engine into its own project. + +The third consideration is more future-focused and abstract: does this syntax +lock Emboss into using a shift-reduce parser in the future? Ideally, no. +Luckily(?), LR(1) grammars are one of the more restrictive types of grammars in +common use, so it is likely that anything that can be handled by the current +parser can be handled by many other types of parsers. + + +## Generated Code + +Right now, there is only the generated C++ code, but there should be other back +ends in the future. Some new features are pure syntax sugar (e.g., `$next` or +`a < b < c`) that are replaced in the IR long before it reaches the back end +(e.g., with the offset+length of the syntactically-previous field, or the IR +equivalent of `a < b && b < c`), while others require extensive changes to how +code is generated. + +* What information will the back end need in order to generate working code? +* Does this feature require embedded-unfriendly generated code? (E.g., + memory allocation, I/O.) +* Can the existing C++ back end, which just walks the IR tree in a single + pass while building up strings which are combined into a `.h`, handle this + feature in its current design? +* How will this feature interact with various generated templates? +* Can/should this feature be, itself, templated? + + +## C++ Runtime Library + +The runtime library will be included with every program that touches Emboss, so +it is important to make it efficient. When adding features, consider: + +* Can the feature be added in such a way that it does not cost anything for + programs that do not use the feature? A standalone C++ template will not + be included in a program unless the program instantiates the template, but + if the new code is used from somewhere in an existing function, it may be + included in programs that do not use it directly. + +* Can the feature be added without allocating any heap memory? Can it be + added with O(1) stack memory use? Both of these are important for some + embedded systems, such as OS-less microcontroller and hard-real-time + environments. Some features may intrinsically require memory allocation, + in which case it is best if they can be separated: for example, Emboss + structure-to-string conversion requires allocation, and even `#include`'ing + the appropriate headers can be too much for some environments, even if the + serialization code is never included in the final binary. + +* How much can you rely on the C++ compiler to optimize things? If you have + to implement your own optimizations, that will cost more development time + and add more complexity to the standard library. + + +## Compiler Complexity + +The Emboss compiler is already quite complex, and has many subsystems that +interact. It is already quite difficult to reason about some interactions. + +* Can the feature be added at an "edge" of the compiler? For example, if you + can implement your feature as syntax sugar that converts the new feature to + existing IR early in the compilation process, it is much easier to verify + that it will not cause problematic interactions. Similarly, if you can + implement your feature entirely in the back end or in the runtime library, + you do not need to worry about interactions inside the front end. + +* If a feature cannot be added at an edge, how can you design it to minimize + the complexity? (Ideally, you could even unify existing systems in such a + way that the overall complexity of the compiler is lower at the end.) + + +## Future Back Ends + +It is important to have some idea of how any feature would be implemented +against future back ends. + + +### Programming Language (Rust/Python/Java/Go/C#/Lua/etc) Back Ends + +Some features may be difficult to implement in other languages. For example, +Python does not have a native `switch` statement, so any `switch`-like feature +in Emboss may be awkward to implement — but this does not necessarily mean that +Emboss should not have a `switch`. + +As a rule of thumb, languages can be grouped into tiers: + +1. "Systems"/embedded-friendly languages: C++, Rust, C. Top support. +2. Languages used for parsing/analyzing raw sensor dumps: C#, Java, Go, + Python, etc. Should have good support, but not gate any features. +3. Languages that are rarely used to touch binary data: JavaScript, + TypeScript, etc. Can be mostly ignored. +4. Dead and obscure languages: Perl, COBOL, APL, INTERCAL, etc. Can be + ignored entirely. + +(It may be difficult to classify some languages, such as FORTRAN, which is +still hanging around in 2024.) + +Remember that other back ends may have different requirements and guarantees +than the C++ back end: for example, it would be unreasonable for a Java back +end to promise "no dynamic memory allocation." + + +### Other Data Format (Protobuf/JSON/etc) Back Ends + +These back ends would translate binary structures into alternate +representations that are easier for some tools to use: for example, Google has +many, many tools for processing Protocol Buffers, and JSON is popular in the +open-source world. + +Most other formats have limitations that may make some kinds of Emboss +constructs difficult or impossible to correctly reproduce: for example, Emboss +already supports "infinitely nested" `struct` types, like: + +``` +struct Foo: + 0 [+10] Foo child_foo +``` + +Formats like Protobuf or JSON, which do not have any way of representing loops +in their data graph, cannot handle this. + +Until the most recent versions of Protobuf, mismatches between Protobuf `enum` +and Emboss `enum` made it functionally impossible to map any Emboss `enum` +types onto Protobuf `enum` types: Emboss `enum` types are open (allow any +value, even ones that are not listed in the `enum`), where all Protobuf `enum` +types were closed (only allowed known values). (The most recent Protobuf +versions, Proto3 and Editions, allow you to have open `enum` types.) + +Generally, it is not worth blocking an Emboss feature because of these kinds of +mismatches, but it is worth thinking about how to avoid them, if possible. + + +### Documentation (PDF/Markdown/etc) Back Ends + +These back ends would translate `.emb` files to a form of human-readable +documentation, intended for publication on a web site, in an RFC, or as part of +a PDF datasheet. This type of back end is the motivation for having both `--` +documentation blocks and `#` comments in Emboss. + +Since the output from these back ends would be intended for human consumption, +for the most part you would only need to ensure that your feature can be +understood by humans. diff --git a/doc/how-to-implement.md b/doc/how-to-implement.md new file mode 100644 index 0000000..452d2d1 --- /dev/null +++ b/doc/how-to-implement.md @@ -0,0 +1,203 @@ +# How to Implement Changes to Emboss + + + +## Getting the Code + +The master Emboss repository lives at https://github.com/google/emboss — you +can `git clone` that repository directly, or make [a fork on +GitHub](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/about-forks) +and then `git clone` your fork. + + +## Prerequisites + +In order to run Emboss, you will need [Python](https://www.python.org/). +Emboss supports all versions of Python that are [still supported by the +(C)Python codevelopers](https://devguide.python.org/versions/), but versions +older than that generally will not work. + +The Emboss tests run under [Bazel](https://bazel.build/). In order to run the +tests, you will need to [install Bazel](https://bazel.build/start) on your +system. + + +## Running Tests + +Emboss has a reasonably extensive test suite. In order to run the test suite, +`cd` into the top `emboss` directory, and run: + +```sh +bazel test ... +``` + +Bazel will download the necessary prerequisites, compile the (C++) code, and +run all the tests. Tests will take a moment to compile and run; Bazel will +show running status, like this: + +``` +Starting local Bazel server and connecting to it... +[502 / 782] 22 actions, 21 running + Compiling runtime/cpp/test/emboss_memory_util_test.cc; 10s linux-sandbox + Compiling runtime/cpp/test/emboss_memory_util_test.cc; 10s linux-sandbox + Creating runfiles tree bazel-out/k8-fastbuild/bin/compiler/front_end/emboss_front_end.runfiles; 2s local + Creating runfiles tree bazel-out/k8-fastbuild/bin/compiler/front_end/synthetics_test.runfiles; 1s local + Creating runfiles tree bazel-out/k8-fastbuild/bin/compiler/front_end/make_parser_test.runfiles; 1s local + Creating runfiles tree bazel-out/k8-fastbuild/bin/compiler/back_end/cpp/header_generator_test.runfiles; 1s local + Compiling absl/strings/str_join.h; 0s linux-sandbox + Creating runfiles tree bazel-out/k8-fastbuild/bin/compiler/front_end/generate_cached_parser.runfiles; 0s local ... +``` + +You may see a few `WARNING` messages; these are generally harmless. + +Once Bazel finishes running tests, you should see a list of all tests and their +status (all should be `PASSED` if you just cloned the main Emboss repo): + +``` +Starting local Bazel server and connecting to it... +INFO: Analyzed 226 targets (98 packages loaded, 4080 targets configured). +INFO: Found 116 targets and 110 test targets... +INFO: Elapsed time: 65.577s, Critical Path: 22.22s +INFO: 862 processes: 372 internal, 490 linux-sandbox. +INFO: Build completed successfully, 862 total actions +//compiler/back_end/cpp:alignments_test PASSED in 0.2s +//compiler/back_end/cpp:alignments_test_no_opts PASSED in 0.1s +//compiler/back_end/cpp:anonymous_bits_test PASSED in 0.2s +//compiler/back_end/cpp:anonymous_bits_test_no_opts PASSED in 0.1s +[... many more tests ...] +//runtime/cpp/test:emboss_prelude_test PASSED in 0.2s +//runtime/cpp/test:emboss_prelude_test_no_opts PASSED in 0.2s +//runtime/cpp/test:emboss_text_util_test PASSED in 0.2s +//runtime/cpp/test:emboss_text_util_test_no_opts PASSED in 0.2s + +Executed 110 out of 110 tests: 110 tests pass. +``` + +If a test fails, you will see lines at the end like: + +``` +//compiler/back_end/cpp:alignments_test FAILED in 0.0s + /usr/local/home/bolms/.cache/bazel/_bazel_bolms/444a471ee8e028e0535394d088883276/execroot/_main/bazel-out/k8-fastbuild/testlogs/compiler/back_end/cpp/alignments_test/test.log + +Executed 110 out of 110 tests: 109 tests pass and 1 fails locally. +``` + +You can read the `test.log` file to find out where the failure occurred. + +Note that each C++ test actually runs multiple times with different Emboss +`#define` options, so a single failure may cause multiple Bazel tests to fail: + +``` +//compiler/back_end/cpp:alignments_test FAILED in 0.0s + /usr/local/home/bolms/.cache/bazel/_bazel_bolms/1c6e4694f903a02feef32c92ec3f1cae/execroot/_main/bazel-out/k8-fastbuild/testlogs/compiler/back_end/cpp/alignments_test/test.log +//compiler/back_end/cpp:alignments_test_no_checks FAILED in 0.0s + /usr/local/home/bolms/.cache/bazel/_bazel_bolms/1c6e4694f903a02feef32c92ec3f1cae/execroot/_main/bazel-out/k8-fastbuild/testlogs/compiler/back_end/cpp/alignments_test_no_checks/test.log +//compiler/back_end/cpp:alignments_test_no_checks_no_opts FAILED in 0.0s + /usr/local/home/bolms/.cache/bazel/_bazel_bolms/1c6e4694f903a02feef32c92ec3f1cae/execroot/_main/bazel-out/k8-fastbuild/testlogs/compiler/back_end/cpp/alignments_test_no_checks_no_opts/test.log +//compiler/back_end/cpp:alignments_test_no_opts FAILED in 0.0s + /usr/local/home/bolms/.cache/bazel/_bazel_bolms/1c6e4694f903a02feef32c92ec3f1cae/execroot/_main/bazel-out/k8-fastbuild/testlogs/compiler/back_end/cpp/alignments_test_no_opts/test.log + +Executed 168 out of 168 tests: 164 tests pass and 4 fail locally. +``` + +(The Emboss repository goes one step further and runs each of *those* tests +under multiple compilers and optimization options.) + +If you are working on fixing a failure in one particular test, you can tell +Bazel to run just that test by specifying the name of the test on the command +line: + +``` +bazel test //compiler/back_end/cpp:alignments_test +``` + +This can be quicker than re-running the entire test suite. + + +### `docs_are_up_to_date_test` + +If you are making changes to the Emboss grammar, you can ignore failures in +`docs_are_up_to_date_test` until you have your updated grammar finalized: that +test ensures that certain generated documentation files are up to date when +code reaches the main repository. See [Checked-In Generated +Code](#checked-in-generated-code), below. + + +## Implementing a Feature + +The the Emboss compiler is under [`compiler/`](../compiler/), with +[`front_end/`](../compiler/front_end/), [`back_end/`](../compiler/front_end/), +and [`util/`](../compiler/util/) directories for the front end, back end, and +shared utilities, respectively. + +The C++ runtime library is under [`runtime/cpp/`](../runtime/cpp). + + +### Coding Style + +For Python, Emboss uses the default style of +the [Black](https://black.readthedocs.io/en/stable/) code formatter.[^genfile] + +[^genfile]: There is one, very large, generated `.py` file checked into the + Emboss repository that is intentionally excluded from code formatting — + both because it can hang the formatter and because the formatted version + takes noticeably longer for CPython to load. + +For C++, Emboss uses the `--style=Google` preset of +[ClangFormat](https://clang.llvm.org/docs/ClangFormat.html). + + +## Writing Tests + +Most code changes require tests: bug fixes should have at least one test that +fails before the bug fix and passes after the fix, and new features should have +many tests that cover all aspects of how the feature might be used. + + +### Python + +The Emboss Python tests use the Python +[`unittest`](https://docs.python.org/3/library/unittest.html) module. Most +[tests of the Emboss front end](../compiler/front_end/) are structured as: + +1. Run a small `.emb` file through the front end, stopping immediately before + the step under test, and hold the result IR (intermediate representation). +2. Run the step under test on that IR, making sure that there are either no + errors, or that the errors are expected. +3. For the "no errors" tests, check various properties of the resulting IR to + ensure that the step under test did what it was supposed to. + + +### C++ + +The Emboss C++ tests use [the GoogleTest +framework](https://google.github.io/googletest/). + +[Pure runtime tests](../runtime/cpp/test) `#include` the C++ runtime library +headers and manually instantiate them, then test various properties. + +[Generated code tests](../compiler/back_end/cpp/testcode/), which incidentally +test the runtime library as well, work by using a header generated from a [test +`.emb`](../testdata/) and interacting with the generated Emboss code the way +that a user might do so. + + +## Writing Documentation + +If you are adding a feature to Emboss, make sure to update [the +documentation](../doc/). In particular, the [language +reference](../doc/language-reference.md) and the [C++ code +reference](cpp-reference.md) are very likely to need to be updated. + + +## Checked-In Generated Code + +There are several checked-in generated files in the Emboss source repository. +As a general rule, this is not a best practice, but it is necessary in order to +achieve the "zero installation" use of the Emboss compiler, where an end user +can simply `git clone` the repository and run the `embossc` executable directly +— even if the cloned repository lives on a read-only filesystem. + +In order to minimize the chances of any of those files becoming stale, each one +has a unit test that checks that the file in the Emboss directory matches what +its generator would currently generate. diff --git a/doc/index.md b/doc/index.md index 6c383e8..7cfb5d6 100644 --- a/doc/index.md +++ b/doc/index.md @@ -16,3 +16,8 @@ Details of the textual representation Emboss uses for structures can be found in the [Emboss Text Format Reference](text-format.md). There is a tentative [roadmap of future development](roadmap.md). + +If you are interested in contributing to Emboss, please read [Contributing to +Emboss](contributing.md), and you may wish to read [How to Design Features for +Emboss](how-to-design.md) and [How to Implement Changes to +Emboss](how-to-implement.md).