Update for 9.1 cc changes

angr · Dec 13, 2021 · 8598f3c · 8598f3c
1 parent bb32408
commit 8598f3c
Show file tree

Hide file tree

Showing 5 changed files with 142 additions and 103 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,19 @@
 This lists the *major* changes in angr.
 Tracking minor changes are left as an exercise for the reader :-)
 
+## angr 9.1
+
+- (#2961) Refactored SimCC to support passing and returning structs and arrays by value
+- (#2964) Functions from the knowledge base may now be pretty-printed, showing colors and reference arrows
+- Improved `import angr` speed substantially
+- (#2948) RDA's `dep_graph` can now be used to track dependencies between temporaries, constants, guard conditions, and function calls - if you want it!
+- (#2929) Basic support for structs with bitfields in SimType
+- There's a decompiler now
+
+## angr 9.0
+
+- Switched to a new versioning scheme: major.minor.build_id
+
 ## angr 8.19.7.25
 
 - (#1503) Implement necessary helpers and information storage for call pretty printing

diff --git a/MIGRATION.md b/MIGRATION.md
@@ -1,91 +1,29 @@
-# Migrating to angr 8
+# Migrating to angr 9.1
 
-angr has moved from python 2 to python 3!
-We took this opportunity of a major version bump to make a few breaking API changes that improve quality-of-life.
+angr 9.1 is here!
 
-## What do I need to know for migrating my scripts to python 3?
+## Calling Conventions and Prototypes
 
-To begin, just the standard py3k changes, the relevant parts of which we'll rehash here as a reference guide:
+The main change motivating angr 9.1 is [this large refactor of SimCC](https://github.com/angr/angr/pull/2961).
+Here are the breaking changes:
 
-- Strings and bytestrings
-  - Strings are now unicode by default, a new `bytes` type holds bytestrings
-  - Bytestring literals can be constructued with the b prefix, like `b'ABCD'`
-  - Conversion between strings and bytestrings happens with `.encode()` and `.decode()`, which use utf-8 as a default. The `latin-1` codec will map byte values to their equivilant unicode codepoints
-  - The `ord()` and `chr()` functions operate on strings, not bytestrings
-  - Enumerating over or indexing into bytestrings produces an unsigned 8 bit integer, not a 1-byte bytestring
-  - Bytestrings have all the string manipulation functions present on strings, including `join`, `upper`/`lower`, `translate`, etc
-  - `hex` and `base64` are no longer string encoding codecs. For hex, use `bytes.fromhex()` and `bytes.hex()`. For base64 use the `base64` module.
-- Builtin functions
-  - `print` and `exec` are now builtin functions instead of statements
-  - Many builtin functions previously returning lists now return iterators, such as `map`, `filter`, and `zip`. `reduce` is no longer a builtin; you have to import it from `functools`.
-- Numbers
-  - The `/` operator is explicitly floating-point division, the `//` operator is expliclty integer division. The magic functions for overriding these ops are `__truediv__` and `__floordiv__`
-  - The int and long types have been merged, there is only int now
-- Dictionary objects have had their `.iterkeys`, `.itervalues`, and `.iteritems` methods removed, and then non-iter versions have been made to return efficient iterators
-- Comparisons between objects of very different types (such as between strings and ints) will raise an exception
+### SimCCs can no longer be customized
 
-In terms of how this has affected angr, any string that represents data from the emulated program will be a bytestring.
-This means that where you previously said `state.solver.eval(x, cast_to=str)` you should now say `cast_to=bytes`.
-When creating concrete bitvectors from strings (including implicitly by just making a comparison against a string) these should be bytestrings. If they are not they will be utf-8 converted and a warning will be printed.
-Symbol names should be unicode strings.
+If you were using the `sp_delta`, `args`, or `ret_val` parameters to SimCC, you should use the new class
+`SimCCUsercall`, which lets (requires) you be explicit about the locations of each argument.
 
-For division, however, ASTs are strongly typed so they will treat both division operators as the kind of division that makes sense for their type.
+### Passing SimTypes is now mandatory
 
-## Clemory API changes
+Every method call on SimCC which interacts with typed data now requires a SimType to be passed in.
+Previously, the use of `is_fp` and `size` was optional, but now these parameters will no longer be accepted and a
+`SimType` will be required.
 
-The memory object in CLE (project.loader.memory, not state.memory) has had a few breaking API changes since the bytes type is much nicer to work with than the py2 string for this specific case, and the old API was an inconsistent mess.
+This has some fairly non-intuitive consequences - in order to accommodate more esoteric calling conventions (think: passing large structs by value via an "invisible reference") you have to specify a function's return type before you can extract any of its arguments.
 
-| Before | After |
-|--------|-------|
-| `memory.read_bytes(addr, n) -> list[str]` | `memory.load(addr, n) -> bytes` |
-| `memory.write_bytes(addr, list[str])` | `memory.store(addr, bytes)` |
-| `memory.get_byte(addr) -> str` | `memory[addr] -> int` |
-| `memory.read_addr_at(addr) -> int` | `memory.unpack_word(addr) -> int` |
-| `memory.write_addr_at(addr, value) -> int` | `memory.pack_word(addr, value)` |
-|  `memory.stride_repr -> list[(start, end, str)]` | `memory.backers() -> iter[(start, bytearray)]` |
+Additionally, some non-cc interfaces, such as `call_state` and `callable` and `SimProcedure.call()`, now _require_ a prototype to be passed to them.
+You'd be surprised how many bugs we found in our own code from enforcing this requirement!
 
-Additionally, `pack_word` and `unpack_word` now take optional `size`, `endness`, and `signed` parameters.
-We have also added `memory.pack(addr, fmt, *data)` and `memory.unpack(addr, fmt)`, which take format strings for use with the `struct` module.
+### `func_ty` -> `prototype`
 
-If you were using the `cbackers` or `read_bytes_c` functions, the conversion is a little more complicated - we were able to remove the split notion of "backers" and "updates" and replaced all backers with bytearrays that we mutate, so we can work directly with the backer objects.
-The `backers()` function iterates through all bottom-level backer objects and their start addresses. You can provide an optional address to the function, and it will skip over all backers that end before that address.
-
-Here is some sample code for producing a C-pointer to a given address:
-
-```python
-import cffi, cle
-ffi = cffi.FFI()
-ld = cle.Loader('/bin/true')
-
-addr = ld.main_object.entry
-try:
-    backer_start, backer = next(ld.memory.backers(addr))
-except StopIteration:
-    raise Exception("not mapped")
-
-if backer_start > addr:
-    raise Exception("not mapped")
-
-cbacker = ffi.from_buffer(backer)
-addr_pointer = cbacker + (addr - backer_start)
-```
-
-You should not have to use this if you aren't passing the data to a native library - the normal load methods should now be more than fast enough for intensive use.
-
-## CLE symbols changes
-
-Previously, your mechanisms for looking up symbols by their address were `loader.find_symbol()` and `object.symbols_by_addr`, where there was clearly some overlap.
-However, `symbols_by_addr` stayed because it was the only way to enumerate symbols in an object.
-This has changed! `symbols_by_addr` is deprecated and here is now `object.symbols`, a sorted list of Symbol objects, to enumerate symbols in a binary.
-
-Additionally, you can now enumerate all symbols in the entire project with `loader.symbols`.
-This change has also enabled us to add a `fuzzy` parameter to `find_symbol` (returns the first symbol before the given address) and make the output of `loader.describe_addr` much nicer (shows offset from closest symbol).
-
-## Deprecations and name changes
-
-- All parameters in cle that started with `custom_` - so, `custom_base_addr`, `custom_entry_point`, `custom_offset`, `custom_arch`, and `custom_ld_path` - have had the `custom_` removed from the beginning of their names.
-- All the functions that were deprecated more than a year ago (at or before the angr 7 release) have been removed.
-- `state.se` has been deprecated.
-  You should have been using `state.solver` for the past few years.
-- Support for immutable simulation managers has been removed.
-  So far as we're aware, nobody was actually useing this, and it was making debugging a pain.
+Every usage of the name func_ty has been replaced with the name prototype.
+This was done for consistency between the static analysis code and the dynamic FFI.
diff --git a/SUMMARY.md b/SUMMARY.md
@@ -43,5 +43,6 @@
   * [List of Claripy Operations](docs/appendices/ops.md)
   * [List of State Options](docs/appendices/options.md)
   * [Changelog](CHANGELOG.md)
-  * [Migrating to angr 8](MIGRATION.md)
+  * [Migrating to angr 9.1](MIGRATION.md)
+  * [Migrating to angr 8](docs/migration-8.md)
   * [Migrating to angr 7](docs/migration-7.md)
diff --git a/docs/migration-8.md b/docs/migration-8.md
@@ -0,0 +1,91 @@
+# Migrating to angr 8
+
+angr has moved from python 2 to python 3!
+We took this opportunity of a major version bump to make a few breaking API changes that improve quality-of-life.
+
+## What do I need to know for migrating my scripts to python 3?
+
+To begin, just the standard py3k changes, the relevant parts of which we'll rehash here as a reference guide:
+
+- Strings and bytestrings
+  - Strings are now unicode by default, a new `bytes` type holds bytestrings
+  - Bytestring literals can be constructued with the b prefix, like `b'ABCD'`
+  - Conversion between strings and bytestrings happens with `.encode()` and `.decode()`, which use utf-8 as a default. The `latin-1` codec will map byte values to their equivilant unicode codepoints
+  - The `ord()` and `chr()` functions operate on strings, not bytestrings
+  - Enumerating over or indexing into bytestrings produces an unsigned 8 bit integer, not a 1-byte bytestring
+  - Bytestrings have all the string manipulation functions present on strings, including `join`, `upper`/`lower`, `translate`, etc
+  - `hex` and `base64` are no longer string encoding codecs. For hex, use `bytes.fromhex()` and `bytes.hex()`. For base64 use the `base64` module.
+- Builtin functions
+  - `print` and `exec` are now builtin functions instead of statements
+  - Many builtin functions previously returning lists now return iterators, such as `map`, `filter`, and `zip`. `reduce` is no longer a builtin; you have to import it from `functools`.
+- Numbers
+  - The `/` operator is explicitly floating-point division, the `//` operator is expliclty integer division. The magic functions for overriding these ops are `__truediv__` and `__floordiv__`
+  - The int and long types have been merged, there is only int now
+- Dictionary objects have had their `.iterkeys`, `.itervalues`, and `.iteritems` methods removed, and then non-iter versions have been made to return efficient iterators
+- Comparisons between objects of very different types (such as between strings and ints) will raise an exception
+
+In terms of how this has affected angr, any string that represents data from the emulated program will be a bytestring.
+This means that where you previously said `state.solver.eval(x, cast_to=str)` you should now say `cast_to=bytes`.
+When creating concrete bitvectors from strings (including implicitly by just making a comparison against a string) these should be bytestrings. If they are not they will be utf-8 converted and a warning will be printed.
+Symbol names should be unicode strings.
+
+For division, however, ASTs are strongly typed so they will treat both division operators as the kind of division that makes sense for their type.
+
+## Clemory API changes
+
+The memory object in CLE (project.loader.memory, not state.memory) has had a few breaking API changes since the bytes type is much nicer to work with than the py2 string for this specific case, and the old API was an inconsistent mess.
+
+| Before | After |
+|--------|-------|
+| `memory.read_bytes(addr, n) -> list[str]` | `memory.load(addr, n) -> bytes` |
+| `memory.write_bytes(addr, list[str])` | `memory.store(addr, bytes)` |
+| `memory.get_byte(addr) -> str` | `memory[addr] -> int` |
+| `memory.read_addr_at(addr) -> int` | `memory.unpack_word(addr) -> int` |
+| `memory.write_addr_at(addr, value) -> int` | `memory.pack_word(addr, value)` |
+|  `memory.stride_repr -> list[(start, end, str)]` | `memory.backers() -> iter[(start, bytearray)]` |
+
+Additionally, `pack_word` and `unpack_word` now take optional `size`, `endness`, and `signed` parameters.
+We have also added `memory.pack(addr, fmt, *data)` and `memory.unpack(addr, fmt)`, which take format strings for use with the `struct` module.
+
+If you were using the `cbackers` or `read_bytes_c` functions, the conversion is a little more complicated - we were able to remove the split notion of "backers" and "updates" and replaced all backers with bytearrays that we mutate, so we can work directly with the backer objects.
+The `backers()` function iterates through all bottom-level backer objects and their start addresses. You can provide an optional address to the function, and it will skip over all backers that end before that address.
+
+Here is some sample code for producing a C-pointer to a given address:
+
+```python
+import cffi, cle
+ffi = cffi.FFI()
+ld = cle.Loader('/bin/true')
+
+addr = ld.main_object.entry
+try:
+    backer_start, backer = next(ld.memory.backers(addr))
+except StopIteration:
+    raise Exception("not mapped")
+
+if backer_start > addr:
+    raise Exception("not mapped")
+
+cbacker = ffi.from_buffer(backer)
+addr_pointer = cbacker + (addr - backer_start)
+```
+
+You should not have to use this if you aren't passing the data to a native library - the normal load methods should now be more than fast enough for intensive use.
+
+## CLE symbols changes
+
+Previously, your mechanisms for looking up symbols by their address were `loader.find_symbol()` and `object.symbols_by_addr`, where there was clearly some overlap.
+However, `symbols_by_addr` stayed because it was the only way to enumerate symbols in an object.
+This has changed! `symbols_by_addr` is deprecated and here is now `object.symbols`, a sorted list of Symbol objects, to enumerate symbols in a binary.
+
+Additionally, you can now enumerate all symbols in the entire project with `loader.symbols`.
+This change has also enabled us to add a `fuzzy` parameter to `find_symbol` (returns the first symbol before the given address) and make the output of `loader.describe_addr` much nicer (shows offset from closest symbol).
+
+## Deprecations and name changes
+
+- All parameters in cle that started with `custom_` - so, `custom_base_addr`, `custom_entry_point`, `custom_offset`, `custom_arch`, and `custom_ld_path` - have had the `custom_` removed from the beginning of their names.
+- All the functions that were deprecated more than a year ago (at or before the angr 7 release) have been removed.
+- `state.se` has been deprecated.
+  You should have been using `state.solver` for the past few years.
+- Support for immutable simulation managers has been removed.
+  So far as we're aware, nobody was actually using this, and it was making debugging a pain.