add all paper reading notes

kongjun18 · Aug 7, 2024 · cd8d1b1 · cd8d1b1
1 parent b8335a7
commit cd8d1b1
Show file tree

Hide file tree

Showing 74 changed files with 2,440 additions and 80 deletions.
diff --git a/config.yaml b/config.yaml
@@ -123,13 +123,13 @@ menu:
       url: "/tags/"
       title: ""
       weight: 3
-    - identifier: "archives"
-      pre: ""
-      post: ""
-      name: "Archives"
-      url: "/archives/"
-      title: ""
-      weight: 4
+    # - identifier: "archives"
+    #   pre: ""
+    #   post: ""
+    #   name: "Archives"
+    #   url: "/archives/"
+    #   title: ""
+    #   weight: 4
     - identifier: "about"
       pre: ""
       post: ""

diff --git a/content/archives/index.md → content/archives/index.md.backup b/content/archives/index.md → content/archives/index.md.backup
diff --git a/...ration-correctness-of-cloud-system-management/images/Acto-builtin-scenarios.png b/...ration-correctness-of-cloud-system-management/images/Acto-builtin-scenarios.png
diff --git a/...ation-correctness-of-cloud-system-management/images/Acto-consistency-oracle.png b/...ation-correctness-of-cloud-system-management/images/Acto-consistency-oracle.png
diff --git a/...tion-correctness-of-cloud-system-management/images/Acto-differential-oracle.png b/...tion-correctness-of-cloud-system-management/images/Acto-differential-oracle.png
diff --git a/...eration-correctness-of-cloud-system-management/images/Acto-property-mapping.png b/...eration-correctness-of-cloud-system-management/images/Acto-property-mapping.png
diff --git a/...-system-management/images/Acto-state-transition-of-different-test-trategies.png b/...-system-management/images/Acto-state-transition-of-different-test-trategies.png
diff --git a/...-for-operation-correctness-of-cloud-system-management/images/featured-image.png b/...-for-operation-correctness-of-cloud-system-management/images/featured-image.png
diff --git a/...nd-to-end-testing-for-operation-correctness-of-cloud-system-management/index.md b/...nd-to-end-testing-for-operation-correctness-of-cloud-system-management/index.md
@@ -0,0 +1,135 @@
+---
+title: "【论文阅读】Acto Automatic End-to-End Testing for Operation Correctness of Cloud System Management"
+date: "2024-07-03"
+keywords: ""
+comment: true
+weight: 0
+author:
+  name: "Jun"
+  link: "https://github.com/kongjun18"
+  avatar: "/images/avatar.jpg"
+license: "All rights reserved"
+tags:
+- Distributed System
+- Reliability
+
+categories:
+- Distributed System
+- Reliability
+
+hiddenFromHomePage: false
+hiddenFromSearch: false
+
+summary: ""
+resources:
+- name: featured-image
+  src: images/featured-image.png
+- name: featured-image-preview
+  src: images/featured-image.png
+
+toc:
+  enable: true
+math:
+  enable: false
+lightgallery: false
+seo:
+  images: []
+
+repost:
+  enable: true
+  url: ""
+---
+
+## 背景
+许多部署在 Kubernetes 等现代云平台上的系统使用 operator 替代人工部署，但这些 operator 通常没有完整的 e2e 测试，极大的影响了分布式系统的可靠性。
+
+由于这些原因，人工编写完善的 e2e 测试基本上是不可行的：
+1. 开发者很难在庞大的状态空间中构造良好的测试用例。人工编写的 e2e 测试通常从理想的初始状态触发，一步（只修改一次 spec）到达最终状态。这种测试无法覆盖足够多的状态转移。
+2. operator 的开发者和被管理的系统的开发者往往不是一拨人，operator 开发者很难有足够的知识完善 e2e 测试。
+3. operator 的协调循环（reconcile loop）涉及大量状态迁移，其中一些还涉及被管理系统的细节。
+
+论文开发了一个自动生成 operator e2e 测试的框架 Acto，发现了大量流行的系统的 operator 中的 bug，其中某些 bug 甚至是由 Kubernetes 和 Go 语言运行时的 bug 导致的。
+## 设计
+Acto 的诞生很大程度上源自 Kubernetes 的特殊性——以状态为中心：
+1. Kubernetes 以状态为中心，用户只需要通过声明式 API 声明目标状态，Kubernetes 和 operator 就能通过协调循环（reconcile loop）从任意起始状态到达目标状态。
+2. operator specification 的很多 property 可以还原到 Kubernetes 内置[对象](https://kubernetes.io/zh-cn/docs/concepts/overview/working-with-objects/)的 property。
+
+这两大特征促使了 Acto 的诞生：
+1. Acto 同样以状态为中心，通过检测 Kubernetes 状态判断 operator 是否正确，通过状态迁移（修改 operator spec）驱动 e2e 测试进行。
+2. Acto 通过将 operator specification 的 property 还原到 Kubernetes 核心资源的 property，实现通用的(不依赖特定 operator) operator e2e 测试生成。
+
+Acto 存在两个版本：
+1. 黑盒版本：只需要 operator 的接口（operator spec），不需要 operator 的源代码。
+2. 白盒版本：不仅需要 operator 的接口，还需要 operator 的源代码。
+
+Acto 通过不断进行状态迁移，驱动 e2e 测试进行。状态即当前状态和 operator spec 设定的目标状态，状态驱动通过修改 operator spec 即可。这里存在三个关键问题：
+1. 什么是正确的 operator 操作？
+2. 如何确定系统的 property？
+3. 如何确定 property 依赖关系？
+4. 向何处进行状态迁移？
+5. 如何判断当前状态正确？
+
+Acto 以状态为中心，因此对于 operator 操作的正确性描述也以状态为中心：
+1. 可以从任何正确的状态驱动被管理系统到目标状态。
+2. 如果进入错误或非预期的状态，回滚到上一个正确的状态。
+3. 能够应对错误的操作（misoperation，指用户给出错误的 operator spec，如 replica 为 -1 等）。
+
+Acto 适用于所有 operator，将 operator spec 的 property 还原成 Kubernetes 核心资源的 property。Acto 直接修改 Kubernetes 核心资源的 property，驱动状态迁移。
+
+从 operator spec 还原到 Kubernetes 核心资源的方法如下：
+- Acto 黑盒：通过 operator spec 的 property  和 Kubernetes 核心资源property 的命名推测。
+例如，Cassandra CRD 的`cassandraDataVolumeClaimSpec`和 Kubernetes StatefulSet 的`VolumeClaimTemplates`结构相同，Acto 将`cassandraDataVolumeClaimSpec`还原成`VolumeClaimTemplates`，状态迁移时修改`VolumeClaimTemplates`的 property。
+- Acto 白盒：通过语义分析，将 operator spec 还原到 Kubernetes 核心资源。
+注意，并非所有 Operator Spec 的所有 property 都能直接通过命名推测映射到 Kubernetes 核心资源，但大多数 property 都可以通过命名推测出来。
+![](images/Acto-property-mapping.png)
+
+property 间的依赖关系，以类似的方法得到：
+- Acto 黑盒：通过命名推测。
+通常，某个 property 都只在子属性`Enabled`为 true 时才启用。论文的数据表明，这种推测可以覆盖 98.5% 的依赖关系。
+- Acto 白盒：通过控制流分析寻找 property 依赖。
+Acto 白盒解析源代码，寻找某 property 只在另一 property 满足某种条件时才启用的情况。
+
+>[!NOTE]
+>Acto 黑盒的猜测不一定正确，因此导致了假阳性，但论文数据显示误报率只有 0.2%。这说明大多数 operator 都遵守 Acto 发现的 pattern。
+
+Acto 用以下三种测试策略探索状态空间：
+- 单操作：只修改一次 operator spec 就到达最终状态。
+- 一系列操作：修改多次 operator spec 才到达最终状态。
+- 到达错误状态时回滚到上次正确状态，继续测试
+
+![](images/Acto-state-transition-of-different-test-trategies.png)
+
+状态转移时，Acto 用以下策略生成 property 的值：
+- 每次修改一个 property。
+- 优先修改没有修改过的 property，从而确保测试期间所有 property 都被修改过至少一次。
+- property 的值以情景为中心，如先扩容再缩容等。
+- 某些 property 无法被映射到 Kubernetes 核心资源，对于这些 Acto 无法理解的 property，生成符合语法和限制的值即可，不考虑语义是否正确。
+![](images/Acto-builtin-scenarios.png)
+>[!NOTE]
+>Kubernetes operator specification 会生成一个 specification 的 schema，记录每个 property 的类型和限制。Acto 根据这个 schema 生成符合语法和限制的值。
+
+Acto Oracle 检测当前状态是否匹配期待的状态，主要通过以下三种手段：
+- Consistency Oracle：检测 operator 视图和 Kubernetes 视图是否一致，不一致说明 operator 出错。
+这种情况的主要场景是，operator 认为已经到达目标状态，因此停止协调循环，但 Kubernetes 状态并未到达目标状态。例如RediScreenshot_20231119_175018sOp 认为`minAvailable`已经为 2（这时 Kubernetes `PodDisruptionBudget`的`redis-follower`一定不为空），但实际上`PodDisruptionBudget`的`redis-follower`为`null`。
+![](images/Acto-consistency-oracle.png)
+- Differential Oracle：从不同起始状态出发，对于同一操作一定能够能到达相同的期待状态，否则 operator 有 bug。
+这种情况主要是由于协调循环不完善，只能从特定起始状态到目标状态，也可能是回滚失败。
+![](images/Acto-differential-oracle.png)
+- Normal Check：检测状态码、日志错误信息和系统抛出的异常等。
+
+Acto 会记录测试失败时的 snapshot，并生成最小化的测试代码，用于复现 bug。
+
+此外，值得注意的是 Acto 还有插件机制，可以拓展 Acto 的功能。比如让用户指定 operator spec 的某个 property 的含义。
+
+## 评估
+Acto 发现了大量 bug，而且只需要 8 小时就能运行完一个 operator 的全部 e2e 测试。
+
+Acto 黑盒可能由于 property 猜测失败导致假阳性，但误报率只有 0.2%。Acto 白盒没有出现假阳性。
+
+## 限制
+
+- 只能测试单个 operator，实际的系统可能由多个 operator 管理。
+- 无法注入故障，因此只能测试理想环境下 operator 的正确性。
+- 完全以状态为中心，无法测试被管理的系统自身是否正确。可能存在状态正确，但 operator 的 bug 导致被管理的系统行为异常，例如违反了规定的一致性。
+
+
diff --git a/...p-compiler-and-allocator-based-heap-memory-protection/images/featured-image.png b/...p-compiler-and-allocator-based-heap-memory-protection/images/featured-image.png
diff --git a/content/posts/camp-compiler-and-allocator-based-heap-memory-protection/index.md b/content/posts/camp-compiler-and-allocator-based-heap-memory-protection/index.md
@@ -0,0 +1,113 @@
+---
+title: "【论文阅读】CAMP Compiler and Allocator-based Heap Memory Protection"
+date: "2023-11-20"
+keywords: ""
+comment: true
+weight: 0
+author:
+  name: "Jun"
+  link: "https://github.com/kongjun18"
+  avatar: "/images/avatar.jpg"
+license: "All rights reserved"
+tags:
+- Security
+
+categories:
+- Security
+
+hiddenFromHomePage: false
+hiddenFromSearch: false
+
+summary: ""
+resources:
+- name: featured-image
+  src: images/featured-image.png
+- name: featured-image-preview
+  src: images/featured-image.png
+
+toc:
+  enable: true
+math:
+  enable: false
+lightgallery: false
+seo:
+  images: []
+
+repost:
+  enable: true
+  url: ""
+---
+
+## 背景
+以往的内存安全 bug 检测方法，通常由编译器或内存分配器独立进行。编译器只理解程序语义，但无法在运行时工作；内存分配器只能在运行时工作，却无法感知程序语义。本文让编译器和内存分配器协同工作检测内存安全 bug。
+
+## 目标
+通过编译器和内存分配器协同设计，针对堆上的 buffer overflow 和 use-after-free 两种存安全 bug，提出更快且准确率高的检测方法。
+
+论文主要面向 C/C++，并且要求源程序不存在整型到指针的转换。
+## 方法
+编译器在编译期插入内存检测指令和进行逃逸分析，内存分配器在运行时记录内存分配和回收信息，程序运行时执行到内存检测指令，完成内存安全 bug 检测。
+
+buffer overflow 检测逻辑：
+- 内存分配器运行时记录已分配的内存区域。
+- 编译器插在内存访问前插入内存访问范围检测指令，包含起始地址、访问地址和元素类型大小。
+程序运行时执行到内存访问范围检测指令，该指令（内存分配器库）检测访问的地址是否在已分配内存范围内，越界则终止程序。
+
+use-after-free 检测逻辑：
+- 编译器进行指针逃逸分析，在指针拷贝操作前插入逃逸指令。
+- 逃逸指令接收指针的地址，以及指针指向缓冲区的起始地址。
+- 运行时逃逸指令将上述两个参数（point-to 关系，即两个指针指向同一内存）记录到内存分配器中。
+- 内存分配器 free 时，将所有指向同一缓冲区的指针指向一块特殊的区域。
+- 用户 use-after-free 时，访问该特殊区域，程序终止。
+
+## 实现
+
+编译器基于 LLVM 做修改，实现了上述指令插入和逃逸分析功能。内存分配器基于 tcmalloc 修改。
+
+tcmalloc 使用分离链表，分离链表指向不同大小的 span，一个 span 是一块包含多个相同大小对象的连续内存区域。CAMP 利用 tcmalloc 的分离链表实现快速地运行时支持，包括 O(1) 的指针范围检测和高效的 point-to 关系记录。
+
+CAMP 内存分配器分配内存时会记下该 span 的位置和 span 中的元素大小，给一个堆指针，就能快速得到该指针的对象在 span 中的下标，从而判断指针访问是否发生 buffer overflow。这个查找过程显然是 O(1) 的。
+
+每个 span 都维护了对象上的 point-to 关系，指向对象的指针被串联到对象的 point-to 指针链表中，只需要通过对象下标就可以定位到该链表，显然也是 O(1)。
+
+## 评价
+这个工作很有意思，别出心裁通过编译器、内存分配器协同设计在高性能的前提下实现了精确的 buffer-overflow  和 use-after-free 的检测。
+
+论文的工作只是一个原型，但有两个明显的优点：
+1. 不破坏内存分配器的内存安全 bug 检测功能。
+2. 不需要修改原始代码。
+
+如果 GCC/LLVM 可以在编译器和标准库内存分配器上实现论文的工作，应该可以显著提高软件内存安全性。
+
+论文的工作对源代码的编程模式做了限制，如不能存在整型到指针的转换等。实际的 C/C++ 程序存在大量这种反模式（anti-pattern），对于实际程序的检测效果会打一定的折扣，但这也是 C/C++ 编程语言的一大问题。
+
+论文基于 LLVM 魔改编译器，而 LLVM 可以用于多个语言的后端，是不是可以直接修改 LLVM，让本论文的内存检测方法直接用于所有 LLVM 支持的语言？也许论文没提到这一点是因为其他语言带 GC 不太存在 C/C++ 的这种问题？
+
+## Q&A
+- [x] 如何实现高性能？
+
+    通过编译器和内存分配器协作。
+
+    编译器：编译期优化
+
+    - 利用类型信息减少 range check
+
+    - 消除多余的指令
+
+        - 多次检测同一内存地址
+
+        - 检测同一内存区域的多个地址（编译器确定）
+
+    - 合并运行时调用：检测同一内存区域的多个地址（运行时确定），直接获取该区域的边界，判断地址是否超过避免。从而避免每次 range check 都要进入 library 获取内存区域边界。
+
+    内存分配器：分离链表实现 O(1) 的查找
+
+- [x] 为什么要设置 point-to cache，point-to 关系先记录在 cache 中，满了批量再记录到内存分配器的 point-to 链表中？
+    1. 为了减少内存分配器中 point-to 链表的大小。程序可能存在循环等反复创建 point-to 关系的情况，CAMP 在 cache 中删除相同的 point-to 关系，从而减少内存分配器中的 point-to 链表大小。这里的场景比较模糊，大致意思是减少重复的、不必要记录的 point-to 关系。
+    2. CAMP 认为从程序代码切换到内存分配器代码（论文称为上下文切换）执行操作的成本比较高，这中批量写入可以减少上下文切换。
+
+- [x] 什么是 in-bound overflow？
+    这里比较模糊，下面是我的理解。
+    in-bound overflow 指发生在内存块内，程序分配对象外的访问。例如，程序分配一个 16 字节的数组，但内存分配器分配了 32 的内存，如果访问发生在数组外内存块内就发生了 in-bound overflow。in-bound overflow 会修改对象外的数据，可能是内存分配器自己的元数据。
+
+
diff --git a/...eliable-scalable-and-high-performance-distributed-storage/images/Ceph-CRUSH.png b/...eliable-scalable-and-high-performance-distributed-storage/images/Ceph-CRUSH.png
diff --git a/...-scalable-and-high-performance-distributed-storage/images/Ceph-architecture.png b/...-scalable-and-high-performance-distributed-storage/images/Ceph-architecture.png
diff --git a/...alable-and-high-performance-distributed-storage/images/Ceph-dynamic-subtree.png b/...alable-and-high-performance-distributed-storage/images/Ceph-dynamic-subtree.png
diff --git a/...eph-reliable-scalable-and-high-performance-distributed-storage/images/featured-image.webp b/...eph-reliable-scalable-and-high-performance-distributed-storage/images/featured-image.webp