Seal January paper digests

paul356 · Feb 1, 2024 · 6445a85 · 6445a85
1 parent 7a97060
commit 6445a85
Show file tree

Hide file tree

Showing 4 changed files with 46 additions and 28 deletions.
diff --git a/_org/2022-12-31-spark.org b/_org/2022-12-31-spark.org
@@ -99,22 +99,27 @@ SparkCatalog还依赖一些其他类实现其功能
     - *DataSourceV2Relation* (table: SparkTable, catalog: SparkCatalog, ...)
 - V2ScanRelationPushDown.apply(plan: LogicalPlan) ???
   - createScanBuilder(plan: LogicalPlan): LogicalPlan
+  - pushDownFilters(plan: LogicalPlan): LogicalPlan
+    - PushDownUtils.pushFilters(scanBuilder: ScanBuilder, filters: Seq[Expression])
+      - SparkScanBuilder.pushFilters(filters: Filter[])
   - pruneColumns(plan: LogicalPlan): LogicalPlan
     - *DataSourceV2ScanRelation* (relation: DataSourceV2Relation, scan: Scan, output: Seq[AttributeReference])
       - SparkBatchQueryScan <-- SparkScanBuilder.build
 - DataSourceV2Strategy.apply
   - BatchScanExec(output: Seq[AttributeReference], scan: Scan, runtimeFilters: Seq[Expression], keyGroupedPartitioning: Option[Seq[Expression]])
     - val inputRDD: RDD[InternalRow]
       - val filteredPartitions: Seq[Seq[InputPartition]]
+        - SparkBatchQueryScan.filter(filters: Filter[])
+          - SparkPartitioningAwareScan.tasks()
+            - SnapshotScan.planFiles()
+              - DataTableScan.doPlanFiles()
+                - Snapshot.dataManifests(io)
+                - Snapshot.deleteManifests(io)
+                - ManifestGroup.planFiles()
+                  - ManifestGroup.plan(ManifestGroup::createFileScanTasks)
         - SparkBatch.planInputPartitions()
           - input parameter taskGroups = SparkPartitioningAwareScan.taskGroups()
             - SparkPartitioningAwareScan.tasks()
-              - SnapshotScan.planFiles()
-                - DataTableScan.doPlanFiles()
-                  - Snapshot.dataManifests(io)
-                  - Snapshot.deleteManifests(io)
-                  - ManifestGroup.planFiles()
-                    - ManifestGroup.plan(ManifestGroup::createFileScanTasks)
 
 ** Iceberg Table Spec
 - Table Metadata (json file)

diff --git a/_org/2024-01-02-jan-papers.org b/_org/2024-01-02-jan-papers.org
@@ -24,6 +24,10 @@ nav_order: {{ page.date }}
 | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding                          | Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova                                                             | This paper improves the model performance by breaking the limit of a left-to-right Transformer architecture by learning from the whole text context. But in order to prevent the model to directly copy the predicted result. The authors randomly marsk 15% tokens of the input sequence and predict the masked tokens. In this way they can train a LM that can learn the representation from the whole context. | arXiv 2018                    | BERT, Language Model, GPT                                                                                      |
 | Eddies: Continuously Adaptive Query Processing                                                            | Ron Avnur, Joseph M. Hellerstein                                                                                         | This paper argues that instead of trying to find a optimal query plan the system should reorder the join order along the execution pipeline. This Eddy module routes the tuples to operators based on avaibility and availability aiming to reduce the processing time and meanwhile maintain correctness. | SIGMOD 2000                   | Adaptive Query Processing                                                                                      |
 | Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs | Jinyang Li, Binyuan Hui, Xuanhe Zhou, Kevin C.C. Chang, Reynold Cheng, Yongbin Li, Guoliang Li                           | This paper presents BIRD a text-to-sql benchmark for LLM based methods. BIRD collected tables from many different domains which can test the generalization ability of tested models. And it also considered the irregularities of value type in real database values. The new testbench shows that LLM based text-to-sql methods are still inferior to human which shows that there are some chances for research efforts. | NeurIPS 2023                  | Large Language Model, BIRD, Text-to-SQL                                                                        |
-| Kepler: Robust Learning for Faster Parametric Query Optimization                                          | Lyric Doshi, Vicent Zhuang, Gaurav Jain, Ryan Marcus,                                                                    | This paper introduces Kepler a method for parametric query optimization. This method uses Row Count Evolution to generate a set of plans from a base plan optitmizer. It then train a neural network per each query template to learn how to classify the best query plan. This network also gives a confidence value about the result. The model will fall back to the origional plan if the confidence value is low. | SIGMOD 2023                   | Row Count Evolution, Plan Optimization, Parametric Query Optimization                                          |
+| Kepler: Robust Learning for Faster Parametric Query Optimization                                          | Lyric Doshi, Vicent Zhuang, Gaurav Jain, Ryan Marcus,                                                                    | This paper introduces Kepler a method for parametric query optimization. This method uses Row Count Evolution which perturbs cardinality estimations to generate a set of plans from a base plan optitmizer. It then train a neural network per each query template to learn how to classify the best query plan. This network also gives a confidence value about the result. The model will fall back to the origional plan if the confidence value is low. | SIGMOD 2023                   | Row Count Evolution, Plan Optimization, Parametric Query Optimization                                          |
+| Parametric Query Optimization                                                                             | Yannis E. Ioannidis, Raymond T. Ng, Kyuseok Shim, Timos K. Sellis                                                        | This paper introduces how to optimize query plans subject to a given set of parameter values and how to use randomized algorithms to find the optimal plans for these parameter values. And this paper also introduces a method called Sideways Informatiton Passing which can optimize queries for large numbers of buffer sizes in the same time needed by the conventional method for one buffer size. | VLDB 1992                     | Parametric Query Optimization                                                                                  |
+| Bao: Makeing Learned Query Optimization Practical                                                         | Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh, Tim Kraska                                   | This paper introduces Bao (the Bandit Optimizer) which is a query optimizer built on top a traditional query optimizer. This optimizer uses optimizer hints to control the tranditional query optimizer to generate better query plans. In order to select the optimal hints this optimizer models the selection problem as a multi-armed bandit problem and uses Thompson Sampling to train the evaluation network which is the same as in Neo. The experiments show that this optimizer overcomes many limits for practcal application in previous learned optimizers. Even more it surpasses the original optimizer in tail performance. | SIGMOD 2021                   | Multi-armed Bandit Problem, Thompson Sampling                                                                  |
+| *Balsa: Learning a Query Optimizer Without Expert Demonstrations*                                         | Zongheng Yang, Wei-Lin Chiang, Sifei Luan, Gautam Mittal, Michael Luo, Ion Stoica                                        | This paper introduces Balsa a learned query optimizer that doesn't need expert demonstrations. Balsa bootstraps itself from a simple cost estimator which uses the PostgresSQL cardinality estimator. Then it uses an on-policy reinforcement learning process to learn from real execution latencies. This method can be seen as an improvement to Neo. With a few novel techiques like Diversified Experiences and Multi-agent Training Balsa can explore distinct plans. Combined with these methods Balsa can generate query plans performing better than experts and state of art methods like Bao. | SIGMOD 22                     | Learned Query Optimization, Machine Learning for Systems                                                       |
+|                                                                                                           |                                                                                                                          |                                                                                                                                                                                                                 |                               |                                                                                                                |
 |-----------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------+----------------------------------------------------------------------------------------------------------------|
 
diff --git a/_posts/2022-12-31-spark.md b/_posts/2022-12-31-spark.md
@@ -111,22 +111,27 @@ SparkCatalog还依赖一些其他类实现其功能
         -   **DataSourceV2Relation** (table: SparkTable, catalog: SparkCatalog, &#x2026;)
 -   V2ScanRelationPushDown.apply(plan: LogicalPlan) ???
     -   createScanBuilder(plan: LogicalPlan): LogicalPlan
+    -   pushDownFilters(plan: LogicalPlan): LogicalPlan
+        -   PushDownUtils.pushFilters(scanBuilder: ScanBuilder, filters: Seq[Expression])
+            -   SparkScanBuilder.pushFilters(filters: Filter[])
     -   pruneColumns(plan: LogicalPlan): LogicalPlan
         -   **DataSourceV2ScanRelation** (relation: DataSourceV2Relation, scan: Scan, output: Seq[AttributeReference])
             -   SparkBatchQueryScan <&#x2013; SparkScanBuilder.build
 -   DataSourceV2Strategy.apply
     -   BatchScanExec(output: Seq[AttributeReference], scan: Scan, runtimeFilters: Seq[Expression], keyGroupedPartitioning: Option[Seq[Expression]])
         -   val inputRDD: RDD[InternalRow]
             -   val filteredPartitions: Seq[Seq[InputPartition]]
+                -   SparkBatchQueryScan.filter(filters: Filter[])
+                    -   SparkPartitioningAwareScan.tasks()
+                        -   SnapshotScan.planFiles()
+                            -   DataTableScan.doPlanFiles()
+                                -   Snapshot.dataManifests(io)
+                                -   Snapshot.deleteManifests(io)
+                                -   ManifestGroup.planFiles()
+                                    -   ManifestGroup.plan(ManifestGroup::createFileScanTasks)
                 -   SparkBatch.planInputPartitions()
                     -   input parameter taskGroups = SparkPartitioningAwareScan.taskGroups()
                         -   SparkPartitioningAwareScan.tasks()
-                            -   SnapshotScan.planFiles()
-                                -   DataTableScan.doPlanFiles()
-                                    -   Snapshot.dataManifests(io)
-                                    -   Snapshot.deleteManifests(io)
-                                    -   ManifestGroup.planFiles()
-                                        -   ManifestGroup.plan(ManifestGroup::createFileScanTasks)
 
 
 ## Iceberg Table Spec