diff --git a/_org/2022-12-31-spark.org b/_org/2022-12-31-spark.org index 81df37d..9519625 100644 --- a/_org/2022-12-31-spark.org +++ b/_org/2022-12-31-spark.org @@ -76,6 +76,21 @@ SparkCatalog还依赖一些其他类实现其功能 - OutputFileFactory - Factory responsible for generating unique but recognizable data file names and file objects. - SparkFileWriteFactory - Return data file writer for Parquet, ORC or AVRO +** 创建QueryExecution +- SessionState.executePlan(logicalPlan: LogicalPlan) + +** Analalyze UnresolvedRelation(multipartIdentifier, ...) +- Dataset.ofRows + - SessionState.executePlan(LogicalPlan) + - createQueryexecution(LogicalPlan, mode) + - QueryExecution.assertAnalyzed() + - lazy val analyzed + - RuleExecutor.execute(LogicalPlan) + - batches.foreach(...) + - ResolveRelations.apply + - ResolveRelations.lookupRelations + - new Dataset[Row](QueryExecution, ...) + ** TODO 支持分层的数据湖格式(HTAP) - 基于lsm结构组织文件 + 第一层采用tiering的方式排布 diff --git a/_org/2023-12-1-december-papers.org b/_org/2023-12-1-december-papers.org index 70df29d..aec02c0 100644 --- a/_org/2023-12-1-december-papers.org +++ b/_org/2023-12-1-december-papers.org @@ -8,23 +8,24 @@ nav_order: {{ page.date }} --- #+END_EXPORT -|---------------------------------------------------------------------------------------------------+------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+-----------------------------------------------------------------------| -| Title | Authors | Synthesis | Publisher | Keywords | -|---------------------------------------------------------------------------------------------------+------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+-----------------------------------------------------------------------| -| Neural Packet Classification | Eric Liang, Ion Stoica | This paper proposes using RL to construct a decision tree for packet classifiers, and it shows how to model formulate the MDP for constructing the decision tree. | SIGCOMM 2019 | Reinforcement Learning, Decision Tree | -| Detect, Distill and Update: Learned DB Systems Facing Out of Distribution Data | Meghdad Kurmanji, Peter Triantafillou | This paper presents a learning framework called DDUp for detecting OODs and updating learned database components. A statistical test is used to test the hypothesis that the new data obey the same distribution as the learned model. Knowledge Distillation is used to transfer knowledge from the old model to the new model if the hypothesis is rejected. | SIGMOD 2023 | Knowledge Distillation, Transfer Learning | -| From Large Language Models to Databases and Back - A discussion on research and education | Sihem Amer-Yahia, .etc | A panel discussion on LLM and database research. They discussed about the potential impact of LLM towards databases, and what are advantages and limitations of incoperating LLM in databases. | DASFAA 2023 | LLM, Database, ChatGPT | -| *Fine-grained Partitioning for Aggressive Data Skipping* | Liwen Sun, Michael J. Franklin, Sanjay Krishnan, Reynold S. Xin | This paper presents a feature based partitioning framework. A number of features are selected to calculate a feature vector for each tuple. The conjunction of all feature vectors in a block is used for data skipping. And data are partitioned to maximize the data skipping gain. The contributions are (1) a workload analyzer, which generates a set of features from a query log, (2) a partitioner, which computes a blocking scheme by solving a optimization problem, (3) a feature-based block skipping framework used in query execution. | SIGMOD 2014 | Partitioning, NP-Hard | -| Small Materialized Aggregates: A Light Weight Index Structure for Data Warehousing | Guido Moerkotte | This paper shows how to speed up data warehouses with summaries called Small Materialized Aggregates. These SMAs are lightweight and easy to generate. They are similiar to views or data cubes but only compact information is generated for each data bucket. | VLDB 1998 | SMA, Data Cube, Data Warehouse | -| Different Cube Computation Approaches: Survey Paper | Dhanshri S. Lad, Rasika P. Saste | This paper survey the different algorithms to compute data cubes. The authors also propose use MR to speed up the data cube computation. | IJCSIT 2014 | Data Cube, Mapreduce | -| High-Diemnsional OLAP: A Minimal Cubing Approach | Xiaolei Li, Jiawei Han, Hector Gonzalez | This paper proposes to decompose the data cube computation by precomputing small sized groups called fragements and a value-list inverted index. All dimensions are divided into 3/4 dimension groups called fragments. For each fragments all data cubes are computed as lists of tuple ids using the inverted index. This paper also shows how to serve point queries and subcube queries with these fragments. | VLDB 2004 | Data Cube, Shell Fragment, OLAP | -| *Data Cube: A Relational Aggregation Operator - Generalizing Group-By, Cross-Tab, and Sub-Totals* | Jim Gray, Adam Bosworth, Andrew Layman, Hamid Pirahesh | This visionary paper explains to us what are data cubes and why we need them in analytics. It also shows us how to generate data cubes with the SQL group-by, proposes what enhancements we need make to group-bys. And it categorifies the aggregate functions. It shows how to compute data cubes with distributive and algebraic aggregates. | Data Mining and Knowledge Discovery 19997 | Data Cube, Group-By, Cross-Tab, Roll-Up, Drill-Down | -| Self-Organizing Data Containers | Samuel Madden, Jialin Ding, Tim Kraska, .etc | The authors envision a kind of systems called Self-Organizing Data Containers which employ a disaggregated and open system architechture. They name a few characteristics of these system - supporting efficient indexing, supporting concurrent acesses, supporting data envolving. They implement a prototype and compare with Delta Lake. The authors also point the directions - using replications to optimize the physical layout, incremental changes and auto-optimizations - for future research. | CIDR 2022 | Self-Organizing Data Container, Cloud Storage, Amazon S3 | -| Instance-Optimized Data Layouts for Cloud Analytics Workloads | Jialin Ding, Umar Farooq Minhas, Tim Kraska, .etc | This paper proposes a method called MTO which optimizes QD-Tree through Sideways Information Passing. It add Join-induced Predicates into QD-Tree. Effectively it should perform better than single table optimization. But Join-induced Predicates also poses a challenge when new data are added or old data are updated. Join-induced Predicates need to be refreshed and adjusted. | SIGMOD 2021 | QD-Tree, Sideways Information Passing, Instance-Optimized Data Layout | -| Tsunami: A Learned Multi-dimensional - Index for Correlated Data and Skewed Workloads | Jialin Ding, Umar Farooq Minhas, Tim Kraska, .etc | This paper proposes a learned multi-dimensional index called Tsunami which is an improved successor to Flood, another learned multi-dimentional index. The authors observe that there are query skew and data correlation which pose challenges to both tranditional multi-dimentional indices and Flood. This paper invents two structures - Grid Tree and Argumented Grid. Each structure is constructured as an optimization problem. The authors also formulate the optimization goals. | PVLDB Vol 14, No. 2, 2020 | Learned Index, Multi-dimensional Index, Skewed Workload | -| SA-LSM: Optimize Data Layout for LSM-tree Based Storage using Survival Analysis | Teng Zhang, Jianling Sun, .etc | This paper presents SA-SLM which uses survival analysis to predict data access events per record. This ssystem employs a proactively compaction srategy to move cold data to slow media. The authors claim that they can reduce tail latency by up to 78.9% using SA-SLM. | PVLDB Vol 15 Issue 10, 2022 | LSM-Tree, Survival Analysis, Random Forest | -| Tiresias: Enabling Predictive Autonomous Storage and Indexing | Michael Abebe, Horatiu Lazu, Khuzaima Daudjee | This paper presents the Tiresias method which combine workload prediction and autonomous storage and indexing. The authors introduce how to predict future workoad and estimate plan cost under a specific storage layout. They also give a heuristic method to calculate the benefit of employing a storage change. Though this paper doesn't say how to propose a storage change I guess they should have a small fixed set of storage choices to select from. | SIGMOD 2021 | Autonomous Storage, Indexing, Workload Prediction | -| Stochastic Database Cracking: Towards Robust Adaptive - Indexing in Main-Memory Column-Stores | Felix Halim, Stratos Idreos, Panagiotis Karras, Roland H. C. Yap | This paper extends the origional idea of database cracking by introducing stochastic cracks. The origional database cracking only cracks a column exactly based on data predicates. This paper shows that for sequential workload the origional cracking method has no optimization compared to random workload. In order to address this problem this paper propose two different algorithms - DDC and DDR. The difference is that DDC always tries to cut in the center which requires a cost of finding the median, but DDR chooses to cut randomly. They also devise other variants with more lightweight initial cost based on these two algorithms. | PVLDB 2012 | Database Cracking, Column Store, Adaptive Indexing | -| | | | | | -|---------------------------------------------------------------------------------------------------+------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+-----------------------------------------------------------------------| +|---------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+-----------------------------------------------------------------------| +| Title | Authors | Synthesis | Publisher | Keywords | +|---------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+-----------------------------------------------------------------------| +| Neural Packet Classification | Eric Liang, Ion Stoica | This paper proposes using RL to construct a decision tree for packet classifiers, and it shows how to model formulate the MDP for constructing the decision tree. | SIGCOMM 2019 | Reinforcement Learning, Decision Tree | +| Detect, Distill and Update: Learned DB Systems Facing Out of Distribution Data | Meghdad Kurmanji, Peter Triantafillou | This paper presents a learning framework called DDUp for detecting OODs and updating learned database components. A statistical test is used to test the hypothesis that the new data obey the same distribution as the learned model. Knowledge Distillation is used to transfer knowledge from the old model to the new model if the hypothesis is rejected. | SIGMOD 2023 | Knowledge Distillation, Transfer Learning | +| From Large Language Models to Databases and Back - A discussion on research and education | Sihem Amer-Yahia, .etc | A panel discussion on LLM and database research. They discussed about the potential impact of LLM towards databases, and what are advantages and limitations of incoperating LLM in databases. | DASFAA 2023 | LLM, Database, ChatGPT | +| *Fine-grained Partitioning for Aggressive Data Skipping* | Liwen Sun, Michael J. Franklin, Sanjay Krishnan, Reynold S. Xin | This paper presents a feature based partitioning framework. A number of features are selected to calculate a feature vector for each tuple. The conjunction of all feature vectors in a block is used for data skipping. And data are partitioned to maximize the data skipping gain. The contributions are (1) a workload analyzer, which generates a set of features from a query log, (2) a partitioner, which computes a blocking scheme by solving a optimization problem, (3) a feature-based block skipping framework used in query execution. | SIGMOD 2014 | Partitioning, NP-Hard | +| Small Materialized Aggregates: A Light Weight Index Structure for Data Warehousing | Guido Moerkotte | This paper shows how to speed up data warehouses with summaries called Small Materialized Aggregates. These SMAs are lightweight and easy to generate. They are similiar to views or data cubes but only compact information is generated for each data bucket. | VLDB 1998 | SMA, Data Cube, Data Warehouse | +| Different Cube Computation Approaches: Survey Paper | Dhanshri S. Lad, Rasika P. Saste | This paper survey the different algorithms to compute data cubes. The authors also propose use MR to speed up the data cube computation. | IJCSIT 2014 | Data Cube, Mapreduce | +| High-Diemnsional OLAP: A Minimal Cubing Approach | Xiaolei Li, Jiawei Han, Hector Gonzalez | This paper proposes to decompose the data cube computation by precomputing small sized groups called fragements and a value-list inverted index. All dimensions are divided into 3/4 dimension groups called fragments. For each fragments all data cubes are computed as lists of tuple ids using the inverted index. This paper also shows how to serve point queries and subcube queries with these fragments. | VLDB 2004 | Data Cube, Shell Fragment, OLAP | +| *Data Cube: A Relational Aggregation Operator - Generalizing Group-By, Cross-Tab, and Sub-Totals* | Jim Gray, Adam Bosworth, Andrew Layman, Hamid Pirahesh | This visionary paper explains to us what are data cubes and why we need them in analytics. It also shows us how to generate data cubes with the SQL group-by, proposes what enhancements we need make to group-bys. And it categorifies the aggregate functions. It shows how to compute data cubes with distributive and algebraic aggregates. | Data Mining and Knowledge Discovery 19997 | Data Cube, Group-By, Cross-Tab, Roll-Up, Drill-Down | +| Self-Organizing Data Containers | Samuel Madden, Jialin Ding, Tim Kraska, .etc | The authors envision a kind of systems called Self-Organizing Data Containers which employ a disaggregated and open system architechture. They name a few characteristics of these system - supporting efficient indexing, supporting concurrent acesses, supporting data envolving. They implement a prototype and compare with Delta Lake. The authors also point the directions - using replications to optimize the physical layout, incremental changes and auto-optimizations - for future research. | CIDR 2022 | Self-Organizing Data Container, Cloud Storage, Amazon S3 | +| Instance-Optimized Data Layouts for Cloud Analytics Workloads | Jialin Ding, Umar Farooq Minhas, Tim Kraska, .etc | This paper proposes a method called MTO which optimizes QD-Tree through Sideways Information Passing. It add Join-induced Predicates into QD-Tree. Effectively it should perform better than single table optimization. But Join-induced Predicates also poses a challenge when new data are added or old data are updated. Join-induced Predicates need to be refreshed and adjusted. | SIGMOD 2021 | QD-Tree, Sideways Information Passing, Instance-Optimized Data Layout | +| Tsunami: A Learned Multi-dimensional - Index for Correlated Data and Skewed Workloads | Jialin Ding, Umar Farooq Minhas, Tim Kraska, .etc | This paper proposes a learned multi-dimensional index called Tsunami which is an improved successor to Flood, another learned multi-dimentional index. The authors observe that there are query skew and data correlation which pose challenges to both tranditional multi-dimentional indices and Flood. This paper invents two structures - Grid Tree and Argumented Grid. Each structure is constructured as an optimization problem. The authors also formulate the optimization goals. | PVLDB Vol 14, No. 2, 2020 | Learned Index, Multi-dimensional Index, Skewed Workload | +| SA-LSM: Optimize Data Layout for LSM-tree Based Storage using Survival Analysis | Teng Zhang, Jianling Sun, .etc | This paper presents SA-SLM which uses survival analysis to predict data access events per record. This ssystem employs a proactively compaction srategy to move cold data to slow media. The authors claim that they can reduce tail latency by up to 78.9% using SA-SLM. | PVLDB Vol 15 Issue 10, 2022 | LSM-Tree, Survival Analysis, Random Forest | +| Tiresias: Enabling Predictive Autonomous Storage and Indexing | Michael Abebe, Horatiu Lazu, Khuzaima Daudjee | This paper presents the Tiresias method which combine workload prediction and autonomous storage and indexing. The authors introduce how to predict future workoad and estimate plan cost under a specific storage layout. They also give a heuristic method to calculate the benefit of employing a storage change. Though this paper doesn't say how to propose a storage change I guess they should have a small fixed set of storage choices to select from. | SIGMOD 2021 | Autonomous Storage, Indexing, Workload Prediction | +| Stochastic Database Cracking: Towards Robust Adaptive - Indexing in Main-Memory Column-Stores | Felix Halim, Stratos Idreos, Panagiotis Karras, Roland H. C. Yap | This paper extends the origional idea of database cracking by introducing stochastic cracks. The origional database cracking only cracks a column exactly based on data predicates. This paper shows that for sequential workload the origional cracking method has no optimization compared to random workload. In order to address this problem this paper propose two different algorithms - DDC and DDR. The difference is that DDC always tries to cut in the center which requires a cost of finding the median, but DDR chooses to cut randomly. They also devise other variants with more lightweight initial cost based on these two algorithms. | PVLDB 2012 | Database Cracking, Column Store, Adaptive Indexing | +| Annotating Columns with Pre-trained Language Models | Yoshihiko Suhara, Jinfeng Li, Cagatay Demiralp, Chen Chen, Wang-Chiew Tan | This paper show how to use LM as representation learning to predict column types and column relation in a table. The authors show how to encode a table as one feature instead of using a column-wise feature, and how to combine these two tasks in one model. But they didn't answer why the order of row and column values seem doesn't matter too much in their tests. | SIGMOD 2022 | Large Language Model, Column Annotation, Column Clustering | +| | | | | | +|---------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+-----------------------------------------------------------------------| diff --git a/_org/2023-12-26-esp32-micropython3_1.org b/_org/2023-12-26-esp32-micropython3_1.org new file mode 100644 index 0000000..3661cd8 --- /dev/null +++ b/_org/2023-12-26-esp32-micropython3_1.org @@ -0,0 +1,107 @@ +#+OPTIONS: ^:nil +#+BEGIN_EXPORT html +--- +layout: default +title: 创客利器之ESP32(四)- 使用ESP32和MQTT控制小灯之小插曲 +tags: [esp32, MCU, micropython, MQTT] +nav_order: {{ page.date }} +sync_wexin: 1 +--- +#+END_EXPORT +* 创客利器之ESP32(四)- 使用MicroPython与MQTT一起控制小灯之小插曲 + +** 小插曲 +上一篇文章我们介绍了如何使用MicroPython连接MQTT Broker。因为我有两个ESP32,都使用了上一篇文章中的main.py。 +#+begin_src python + import time + from umqtt.simple import MQTTClient + from machine import Pin + + # Publish test messages e.g. with: + # mosquitto_pub -t foo_topic -m hello + + server = "" + port_num = 9001 + swith_topic = b"home/bedroom/light" + p5 = Pin(5, Pin.OUT) + p5.off() + + # 如果值是on,打开小灯;否则关掉小灯 + def sub_cb(topic, msg): + if msg.decode().lower() == "on": + p5.on() + else: + p5.off() + + def main(): + print("main started ...") + c = MQTTClient("umqtt_client", server, port=port_num) + c.set_callback(sub_cb) + c.connect() + c.subscribe(swith_topic) + while True: + try: + if True: + # Blocking wait for message + c.wait_msg() + except Exception: + print("wait message fail") + c.disconnect() + + print("will enter main") + if __name__ == "__main__": + main() +#+end_src +但是我发现只要我打开第二个ESP32时,第一个打开的ESP32就会报错,一直打印 ~wait message fail~ 。开始我怀疑是因为MQTT Broker不支持多个客户端订阅同一个地址,但是查了文档发现并没有这样的问题,而且使用mosquitto_sub工具就没有这样的问题。在没有找到原因的情况下,就像很多程序员一样,先修改代码绕过问题。修改后的代码如下。结果两个ESP32每隔一秒钟就报一次错,然后重新连接,不完美还是可以打开关闭小灯。 +#+begin_src python + def main(): + print("main started ...") + connect = True + while True: + try: + if connect: + client_id = f"umqtt_client{random.randint(0, 1000)}" + c = MQTTClient("umqtt_client", server, port=port_num) + c.set_callback(sub_cb) + c.connect() + c.subscribe(switch_topic) + print("client ", client_id, " connects") + connect = False + if True: + # Blocking wait for message + c.wait_msg() + except Exception as e: + print("wait message fail, ", e) + c.disconnect() + time.sleep(1) + connect = True + c.disconnect() +#+end_src +由于这个方案并不完美,我还是决定上网找一下是否有人遇到和我一样的问题。还真找到了,有个人遇到了和我一样[[https://stackoverflow.com/questions/36184490/mqtt-client-disconnects-when-another-client-connects-to-the-server][问题]]。原来Mosquitto为每个client_id只保留一个连接,而我的main.py使用了相同的client_id。所以第二个ESP32连接时,第一个ESP32会被断开。再修改main函数如下,给client_id加上一个随机数,这样不同ESP32的client_id大概率就不一样了。 +#+begin_src python +def main(): + print("main started ...") + connect = True + while True: + try: + if connect: + client_id = f"umqtt_client{random.randint(0, 1000)}" + c = MQTTClient(client_id, server, port=port_num) + c.set_callback(sub_cb) + c.connect() + c.subscribe(switch_topic) + print("client ", client_id, " connects") + connect = False + if True: + # Blocking wait for message + c.wait_msg() + except Exception as e: + print("wait message fail, ", e) + c.disconnect() + time.sleep(1) + connect = True + c.disconnect() +#+end_src + +** 总结 +看来看清楚API文档很重要。 diff --git a/_posts/2023-12-26-esp32-micropython3_1.md b/_posts/2023-12-26-esp32-micropython3_1.md new file mode 100644 index 0000000..2e1849a --- /dev/null +++ b/_posts/2023-12-26-esp32-micropython3_1.md @@ -0,0 +1,110 @@ +--- +layout: default +title: 创客利器之ESP32(四)- 使用ESP32和MQTT控制小灯之小插曲 +tags: [esp32, MCU, micropython, MQTT] +nav_order: {{ page.date }} +sync_wexin: 1 +--- + + +# 创客利器之ESP32(四)- 使用MicroPython与MQTT一起控制小灯之小插曲 + + +## 小插曲 + +上一篇文章我们介绍了如何使用MicroPython连接MQTT Broker。因为我有两个ESP32,都使用了上一篇文章中的main.py。 + + import time + from umqtt.simple import MQTTClient + from machine import Pin + + # Publish test messages e.g. with: + # mosquitto_pub -t foo_topic -m hello + + server = "" + port_num = 9001 + swith_topic = b"home/bedroom/light" + p5 = Pin(5, Pin.OUT) + p5.off() + + # 如果值是on,打开小灯;否则关掉小灯 + def sub_cb(topic, msg): + if msg.decode().lower() == "on": + p5.on() + else: + p5.off() + + def main(): + print("main started ...") + c = MQTTClient("umqtt_client", server, port=port_num) + c.set_callback(sub_cb) + c.connect() + c.subscribe(swith_topic) + while True: + try: + if True: + # Blocking wait for message + c.wait_msg() + except Exception: + print("wait message fail") + c.disconnect() + + print("will enter main") + if __name__ == "__main__": + main() + +但是我发现只要我打开第二个ESP32时,第一个打开的ESP32就会报错,一直打印 `wait message fail` 。开始我怀疑是因为MQTT Broker不支持多个客户端订阅同一个地址,但是查了文档发现并没有这样的问题,而且使用mosquitto\_sub工具就没有这样的问题。在没有找到原因的情况下,就像很多程序员一样,先修改代码绕过问题。修改后的代码如下。结果两个ESP32每隔一秒钟就报一次错,然后重新连接,不完美还是可以打开关闭小灯。 + + def main(): + print("main started ...") + connect = True + while True: + try: + if connect: + client_id = f"umqtt_client{random.randint(0, 1000)}" + c = MQTTClient("umqtt_client", server, port=port_num) + c.set_callback(sub_cb) + c.connect() + c.subscribe(switch_topic) + print("client ", client_id, " connects") + connect = False + if True: + # Blocking wait for message + c.wait_msg() + except Exception as e: + print("wait message fail, ", e) + c.disconnect() + time.sleep(1) + connect = True + c.disconnect() + +由于这个方案并不完美,我还是决定上网找一下是否有人遇到和我一样的问题。还真找到了,有个人遇到了和我一样[问题](https://stackoverflow.com/questions/36184490/mqtt-client-disconnects-when-another-client-connects-to-the-server)。原来Mosquitto为每个client\_id只保留一个连接,而我的main.py使用了相同的client\_id。所以第二个ESP32连接时,第一个ESP32会被断开。再修改main函数如下,给client\_id加上一个随机数,这样不同ESP32的client\_id大概率就不一样了。 + + def main(): + print("main started ...") + connect = True + while True: + try: + if connect: + client_id = f"umqtt_client{random.randint(0, 1000)}" + c = MQTTClient(client_id, server, port=port_num) + c.set_callback(sub_cb) + c.connect() + c.subscribe(switch_topic) + print("client ", client_id, " connects") + connect = False + if True: + # Blocking wait for message + c.wait_msg() + except Exception as e: + print("wait message fail, ", e) + c.disconnect() + time.sleep(1) + connect = True + c.disconnect() + + +## 总结 + +看来看清楚API文档很重要。 +