Skip to content

Commit

Permalink
[Doc] 3.2.7 async cache population and other updates (backport #46230) (
Browse files Browse the repository at this point in the history
#46269)

Signed-off-by: evelyn.zhaojie <[email protected]>
Co-authored-by: evelyn.zhaojie <[email protected]>
  • Loading branch information
mergify[bot] and evelynzhaojie authored May 27, 2024
1 parent 44058fe commit 7283a5b
Show file tree
Hide file tree
Showing 10 changed files with 78 additions and 8 deletions.
2 changes: 1 addition & 1 deletion docs/en/administration/user_privs/ranger_plugin.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ After StarRocks is integrating with Apache Ranger, you can achieve the following
- All StarRocks FE machines have access to Apache Ranger. You can check this by running the following command on each FE machine:

```SQL
telnet <ranger-ip> <ranger-host>
telnet <ranger-ip> <ranger-port>
```

If `Connected to <ip>` is displayed, the connection is successful.
Expand Down
26 changes: 26 additions & 0 deletions docs/en/data_source/data_cache.md
Original file line number Diff line number Diff line change
Expand Up @@ -162,3 +162,29 @@ Table: lineitem
- __MAX_OF_BytesRead: 194.99 MB
- __MIN_OF_BytesRead: 81.25 MB
```

## Populate data cache

StarRocks supports populating the data cache in synchronous or asynchronous mode.

### Synchronous cache population (default)

In synchronous population mode, all the remote data read by the current query is cached locally. Synchronous population is efficient but may affect the performance of initial queries because it happens during data reading.

### Asynchronous cache population (since v3.2.7)

In asynchronous population mode, the system tries to cache the accessed data in the background, in order to minimize the impact on read performance. Asynchronous population can reduce the performance impact of cache population on initial reads, but the population efficiency is lower than synchronous population. Typically, a single query cannot guarantee that all the accessed data can be cached. Multiple attempts may be needed to cache all the accessed data.

By default, the system uses synchronous cache population. You can enable asynchronous cache population by setting the session variable [enable_datacache_async_populate_mode](../reference/System_variable.md):

- Enable asynchronous cache population for a single session.

```sql
SET enable_datacache_async_populate_mode = true;
```

- Enable asynchronous cache population globally for all sessions.

```sql
SET GLOBAL enable_datacache_async_populate_mode = true;
```
6 changes: 6 additions & 0 deletions docs/en/reference/System_variable.md
Original file line number Diff line number Diff line change
Expand Up @@ -321,6 +321,12 @@ This variable is introduced to solve compatibility issues.
Default value: `true`.
-->

### enable_datacache_async_populate_mode

* **Description**: Whether to populate the data cache in asynchronous mode. By default, the system uses the synchronous mode to populate data cache, that is, populating the cache while querying data.
* **Default**: false
* **Introduced in**: v3.2.7

### enable_connector_adaptive_io_tasks

* **Description**: Whether to adaptively adjust the number of concurrent I/O tasks when querying external tables. Default value is `true`. If this feature is not enabled, you can manually set the number of concurrent I/O tasks using the variable `connector_io_tasks_per_scan_operator`.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ You can use Lateral Join with UNNEST to implement common conversions, for exampl

From v2.5, UNNEST can take a variable number of array parameters. The arrays can vary in type and length (number of elements). If the arrays have different lengths, the largest length prevails, which means nulls will be added to arrays that are less than this length. See [Example 2](#example-2-unnest-takes-multiple-parameters) for more information.

From v3.2.6, UNNEST can be used with LEFT JOIN ON TRUE, which is to retain all rows in the left table even if the corresponding rows in the right table are empty or have null values. NULLs are returned for such empty or NULL rows. See [Example 3](#example-3-unnest-left-join-on-true) for more information.
From v3.2.7, UNNEST can be used with LEFT JOIN ON TRUE, which is to retain all rows in the left table even if the corresponding rows in the right table are empty or have null values. NULLs are returned for such empty or NULL rows. See [Example 3](#example-3-unnest-with-left-join-on-true) for more information.

## Syntax

Expand Down
6 changes: 4 additions & 2 deletions docs/en/using_starrocks/Cost_based_optimizer.md
Original file line number Diff line number Diff line change
Expand Up @@ -485,7 +485,9 @@ The task ID for a manual collection task can be obtained from SHOW ANALYZE STATU

## Collect statistics of Hive/Iceberg/Hudi tables

Since v3.2.0, StarRocks supports collecting statistics of Hive, Iceberg, and Hudi tables. The syntax is similar to collecting StarRocks internal tables. **However, only manual and automatic full collection are supported. Sampled collection and histogram collection are not supported.** The collected statistics are stored in the `external_column_statistics` table of the `_statistics_` in the `default_catalog`. They are not stored in Hive Metastore and cannot be shared by other search engines. You can query data from the `default_catalog._statistics_.external_column_statistics` table to verify whether statistics are collected for a Hive/Iceberg/Hudi table.
Since v3.2.0, StarRocks supports collecting statistics of Hive, Iceberg, and Hudi tables. The syntax is similar to collecting StarRocks internal tables. **However, only manual full collection, manual histogram collection (since v3.2.7), and automatic full collection are supported. Sampled collection is not supported.** Since v3.3.0, StarRocks supports collecting statistics of sub-fields in STRUCT.

The collected statistics are stored in the `external_column_statistics` table of the `_statistics_` in the `default_catalog`. They are not stored in Hive Metastore and cannot be shared by other search engines. You can query data from the `default_catalog._statistics_.external_column_statistics` table to verify whether statistics are collected for a Hive/Iceberg/Hudi table.

Following is an example of querying statistics data from `external_column_statistics`.

Expand All @@ -512,7 +514,7 @@ partition_name:
The following limits apply when you collect statistics for Hive, Iceberg, Hudi tables:

1. You can collect statistics of only Hive, Iceberg, and Hudi tables.
2. Only full collection is supported. Sampled collection and histogram collection are not supported.
2. Only manual full collection, manual histogram collection (since v3.2.7), and automatic full collection are supported. Sampled collection is not supported.
3. For the system to automatically collect full statistics, you must create an Analyze job, which is different from collecting statistics of StarRocks internal tables where the system does this in the background by default.
4. For automatic collection tasks, you can only collect statistics of a specific table. You cannot collect statistics of all tables in a database or statistics of all databases in an external catalog.
5. For automatic collection tasks, StarRocks can detect whether data in Hive and Iceberg tables are updated and if so, collect statistics of only partitions whose data is updated. StarRocks cannot perceive whether data in Hudi tables are updated and can only perform periodic full collection.
Expand Down
2 changes: 1 addition & 1 deletion docs/zh/administration/user_privs/ranger_plugin.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ StarRocks 集成 Apache Ranger 后可以实现以下权限控制方式:
- 确保 StarRocks 所有 FE 机器都能够访问 Ranger。您可以在 FE 节点的机器上执行以下语句来判断:

```SQL
telnet <ranger-ip> <ranger-host>
telnet <ranger-ip> <ranger-port>
```

如果显示 `Connected to <ip>`,则表示连接成功。
Expand Down
28 changes: 28 additions & 0 deletions docs/zh/data_source/data_cache.md
Original file line number Diff line number Diff line change
Expand Up @@ -158,3 +158,31 @@ datacache_disk_size = 1288490188800
- __MAX_OF_BytesRead: 194.99 MB
- __MIN_OF_BytesRead: 81.25 MB
```

## Data Cache 填充

### 异步填充

Data Cache 支持以同步或异步的方式进行缓存填充。

- 同步填充(默认方式)

使用同步填充方式时,会将当前查询所读取的远端数据都缓存在本地。同步方式填充效率较高,但由于缓存填充操作在数据读取时执行,可能会对首次查询效率带来影响。

- 异步填充(3.2.7 及以后)

使用异步填充方式时,系统会尝试在尽可能不影响读取性能的前提下在后台对访问到的数据进行缓存。异步方式能够减少缓存填充对首次读取性能的影响,但填充效率较低。通常单次查询不能保证将访问到的所以数据都缓存到本地,往往需要多次。

当前系统默认以同步方式进行缓存,您可以通过修改 session 变量 [enable_datacache_async_populate_mode](../reference/System_variable.md) 来启用异步填充:

- 按需在单个会话中开启 Data Cache 异步填充。

```sql
SET enable_datacache_async_populate_mode = true;
```

- 为当前所有会话开启 Data Cache 异步填充。

```sql
SET GLOBAL enable_datacache_async_populate_mode = true;
```
6 changes: 6 additions & 0 deletions docs/zh/reference/System_variable.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,12 @@ SELECT /*+ SET_VAR
* 默认值:true
* 引入版本:v3.1.11,v3.2.5

### enable_datacache_async_populate_mode

* 描述:是否使用异步方式进行 Data Cache 填充。系统默认使用同步方式进行填充,即在查询数据时同步填充进行缓存填充。
* 默认值:false
* 引入版本:v3.2.7

### query_including_mv_names

* 描述:指定需要在查询执行过程中包含的异步物化视图的名称。您可以使用此变量来限制候选物化视图的数量,并提高优化器中的查询改写性能。此项优先于 `query_excluding_mv_names` 生效。
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ UNNEST 是一种表函数 (table function),用于将一个数组展开成多

从 2.5 版本开始,UNNEST 支持传入多个 array 参数,并且多个 array 的元素类型和长度(元素个数)可以不同。对于长度不同的情况,以最长数组的长度为基准,长度小于这个长度的数组使用 NULL 进行元素补充,参见 [示例二](#示例二unnest-接收多个参数)

从 3.2.6 版本开始,UNNEST 支持 LEFT JOIN ON TRUE,会保留左表中的所有行,即使右表的表达式没有返回任何行,会对右表相应的行用空值填充。参见 [示例三](#示例三unnest-支持left-join-on-true)
从 3.2.7 版本开始,UNNEST 支持 LEFT JOIN ON TRUE,会保留左表中的所有行,即使右表的表达式没有返回任何行,会对右表相应的行用空值填充。参见 [示例三](#示例三unnest-支持-left-join-on-true)

## 语法

Expand Down
6 changes: 4 additions & 2 deletions docs/zh/using_starrocks/Cost_based_optimizer.md
Original file line number Diff line number Diff line change
Expand Up @@ -484,7 +484,9 @@ KILL ANALYZE <ID>

## 采集 Hive/Iceberg/Hudi 表的统计信息

从 3.2 版本起,支持采集 Hive, Iceberg, Hudi 表的统计信息。**采集的语法和内表相同,但是只支持手动全量采集和自动全量采集两种方式,不支持抽样采集和直方图采集**。收集的统计信息会写入到 `_statistics_` 数据库的 `external_column_statistics` 表中,不会写入到 Hive Metastore 中,因此无法和其他查询引擎共用。您可以通过查询 `default_catalog._statistics_.external_column_statistics` 表中是否写入了表的统计信息。
从 3.2 版本起,支持采集 Hive, Iceberg, Hudi 表的统计信息。**采集的语法和内表相同,但是只支持手动全量采集、手动直方图采集(自 v3.2.7 起)、自动全量采集,不支持抽样采集**。自 v3.3.0 起,支持采集 STRUCT 子列的统计信息。

收集的统计信息会写入到 `_statistics_` 数据库的 `external_column_statistics` 表中,不会写入到 Hive Metastore 中,因此无法和其他查询引擎共用。您可以通过查询 `default_catalog._statistics_.external_column_statistics` 表中是否写入了表的统计信息。

查询时,会返回如下信息:

Expand All @@ -511,7 +513,7 @@ partition_name:
对 Hive、Iceberg、Hudi 表采集统计信息时,有如下限制:

1. 目前只支持采集 Hive、Iceberg、Hudi 表的统计信息。
2. 目前只支持全量采集,不支持抽样采集和直方图采集
2. 目前只支持手动全量采集、手动直方图采集(自 v3.2.7 起)和自动全量采集,不支持抽样采集
3. 全量自动采集,需要创建一个采集任务,系统不会默认自动采集外部数据源的统计信息。
4. 对于自动采集任务,只支持采集指定表的统计信息,不支持采集所有数据库、数据库下所有表的统计信息。
5. 对于自动采集任务,目前只有 Hive 和 Iceberg 表可以每次检查数据是否发生更新,数据发生了更新才会执行采集任务, 并且只会采集数据发生了更新的分区。Hudi 表目前无法判断是否发生了数据更新,所以会根据采集间隔周期性全表采集。
Expand Down

0 comments on commit 7283a5b

Please sign in to comment.