Skip to content

Commit

Permalink
[Doc] Add ds_hll_count_distinct doc (#54745)
Browse files Browse the repository at this point in the history
Signed-off-by: shuming.li <[email protected]>
Signed-off-by: 絵空事スピリット <[email protected]>
Co-authored-by: 絵空事スピリット <[email protected]>
  • Loading branch information
LiShuMing and EsoragotoSpirit authored Jan 17, 2025
1 parent cbb7eb1 commit 952799a
Show file tree
Hide file tree
Showing 3 changed files with 119 additions and 30 deletions.

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# ds_hll_count_distinct

Returns the approximate value of aggregate function similar to the result of COUNT(DISTINCT col). APPROX_COUNT_DISTINCT(expr) is similar function.

ds_hll_count_distinct is faster than the COUNT and DISTINCT combination and uses a fixed-size memory, so less memory is required for columns of high cardinality.

It is slower than APPROX_COUNT_DISTINCT(expr) but with higher precision because it adopts of Apache Datasketches. For more information, see [HyperLogLog Sketches](https://datasketches.apache.org/docs/HLL/HllSketches.html).

## Syntax

```Haskell
ds_hll_count_distinct(expr, [log_k], [tgt_type])
```
- `log_k`: Integer. Range [4, 21]. Default: 17.
- `tgt_type`: Valid values are `HLL_4`, `HLL_6` (default) and `HLL_8`.

## Examples

```plain text
mysql> CREATE TABLE t1 (
-> id BIGINT NOT NULL,
-> province VARCHAR(64),
-> age SMALLINT,
-> dt VARCHAR(10) NOT NULL
-> )
-> DUPLICATE KEY(id)
-> DISTRIBUTED BY HASH(id) BUCKETS 4;
Query OK, 0 rows affected (0.02 sec)
mysql> insert into t1 select generate_series, generate_series, generate_series % 100, "2024-07-24" from table(generate_series(1, 100000));
Query OK, 100000 rows affected (0.29 sec)
mysql> select ds_hll_count_distinct(id), ds_hll_count_distinct(province), ds_hll_count_distinct(age), ds_hll_count_distinct(dt) from t1 order by 1, 2;
+---------------------------+---------------------------------+----------------------------+---------------------------+
| ds_hll_count_distinct(id) | ds_hll_count_distinct(province) | ds_hll_count_distinct(age) | ds_hll_count_distinct(dt) |
+---------------------------+---------------------------------+----------------------------+---------------------------+
| 100090 | 100140 | 100 | 1 |
+---------------------------+---------------------------------+----------------------------+---------------------------+
1 row in set (0.07 sec)
mysql> select ds_hll_count_distinct(id, 21), ds_hll_count_distinct(province, 21), ds_hll_count_distinct(age, 21), ds_hll_count_distinct(dt, 21) from t1 order by 1, 2;
+-------------------------------+-------------------------------------+--------------------------------+-------------------------------+
| ds_hll_count_distinct(id, 21) | ds_hll_count_distinct(province, 21) | ds_hll_count_distinct(age, 21) | ds_hll_count_distinct(dt, 21) |
+-------------------------------+-------------------------------------+--------------------------------+-------------------------------+
| 99995 | 100001 | 100 | 1 |
+-------------------------------+-------------------------------------+--------------------------------+-------------------------------+
1 row in set (0.07 sec)
mysql> select ds_hll_count_distinct(id, 10, "HLL_8"), ds_hll_count_distinct(province, 10, "HLL_8"), ds_hll_count_distinct(age, 10, "HLL_8"), ds_hll_count_distinct(dt, 10, "HLL_8") from t1 order by 1, 2;
+----------------------------------------+----------------------------------------------+-----------------------------------------+----------------------------------------+
| ds_hll_count_distinct(id, 10, 'HLL_8') | ds_hll_count_distinct(province, 10, 'HLL_8') | ds_hll_count_distinct(age, 10, 'HLL_8') | ds_hll_count_distinct(dt, 10, 'HLL_8') |
+----------------------------------------+----------------------------------------------+-----------------------------------------+----------------------------------------+
| 99844 | 101905 | 96 | 1 |
+----------------------------------------+----------------------------------------------+-----------------------------------------+----------------------------------------+
1 row in set (0.09 sec)
```

## Keywords

DS_HLL_COUNT_DISTINCT,APPROX_COUNT_DISTINCT
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# ds_hll_count_distinct

返回聚合函数的近似值,结果类似于 COUNT(DISTINCT col)。相似函数为 APPROX_COUNT_DISTINCT(expr)。

相较于 COUNT DISTINCT 速度更快,并且使用固定大小的内存,因此基于高基数列使用时内存占用更少。

相较于 APPROX_COUNT_DISTINCT(expr) 速度更慢,但由于 Apache Datasketches 的优势,导致其精度更高。更多信息,参考 [HyperLogLog Sketches](https://datasketches.apache.org/docs/HLL/HllSketches.html)

## 语法

```Haskell
ds_hll_count_distinct(expr, [log_k], [tgt_type])
```
- `log_k`:必须为整数。范围:[4, 21]。默认值:17。
- `tgt_type`:有效值为 `HLL_4``HLL_6`(默认)以及 `HLL_8`

## 示例

```plain text
mysql> CREATE TABLE t1 (
-> id BIGINT NOT NULL,
-> province VARCHAR(64),
-> age SMALLINT,
-> dt VARCHAR(10) NOT NULL
-> )
-> DUPLICATE KEY(id)
-> DISTRIBUTED BY HASH(id) BUCKETS 4;
Query OK, 0 rows affected (0.02 sec)
mysql> insert into t1 select generate_series, generate_series, generate_series % 100, "2024-07-24" from table(generate_series(1, 100000));
Query OK, 100000 rows affected (0.29 sec)
mysql> select ds_hll_count_distinct(id), ds_hll_count_distinct(province), ds_hll_count_distinct(age), ds_hll_count_distinct(dt) from t1 order by 1, 2;
+---------------------------+---------------------------------+----------------------------+---------------------------+
| ds_hll_count_distinct(id) | ds_hll_count_distinct(province) | ds_hll_count_distinct(age) | ds_hll_count_distinct(dt) |
+---------------------------+---------------------------------+----------------------------+---------------------------+
| 100090 | 100140 | 100 | 1 |
+---------------------------+---------------------------------+----------------------------+---------------------------+
1 row in set (0.07 sec)
mysql> select ds_hll_count_distinct(id, 21), ds_hll_count_distinct(province, 21), ds_hll_count_distinct(age, 21), ds_hll_count_distinct(dt, 21) from t1 order by 1, 2;
+-------------------------------+-------------------------------------+--------------------------------+-------------------------------+
| ds_hll_count_distinct(id, 21) | ds_hll_count_distinct(province, 21) | ds_hll_count_distinct(age, 21) | ds_hll_count_distinct(dt, 21) |
+-------------------------------+-------------------------------------+--------------------------------+-------------------------------+
| 99995 | 100001 | 100 | 1 |
+-------------------------------+-------------------------------------+--------------------------------+-------------------------------+
1 row in set (0.07 sec)
mysql> select ds_hll_count_distinct(id, 10, "HLL_8"), ds_hll_count_distinct(province, 10, "HLL_8"), ds_hll_count_distinct(age, 10, "HLL_8"), ds_hll_count_distinct(dt, 10, "HLL_8") from t1 order by 1, 2;
+----------------------------------------+----------------------------------------------+-----------------------------------------+----------------------------------------+
| ds_hll_count_distinct(id, 10, 'HLL_8') | ds_hll_count_distinct(province, 10, 'HLL_8') | ds_hll_count_distinct(age, 10, 'HLL_8') | ds_hll_count_distinct(dt, 10, 'HLL_8') |
+----------------------------------------+----------------------------------------------+-----------------------------------------+----------------------------------------+
| 99844 | 101905 | 96 | 1 |
+----------------------------------------+----------------------------------------------+-----------------------------------------+----------------------------------------+
1 row in set (0.09 sec)
```

## Keywords

DS_HLL_COUNT_DISTINCT,APPROX_COUNT_DISTINCT

0 comments on commit 952799a

Please sign in to comment.