ClickHouse Index-202301071508

Last updated Jan 7, 2023 Edit Source

# ClickHouse Index-202301071508

# Summary

How Clickhouse index work?

# Notes

介紹 Index 前, 先介紹一下 Clickhouse MergeTree 的儲存結構

# Partition

一個 Partition 可以有多個 parts. Clickhouse 每次 bulk insert 都會產生新的 part file, 相同 Partition 的 part 會在 insert 完後 10-15分鐘 merge, 合併一個 part file

Partition Format 有兩種: Wide, Compact

Wide: Column 分開儲存, default.
Compact: 所有 Column 一起儲存, 適合頻率大, 小量的 insert.

文件中有一段話講 partition key 的設計

A merge only works for data parts that have the same value for the partitioning expression.
This means you shouldn’t make overly granular partitions (more than about a thousand partitions). Otherwise, the SELECT query performs poorly because of an unreasonably large number of files in the file system and open file descriptors.

白話就是 partition 在 table 內不應該過多, 會降低 select 效能. 自己的心得就是針對不同的 data source 選擇適合的 partition key 可以有效改善效能瓶頸

# Primary Key

跟一般 RMDB 不同的是, Clickhouse Primary Key 並沒有表示重複數據的意思. 而是依照 Primary Key 排序. 由於 Clickhouse 內可以設定多個 Primary Key, 比如說 (Area, UserID), 表示說 Part 內數據先依照 Area 排序, Area 內再依照 UserID 排序.

# Granules

Granules 是 Clickhouse data 的最小單位, granules 包含一個數量的 rows. Clickhouse 會拿 granules 第一筆資料的 Primary Key 當 mark.

針對每個 Part, 會有個 index file .idx 儲存這些 mark value.
針對每個 Columns, 就算不是 Primary Key, 也會儲存這些 mark. .mrk (透過 mark 在 column file 找資料)

.idx 存這些 mark value, .mrk 則是存兩個偏移值 .bin 的偏移值定位到 compressed block & compressed block 解壓後的文件偏移量

# Block

Block 是 Clickhouse Column 的壓縮單位, 包含多個 Granules, 至於是多少要看 min_compress_block_size & max_compress_block_size決定.

每次寫入一個 Granules (default index_granularity 8192 rows), Clickhouse 會檢查 data size
- 如果 < min_compress_block_size 會繼續拿取 data
- 如果 > min_compress_block_size & < max_compress_block_size 會壓縮並寫入磁碟
- 如果 > max_compress_block_size, 會依照 compress_block_size 截斷並生成 block

一個 .bin (Date file) 內包含多個 compress block

結構圖如下:

# Index

所以 Select Index 如何發揮作用呢？

Create Table T ORDER BY A
- A 有 Primary Index, B 沒有

1
CREATE TABLE T(A Int, B Int) MergeTree order by A

這邊 insert 三次, 所以會有三個 part 產生

1
2
3
4
5
insert into T values(1, 1)
insert into T values(2, 1)
insert into T values(1, 2)

select database, table, name, primary_key_bytes_in_memory, marks_bytes, rows  from system.parts where table = 'T'

1
2
3
4
5
┌─database─┬─table─┬─name──────┬─primary_key_bytes_in_memory─┬─marks_bytes─┬─rows─┐
│ default  │ T     │ all_1_1_0 │                           8 │          80 │    1 │
│ default  │ T     │ all_2_2_0 │                           8 │          80 │    1 │
│ default  │ T     │ all_3_3_0 │                           8 │          80 │    1 │
└──────────┴───────┴───────────┴─────────────────────────────┴─────────────┴──────┘

開啟 Tracing

1
set send_logs_level = 'trace'

針對 B 做 Query

1
SELECT * FROM T where B > 1

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
... <Debug> executeQuery: (from 127.0.0.1:53134) select * from T where B > 1
... <Trace> ContextAccess (default): Access granted: SELECT(A, B) ON default.T

// primary index 沒作用

... <Debug> default.T (9eae30d1-c8fa-4630-af28-88484187ddbe) (SelectExecutor): Key condition: unknown
... <Trace> default.T (9eae30d1-c8fa-4630-af28-88484187ddbe) (SelectExecutor): Not using primary index on part all_1_1_0
... <Trace> default.T (9eae30d1-c8fa-4630-af28-88484187ddbe) (SelectExecutor): Not using primary index on part all_2_2_0
... <Trace> default.T (9eae30d1-c8fa-4630-af28-88484187ddbe) (SelectExecutor): Not using primary index on part all_3_3_0

// Clickhouse 掃過全部三個 Partition

... <Debug> default.T (9eae30d1-c8fa-4630-af28-88484187ddbe) (SelectExecutor): Selected 3 parts by partition key, 3 parts by primary key, 3 marks by primary key, 3 marks to read from 3 ranges
... <Trace> default.T (9eae30d1-c8fa-4630-af28-88484187ddbe) (SelectExecutor): Reading approx. 3 rows with 3 streams
... <Trace> InterpreterSelectQuery: FetchColumns -> Complete
┌─A─┬─B─┐
│ 1 │ 2 │
└───┴───┘
... <Information> executeQuery: Read 3 rows, 24.00 B in 0.004743004 sec., 632 rows/sec., 4.94 KiB/sec.
... <Debug> MemoryTracker: Peak memory usage (for query): 0.00 B.

1 rows in set. Elapsed: 0.007 sec. 

1
SELECT * FROM T where A > 1

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
... <Debug> executeQuery: (from 127.0.0.1:53134) SELECT * from T where A > 1
... <Trace> ContextAccess (default): Access granted: SELECT(A, B) ON default.T

// 找到 Primary key, 透過找 LEFT/RIGHT boundary 選擇 partition

... <Debug> default.T (9eae30d1-c8fa-4630-af28-88484187ddbe) (SelectExecutor): Key condition: (column 0 in [2, +inf))
... <Trace> default.T (9eae30d1-c8fa-4630-af28-88484187ddbe) (SelectExecutor): Running binary search on index range for part all_2_2_0 (2 marks)
... <Trace> default.T (9eae30d1-c8fa-4630-af28-88484187ddbe) (SelectExecutor): Running binary search on index range for part all_1_1_0 (2 marks)
... <Trace> default.T (9eae30d1-c8fa-4630-af28-88484187ddbe) (SelectExecutor): Found (LEFT) boundary mark: 1
... <Trace> default.T (9eae30d1-c8fa-4630-af28-88484187ddbe) (SelectExecutor): Found (RIGHT) boundary mark: 2
... <Trace> default.T (9eae30d1-c8fa-4630-af28-88484187ddbe) (SelectExecutor): Found (LEFT) boundary mark: 1
... <Trace> default.T (9eae30d1-c8fa-4630-af28-88484187ddbe) (SelectExecutor): Found empty range in 1 steps
... <Trace> default.T (9eae30d1-c8fa-4630-af28-88484187ddbe) (SelectExecutor): Found (RIGHT) boundary mark: 2
... <Trace> default.T (9eae30d1-c8fa-4630-af28-88484187ddbe) (SelectExecutor): Found empty range in 1 steps
... <Trace> default.T (9eae30d1-c8fa-4630-af28-88484187ddbe) (SelectExecutor): Running binary search on index range for part all_3_3_0 (2 marks)
... <Trace> default.T (9eae30d1-c8fa-4630-af28-88484187ddbe) (SelectExecutor): Found (LEFT) boundary mark: 0
... <Trace> default.T (9eae30d1-c8fa-4630-af28-88484187ddbe) (SelectExecutor): Found (RIGHT) boundary mark: 2
... <Trace> default.T (9eae30d1-c8fa-4630-af28-88484187ddbe) (SelectExecutor): Found continuous range in 2 steps

// 這邊顯示只選到 1 個 partition

... <Debug> default.T (9eae30d1-c8fa-4630-af28-88484187ddbe) (SelectExecutor): Selected 3 parts by partition key, 1 parts by primary key, 1 marks by primary key, 1 marks to read from 1 ranges
... <Trace> default.T (9eae30d1-c8fa-4630-af28-88484187ddbe) (SelectExecutor): Reading approx. 1 rows with 1 streams
... <Trace> InterpreterSelectQuery: FetchColumns -> Complete
┌─A─┬─B─┐
│ 2 │ 1 │
└───┴───┘
... <Information> executeQuery: Read 1 rows, 8.00 B in 0.005239063 sec., 190 rows/sec., 1.49 KiB/sec.
... <Debug> MemoryTracker: Peak memory usage (for query): 0.00 B.

# 所以整個過程是什麼 ?

Partition Key -> Partition
Mark LEFT/RIGHT -> 選 Partition
Primary key 找符合條件的 marks
mark -> 要讀取的範圍

Clickhouse Primary Index 是 Sparse Index, 針對特定 row 的 query 基本上很慢 (看整段過程就知道了), 但對條件過濾就友善很多了.

所以 Clickhouse Index 的概念是什麼？

如何跳過更多的資料, 找到需要掃過的範圍

跳過的 Block 多了, 解壓縮的 Block 就少了. 可以很大程度的減少 I/O.

就像切蛋糕, 看能夠切幾刀找到需要的資料

# Tips

Unique: 避免太獨特的 primary key, 效果會很差
Long data range: 反之太常見的 Primary key 就會失去效果, 需要掃過的 range 過長
Primary Key 順序: Order by (A, B, C) or (B, A, C). 因為是 Sparse Index, 如果順序錯了有機會多讀取很多資料. 慎重考慮 query 條件 && Primary Key 的設定
事後加入 Data Skipping Index, 對已經存在的資料並不會馬上發生作用. 而是需要 Part merge process 跑完後才會生效
預設的 min_compress_block_size & max_compress_block_size 已經優化過了, 沒事別動.

# References

https://www.slideshare.net/Altinity/webinar-secrets-of-clickhouse-query-performance-by-robert-hodges-173379008 https://stackoverflow.com/questions/60255863/how-understand-the-granularity-and-block-in-clickhouse https://github.com/ClickHouse/ClickHouse/issues/7473 https://stackoverflow.com/questions/60142967/how-to-understand-part-and-partition-of-clickhouse https://www.wangfenjin.com/posts/clickhouse-od/

jneo8.com