Bài 31 / 60~22 phútIndexing internalsMiễn phí… lượt xem

GIN/BRIN/partial/expression — chọn index theo data type và workload

Q: BRIN index size nhỏ hơn B-tree 1000x — tradeoff cụ thể là gì? Khi nào BRIN trở nên vô dụng?

Tradeoff: BRIN chỉ lưu min/max per block range (128 page mặc định) thay vì index từng row. Với 100 triệu row, B-tree cần ~4 GB để lưu mỗi row pointer; BRIN chỉ cần ~100 KB lưu vài nghìn (min, max) pair. Index nhỏ → toàn bộ BRIN fit trong buffer cache → lookup metadata nhanh. Nhược điểm: BRIN không cho phép random access — chỉ loại trừ block range chắc chắn không thỏa query. Với range query nhỏ (1 row), BRIN vẫn phải scan toàn bộ block range matching (~1 MB) thay vì trỏ đến row cụ thể như B-tree → kém hơn B-tree cho selective point lookup. Vô dụng khi: dữ liệu có 'correlation' thấp (gần 0). Ví dụ: bảng 'tasks' có UPDATE thay đổi thứ tự vật lý → mỗi block range có thể chứa row từ mọi khoảng thời gian → min/max per range cover gần như toàn bộ domain → BRIN không skip được block nào → tương đương seq scan.

Q: Phân biệt partial index và WHERE clause trong query thông thường. Khi nào partial index đáng tạo?

Partial index là index chỉ chứa row thỏa điều kiện khai báo lúc CREATE INDEX. Index nhỏ hơn proportional với subset. Planner dùng partial index khi WHERE clause của query bao gồm hoặc khớp điều kiện partial. WHERE clause trong query chỉ lọc kết quả sau khi fetch từ index hoặc table — không ảnh hưởng index size. Đáng tạo partial index khi: (1) subset row thỏa điều kiện chiếm nhỏ hơn 50% (lý tưởng dưới 30%) — storage save đáng kể; (2) điều kiện đó xuất hiện trong phần lớn query thực tế (vd 'WHERE deleted_at IS NULL' luôn có trong 95% query); (3) full index quá lớn ảnh hưởng buffer cache. Không đáng nếu subset quá nhỏ (dưới 1%) hoặc điều kiện hiếm gặp trong query thực tế.

Q: Phân biệt INCLUDE column vs thêm column vào sort key của composite index. Khác biệt nào quan trọng cho Index Only Scan?

Sort key (vd '(a, b, c)' ): cả ba column được lưu trong B-tree sorted structure → planner có thể dùng index cho filter/sort trên bất kỳ leftmost prefix. Write overhead tăng theo số column trong key. INCLUDE (vd '(a, b) INCLUDE (c)' ): column 'c' được lưu ở leaf page nhưng không tham gia sort order → planner không thể filter hay sort bằng 'c' ; chỉ dùng để cover SELECT list cho Index Only Scan — tránh heap fetch. Write overhead thấp hơn thêm vào sort key vì không cần maintain B-tree order cho column INCLUDE. Quan trọng cho Index Only Scan: nếu query chỉ cần 'SELECT c FROM t WHERE a = 1' , cần 'c' trong leaf page nhưng không cần sort theo 'c' → INCLUDE là lựa chọn đúng. Thêm 'c' vào sort key vừa không cần thiết vừa tốn storage và write overhead hơn.

Q: TaskFlow có query containment `metadata @> '{\"priority\":\"high\"}'` chạy 1000 lần/phút. Chọn GIN jsonb_path_ops hay jsonb_ops? Giải thích tradeoff.

Chọn 'jsonb_path_ops' cho use case này. jsonb_ops (mặc định): index cả key và value riêng lẻ. Hỗ trợ nhiều operator: '@>' (containment

B-tree không index được JSONB array tsvector. PG có 6 loại index. GIN inverted, BRIN block-range time-series, partial subset, expression function.

TaskFlow lưu metadata tag priority trong column metadata JSONB. Query WHERE metadata @> '{"priority":"high"}' chạy seq scan 2 triệu row dù đã có index — vì index đó là B-tree, không index được cấu trúc JSONB lồng nhau. Cùng lúc, bảng task_history 100 triệu row audit append-only có B-tree index trên event_at nặng 4 GB — chiếm buffer cache, slow range scan. Và users table có query WHERE LOWER(email) = '[email protected]' bỏ qua B-tree index vì function trên column.

Ba vấn đề, ba loại index khác nhau. PostgreSQL có 6 loại index — mỗi loại thiết kế cho data type và access pattern cụ thể. Bài này map decision tree thực tế + lý giải cơ chế bên trong từng loại, áp dụng vào TaskFlow.

1. Analogy — Đồ nghề khác cho công việc khác

Thợ sửa xe không dùng búa để vặn ốc. Mỗi loại công việc cần đúng dụng cụ — dùng sai không những không hiệu quả mà còn gây hỏng.

Đồ nghề	B-tree tương đương	Dùng khi
Búa (general purpose)	B-tree	Scalar equality, range, sort — 99% case thông thường
Cây kẹp nhíp (multi-element)	GIN	Array, JSONB, tsvector — dữ liệu đa phần tử
Thước cuộn (range nhanh)	BRIN	Time-series append-only lớn — dải phạm vi
Bút highlight (subset)	Partial	Chỉ index subset row thỏa WHERE clause
Nhãn dán ghi chú	Expression	Function transformation trên column

💡 Cách nhớ

B-tree là đồ nghề mặc định — chọn nó trước. Chỉ đổi sang specialized index khi data type hoặc access pattern thực sự khác: multi-value → GIN, time-series lớn → BRIN, subset row → Partial, function on column → Expression.

2. 6 loại index PostgreSQL — overview

CREATE INDEX không khai báo USING mặc định là B-tree. Năm loại còn lại cho use case cụ thể:

Type	Best for	Internal	Write cost so với B-tree
B-tree	Equality, range, sort	Balanced multi-way tree	1x (baseline)
Hash	Equality only	Hash table	~1x (PG 10+ WAL-logged, ít dùng)
GiST	Geometric, range, FTS	Generalized search tree	2–3x
GIN	Array, JSONB, FTS	Inverted index	5–10x
BRIN	Time-series append-only	Block-range min/max	0.1x (rất rẻ)
SP-GiST	Quadtree, k-d tree	Space-partitioned tree	2–3x

Trong thực tế, 99% developer chỉ cần B-tree. Bốn loại còn lại (GIN, BRIN, Partial, Expression) xuất hiện khi gặp đúng use case — biết chúng tồn tại là đủ để nhận ra tình huống cần dùng.

3. GIN — Generalized Inverted Index

GIN (Generalized Inverted Index) là inverted index — cùng cơ chế với search engine index token → document. Thay vì index row → key (như B-tree), GIN index element → list row chứa element đó.

Use case chính: column multi-value như JSONB, array, tsvector.

-- TaskFlow: tasks_v2 co JSONB metadata
CREATE TABLE tasks_v2 (
  id          BIGSERIAL PRIMARY KEY,
  title       TEXT NOT NULL,
  metadata    JSONB,
  created_at  TIMESTAMPTZ DEFAULT NOW()
);

-- Tao GIN index -- mac dinh operator class jsonb_ops
CREATE INDEX idx_tasks_metadata ON tasks_v2 USING GIN (metadata);

-- Hoac jsonb_path_ops: nho hon, chi ho tro @> (containment)
CREATE INDEX idx_tasks_metadata_path ON tasks_v2 USING GIN (metadata jsonb_path_ops);

-- Query containment
SELECT * FROM tasks_v2 WHERE metadata @> '{"priority":"high"}';
-- Plan: Bitmap Index Scan on idx_tasks_metadata -> Bitmap Heap Scan

Cơ chế inverted index bên trong:

Document (row) data:
  Row 1: {"priority":"high", "tag":"bug"}
  Row 2: {"priority":"low",  "tag":"feature"}
  Row 3: {"priority":"high", "tag":"feature"}

GIN index entries (element -> list row):
  "priority":"high"  ->  [row1, row3]
  "priority":"low"   ->  [row2]
  "tag":"bug"        ->  [row1]
  "tag":"feature"    ->  [row2, row3]

Query metadata @> '{"priority":"high"}' tra GIN → tìm entry "priority":"high" → nhận list [row1, row3] → fetch heap. Không cần scan toàn bảng.

Tradeoff:

✅ Multi-value containment query nhanh hơn seq scan nhiều bậc
❌ Write chậm hơn B-tree 5–10x — mỗi INSERT/UPDATE phải update inverted list cho từng element
❌ Storage lớn hơn 2–3x — lưu inverted list cho toàn bộ unique element

jsonb_ops vs jsonb_path_ops: jsonb_ops (mặc định) index cả key và value riêng lẻ, hỗ trợ nhiều operator (@>, ?, ?|, ?&). jsonb_path_ops chỉ index key+value cặp đôi, nhỏ hơn và nhanh hơn cho @>, nhưng không hỗ trợ ? (key existence). Nếu chỉ dùng containment @> → chọn jsonb_path_ops.

Forward: Module 9 của khoá này đi sâu JSONB operator class, jsonb_path_ops vs jsonb_ops tradeoff chi tiết, và GIN cho full-text search tiếng Việt.

4. BRIN — Block Range Index

BRIN (Block Range Index) lưu min/max của từng block range (nhóm page) thay vì index từng row. Index size nhỏ hơn B-tree 1000x, nhưng chỉ hiệu quả khi dữ liệu có correlation vật lý cao với column — tức là data được ghi theo thứ tự column đó.

Use case điển hình: timestamp trên bảng append-only.

-- task_history: append-only audit table, 100 trieu row
CREATE TABLE task_history (
  id          BIGSERIAL PRIMARY KEY,
  task_id     BIGINT NOT NULL,
  action      TEXT NOT NULL,
  event_at    TIMESTAMPTZ DEFAULT NOW(),
  payload     JSONB
);

-- BRIN index -- pages_per_range=128 la mac dinh (128 page = ~1 MB per range)
CREATE INDEX idx_task_history_event_at ON task_history USING BRIN (event_at);

-- So sanh size
SELECT pg_size_pretty(pg_relation_size('idx_task_history_event_at'));
-- ~48 KB (so voi B-tree ~4 GB cung data)

Cơ chế block range:

Table: task_history (append-only by event_at)

Block range 1 (page 1-128):   event_at min='2026-01-01', max='2026-01-31'
Block range 2 (page 129-256): event_at min='2026-02-01', max='2026-02-28'
Block range 3 (page 257-384): event_at min='2026-03-01', max='2026-03-31'
...

Query: WHERE event_at > '2026-03-01'
-> BRIN check: range 1 max='2026-01-31' < threshold -> skip
-> BRIN check: range 2 max='2026-02-28' < threshold -> skip
-> BRIN check: range 3 min='2026-03-01' -> scan range 3+
-> Faster than full seq scan (skip 2/3 blocks)

BRIN không cho phép random access như B-tree — chỉ loại trừ block range chắc chắn không chứa row thỏa mãn. Với range scan trên column có correlation cao, đây là lợi thế đủ lớn.

Khi nào BRIN trở nên vô dụng — data random không có correlation:

-- Kiem tra correlation trong pg_stats
SELECT tablename, attname, correlation
FROM pg_stats
WHERE tablename = 'task_history' AND attname = 'event_at';
-- correlation = 0.99 -> append-only theo thoi gian -> BRIN hieu qua

-- Neu tasks table co UPDATE lam thay doi physical order:
SELECT tablename, attname, correlation
FROM pg_stats
WHERE tablename = 'tasks' AND attname = 'created_at';
-- correlation = 0.45 -> du lieu xao tron -> BRIN kem hieu qua

Khi correlation gần 0 (random), BRIN không skip được block nào — mọi range có thể chứa row thỏa query → phải scan toàn bộ, không khác seq scan.

Ứng dụng TaskFlow:

task_history(event_at): append-only audit, correlation gần 1.0 → BRIN lý tưởng. Index nhỏ 1000x → buffer cache dồn cho index thực sự cần.
tasks(created_at): table có UPDATE → physical order bị xáo trộn → B-tree tốt hơn.

Forward: Module 7 của khoá này giải thích pg_stats.correlation và cách planner dùng statistics để chọn index. Module 11 của khoá này bàn BRIN tuning pages_per_range cho production time-series.

5. Partial index — index chỉ subset row

Partial index thêm WHERE clause vào CREATE INDEX — chỉ index row thỏa điều kiện. Index nhỏ hơn, scan nhanh hơn — nhưng query phải include điều kiện matching để planner dùng được.

-- Chi index task chua hoan thanh (active tasks ~30% data)
CREATE INDEX idx_tasks_active_due ON tasks(due_at)
WHERE status IN ('todo', 'doing');

-- Query PHAI include WHERE clause matching de dung partial index
SELECT * FROM tasks
WHERE status IN ('todo', 'doing') AND due_at > NOW();
-- Plan: Index Scan on idx_tasks_active_due

-- Query thieu dieu kien matching -> planner KHONG dung partial
SELECT * FROM tasks WHERE due_at > NOW();
-- Plan: Seq Scan (partial index khong dam bao subset)

Use case TaskFlow theo pattern thực tế:

-- Soft delete: 95% row co deleted_at IS NULL
CREATE INDEX idx_tasks_not_deleted ON tasks(created_at)
WHERE deleted_at IS NULL;

-- CMS draft/published: 90% la draft
CREATE INDEX idx_content_published ON content(updated_at)
WHERE published = true;

-- Active task filter la query pho bien nhat
CREATE INDEX idx_tasks_assignee_active ON tasks(assignee_id, due_at)
WHERE status IN ('todo', 'doing');

Tradeoff:

	Full index	Partial index
Index size	100% row	Chi subset (vd 30%)
Scan speed	Toan bo	Nho hon -> nhanh hon
Query constraint	Khong	Phai match WHERE clause
Maintenance	Build toan bo	Chi build subset

Partial index không thay thế WHERE clause trong query — planner chỉ dùng partial index khi WHERE clause của query bao gồm (hoặc khớp) điều kiện partial index. Nếu query không có điều kiện đó, planner sẽ chọn full index hoặc seq scan.

6. Expression index — index kết quả hàm

Expression index lưu kết quả tính toán của một expression (thường là function) thay vì giá trị gốc của column. Giải quyết vấn đề khi query có function transformation trên column — B-tree thường bỏ qua vì index chứa giá trị gốc, không phải kết quả hàm.

-- Query case-insensitive -> B-tree tren email khong dung duoc
SELECT * FROM users WHERE LOWER(email) = '[email protected]';
-- Plan: Seq Scan (B-tree index tren email luu gia tri goc, khong phai LOWER())

-- Tao expression index
CREATE INDEX idx_users_email_lower ON users(LOWER(email));

-- Query SAME nay dung duoc index
SELECT * FROM users WHERE LOWER(email) = '[email protected]';
-- Plan: Index Scan on idx_users_email_lower

Use case phổ biến:

-- Date truncation: group by day
CREATE INDEX idx_tasks_created_day ON tasks(date_trunc('day', created_at));
SELECT date_trunc('day', created_at), COUNT(*)
FROM tasks GROUP BY 1;
-- Index scan cho group by

-- JSONB extract: filter theo field cu the
CREATE INDEX idx_tasks_priority ON tasks((metadata->>'priority'));
SELECT * FROM tasks WHERE metadata->>'priority' = 'high';

-- Computed value
CREATE INDEX idx_order_total ON orders((price * quantity));
SELECT * FROM orders WHERE price * quantity > 1000;

Yêu cầu bắt buộc: expression PHẢI immutable — cùng input luôn trả về cùng output. PG kiểm tra immutability khi tạo index, reject nếu function không phải immutable.

-- KHONG OK: now() khong immutable (ket qua thay doi theo thoi gian)
CREATE INDEX bad_idx ON tasks((created_at - NOW()));
-- ERROR: functions in index expression must be marked IMMUTABLE

-- OK: LOWER() la immutable
CREATE INDEX ok_idx ON users(LOWER(email));

-- OK: date_trunc voi literal timezone string la immutable
CREATE INDEX ok_day_idx ON tasks(date_trunc('day', created_at));

Expression index kết hợp INCLUDE cover SELECT columns:

-- Expression + INCLUDE: Index Only Scan khong can heap fetch
CREATE INDEX idx_users_email_lower_cover
ON users(LOWER(email)) INCLUDE (id, display_name);

SELECT id, display_name FROM users WHERE LOWER(email) = '[email protected]';
-- Plan: Index Only Scan (khong can fetch heap page)

7. Decision tree — chọn index theo workload

Buoc 1: Data type va access pattern?
|
+-- Scalar (int, text, date) + equality / range / sort
|   -> B-tree (default)
|
+-- Multi-value: array, JSONB, tsvector
|   -> GIN
|   -> GIN jsonb_path_ops neu chi dung @> containment
|
+-- Time-series append-only, table cuc lon
|   -> BRIN (kiem tra correlation truoc)
|
+-- Geometric: point, polygon, range type
|   -> GiST
|
+-- Function transformation tren column
    -> Expression index

Buoc 2: Chi index subset row?
+-- Co WHERE clause pho bien va skip phan lon data
    -> Partial index (ket hop voi type o Buoc 1)

Buoc 3: Query fetch nhieu column tu SELECT?
+-- Co the cover het bang INCLUDE
    -> INCLUDE columns -> Index Only Scan

Kết hợp hợp lệ:

-- Partial + B-tree (pho bien nhat)
CREATE INDEX idx_active ON tasks(due_at) WHERE status IN ('todo', 'doing');

-- Partial + GIN
CREATE INDEX idx_meta_active ON tasks USING GIN (metadata)
WHERE status IN ('todo', 'doing');

-- Expression + INCLUDE
CREATE INDEX idx_email_cover ON users(LOWER(email)) INCLUDE (id, display_name);

-- Expression + Partial
CREATE INDEX idx_email_verified ON users(LOWER(email))
WHERE verified = true;

8. Pitfall — dùng GIN cho scalar equality

Pitfall — GIN cho scalar column: write overhead vô ích

GIN thiết kế cho multi-value container (array, JSONB). Áp dụng GIN cho scalar column (TEXT, INT) dùng equality query là sai use case — write chậm hơn 5–10x mà không có lợi ích gì so với B-tree.

-- ANTI-PATTERN: GIN cho scalar TEXT equality
CREATE INDEX ON tasks USING GIN (status);  -- status la TEXT scalar
-- Sai: GIN tao inverted index cho tung ky tu / token cua status
-- Hau qua: write 5-10x cham, index lon, equality query khong nhanh hon B-tree

-- DUNG: B-tree cho scalar equality
CREATE INDEX ON tasks(status);  -- B-tree default, du dung

-- ANTI-PATTERN: BRIN cho table co nhieu UPDATE
CREATE INDEX ON tasks USING BRIN (created_at);
-- Sai: UPDATE lam xao tron physical order -> correlation thap -> BRIN vo dung
-- Fix: kiem tra correlation truoc

SELECT correlation FROM pg_stats
WHERE tablename = 'tasks' AND attname = 'created_at';
-- Neu < 0.7 -> BRIN kem hieu qua -> dung B-tree

-- ANTI-PATTERN: Partial index voi WHERE clause it gap
CREATE INDEX ON tasks(id) WHERE status = 'archived';
-- archived chi chiem 0.1% row -> partial index qua nho de co ich
-- Planner co the prefer seq scan do overhead

Specialized index mang overhead thực: GIN write 5–10x chậm hơn, BRIN vô dụng nếu data không có correlation, Partial index không được dùng nếu query không match WHERE clause. Luôn đo trước, optimize sau.

9. Applied — TaskFlow specialized index plan

Column / Pattern	Index choice	Lý do
`tasks.id` (PK)	B-tree (auto)	Scalar equality mặc định
`tasks(assignee_id, status, due_at)`	B-tree composite	Equality + range, bài 3 module này
`tasks.metadata JSONB` — containment `@>`	GIN jsonb_path_ops	Multi-value, chỉ cần `@>`
`tasks.body_tsv tsvector` (FTS)	GIN	Inverted token cho full-text search
`task_history(event_at)` — audit append-only	BRIN	Append-only lớn, correlation gần 1.0
`tasks(due_at) WHERE status IN ('todo','doing')`	Partial B-tree	70% storage save, active task filter
`users(LOWER(email))`	Expression B-tree	Function on column, immutable

Verify plan trước khi deploy:

-- Kiem tra GIN duoc dung cho containment query
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM tasks WHERE metadata @> '{"priority":"high"}';
-- Mong doi: Bitmap Index Scan on idx_tasks_metadata_path

-- Kiem tra BRIN duoc dung cho time range scan
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM task_history WHERE event_at > NOW() - INTERVAL '7 days';
-- Mong doi: Bitmap Index Scan on idx_task_history_event_at

-- Kiem tra partial index duoc dung
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM tasks
WHERE status IN ('todo', 'doing') AND due_at > NOW();
-- Mong doi: Index Scan on idx_tasks_active_due

-- Kiem tra expression index duoc dung
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM users WHERE LOWER(email) = '[email protected]';
-- Mong doi: Index Scan on idx_users_email_lower

Forward: Module 9 của khoá này đi sâu JSONB operator class và GIN cho full-text search tiếng Việt. Module 11 của khoá này bàn BRIN tuning pages_per_range và production monitoring.

10. Deep Dive — Specialized indexes

📚 Deep Dive — Specialized indexes

PostgreSQL Documentation Ch.11.2 "Index Types" — catalog chính thức 6 loại index, operator class, khi nào dùng từng loại.
PostgreSQL Documentation Ch.11.8 "Partial Indexes" — deep dive partial index: use case, planner constraint matching, và exclusion constraint.
Crunchy Data — "Choosing the right Postgres index" — practical decision guide từ team Crunchy, nhiều ví dụ thực tế.
PostgreSQL 14 Internals — Egor Rogov, Part V — GIN section — free PDF, GIN inverted index implementation chi tiết: posting list, pending list, và vacuum interaction.

Ghi chú: PG docs Ch.11.2 cho catalog tổng quan, Ch.11.8 cho partial deep dive. Crunchy Data blog cho decision guide nhanh. Rogov Part V cho GIN internals — đọc trước Module 9 của khoá này khi đi vào JSONB + FTS production.

11. Tóm tắt

PostgreSQL có 6 loại index: B-tree (default), Hash, GIN, BRIN, GiST, SP-GiST — mỗi loại tối ưu cho data type và access pattern khác nhau.
GIN cho multi-value column (array, JSONB, tsvector): inverted index → element → list row. Write chậm hơn B-tree 5–10x; jsonb_path_ops nhỏ hơn nếu chỉ cần @>.
BRIN cho time-series append-only lớn: lưu min/max per block range → index nhỏ hơn B-tree 1000x. Vô hiệu nếu correlation thấp — kiểm tra pg_stats trước.
Partial index chỉ index subset row thỏa WHERE clause → index nhỏ hơn, scan nhanh hơn. Query phải include điều kiện matching để planner dùng được.
Expression index lưu kết quả function trên column → giải quyết case-insensitive search, date truncation, computed value. Expression phải immutable.
INCLUDE columns cover SELECT → Index Only Scan, tránh heap fetch.
Forward: Module 7 của khoá này (planner chọn index theo statistics), Module 9 của khoá này (JSONB + GIN deep + FTS tiếng Việt), Module 11 của khoá này (BRIN production setup + bloat monitoring).

12. Tự kiểm tra

Tự kiểm tra

Vì sao GIN write chậm hơn B-tree 5–10x? Giải thích cơ chế cập nhật inverted list khi INSERT một row mới.

▸

B-tree insert một row: traverse từ root xuống leaf, insert key vào đúng vị trí sorted → ghi 1–2 page (kể cả page split). Đơn giản và local.

GIN insert một row JSONB: phải update inverted list cho từng element trong document mới. Một row có 5 key-value pair → 5 inverted list entries phải update → 5–10 page write thay vì 1. Thêm nữa, GIN dùng pending list buffer tạm thời — INSERT trước vào pending list (nhanh), sau đó background worker merge vào inverted index chính (chậm hơn, gây write amplification batch). Pending list merge là nguồn gốc của 5–10x overhead.

BRIN index size nhỏ hơn B-tree 1000x — tradeoff cụ thể là gì? Khi nào BRIN trở nên vô dụng?

▸

Tradeoff: BRIN chỉ lưu min/max per block range (128 page mặc định) thay vì index từng row. Với 100 triệu row, B-tree cần ~4 GB để lưu mỗi row pointer; BRIN chỉ cần ~100 KB lưu vài nghìn (min, max) pair. Index nhỏ → toàn bộ BRIN fit trong buffer cache → lookup metadata nhanh.

Nhược điểm: BRIN không cho phép random access — chỉ loại trừ block range chắc chắn không thỏa query. Với range query nhỏ (1 row), BRIN vẫn phải scan toàn bộ block range matching (~1 MB) thay vì trỏ đến row cụ thể như B-tree → kém hơn B-tree cho selective point lookup.

Vô dụng khi: dữ liệu có correlation thấp (gần 0). Ví dụ: bảng tasks có UPDATE thay đổi thứ tự vật lý → mỗi block range có thể chứa row từ mọi khoảng thời gian → min/max per range cover gần như toàn bộ domain → BRIN không skip được block nào → tương đương seq scan.

Phân biệt partial index và WHERE clause trong query thông thường. Khi nào partial index đáng tạo?

▸

Partial index là index chỉ chứa row thỏa điều kiện khai báo lúc CREATE INDEX. Index nhỏ hơn proportional với subset. Planner dùng partial index khi WHERE clause của query bao gồm hoặc khớp điều kiện partial.

WHERE clause trong query chỉ lọc kết quả sau khi fetch từ index hoặc table — không ảnh hưởng index size.

Đáng tạo partial index khi: (1) subset row thỏa điều kiện chiếm nhỏ hơn 50% (lý tưởng dưới 30%) — storage save đáng kể; (2) điều kiện đó xuất hiện trong phần lớn query thực tế (vd WHERE deleted_at IS NULL luôn có trong 95% query); (3) full index quá lớn ảnh hưởng buffer cache. Không đáng nếu subset quá nhỏ (dưới 1%) hoặc điều kiện hiếm gặp trong query thực tế.

Expression index yêu cầu function phải immutable. Vì sao? Cho 1 ví dụ function không immutable và hậu quả nếu PG cho phép.

▸

Index lưu giá trị precomputed của expression tại thời điểm INSERT/UPDATE row. Khi query tra index, PG so sánh giá trị query với giá trị đã lưu trong index. Nếu function không immutable — tức là cùng input cho output khác nhau theo thời gian hay ngữ cảnh — giá trị đã lưu trong index sẽ lỗi thời, không match với kết quả hàm tính lại lúc query. Index trả về kết quả sai.

Ví dụ: NOW() không immutable — mỗi lần gọi trả về thời điểm khác. Nếu PG cho phép CREATE INDEX ON tasks((created_at - NOW())): giá trị lưu trong index là khoảng cách tính lúc INSERT (vd "5 ngày trước"). Lúc query, NOW() đã khác → created_at - NOW() tính lại cho giá trị khác → không khớp index → query trả về row sai hoặc bỏ sót row. PostgreSQL reject ngay khi CREATE INDEX với lỗi "functions in index expression must be marked IMMUTABLE".

Phân biệt INCLUDE column vs thêm column vào sort key của composite index. Khác biệt nào quan trọng cho Index Only Scan?

▸

Sort key (vd (a, b, c)): cả ba column được lưu trong B-tree sorted structure → planner có thể dùng index cho filter/sort trên bất kỳ leftmost prefix. Write overhead tăng theo số column trong key.

INCLUDE (vd (a, b) INCLUDE (c)): column c được lưu ở leaf page nhưng không tham gia sort order → planner không thể filter hay sort bằng c; chỉ dùng để cover SELECT list cho Index Only Scan — tránh heap fetch. Write overhead thấp hơn thêm vào sort key vì không cần maintain B-tree order cho column INCLUDE.

Quan trọng cho Index Only Scan: nếu query chỉ cần SELECT c FROM t WHERE a = 1, cần c trong leaf page nhưng không cần sort theo c → INCLUDE là lựa chọn đúng. Thêm c vào sort key vừa không cần thiết vừa tốn storage và write overhead hơn.

TaskFlow có query containment `metadata @> '{"priority":"high"}'` chạy 1000 lần/phút. Chọn GIN jsonb_path_ops hay jsonb_ops? Giải thích tradeoff.

▸

Chọn jsonb_path_ops cho use case này.

jsonb_ops (mặc định): index cả key và value riêng lẻ. Hỗ trợ nhiều operator: @> (containment), ? (key existence), ?| (any key), ?& (all keys). Index lớn hơn vì lưu nhiều entry hơn per document.

jsonb_path_ops: index key+value theo cặp hash, không index key riêng lẻ. Chỉ hỗ trợ @>. Index nhỏ hơn 2–3x và lookup nhanh hơn cho @> vì ít entry hơn, hash lookup thay vì tree lookup trên key string.

Với workload 1000 query/phút, toàn bộ dùng @>, không cần ? key existence: jsonb_path_ops cho index nhỏ hơn (cache-friendly hơn) và lookup nhanh hơn. Chỉ chọn jsonb_ops nếu cần ?, ?|, ?& — thường khi query kiểm tra sự tồn tại của key không có value cụ thể.

⁂

Bài tiếp theo: Mini-challenge dashboard tuning 2.1s → 50ms

Bài này có giúp bạn hiểu bản chất không?

Hỏi đáp về bài này

Chưa có câu hỏi

Đặt câu hỏi

Có gì chưa rõ trong bài? Đặt câu hỏi đầu tiên — câu trả lời từ cộng đồng giúp bạn (và người sau).

Đặt câu hỏi đầu tiên

← Bài trước

Composite index ordering — leftmost prefix rule + INCLUDE covering

Bài tiếp

Mini-challenge M05 — TaskFlow dashboard 2.1s → 50ms qua index strategy