Bài 56 / 60~22 phútJSONB, full-text & pgvectorMiễn phí… lượt xem

GIN index cho JSONB — jsonb_ops vs jsonb_path_ops, partial GIN, write tax

Q: Tại sao B-tree không thể dùng cho query `WHERE task_metadata @> '{\"priority\":\"high\"}'` dù task_metadata là column có index B-tree? Giải thích cơ chế bên trong.

( B-tree được thiết kế để index một giá trị scalar duy nhất per row và so sánh theo tổng thứ tự (total order) — nhỏ hơn, bằng, lớn hơn. B-tree có thể so sánh hai JSONB document theo toàn bộ nội dung (equality toàn phần), nhưng không có cách nào để "decompose" document thành các key-value element riêng lẻ và index từng element đó. Containment '@>' hỏi "document này có chứa sub-document kia không?" — đây không phải total order

Q: "App TaskFlow có 3 query pattern trên task_metadata: (1) `@> ' \"priority\":\"high\" '`

( Một index duy nhất có thể cover cả 3: 'CREATE INDEX ON tasks USING gin(task_metadata)' (tức là 'jsonb_ops' mặc định) — hỗ trợ '@>'

Q: "GIN có 'pending list' — INSERT ghi vào pending trước

Vấn đề 1 — Stale read window: row vừa INSERT có thể chưa xuất hiện ngay trong GIN search kết quả (vẫn trong pending list, chưa merge vào main inverted index). Với workload read-after-write nhạy cảm (vd: task vừa tạo phải tìm thấy ngay), cần biết window này tồn tại — thường chỉ vài millisecond nhưng có thể dài hơn dưới load. Vấn đề 2 — Merge overhead: khi pending list đầy (mặc định 4MB), PostgreSQL trigger merge vào main index — đây là operation I/O-intensive, có thể gây latency spike cho write. Với burst INSERT (vd: bulk sync mỗi giờ), pending list đầy liên tục → nhiều merge trigger → write throughput không ổn định. Khi nên tăng 'gin_pending_list_limit' : workload có burst write ngắn (vd: 100k INSERT trong 30 giây mỗi giờ) — tăng limit giúp buffer toàn bộ burst, flush một lần sau khi xong thay vì nhiều lần nhỏ. Ngược lại, workload write đều đặn không cần tăng — merge frequency phân tán đều, không có spike. Sau bulk load lớn, chạy 'VACUUM tasks' để flush pending list ngay thay vì chờ autovacuum.

Q: Partial GIN với `WHERE task_metadata ? 'priority'` khác gì với full GIN index? Trong trường hợp nào partial GIN nhỏ hơn đáng kể nhưng vẫn cover query cần thiết?

( Full GIN index tất cả row trong bảng — bao gồm cả row không có key 'priority' trong task_metadata. Inverted list cho mỗi element chứa CTID từ toàn bộ bảng. Partial GIN với 'WHERE task_metadata ? \'priority\'' chỉ index row có key 'priority' . Nếu 60% task không có priority field (vd: task từ các project không dùng priority system), partial GIN nhỏ hơn 60%. Nhỏ hơn đáng kể khi: (1) subset thỏa WHERE clause chiếm nhỏ hơn 50% tổng row; (2) query luôn include điều kiện matching điều kiện partial — vd mọi query priority đều có 'AND task_metadata ? \'priority\'' hoặc 'AND task_metadata @> \' "priority": ... \'' . Nếu app có query '@>' trên priority mà không filter by key existence

Q: "Sau 6 tháng production

Bước 1 — Đo index usage thực tế: 'SELECT indexname, idx_scan, idx_tup_read, pg_size_pretty(pg_relation_size(indexrelid)) AS size FROM pg_stat_user_indexes WHERE relname = \'tasks\' ORDER BY idx_scan DESC;' Index có 'idx_scan = 0' sau 7+ ngày production → không có query nào dùng → candidate để drop. Index có 'idx_scan' thấp nhưng size lớn → cost/benefit âm. Bước 2 — Measure write overhead: tạm thời disable index (hoặc test trên staging) và so sánh INSERT throughput. GIN thường chiếm 40–60% write overhead so với bảng không có index. Quyết định: giữ index có 'idx_scan' cao (thực sự được dùng trong query hot). Drop index 'idx_scan = 0' hoặc thấp. Hợp nhất nếu có thể: nếu cả jsonb_ops full và partial priority đều được dùng nhưng partial chỉ cover một phần nhỏ query, cân nhắc giữ một index jsonb_ops full và bỏ partial nếu query pattern không thực sự cần partial. Đo lại throughput sau mỗi index drop trên production (dùng 'pg_stat_user_tables.n_tup_ins' rate theo thời gian).

WHERE task_metadata @> '{"priority":"high"}' trên 5M row = Seq Scan 3 giây. GIN inverted index cho JSONB: jsonb_ops vs jsonb_path_ops, partial GIN, expression GIN, bulk load pattern, gin_pending_list_limit.

WHERE task_metadata @> '{"priority":"high"}' trên bảng tasks 5 triệu row — không có index — chạy Seq Scan và mất hơn 3 giây. Thêm B-tree index trên column JSONB? PostgreSQL báo lỗi: B-tree không biết cách so sánh JSONB containment. Cần một loại index khác.

Bài 4 Module 5 của khoá này đã giới thiệu GIN (Generalized Inverted Index) ở mức tổng quan cùng với BRIN, partial, và expression index. Bài này đi sâu vào GIN riêng cho JSONB: hai operator class, tradeoff về size và write speed, partial GIN cho subset hot, và các pitfall thực chiến khi production table có nhiều write.

1. Analogy — Mục lục sách vs Bảng tra từ điển đảo ngược

Hai loại tra cứu trong sách — mục lục chính và bảng tra chủ đề đảo ngược:

Tra cứu	Tương đương index	Cách hoạt động
Mục lục chính (trang → chủ đề)	B-tree	Sort theo key, binary search tìm chính xác
Bảng tra đảo ngược ("từ khoá → các trang chứa từ")	GIN	Mỗi element → danh sách trang/row chứa element đó
Tìm chương có từ "index"	GIN containment `@>`	Lookup "index" trong bảng đảo ngược → nhận list row
Đọc theo số trang tăng dần	B-tree range scan	Traverse sorted tree từ trái sang phải

💡 Cách nhớ

B-tree = mục lục theo thứ tự — nhanh cho tìm chính xác và range có sort. GIN = bảng tra "từ khoá → các trang chứa từ" — nhanh cho "document nào chứa element X?". JSONB containment (@>) là câu hỏi "document nào chứa cặp key-value X?" — đây chính là bài toán GIN được sinh ra để giải.

2. Vấn đề — `@>` trên 5M row không có index

TaskFlow mở rộng: bảng tasks 5 triệu row, mỗi row có task_metadata JSONB với priority, labels, estimate. Query filter theo priority:

-- Schema mo rong TaskFlow voi JSONB
CREATE TABLE tasks (
  id            BIGSERIAL PRIMARY KEY,
  project_id    BIGINT NOT NULL,
  title         TEXT NOT NULL,
  status        TEXT NOT NULL DEFAULT 'todo',
  assignee_id   BIGINT,
  due_at        TIMESTAMPTZ,
  task_metadata JSONB NOT NULL DEFAULT '{}',
  created_at    TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Vd du lieu trong task_metadata:
-- {"priority": "high", "labels": ["deploy", "prod"], "estimate": {"hours": 4, "sp": 3}}
-- {"priority": "low",  "labels": ["docs"],            "compliance_tag": "GDPR"}
-- {"sprint": 12,       "story_points": 8,             "epic": "security"}

-- Query: tim tat ca task priority high
EXPLAIN ANALYZE
SELECT * FROM tasks
WHERE task_metadata @> '{"priority":"high"}';

Kết quả không có index:

Seq Scan on tasks  (cost=0.00..198432.00 rows=5000 width=256)
                   (actual time=0.042..3127.451 rows=486234 loops=1)
  Filter: (task_metadata @> '{"priority": "high"}'::jsonb)
  Rows Removed by Filter: 4513766
Planning Time: 0.089 ms
Execution Time: 3189.204 ms

PostgreSQL phải đọc qua toàn bộ 5 triệu row, evaluate @> cho từng row. B-tree không giải quyết được vì B-tree không biết cách "decompose" JSONB document để index từng key-value pair.

3. GIN — inverted index hoạt động như thế nào

GIN (Generalized Inverted Index) lưu mapping ngược: từng element (key, value, key+value path) trong JSONB → danh sách CTID (row identifier) của các row chứa element đó.

Du lieu trong tasks:
  Row (0,1): {"priority":"high", "labels":["deploy","prod"]}
  Row (0,2): {"priority":"low",  "labels":["docs"]}
  Row (0,3): {"priority":"high", "labels":["docs","prod"]}

GIN inverted index (jsonb_ops -- luu key va value rieng):
  key "priority"      ->  [(0,1), (0,2), (0,3)]
  value "high"        ->  [(0,1), (0,3)]
  value "low"         ->  [(0,2)]
  key "labels"        ->  [(0,1), (0,2), (0,3)]
  value "deploy"      ->  [(0,1)]
  value "prod"        ->  [(0,1), (0,3)]
  value "docs"        ->  [(0,2), (0,3)]
  ...

Query: WHERE task_metadata @> '{"priority":"high"}'
  -> GIN lookup element "priority":"high"
  -> Lay list CTID: [(0,1), (0,3)]
  -> Bitmap Heap Scan: fetch 2 row tu heap
  -> Khong can scan 5 trieu row

Kết quả query sau khi có GIN index:

Bitmap Heap Scan on tasks  (cost=312.45..8934.12 rows=486234 width=256)
                            (actual time=12.341..45.892 rows=486234 loops=1)
  Recheck Cond: (task_metadata @> '{"priority": "high"}'::jsonb)
  ->  Bitmap Index Scan on idx_tasks_metadata_full
        (actual time=11.203..11.203 rows=486234 loops=1)
        Index Cond: (task_metadata @> '{"priority": "high"}'::jsonb)
Planning Time: 0.312 ms
Execution Time: 68.547 ms

Từ 3189ms xuống 68ms — gần 50x nhanh hơn. Bitmap Index Scan (bài 2 Module 7 của khoá này) build bitmap row positions từ GIN, sau đó Bitmap Heap Scan fetch heap page theo thứ tự sequential.

4. Hai operator class — `jsonb_ops` vs `jsonb_path_ops`

PostgreSQL cung cấp hai operator class cho GIN trên JSONB, với tradeoff rõ ràng về size, write speed, và operator support:

-- jsonb_ops: mac dinh khi khong khai bao
-- Index moi key VA moi value RIENG LE
CREATE INDEX idx_tasks_metadata_full
ON tasks USING gin(task_metadata);

-- Tuong duong viet tuong minh:
-- CREATE INDEX ... USING gin(task_metadata jsonb_ops);

-- jsonb_path_ops: phai khai bao tuong minh
-- Chi index hash cua full path (key + value chuoi voi nhau)
CREATE INDEX idx_tasks_metadata_path
ON tasks USING gin(task_metadata jsonb_path_ops);

Bảng tradeoff đầy đủ:

Aspect	`jsonb_ops` (default)	`jsonb_path_ops`
Index size	2–3x lớn hơn	Nhỏ hơn
Write speed	Chậm hơn	Nhanh hơn (~30% so với jsonb_ops)
`@>` containment	Nhanh	Nhanh hơn (hash lookup)
`?` key exists	Hỗ trợ	Không hỗ trợ
`?	` any key	Hỗ trợ
`?&` all keys	Hỗ trợ	Không hỗ trợ
JSONPath `@@`	Hỗ trợ	Không hỗ trợ
Use case chính	Cần `?` key existence	Chỉ dùng `@>` (phổ biến nhất)

Lý do jsonb_path_ops nhỏ hơn và nhanh hơn cho @>: thay vì lưu key "priority" và value "high" như 2 entry riêng, jsonb_path_ops lưu hash của path priority.high như 1 entry duy nhất. Ít entry hơn → index nhỏ hơn → fit cache tốt hơn → lookup nhanh hơn.

Verify bằng EXPLAIN với 3 operator khác nhau:

-- 1) Containment @> -> ca 2 index deu duoc su dung
EXPLAIN (COSTS OFF)
SELECT * FROM tasks
WHERE task_metadata @> '{"priority":"high"}';
-- Bitmap Index Scan on idx_tasks_metadata_path  (jsonb_path_ops duoc dung)

-- 2) Key exists ? -> CHI jsonb_ops ho tro; jsonb_path_ops bi planner bo qua
EXPLAIN (COSTS OFF)
SELECT * FROM tasks
WHERE task_metadata ? 'compliance_tag';
-- Bitmap Index Scan on idx_tasks_metadata_full  (jsonb_ops duoc dung)
-- (Neu chi co jsonb_path_ops: Seq Scan -- khong the dung)

-- 3) JSONPath @@ -> CHI jsonb_ops ho tro
EXPLAIN (COSTS OFF)
SELECT * FROM tasks
WHERE task_metadata @@ '$.estimate.hours > 5';
-- Bitmap Index Scan on idx_tasks_metadata_full  (jsonb_ops duoc dung)
-- (Neu chi co jsonb_path_ops: Seq Scan)

Recommendation: nếu app chỉ dùng @> (containment) — trường hợp phổ biến nhất — dùng jsonb_path_ops. Chỉ dùng jsonb_ops khi cần ? key existence hoặc JSONPath @@.

5. Partial GIN — index subset row

Partial index thêm WHERE clause vào CREATE INDEX — chỉ index row thỏa điều kiện. Kết hợp với GIN cho JSONB: index nhỏ hơn, nhanh hơn cho subset hot.

-- Chi index task co key 'priority' (subset co priority field)
-- Task khong co priority key bi loai khoi index
CREATE INDEX idx_tasks_metadata_priority
ON tasks USING gin(task_metadata)
WHERE task_metadata ? 'priority';

-- Query phai include dieu kien matching de planner dung partial index
EXPLAIN (COSTS OFF)
SELECT * FROM tasks
WHERE task_metadata @> '{"priority":"high"}'
  AND task_metadata ? 'priority';
-- Bitmap Index Scan on idx_tasks_metadata_priority

-- Query thieu dieu kien WHERE -> planner dung full index hoac seq scan
SELECT * FROM tasks
WHERE task_metadata @> '{"priority":"high"}';
-- Planner co the dung full GIN neu co, hoac seq scan neu khong co

Expression GIN trên một path cụ thể — index chỉ một sub-document hoặc array:

-- Chi index labels array (sub-document, khong phai toan bo task_metadata)
CREATE INDEX idx_tasks_labels
ON tasks USING gin((task_metadata -> 'labels'));

-- Query tuong ung: dung arrow operator -> trich labels truoc khi compare
SELECT * FROM tasks
WHERE task_metadata -> 'labels' @> '["prod"]';
-- Bitmap Index Scan on idx_tasks_labels

-- Nho hon nhieu so voi GIN toan bo task_metadata
-- Chi huu ich neu labels query la dominant workload

Partial GIN kết hợp jsonb_path_ops — nhỏ nhất và nhanh nhất cho subset:

-- Ket hop: partial (chi high priority task) + jsonb_path_ops (chi @>)
CREATE INDEX idx_tasks_metadata_hot
ON tasks USING gin(task_metadata jsonb_path_ops)
WHERE task_metadata @> '{"priority":"high"}';

-- Index nay chi chua row priority=high -> rat nho
-- Query: WHERE task_metadata @> '{"priority":"high"}' AND [dieu kien hot khac]

6. Pitfall — write tax, JSON thay vì JSONB, pending list

Pitfall — 4 vấn đề thực chiến với GIN trên JSONB

Pitfall 1 — Write tax 5–10x so với B-tree:

GIN phải update inverted list cho mỗi key-value pair trong document mới. Một JSONB row có 6 key-value pair → 6 inverted list updates mỗi INSERT. Bảng heavy write với 3–4 GIN index có thể giảm INSERT throughput 60%.

-- Audit bang GIN index su dung: idx_scan = 0 sau 7 ngay -> co the drop
SELECT indexname, idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
WHERE relname = 'tasks'
  AND indexname LIKE '%metadata%';

-- GIN it dung nhung van chiem write overhead -> drop neu idx_scan thap

Pitfall 2 — JSONB vs JSON: GIN chỉ index được JSONB:

GIN hoạt động trên binary decomposed structure của JSONB. Column type JSON (text-based) không được hỗ trợ.

-- LOI: khong the tao GIN tren JSON column
CREATE INDEX bad_idx ON tasks USING gin(settings);
-- ERROR: data type json has no default operator class for access method "gin"

-- FIX: migrate column sang JSONB truoc
ALTER TABLE tasks
  ALTER COLUMN settings TYPE JSONB USING settings::jsonb;

-- Sau do tao GIN binh thuong
CREATE INDEX idx_tasks_settings ON tasks USING gin(settings jsonb_path_ops);

Pitfall 3 — gin_pending_list_limit và stale search ngay sau INSERT:

GIN có cơ chế "pending list" — INSERT ghi vào pending list trước (nhanh), sau đó background worker (hoặc VACUUM) merge vào main inverted index. Mặc định pending list giới hạn 4MB.

Hệ quả: trong vài millisecond sau INSERT, row vừa insert có thể chưa xuất hiện trong GIN search kết quả. Với bulk write workload, pending list đầy nhanh → trigger merge thường xuyên → write overhead tăng.

-- Kiem tra pending list size hien tai (phan chua merge)
SELECT pg_size_pretty(
  pg_relation_size(indexrelid)
) AS pending_size
FROM pg_index
WHERE indrelid = 'tasks'::regclass
  AND indisvalid = false;  -- pending entries chua merge

-- Tang pending list cho heavy write workload (giam tan suat merge)
-- Dat trong postgresql.conf hoac per-session:
SET gin_pending_list_limit = '64MB';  -- mac dinh 4MB

-- Manual flush pending list (trong maintenance window):
-- VACUUM tasks;  -- flush pending list vao main index

Pitfall 4 — GIN index trên cả bảng khi chỉ cần một path:

Tao GIN trên task_metadata index tất cả key-value pair — bao gồm cả những field hiếm dùng. Nếu chỉ query theo labels array, expression GIN trên (task_metadata -> 'labels') nhỏ hơn 5–10x.

7. Applied — TaskFlow product catalog attribute filter

Use case e-commerce-style: product catalog với flexible attribute per category. Filter WHERE attributes @> '{"color":"red","size":"M"}' là query dominant.

-- Product catalog voi attribute flexible
CREATE TABLE products (
  id         BIGSERIAL PRIMARY KEY,
  name       TEXT NOT NULL,
  category   TEXT NOT NULL,
  price      NUMERIC(10,2) NOT NULL,
  attributes JSONB NOT NULL DEFAULT '{}'
  -- vd: {"color":"red","size":"M","material":"cotton","tags":["sale","new"]}
  -- vd: {"screen_size":"15.6","ram_gb":16,"storage_gb":512,"brand":"Dell"}
  -- Schema khac nhau theo category -- JSONB la lua chon dung
);

-- GIN jsonb_path_ops: chi @> containment, nho hon, nhanh hon
CREATE INDEX idx_products_attr
ON products USING gin(attributes jsonb_path_ops);

-- Filter multi-attribute: color + size
SELECT id, name, price
FROM products
WHERE attributes @> '{"color":"red","size":"M"}';
-- Plan: Bitmap Index Scan on idx_products_attr -> Bitmap Heap Scan

-- Filter by tag array containment
SELECT id, name, price
FROM products
WHERE attributes @> '{"tags":["sale"]}';
-- Plan: Bitmap Index Scan on idx_products_attr

-- Ket hop GIN + B-tree tren column khac
SELECT id, name, price
FROM products
WHERE attributes @> '{"color":"red"}'
  AND price < 500000
  AND category = 'fashion';
-- Plan: Bitmap Index Scan on idx_products_attr (GIN)
--   + Index Scan on idx_products_category_price (B-tree composite)
--   -> BitmapAnd -> Bitmap Heap Scan

Event log query theo event type — GIN cho JSONB payload:

CREATE TABLE audit_logs (
  id          BIGSERIAL PRIMARY KEY,
  occurred_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  user_id     BIGINT,
  event_type  TEXT NOT NULL,
  payload     JSONB NOT NULL DEFAULT '{}'
);

-- GIN de filter payload theo event detail
CREATE INDEX idx_audit_payload ON audit_logs USING gin(payload jsonb_path_ops);

-- Tim tat ca audit log lien quan task_id = 42
SELECT occurred_at, event_type, payload
FROM audit_logs
WHERE payload @> '{"task_id": 42}'
ORDER BY occurred_at DESC
LIMIT 50;
-- Plan: Bitmap Index Scan on idx_audit_payload

8. Bulk load best practice — tắt GIN khi insert hàng loạt

GIN viết inverted list cho mỗi key-value trong mỗi row. Bulk INSERT 10M row với GIN live trên bảng chậm hơn 5–10x so với insert không có index rồi rebuild sau.

-- Truoc bulk INSERT: drop GIN de tang throughput
DROP INDEX idx_tasks_metadata_full;

-- Bulk insert du lieu (vd tu CSV hoac ETL pipeline)
COPY tasks (project_id, title, status, task_metadata)
FROM '/path/to/tasks_export.csv'
WITH (FORMAT csv, HEADER true);
-- Hoac INSERT ... SELECT tu staging table

-- Sau khi bulk insert xong: tang maintenance_work_mem truoc khi rebuild
-- (GIN build dung nhieu RAM hon B-tree)
SET maintenance_work_mem = '1GB';  -- mac dinh 64MB

-- Rebuild GIN voi jsonb_path_ops (pho bien nhat)
CREATE INDEX idx_tasks_metadata_full
ON tasks USING gin(task_metadata jsonb_path_ops);

RESET maintenance_work_mem;

-- Nhanh hon 5-10x so voi giu index live suot qua trinh bulk insert
-- VACUUM ANALYZE de cap nhat statistics sau rebuild
VACUUM ANALYZE tasks;

Nếu không thể drop index (bảng production đang có traffic), dùng SET LOCAL maintenance_work_mem trong transaction để giảm merge frequency:

BEGIN;
SET LOCAL gin_pending_list_limit = '256MB';  -- giam tan suat merge khi insert batch
INSERT INTO tasks SELECT ... FROM staging_tasks;  -- batch insert
COMMIT;
-- GIN pending list tu merge sau COMMIT
VACUUM tasks;  -- flush pending list vao main index ngay

9. Deep Dive

Deep Dive — GIN internals và JSONB indexing

PostgreSQL Documentation Ch.70 "GIN Indexes" — reference chính thức: GIN structure, operator class API, gin_pending_list_limit, fastupdate, vacuum interaction. Đọc phần "GIN and Lost Updates" để hiểu stale-read window sau INSERT.
PostgreSQL 14 Internals — Egor Rogov, Part V (GIN Indexes section) — free PDF từ Postgres Pro; Part V giải thích GIN posting list, pending list buffer, merge mechanism và vacuum triggering chi tiết hơn docs chính thức. Đây là nguồn tốt nhất để hiểu tại sao GIN write 5–10x chậm hơn B-tree ở mức implementation.
Crunchy Data — "JSONB multi-column type casting and search" — practical guide từ Crunchy team: jsonb_ops vs jsonb_path_ops benchmark thực tế, expression index cho JSONB path cụ thể, và khi nào kết hợp GIN với B-tree composite.

10. Tóm tắt

GIN là inverted index — mỗi key/value element → danh sách CTID row chứa element đó. Query @> lookup element trong inverted index thay vì scan toàn bảng → 50x nhanh hơn trên 5M row.
jsonb_ops (default): index key và value riêng lẻ — hỗ trợ @>, ?, ?|, ?&, JSONPath @@. Index lớn hơn, write chậm hơn. Dùng khi cần ? key existence.
jsonb_path_ops: index hash của full path (key+value chained) — chỉ hỗ trợ @>. Index nhỏ hơn 2–3x, write nhanh hơn, lookup @> nhanh hơn. Dùng khi app chỉ cần @> — trường hợp phổ biến nhất.
Partial GIN: thêm WHERE clause vào CREATE INDEX — chỉ index subset row hot. Kết hợp jsonb_path_ops + partial = index nhỏ nhất và nhanh nhất cho workload cụ thể.
Expression GIN trên (task_metadata -> 'labels') — index chỉ một sub-document. Nhỏ hơn 5–10x so với GIN toàn bộ column khi chỉ query một path.
Write tax: GIN write 5–10x chậm hơn B-tree — mỗi INSERT update inverted list cho từng element. Bảng heavy write với 3–4 GIN index có thể giảm INSERT throughput 60%. Audit bằng pg_stat_user_indexes.idx_scan.
GIN chỉ hoạt động với JSONB — column JSON không được hỗ trợ. Migration: ALTER COLUMN ... TYPE JSONB USING ...::jsonb.
Bulk load pattern: drop GIN trước bulk INSERT, rebuild với maintenance_work_mem = '1GB' sau — nhanh hơn 5–10x so với giữ index live.

11. Tự kiểm tra

Tự kiểm tra

Tại sao B-tree không thể dùng cho query `WHERE task_metadata @> '{"priority":"high"}'` dù task_metadata là column có index B-tree? Giải thích cơ chế bên trong.

▸

B-tree được thiết kế để index một giá trị scalar duy nhất per row và so sánh theo tổng thứ tự (total order) — nhỏ hơn, bằng, lớn hơn. B-tree có thể so sánh hai JSONB document theo toàn bộ nội dung (equality toàn phần), nhưng không có cách nào để "decompose" document thành các key-value element riêng lẻ và index từng element đó.

Containment @> hỏi "document này có chứa sub-document kia không?" — đây không phải total order, không phải equality toàn phần, mà là một quan hệ set containment. Để answer câu hỏi này nhanh, cần inverted index biết element nào xuất hiện trong row nào — đúng là cấu trúc GIN cung cấp.

Nếu cố tạo B-tree trên JSONB column và query @>: PostgreSQL sẽ báo lỗi operator không tương thích, hoặc nếu dùng equality = thì B-tree so sánh toàn bộ document — không giúp gì cho containment partial match.

App TaskFlow có 3 query pattern trên task_metadata: (1) `@> '{"priority":"high"}'`, (2) `? 'compliance_tag'`, (3) `@@ '$.estimate.hours > 5'`. Nên tạo index nào? Có thể dùng một index duy nhất không?

▸

Một index duy nhất có thể cover cả 3: CREATE INDEX ON tasks USING gin(task_metadata) (tức là jsonb_ops mặc định) — hỗ trợ @>, ?, và JSONPath @@.

Không thể dùng jsonb_path_ops vì pattern (2) và (3) yêu cầu ? key existence và JSONPath @@ — cả hai đều không được hỗ trợ bởi jsonb_path_ops.

Tradeoff: jsonb_ops lớn hơn và write chậm hơn jsonb_path_ops. Nếu (1) chiếm 90% traffic và (2)+(3) hiếm, cân nhắc hai index: jsonb_path_ops cho @> query hot + jsonb_ops partial index cho subset row cần ?/@@. Audit pg_stat_user_indexes sau một tuần để xem index nào thực sự được dùng.

GIN có 'pending list' — INSERT ghi vào pending trước, sau đó merge vào main index. Điều này gây ra vấn đề gì trong production? Khi nào nên tăng `gin_pending_list_limit`?

▸

Vấn đề 1 — Stale read window: row vừa INSERT có thể chưa xuất hiện ngay trong GIN search kết quả (vẫn trong pending list, chưa merge vào main inverted index). Với workload read-after-write nhạy cảm (vd: task vừa tạo phải tìm thấy ngay), cần biết window này tồn tại — thường chỉ vài millisecond nhưng có thể dài hơn dưới load.

Vấn đề 2 — Merge overhead: khi pending list đầy (mặc định 4MB), PostgreSQL trigger merge vào main index — đây là operation I/O-intensive, có thể gây latency spike cho write. Với burst INSERT (vd: bulk sync mỗi giờ), pending list đầy liên tục → nhiều merge trigger → write throughput không ổn định.

Khi nên tăng gin_pending_list_limit: workload có burst write ngắn (vd: 100k INSERT trong 30 giây mỗi giờ) — tăng limit giúp buffer toàn bộ burst, flush một lần sau khi xong thay vì nhiều lần nhỏ. Ngược lại, workload write đều đặn không cần tăng — merge frequency phân tán đều, không có spike. Sau bulk load lớn, chạy VACUUM tasks để flush pending list ngay thay vì chờ autovacuum.

Partial GIN với `WHERE task_metadata ? 'priority'` khác gì với full GIN index? Trong trường hợp nào partial GIN nhỏ hơn đáng kể nhưng vẫn cover query cần thiết?

▸

Full GIN index tất cả row trong bảng — bao gồm cả row không có key priority trong task_metadata. Inverted list cho mỗi element chứa CTID từ toàn bộ bảng.

Partial GIN với WHERE task_metadata ? 'priority' chỉ index row có key priority. Nếu 60% task không có priority field (vd: task từ các project không dùng priority system), partial GIN nhỏ hơn 60%.

Nhỏ hơn đáng kể khi: (1) subset thỏa WHERE clause chiếm nhỏ hơn 50% tổng row; (2) query luôn include điều kiện matching điều kiện partial — vd mọi query priority đều có AND task_metadata ? 'priority' hoặc AND task_metadata @> '{"priority": ...}'. Nếu app có query @> trên priority mà không filter by key existence, planner có thể không dùng partial index (vì không thể đảm bảo partial index cover đủ row).

Sau 6 tháng production, bảng tasks có 3 GIN index trên task_metadata (jsonb_ops full, partial priority, expression labels). INSERT throughput giảm 55%. Cách diagnose và quyết định drop index nào?

▸

Bước 1 — Đo index usage thực tế:

SELECT indexname, idx_scan, idx_tup_read, pg_size_pretty(pg_relation_size(indexrelid)) AS size FROM pg_stat_user_indexes WHERE relname = 'tasks' ORDER BY idx_scan DESC;

Index có idx_scan = 0 sau 7+ ngày production → không có query nào dùng → candidate để drop. Index có idx_scan thấp nhưng size lớn → cost/benefit âm.

Bước 2 — Measure write overhead: tạm thời disable index (hoặc test trên staging) và so sánh INSERT throughput. GIN thường chiếm 40–60% write overhead so với bảng không có index.

Quyết định: giữ index có idx_scan cao (thực sự được dùng trong query hot). Drop index idx_scan = 0 hoặc thấp. Hợp nhất nếu có thể: nếu cả jsonb_ops full và partial priority đều được dùng nhưng partial chỉ cover một phần nhỏ query, cân nhắc giữ một index jsonb_ops full và bỏ partial nếu query pattern không thực sự cần partial. Đo lại throughput sau mỗi index drop trên production (dùng pg_stat_user_tables.n_tup_ins rate theo thời gian).

⁂

Bài tiếp theo: Hybrid schema pattern — column cứng + JSONB flexible

Bài này có giúp bạn hiểu bản chất không?

Hỏi đáp về bài này