~35 phútEXPLAIN & query optimization… lượt xem

Mini-challenge M07 — Dashboard 2.1s → 50ms trên 5M task dataset

Setup 5M task dataset. EXPLAIN baseline 2.1s. Ba fix incremental: composite partial covering index, keyset pagination thay OFFSET, extended statistics. Đo ratio sau mỗi bước — từ 2100ms xuống 1.5ms (1400x).

Dashboard TaskFlow từ bài 7 Module 2 của khoá này đang chạy 2.1 giây trên production — dataset đã lên 5 triệu task. PM yêu cầu dưới 50ms trước sprint kế tiếp. Đó là 40x speedup. Có vẻ khó, nhưng sau khi học EXPLAIN cost model (bài 4 của module này), scan strategy (bài 2), và tuning toolbox (bài 6 của module này), bạn đã có đủ công cụ.

Bài này không dạy concept mới — bài này tổng hợp toàn bộ module qua 3 fix incremental trên dataset thực chiến 5M row: composite partial covering index, keyset pagination thay OFFSET, và extended statistics cho cross-column correlation. Cuối bài: 4 EXPLAIN output, bảng so sánh baseline với 3 fix, tính ratio improvement.

Setup — 5M dataset script

Truncate sạch và seed 5 triệu task với phân bố realistic: 1000 project, 10000 user (10% unassigned), 4 status, due_at trải ±100 ngày.

-- Truncate sach truoc khi seed de clean state
TRUNCATE tasks RESTART IDENTITY CASCADE;

-- Insert 5M task
INSERT INTO tasks (project_id, assignee_id, title, status, due_at, created_at, updated_at)
SELECT
  (random() * 1000)::int + 1,                               -- 1000 project
  CASE WHEN random() < 0.1 THEN NULL                        -- 10% unassigned
       ELSE (random() * 10000)::int + 1 END,                -- 10000 user
  'Task ' || gs::text,
  (ARRAY['todo','doing','done','archived'])
    [floor(random() * 4) + 1],
  now() + (random() * 365 - 100) * interval '1 day',        -- +/-100 ngay
  now() - random() * interval '180 days',
  now() - random() * interval '90 days'
FROM generate_series(1, 5000000) gs;

-- Khoang 5-10 phut tuy toc do disk
-- Kiem tra so luong row sau khi done:
SELECT COUNT(*) FROM tasks;  -- expected: 5000000

Chạy ANALYZE ngay sau khi insert xong — không đợi autovacuum — để planner có statistics chính xác:

ANALYZE tasks;

-- Kiem tra statistics sau ANALYZE
SELECT attname, n_distinct, most_common_freqs[1:3] AS top3_freqs
FROM pg_stats
WHERE tablename = 'tasks'
  AND attname IN ('assignee_id', 'status', 'due_at')
ORDER BY attname;

Phân bố dữ liệu sau seed:

assignee_id: 10000 user → mỗi user ~450 row (5M × 90% / 10000)
status: 4 giá trị phân bố đều → mỗi status ~25%; archived ~1.25M row
due_at: range rộng ±100 ngày → khoảng 2–3% row rơi vào window 7 ngày bất kỳ

Step 1 — Baseline EXPLAIN ANALYZE

Query dashboard hiển thị task active của một assignee, paginate 20, join với projects:

EXPLAIN (ANALYZE, BUFFERS)
SELECT t.id, t.title, t.due_at, p.name AS project_name
FROM tasks t
JOIN projects p ON p.id = t.project_id
WHERE t.assignee_id = 5
  AND t.status != 'archived'
ORDER BY t.due_at
OFFSET 0 LIMIT 20;

Output baseline (tham khảo — số thực tế dao động theo hardware):

Limit  (cost=185432.18..185432.23 rows=20 width=72)
       (actual time=2051.234..2051.239 rows=20 loops=1)
  ->  Sort  (cost=185432.18..185882.32 rows=180054 width=72)
            (actual time=2051.232..2051.236 rows=20 loops=1)
        Sort Key: t.due_at
        Sort Method: external merge  Disk: 26432kB
        ->  Hash Join  (cost=28.50..180891.44 rows=180054 width=72)
                       (actual time=1.843..1987.456 rows=337500 loops=1)
              Hash Cond: (t.project_id = p.id)
              Buffers: shared hit=8821 read=33412, temp read=3304 written=3304
              ->  Seq Scan on tasks t
                    (cost=0.00..177081.00 rows=180054 width=60)
                    (actual time=0.041..1765.321 rows=337500 loops=1)
                    Filter: ((assignee_id = 5) AND (status <> 'archived'))
                    Rows Removed by Filter: 4662500
                    Buffers: shared hit=8765 read=33219
              ->  Hash  (cost=16.00..16.00 rows=1000 width=20)
                        (actual time=1.678..1.678 rows=1000 loops=1)
                    ->  Seq Scan on projects p
Planning Time: 1.234 ms
Execution Time: 2054.891 ms

Identify 3 issue từ plan output:

Issue 1 — Seq Scan toàn bảng 5M row: Planner đọc toàn bộ tasks (33.000+ buffer read), filter loại bỏ 4.66M row (Rows Removed by Filter: 4662500) chỉ để lấy 337.500 row của assignee=5 có status != archived. Không có index nào cho (assignee_id, status).

Issue 2 — External merge sort spill ra disk: Sort Method: external merge Disk: 26432kB — planner phải sort 337.500 row trước khi LIMIT 20. Work_mem không đủ giữ toàn bộ trong RAM → spill 26MB ra disk → thêm I/O. Nếu có index ordered theo due_at, bước Sort này biến mất hoàn toàn.

Issue 3 — OFFSET deep pagination sẽ tệ hơn nhiều: OFFSET 0 còn chịu được, nhưng OFFSET 1000 (trang 51) buộc planner phải scan và sort 1020 row rồi bỏ 1000 — O(n) theo OFFSET. OFFSET 100000 trên 5M row = vài giây. Cần keyset pagination.

Step 2 — Fix 1: composite partial covering index

Tạo index composite (assignee_id, status, due_at) với partial condition WHERE status != 'archived' và INCLUDE columns cần cho SELECT + JOIN:

CREATE INDEX idx_tasks_dashboard
ON tasks(assignee_id, status, due_at)
WHERE status != 'archived'           -- partial: bo qua 1.25M archived row
INCLUDE (project_id, title);         -- covering: tranh heap fetch cho SELECT list

-- VACUUM ANALYZE de cap nhat visibility map (can cho Index Only Scan)
VACUUM ANALYZE tasks;

Tại sao column order này đúng:

assignee_id equality trước — cardinality 10000, thu hẹp từ 5M xuống ~450 row
status equality tiếp — loại thêm done và todo/doing theo filter
due_at range cuối — range column phải đặt cuối để traverse hiệu quả (bài 3 Module 5 của khoá này)
Partial WHERE status != 'archived' — bỏ qua 1.25M row archived, index nhỏ hơn ~25%
INCLUDE (project_id, title) — cover toàn bộ SELECT list + JOIN key, cho phép Index Only Scan

Re-run EXPLAIN sau khi tạo index:

EXPLAIN (ANALYZE, BUFFERS)
SELECT t.id, t.title, t.due_at, p.name AS project_name
FROM tasks t
JOIN projects p ON p.id = t.project_id
WHERE t.assignee_id = 5
  AND t.status != 'archived'
ORDER BY t.due_at
OFFSET 0 LIMIT 20;

Output sau Fix 1:

Limit  (cost=0.56..42.18 rows=20 width=72)
       (actual time=0.312..0.487 rows=20 loops=1)
  ->  Nested Loop  (cost=0.56..7492.34 rows=3623 width=72)
                   (actual time=0.311..0.484 rows=20 loops=1)
        ->  Index Only Scan using idx_tasks_dashboard on tasks t
              (cost=0.56..5934.21 rows=3623 width=52)
              (actual time=0.289..0.421 rows=20 loops=1)
              Index Cond: ((assignee_id = 5) AND (status <> 'archived'))
              Heap Fetches: 0
              Buffers: shared hit=8
        ->  Index Scan on projects p
              (cost=0.00..0.43 rows=1 width=20)
              (actual time=0.003..0.003 rows=1 loops=20)
              Index Cond: (id = t.project_id)
Planning Time: 0.856 ms
Execution Time: 4.921 ms

Cải thiện: 2054ms → 4.9ms — 420x. Index Only Scan với Heap Fetches: 0 — toàn bộ data lấy từ index page, bỏ qua heap. Sort biến mất vì index đã sắp xếp theo due_at. Hash Join → Nested Loop vì row count nhỏ.

Step 3 — Fix 2: keyset pagination thay OFFSET

OFFSET 0 đã ổn với Fix 1, nhưng deep pagination vẫn tệ. Query với OFFSET 100000 buộc planner scan và bỏ 100000 row từ Index Scan — O(n) theo OFFSET. Keyset pagination dùng cursor từ page trước để Index range scan trực tiếp đến vị trí cần — O(1).

-- Thay the OFFSET bang keyset cursor
-- Cursor (due_at, id) lay tu row cuoi cua page truoc
EXPLAIN (ANALYZE, BUFFERS)
SELECT t.id, t.title, t.due_at, p.name AS project_name
FROM tasks t
JOIN projects p ON p.id = t.project_id
WHERE t.assignee_id = 5
  AND t.status != 'archived'
  AND (t.due_at, t.id) > ('2026-06-15', 12345)   -- cursor tu page truoc
ORDER BY t.due_at, t.id
LIMIT 20;

Output sau Fix 2 (deep page — tương đương OFFSET 100000):

Limit  (cost=0.56..41.94 rows=20 width=76)
       (actual time=0.198..0.287 rows=20 loops=1)
  ->  Nested Loop  (cost=0.56..7448.12 rows=3589 width=76)
                   (actual time=0.197..0.284 rows=20 loops=1)
        ->  Index Only Scan using idx_tasks_dashboard on tasks t
              (cost=0.56..5896.78 rows=3589 width=56)
              (actual time=0.181..0.231 rows=20 loops=1)
              Index Cond: ((assignee_id = 5) AND (status <> 'archived')
                           AND ((due_at, id) > ('2026-06-15'::timestamptz, 12345)))
              Heap Fetches: 0
              Buffers: shared hit=6
        ->  Index Scan on projects p
              (cost=0.00..0.43 rows=1 width=20)
              (actual time=0.003..0.003 rows=1 loops=20)
              Index Cond: (id = t.project_id)
Planning Time: 0.712 ms
Execution Time: 2.134 ms

So sánh với OFFSET tương đương: OFFSET 100000 với fix 1 index sẽ mất ~400ms vì phải traverse 100000 index entry trước khi đến row cần. Keyset cursor nhảy thẳng đến (due_at, id) > ('2026-06-15', 12345) — Index range scan từ cursor, chỉ đọc 20 row cần thiết. Đọc thêm về keyset pagination tại bài 3 Module 2 của khoá này.

Lưu ý khi implement keyset:

ORDER BY phải bao gồm cả id để cursor unique khi nhiều row có cùng due_at
Client lưu (due_at, id) của row cuối page làm cursor cho request kế tiếp
Không thể nhảy đến trang tùy ý (page 50 của 100) — keyset chỉ cho phép next/previous

Step 4 — Fix 3: extended statistics cho cross-column correlation

Planner hiện tại estimate row count độc lập cho từng column — giả định status và due_at không tương quan. Nhưng thực tế task status='todo' thường có due_at trong tương lai gần, task done thường có due_at cũ. Estimate sai này không ảnh hưởng query dashboard đơn giản, nhưng với query phức tạp hơn (ví dụ filter cả status và range due_at hẹp), planner có thể chọn sai join algorithm.

-- Tao extended statistics cho cross-column correlation giua status va due_at
CREATE STATISTICS stats_tasks_status_due (dependencies, ndistinct)
ON status, due_at FROM tasks;

-- BAT BUOC chay ANALYZE de thu thap extended statistics
ANALYZE tasks;

-- Kiem tra extended statistics da duoc thu thap
SELECT stxname, stxkeys, stxdependencies
FROM pg_statistic_ext
JOIN pg_statistic_ext_data ON pg_statistic_ext.oid = pg_statistic_ext_data.stxoid
WHERE stxrelid = 'tasks'::regclass;

Re-run EXPLAIN sau Fix 3 — planner bay giờ biết correlation giữa status và due_at:

EXPLAIN (ANALYZE, BUFFERS)
SELECT t.id, t.title, t.due_at, p.name AS project_name
FROM tasks t
JOIN projects p ON p.id = t.project_id
WHERE t.assignee_id = 5
  AND t.status != 'archived'
  AND (t.due_at, t.id) > ('2026-06-15', 12345)
ORDER BY t.due_at, t.id
LIMIT 20;

Output sau Fix 3:

Limit  (cost=0.56..38.91 rows=20 width=76)
       (actual time=0.163..0.241 rows=20 loops=1)
  ->  Nested Loop  (cost=0.56..6812.44 rows=3312 width=76)
                   (actual time=0.162..0.239 rows=20 loops=1)
        ->  Index Only Scan using idx_tasks_dashboard on tasks t
              (cost=0.56..5312.11 rows=3312 width=56)
              (actual time=0.149..0.189 rows=20 loops=1)
              Index Cond: ((assignee_id = 5) AND (status <> 'archived')
                           AND ((due_at, id) > ('2026-06-15'::timestamptz, 12345)))
              Heap Fetches: 0
              Buffers: shared hit=5
        ->  Index Scan on projects p
              (cost=0.00..0.43 rows=1 width=20)
              (actual time=0.003..0.003 rows=1 loops=20)
              Index Cond: (id = t.project_id)
Planning Time: 0.681 ms
Execution Time: 1.543 ms

Cải thiện: 2.13ms → 1.54ms — nhỏ cho query đơn giản này. Lợi ích chính của extended statistics thể hiện rõ hơn ở corner case: query filter status = 'todo' AND due_at < now() (task todo quá hạn — rare combination). Với independent estimate, planner đánh giá quá cao số row → chọn Hash Join khi lẽ ra Nested Loop tốt hơn. Với dependencies statistics, planner biết correlation và estimate chính xác hơn → plan tốt hơn cho toàn bộ query pattern của codebase.

So sánh 4 plan

Plan	Time (ms)	Cost ước tính	Buffer reads	Strategy chính
Baseline	2054	185.432	42.233	Seq Scan + external merge sort (spill 26MB)
Fix 1 — covering partial index	4.9	42	8	Index Only Scan, Heap Fetches: 0, no sort
Fix 1+2 — keyset deep page	2.1	42	6	Index range scan từ cursor, skip 100k rows
Fix 1+2+3 — extended stats	1.5	39	5	Better estimate, tối ưu corner case plan

Đóng góp của mỗi fix:

Fix 1 — covering partial index: 2054ms → 4.9ms — cải thiện 98.8%. Đây là fix quan trọng nhất. Xoá Seq Scan, xoá Sort + spill, chuyển sang Index Only Scan với Heap Fetches: 0.
Fix 2 — keyset pagination: Trang đầu (OFFSET 0) không thay đổi nhiều, nhưng deep page (OFFSET 100000 tương đương) từ ~400ms xuống 2ms — cải thiện 99.5% cho deep page. Không visible trong bảng trên vì bảng so sánh ở OFFSET 0.
Fix 3 — extended statistics: 2.1ms → 1.5ms — cải thiện nhỏ cho query đơn giản, nhưng quan trọng cho correctness estimate trên toàn bộ query pattern có filter phức tạp.

Tổng improvement baseline → final: 2054ms / 1.54ms ≈ 1334x.

Deliverable

Tạo file module7_challenge.sql ghi lại 4 EXPLAIN output và comment insight:

-- ==============================================
-- Module 7 Mini-challenge: Dashboard 2s -> 50ms
-- Dataset: 5M tasks, 1000 projects, 10000 users
-- ==============================================

-- ==============================================
-- Section 0: Setup
-- ==============================================
TRUNCATE tasks RESTART IDENTITY CASCADE;
-- ... (seed script tu Setup section)
ANALYZE tasks;

-- ==============================================
-- Section 1: Baseline EXPLAIN (no optimization)
-- ==============================================
EXPLAIN (ANALYZE, BUFFERS)
SELECT t.id, t.title, t.due_at, p.name AS project_name
FROM tasks t
JOIN projects p ON p.id = t.project_id
WHERE t.assignee_id = 5
  AND t.status != 'archived'
ORDER BY t.due_at
OFFSET 0 LIMIT 20;
-- Record: Seq Scan + Sort spill 26MB, ~2054ms
-- Issue 1: Seq Scan 5M rows, filter loai 4.66M
-- Issue 2: external merge sort spill disk
-- Issue 3: OFFSET deep page -> O(n) scan

-- ==============================================
-- Section 2: Fix 1 - Composite partial covering index
-- ==============================================
CREATE INDEX idx_tasks_dashboard
ON tasks(assignee_id, status, due_at)
WHERE status != 'archived'
INCLUDE (project_id, title);
VACUUM ANALYZE tasks;

EXPLAIN (ANALYZE, BUFFERS)
SELECT t.id, t.title, t.due_at, p.name AS project_name
FROM tasks t
JOIN projects p ON p.id = t.project_id
WHERE t.assignee_id = 5
  AND t.status != 'archived'
ORDER BY t.due_at
OFFSET 0 LIMIT 20;
-- Record: Index Only Scan, Heap Fetches: 0, ~4.9ms
-- Improvement: 420x (2054ms -> 4.9ms)
-- Note: Fix 1 dong gop 98.8% cai thien tong (fix quan trong nhat)

-- ==============================================
-- Section 3: Fix 2 - Keyset pagination
-- ==============================================
EXPLAIN (ANALYZE, BUFFERS)
SELECT t.id, t.title, t.due_at, p.name AS project_name
FROM tasks t
JOIN projects p ON p.id = t.project_id
WHERE t.assignee_id = 5
  AND t.status != 'archived'
  AND (t.due_at, t.id) > ('2026-06-15', 12345)
ORDER BY t.due_at, t.id
LIMIT 20;
-- Record: Index range scan tu cursor, ~2.1ms (tuong duong OFFSET 100000)
-- Compare: OFFSET 100000 tuong duong -> ~400ms; keyset -> 2.1ms (190x cho deep page)

-- ==============================================
-- Section 4: Fix 3 - Extended statistics
-- ==============================================
CREATE STATISTICS stats_tasks_status_due (dependencies, ndistinct)
ON status, due_at FROM tasks;
ANALYZE tasks;

EXPLAIN (ANALYZE, BUFFERS)
SELECT t.id, t.title, t.due_at, p.name AS project_name
FROM tasks t
JOIN projects p ON p.id = t.project_id
WHERE t.assignee_id = 5
  AND t.status != 'archived'
  AND (t.due_at, t.id) > ('2026-06-15', 12345)
ORDER BY t.due_at, t.id
LIMIT 20;
-- Record: ~1.54ms, cost estimate giam, better plan cho corner case

-- ==============================================
-- Section 5: Ratio summary
-- ==============================================
-- Baseline:          2054 ms  (1x)
-- Fix 1 (index):      4.9 ms  (420x)
-- Fix 1+2 (keyset):   2.1 ms  (keyset deep page vs OFFSET 100k: 190x)
-- Fix 1+2+3 (stats):  1.5 ms  (final: ~1334x vs baseline)
--
-- Insight: Fix nao dong gop nhieu nhat?
-- -> Fix 1 (covering partial index) = 98.8% cai thien
--    Seq Scan loai -> Index Only Scan, Sort spill loai
--    Day la fix always-do-first cho query pattern nay
-- -> Fix 2 (keyset) quan trong cho deep pagination O(n) -> O(1)
-- -> Fix 3 (extended stats) cai thien estimate corner case,
--    gia tri tang them theo complexity cua query

Kèm comment cuối file: insight nào bất ngờ nhất và tại sao covering partial index là fix đầu tiên cần làm cho mọi dashboard query pattern tương tự.

Forward — M11 của khoá này: auto_explain + pg_stat_statements

Bài này bạn biết query nào cần tune vì PM báo cáo. Trong production với hàng trăm query, làm thế nào để tự động phát hiện slow query?

auto_explain — PostgreSQL extension log EXPLAIN plan của mọi query chậm hơn threshold:

-- postgresql.conf hoac ALTER SYSTEM:
-- auto_explain.log_min_duration = 200   -- log query > 200ms
-- auto_explain.log_analyze = on         -- bao gom ANALYZE (actual rows)
-- auto_explain.log_buffers = on         -- bao gom BUFFERS (I/O)
-- auto_explain.log_nested_statements = on -- log ca query ben trong function

-- Ket qua: moi query > 200ms tu dong ghi EXPLAIN output vao server log
-- Khong can phat hien thu cong, khong can reproduce

pg_stat_statements — view tổng hợp top N slow query theo total/mean time:

-- Sau khi enable extension:
-- CREATE EXTENSION pg_stat_statements;

-- Top 10 query chiem nhieu thoi gian nhat (total_exec_time)
SELECT query, calls, mean_exec_time, total_exec_time, rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

-- Top 10 query co mean_exec_time cao nhat (outlier don le)
SELECT query, calls, mean_exec_time, stddev_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

pg_stat_statements cho biết query nào đang tốn nhiều CPU nhất trên toàn bộ workload — không chỉ query bạn đang debug. Module 11 của khoá này sẽ deep dive: cấu hình auto_explain đầy đủ, tune threshold, đọc pg_stat_statements để ưu tiên optimization effort, và kết hợp với pg_stat_user_indexes để audit index nào không được dùng.

Deep Dive — đọc thêm

PostgreSQL Documentation — auto_explain: cấu hình đầy đủ, các GUC, log format, nested statement logging. Đọc phần "Configuration Parameters" trước khi bật trong production.
PostgreSQL Documentation — pg_stat_statements: schema của view, pg_stat_statements_reset(), normalization query text, track=all vs track=top.
PostgreSQL Documentation — CREATE STATISTICS: dependencies, ndistinct, mcv statistic kinds, khi nào cần extended statistics, cách đọc pg_statistic_ext_data.

Tự kiểm tra

Baseline plan cho thấy 'Sort Method: external merge Disk: 26432kB'. Cơ chế nào xảy ra và Fix 1 xoá bỏ nó như thế nào? Giải thích liên quan đến work_mem và index order.

▸

External merge sort xảy ra khi PostgreSQL cần sort một tập dữ liệu lớn hơn work_mem. Planner phải sort ~337.500 row (kết quả sau filter) trước khi áp LIMIT 20 — không thể biết trước 20 row nào có due_at nhỏ nhất mà không sort hết. work_mem mặc định (4MB hoặc tùy cấu hình) không đủ → PostgreSQL chia data thành các run nhỏ, sort từng run trong RAM, merge các run trên disk — tốn thêm 26MB I/O disk.

Fix 1 xoá external sort vì index (assignee_id, status, due_at) đã sắp xếp sẵn theo due_at. Index Only Scan traverse index theo thứ tự ascending due_at ngay từ đầu — 20 row đầu tiên thỏa điều kiện là 20 row cần lấy, không cần sort toàn bộ result set. Planner thấy query có ORDER BY t.due_at và index có due_at ở cuối với đúng thứ tự → eliminate sort node hoàn toàn.

OFFSET 1000 LIMIT 20 và keyset cursor LIMIT 20 đều trả về cùng 20 row (giả sử data không thay đổi). Tại sao keyset nhanh hơn nhiều cho deep page? Giải thích cơ chế index traversal.

▸

OFFSET N LIMIT 20: PostgreSQL phải traverse index từ đầu, đọc và bỏ qua N row đầu tiên, rồi trả về 20 row tiếp theo. Với OFFSET 1000: đọc 1020 row từ index, bỏ 1000, trả về 20. Với OFFSET 100000: đọc 100020 row, bỏ 100000, trả về 20. Chi phí O(N) theo OFFSET — tăng tuyến tính khi user scroll sâu.

Keyset cursor: Điều kiện (due_at, id) > (cursor_due_at, cursor_id) là index range scan bắt đầu từ cursor position. PostgreSQL dùng B-tree index để tìm vị trí đầu tiên thỏa điều kiện trong O(log n), rồi đọc 20 row tiếp theo. Không cần skip row nào — mọi deep page đều tốn đúng O(log n + 20) thay vì O(N). Với 5M row và OFFSET 100000: keyset ~2ms vs OFFSET ~400ms (~200x).

Trade-off: keyset không cho phép nhảy đến trang tùy ý. Phù hợp cho "Next page" / "Infinite scroll" — không phù hợp cho "Jump to page 50 of 100".

Index được tạo với `WHERE status != 'archived'` (partial). Nếu một query không có điều kiện `status` trong WHERE clause, planner có dùng index này không? Giải thích cơ chế matching.

▸

Không — planner sẽ không dùng partial index đó. Partial index chỉ chứa row thỏa điều kiện WHERE status != 'archived' trong index structure. Một query không filter theo status cần kết quả bao gồm cả row archived — nhưng những row đó không tồn tại trong index. Nếu planner dùng index này cho query không có điều kiện status, kết quả sẽ thiếu row archived → sai về correctness.

PostgreSQL kiểm tra predicate implication: query WHERE clause có imply partial index predicate không? WHERE assignee_id = 5 AND status != 'archived' imply status != 'archived' → planner dùng được. WHERE assignee_id = 5 (không có status filter) không imply predicate → planner không dùng, fallback sang full table scan hoặc index khác.

Đây là lý do cần document rõ partial index predicate — query phải include điều kiện matching trong WHERE clause, nếu không index vô dụng cho query đó dù index tồn tại.

Extended statistics `CREATE STATISTICS ... (dependencies, ndistinct) ON status, due_at` cải thiện estimate như thế nào? Cho ví dụ cụ thể về query nào hưởng lợi và query nào không hưởng lợi.

▸

Cơ chế: Mặc định, planner estimate selectivity của status = X AND due_at BETWEEN A AND B bằng cách nhân hai selectivity độc lập: sel(status=X) × sel(due_at range). Nếu hai column tương quan (task todo thường có due_at trong tương lai, task done thường có due_at cũ), phép nhân độc lập cho estimate sai — thường underestimate hoặc overestimate đáng kể.

dependencies statistics thu thập functional dependency degree giữa các column. Planner biết nếu status ảnh hưởng phân bố due_at → điều chỉnh estimate thay vì nhân độc lập.

Hưởng lợi: Query filter cả status và range due_at hẹp — ví dụ status = 'todo' AND due_at < now() (todo quá hạn). Nếu estimate overestimate nhiều, planner chọn Hash Join khi Nested Loop tốt hơn. Extended stats cho estimate chính xác → plan đúng.

Không hưởng lợi nhiều: Query chỉ filter một column (WHERE assignee_id = 5) hoặc hai column không tương quan (ví dụ status và project_id nếu project không ảnh hưởng phân bố status). Extended statistics chỉ có giá trị khi cross-column correlation thực sự tồn tại trong data.

⁂

Module tiếp theo: Module 8 — Advanced query patterns (CTE, recursive, window advanced)

Bài này có giúp bạn hiểu bản chất không?

← Bài trước

Tuning toolbox — 6 cách fix theo hierarchy chi phí thấp → cao

Bài tiếp Advanced query patterns

Subquery, CTE, LATERAL — 3 cách viết cùng một logic