~35 phútTìm kiếm nhanh — Hashing & TreeMiễn phí… lượt xem

Case Study: MySQL InnoDB B+tree index + Redis dict incremental rehash

MySQL và Redis đều giải bài toán 'tìm key nhanh' — nhưng chọn hai cấu trúc dữ liệu hoàn toàn khác nhau. Bài này dive source thật của InnoDB B+tree clustered index và Redis dict incremental rehash để hiểu vì sao mỗi quyết định đúng cho context của nó.

Module 3 đã đi qua hash table (bài 02), self-balancing tree (bài 06), và B+tree (bài 07). Tất cả đều giải cùng một vấn đề cốt lõi: tìm key trong tập dữ liệu lớn nhanh nhất có thể.

Vậy tại sao MySQL InnoDB — database được deploy nhiều nhất thế giới — lại chọn B+tree cho persistent index, trong khi Redis — in-memory key-value store phổ biến nhất thế giới — lại chọn hash table với incremental rehash? Cả hai đều cần tìm key nhanh. Cả hai đều phục vụ hàng triệu request mỗi ngày.

Câu trả lời không nằm ở thuật toán tốt hơn hay tệ hơn — mà ở workload khác nhau: MySQL sống trên disk và cần range query; Redis sống hoàn toàn trong RAM và chỉ cần point query. Workload khác → cấu trúc tối ưu khác.

Bài này dive vào hai system thật: MySQL InnoDB B+tree clustered index và Redis dict 2-table incremental rehash. Đọc source thật, hiểu vì sao mỗi quyết định đúng cho context của nó.

1. Tech profile A — MySQL InnoDB

Project: MySQL InnoDB (storage engine mặc định từ MySQL 5.5+) Version: MySQL 8.0 (current LTS) Source: https://github.com/mysql/mysql-server/tree/8.0/storage/innobase/btr/ License: GPL v2

InnoDB là storage engine mặc định của MySQL từ version 5.5. Trước đó là MyISAM — không có transaction, không có row-level lock, index chỉ là pointer đến heap. InnoDB thay thế MyISAM với ACID transaction, row-level locking, và clustered index — thiết kế quan trọng nhất mà bài này sẽ phân tích.

2. InnoDB B+tree clustered index — không phải pointer thông thường

Bài 07 đã giải thích B+tree: internal node chứa key để route, leaf node chứa data. Nhưng InnoDB làm một điều khác biệt căn bản so với hầu hết implementation khác: clustered index.

2.1 Clustered vs non-clustered

Clustered index (InnoDB primary key index): leaf node chứa toàn bộ row data — không phải pointer đến row, mà là row thật sự. Query theo primary key chỉ cần walk down tree một lần, lấy luôn data tại leaf.

Secondary index (InnoDB non-primary index): leaf node chứa primary key value — không phải row data, không phải physical pointer. Query theo secondary index cần hai bước: (1) walk xuống secondary B+tree để tìm primary key value, (2) walk lại primary B+tree để lấy row. Đây là double lookup.

InnoDB Clustered Index (PRIMARY KEY):

                     [20 | 50 | 80]          <- internal node
                    /    |    |    \
          [10|15|18] [25|30|45] [55|60|75] [85|90|95]  <- leaf nodes
          [row data] [row data] [row data]  [row data]  <- full row here!

InnoDB Secondary Index (e.g. INDEX on email):

                   [[email protected] | [email protected]]      <- internal node
                  /          |          \
        [a@...|b@...]   [c@...|m@...]   [n@...|z@...]   <- leaf nodes
        [pk=5 | pk=12]  [pk=3 | pk=22]  [pk=8 | pk=30]  <- primary keys only

Secondary index lookup:
  Step 1: traverse secondary tree -> find pk=12
  Step 2: traverse primary tree   -> fetch full row at pk=12

Hệ quả quan trọng: order primary key chiến lược. Nếu primary key tăng đơn điệu (AUTO_INCREMENT integer), insert luôn vào cuối leaf node cuối cùng — minimal page split. Nếu primary key random (UUID v4), mỗi insert rơi vào vị trí ngẫu nhiên trong tree — page split xảy ra khắp nơi, fragmentation tích lũy.

2.2 Fanout và height

InnoDB page size mặc định là 16KB. Với integer primary key (8 byte) và page overhead, một internal node chứa được khoảng 1000 key. Fanout 1000 → height 3-4 cho 100 triệu row:

Fanout = 1000 (approximate for int PK, 16KB page)

Height 1: 1,000 records (root is leaf)
Height 2: 1,000 * 1,000 = 1,000,000 records
Height 3: 1,000 * 1,000 * 1,000 = 1,000,000,000 (1 billion) records
Height 4: 1,000^4 = 1 trillion records

For 100M rows:
  log_1000(100,000,000) = log(10^8) / log(10^3) = 8/3 ≈ 2.67
  => Height 3 sufficient. Add 1 for root = height 3 total.
  => 3 disk seeks to fetch any row by primary key.

Ba disk seek trên SSD (mỗi seek khoảng 0.1ms) = 0.3ms. Buffer pool cache internal nodes thường xuyên truy cập — trên thực tế thường chỉ 1-2 seeks sau khi warm cache.

3. InnoDB page structure — B+tree trên disk

InnoDB không đọc/ghi từng byte — đơn vị làm việc là page 16KB. Mỗi page là một node trong B+tree.

InnoDB Page (16KB):
+------------------+
|  Page Header     |  38 bytes: page number, level, prev/next sibling pointer
|  (infimum rec)   |  pseudo-record: lower bound sentinel
|  (supremum rec)  |  pseudo-record: upper bound sentinel
+------------------+
|                  |
|   User Records   |  actual key-value pairs, sorted ascending
|                  |
+------------------+
|   Free Space     |  grows from bottom up
+------------------+
|  Page Directory  |  sparse index: pointer to every 4-8 records
|  (slot array)    |  for binary search within page
+------------------+
|  File Trailer    |  checksum for corruption detection
+------------------+

Page directory là sparse index trong page: thay vì duyệt tuyến tính 500 record trong page, InnoDB dùng page directory để binary search xuống còn 4-8 record rồi scan linear. Tổng: binary search qua pages (disk IO) + binary search trong page (RAM).

Buffer pool cache pages trong RAM. Production setting: innodb_buffer_pool_size = 70% RAM. Database server 64GB RAM → 45GB buffer pool → hàng triệu pages được cache, giảm disk IO xuống gần 0 cho working set.

InnoDB source code liên quan, file btr/btr0sea.cc:

/* Simplified from btr0sea.cc -- InnoDB adaptive hash search */
/* btr_search_get_n_fields: returns number of fields needed  */
/* for the hash search based on the current hash info.        */
ulint
btr_search_get_n_fields(
    ulint   n_fields,
    ulint   n_bytes)
{
    return(n_fields + (n_bytes > 0 ? 1 : 0));
}

4. Production gotcha MySQL

4.1 Composite index — column order matters

INDEX(a, b) là B+tree được sort theo (a, b). Query WHERE a = 1 AND b = 2 dùng được index. Query WHERE b = 2 không dùng được — không có prefix a, không thể dùng leftmost prefix của index.

-- Table: orders(id, customer_id, status, created_at)
-- Index: INDEX idx_cust_status (customer_id, status)

-- Uses index (both columns, leftmost prefix)
SELECT * FROM orders WHERE customer_id = 42 AND status = 'pending';

-- Uses index (leftmost prefix only)
SELECT * FROM orders WHERE customer_id = 42;

-- Does NOT use index (skips leftmost column)
SELECT * FROM orders WHERE status = 'pending';

Rule: đặt column selectivity cao (nhiều giá trị distinct) làm leftmost prefix để index pruning hiệu quả nhất.

4.2 Low cardinality column — index vô dụng

INDEX(gender) với 3 giá trị distinct (M, F, X) trên bảng 10 triệu row: mỗi bucket index sẽ trỏ đến khoảng 3 triệu row. MySQL query optimizer thấy rằng scan full table còn rẻ hơn dùng index vì locality của heap traversal tốt hơn random B+tree lookup khi fetch nhiều row.

4.3 InnoDB Adaptive Hash Index

InnoDB tự động xây dựng hash index trên các page được truy cập thường xuyên. Khi pattern truy cập đủ đều (cùng key liên tục), InnoDB cache result của B+tree traversal vào hash table — biến point query O(log n) thành O(1).

Boost điển hình: 10-30% cho workload point query nặng. Disable bằng innodb_adaptive_hash_index = OFF nếu workload đa dạng (adaptive hash index overhead là negative khi pattern không đủ uniform).

4.4 UUID v4 PK và page split storm

UUID v4 là random 128-bit. Insert UUID v4 làm primary key → mỗi insert rơi vào vị trí ngẫu nhiên trong B+tree → page split ở khắp nơi.

AUTO_INCREMENT insert (sequential):
  [1][2][3][4][5]...[N] -> always appends to last leaf -> minimal split

UUID v4 insert (random):
  [7f3b...][2a1c...][8e4d...] -> random position -> frequent mid-page split
  Each split: InnoDB allocates new page, moves half records -> expensive
  After 1M inserts: tree fragmented, many half-full pages, wasted space

Solution: UUID v7 (RFC 9562, timestamp prefix monotonic) hoặc AUTO_INCREMENT BIGINT. UUID v7 bắt đầu bằng millisecond timestamp → insert gần-sequential → split ít hơn đáng kể.

5. Tech profile B — Redis dict

Project: Redis (in-memory data structure store) Version: Redis 7.0+ Source: https://github.com/redis/redis/blob/7.4/src/dict.c License: RSALv2/SSPL (Redis 7.4+), BSD-3 (phiên bản cũ hơn)

Redis là in-memory key-value store single-threaded (main command loop). Mọi GET, SET, DEL đều chạy trên một thread duy nhất — đây là quyết định thiết kế có chủ ý để tránh lock overhead và đảm bảo atomicity. Hệ quả: bất kỳ operation nào block thread chính đều block toàn bộ Redis.

6. Redis dict structure — 2-table design

Redis dict không phải một hash table đơn giản. Nó có hai hash table ht[0] và ht[1] để phục vụ incremental rehash:

/* Simplified from Redis src/dict.h */
typedef struct dict {
    dictType *type;
    dictEntry **ht_table[2];   /* two hash tables */
    unsigned long ht_used[2];  /* number of used entries in each table */
    long rehashidx;            /* rehashing index; -1 if not rehashing */
    /* ... */
} dict;

Steady state: chỉ ht[0] active, ht[1] là NULL. rehashidx = -1.

Khi cần resize (load factor vượt ngưỡng hoặc xuống quá thấp): ht[1] được alloc với capacity mới (thường gấp đôi), rehashidx = 0. Bắt đầu incremental rehash.

rehashidx là progress counter: chỉ đến bucket tiếp theo cần migrate từ ht[0] sang ht[1]. Không phải migrate một lần — migrate dần dần, mỗi lần một vài bucket.

Redis dict -- steady state:

  ht[0]: [bucket_0] -> [entry: "user:1" -> "Alice"]
         [bucket_1] -> [entry: "product:5" -> "Laptop"] -> [entry: "product:9" -> "Phone"]
         [bucket_2] -> NULL
         ...
  ht[1]: NULL
  rehashidx: -1

Redis dict -- during rehash (ht[0] size 4 -> ht[1] size 8):

  ht[0]: [bucket_0] -> NULL (migrated)
         [bucket_1] -> NULL (migrated)
         [bucket_2] -> [entry: "foo" -> "bar"]   <- not yet migrated
         [bucket_3] -> [entry: "baz" -> "qux"]   <- not yet migrated
  ht[1]: [bucket_0] -> [entry: "user:1" -> "Alice"]   <- migrated
         [bucket_1] -> [entry: "product:5" -> "Laptop"]
         ...8 buckets total...
  rehashidx: 2   <- next bucket to migrate

7. Vì sao Redis cần incremental rehash

Redis single-threaded command loop. Nếu resize hash table thông thường (blocking rehash):

Time     Event
t=0      SET command arrives
t=0.001  resize triggered: alloc new table, start migrating 1M entries
t=0.001  ... migrating ... migrating ... (100-500ms blocked)
t=0.500  all 1M entries migrated, resize complete
t=0.500  next command processed (499ms late)

499ms stall trên một Redis instance production:

Client timeout (thường set 100-200ms) → error cascade
Replication lag → replica out of sync
Monitoring alert spam
Toàn bộ throughput về 0 trong khoảng thời gian đó

Solution: incremental rehash. Chia 1 lần migrate 1M entry thành N lần migrate 10 bucket. Mỗi command thực hiện thêm một chút migration — latency mỗi command tăng từ khoảng 1µs lên khoảng 2µs, nhưng max latency vẫn dưới 1ms.

/* Simplified from Redis src/dict.c -- dictRehash() */
/* Performs n steps of incremental rehashing.        */
/* Returns 1 if rehashing still in progress, 0 done. */
int dictRehash(dict *d, int n) {
    int empty_visits = n * 10; /* max empty buckets to visit */

    if (!dictIsRehashing(d)) return 0;

    while (n-- && d->ht_used[0] != 0) {
        dictEntry *de, *nextde;

        /* skip empty buckets, but limit iterations */
        while (d->ht_table[0][d->rehashidx] == NULL) {
            d->rehashidx++;
            if (--empty_visits == 0) return 1;
        }

        de = d->ht_table[0][d->rehashidx];
        /* migrate all entries in this bucket */
        while (de) {
            nextde = de->next;
            /* rehash key for new table size */
            unsigned int h = dictHashKey(d, de->key) & DICTHT_SIZE_MASK(d->ht_size_exp[1]);
            de->next = d->ht_table[1][h];
            d->ht_table[1][h] = de;
            d->ht_used[0]--;
            d->ht_used[1]++;
            de = nextde;
        }
        d->ht_table[0][d->rehashidx] = NULL;
        d->rehashidx++;
    }

    /* check if rehash complete */
    if (d->ht_used[0] == 0) {
        /* swap: ht[1] becomes ht[0], reset ht[1] */
        zfree(d->ht_table[0]);
        d->ht_table[0] = d->ht_table[1];
        d->ht_used[0] = d->ht_used[1];
        d->ht_size_exp[0] = d->ht_size_exp[1];
        d->ht_table[1] = NULL;
        d->ht_used[1] = 0;
        d->rehashidx = -1;
        return 0;
    }
    return 1;
}

Lookup trong rehash phase phải check cả 2 table: dict GET tìm trong ht[0] trước — nếu không thấy, tìm tiếp trong ht[1]. Điều này đảm bảo không bỏ sót key đã migrate sang ht[1] lẫn key chưa migrate còn ở ht[0].

/* GET in rehash phase: check both tables */
dictEntry *dictFind(dict *d, const void *key) {
    if (dictSize(d) == 0) return NULL;
    if (dictIsRehashing(d)) _dictRehashStep(d); /* do 1 step on each access */

    for (int table = 0; table <= 1; table++) {
        unsigned int h = dictHashKey(d, key) & DICTHT_SIZE_MASK(d->ht_size_exp[table]);
        dictEntry *he = d->ht_table[table][h];
        while (he) {
            if (dictCompareKeys(d, key, he->key))
                return he;
            he = he->next;
        }
        if (!dictIsRehashing(d)) break; /* only ht[0] if not rehashing */
    }
    return NULL;
}

8. Bảng so sánh quyết định MySQL vs Redis

Aspect	MySQL InnoDB	Redis dict
Storage	Disk + buffer pool (RAM cache)	RAM only
Workload	Mixed: point query + range query	Point query only (no range)
DS chọn	B+tree (16KB page, leaf chain)	Hash table + chaining
Lý do chọn	Range scan qua leaf linked list; disk fanout minimize IO	O(1) point query; no range needed
Resize cost concern	Thấp (ALTER TABLE online hoặc offline build)	Cao (single-threaded: blocking = full stall)
Resize strategy	One-shot rebuild	Incremental rehash (N steps per op)
Key ordering	Sorted (B+tree property)	Unordered (hash table)
Memory model	On-disk, page-buffered	In-memory, pointer-direct

Insight cốt lõi: cùng problem "tìm key" nhưng constraints hoàn toàn khác. InnoDB sống trên disk → minimize disk IO là ưu tiên → B+tree fanout 1000 + leaf chain cho range scan. Redis single-threaded in-memory → blocking resize là fatal → hash table O(1) + incremental rehash tránh stall.

9. Production gotcha Redis

9.1 Big key và rehash stall trên RDB save

Redis HGETALL trên Hash key với 10 triệu field — Redis phải iterate toàn bộ internal dict của key đó. Nếu dict đang rehash, traverse thêm phức tạp hơn. RDB fork snapshot đọc toàn bộ memory, các big key làm thời gian fork dài hơn.

Monitor key sizes định kỳ:

# scan for large keys (Redis 4.0+ with OBJECT FREQ/ENCODING)
redis-cli --bigkeys

# or manually check specific key
redis-cli OBJECT ENCODING mykey
redis-cli HLEN myhashkey  # number of fields in Hash

Nếu Hash key vượt vài triệu field, chia thành nhiều key nhỏ: user:1000:profile, user:1000:settings, ... thay vì một key user:1000 chứa tất cả.

9.2 SLOWLOG để catch slow operation

Redis SLOWLOG ghi lại các command tốn hơn ngưỡng (default 10ms):

redis-cli CONFIG SET slowlog-log-slower-than 10000  # 10ms in microseconds
redis-cli SLOWLOG GET 10  # last 10 slow entries

Các nguyên nhân phổ biến của slow op:

Rehash phase trên dict lớn: dictRehashMilliseconds() chạy background nhưng mỗi command vẫn thực hiện 1 rehash step — dict với triệu entry, một step migrate bucket lớn tốn vài ms.
Big key operation: SMEMBERS trên Set 100k member, LRANGE key 0 -1 trên List 500k element.
KEYS * pattern scan: O(N) trên toàn database — không bao giờ dùng trong production, thay bằng SCAN cursor.

9.3 maxmemory-policy và cross-link Module 2

Redis maxmemory-policy quyết định eviction khi RAM đầy:

allkeys-lru: evict key ít dùng nhất gần đây nhất (LRU)
allkeys-lfu: evict key ít dùng nhất về tần suất (LFU)
volatile-ttl: evict key có TTL ngắn nhất

LRU implementation của Redis dùng một doubly linked list và dict. Cross-link: Module 2 bài 07 LRU mini-challenge giải thích cơ chế đằng sau allkeys-lru.

10. Applied to tech

PostgreSQL — heap + B-tree, khác với InnoDB clustered

PostgreSQL dùng heap storage: rows được lưu trong heap files, không phải trong B-tree leaf. B-tree index của PostgreSQL lưu heap tuple ID (TID) — physical pointer (page_number, slot) trỏ đến row trong heap.

Hệ quả: tất cả index PostgreSQL đều là non-clustered về cơ bản. Query qua bất kỳ index nào cũng cần 2 random IO: (1) traverse B-tree → TID, (2) heap fetch. So với InnoDB clustered primary key chỉ cần 1 walk.

PostgreSQL 12+ thêm INCLUDE index (covering index): lưu thêm column vào leaf node để tránh heap fetch:

-- PostgreSQL covering index: no heap fetch for queries using id + email
CREATE INDEX idx_user_cover ON users (id) INCLUDE (email, name);
-- Query: SELECT email, name FROM users WHERE id = 42
-- -> traverse B-tree -> find leaf with email + name -> done, no heap fetch

MongoDB / WiredTiger — document B-tree

MongoDB dùng WiredTiger storage engine, cũng dùng B-tree. Document-level locking thay vì row-level. _id field mặc định là clustered index (ObjectID timestamp-ordered → sequential insert). Secondary index lưu _id reference giống InnoDB lưu primary key.

DynamoDB — partition key + sort key B-tree

DynamoDB dùng hash của partition key để route đến partition server. Bên trong mỗi partition: B-tree được sort theo sort key cho range query. Kết hợp: O(1) partition lookup (hash) + O(log n) range query trong partition (B-tree). Đây là hybrid tương tự cách InnoDB secondary index hoạt động nhưng ở distributed scale.

etcd / Consul — B-tree trên Raft

etcd lưu data trong bbolt B-tree (key-value sorted). Mọi write đi qua Raft consensus trước khi commit vào B-tree. B-tree sorted property cho phép RANGE query (watch prefix, list keys) — thiết yếu cho Kubernetes service discovery. Module 9 phân tích sâu distributed consensus layer.

11. Deep Dive tài liệu gốc

📚 Deep Dive — nguồn tham khảo

MySQL InnoDB:

MySQL 8.0 Reference — Chapter 15: InnoDB Storage Engine — clustered index, secondary index, buffer pool, adaptive hash index. Section 15.6.2.1 "Clustered and Secondary Indexes" là must-read.
MySQL High Performance, 4th Edition — Baron Schwartz et al. Chapter 5 (Indexing for High Performance): covering index, index selectivity, UUID key pitfalls.
InnoDB source — btr0btr.cc — B+tree insert, split, merge logic. Hàm btr_page_split_and_insert cho thấy chi tiết page split.

Redis:

Redis source — src/dict.c — toàn bộ incremental rehash logic. dictRehash(), dictRehashMilliseconds(), dictFind() trong rehash phase.
Redis internals documentation — memory optimization, bigkeys, latency monitoring.

Sách nền tảng:

Designing Data-Intensive Applications — Martin Kleppmann, Chapter 3 — Storage and Retrieval: B-tree vs LSM-tree, write amplification, read amplification. Perspective rộng nhất về storage engine design.

Cross-link trong khóa học:

Bài 02 Module 3 (HashMap internals) — hash table chaining, resize 2x, load factor 0.75.
Bài 07 Module 3 (B-tree B+tree) — fanout, leaf chain, disk seek analysis.
Bài 07 Module 2 (LRU mini-challenge) — Redis allkeys-lru eviction mechanism.
Khóa SQL — Module Index — EXPLAIN ANALYZE, index selection, covering index trong PostgreSQL/MySQL.

12. Tóm tắt

InnoDB clustered index: primary key leaf chứa toàn bộ row — 1 walk để fetch row. Secondary index leaf chứa primary key → cần double lookup (2 tree traversals). Chọn primary key sequential (AUTO_INCREMENT hoặc UUID v7) để tránh page split storm.
InnoDB page 16KB: fanout khoảng 1000 với integer key → height 3-4 cho 100 triệu row → 3-4 disk seeks per query. Buffer pool cache pages trong RAM, hot working set về gần 0 disk IO.
Redis dict 2-table: ht[0] active + ht[1] resize target. rehashidx track migration progress. Mỗi command thực hiện thêm 1 migration step — incremental, không blocking.
Incremental rehash: Redis single-threaded → blocking resize 1M entry tốn 100-500ms → toàn bộ Redis stall. Incremental chia nhỏ thành N bước, max latency per command vẫn dưới 1ms.
Lookup trong rehash phase: check cả ht[0] và ht[1] để không bỏ sót key chưa hoặc đã migrate.
MySQL chọn B+tree vì: disk workload + range query cần leaf chain. Redis chọn hash table vì: in-memory + point query only + single-threaded cần incremental resize.
Production pitfall MySQL: UUID v4 PK gây page split, composite index column order, low cardinality index vô dụng.
Production pitfall Redis: big key làm rehash/RDB stall, KEYS * O(N) full scan, SLOWLOG để detect slow op.

13. Tự kiểm tra

Tự kiểm tra

InnoDB clustered index vs PostgreSQL heap + B-tree: khi query qua secondary index, số random IO khác nhau thế nào? Giải thích cơ chế.

▸

InnoDB secondary index: (1) traverse secondary B+tree → tìm primary key value, (2) traverse primary B+tree → fetch full row tại leaf. Hai tree traversal, nhưng primary B+tree leaf chứa luôn row data — tổng 2 tree walks.

PostgreSQL secondary index: (1) traverse B-tree → tìm heap TID (page number + slot), (2) random heap fetch tại TID. Tổng 2 IO. Tuy nhiên heap fetch là random IO vào heap file — khác với InnoDB primary B+tree walk theo cấu trúc có thứ tự. Khi fetch nhiều row cùng lúc, InnoDB clustered có lợi thế locality hơn PostgreSQL heap random access. PostgreSQL giải quyết bằng INCLUDE index để avoid heap fetch khi possible.

UUID v4 làm primary key gây page split storm là vì sao? UUID v7 giải quyết thế nào?

▸

UUID v4 là 128-bit random hoàn toàn — không có thứ tự tự nhiên. Mỗi INSERT với UUID v4 primary key rơi vào vị trí ngẫu nhiên trong B+tree. Khi leaf page đầy, InnoDB phải split page: alloc page mới, chuyển nửa record sang, update internal node pointer — tốn nhiều disk write. Với insert rate cao, split xảy ra liên tục ở mọi nơi trong tree → fragmentation, nhiều half-full page, index bloat, performance giảm.

UUID v7 (RFC 9562) bắt đầu bằng 48-bit millisecond timestamp, tiếp theo là version + random bits. Insert UUID v7 gần-sequential theo thời gian → luôn append vào cuối leaf chain → minimal split, tương tự AUTO_INCREMENT nhưng globally unique và không cần central counter.

Redis incremental rehash giải quyết stall problem thế nào? Tại sao single-threaded constraint của Redis biến resize thành vấn đề nghiêm trọng?

▸

Redis main command loop chạy trên một thread duy nhất. Mọi command — GET, SET, DEL — chạy tuần tự không có parallelism. Nếu resize hash table blocking (migrate 1M entry một lần), thread chính bị chiếm trong 100-500ms — không có command nào được xử lý trong thời gian đó, toàn bộ throughput về 0, client timeout.

Incremental rehash chia migrate thành N bước nhỏ: mỗi command thực hiện thêm 1 step (migrate vài bucket). Thay vì 1 lần tốn 500ms, có 100.000 lần tốn thêm khoảng 1µs mỗi lần — max latency per command vẫn dưới 1ms, throughput không bị gián đoạn. Hai table ht[0] và ht[1] coexist trong suốt thời gian rehash; lookup check cả hai để không bỏ sót key.

InnoDB page size 16KB, fanout khoảng 1000 với integer key. Tính height cần thiết cho bảng 100 triệu row. Nếu buffer pool cache toàn bộ internal node, còn bao nhiêu disk IO để fetch 1 row?

▸

Fanout 1000 nghĩa là mỗi internal node có 1000 children. Số row tối đa ở height h (leaf level là level 1):

Height 1 (chỉ root, root là leaf): 1000 records
Height 2 (root + leaf level): 1000 * 1000 = 1 triệu records
Height 3: 1000^3 = 1 tỷ records

100 triệu nằm giữa 1 triệu và 1 tỷ → height 3 đủ (root + 1 internal level + leaf level = 3 tầng, 2 transitions). Để fetch 1 row: đi qua root → internal node → leaf = 3 node.

Nếu buffer pool cache toàn bộ internal node (root + level 1 internal nodes): root và internal node đọc từ RAM, chỉ leaf cần disk IO. 1 disk seek để đọc leaf page. Trên SSD 0.1ms → khoảng 10.000 primary key lookup/giây per disk với 0 cache miss trên non-leaf.

Redis SLOWLOG ghi lại slow command. Kể 3 nguyên nhân phổ biến nhất gây slow op trên Redis production.

▸

Big key operation: SMEMBERS trên Set triệu member, LRANGE key 0 -1 trên List hàng trăm nghìn element, HGETALL trên Hash lớn — O(N) với N lớn chiếm thread chính hàng chục ms.
Rehash phase trên dict lớn: khi Redis dict đang incremental rehash, mỗi command thực hiện thêm 1 rehash step. Bucket chứa nhiều entry (degenerate chain do hash collision) làm step đó tốn nhiều thời gian hơn thông thường.
KEYS * hoặc SMEMBERS pattern scan: KEYS * iterate toàn bộ keyspace O(N total keys) — trên database 10 triệu key có thể tốn hàng trăm ms. Dùng SCAN cursor-based thay thế để chia nhỏ ra nhiều call.

Cho workload '100k point query/giây, 0 range query' trên một service mới. Chọn MySQL B+tree hay hash-table-backed store như Redis? Lý do kỹ thuật?

▸

Hash-table-backed store (Redis hoặc tương tự) phù hợp hơn cho workload này.

Lý do kỹ thuật:

O(1) point query: hash table trả về kết quả trong 1-2 hash + bucket lookup. B+tree cần O(log n) disk IO — với 100M record, 3 seeks × 0.1ms/seek = 0.3ms minimum, giới hạn throughput khoảng 3000 query/giây per disk. Hash table in-memory: khoảng 1µs per lookup → 1 triệu query/giây trên 1 thread.
Range query = 0: ưu điểm duy nhất của B+tree over hash table là sorted order cho range scan và prefix query. Nếu không có range query, B+tree overhead (tree traversal, leaf page fetch, buffer pool management) là pure cost không có benefit.
Tradeoff cần cân nhắc: Redis is in-memory → data size bị giới hạn bởi RAM và expensive. MySQL B+tree phù hợp hơn khi data lớn hơn RAM, cần persistence + ACID transaction, hoặc có mixed workload (một số range query dù ít).

⁂

Bài tiếp theo: Sort landscape — comparison vs non-comparison

Bài này có giúp bạn hiểu bản chất không?

← Bài trước

Mini-challenge — Autocomplete top-K với Trie + frequency