Bài 60 / 75~15 phútJVM InternalsMiễn phí… lượt xem

CLI tools chẩn đoán JVM — jps, jstack, jmap, jstat, jcmd

Q: Workflow nào để debug "API p99 latency 5s

( Step-by-step: List PID : jps -l → tìm PID app (vd 12345). Tìm thread ăn CPU : top -H -p 12345 → cột %CPU, note TID đang 98% (vd 12399). Convert TID sang hex (jstack ghi hex): printf '%x' 12399 → 306f . Thread dump + grep : jstack 12345 > dump.txt rồi grep -A 30 'nid=0x306f' dump.txt — stack frame top cho biết method nào busy spin. Triple dump confirm : 3 dump cách 5s

Q: Heap dump 8GB. MAT report top "Retained Heap": com.myapp.OrderCache 7GB. Workflow tiếp theo để xác định leak?

Đã có suspect — giờ trace nguyên nhân: Path to GC Roots : Histogram → search "OrderCache" → select instance → "Path to GC Roots" → "exclude weak/soft references". Kết quả ví dụ: `OrderCache@0x... (7 GB retained) cache field of OrderService@0x... INSTANCE field of OrderManager (static) Static field giữ cache sống vĩnh viễn. Inspect content : drill down field orders (Map) — 5 triệu entry, mỗi entry ~1.4 KB → cache không bound, không evict. OQL confirm : 'SELECT * FROM com.myapp.Order o WHERE o.timestamp — Order cũ hơn 1 ngày vẫn trong cache → không có TTL. Đối chiếu code : tìm chỗ cache.put(...) không kèm eviction (thường là HashMap trong singleton). Fix : thay HashMap bằng Caffeine cache — 'Caffeine.newBuilder().maximumSize(10_000).expireAfterWrite(Duration.ofMinutes(30)).build()' — hoặc bỏ cache nếu DB query đủ nhanh. Verify bằng heap dump sau deploy: retained size stable. Pattern leak phổ biến tìm bằng MAT: static collection growing (case này), listener không unregister, ThreadLocal không remove (Tomcat redeploy), inner class giữ enclosing instance. Always 2-step: identify retained (histogram) + trace path GC root .

Q: Vì sao `jstat -gcutil` thấy "Old 95%" liên tục lại là red flag

Old 95% liên tục nghĩa là: promote rate cao (object survive young GC dồn sang Old) và GC không reclaim được — phần lớn object trong Old vẫn live, mark-sweep-compact không free đủ. Hệ quả chuỗi: Old gần đầy → major/mixed GC chạy dày hơn → pause dài (hàng trăm ms) → latency spike → full GC last resort (pause vài giây) → cuối cùng OOM "GC overhead limit exceeded" hoặc heap space. Trong jstat, FGC tăng nhanh = pressure, sắp OOM: # Red flag - O lien tuc tang, FGC nhieu S0 S1 E O M YGC YGCT FGC FGCT 0.00 88.31 68.85 85.20 93.92 145 1.234 8 2.567 0.00 90.50 70.12 90.85 93.92 146 1.245 12 4.123 0.00 92.50 72.12 95.20 93.92 147 1.256 18 7.890 2 nguyên nhân thường gặp và cách phân biệt (heap dump 2 lần cách 5 phút, so sánh top retained): Memory leak : class app (Order, User, CacheEntry) growing liên tục → MAT trace GC roots. Heap quá nhỏ : workload steady, mọi class tăng proportional → tăng -Xmx . Quick action: tăng -Xmx tạm để buy time, bật -XX:+HeapDumpOnOutOfMemoryError , chạy JFR allocation profile (bài 08) tìm hot allocation site. Long-term: ship Old % vào time series (Prometheus/Datadog), alert khi vượt 80% sustained 10 phút — bắt sớm trước khi user thấy.

Q: Vì sao thread dump phải đọc 3 lần (triple dump) thay vì 1 lần?

( 1 thread dump = snapshot 1 thời điểm, dễ đọc sai: thread tình cờ đứng ở method bạn nghi (không stuck), tất cả WAITING chỉ là pool worker idle giữa task, stack 1-2 frame thiếu context. Triple dump (3 lần cách 5 giây) so sánh state + stack frame qua thời gian: `Thread X: 3 dump deu BLOCKED waiting lock 0xABC tai Service.process(L42) => stuck cho lock suot 15s -> contention that Thread Y: dump 1 o ArrayList.add

Q: Vì sao jcmd được khuyến khích thay jstack / jmap trong production?

Trước hết, đính chính một hiểu nhầm phổ biến: "jstack/jmap pause JVM lâu hơn jcmd" là sai . Mặc định, jstack và jmap dùng cùng dynamic attach API như jcmd — chỉ khi chạy với flag -F (force, cho JVM treo không respond) chúng mới chuyển sang Serviceability Agent (SA) đọc memory process từ ngoài. Và mọi thread dump — dù qua tool nào — đều cần JVM đạt safepoint (bài 11). Pause là tương đương. Lý do thực sự nên ưu tiên jcmd : Thống nhất 1 interface : jcmd <pid> help tự liệt kê mọi command — không cần nhớ jstack vs jmap vs jstat vs jinfo. Diagnostic command framework mở rộng : command mới (JFR control, ZGC stats, compiler directives) chỉ thêm vào jcmd. Tool cũ frozen. Native Memory Tracking (NMT) : jcmd VM.native_memory — debug "JVM ăn memory hơn -Xmx" — không có tool cũ tương đương. So sánh cụ thể: Tool cũ jcmd equivalent jstack <pid> jcmd <pid> Thread.print jmap -histo <pid> jcmd <pid> GC.class_histogram jmap -dump:format=b,file=h.hprof <pid> 'jcmd GC.heap_dump h.hprof' jinfo <pid> jcmd <pid> VM.flags + VM.system_properties — jcmd <pid> JFR.start ... (JFR) — jcmd <pid> VM.native_memory (NMT) Container: docker exec <container> jcmd <pid> ... chạy trong namespace JVM, không dependency host. Tool cũ vẫn hợp lý cho script CI/CD đã parse format jstack. Production: jcmd + JFR là combo chuẩn.

Chẩn đoán JVM production không cần restart: jps, jstack (deadlock), jmap (heap dump), jstat, jcmd — workflow từ symptom đến root cause.

TL;DR: JDK ship sẵn bộ CLI tool chẩn đoán JVM đang chạy mà không cần restart: jps (list process), jstack (thread dump — deadlock, thread stuck), jmap (heap histogram + heap dump), jstat (GC metric realtime — Old % leo cao là red flag leak), và jcmd (Swiss-army knife thay được hầu hết tool cũ, kèm Native Memory Tracking). Heap dump mở bằng Eclipse MAT với "Leak Suspects" + "Path to GC Roots". Quy tắc vàng: match tool với symptom (CPU 100% → jstack; leak → jmap + MAT; GC liên tục → jstat), triple dump để confirm thread stuck, và luôn capture diagnostic trước khi restart. Profiling liên tục (JFR, async-profiler) ở bài 08.

3h sáng. Pager: "API p99 latency 5s (SLA 200ms), CPU pegged 100%". Bạn SSH vào server prod. App vẫn chạy, không log error. Heap stable. Vậy gì đang xảy ra?

Câu trả lời nằm trong JVM tools — toolkit đi kèm JDK cho phép chẩn đoán JVM đang chạy mà không restart, không deploy code mới. Trong vài phút bạn có thể:

Liệt kê thread đang block / chạy → tìm thread loop infinite hoặc deadlock.
Snapshot heap → tìm object leak.
Đọc metric GC realtime → confirm leak trend.

Đây là kỹ năng phân biệt junior với senior Java engineer. Junior đoán, restart, hy vọng. Senior dùng tool, tìm root cause trong 30 phút.

Bài này đi qua các CLI tool: jps (list process), jstack (thread dump), jmap (heap dump), jstat (GC metric), jcmd (Swiss-army knife), và MAT (heap dump analysis). Kết thúc với workflow debug điển hình: từ symptom đến root cause. Profiler liên tục (JFR, async-profiler) là chủ đề của bài 08.

1. Analogy — Bộ đồ nghề thợ điện

Thợ điện không "nghe" tủ điện đoán hỏng — họ có dụng cụ đo. Mỗi tool có context phù hợp; dùng nhầm tool = lãng phí thời gian (vd jstack cho memory leak vô nghĩa — phải jmap).

Đời thường	JVM tool	Đo cái gì
Multimeter (đo điện áp/dòng)	`jps`, `jstat`	Số liệu hiện tại
Camera nhiệt	`jstack`	Thread nào nóng/lạnh, ai chờ ai
Tháo tủ kiểm linh kiện	`jmap` + MAT	Từng object trong heap
Multimeter universal	`jcmd`	Gần như mọi thứ
Data logger / oscilloscope	JFR / async-profiler (bài 08)	Ghi liên tục, phân tích sau

💡 Cách nhớ

Symptom → tool: app hang → jstack thread dump; memory leak → jmap + MAT; GC liên tục → jstat; liệt kê → jps; CPU 100% → jstack + top -H (sâu hơn thì flame graph, bài 08). jcmd thay được hầu hết tool cũ, ưu tiên dùng.

2. `jps` — list JVM process

Đầu tiên cần biết PID của JVM đang chạy.

jps -l
# 12345 com.myapp.MainApplication
# 23456 org.gradle.launcher.daemon.bootstrap.GradleDaemon
# 34567 sun.tools.jps.Jps

jps -v
# 12345 MainApplication -Xms2g -Xmx4g -XX:+UseG1GC ...

Flag:

-l: full class name hoặc path JAR.
-v: JVM args.
-m: main args.

jps chỉ thấy JVM cùng user. Sudo nếu cần thấy JVM khác user.

Container Docker: jps chạy ngoài container không thấy JVM trong container. Cần docker exec <container> jps.

3. `jstack` — thread dump

Dump trạng thái mọi thread. Câu trả lời cho "app hang", "high CPU 1 thread", "deadlock?".

jstack 12345 > thread-dump.txt

Output mẫu:

"http-nio-8080-exec-3" #45 daemon prio=5 cpu=12345.67ms nid=0x4567 waiting on condition
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x000000076ab12345> (a ...ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:341)
        at com.myapp.Service.process(Service.java:42)
        at com.myapp.Controller.handle(Controller.java:25)

Đọc: tên thread + #45 daemon prio=5 (ID, daemon flag, priority); cpu= tổng CPU time đã dùng; Thread.State trạng thái hiện tại; stack trace từ top frame đi xuống; parking to wait for <0x...> = thread đợi lock này.

State quan trọng

RUNNABLE: đang chạy hoặc ready chạy. Nhiều RUNNABLE → CPU bound.
BLOCKED: đợi lock (synchronized). Nhiều BLOCKED cùng địa chỉ lock → contention.
WAITING / TIMED_WAITING: đợi notify hoặc condition. Thường idle (pool worker đợi task).

Detect deadlock

jstack tự detect deadlock. Output:

Found one Java-level deadlock:
=============================
"Thread-1":
  waiting to lock monitor 0x00007f0c... (object 0x000000076ab1),
  which is held by "Thread-2"
"Thread-2":
  waiting to lock monitor 0x00007f0c... (object 0x000000076bc2),
  which is held by "Thread-1"

Kèm stack trace từng thread — báo rõ thread nào, lock nào, ở dòng code nào. Fix: change lock order hoặc dùng tryLock timeout.

High CPU 1 thread

# Tim thread an CPU cao
top -H -p 12345
# PID  USER   ...   %CPU
# 12399 luatnq  ...  98.7
# 12400 luatnq  ...   2.3

# Convert TID sang hex de match nid trong jstack
printf '%x\n' 12399
# 306f

# Tim thread voi nid=0x306f trong dump
jstack 12345 | grep -A 30 'nid=0x306f'

Stack thread đó cho biết method nào loop / busy spin.

Triple dump pattern

Hang state có thể là snapshot 1 thread tạm dừng — không phải bug. Để confirm:

for i in 1 2 3; do jstack 12345 > dump-$i.txt; sleep 5; done
# So sanh dump-1, dump-2, dump-3
# Thread van o cung stack frame qua 15s -> stuck

Real bug: thread cùng stack frame qua 3 dump. Nếu dump khác nhau → thread đang làm việc, không stuck.

4. `jmap` — heap dump và stats

Heap histogram (live object count + size)

jmap -histo:live 12345 | head -20

Output:

 num     #instances         #bytes  class name
----------------------------------------------
   1:        500000      40000000  com.myapp.Order
   2:        500000      32000000  java.lang.String
   3:       1000000      16000000  java.util.HashMap$Node
   4:        500000      12000000  [B (byte array)
   ...

Top class theo size. Tìm leak: class lạ với count cao = candidate.

:live ép GC trước → chỉ thấy live object. Bỏ → thấy cả object chờ GC, noisy.

Heap dump (full snapshot)

jmap -dump:live,format=b,file=heap.hprof 12345

Dump 1 file .hprof chứa toàn bộ object trong heap. Size = heap size (vài GB cho production typical). Chậm vài giây — pause app trong khi dump.

-XX:+HeapDumpOnOutOfMemoryError config sẵn JVM dump tự khi OOM.

Mở file bằng:

Eclipse MAT (Memory Analyzer Tool) — desktop, đầy đủ feature.
VisualVM — bundled JDK.
JDK Mission Control — production tool, có heap analyzer.

Tip: dump 2 lần cách nhau 5 phút, so sánh diff trong MAT — class tăng nhiều = leak candidate.

5. `jstat` — realtime metric

Theo dõi GC, class loading, JIT realtime.

GC overview

jstat -gc 12345 1000
# Sample mỗi 1000ms

S0C    S1C    S0U    S1U    EC       EU       OC        OU       MC      MU      YGC    YGCT     FGC   FGCT     GCT
8192.0 8192.0 0.0    7234.5 65536.0  45123.2  131072.0  78901.2  20480.0 19234.5 145    1.234    3     0.567    1.801

Cột: *C = capacity (KB), *U = used cho Survivor 0/1, Eden, Old, Metaspace. YGC/YGCT = young GC count + total time; FGC/FGCT = full GC; GCT = total GC time.

Theo dõi OU tăng dần qua sample → old gen growing → leak hoặc promote rate cao.

Tính toán nhanh

jstat -gcutil 12345 1000
# S0     S1     E      O      M      CCS    YGC     YGCT    FGC    FGCT     GCT
# 0.00  88.31  68.85  60.20  93.92  91.83    145    1.234     3    0.567    1.801

-gcutil show % thay vì KB — dễ đánh giá health. O 60% healthy, O 95% red flag.

6. `jcmd` — Swiss-army knife

jcmd (Java 8+) thay nhiều tool cũ. Ưu tiên dùng jcmd trong production.

jcmd 12345 help                              # Liet ke command available
jcmd 12345 GC.heap_info                      # Heap stats hien tai
jcmd 12345 GC.run                            # Force GC (don't usually)
jcmd 12345 GC.class_histogram                # = jmap -histo
jcmd 12345 Thread.print                      # = jstack
jcmd 12345 VM.flags                          # JVM flag effective
jcmd 12345 VM.system_properties              # System property
jcmd 12345 VM.uptime                         # JVM running time
jcmd 12345 VM.classloader_stats              # Loader stats - debug metaspace leak

Lưu ý: command GC.class_stats cũ đã bị remove từ JDK 16 — dùng VM.classloader_stats (thống kê theo classloader) hoặc GC.class_histogram (count + size theo class) thay thế.

JFR control (chi tiết bài 08):

jcmd 12345 JFR.start name=profile duration=60s filename=profile.jfr
jcmd 12345 JFR.dump name=profile filename=profile.jfr
jcmd 12345 JFR.stop name=profile

Native memory tracking:

java -XX:NativeMemoryTracking=summary MyApp
jcmd 12345 VM.native_memory summary
# Show malloc breakdown: heap, class, thread, code, GC, internal, ...

Quan trọng debug "JVM ăn nhiều memory hơn -Xmx" — show native memory breakdown.

7. Eclipse MAT — heap dump analysis

Khi có heap leak, dump .hprof → mở MAT (Eclipse Memory Analyzer Tool).

Workflow điển hình

Generate dump: jmap -dump:live,format=b,file=heap.hprof 12345 (hoặc auto qua -XX:+HeapDumpOnOutOfMemoryError).
Open MAT → File → Open Heap Dump → "Leak Suspects" report auto-analyze, suggest top leak candidate.
Histogram view: sort class theo Retained Heap descending — class retained vượt 50% heap = strong candidate. Dominator Tree show object giữ memory nhiều nhất.

Path to GC Roots: từ object nghi leak, MAT show chain reference giữ nó live:

MyOrder (1.5 KB)
└── elementData[42] of java.util.ArrayList (40 KB)
    └── orders field of com.myapp.OrderCache (instance)
        └── INSTANCE field of com.myapp.OrderCache (static)  <- GC root!

Identify root cause: static field OrderCache.INSTANCE.orders giữ ArrayList grow vô hạn → leak.

OQL — query heap

MAT có DSL giống SQL query heap:

SELECT s.value FROM java.lang.String s WHERE s.value.length > 1000

Tìm String dài bất thường — log message, query SQL build vô tận, etc.

8. Workflow debug điển hình

Symptom: API p99 5s, CPU 100%

jps -l                            # 1. Tim PID
top -H -p <PID>                   # 2. Note TID an CPU cao
printf '%x\n' <TID>               # 3. Convert TID sang hex
jstack <PID> > dump.txt           # 4. Thread dump
grep -A 30 'nid=0x<hex>' dump.txt # 5. Stack frame: method nao busy spin

Nếu CPU spread nhiều thread → cần flame graph (async-profiler, bài 08).

Symptom: OOM Java heap space

# 1. Auto dump (config truoc khi crash)
JAVA_OPTS="-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/heap"

# 2. Sau crash, copy heap.hprof ve local roi mo MAT
scp server:/var/log/heap/java_pid12345.hprof .
# Run "Leak Suspects Report" -> top retained class
# Path to GC Roots -> static field / cache / collection growing

Symptom: app hang, không response

# 1. Triple dump confirm hang
for i in 1 2 3; do jstack <PID> > dump-$i.txt; sleep 5; done
diff dump-1.txt dump-2.txt

# 2. Check deadlock
grep "Found one Java-level deadlock" dump-1.txt

# 3. Phan bo thread state
grep "Thread.State" dump-1.txt | sort | uniq -c
# Nhieu BLOCKED -> contention; tat ca WAITING -> cho external event

# 4. Tim lock contention
grep -A 5 "BLOCKED" dump-1.txt | grep "waiting to lock"

Symptom "GC chạy liên tục, throughput thấp": bắt đầu bằng jstat -gcutil (mục 5) để confirm pattern, rồi sang JFR allocation profiling — workflow đầy đủ ở bài 08.

9. Pitfall tổng hợp

❌ Nhầm 1: Restart trước khi capture diagnosis.

Production hang -> restart -> mat thread state -> khong root cause

✅ Capture jstack + jmap (nếu RAM cho phép) trước restart.

❌ Nhầm 2: Heap dump trên production đang serve traffic.

jmap -dump 12345   # Pause app vai giay

✅ Drain traffic / fail-over trước. Hoặc trigger qua -XX:+HeapDumpOnOutOfMemoryError (chỉ khi đã OOM).

❌ Nhầm 3: Đọc thread dump 1 lần — thread có thể tình cờ ở stack đó, không stuck. ✅ Triple dump, so sánh.

❌ Nhầm 4: Cài MAT trên production — MAT cần GUI, chiếm RAM ngang heap dump (load full). ✅ Copy .hprof về dev machine, MAT local.

❌ Nhầm 5: Chỉ dùng jstack cho mọi vấn đề — memory leak thì jstack vô nghĩa. ✅ Match tool với symptom (mục 8).

❌ Nhầm 6: Tin số jstat cho trend dài hạn — jstat sample từng thời điểm, không thấy mọi event GC. ✅ JFR cho long-term (bài 08), jstat cho realtime quick check.

10. 📚 Deep Dive Oracle

📚 Deep Dive Oracle

Spec / reference chính thức:

Troubleshooting Guide for HotSpot VM (Java 21) — manual chính thức, đọc theo symptom.
jcmd man page — full diagnostic command reference.
Eclipse MAT — Memory Analyzer Tool.
"Java Performance: The Definitive Guide" - Scott Oaks — sách chuẩn về JVM tuning + tooling.

Ghi chú: Troubleshooting Guide official có "decision tree" theo symptom — bookmark khi on-call. Scott Oaks book là reference depth nhất về performance + tooling, đặc biệt chương GC tuning và profiling.

11. Tóm tắt

jps -lv: list JVM process với args.
jstack <pid>: thread dump. State: RUNNABLE, BLOCKED, WAITING. Tự detect deadlock. Triple dump (3 lần cách 5s) confirm thread thực sự stuck.
High CPU 1 thread: top -H -p <pid> → TID hex → match nid trong jstack.
jmap -histo:live <pid>: count + size per class. jmap -dump:live,format=b,file=heap.hprof <pid>: full heap dump (pause app vài giây). -XX:+HeapDumpOnOutOfMemoryError auto dump khi OOM.
jstat -gcutil <pid> 1000: realtime % usage. Old vượt 95% là red flag.
jcmd thay nhiều tool cũ (jcmd <pid> help list command). Ưu tiên dùng. NMT debug native memory. GC.class_stats đã removed JDK 16 — dùng VM.classloader_stats / GC.class_histogram.
Eclipse MAT: "Leak Suspects" tự suggest; Dominator Tree show object retain nhiều; Path to GC Roots tìm chain leak.
Workflow: CPU 100% → top -H + jstack; OOM → heap dump + MAT; hang → triple jstack + diff.
Container: docker exec <container> jcmd ... để chạy tool trong namespace JVM.
Capture diagnostic trước khi restart — restart mất state.

12. Tự kiểm tra

Tự kiểm tra

Workflow nào để debug "API p99 latency 5s, CPU 100% trên 1 core" trên production?

▸

Step-by-step:

List PID: jps -l → tìm PID app (vd 12345).
Tìm thread ăn CPU: top -H -p 12345 → cột %CPU, note TID đang 98% (vd 12399).
Convert TID sang hex (jstack ghi hex): printf '%x' 12399 → 306f.
Thread dump + grep: jstack 12345 > dump.txt rồi grep -A 30 'nid=0x306f' dump.txt — stack frame top cho biết method nào busy spin.
Triple dump confirm: 3 dump cách 5s, thread vẫn ở cùng stack frame qua 15s → stuck thật, không phải tình cờ.

Nếu CPU spread nhiều thread (không thread nào dominate) → flame graph với async-profiler (bài 08).

Common pattern phát hiện: infinite loop (while(true) thiếu exit condition), regex catastrophic backtracking (stack ở java.util.regex.Pattern), hash collision (stack ở HashMap.put chain dài), lock contention (nhiều thread BLOCKED cùng monitor). Tránh restart trước khi capture — mất toàn bộ diagnosis.

Heap dump 8GB. MAT report top "Retained Heap": com.myapp.OrderCache 7GB. Workflow tiếp theo để xác định leak?

▸

Đã có suspect — giờ trace nguyên nhân:

Path to GC Roots: Histogram → search "OrderCache" → select instance → "Path to GC Roots" → "exclude weak/soft references". Kết quả ví dụ:
```
OrderCache@0x... (7 GB retained)
cache field of OrderService@0x...
  INSTANCE field of OrderManager (static)  <- GC root
```
Static field giữ cache sống vĩnh viễn.
Inspect content: drill down field orders (Map) — 5 triệu entry, mỗi entry ~1.4 KB → cache không bound, không evict.
OQL confirm: SELECT * FROM com.myapp.Order o WHERE o.timestamp < currentTimeMillis() - 86400000 — Order cũ hơn 1 ngày vẫn trong cache → không có TTL.
Đối chiếu code: tìm chỗ cache.put(...) không kèm eviction (thường là HashMap trong singleton).
Fix: thay HashMap bằng Caffeine cache — Caffeine.newBuilder().maximumSize(10_000).expireAfterWrite(Duration.ofMinutes(30)).build() — hoặc bỏ cache nếu DB query đủ nhanh. Verify bằng heap dump sau deploy: retained size stable.

Pattern leak phổ biến tìm bằng MAT: static collection growing (case này), listener không unregister, ThreadLocal không remove (Tomcat redeploy), inner class giữ enclosing instance. Always 2-step: identify retained (histogram) + trace path GC root.

Vì sao `jstat -gcutil` thấy "Old 95%" liên tục lại là red flag, dù chưa OOM?

▸

Old 95% liên tục nghĩa là: promote rate cao (object survive young GC dồn sang Old) và GC không reclaim được — phần lớn object trong Old vẫn live, mark-sweep-compact không free đủ.

Hệ quả chuỗi: Old gần đầy → major/mixed GC chạy dày hơn → pause dài (hàng trăm ms) → latency spike → full GC last resort (pause vài giây) → cuối cùng OOM "GC overhead limit exceeded" hoặc heap space. Trong jstat, FGC tăng nhanh = pressure, sắp OOM:

# Red flag - O lien tuc tang, FGC nhieu
S0     S1     E      O      M      YGC    YGCT    FGC    FGCT
0.00  88.31  68.85  85.20  93.92    145    1.234     8    2.567
0.00  90.50  70.12  90.85  93.92    146    1.245    12    4.123
0.00  92.50  72.12  95.20  93.92    147    1.256    18    7.890

2 nguyên nhân thường gặp và cách phân biệt (heap dump 2 lần cách 5 phút, so sánh top retained):

Memory leak: class app (Order, User, CacheEntry) growing liên tục → MAT trace GC roots.
Heap quá nhỏ: workload steady, mọi class tăng proportional → tăng -Xmx.

Quick action: tăng -Xmx tạm để buy time, bật -XX:+HeapDumpOnOutOfMemoryError, chạy JFR allocation profile (bài 08) tìm hot allocation site. Long-term: ship Old % vào time series (Prometheus/Datadog), alert khi vượt 80% sustained 10 phút — bắt sớm trước khi user thấy.

Vì sao thread dump phải đọc 3 lần (triple dump) thay vì 1 lần?

▸

1 thread dump = snapshot 1 thời điểm, dễ đọc sai: thread tình cờ đứng ở method bạn nghi (không stuck), tất cả WAITING chỉ là pool worker idle giữa task, stack 1-2 frame thiếu context.

Triple dump (3 lần cách 5 giây) so sánh state + stack frame qua thời gian:

Thread X: 3 dump deu BLOCKED waiting lock 0xABC tai Service.process(L42)
=> stuck cho lock suot 15s -> contention that

Thread Y: dump 1 o ArrayList.add, dump 2 o HashMap.put, dump 3 o String.equals
=> dang lam viec binh thuong qua nhieu method -> khong stuck

Thread Z: 3 dump deu RUNNABLE tai MyService.compute(L50)
=> infinite loop hoac hot spin

Diff nhanh: grep "Thread.State" dump-1.txt | sort | uniq -c chạy cho từng dump để so phân bố state thay đổi thế nào.

Production guideline: hang/CPU issue luôn capture 3 dump cách 5-10s. Riêng deadlock thì 1 dump đủ — jstack tự detect và in "Found one Java-level deadlock".

Vì sao jcmd được khuyến khích thay jstack / jmap trong production?

▸

Trước hết, đính chính một hiểu nhầm phổ biến: "jstack/jmap pause JVM lâu hơn jcmd" là sai. Mặc định, jstack và jmap dùng cùng dynamic attach API như jcmd — chỉ khi chạy với flag -F (force, cho JVM treo không respond) chúng mới chuyển sang Serviceability Agent (SA) đọc memory process từ ngoài. Và mọi thread dump — dù qua tool nào — đều cần JVM đạt safepoint (bài 11). Pause là tương đương.

Lý do thực sự nên ưu tiên jcmd:

Thống nhất 1 interface: jcmd <pid> help tự liệt kê mọi command — không cần nhớ jstack vs jmap vs jstat vs jinfo.
Diagnostic command framework mở rộng: command mới (JFR control, ZGC stats, compiler directives) chỉ thêm vào jcmd. Tool cũ frozen.
Native Memory Tracking (NMT): jcmd VM.native_memory — debug "JVM ăn memory hơn -Xmx" — không có tool cũ tương đương.

So sánh cụ thể:

Tool cũ	`jcmd` equivalent
`jstack <pid>`	`jcmd <pid> Thread.print`
`jmap -histo <pid>`	`jcmd <pid> GC.class_histogram`
`jmap -dump:format=b,file=h.hprof <pid>`	`jcmd <pid> GC.heap_dump h.hprof`
`jinfo <pid>`	`jcmd <pid> VM.flags` + `VM.system_properties`
—	`jcmd <pid> JFR.start ...` (JFR)
—	`jcmd <pid> VM.native_memory` (NMT)

Container: docker exec <container> jcmd <pid> ... chạy trong namespace JVM, không dependency host. Tool cũ vẫn hợp lý cho script CI/CD đã parse format jstack. Production: jcmd + JFR là combo chuẩn.

Bài tiếp theo: JFR và profiling — Flight Recorder, JMC, async-profiler

Bài này đáng gửi cho bạn học cùng?

Copy link đã gắn nguồn — dán group, chat, hoặc LinkedIn.

Bài này có giúp bạn hiểu bản chất không?

Hỏi đáp về bài này

Chưa có câu hỏi

Đặt câu hỏi

Có gì chưa rõ trong bài? Đặt câu hỏi đầu tiên — câu trả lời từ cộng đồng giúp bạn (và người sau).

Đặt câu hỏi đầu tiên

← Bài trước

Chọn GC collector theo SLA latency và đọc GC log xác nhận

Bài tiếp

JFR và profiling — Flight Recorder, JMC, async-profiler