FastAPI – 效能與最佳化：Profiling 與 Memory Leak 偵測

簡介

在 FastAPI 這類高效能的 Web 框架中，即使程式碼寫得再簡潔，也仍有可能因為不當的資源使用或演算法效率不佳而成為效能瓶頸。
當服務在高併發環境下運行時，CPU 使用率過高、回應時間變慢 或是 記憶體持續增長（Memory Leak）都會直接影響使用者體驗與運營成本。

因此，掌握 profiling（效能分析）與 memory leak 偵測的技巧，不僅能幫助開發者快速定位問題，更能在部署前就把隱憂排除，讓 FastAPI 應用保持「快」與「穩」兩大特性。

核心概念

1. 為什麼需要 Profiling

Profiling 是在程式執行時收集 執行時間、呼叫次數、CPU 與 I/O 使用情況 的過程。

找出耗時最長的路由或函式
辨識不必要的重複計算
評估第三方套件的效能影響

常見的 Python Profiling 工具有 cProfile、pyinstrument、yappi 等，配合 FastAPI 的非同步特性，選擇適合的工具尤為重要。

2. Memory Leak 的概念

Memory Leak 指的是程式在執行過程中，已不再使用的物件仍被保持在記憶體裡，導致記憶體使用量持續上升。
在長時間執行的 API 服務中，哪怕是每次請求只漏掉幾百 KB，累積起來也可能在數小時內耗盡所有可用記憶體。

Python 有自動垃圾回收機制，但循環參照、全局變數未釋放、第三方 C 擴充套件的錯誤 都可能造成漏記憶體。

3. 主要工具概覽

工具	用途	特色
`cProfile`	CPU 時間分析	標準庫、支援統計資訊
`pyinstrument`	即時火焰圖（flame graph）	可視化、支援 async
`memory_profiler`	行級記憶體使用量	以 `@profile` 裝飾器標註
`tracemalloc`	追蹤記憶體分配快照	內建於 Python 3.4+
`objgraph`	物件關係圖	檢查循環參照與孤兒物件

下面將以 FastAPI 為例，示範如何結合上述工具完成效能與記憶體的偵測。

程式碼範例

1️⃣ 使用 `cProfile` 分析同步路由

# app_sync.py
from fastapi import FastAPI
import time

app = FastAPI()

def heavy_computation(n: int) -> int:
    total = 0
    for i in range(n):
        total += i * i
    return total

@app.get("/sync")
def sync_endpoint(q: int = 100_000):
    start = time.time()
    result = heavy_computation(q)
    duration = time.time() - start
    return {"result": result, "duration": duration}

執行 profiling：

python -m cProfile -s cumtime app_sync.py

-s cumtime 會依累積時間排序，最耗時的函式會排到最前。
從輸出中可以看到 heavy_computation 佔用了大部分 CPU 時間，進一步優化可改寫為 numpy 或 Cython。

2️⃣ 使用 `pyinstrument` 觀測非同步路由

# app_async.py
from fastapi import FastAPI
import asyncio

app = FastAPI()

async def async_task(delay: float):
    await asyncio.sleep(delay)
    return "done"

@app.get("/async")
async def async_endpoint(wait: float = 0.5):
    result = await async_task(wait)
    return {"msg": result}

執行 pyinstrument：

pyinstrument -m uvicorn app_async:app --host 0.0.0.0 --port 8000

執行後會產生類似以下的火焰圖（文字版）：

async_task               0.502s  (100%)
await asyncio.sleep      0.500s   (99%)

透過火焰圖可以快速辨識 await 前後的時間分配，若 asyncio.sleep 被不當使用，就會造成 I/O 阻塞。

3️⃣ 行級記憶體監控：`memory_profiler`

# app_mem.py
from fastapi import FastAPI
from memory_profiler import profile
import pandas as pd

app = FastAPI()

@profile   # <-- 只要在此加上裝飾器，執行時會顯示每行記憶體使用量
def load_dataframe(rows: int):
    df = pd.DataFrame({"a": range(rows), "b": range(rows)})
    return df

@app.get("/mem")
def mem_endpoint(size: int = 1_000_000):
    df = load_dataframe(size)
    return {"rows": len(df)}

執行方式：

python -m memory_profiler app_mem.py

輸出會列出每行的 增量記憶體 (MiB)，若發現 pd.DataFrame 建構佔用過多，可改用 Chunked 讀取或 Dask。

4️⃣ 使用 `tracemalloc` 捕捉記憶體分配快照

# app_tracemalloc.py
import tracemalloc
from fastapi import FastAPI

app = FastAPI()
tracemalloc.start()

@app.get("/leak")
def leak_endpoint():
    # 故意製造一個全域列表，持續累加字串
    global _leak
    _leak = getattr(app.state, "leak", [])
    _leak.append("x" * 10_000)   # 每次請求增加約 10KB
    app.state.leak = _leak
    return {"len": len(_leak)}

檢查快照：

# run.py
import time, sys
from app_tracemalloc import app
import uvicorn

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

    # 在服務運行一段時間後，手動觸發快照
    snapshot = tracemalloc.take_snapshot()
    top_stats = snapshot.statistics('lineno')
    print("[ Top 10 memory leaks ]")
    for stat in top_stats[:10]:
        print(stat)

透過 tracemalloc 可以直接看到是哪一行程式持續分配記憶體，從而定位 Memory Leak。

5️⃣ `objgraph` 繪製物件關聯圖

# leak_check.py
import objgraph
from fastapi import FastAPI

app = FastAPI()

@app.on_event("startup")
def startup():
    # 在啟動時產生一次物件圖，供日後比對
    objgraph.show_backrefs(
        objgraph.by_type('list')[0],
        filename='leak_start.png'
    )

@app.get("/check")
def check():
    # 假設此時有疑似泄漏的 list
    suspect = [i for i in range(1000)]
    objgraph.show_backrefs(suspect, filename='leak_now.png')
    return {"msg": "graph generated"}

產生的 leak_start.png 與 leak_now.png 讓開發者直觀比較物件的 引用鏈，找出未被釋放的根源。

常見陷阱與最佳實踐

陷阱	說明	最佳實踐
在生產環境直接開啟 cProfile	會產生大量 I/O，影響服務效能	只在本機或測試環境使用，或透過 `py-spy` 以 sampling 方式無侵入分析
忘記關閉 `tracemalloc`	長時間跑會佔用額外記憶體	在不需要時呼叫 `tracemalloc.stop()`
把大型 DataFrame 放在全域變數	請求結束後仍被引用，形成 leak	使用依賴注入（Depends）或 request‑scoped 變數
同步函式阻塞 async 事件迴圈	會導致其他請求被卡住	使用 `run_in_threadpool` 或改寫為 async
過度使用 `print` 或 logging	I/O 成本高，尤其在高併發時	使用 structured logging 並設定適當的 log level

額外建議

自動化測試：結合 pytest-benchmark，在 CI 中檢查回歸效能。
監控平台：部署 Prometheus + Grafana，收集 process_cpu_seconds_total、process_resident_memory_bytes 等指標，搭配 Alert 早期偵測異常。
容器化注意：在 Docker 中設定 --memory 限制，防止 leak 時容器被 OOM 殺死，並在 docker stats 觀察記憶體走勢。

實際應用場景

場景一：電商平台高峰期的商品搜尋 API

問題：搜尋路由在高併發下回應時間從 120ms 增至 800ms。

解決步驟

使用 pyinstrument 產生火焰圖，發現 SQLAlchemy ORM 的 session.query 佔用了 70% 時間。
改為 SQLAlchemy Core + prepared statement，將耗時降至 200ms。
針對結果轉換的 pydantic 模型，使用 orjson 取代 jsonable_encoder，再減少 30ms。

場景二：即時聊天服務的訊息推送

問題：服務在長時間運行後記憶體從 200MB 增至 2GB，最終 OOM。