LangChain 課程 – LLM 評估（Evaluation）

主題：自訂指標與測試資料

簡介

在建置聊天機器人、文件問答或自動化工作流程時，我們常會直接使用大型語言模型（LLM）產生回應，卻忽略了如何系統化評估模型的表現。沒有客觀的評分機制，就難以比較不同 Prompt、模型或微調策略的優劣，也無法在產品上提供可靠的服務品質保證。

在 LangChain 生態系中，Evaluation 是一個專門的子模組，提供了方便的測試資料管理、指標計算與結果視覺化功能。透過自訂指標（custom metrics）與測試資料集（test data），開發者可以：

以 業務需求 為出發點，設計符合實際使用情境的評分方式。
自動化 大量測試，快速迭代 Prompt 或模型參數。
在 CI/CD 流程中加入品質門檻，避免回歸問題。

本篇文章將以 LangChainJS 為例，說明如何建立測試資料、撰寫自訂指標、執行評估，並分享常見陷阱與最佳實踐，幫助你把 LLM 評估從「感覺」升級為「科學」的工作流程。

核心概念

1. 評估的基本流程

LangChain 的評估流程大致可分為四個步驟：

步驟	說明
準備測試資料	定義問題（question）與參考答案（ground‑truth）或多個可接受答案。
呼叫 LLM 產生回應	使用 `LLMChain`、`Runnable` 或自訂的 `PromptTemplate` 取得模型輸出。
計算指標	依照自訂或內建的指標（例如 `ExactMatch`, `BLEU`, `Rouge`）比較模型回應與參考答案。
彙總與分析	產生報表、繪製分佈圖或寫入資料庫，供後續決策使用。

下面會逐一展開說明，並提供完整程式碼範例。

2. 建立測試資料（Test Data）

測試資料的格式沒有硬性規定，只要能夠在程式中被迭代即可。LangChain 官方建議使用 JSONL（每行一個 JSON 物件）或 CSV，方便與外部資料管線整合。

// test_data.jsonl
{"question": "台北 101 的高度是多少？", "answers": ["508 公尺", "508m", "508 meters"]}
{"question": "Python 的 List 可以直接使用哪個方法排序？", "answers": ["sort()", "sorted()"]}

在 JavaScript 中，我們可以使用 fs 讀取檔案，或直接在程式內以陣列形式定義：

// testData.js
export const testData = [
  {
    question: "台北 101 的高度是多少？",
    answers: ["508 公尺", "508m", "508 meters"],
  },
  {
    question: "Python 的 List 可以直接使用哪個方法排序？",
    answers: ["sort()", "sorted()"],
  },
  // 更多測試案例...
];

小技巧：若測試資料量很大，建議先分批（batch）載入，以免一次讀入佔用過多記憶體。

3. 內建指標 vs. 自訂指標

LangChain 提供了幾個常用的內建指標：

ExactMatchMetric – 完全相同即為 1，否則 0。
RougeMetric – 文字相似度（Rouge‑1、Rouge‑L）。
BLEUMetric – 以 n‑gram 計算的 BLEU 分數。

然而在實務上，我們往往需要 更貼近業務需求的指標，例如：

關鍵資訊命中率：檢查模型回應是否包含特定關鍵字或數值。
語意一致性：使用向量相似度判斷回應與參考答案的語意相近程度。
回應長度懲罰：過長或過短的回應視為不佳。

下面示範三個自訂指標的實作方式。

3.1 關鍵資訊命中率（KeywordHitMetric）

// metrics/keywordHitMetric.js
import { BaseMetric } from "langchain/evaluation";

/**
 * 判斷模型回應是否包含所有關鍵字。
 * @param {string[]} keywords - 必須出現在回應中的關鍵字陣列
 * @returns {BaseMetric}
 */
export class KeywordHitMetric extends BaseMetric {
  constructor(keywords) {
    super();
    this.keywords = keywords;
  }

  /**
   * 計算分數，全部命中為 1，否則 0。
   * @param {string} prediction - LLM 的回應
   */
  async evaluate(prediction) {
    const lower = prediction.toLowerCase();
    const hitAll = this.keywords.every((kw) => lower.includes(kw.toLowerCase()));
    return hitAll ? 1 : 0;
  }
}

3.2 語意相似度（SemanticSimilarityMetric）

此指標利用 OpenAI Embedding（或任何向量模型）計算餘弦相似度。

// metrics/semanticSimilarityMetric.js
import { BaseMetric } from "langchain/evaluation";
import { OpenAIEmbeddings } from "langchain/embeddings/openai";

/**
 * 以向量餘弦相似度衡量語意相近程度，分數介於 0~1。
 */
export class SemanticSimilarityMetric extends BaseMetric {
  constructor() {
    super();
    this.embeddings = new OpenAIEmbeddings();
  }

  async evaluate(prediction, groundTruth) {
    const [predVec, gtVec] = await Promise.all([
      this.embeddings.embedQuery(prediction),
      this.embeddings.embedQuery(groundTruth),
    ]);

    // 計算餘弦相似度
    const dot = predVec.reduce((sum, v, i) => sum + v * gtVec[i], 0);
    const normPred = Math.sqrt(predVec.reduce((s, v) => s + v * v, 0));
    const normGT = Math.sqrt(gtVec.reduce((s, v) => s + v * v, 0));
    const cosine = dot / (normPred * normGT);
    return cosine; // 0~1
  }
}

3.3 回應長度懲罰（LengthPenaltyMetric）

// metrics/lengthPenaltyMetric.js
import { BaseMetric } from "langchain/evaluation";

/**
 * 若回應長度超過上限，分數會被線性降低。
 * @param {number} maxLen - 允許的最大 token 數
 */
export class LengthPenaltyMetric extends BaseMetric {
  constructor(maxLen = 150) {
    super();
    this.maxLen = maxLen;
  }

  async evaluate(prediction) {
    // 這裡簡易以字元長度代替 token 數，實務可換成 tokenizer
    const len = prediction.length;
    if (len <= this.maxLen) return 1;
    // 超過上限的部分每多 10 個字元扣 0.05 分
    const penalty = Math.min(0.5, ((len - this.maxLen) / 10) * 0.05);
    return Math.max(0, 1 - penalty);
  }
}

4. 組合評估流程（完整範例）

以下示範如何把測試資料、LLM 呼叫與自訂指標串成一條 Evaluation Pipeline，最終產出 CSV 報表。

// evaluation/runEvaluation.js
import { OpenAI } from "langchain/llms/openai";
import { PromptTemplate } from "langchain/prompts";
import { LLMChain } from "langchain/chains";
import { CSVWriter } from "langchain/utilities/csv";
import { KeywordHitMetric } from "./metrics/keywordHitMetric";
import { SemanticSimilarityMetric } from "./metrics/semanticSimilarityMetric";
import { LengthPenaltyMetric } from "./metrics/lengthPenaltyMetric";
import { testData } from "./testData";

// 1️⃣  初始化 LLM
const llm = new OpenAI({ temperature: 0 });

// 2️⃣  定義 Prompt（可依需求自行調整）
const prompt = new PromptTemplate({
  template: "請用中文回答以下問題：\n{question}",
  inputVariables: ["question"],
});
const chain = new LLMChain({ llm, prompt });

// 3️⃣  建立指標實例
const keywordMetric = new KeywordHitMetric(["公尺", "高度"]);
const semanticMetric = new SemanticSimilarityMetric();
const lengthMetric = new LengthPenaltyMetric(100);

// 4️⃣  執行測試
async function run() {
  const results = [];

  for (const item of testData) {
    const prediction = await chain.run({ question: item.question });

    // 多個參考答案取最高分（示範用）
    const exactMatches = item.answers.map((ans) =>
      prediction.trim() === ans.trim() ? 1 : 0
    );
    const exactScore = Math.max(...exactMatches);

    const keywordScore = await keywordMetric.evaluate(prediction);
    const semanticScore = await semanticMetric.evaluate(
      prediction,
      item.answers[0] // 取第一個作為參考
    );
    const lengthScore = await lengthMetric.evaluate(prediction);

    // 加權平均（依需求自行調整權重）
    const finalScore =
      0.4 * exactScore + 0.2 * keywordScore + 0.3 * semanticScore + 0.1 * lengthScore;

    results.push({
      question: item.question,
      prediction,
      exactScore,
      keywordScore,
      semanticScore: semanticScore.toFixed(3),
      lengthScore: lengthScore.toFixed(2),
      finalScore: finalScore.toFixed(3),
    });
  }

  // 5️⃣  輸出 CSV
  const writer = new CSVWriter({ path: "./evaluation_report.csv" });
  await writer.writeRecords(results);
  console.log("✅ 評估完成，報表已寫入 evaluation_report.csv");
}

run().catch(console.error);

關鍵說明：

PromptTemplate 為 可重用 的 Prompt，讓測試時保持一致性。
KeywordHitMetric、SemanticSimilarityMetric、LengthPenaltyMetric 分別負責不同面向的評分。
finalScore 使用 加權平均 合併多個指標，這是最常見的彙總方式。
最後把結果寫入 CSV，方便在 Excel、Google Sheet 或 CI 報表中檢視。

常見陷阱與最佳實踐

陷阱	為什麼會發生	建議的解決方案
測試資料與實際需求不對齊	測試題目過於簡單或與真實使用情境差距大，導致評分失真。	在設計測試資料時，邀請業務或產品負責人一起審核，確保問題類型、語氣與長度與線上環境相符。
只使用單一指標	例如只看 Exact Match，會忽略語意相近但文字不同的情況。	結合多個指標（文字相似度、關鍵字命中、長度懲罰），或根據業務需求加權。
指標計算成本過高	語意相似度需要向量嵌入，若測試樣本上千會導致 API 請求爆炸。	使用批次（batch）嵌入、快取 embedding、或在本地部署開源向量模型（如 `sentence‑transformers`）。
未考慮隨機性	LLM 具有隨機性，單次呼叫可能得到不同結果。	多次抽樣（例如 3 次）取平均分，或在測試時將 `temperature` 設為 0。
指標權重未校正	隨意設定權重會讓最終分數失去解釋力。	先執行敏感度分析，觀察不同權重對最終排序的影響，再根據業務目標微調。
報表缺乏可視化	純文字或 CSV 難以快速洞察問題。	使用 Plotly、Tableau 或 `pandas`/`matplotlib` 產生分佈圖、箱形圖，讓團隊一眼看出模型弱點。

最佳實踐：

版本化測試資料：使用 Git 管理 test_data.jsonl，每次模型更新都能追蹤變化。
CI 集成：將 runEvaluation.js 加入 GitHub Actions，設定失敗門檻（例如 finalScore 必須 > 0.75），自動阻止不合格的 PR。
分層指標：先在 粗粒度（Exact Match）篩選，再在 細粒度（語意相似度）進行深度分析。
可解釋性：在報表中加入「失敗案例」與「觸發的關鍵字」欄位，方便開發者快速定位問題。

實際應用場景

場景	為什麼需要自訂指標	可能的指標組合
客服聊天機器人	必須保證回應中包含「訂單編號」或「解決方案」關鍵字。	`KeywordHitMetric` + `SemanticSimilarityMetric` + `LengthPenaltyMetric`
法律文件問答	允許同義詞但不接受錯誤的條款編號。	`ExactMatchMetric`（條款號碼）+ `BLEUMetric`（語意）
醫療資訊摘要	必須避免產生危險的錯誤資訊，且字數需控制在 200 token 以內。	`SemanticSimilarityMetric` + `LengthPenaltyMetric` + 安全性檢查（自訂檢測醫學術語）
教育輔助系統	評分需兼顧答案完整度與解題步驟。	`KeywordHitMetric`（概念詞）+ `ExactMatchMetric`（關鍵步驟）+ `CustomScoring`（步驟順序）

案例分享：某金融機構在部署 LLM 為投資顧問時，利用 KeywordHitMetric 監測「投資風險」與「年化報酬」兩個關鍵字，同時以 SemanticSimilarityMetric 確保語意正確。透過 CI 中的自動評估，成功將模型回應的錯誤率從 12% 降至 2% 以內。

總結

自訂指標與測試資料 是提升 LLM 產品品質的核心手段，能讓開發者從「感覺」轉向「量化」的迭代流程。
LangChain 提供了 易於擴充的 Evaluation 框架，只要遵循「資料 → 呼叫 → 指標 → 報表」的四步走，就能快速建立自己的評估管線。
在實務上，結合多指標、批次處理、CI 整合 能顯著降低成本、提升可靠性，並為團隊提供可操作的回饋資訊。

透過本文的範例與最佳實踐，你現在應該能夠：

設計符合業務需求的測試資料。
實作關鍵字命中、語意相似度與長度懲罰等自訂指標。
將指標組合成加權分數，並自動產出報表。
將評估流程納入 CI/CD，確保每一次模型更新都符合品質門檻。

持續迭代、持續評估，才能讓 LLM 在真實應用中發揮最大價值。祝你在 LangChain 的旅程中，評估順利、模型更佳！ 🚀