WEKO3
アイテム
Structured Evaluation of Legal Reasoning in LLMs: Chain-of-Thought Prompting and Human Scoring for Retrieval Robustness
https://doi.org/10.20736/0002002106
https://doi.org/10.20736/0002002106d1c91dbe-3951-40d6-97e7-0bb7fa66ebb3
| 名前 / ファイル | ライセンス | アクション |
|---|---|---|
|
|
|
| アイテムタイプ | デフォルトアイテムタイプ(フル)(1) | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 公開日 | 2025-06-06 | |||||||||||
| タイトル | ||||||||||||
| タイトル | Structured Evaluation of Legal Reasoning in LLMs: Chain-of-Thought Prompting and Human Scoring for Retrieval Robustness | |||||||||||
| 言語 | en | |||||||||||
| 作成者 |
Ying-Chu Yu
× Ying-Chu Yu
× Sieh-Chuen Huang
× Hsuan-Lei Shao
|
|||||||||||
| 内容記述 | ||||||||||||
| 内容記述タイプ | Abstract | |||||||||||
| 内容記述 | This study investigates the legal reasoning abilities of Large Language Models (LLMs) in Taiwan’s Status Law (family and inheritance law) and evaluates the effects of Chain-of-Thought (CoT) prompting on answer quality. Six essay questions from past judicial and graduate law exams were decomposed into 68 sub-questions targeting issue spotting, statutory application, legal reasoning, and property calculation. Four LLMs (ChatGPT-4o, Gemini, Copilot, and Grok3) were evaluated using a two-stage framework: decomposed sub-question accuracy (Stage 1) and full-length essay response performance with and without CoT prompting (Stage 2), with human scoring conducted by a law professor and a student. Results show that CoT prompting consistently improves legal reasoning quality across models, notably enhancing issue coverage, statutory citation accuracy, and reasoning structure. Gemini achieved the most significant accuracy gains (from 83.2% to 94.5%, p < 0.05) and was selected for detailed qualitative analysis. Beyond model-specific findings, this study contributes to retrieval evaluation research by addressing statistical consistency challenges in human scoring, proposing a diagnostic evaluation method adaptable for multilingual and multimedia legal corpora, and suggesting extensions for evaluating enterprise-level legal information systems. These findings underscore the value of structured prompting strategies in supporting more interpretable, transferable, and scalable legal AI evaluation frameworks. |
|||||||||||
| 言語 | en | |||||||||||
| 出版者 | ||||||||||||
| 出版者 | NII Institutional Repository | |||||||||||
| 言語 | en | |||||||||||
| 日付 | ||||||||||||
| 日付 | 2025-06-06 | |||||||||||
| 日付タイプ | Issued | |||||||||||
| 言語 | ||||||||||||
| 言語 | eng | |||||||||||
| 資源タイプ | ||||||||||||
| 資源タイプ識別子 | http://purl.org/coar/resource_type/c_5794 | |||||||||||
| 資源タイプ | conference paper | |||||||||||
| ID登録 | ||||||||||||
| ID登録 | 10.20736/0002002106 | |||||||||||
| ID登録タイプ | JaLC | |||||||||||
| 関連情報 | ||||||||||||
| 関連タイプ | isReferencedBy | |||||||||||
| 識別子タイプ | URI | |||||||||||
| 関連識別子 | https://research.nii.ac.jp/ntcir/evia2025/index.html | |||||||||||
| 言語 | en | |||||||||||
| 関連名称 | EVIA 2025 | |||||||||||
| 開始ページ | ||||||||||||
| 開始ページ | none | |||||||||||
| 会議記述 | ||||||||||||
| 会議名 | EVIA 2025 | |||||||||||
| 言語 | en | |||||||||||
| 回次 | 11 | |||||||||||
| 主催機関 | National Institute of Informatics | |||||||||||
| 言語 | en | |||||||||||
| 開始年 | 2025 | |||||||||||
| 開始月 | 6 | |||||||||||
| 開始日 | 10 | |||||||||||
| 終了年 | 2025 | |||||||||||
| 終了月 | 6 | |||||||||||
| 終了日 | 10 | |||||||||||
| 開催期間 | June 10-13, 2025 | |||||||||||
| 言語 | en | |||||||||||
| 開催会場 | National Institute of Informatics | |||||||||||
| 言語 | en | |||||||||||
| 開催国 | JPN | |||||||||||