Structured Evaluation of Legal Reasoning in LLMs: Chain-of-Thought Prompting and Human Scoring for Retrieval Robustness

Ying-Chu Yu; Sieh-Chuen Huang; Hsuan-Lei Shao

WEKO3

lat lon distance

[[sub_check.contents]]

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

インデックスツリー

アイテム

Structured Evaluation of Legal Reasoning in LLMs: Chain-of-Thought Prompting and Human Scoring for Retrieval Robustness

https://doi.org/10.20736/0002002106

名前 / ファイル	ライセンス	アクション
02-EVIA2025-EVIA-YuY.pdf (1.4 MB)

アイテムタイプ

デフォルトアイテムタイプ（フル）(1)

公開日

2025-06-06

タイトル

Structured Evaluation of Legal Reasoning in LLMs: Chain-of-Thought Prompting and Human Scoring for Retrieval Robustness

言語

作成者

Ying-Chu Yu
Sieh-Chuen Huang
Hsuan-Lei Shao

内容記述

内容記述タイプ

Abstract

内容記述

This study investigates the legal reasoning abilities of
Large Language Models (LLMs) in Taiwan’s Status Law (family
and inheritance law) and evaluates the effects of
Chain-of-Thought (CoT) prompting on answer quality. Six
essay questions from past judicial and graduate law exams
were decomposed into 68 sub-questions targeting issue
spotting, statutory application, legal reasoning, and
property calculation. Four LLMs (ChatGPT-4o, Gemini,
Copilot, and Grok3) were evaluated using a two-stage
framework: decomposed sub-question accuracy (Stage 1) and
full-length essay response performance with and without CoT
prompting (Stage 2), with human scoring conducted by a law
professor and a student.
Results show that CoT prompting consistently improves legal
reasoning quality across models, notably enhancing issue
coverage, statutory citation accuracy, and reasoning
structure. Gemini achieved the most significant accuracy
gains (from 83.2% to 94.5%, p < 0.05) and was selected for
detailed qualitative analysis. Beyond model-specific
findings, this study contributes to retrieval evaluation
research by addressing statistical consistency challenges
in human scoring, proposing a diagnostic evaluation method
adaptable for multilingual and multimedia legal corpora,
and suggesting extensions for evaluating enterprise-level
legal information systems. These findings underscore the
value of structured prompting strategies in supporting more
interpretable, transferable, and scalable legal AI
evaluation frameworks.

言語

出版者

NII Institutional Repository

言語

日付

2025-06-06

日付タイプ

Issued

言語

eng

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_5794

資源タイプ

conference paper

ID登録

10.20736/0002002106

ID登録タイプ

JaLC

Versions

Ver.1

2025-06-04 10:47:48.651758

Show All versions

Cite as

Other

エクスポート

OAI-PMH

JPCOAR 2.0
JPCOAR 1.0
DublinCore
DDI

Other Formats

インデックスリンク

インデックスツリー

アイテム

Structured Evaluation of Legal Reasoning in LLMs: Chain-of-Thought Prompting and Human Scoring for Retrieval Robustness

× Ying-Chu Yu

× Sieh-Chuen Huang

× Hsuan-Lei Shao

Versions

Share

Cite as

Other

エクスポート

コミュニティ

メニューを最小化

インデックスリンク

インデックスツリー

アイテム

Structured Evaluation of Legal Reasoning in LLMs: Chain-of-Thought Prompting and Human Scoring for Retrieval Robustness

× Ying-Chu Yu

× Sieh-Chuen Huang

× Hsuan-Lei Shao

Versions

Share

Cite as

Other

エクスポート

コミュニティ