Heaps’ Law in GPT-Neo Large Language Model Emulated Corpora

Uyen Lai; Gurjit Randhawa; Paul Sheridan

WEKO3

lat lon distance

[[sub_check.contents]]

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

インデックスツリー

アイテム

Heaps’ Law in GPT-Neo Large Language Model Emulated Corpora

https://doi.org/10.20736/0002001352

名前 / ファイル	ライセンス	アクション
03-EVIA2023-EVIA-LaiU.pdf (997 KB)

アイテムタイプ

デフォルトアイテムタイプ（フル）(1)

公開日

2023-12-12

タイトル

Heaps’ Law in GPT-Neo Large Language Model Emulated Corpora

言語

作成者

Uyen Lai
Gurjit Randhawa
Paul Sheridan

主題

言語

主題Scheme

Other

主題

corpus profiling

主題

言語

主題Scheme

Other

主題

generative large language models

主題

言語

主題Scheme

Other

主題

word statistics

内容記述

内容記述タイプ

Abstract

内容記述

Heaps’ law is an empirical relation in text analysis that predicts vocabulary growth as a function of corpus size. While this law has been validated in diverse human-authored text corpora, its applicability to large language model generated text remains unexplored. This study addresses this gap, focusing on the emulation of corpora using the suite of GPT-Neo large language models. To conduct our investigation, we emulated corpora of PubMed abstracts using three different parameter sizes of the GPT-Neo model. Our emulation strategy involved using the initial five words of each PubMed abstract as a prompt and instructing the model to expand the con- tent up to the original abstract’s length. Our findings indicate that the generated corpora adhere to Heaps’ law. Interestingly, as the GPT-Neo model size grows, its generated vocabulary increasingly adheres to Heaps’ law as as observed in human-authored text. To further improve the richness and authenticity of GPT-Neo outputs, future iterations could emphasize enhancing model size or refining the model architecture to curtail vocabulary repetition.

言語

出版者

NII Institutional Repository

言語

日付

2023-12-12

日付タイプ

Issued

言語

eng

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_5794

資源タイプ

conference paper

ID登録

10.20736/0002001352

ID登録タイプ

JaLC

Versions

Ver.1

2023-10-26 13:03:27.440207

Show All versions

Cite as

Other

エクスポート

OAI-PMH

JPCOAR 2.0
JPCOAR 1.0
DublinCore
DDI

Other Formats

インデックスリンク

インデックスツリー

アイテム

Heaps’ Law in GPT-Neo Large Language Model Emulated Corpora

× Uyen Lai

× Gurjit Randhawa

× Paul Sheridan

Versions

Share

Cite as

Other

エクスポート

コミュニティ

メニューを最小化

インデックスリンク

インデックスツリー

アイテム

Heaps’ Law in GPT-Neo Large Language Model Emulated Corpora

× Uyen Lai

× Gurjit Randhawa

× Paul Sheridan

Versions

Share

Cite as

Other

エクスポート

コミュニティ