Is there a benchmark to measure real effective context length? Sure, gpt-4o has ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		jerjerjer 3 months ago \| parent \| context \| favorite \| on: Claude 4 Is there a benchmark to measure real effective context length? Sure, gpt-4o has a context window of 128k, but it loses a lot from the beginning/middle.

brookst 3 months ago | [–]

Here's an older study that includes Claude 3.5: https://www.databricks.com/blog/long-context-rag-capabilitie...?

evertedsphere 3 months ago | | [–]

ruler https://arxiv.org/abs/2404.06654

nolima https://arxiv.org/abs/2502.05167

bigmadshoe 3 months ago | [–]

They often publish "needle in a haystack" benchmarks that look very good, but my subjective experience with a large context is always bad. Maybe we need better benchmarks.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact