Harvey Benchmark Finds Frontier AI Completes Less Than 10% of Complex Legal Tasks

A new benchmark from the legal AI company Harvey shows that today’s most advanced AI models can’t handle most complex legal work from start to finish. The Harvey Legal Agent Benchmark, or LAB, tested frontier models on a set of end-to-end legal tasks. The result: fewer than one in ten tasks were completed successfully.

What the LAB measures

The benchmark is designed to go beyond simple question-and-answer tests. Instead, it evaluates how well AI models perform multi-step legal assignments that mimic real lawyer workflows. These might include reviewing a contract, identifying risks, drafting clauses, and producing a final analysis — all in one continuous session. Harvey’s LAB forces the model to plan, reason, and execute without human hand-holding at each step.

That rigor explains the low score. Most AI benchmarks today measure isolated skills — a model might ace a multiple-choice bar exam question but fall apart when asked to actually produce a legal memo from scratch. Harvey’s test is closer to the messy, iterative work lawyers do every day.

Why the results matter

Law firms and corporate legal departments have been pouring money into AI tools, hoping to cut costs and speed up work. The LAB results suggest that frontier models — names like GPT-4, Claude, and Gemini — aren't ready to take over complex assignments unsupervised. Companies that deploy these models for end-to-end legal work could face serious errors, missed deadlines, or compliance risks.

That doesn't mean AI is useless in law. It can still handle narrow tasks like summarizing documents or finding relevant case law. But the benchmark draws a hard line between assistive AI and autonomous AI. Right now, the gap is wide.

What the benchmark doesn't tell us

Harvey hasn't released a breakdown of which models scored where, or what specific tasks tripped them up. The company also hasn't said how it defined “completion” — whether partial success counted, or if only flawless outputs made the cut. Without that detail, it's hard to know exactly how far the models are from being useful in a real law office.

Still, the headline number is stark. If the best publicly available AI fails nine out of ten complex legal assignments, law firms have a clear baseline to measure progress against. Future versions of the same benchmark could show whether improvements are real or just marketing hype.

Harvey plans to update the LAB periodically as models improve. The next round of results will tell us whether AI is catching up to the demands of legal work — or staying stuck below that 10% line.

What the LAB measures

Why the results matter

What the benchmark doesn't tell us

Related Articles