TL;DR

Constrained generation using JSON schemas can enforce structured outputs but becomes costly and error-prone on complex, nested document schemas. Pulse's experiments show compute and accuracy trade-offs, and the team is exploring schema analysis, adaptive strategies, and compilation optimizations to mitigate the issues.

What happened

Pulse engineers tested schema-guided extraction across simple and highly complex document sets. While constrained decoding—masking tokens that would violate a target schema—works efficiently for simple regular constraints, it struggles with nested JSON-like structures that behave more like context-free grammars. Libraries such as Outlines and XGrammar offer optimizations: precomputing masks for context-independent tokens and runtime checks for a smaller context-dependent set can speed decoding substantially. However, non-deterministic grammar elements force parsers to track multiple parallel stack states, risking exponential state growth and heavy compute. The team also observed that strictly enforcing schema constraints during generation can degrade extraction quality because the model’s logits are masked after they are produced, producing distribution shifts that may force low-probability token choices. To address these challenges, Pulse is prototyping schema analysis tools, hybrid generation strategies, shared grammar compilation, and confidence-aware routing for uncertain extractions.

Why it matters

  • Schema-guided decoding is widely supported by major LLM providers, so limits in scalability and accuracy affect many production extraction pipelines.
  • Complex real-world documents (invoices, contracts, medical records) commonly contain structures that cause parser state explosion and higher inference cost.
  • Strict format enforcement can reduce downstream cleaning but may actively harm the model’s ability to extract correct content.
  • Practical remedies (schema design, hybrid workflows, shared grammar fragments) could improve throughput and reduce human review but require tooling and research.

Key facts

  • Constrained decoding masks out tokens that would violate a schema at each generation step.
  • For regular constraints, an FSM can provide O(1) token masks per step; regular expressions can be compiled into such FSMs.
  • Nested JSON-like schemas behave more like context-free grammars, increasing parser complexity compared with regular languages.
  • XGrammar separates tokens into context-independent and context-dependent sets to precompute masks and reduce runtime checks.
  • Non-determinism in grammars can force tracking multiple parallel stack states; converting NFAs to DFAs can yield up to 2^n state blowup in theory.
  • Pulse tested extraction on a larger corpus of about ten thousand complex documents and observed compute and quality issues.
  • Masking logits after the model computes them can shift the output distribution and force low-probability tokens, harming accuracy.
  • Pulse is developing schema analysis, adaptive constraint strategies, grammar compilation reuse, and confidence-aware extraction routing as mitigations.

What to watch next

  • Schema analysis tooling that estimates compilation cost, risk of state explosion, and potential quality degradation at design time.
  • Hybrid 'generate minimally constrained, then restructure' workflows (NL-to-Format) as an alternative to direct constrained decoding.
  • Development of shared grammar fragments and a schema registry to precompile and reuse common substructures like dates and currencies.
  • Confidence-aware monitoring that flags high-perplexity extractions for human review or alternate processing.

Quick glossary

  • Constrained decoding: A generation technique that restricts a model’s allowed next tokens to those consistent with a target schema or grammar.
  • Finite State Machine (FSM): A computation model with a finite number of states used to represent regular languages and to compute allowed next tokens efficiently.
  • Context-free grammar (CFG): A grammar capable of describing nested, recursive structures such as nested JSON objects and arrays; requires more complex parsers than regular grammars.
  • Nondeterministic finite automaton (NFA): An automaton where multiple transitions may be possible from a state for a given input; converting an NFA to a deterministic automaton can cause exponential growth in states.
  • Perplexity: A measure of how uncertain a model is about its next token; higher perplexity can indicate the model is being forced into low-probability choices.

Reader FAQ

Does constrained decoding always guarantee correct structured output?
Not necessarily; while it enforces format, it can still drive the model to produce incorrect content or low-probability tokens under strict constraints.

Are there efficient implementations for constrained decoding?
Yes — for regular constraints an FSM approach is efficient, and tools like XGrammar optimize by precomputing masks for context-independent tokens, but complexity rises for nested schemas.

Can schema design reduce computational costs?
Pulse is building schema analysis tools to predict and reduce compilation time and state explosion risk, but concrete effectiveness is under development.

Will constrained decoding stop working on real documents?
Not confirmed in the source

Back Computational Complexity of Schema-Guided Document Extraction Sid and Ritvik January 12, 2026 When we started building Pulse, we assumed that structured outputs had solved the document extraction problem. Define…

Sources

Related posts

By

Leave a Reply

Your email address will not be published. Required fields are marked *