TL;DR

A developer explored compiling Unix find's expression language into a compact bytecode to reduce per-file work and resolve as much as possible ahead of time. The article outlines a simple bytecode design, a compiler that converts infix find expressions to postfix via the shunting-yard algorithm, and examples plus ideas for peephole optimization.

What happened

The author examined the Unix find utility's expression language and designed a small compiler that emits bytecode for evaluating expressions over file trees. They propose a single-bit register machine with five opcodes (halt, not, braf, brat, action) and show how short-circuiting logical operators map to conditional branches. The compiler first transforms the shell-tokenized infix expression into postfix using the shunting-yard algorithm, synthesizing implied -a and wrapping the expression with -print when needed. Compilation produces position-independent fragments which are concatenated without branch fix-ups; branches are implemented with relative targets so fragments can be joined directly. The article includes runnable examples of emitted bytecode for several find expressions, notes that real-world implementations examined use tree-walk interpreters instead, and suggests a set of peephole optimizations to simplify redundant instructions. An online Wasm demo and a peephole optimizer update are also mentioned.

Why it matters

  • Compiling expressions can reduce per-file overhead by resolving structure and targets ahead of traversal.
  • A bytecode approach enables position-independent fragments, simplifying code generation and concatenation.
  • Short-circuiting semantics map naturally to conditional branches, allowing efficient skipping of unnecessary actions.
  • Identifying simple peephole optimizations can further cut redundant instructions and improve runtime performance.

Key facts

  • find's syntax is tokenized by the shell; an implicit -print is applied when no side-effecting actions are present.
  • Binary operators -a (AND) and -o (OR) are short-circuiting and have different precedence; parentheses alter grouping.
  • Author's bytecode machine is a one-bit register design with five opcodes: halt, not, braf, brat, action.
  • braf and brat are forward conditional branches (branch-if-false, branch-if-true) implemented with relative addresses.
  • The compiler converts infix expressions to postfix using the shunting-yard algorithm, synthesizing implied -a tokens.
  • Compilation produces fragments on a stack; fragments are concatenated directly because branches are position-independent.
  • Real-world find implementations examined (GNU, BSD, BusyBox) use tree-walk interpreters rather than a bytecode compiler.
  • Several peephole optimizations are suggested, including removing double nots, collapsing redundant jumps, and dropping branches after always-true actions.
  • Some find implementations eagerly compile regular expressions even when short-circuited, producing errors for invalid patterns even if not evaluated.

What to watch next

  • Whether maintainers of mainstream find implementations adopt a bytecode compilation strategy instead of tree-walk interpreters (not confirmed in the source).
  • Results from applying the proposed peephole optimizer to measure actual runtime improvements (not confirmed in the source).
  • The author's online Wasm demo and runnable example provided for experimentation with emitted bytecode.

Quick glossary

  • bytecode: A compact, low-level representation of a program designed to be interpreted or executed by a virtual machine.
  • shunting-yard algorithm: A method for converting infix expressions (with operators and parentheses) into postfix (RPN) form for easier compilation.
  • short-circuit evaluation: An evaluation strategy where the second operand of a logical operator is not evaluated if the first operand determines the result.
  • peephole optimizer: A small-scale optimizer that scans short sequences of instructions to find and replace inefficient or redundant patterns.
  • opcode: An instruction code that specifies an operation for a virtual machine or processor to perform.

Reader FAQ

Do existing find implementations already use this bytecode approach?
The author examined GNU, BSD, and BusyBox find implementations and found they use tree-walk interpreters rather than the bytecode approach described.

Is a full set of dedicated opcodes implemented for every find action?
Not in the described prototype: the compiler uses a single action opcode as a stand-in; a full implementation would give each action its own opcode.

Can I try the compiler or demo?
Yes; the article mentions a runnable example and an online Wasm demo that includes a peephole optimizer update.

Does the article quantify performance gains from the compiled bytecode?
Not confirmed in the source.

Unix "find" expressions compiled to bytecode December 23, 2025 In preparation for a future project, I was thinking about at the unix find utility. It operates a file system hierarchies,…

Sources

Related posts

By

Leave a Reply

Your email address will not be published. Required fields are marked *