Optical Neural Networks - Where the Matmul Is the Hardware

Strip away the optics jargon and the cleanest mental model is this: an optical neural network offloads some linear algebra into a physical system where light propagation performs the transform. The interesting engineering question is not whether photons are fast. It is where the digital-analog boundaries, calibration loops, and programmability costs move.

Scope. This post focuses on current inference-oriented ONN hardware, especially coherent Mach-Zehnder-interferometer meshes and diffractive optical systems. Several performance figures below come from different papers and counting conventions, so any GPU comparison is directional rather than apples-to-apples.

00 / Mental Model

Where the linear map becomes physical

An ONN is not “PyTorch, but with lasers.” The more accurate picture is that some learned linear transforms are compiled into a physical scattering network. In a coherent MZI mesh, phase shifters and beam splitters implement a programmable optical transform. In a passive diffractive network, the geometry of stacked phase masks does the work. Either way, the weights are no longer fetched from HBM on every inference.

GPU forward pass

Load weights, run kernels, write activations

The dominant systems story is data movement: weights come out of HBM, are staged closer to arithmetic units, a fused kernel runs, then activations go back to memory. The bottleneck is often memory bandwidth and orchestration more than multiply-add itself.

# each layer: data movement + compute y = relu(x @ W + b) # W is a tensor loaded from memory

ONN forward pass

Encode, propagate, detect

The optical core applies a programmed physical transform to an encoded optical field. For coherent meshes that transform is typically unitary or built from unitary blocks plus amplitude control. For diffractive systems it is set by the masks or metasurfaces in the beam path.

# weights live in optical elements y = detect(propagate(encode(x))) # "loading weights" means setting phases or fabricating masks

 Key systems insight. For the optical linear stage, the model parameters are embodied in hardware state rather than streamed from HBM on each inference. That shifts the bottleneck toward modulators, photodetectors, ADC/DAC precision, calibration, and control electronics.

Where this fits in a modern ML stack

Stack layer	Status in an ONN deployment
Your inference harness	Mostly unchanged. You still batch, schedule, validate, and monitor requests in software.
Model parameter artifacts	Compiled to phase settings, mask states, or other hardware control values instead of only tensor checkpoints.
Linear layers	Candidate for optical execution.
Nonlinearities and control logic	Usually electronic today, although some recent chips integrate limited optical nonlinear functions.

01 / Execution Model

The fast path is optical. The tax is at the boundaries.

End-to-end ONN inference is best understood as a domain-crossing pipeline. Digital inputs have to be encoded into optical amplitude or phase, the optical network applies its linear transform, and the result is measured back into the electrical domain. The optical segment is the part people find exciting. The converters are where a lot of the practical pain lives.

Inference pipeline

Input data Digital features, token embeddings, sensor values, or image patches.

→

DAC + modulator Encode values into optical phase, amplitude, wavelength channels, or time bins.

→

Optical propagation Beam splitters, phase shifters, waveguides, or diffractive masks apply the learned linear map.

→

Photodetector + ADC Measure intensity or interference outputs and bring them back to digital logic.

→

Output logits Now you can apply thresholding, softmax, routing, or the next hybrid stage.

 Practical bottleneck. Recent integrated photonic accelerator work explicitly includes TX/RX electronics, DACs, driver circuits, photodetectors, TIAs, and ADCs as core system components. For current mixed-signal ONNs, converter precision and interface overhead often dominate the usable accuracy budget more than the underlying optics itself.

What the optical core computes

A structured linear transform

In an MZI mesh, the optical field evolution is a matrix transform built from interferometers and phase shifts. In the cleanest setting that is a unitary matrix. General dense real-valued layers usually need extra decomposition steps, diagonal scaling, or hybrid surrounding circuitry.

What still stays hard

Nonlinear activation layers

ReLU, softmax, masking, and many control-heavy operations are still typically executed in electronics. Some chips now integrate optical nonlinearities, but that is not yet the dominant deployment model. For many systems, every deep layer still pays an optical-electrical round-trip.

02 / Architectures

The backend choice changes everything

“Optical neural network” names a family of hardware strategies, not a single design. The two most useful buckets for a systems reader are programmable coherent meshes and diffractive free-space systems. There is also a middle zone of reconfigurable diffractive processors that trade away some efficiency or compactness to recover programmability.

Programmable coherent photonics

MZI meshes are the tunable matmul kernel

Shen et al. showed a programmable nanophotonic processor with 56 Mach-Zehnder interferometers for vowel recognition, and the Clements design formalized a compact mesh that can implement arbitrary linear transforms across channels with better robustness to loss than earlier layouts.

Why people like it

Runtime programmability

“Deploying weights” means programming phase shifters or related control elements. That makes MZI hardware closer to a tunable accelerator than a fab-time-fixed optical circuit.

The catch

Calibration and scale

Phase noise, crosstalk, drift, and control complexity grow with chip size. The physics is elegant. The control plane is where scaling gets difficult.

Passive free-space optics

D²NNs are optical inference burned into geometry

Lin et al. introduced diffractive deep neural networks as stacks of passive diffractive layers that collectively implement learned functions at the speed of light. Once those layers are physically made, the base design behaves much more like a write-once inference artifact than a reprogrammable chip.

 Important distinction. The classic D²NN story is passive and largely fixed after fabrication. Reprogrammable spatial-light-modulator or metasurface variants exist, but the base architecture is much closer to ROM than to hot-reloaded GPU weights.

Why it is attractive

Huge parallelism in free space

Light diffracts across many spatial degrees of freedom at once, so a single optical pass can process large fields in parallel.

Why it is limiting

One mask stack, one deployed function

If the weights are embodied in fabricated masks, model updates are no longer a software deployment problem. They become a hardware replacement problem.

Middle ground

Reconfigurable diffractive processors recover flexibility

This is the compromise architecture. Instead of a permanently fixed diffractive stack, the optical processor uses reconfigurable elements, such as digital-coding metasurfaces or other optoelectronic control planes, to support different models or tasks. Zhou et al. reported a reconfigurable diffractive processing unit with millions of neurons and adaptive training to compensate system errors.

✓ more flexible than passive D²NN ~ more control overhead ~ still hybrid, not purely passive

03 / Weights & Training

Weights are no longer just tensors

The most unintuitive part for systems engineers is that “model weights” can map to phase settings, mask geometries, or other physical control states. That changes deployment, versioning, evaluation, and reproducibility. It also changes how carefully you have to talk about training: offline simulation is common, but it is no longer the only story.

What a weight means in a coherent ONN

A compiled physical configuration

In an MZI-based design, a learned matrix is decomposed into interferometer parameters and phase values. In a diffractive design, the “weights” are the transmissive or phase profile of each layer. That means deployment artifacts often include both a model checkpoint and a hardware-programming representation.

# Conceptually, one MZI is a tunable 2x2 block def mzi(theta, phi): # beam splitter + phase shift return U(theta, phi) # Full optical layer = product of many such blocks # "Loading weights" = setting phases and calibration state

Common deployment workflow today

step 1

Train or co-train a differentiable optical model

The common path is still digital optimization in PyTorch, JAX, or custom simulators that model the optical layer and its hardware constraints.

step 2

Compile to hardware parameters

For coherent meshes this can involve decompositions such as Clements-style parameterization plus any diagonal scaling and hardware-aware clipping.

step 3

Program or fabricate the optical system

You send voltages to phase shifters, configure a reconfigurable optical front-end, or physically realize a fixed mask stack.

 Correction to a common oversimplification. It is no longer accurate to say ONN training is always offline simulation followed by one-way programming. Pai et al. experimentally demonstrated in-situ backpropagation on a silicon photonic neural network, and Bandyopadhyay et al. showed forward-only in-situ training on a fully integrated chip. Offline compilation is common. It is not the only training mode anymore.

What this means for evaluation and reproducibility

Factor	GPU assumption	ONN reality
Model artifact	Tensor checkpoint	Tensor checkpoint plus compiled phase or mask state and calibration metadata
Numeric precision	Bit-defined formats like fp32, bf16, int8	Mixed-signal, often single-digit effective bits at system level
Runtime determinism	Close to bit-exact for fixed seeds and kernels	Affected by drift, noise, bias settings, device mismatch, and readout error
Calibration	Usually not part of model versioning	Operationally important and may change output quality over time
Eval thresholds	Exact-match and narrow tolerances are common	Statistical tolerances and repeated measurements are often more defensible

04 / Performance

The numbers are impressive, but the units need adult supervision

ONN papers often report outstanding latency and energy-efficiency numbers, but comparing them directly to GPUs is tricky because precision, sparsity assumptions, workload shape, and system boundary definitions vary. The right way to read the field is: optical linear algebra can be extraordinarily efficient, but end-to-end usefulness still depends on control electronics and deployment fit.

Taichi chiplet (Science 2024) 160 TOPS/W Reported energy efficiency for a large-scale diffractive-interference hybrid photonic chiplet with millions-of-neurons capability.

PACE system 2.38–4.21 TOPS/W Reported system-level efficiency for a 64 × 64 integrated photonic accelerator, depending on whether laser power is included.

FICONN (Nature Photonics 2024) 410 ps Demonstrated latency for a three-layer fully integrated coherent optical neural network.

PACE system 7.61 bits Average bit accuracy reported for a 64 × 64 integrated photonic accelerator system.

Read each number by its system boundary

This is the part that trips people up. ONN papers, integrated mixed-signal accelerator papers, and vendor GPU spec sheets are usually not measuring the same thing. So compare them as architectural signals, not as interchangeable benchmark rows.

System	Reported number	What to remember
Taichi chiplet	160 TOPS/W	Paper-reported efficiency for a specialized photonic chiplet architecture optimized around optical computing.
PACE 64 × 64 accelerator	2.38 TOPS/W including lasers, 4.21 TOPS/W excluding lasers	Integrated mixed-signal system number, not just an isolated optical core.
FICONN	410 ps latency	Latency number on a compact fully integrated coherent ONN, useful for understanding the speed floor of the optical path.
H100 SXM	Up to 1,979 TFLOPS BF16/FP16 with sparsity, 3.35 TB/s HBM, up to 700 W	Vendor peak spec for a general-purpose GPU. This is the machine you already know, but it is not a like-for-like paper boundary.

Why people keep saying ONNs dodge the memory wall

# GPU intuition flops = 2 * N * M bytes = N * M * dtype_size # weight traffic arithmetic_intensity = flops / bytes # ONN intuition for the optical linear map # weights are embodied in phase settings / optical structure # per-inference weight traffic is reduced or eliminated at the optical stage # the bottleneck shifts to modulator rate, photodetector chain, ADC/DAC, and control I/O

The claim is directionally right for the optical linear layer, but it should be phrased carefully. End-to-end ONN systems still rely on electronics, control paths, and often repeated domain conversions.

05 / Tradeoffs

Strong in narrow places, weak in exactly the places LLMs care about

The honest assessment is that ONNs are compelling when you can exploit fast, energy-efficient linear transforms with limited need for online updates and with a tolerance for mixed-signal imperfection. They are much less compelling when you need deep stacks of nonlinear layers, exact reproducibility, or massive reprogrammable parameter counts.

Engineering verdict

Dimension	Verdict	Why it matters
Energy per linear op	✓ strong win	Optical propagation is passive or nearly passive compared with electronic MAC arrays.
Latency for linear inference	✓ win	The optical core is extremely fast; integrated systems have already demonstrated sub-nanosecond class latency in small networks.
Precision and reproducibility	× loss	Noise, converter limits, drift, and calibration state make exact-match expectations harder to defend.
Programmability	~ depends	MZI meshes are reconfigurable; passive diffractive networks are much closer to hardware-fused models.
Nonlinear depth	× loss	Hybrid optical-electronic round-trips are still the default for many architectures.
Model size scaling	× loss	Even impressive photonic demonstrations remain far below the parameter counts and memory footprints of frontier LLMs.
Training	~ partial	In-situ training exists in research prototypes, but training is not yet the field’s easiest or most mature deployment story.
Inference-only, fixed-function tasks	✓ best fit	Classification, signal processing, and front-end sensing pipelines are where ONNs look most deployable today.

 Deployment answer today. ONNs look most plausible for latency-critical or energy-sensitive inference where the linear transform dominates and the model changes slowly. They are not a general drop-in replacement for LLM training, online fine-tuning, or giant dynamically updated models.

Selected primary sources

Research behind the claims in this explainer

Shen et al. (Nature Photonics, 2017) demonstrated a programmable nanophotonic processor with 56 MZIs for optical neural inference.

Clements et al. (2016) described a compact universal interferometer mesh used heavily in programmable photonics.

Pai et al. (Science, 2023) experimentally realized in-situ backpropagation on a silicon photonic neural network.

Bandyopadhyay et al. (Nature Photonics, 2024) demonstrated a fully integrated coherent optical neural network with forward-only in-situ training and 410 ps latency.

Xu et al. (Science, 2024) reported the Taichi photonic chiplet with 160 TOPS/W and millions-of-neurons capability.

Zhou et al. (Nature Photonics, 2021) reported a reconfigurable diffractive processing unit with adaptive training and millions of neurons.

Lin et al. (Science, 2018 preprint / related DOI) introduced diffractive deep neural networks as passive optical inference stacks.

Liu et al. (Nature Electronics, 2022) demonstrated a programmable diffractive deep neural network based on a digital-coding metasurface array.

An integrated large-scale photonic accelerator with ultralow latency reported a 64 × 64 photonic accelerator system with 7.61-bit average accuracy and explicit TX/RX electronics in the system architecture.

NVIDIA H100 official specifications are used only as a modern GPU orientation point for memory bandwidth and tensor-throughput scale.

Farouq Oguntoye