Optical Neural Networks - Where the Matmul Is the Hardware
Strip away the optics jargon and the cleanest mental model is this: an optical neural network offloads some linear algebra into a physical system where light propagation performs the transform. The interesting engineering question is not whether photons are fast. It is where the digital-analog boundaries, calibration loops, and programmability costs move.
00 / Mental Model
Where the linear map becomes physical
An ONN is not “PyTorch, but with lasers.” The more accurate picture is that some learned linear transforms are compiled into a physical scattering network. In a coherent MZI mesh, phase shifters and beam splitters implement a programmable optical transform. In a passive diffractive network, the geometry of stacked phase masks does the work. Either way, the weights are no longer fetched from HBM on every inference.
GPU forward pass
Load weights, run kernels, write activations
The dominant systems story is data movement: weights come out of HBM, are staged closer to arithmetic units, a fused kernel runs, then activations go back to memory. The bottleneck is often memory bandwidth and orchestration more than multiply-add itself.
ONN forward pass
Encode, propagate, detect
The optical core applies a programmed physical transform to an encoded optical field. For coherent meshes that transform is typically unitary or built from unitary blocks plus amplitude control. For diffractive systems it is set by the masks or metasurfaces in the beam path.
Where this fits in a modern ML stack
| Stack layer | Status in an ONN deployment |
|---|---|
| Your inference harness | Mostly unchanged. You still batch, schedule, validate, and monitor requests in software. |
| Model parameter artifacts | Compiled to phase settings, mask states, or other hardware control values instead of only tensor checkpoints. |
| Linear layers | Candidate for optical execution. |
| Nonlinearities and control logic | Usually electronic today, although some recent chips integrate limited optical nonlinear functions. |
01 / Execution Model
The fast path is optical. The tax is at the boundaries.
End-to-end ONN inference is best understood as a domain-crossing pipeline. Digital inputs have to be encoded into optical amplitude or phase, the optical network applies its linear transform, and the result is measured back into the electrical domain. The optical segment is the part people find exciting. The converters are where a lot of the practical pain lives.
Inference pipeline
What the optical core computes
A structured linear transform
In an MZI mesh, the optical field evolution is a matrix transform built from interferometers and phase shifts. In the cleanest setting that is a unitary matrix. General dense real-valued layers usually need extra decomposition steps, diagonal scaling, or hybrid surrounding circuitry.
What still stays hard
Nonlinear activation layers
ReLU, softmax, masking, and many control-heavy operations are still typically executed in electronics. Some chips now integrate optical nonlinearities, but that is not yet the dominant deployment model. For many systems, every deep layer still pays an optical-electrical round-trip.
02 / Architectures
The backend choice changes everything
“Optical neural network” names a family of hardware strategies, not a single design. The two most useful buckets for a systems reader are programmable coherent meshes and diffractive free-space systems. There is also a middle zone of reconfigurable diffractive processors that trade away some efficiency or compactness to recover programmability.
Programmable coherent photonics
MZI meshes are the tunable matmul kernel
Shen et al. showed a programmable nanophotonic processor with 56 Mach-Zehnder interferometers for vowel recognition, and the Clements design formalized a compact mesh that can implement arbitrary linear transforms across channels with better robustness to loss than earlier layouts.
Why people like it
Runtime programmability
“Deploying weights” means programming phase shifters or related control elements. That makes MZI hardware closer to a tunable accelerator than a fab-time-fixed optical circuit.
The catch
Calibration and scale
Phase noise, crosstalk, drift, and control complexity grow with chip size. The physics is elegant. The control plane is where scaling gets difficult.
Passive free-space optics
D²NNs are optical inference burned into geometry
Lin et al. introduced diffractive deep neural networks as stacks of passive diffractive layers that collectively implement learned functions at the speed of light. Once those layers are physically made, the base design behaves much more like a write-once inference artifact than a reprogrammable chip.
Why it is attractive
Huge parallelism in free space
Light diffracts across many spatial degrees of freedom at once, so a single optical pass can process large fields in parallel.
Why it is limiting
One mask stack, one deployed function
If the weights are embodied in fabricated masks, model updates are no longer a software deployment problem. They become a hardware replacement problem.
Middle ground
Reconfigurable diffractive processors recover flexibility
This is the compromise architecture. Instead of a permanently fixed diffractive stack, the optical processor uses reconfigurable elements, such as digital-coding metasurfaces or other optoelectronic control planes, to support different models or tasks. Zhou et al. reported a reconfigurable diffractive processing unit with millions of neurons and adaptive training to compensate system errors.
03 / Weights & Training
Weights are no longer just tensors
The most unintuitive part for systems engineers is that “model weights” can map to phase settings, mask geometries, or other physical control states. That changes deployment, versioning, evaluation, and reproducibility. It also changes how carefully you have to talk about training: offline simulation is common, but it is no longer the only story.
What a weight means in a coherent ONN
A compiled physical configuration
In an MZI-based design, a learned matrix is decomposed into interferometer parameters and phase values. In a diffractive design, the “weights” are the transmissive or phase profile of each layer. That means deployment artifacts often include both a model checkpoint and a hardware-programming representation.
Common deployment workflow today
Train or co-train a differentiable optical model
The common path is still digital optimization in PyTorch, JAX, or custom simulators that model the optical layer and its hardware constraints.
Compile to hardware parameters
For coherent meshes this can involve decompositions such as Clements-style parameterization plus any diagonal scaling and hardware-aware clipping.
Program or fabricate the optical system
You send voltages to phase shifters, configure a reconfigurable optical front-end, or physically realize a fixed mask stack.
What this means for evaluation and reproducibility
| Factor | GPU assumption | ONN reality |
|---|---|---|
| Model artifact | Tensor checkpoint | Tensor checkpoint plus compiled phase or mask state and calibration metadata |
| Numeric precision | Bit-defined formats like fp32, bf16, int8 | Mixed-signal, often single-digit effective bits at system level |
| Runtime determinism | Close to bit-exact for fixed seeds and kernels | Affected by drift, noise, bias settings, device mismatch, and readout error |
| Calibration | Usually not part of model versioning | Operationally important and may change output quality over time |
| Eval thresholds | Exact-match and narrow tolerances are common | Statistical tolerances and repeated measurements are often more defensible |
04 / Performance
The numbers are impressive, but the units need adult supervision
ONN papers often report outstanding latency and energy-efficiency numbers, but comparing them directly to GPUs is tricky because precision, sparsity assumptions, workload shape, and system boundary definitions vary. The right way to read the field is: optical linear algebra can be extraordinarily efficient, but end-to-end usefulness still depends on control electronics and deployment fit.
Read each number by its system boundary
This is the part that trips people up. ONN papers, integrated mixed-signal accelerator papers, and vendor GPU spec sheets are usually not measuring the same thing. So compare them as architectural signals, not as interchangeable benchmark rows.
| System | Reported number | What to remember |
|---|---|---|
| Taichi chiplet | 160 TOPS/W | Paper-reported efficiency for a specialized photonic chiplet architecture optimized around optical computing. |
| PACE 64 × 64 accelerator | 2.38 TOPS/W including lasers, 4.21 TOPS/W excluding lasers | Integrated mixed-signal system number, not just an isolated optical core. |
| FICONN | 410 ps latency | Latency number on a compact fully integrated coherent ONN, useful for understanding the speed floor of the optical path. |
| H100 SXM | Up to 1,979 TFLOPS BF16/FP16 with sparsity, 3.35 TB/s HBM, up to 700 W | Vendor peak spec for a general-purpose GPU. This is the machine you already know, but it is not a like-for-like paper boundary. |
Why people keep saying ONNs dodge the memory wall
The claim is directionally right for the optical linear layer, but it should be phrased carefully. End-to-end ONN systems still rely on electronics, control paths, and often repeated domain conversions.
05 / Tradeoffs
Strong in narrow places, weak in exactly the places LLMs care about
The honest assessment is that ONNs are compelling when you can exploit fast, energy-efficient linear transforms with limited need for online updates and with a tolerance for mixed-signal imperfection. They are much less compelling when you need deep stacks of nonlinear layers, exact reproducibility, or massive reprogrammable parameter counts.
Engineering verdict
| Dimension | Verdict | Why it matters |
|---|---|---|
| Energy per linear op | ✓ strong win | Optical propagation is passive or nearly passive compared with electronic MAC arrays. |
| Latency for linear inference | ✓ win | The optical core is extremely fast; integrated systems have already demonstrated sub-nanosecond class latency in small networks. |
| Precision and reproducibility | × loss | Noise, converter limits, drift, and calibration state make exact-match expectations harder to defend. |
| Programmability | ~ depends | MZI meshes are reconfigurable; passive diffractive networks are much closer to hardware-fused models. |
| Nonlinear depth | × loss | Hybrid optical-electronic round-trips are still the default for many architectures. |
| Model size scaling | × loss | Even impressive photonic demonstrations remain far below the parameter counts and memory footprints of frontier LLMs. |
| Training | ~ partial | In-situ training exists in research prototypes, but training is not yet the field’s easiest or most mature deployment story. |
| Inference-only, fixed-function tasks | ✓ best fit | Classification, signal processing, and front-end sensing pipelines are where ONNs look most deployable today. |
Selected primary sources
Research behind the claims in this explainer
Shen et al. (Nature Photonics, 2017) demonstrated a programmable nanophotonic processor with 56 MZIs for optical neural inference.
Clements et al. (2016) described a compact universal interferometer mesh used heavily in programmable photonics.
Pai et al. (Science, 2023) experimentally realized in-situ backpropagation on a silicon photonic neural network.
Bandyopadhyay et al. (Nature Photonics, 2024) demonstrated a fully integrated coherent optical neural network with forward-only in-situ training and 410 ps latency.
Xu et al. (Science, 2024) reported the Taichi photonic chiplet with 160 TOPS/W and millions-of-neurons capability.
Zhou et al. (Nature Photonics, 2021) reported a reconfigurable diffractive processing unit with adaptive training and millions of neurons.
Lin et al. (Science, 2018 preprint / related DOI) introduced diffractive deep neural networks as passive optical inference stacks.
Liu et al. (Nature Electronics, 2022) demonstrated a programmable diffractive deep neural network based on a digital-coding metasurface array.
An integrated large-scale photonic accelerator with ultralow latency reported a 64 × 64 photonic accelerator system with 7.61-bit average accuracy and explicit TX/RX electronics in the system architecture.
NVIDIA H100 official specifications are used only as a modern GPU orientation point for memory bandwidth and tensor-throughput scale.
