m/z survey
Spectrum
to Structure
How this talk is organized
Setup
The task, the metrics everyone quotes, and the two benchmarks that don't agree with each other.
Five Paradigms
17 papers grouped by how they actually generate structure — translation, diffusion, flow matching, sequence diffusion, and search.
The Fair Comparison
What happens to the leaderboard once you stop trusting self-reported numbers and reproduce them yourself.
Takeaways
What actually limits accuracy across every paradigm — and what it means for our own direction.
The task, and its vocabulary
- Top-k accuracy
- exact structure match (by InChIKey) within the top k ranked candidates.
- Tanimoto similarity
- fingerprint overlap between predicted and true structure — a softer, partial-credit score (0–1).
- MCES
- maximum common edge subgraph distance — a graph-edit-distance-style score. Lower is better.
Seventeen papers, five paradigms
As published: Top-1 accuracy
Independently reproduced
A batched-inference bug adds zero, not negative infinity, to padded attention positions. Because MassSpecGym batches multiple spectra of the same molecule together, shorter spectra can attend to the padding and leak signal from other spectra of that molecule — inflating fingerprint quality and downstream accuracy.
What actually limits accuracy
Ranking, not generation, is the bottleneck
FOAM's search encounters the true structure 68.1% of the time but ranks it #1 only 11% of the time — and 95.2% of its runs find a decoy that outscores the true answer.
The spectrum encoder loses more signal than the decoder
MSFlow reaches 86.55% Top-1 given the ground-truth molecular descriptor, but only 44.70% given its own encoder's prediction from the spectrum — the bottleneck is conditioning, not decoding.
A "state of the art" number can hide an implementation bug
MIST+MolForge's published 31.75% Top-1 drops to 10.73% once a batched-inference masking bug is fixed. Self-reported leaderboards should be treated skeptically until independently reproduced.
Almost every method assumes the formula is already known
From SIRIUS, MIST-CF, or precursor + isotope pattern — before generation even starts. A load-bearing external dependency that's easy to take for granted.
Every architecture degrades on larger, more complex molecules
DeniMS: 43.7% Top-1 at 6–10 heavy atoms falls to 7.9% at 26–29. DualLGD: ~20% for molecules with ≥46 atoms. No paradigm in this review has solved this.
Optimize the ranker, not just the generator
Every paradigm in this review — translation, diffusion, flow matching, retrieval, search — pours its effort into generation. Almost none seriously optimizes selection.
Lesson 01 says the opposite is usually true: the answer is often already sitting in the candidate pool. It just isn't ranked first.
Our direction: decouple the two problems explicitly. A generator's only job is coverage — place the truth somewhere in a same-formula candidate pool. A separate, learned, fragmentation-grounded discriminator does the ranking.
This also defuses the two biggest risks in this literature: oracle over-optimization (Lesson 01) and encoder→conditioning bottlenecks (Lesson 02) become problems the discriminator can catch, not silent failure modes baked into one end-to-end model.
Generate for recall.
Rank with a discriminator.
Same-formula candidate pool → learned, fragmentation-grounded scorer → calibrated fusion. A wrong generator path becomes one bad candidate, not a corrupted final answer.