Compound Identification — Related Work
01 / 30
m/z survey

Spectrum
to Structure

De novo molecular structure elucidation from tandem mass spectra — how seventeen papers read the peaks.
17 papers
5 paradigms
2021–2026
Roadmap
02 / 30

How this talk is organized

I

Setup

The task, the metrics everyone quotes, and the two benchmarks that don't agree with each other.

II

Five Paradigms

17 papers grouped by how they actually generate structure — translation, diffusion, flow matching, sequence diffusion, and search.

III

The Fair Comparison

What happens to the leaderboard once you stop trusting self-reported numbers and reproduce them yourself.

IV

Takeaways

What actually limits accuracy across every paradigm — and what it means for our own direction.

Setup
03 / 30

The task, and its vocabulary

The Task
MS2 spectrum
+
formula (usually assumed known)
?
molecular structure
Why it's hard: many distinct molecules (isomers) share the same formula and produce similar fragmentation patterns. The mapping from spectrum to structure is not one-to-one, and the space of chemically valid graphs is combinatorially large.
The Vocabulary
Top-k accuracy
exact structure match (by InChIKey) within the top k ranked candidates.
Tanimoto similarity
fingerprint overlap between predicted and true structure — a softer, partial-credit score (0–1).
MCES
maximum common edge subgraph distance — a graph-edit-distance-style score. Lower is better.
NPLIB1 vs. MassSpecGym: the two recurring benchmarks. NPLIB1 (GNPS/CANOPUS) often has high train/test structural similarity; MassSpecGym enforces a strict MCES≥10 scaffold split. Top-1 numbers across the two are not directly comparable.
Landscape
04 / 30

Seventeen papers, five paradigms

The Fair Comparison, Part 1

As published: Top-1 accuracy

self-reported by each paper's own authors — different pretraining corpora, candidate-pool sizes, and eval protocols. not apples-to-apples.
NPLIB1
Top-1 accuracy, as reported
MSFlow
44.70%
DualLGD
34.37%
FRIGID
25.03%
MBGen
12.20%
FlowMS
9.15%
MSAnchor (CANOPUS)
8.51%
DiffMS
8.34%
MS-BART
7.45%
MassSpecGym
Top-1 accuracy, as reported
MSFlow
32.00%
MIST+MolForge *
30.97%*
DualLGD
23.89%
FRIGID
18.29%
MBGen
7.58%
MolSpecFlow
3.11%
MSAnchor
2.68%
DiffMS
2.30%
FOAM
1.50%
MS-BART
1.07%
* flagged — see next slide for why this number needs a correction.
The Fair Comparison, Part 2

Independently reproduced

source: FRIGID paper, Table 1 — every baseline reimplemented and run under identical conditions.
NPLIB1
Top-1 accuracy, reproduced
FRIGID
25.03%
FRIGID-base
19.80%
DiffMS
8.34%
MS-BART
7.45%
MIST+MolForge
2.24%
Spec2Mol
0.00%
MassSpecGym
Top-1 accuracy, reproduced
FRIGID
18.29%
FRIGID-base
16.09%
MIST+MolForge
10.73%
DiffMS
2.30%
FOAM
1.50%
MS-BART
1.07%
Spec2Mol
0.00%
Case study — MIST+MolForge, MassSpecGym Top-1

A batched-inference bug adds zero, not negative infinity, to padded attention positions. Because MassSpecGym batches multiple spectra of the same molecule together, shorter spectra can attend to the padding and leak signal from other spectra of that molecule — inflating fingerprint quality and downstream accuracy.

as published: 31.75%
corrected: 10.73%
— FRIGID, Appendix A.3
Synthesis

What actually limits accuracy

01

Ranking, not generation, is the bottleneck

FOAM's search encounters the true structure 68.1% of the time but ranks it #1 only 11% of the time — and 95.2% of its runs find a decoy that outscores the true answer.

02

The spectrum encoder loses more signal than the decoder

MSFlow reaches 86.55% Top-1 given the ground-truth molecular descriptor, but only 44.70% given its own encoder's prediction from the spectrum — the bottleneck is conditioning, not decoding.

03

A "state of the art" number can hide an implementation bug

MIST+MolForge's published 31.75% Top-1 drops to 10.73% once a batched-inference masking bug is fixed. Self-reported leaderboards should be treated skeptically until independently reproduced.

04

Almost every method assumes the formula is already known

From SIRIUS, MIST-CF, or precursor + isotope pattern — before generation even starts. A load-bearing external dependency that's easy to take for granted.

05

Every architecture degrades on larger, more complex molecules

DeniMS: 43.7% Top-1 at 6–10 heavy atoms falls to 7.9% at 26–29. DualLGD: ~20% for molecules with ≥46 atoms. No paradigm in this review has solved this.

What This Means For Us

Optimize the ranker, not just the generator

Every paradigm in this review — translation, diffusion, flow matching, retrieval, search — pours its effort into generation. Almost none seriously optimizes selection.

Lesson 01 says the opposite is usually true: the answer is often already sitting in the candidate pool. It just isn't ranked first.

Our direction: decouple the two problems explicitly. A generator's only job is coverage — place the truth somewhere in a same-formula candidate pool. A separate, learned, fragmentation-grounded discriminator does the ranking.

This also defuses the two biggest risks in this literature: oracle over-optimization (Lesson 01) and encoder→conditioning bottlenecks (Lesson 02) become problems the discriminator can catch, not silent failure modes baked into one end-to-end model.

Verifier-centric, coverage-first

Generate for recall.
Rank with a discriminator.

Same-formula candidate pool → learned, fragmentation-grounded scorer → calibrated fusion. A wrong generator path becomes one bad candidate, not a corrupted final answer.

1 / 30