Compound Identification — Related Work

01 / 30
m/z survey

Spectrum
to Structure

De novo molecular structure elucidation from tandem mass spectra — how seventeen papers read the peaks.

17 papers

5 paradigms

2021–2026

Roadmap

02 / 30

How this talk is organized

I

Setup

The task, the metrics everyone quotes, and the two benchmarks that don't agree with each other.

II

Five Paradigms

17 papers grouped by how they actually generate structure — translation, diffusion, flow matching, sequence diffusion, and search.

III

The Fair Comparison

What happens to the leaderboard once you stop trusting self-reported numbers and reproduce them yourself.

IV

Takeaways

What actually limits accuracy across every paradigm — and what it means for our own direction.

Setup

03 / 30

The task, and its vocabulary

The Task

MS2 spectrum

+

formula (usually assumed known)

→

?

→

molecular structure

Why it's hard: many distinct molecules (isomers) share the same formula and produce similar fragmentation patterns. The mapping from spectrum to structure is not one-to-one, and the space of chemically valid graphs is combinatorially large.

The Vocabulary

Top-k accuracy: exact structure match (by InChIKey) within the top k ranked candidates.
Tanimoto similarity: fingerprint overlap between predicted and true structure — a softer, partial-credit score (0–1).
MCES: maximum common edge subgraph distance — a graph-edit-distance-style score. Lower is better.

NPLIB1 vs. MassSpecGym: the two recurring benchmarks. NPLIB1 (GNPS/CANOPUS) often has high train/test structural similarity; MassSpecGym enforces a strict MCES≥10 scaffold split. Top-1 numbers across the two are not directly comparable.

Landscape

04 / 30

Seventeen papers, five paradigms

The Fair Comparison, Part 1

As published: Top-1 accuracy

self-reported by each paper's own authors — different pretraining corpora, candidate-pool sizes, and eval protocols. not apples-to-apples.

NPLIB1

Top-1 accuracy, as reported

MSFlow

44.70%

DualLGD

34.37%

FRIGID

25.03%

MBGen

12.20%

FlowMS

9.15%

MSAnchor (CANOPUS)

8.51%

DiffMS

8.34%

MS-BART

7.45%

MassSpecGym

Top-1 accuracy, as reported

MSFlow

32.00%

MIST+MolForge *

30.97%*

DualLGD

23.89%

FRIGID

18.29%

MBGen

7.58%

MolSpecFlow

3.11%

MSAnchor

2.68%

DiffMS

2.30%

FOAM

1.50%

MS-BART

1.07%

* flagged — see next slide for why this number needs a correction.

The Fair Comparison, Part 2

Independently reproduced

source: FRIGID paper, Table 1 — every baseline reimplemented and run under identical conditions.

NPLIB1

Top-1 accuracy, reproduced

FRIGID

25.03%

FRIGID-base

19.80%

DiffMS

8.34%

MS-BART

7.45%

MIST+MolForge

2.24%

Spec2Mol

0.00%

MassSpecGym

Top-1 accuracy, reproduced

FRIGID

18.29%

FRIGID-base

16.09%

MIST+MolForge

10.73%

DiffMS

2.30%

FOAM

1.50%

MS-BART

1.07%

Spec2Mol

0.00%

Case study — MIST+MolForge, MassSpecGym Top-1

A batched-inference bug adds zero, not negative infinity, to padded attention positions. Because MassSpecGym batches multiple spectra of the same molecule together, shorter spectra can attend to the padding and leak signal from other spectra of that molecule — inflating fingerprint quality and downstream accuracy.

as published: 31.75%

→

corrected: 10.73%

— FRIGID, Appendix A.3

Synthesis

What actually limits accuracy

01

Ranking, not generation, is the bottleneck

FOAM's search encounters the true structure 68.1% of the time but ranks it #1 only 11% of the time — and 95.2% of its runs find a decoy that outscores the true answer.

02

The spectrum encoder loses more signal than the decoder

MSFlow reaches 86.55% Top-1 given the ground-truth molecular descriptor, but only 44.70% given its own encoder's prediction from the spectrum — the bottleneck is conditioning, not decoding.

03

A "state of the art" number can hide an implementation bug

MIST+MolForge's published 31.75% Top-1 drops to 10.73% once a batched-inference masking bug is fixed. Self-reported leaderboards should be treated skeptically until independently reproduced.

04

Almost every method assumes the formula is already known

From SIRIUS, MIST-CF, or precursor + isotope pattern — before generation even starts. A load-bearing external dependency that's easy to take for granted.

05

Every architecture degrades on larger, more complex molecules

DeniMS: 43.7% Top-1 at 6–10 heavy atoms falls to 7.9% at 26–29. DualLGD: ~20% for molecules with ≥46 atoms. No paradigm in this review has solved this.

What This Means For Us

Optimize the ranker, not just the generator

◆

Every paradigm in this review — translation, diffusion, flow matching, retrieval, search — pours its effort into generation. Almost none seriously optimizes selection.

◆

Lesson 01 says the opposite is usually true: the answer is often already sitting in the candidate pool. It just isn't ranked first.

◆

Our direction: decouple the two problems explicitly. A generator's only job is coverage — place the truth somewhere in a same-formula candidate pool. A separate, learned, fragmentation-grounded discriminator does the ranking.

◆

This also defuses the two biggest risks in this literature: oracle over-optimization (Lesson 01) and encoder→conditioning bottlenecks (Lesson 02) become problems the discriminator can catch, not silent failure modes baked into one end-to-end model.

Verifier-centric, coverage-first

Generate for recall.
Rank with a discriminator.

Same-formula candidate pool → learned, fragmentation-grounded scorer → calibrated fusion. A wrong generator path becomes one bad candidate, not a corrupted final answer.

Spectrumto Structure

How this talk is organized

Setup

Five Paradigms

The Fair Comparison

Takeaways

The task, and its vocabulary

Seventeen papers, five paradigms

As published: Top-1 accuracy

Independently reproduced

What actually limits accuracy

Ranking, not generation, is the bottleneck

The spectrum encoder loses more signal than the decoder

A "state of the art" number can hide an implementation bug

Almost every method assumes the formula is already known

Every architecture degrades on larger, more complex molecules

Optimize the ranker, not just the generator

Generate for recall.Rank with a discriminator.

Spectrum
to Structure

Generate for recall.
Rank with a discriminator.