Research

Preprint
ML for Molecule Analysis

A task-specific transfer learning approach to enhancing small molecule retention time prediction with limited data

Yuhui Hong, & Haixu Tang (2025).

bioRxiv 2025.06.26.661631.

TSTL (Task-Specific Transfer Learning) is introduced as a training strategy for predicting retention times in various LC systems with limited training data. Evaluated across 6 benchmark datasets from different LC systems using 5 deep neural network architectures, TSTL achieved significant improvements in prediction accuracy, increasing average R² from 0.587 to 0.825 with superior data efficiency.

Reliable ML in Microbiome

Confounder-free predictive models for microbiome-based host phenotype prediction

Mahsa Monshizadeh*, Yuhui Hong*, Yuzhen Ye (2025). * equal contribution

bioRxiv, 2025.01.29.635502.

Confounding factors like medications can severely bias microbiome-based disease predictions, leading to spurious associations. This study developed confounder-free models using adversarial optimization to remove biases while preserving true phenotype-microbiome associations. Tested on type 2 diabetes data with metformin as confounder, both FNN_CF and MicroKPNN_CF outperformed conventional approaches by identifying genuine disease markers.

2025
ML for Molecule Analysis

FIDDLE: a deep learning method for chemical formulas prediction from tandem mass spectra

Yuhui Hong, Sujun Li, Yuzhen Ye, & Haixu Tang (2025).

Nature Communications, 16(1), 11102.

FIDDLE (Formula IDentification by Deep LEarning) is introduced as a deep learning-based method for identifying chemical formulas from MS/MS data. It is trained on over 38,000 molecules and 1 million MS/MS spectra collected under various conditions, including collision energy and precursor types, using Quadrupole Time-of-Flight (QTOF) and Orbitrap instruments.

ML for Molecule Analysis Survey

Machine learning in small-molecule mass spectrometry

Yuhui Hong, Yuzhen Ye, & Haixu Tang (2025).

Annual Review of Analytical Chemistry, 18.

Small-molecule mass spectrometry can only identify compounds already in reference libraries, leaving billions of molecules uncharacterized. Machine learning is changing this by: (1) predicting spectra and properties to expand virtual libraries, (2) automating spectral matching, and (3) enabling direct structure prediction from spectra. This review examines the deep learning methods driving this shift from library matching to de novo prediction, finally enabling identification of the metabolome's dark matter.

ML for Molecule Analysis Patent

Method of predicting MS/MS spectra and properties of chemical compounds

Haixu Tang, Yuhui Hong, & Sujun Li (2025).

US Patent, US20250356958A1.

Methods and systems for predicting molecular properties from 3D molecular conformers. The method generates a 3D molecular input point set from compound information, convolutes it through stacked layers to encode the chemical compound, and produces a report of predicted properties such as MS/MS spectra.

ML for Molecule Analysis

Koina: democratizing machine learning for proteomics research

Ludwig Lautenbacher, Kevin L Yang, Tobias Kockmann, Christian Panse, Wassim Gabriel, Dulguun Bold, Elias Kahl, Matthew Chambers, Brendan X MacLean, Kai Li, Fengchao Yu, Brian C Searle, Damien Beau Wilburn, Mohammad Reza Zare Shahneh, Yuhui Hong, Haixu Tang, Mingxun Wang, Ralf Gabriels, Robbin Bouwmeester, Robbe Devreese, Jesse Angelis, Eduard Sabidó, Tobias K Schmidt, Alexey I Nesvizhskii, Mathias Wilhelm (2025).

Nature Communications, 16(1), 9933.

Koina is a user-friendly platform that enables proteomics researchers to apply machine learning without coding expertise. It offers pre-configured workflows for common tasks like tandem mass spectra, retention time and collisional cross section prediction, along with customizable options for advanced users.

2024
ML for Molecule Analysis

Enhanced structure-based prediction of chiral stationary phases for chromatographic enantioseparation from 3D molecular conformations

Yuhui Hong, Christopher J Welch, Patrick Piras, & Haixu Tang (2024).

Analytical Chemistry, 96(6), 2351–2359.

3DMolCSP leverages a 3D molecular conformation representation algorithm, alongside a dataset of over 300k enantioseparation records. This approach significantly improves enantioselectivity predictions, enabling more efficient and informed decisions in chiral chromatography.

Reliable ML in Microbiome

Multitask knowledge-primed neural network for predicting missing metadata and host phenotype based on human microbiome

Mahsa Monshizadeh*, Yuhui Hong*, Yuzhen Ye (2024). * equal contribution

Bioinformatics Advances, vbae203.

Metadata like age and gender are often missing in microbiome studies but crucial for accurate disease prediction. MicroKPNN-MT addresses this by either using available metadata as input or predicting it from microbiome profiles. Tested across 25 diseases, the model showed that incorporating real or predicted metadata improves both prediction accuracy and generalizability.

2023
ML for Molecule Analysis

3DMolMS: prediction of tandem mass spectra from three dimensional molecular conformations

Yuhui Hong, Sujun Li, Christopher J Welch, Shane Tichy, Yuzhen Ye, & Haixu Tang (2023).

Bioinformatics, btad354.

3DMolMS is a deep neural network model that predicts MS/MS spectra from 3D conformations. The learned molecular representation also enhances predictions of chemical properties, such as elution time and collisional cross section, aiding compound identification.

Tags: ML for Molecule Analysis Reliable ML in Microbiome Survey Patent