Why Training Data Matters: Exploring Coverage Bias in Small Molecule Machine Learning
Machine learning is transforming analytical chemistry by enabling predictions of small molecule properties, crucial for drug development and other applications. However, ensuring reliable results requires careful selection of training data to avoid biases that can mislead models. Here, we explain why it was important to prepare high-quality training datasets for the machine learning methods in SIRIUS, especially given that many widely used datasets fail to evenly represent the diversity of biomolecular structures.