NovoMolGen: Rethinking Molecular Language Model Pretraining

Kamran Chitsaz^1,2,*, Roshan Balaji^5,6,7,*, Quentin Fournier²,
Nirav Pravinbhai Bhatt^5,6,7,8, Sarath Chandar^1,2,3,4

¹ Chandar Research Lab, ² Mila – Quebec AI Institute, ³ Polytechnique Montréal, ⁴ Canada CIFAR AI Chair ⁵ BioSystems Engineering and Control Lab, ⁶ Wadhwani School of Data Science and AI, IIT Madras,
⁷ The Centre for Integrative Biology and Systems Medicine (IBSE), ⁸ IIT Madras Zanzibar
^*Indicates Equal Contribution

Paper arXiv

HuggingFace Code

We introduce NovoMolGen, a family of open-source transformer-based foundation models pretrained on 1.5 billion molecules, evaluating the effects of molecular representation, tokenization, model scaling, and dataset size on de novo generation.

AI in Chemistry: A Quest for New Medicines

The search for new medicines is one of humanity's toughest challenges, made harder by the sheer size of the chemical universe. Scientists estimate there could be between 10²³ and 10⁶⁰ possible drug-like molecules, a number so vast that testing even a fraction of them is impossible. To tackle this, researchers are turning to artificial intelligence. Molecular Large Language Models (Mol-LLMs) treat molecules as text-like strings and apply the same transformer technology behind popular tools like ChatGPT. Inspired by breakthroughs in natural language processing, many in the field have assumed that scaling up, bigger models trained on more data, will naturally lead to better results. But chemistry is not a natural language. While words follow flexible rules shaped by context, molecules obey strict physical and chemical laws, and their meaning is tied to some chemical function.

This raises an important question: Does scaling up really work the same way in drug discovery? We challenge this assumption with NovoMolGen, a family of open-source foundation models for chemistry and present the largest systematic study (>30,000 experiments) to date on Mol-LLMs by evaluating the effects of molecular representation, tokenization, model scaling, and dataset size on de novo generation.

Establishing a New Standard in Molecular Design

NovoMolGen achieves state-of-the-art performance in goal-directed molecular design, where the task is to generate new compounds with specific, multi-property profiles under strict oracle budget constraints. On the Practical Molecular Optimization (PMO) benchmark, which captures a wide range of drug discovery challenges, NovoMolGen consistently outperforms strong baselines, including widely used frameworks like REINVENT and recent state-of-the-art methods like f-RAG. Importantly, these improvements are observed not just for the largest models but also for smaller variants, highlighting that the approach is broadly effective rather than dependent on scale alone.

In protein-ligand docking, NovoMolGen achieves better performance compared to multiple existing models. Our models generate molecules that bind more strongly to protein targets and identify a higher proportion of candidates that are novel, drug-like, and synthesizable compared to previous methods. Across multiple targets (including the cancer target parp1), our models deliver high-quality molecules with a strict oracle budget of 3000 molecule evaluations per run. This demonstrates that NovoMolGen has learned a strong chemical prior, enabling it to explore chemical space effectively and propose candidates that meet multiple medicinal chemistry criteria simultaneously. The NovoMolGen models achieve hit ratios often twice as high as strong baselines. These gains hold even for the smaller variants: scaling up to 300M parameters yields only modest improvements, suggesting the 32M and 157M models were already capturing the essentials of designing ligands that bind strongly to multiple protein targets.

Docking Pose — Molecular docking visualization showing NovoMolGen-generated compounds binding to protein targets ***parp1*** (top) and ***braf*** (bottom)

The Scaling Problem

In the world of AI, a common belief is that bigger is always better. The thinking goes that larger models trained on more data will inevitably lead to better performance. Our research shows this isn't necessarily true for chemistry. We found that a relatively small model can learn the fundamental rules of molecular structure very quickly, and making the model bigger or training it for longer yields surprisingly little benefit.

Performance on key design tasks plateaued remarkably early in the training process. This suggests that once a model has mastered the basic "grammar" of chemistry, simply showing it more examples doesn't teach it much more about designing effective molecules. This is a game-changing insight. It means that state-of-the-art results in molecular design don't require massive, resource-intensive models. This democratizes the field, allowing academic labs and smaller research groups to contribute to the cutting edge of drug discovery without needing a supercomputer.

Are We Measuring What Actually Matters?

Our investigation also revealed a critical disconnect: the metrics we use to grade these models during pretraining don't predict how well they'll perform on real-world tasks. A popular metric, Fréchet ChemNet Distance (FCD), measures how well a model can generate a batch of molecules that looks like a reference dataset. But we found that a model's FCD score has very little correlation with its ability to design a novel molecule for a specific purpose. We were grading models on their ability to copy what's already known, not on their creativity or efficiency in solving a new problem. This finding is a call to action for the research community to develop new benchmarks that measure what truly matters: a model's ability to be an efficient and innovative partner in the scientific discovery process.

FCD Scatter Plot — FCD does not correlate with downstream task performance: The plot compares the FCD score against the PMO benchmark score

The Path Forward: An Open-Source Foundation for Discovery

Our findings point to a clear new direction for the field. To build better models, we need to shift our focus from teaching them chemical syntax (what molecules look like) to teaching them functional semantics (what molecules do). The vast chemical databases used for training are great for learning the rules of assembly, but they lack information about biological function.

Future models should be trained with objectives that are directly related to a molecule's purpose, such as its interaction with a protein target or its desired physicochemical properties. On a more practical note, our study also confirmed that better ways of representing molecules, like using Byte Pair Encoding (BPE) instead of relying on atoms to recognize common chemical building blocks, consistently improve efficiency and performance.

By providing a state-of-the-art foundation model, our work enables any lab, regardless of computational resources, to bypass the expensive pretraining phase and focus directly on the creative and scientific challenges of fine-tuning and application. We have made our pretrained models, datasets, and code fully open-source to empower other researchers and accelerate progress, whether finding a new drug or designing a novel material. We invite the community to build on this work and help shape the future of AI-driven science.

BibTeX

@article{chitsaz2025novomolgenrethinkingmolecularlanguage,
      title={NovoMolGen: Rethinking Molecular Language Model Pretraining}, 
      author={Kamran Chitsaz and Roshan Balaji and Quentin Fournier and Nirav Pravinbhai Bhatt and Sarath Chandar},
      year={2025},
      eprint={2508.13408},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2508.13408}, 
}