Gen AI Based Molecular Property Prediction: A Silent Revolution in Drug Discovery

Gokul Rangarajan
Jul 1
7 min read

Gen Ai in Molecular Property Prediction beyond ADMET, How Generative AI is Transforming Workflows and Project Management.

This blog is part of the “GenAI in Healthcare Report 2025” by Murali Sudram in collaboration with Pitchworks VC Studio. The report explores how generative AI is reshaping scientific research, clinical workflows, and drug discovery. Stay tuned for more in-depth explorations of real-world applications and enterprise adoption strategies. You can download our Gen AI in Healthcare report from here https://www.pitchworks.club/healthcaregenaireport If you are into manufacturing, you can download our Gen AI manufacturing report here https://www.pitchworks.club/gen-ai-manufacturing-report-2025

If your interest is in Clinical trials: we have report on Gen Ai in healthcare : Clinical trails 2025 https://pitchworks.club/clinicaltrailgenaihealthcarereport2025

Molecular Property Prediction (MPP) refers to the computational task of estimating physicochemical, biological, or pharmacological properties of molecules based on their structure. These properties range from solubility, toxicity, lipophilicity, binding affinity, to blood-brain barrier permeability, and go far beyond the classic ADMET parameters.

The Traditional Process: Time-Intensive and Costly

In conventional drug discovery, determining a molecule’s properties required extensive laboratory experimentation. This included:

Wet lab experiments for absorption and solubility.
Animal models or in vitro testing for toxicity and efficacy.
High-throughput screening for activity against targets.
molecular property prediction in a traditional lab

This process could take weeks to months per molecule, cost thousands of dollars, and needed interdisciplinary teams—typically involving computational chemists, medicinal chemists, biologists, and data analysts.

The average cost of developing a new drug is estimated at $2.8 billion, with a success rate of less than 10% (Fleming 2018, Sarkar 2023).

Traditional Tools for MPP (Pre-AI Era)

These tools relied on rule-based, statistical, and experimental methods:

Tool / Platform	Purpose	Example Properties
QSAR software (Quantitative Structure–Activity Relationship)	Statistical modeling based on molecular descriptors	Toxicity, Solubility, Bioavailability
ADMET Predictor	Predicts absorption, distribution, metabolism, excretion, and toxicity	ADMET
ChemDraw / MarvinSketch	Molecule drawing & basic property calculation	LogP, pKa, MW
TopKat (BIOVIA)	Predicts toxicity using empirical models	Mutagenicity, Carcinogenicity
MOE (Molecular Operating Environment)	Structure-based drug design & property estimation	Solubility, Permeability
Gaussian / ORCA / Spartan	Quantum mechanical calculations for small molecules	Electronic properties, Energies
GROMACS / AMBER	Molecular dynamics simulations	Stability, Binding free energy

Enter Gen AI: Predicting the Future of Molecules

Thanks to the explosion of chemical data and deep learning innovations, MPP has rapidly evolved. Graph neural networks (GNNs), transformers, and multimodal AI models are now able to predict molecular properties with high precision—drastically reducing cost, time, and failure rates.

According to Zhao et al. (2024), GSL-MPP leverages a two-level graph representation, combining intra-molecular (atom-level) and inter-molecular (similarity-based) information to boost prediction accuracy.

🧠 Using graph structure learning, the model embeds molecules not just as isolated graphs but also considers their similarity with other molecules—just like how medicinal chemists reason based on chemical families.

🤖 Modern Tools in MPP (AI/Software-Driven Era)

These tools use deep learning, graph neural networks, and self-supervised learning:

Tool / Library	Description	Highlights
ChemBERTa / Chemformer / ChemGPT	Pretrained transformer models on SMILES strings	Sequence-based property prediction
GROVER / Uni-Mol / MolCLR	Self-supervised graph neural networks for molecules	Graph-based with 2D/3D features
SCAGE (Nature 2025)	Self-conformation-aware graph transformer	Learns from 5M compounds with structure-function tasks
DGL-LifeSci	Deep Graph Library for life sciences	Custom GNNs for MPP
RDKit	Cheminformatics toolkit	Descriptor calculation, molecular similarity
DeepChem	ML for drug discovery & materials science	Benchmarks, datasets, ready-to-use models
Open Babel	File conversion and basic property prediction	Free and extensible
PaddleHelix	AI tools by Baidu for life sciences	Protein-drug interaction, MPP tasks
AutoGluon-Tabular	AutoML for molecular tabular data	Property prediction with little coding

Core Innovations in MPP: From Single to Multimodal Approaches

As reviewed by Liyaqat et al. (2024), recent AI models in MPP fall under three broad strategies:

Single-Modality Models
- Rely on a single molecular representation (e.g., SMILES, graphs, or molecular images).
- Models: ChemBERTa, GIN, MPNN.
Multimodal Models
- Combine multiple forms of input (e.g., 2D graph + 3D conformation).
- Models: GROVER, MolAE, Uni-Mol.
Pretrained Molecular Language Models (MPMs)
- Use massive chemical datasets (10–20 million molecules).
- Pretraining tasks include fingerprint prediction, angle prediction, and masked node prediction.

For example, the SCAGE model (Qiao et al. 2025) was pretrained on 5 million drug-like compounds and integrates 2D and 3D structural information with functional group awareness. This led to notable improvements on 30 activity cliff benchmarks, a notorious challenge in drug discovery.

Understanding Molecular Representations

To predict properties, a model must "understand" the molecule. Here are the formats used:

SMILES (1D) – simple string notation of molecules.
2D Graphs – atoms as nodes, bonds as edges.
3D Conformers – spatial geometry of molecules.
Images – pixel-based views for computer vision models.

🧪 Different formats capture different aspects of molecular behavior. For instance, 3D geometry is essential for predicting properties like binding affinity or toxicity, while SMILES strings are more scalable for pretraining language models.

Challenges and Gaps

Despite progress, real-world deployment of MPP still faces issues:

⚠️ Overfitting to benchmarks like MoleculeNet (Deng et al. 2023) can lead to inflated claims.
⚠️ Activity cliffs — where small structural changes cause big functional changes — are still hard to predict.
📉 Low-data regimes suffer from poor generalization.

In fact, Deng et al. ran 62,820 experiments and found many models fail when tested on simple property descriptors, showing that scaling alone doesn’t guarantee learning.

Popular Tools & Datasets

Libraries:

PaddleHelix (AI for drug discovery)
Uni-Mol (3D molecular representation)
ChemBERTa (language models for chemistry)

Datasets:

MoleculeNet (BBBP, ESOL, Tox21)
QM9, QM7 (quantum mechanical properties)
HIV, BACE, FreeSolv

See more on Papers With Code: Molecular Property Prediction Gen AI Workflow in Molecular Property Prediction Step 1: Molecule Input

⬇

Molecules are represented as:

- SMILES strings (1D)

- 2D molecular graphs

- 3D conformations

- Images (optional)

⬇

Step 2: Data Preprocessing

⬇

- Convert structures into model-readable formats

- Standardize molecules (e.g., remove salts, normalize tautomers)

- Generate molecular descriptors or fingerprints (optional)

⬇

Step 3: Molecular Representation Learning

⬇

Use Gen AI models such as:

- Graph Neural Networks (GNNs)

- Transformers (ChemBERTa, SCAGE, Chemformer)

- Contrastive learning models

Goal: Learn embeddings that capture molecular structure and behavior

⬇

Step 4: Pretraining (Optional but Powerful)

⬇

Train on large unlabeled datasets (e.g., 5M+ molecules)

Tasks include:

- Masked atom prediction

- 3D angle prediction

- Functional group prediction

- Fingerprint recovery

⬇

Step 5: Fine-tuning on Specific Property Prediction Tasks

⬇

- Solubility

- Toxicity (Tox21, ToxCast)

- Blood-brain barrier penetration (BBBP)

- Lipophilicity

- Drug-likeness, etc.

⬇

Step 6: Evaluation and Validation

⬇

Use benchmark datasets (e.g., MoleculeNet)

Metrics: RMSE, ROC-AUC, Accuracy, MAE

⬇

Step 7: Interpretation & Deployment

⬇

- Identify substructures linked to activity (e.g., using attention or saliency maps)

- Integrate into drug screening pipelines

- Prioritize or eliminate candidate molecules

If model performance is weak →

Go back to Step 3 or 4:

- Try better representations

- Use more training data

- Apply domain-specific augmentation

Project Management Workflow in Molecular Property Prediction by Pitchworks & Kwapio

Molecular Property Prediction (MPP) is a complex, multi-disciplinary process that involves chemistry, data science, AI modeling, and regulatory insight. To manage such a high-stakes workflow, Pitchworks and its portfolio company Kwapio have co-developed a comprehensive end-to-end project management system tailored for MPP research and productization. This workflow enables scientific and technical teams to collaborate seamlessly across discovery, model building, validation, and deployment stages.

List of tasks for Managing project Molecular Property Prediction

The system offers integrated modules for project planning, task tracking, data management, model lifecycle tracking, and compliance documentation. It includes Kanban-style boards to handle different ticket types—such as dataset acquisition, preprocessing, pretraining, fine-tuning, and deployment—and assigns them to interdisciplinary team members across AI, chemistry, and operations. The platform supports real-time collaboration, where chemists can comment on model outputs, data scientists can update experiments, and leadership can monitor progress via dashboards.

A major feature is its smart documentation engine—every dataset, model version, experiment result, and validation metric is linked to a living document. This reduces duplication and improves traceability, which is critical in regulated environments like pharma and healthcare. The platform also includes automated reminders, experiment version control, and model governance templates, ensuring every stage from SMILES ingestion to prediction deployment is tracked and auditable.

Beyond technical execution, the system promotes transparency and speed by integrating commenting tools, meeting notepads, and milestone checklists, which are accessible to both internal R&D and external stakeholders (e.g., CROs, pharma partners). By aligning agile development principles with scientific rigor, Pitchworks and Kwpaio have created a scalable framework to manage MPP workflows—cutting lead times by 30–50% and boosting reproducibility and innovation across their portfolio.

The Bottom Line

Molecular Property Prediction powered by Gen AI has the potential to shorten drug development timelines from years to months, and cut costs by over 40–60% in early-stage screening. By learning from millions of known molecules, these systems can identify promising candidates, eliminate poor ones, and reduce the reliance on expensive lab experiments.

"AI won't replace chemists—but chemists using AI will replace those who don’t."

Molecular Property Prediction is rapidly evolving from a chemistry lab challenge to a data-driven, AI-enabled pipeline. As the field advances beyond traditional QSAR and descriptor-based approaches, Gen AI models—leveraging graph structures, self-supervised learning, and multi-modal representations—are unlocking new levels of accuracy and scalability. However, these complex workflows also demand structured coordination, clear documentation, and robust collaboration.

This is where the synergy between AI innovation and smart project management, as seen in Pitchworks and Kwapio’s end-to-end workflow platform, becomes critical. By integrating scientific computation with agile management tools—such as task tracking, version control, and explainability dashboards—MPP workflows become faster, more reproducible, and enterprise-ready.

Ultimately, combining cutting-edge Gen AI with disciplined execution transforms MPP from experimental modeling into a scalable, reliable engine for next-gen drug discovery, formulation, and chemical innovation. The future of molecular science will not only be AI-first—it will be workflow-smart.