top of page

Gen AI Based Molecular Property Prediction: A Silent Revolution in Drug Discovery

  • Writer: Gokul Rangarajan
    Gokul Rangarajan
  • Jul 1
  • 7 min read

Gen Ai in Molecular Property Prediction beyond ADMET, How Generative AI is Transforming Workflows and Project Management.


Molecular Property Prediction
Molecular Property Prediction

This blog is part of the “GenAI in Healthcare Report 2025” by Murali Sudram in collaboration with Pitchworks VC Studio. The report explores how generative AI is reshaping scientific research, clinical workflows, and drug discovery. Stay tuned for more in-depth explorations of real-world applications and enterprise adoption strategies. You can download our Gen AI in Healthcare report from here https://www.pitchworks.club/healthcaregenaireportIf you are into manufacturing, you can download our Gen AI manufacturing report here https://www.pitchworks.club/gen-ai-manufacturing-report-2025

If your interest is in Clinical trials: we have report on Gen Ai in healthcare : Clinical trails 2025 https://pitchworks.club/clinicaltrailgenaihealthcarereport2025




Molecular Property Prediction (MPP) refers to the computational task of estimating physicochemical, biological, or pharmacological properties of molecules based on their structure. These properties range from solubility, toxicity, lipophilicity, binding affinity, to blood-brain barrier permeability, and go far beyond the classic ADMET parameters.



The Traditional Process: Time-Intensive and Costly

In conventional drug discovery, determining a molecule’s properties required extensive laboratory experimentation. This included:

  • Wet lab experiments for absorption and solubility.

  • Animal models or in vitro testing for toxicity and efficacy.

  • High-throughput screening for activity against targets.

    molecular property prediction in a traditional lab
    molecular property prediction in a traditional lab

This process could take weeks to months per molecule, cost thousands of dollars, and needed interdisciplinary teams—typically involving computational chemists, medicinal chemists, biologists, and data analysts.

The average cost of developing a new drug is estimated at $2.8 billion, with a success rate of less than 10% (Fleming 2018, Sarkar 2023).

Traditional Tools for MPP (Pre-AI Era)

These tools relied on rule-based, statistical, and experimental methods:

Tool / Platform

Purpose

Example Properties

QSAR software (Quantitative Structure–Activity Relationship)

Statistical modeling based on molecular descriptors

Toxicity, Solubility, Bioavailability

ADMET Predictor

Predicts absorption, distribution, metabolism, excretion, and toxicity

ADMET

ChemDraw / MarvinSketch

Molecule drawing & basic property calculation

LogP, pKa, MW

TopKat (BIOVIA)

Predicts toxicity using empirical models

Mutagenicity, Carcinogenicity

MOE (Molecular Operating Environment)

Structure-based drug design & property estimation

Solubility, Permeability

Gaussian / ORCA / Spartan

Quantum mechanical calculations for small molecules

Electronic properties, Energies

GROMACS / AMBER

Molecular dynamics simulations

Stability, Binding free energy


Enter Gen AI: Predicting the Future of Molecules

Thanks to the explosion of chemical data and deep learning innovations, MPP has rapidly evolved. Graph neural networks (GNNs), transformers, and multimodal AI models are now able to predict molecular properties with high precision—drastically reducing cost, time, and failure rates.

According to Zhao et al. (2024), GSL-MPP leverages a two-level graph representation, combining intra-molecular (atom-level) and inter-molecular (similarity-based) information to boost prediction accuracy.

🧠 Using graph structure learning, the model embeds molecules not just as isolated graphs but also considers their similarity with other molecules—just like how medicinal chemists reason based on chemical families.


🤖 Modern Tools in MPP (AI/Software-Driven Era)

These tools use deep learning, graph neural networks, and self-supervised learning:

Tool / Library

Description

Highlights

ChemBERTa / Chemformer / ChemGPT

Pretrained transformer models on SMILES strings

Sequence-based property prediction

GROVER / Uni-Mol / MolCLR

Self-supervised graph neural networks for molecules

Graph-based with 2D/3D features

SCAGE (Nature 2025)

Self-conformation-aware graph transformer

Learns from 5M compounds with structure-function tasks

DGL-LifeSci

Deep Graph Library for life sciences

Custom GNNs for MPP

RDKit

Cheminformatics toolkit

Descriptor calculation, molecular similarity

DeepChem

ML for drug discovery & materials science

Benchmarks, datasets, ready-to-use models

Open Babel

File conversion and basic property prediction

Free and extensible

PaddleHelix

AI tools by Baidu for life sciences

Protein-drug interaction, MPP tasks

AutoGluon-Tabular

AutoML for molecular tabular data

Property prediction with little coding


Core Innovations in MPP: From Single to Multimodal Approaches

As reviewed by Liyaqat et al. (2024), recent AI models in MPP fall under three broad strategies:

  1. Single-Modality Models

    • Rely on a single molecular representation (e.g., SMILES, graphs, or molecular images).

    • Models: ChemBERTa, GIN, MPNN.

  2. Multimodal Models

    • Combine multiple forms of input (e.g., 2D graph + 3D conformation).

    • Models: GROVER, MolAE, Uni-Mol.

  3. Pretrained Molecular Language Models (MPMs)

    • Use massive chemical datasets (10–20 million molecules).

    • Pretraining tasks include fingerprint prediction, angle prediction, and masked node prediction.

For example, the SCAGE model (Qiao et al. 2025) was pretrained on 5 million drug-like compounds and integrates 2D and 3D structural information with functional group awareness. This led to notable improvements on 30 activity cliff benchmarks, a notorious challenge in drug discovery.

Understanding Molecular Representations

To predict properties, a model must "understand" the molecule. Here are the formats used:

  • SMILES (1D) – simple string notation of molecules.

  • 2D Graphs – atoms as nodes, bonds as edges.

  • 3D Conformers – spatial geometry of molecules.

  • Images – pixel-based views for computer vision models.

🧪 Different formats capture different aspects of molecular behavior. For instance, 3D geometry is essential for predicting properties like binding affinity or toxicity, while SMILES strings are more scalable for pretraining language models.

Challenges and Gaps

Despite progress, real-world deployment of MPP still faces issues:

  • ⚠️ Overfitting to benchmarks like MoleculeNet (Deng et al. 2023) can lead to inflated claims.

  • ⚠️ Activity cliffs — where small structural changes cause big functional changes — are still hard to predict.

  • 📉 Low-data regimes suffer from poor generalization.

In fact, Deng et al. ran 62,820 experiments and found many models fail when tested on simple property descriptors, showing that scaling alone doesn’t guarantee learning.

Popular Tools & Datasets

Libraries:

  • PaddleHelix (AI for drug discovery)

  • Uni-Mol (3D molecular representation)

  • ChemBERTa (language models for chemistry)

Datasets:

  • MoleculeNet (BBBP, ESOL, Tox21)

  • QM9, QM7 (quantum mechanical properties)

  • HIV, BACE, FreeSolv

See more on Papers With Code: Molecular Property Prediction Gen AI Workflow in Molecular Property Prediction Step 1: Molecule Input

Molecules are represented as:

- SMILES strings (1D)

- 2D molecular graphs

- 3D conformations

- Images (optional)


Step 2: Data Preprocessing

- Convert structures into model-readable formats

- Standardize molecules (e.g., remove salts, normalize tautomers)

- Generate molecular descriptors or fingerprints (optional)


Step 3: Molecular Representation Learning

Use Gen AI models such as:

- Graph Neural Networks (GNNs)

- Transformers (ChemBERTa, SCAGE, Chemformer)

- Contrastive learning models


Goal: Learn embeddings that capture molecular structure and behavior


Step 4: Pretraining (Optional but Powerful)

Train on large unlabeled datasets (e.g., 5M+ molecules)

Tasks include:

- Masked atom prediction

- 3D angle prediction

- Functional group prediction

- Fingerprint recovery


Step 5: Fine-tuning on Specific Property Prediction Tasks

- Solubility

- Toxicity (Tox21, ToxCast)

- Blood-brain barrier penetration (BBBP)

- Lipophilicity

- Drug-likeness, etc.


Step 6: Evaluation and Validation

Use benchmark datasets (e.g., MoleculeNet)

Metrics: RMSE, ROC-AUC, Accuracy, MAE


Step 7: Interpretation & Deployment

- Identify substructures linked to activity (e.g., using attention or saliency maps)

- Integrate into drug screening pipelines

- Prioritize or eliminate candidate molecules


If model performance is weak →

Go back to Step 3 or 4:

- Try better representations

- Use more training data

- Apply domain-specific augmentation


Project Management Workflow in Molecular Property Prediction by Pitchworks & Kwapio


Molecular Property Prediction (MPP) is a complex, multi-disciplinary process that involves chemistry, data science, AI modeling, and regulatory insight. To manage such a high-stakes workflow, Pitchworks and its portfolio company Kwapio have co-developed a comprehensive end-to-end project management system tailored for MPP research and productization. This workflow enables scientific and technical teams to collaborate seamlessly across discovery, model building, validation, and deployment stages.

List of tasks for Managing project Molecular Property Prediction
List of tasks for Managing project Molecular Property Prediction

The system offers integrated modules for project planning, task tracking, data management, model lifecycle tracking, and compliance documentation. It includes Kanban-style boards to handle different ticket types—such as dataset acquisition, preprocessing, pretraining, fine-tuning, and deployment—and assigns them to interdisciplinary team members across AI, chemistry, and operations. The platform supports real-time collaboration, where chemists can comment on model outputs, data scientists can update experiments, and leadership can monitor progress via dashboards.

Kwpaio Molecular PPP Project management
Kwpaio Molecular PPP Project management

A major feature is its smart documentation engine—every dataset, model version, experiment result, and validation metric is linked to a living document. This reduces duplication and improves traceability, which is critical in regulated environments like pharma and healthcare. The platform also includes automated reminders, experiment version control, and model governance templates, ensuring every stage from SMILES ingestion to prediction deployment is tracked and auditable.

Beyond technical execution, the system promotes transparency and speed by integrating commenting tools, meeting notepads, and milestone checklists, which are accessible to both internal R&D and external stakeholders (e.g., CROs, pharma partners). By aligning agile development principles with scientific rigor, Pitchworks and Kwpaio have created a scalable framework to manage MPP workflows—cutting lead times by 30–50% and boosting reproducibility and innovation across their portfolio.


The Bottom Line

Molecular Property Prediction powered by Gen AI has the potential to shorten drug development timelines from years to months, and cut costs by over 40–60% in early-stage screening. By learning from millions of known molecules, these systems can identify promising candidates, eliminate poor ones, and reduce the reliance on expensive lab experiments.

"AI won't replace chemists—but chemists using AI will replace those who don’t."

Molecular Property Prediction is rapidly evolving from a chemistry lab challenge to a data-driven, AI-enabled pipeline. As the field advances beyond traditional QSAR and descriptor-based approaches, Gen AI models—leveraging graph structures, self-supervised learning, and multi-modal representations—are unlocking new levels of accuracy and scalability. However, these complex workflows also demand structured coordination, clear documentation, and robust collaboration.

This is where the synergy between AI innovation and smart project management, as seen in Pitchworks and Kwapio’s end-to-end workflow platform, becomes critical. By integrating scientific computation with agile management tools—such as task tracking, version control, and explainability dashboards—MPP workflows become faster, more reproducible, and enterprise-ready.

Ultimately, combining cutting-edge Gen AI with disciplined execution transforms MPP from experimental modeling into a scalable, reliable engine for next-gen drug discovery, formulation, and chemical innovation. The future of molecular science will not only be AI-first—it will be workflow-smart.



Yorumlar


bottom of page