Gen AI Based Molecular Property Prediction: A Silent Revolution in Drug Discovery
- Gokul Rangarajan
- Jul 1
- 7 min read
Gen Ai in Molecular Property Prediction beyond ADMET, How Generative AI is Transforming Workflows and Project Management.

This blog is part of the “GenAI in Healthcare Report 2025” by Murali Sudram in collaboration with Pitchworks VC Studio. The report explores how generative AI is reshaping scientific research, clinical workflows, and drug discovery. Stay tuned for more in-depth explorations of real-world applications and enterprise adoption strategies. You can download our Gen AI in Healthcare report from here https://www.pitchworks.club/healthcaregenaireportIf you are into manufacturing, you can download our Gen AI manufacturing report here https://www.pitchworks.club/gen-ai-manufacturing-report-2025
If your interest is in Clinical trials: we have report on Gen Ai in healthcare : Clinical trails 2025 https://pitchworks.club/clinicaltrailgenaihealthcarereport2025
Molecular Property Prediction (MPP) refers to the computational task of estimating physicochemical, biological, or pharmacological properties of molecules based on their structure. These properties range from solubility, toxicity, lipophilicity, binding affinity, to blood-brain barrier permeability, and go far beyond the classic ADMET parameters.
The Traditional Process: Time-Intensive and Costly
In conventional drug discovery, determining a molecule’s properties required extensive laboratory experimentation. This included:
Wet lab experiments for absorption and solubility.
Animal models or in vitro testing for toxicity and efficacy.
High-throughput screening for activity against targets.
molecular property prediction in a traditional lab
This process could take weeks to months per molecule, cost thousands of dollars, and needed interdisciplinary teams—typically involving computational chemists, medicinal chemists, biologists, and data analysts.
The average cost of developing a new drug is estimated at $2.8 billion, with a success rate of less than 10% (Fleming 2018, Sarkar 2023).
Traditional Tools for MPP (Pre-AI Era)
These tools relied on rule-based, statistical, and experimental methods:
Tool / Platform | Purpose | Example Properties |
QSAR software (Quantitative Structure–Activity Relationship) | Statistical modeling based on molecular descriptors | Toxicity, Solubility, Bioavailability |
ADMET Predictor | Predicts absorption, distribution, metabolism, excretion, and toxicity | ADMET |
ChemDraw / MarvinSketch | Molecule drawing & basic property calculation | LogP, pKa, MW |
TopKat (BIOVIA) | Predicts toxicity using empirical models | Mutagenicity, Carcinogenicity |
MOE (Molecular Operating Environment) | Structure-based drug design & property estimation | Solubility, Permeability |
Gaussian / ORCA / Spartan | Quantum mechanical calculations for small molecules | Electronic properties, Energies |
GROMACS / AMBER | Molecular dynamics simulations | Stability, Binding free energy |
Enter Gen AI: Predicting the Future of Molecules
Thanks to the explosion of chemical data and deep learning innovations, MPP has rapidly evolved. Graph neural networks (GNNs), transformers, and multimodal AI models are now able to predict molecular properties with high precision—drastically reducing cost, time, and failure rates.
According to Zhao et al. (2024), GSL-MPP leverages a two-level graph representation, combining intra-molecular (atom-level) and inter-molecular (similarity-based) information to boost prediction accuracy.
🧠 Using graph structure learning, the model embeds molecules not just as isolated graphs but also considers their similarity with other molecules—just like how medicinal chemists reason based on chemical families.
🤖 Modern Tools in MPP (AI/Software-Driven Era)
These tools use deep learning, graph neural networks, and self-supervised learning:
Tool / Library | Description | Highlights |
ChemBERTa / Chemformer / ChemGPT | Pretrained transformer models on SMILES strings | Sequence-based property prediction |
GROVER / Uni-Mol / MolCLR | Self-supervised graph neural networks for molecules | Graph-based with 2D/3D features |
SCAGE (Nature 2025) | Self-conformation-aware graph transformer | Learns from 5M compounds with structure-function tasks |
DGL-LifeSci | Deep Graph Library for life sciences | Custom GNNs for MPP |
RDKit | Cheminformatics toolkit | Descriptor calculation, molecular similarity |
DeepChem | ML for drug discovery & materials science | Benchmarks, datasets, ready-to-use models |
Open Babel | File conversion and basic property prediction | Free and extensible |
PaddleHelix | AI tools by Baidu for life sciences | Protein-drug interaction, MPP tasks |
AutoGluon-Tabular | AutoML for molecular tabular data | Property prediction with little coding |
Core Innovations in MPP: From Single to Multimodal Approaches
As reviewed by Liyaqat et al. (2024), recent AI models in MPP fall under three broad strategies:
Single-Modality Models
Rely on a single molecular representation (e.g., SMILES, graphs, or molecular images).
Models: ChemBERTa, GIN, MPNN.
Multimodal Models
Combine multiple forms of input (e.g., 2D graph + 3D conformation).
Models: GROVER, MolAE, Uni-Mol.
Pretrained Molecular Language Models (MPMs)
Use massive chemical datasets (10–20 million molecules).
Pretraining tasks include fingerprint prediction, angle prediction, and masked node prediction.
For example, the SCAGE model (Qiao et al. 2025) was pretrained on 5 million drug-like compounds and integrates 2D and 3D structural information with functional group awareness. This led to notable improvements on 30 activity cliff benchmarks, a notorious challenge in drug discovery.
Understanding Molecular Representations
To predict properties, a model must "understand" the molecule. Here are the formats used:
SMILES (1D) – simple string notation of molecules.
2D Graphs – atoms as nodes, bonds as edges.
3D Conformers – spatial geometry of molecules.
Images – pixel-based views for computer vision models.
🧪 Different formats capture different aspects of molecular behavior. For instance, 3D geometry is essential for predicting properties like binding affinity or toxicity, while SMILES strings are more scalable for pretraining language models.
Challenges and Gaps
Despite progress, real-world deployment of MPP still faces issues:
⚠️ Overfitting to benchmarks like MoleculeNet (Deng et al. 2023) can lead to inflated claims.
⚠️ Activity cliffs — where small structural changes cause big functional changes — are still hard to predict.
📉 Low-data regimes suffer from poor generalization.
In fact, Deng et al. ran 62,820 experiments and found many models fail when tested on simple property descriptors, showing that scaling alone doesn’t guarantee learning.
Popular Tools & Datasets
Libraries:
PaddleHelix (AI for drug discovery)
Uni-Mol (3D molecular representation)
ChemBERTa (language models for chemistry)
Datasets:
MoleculeNet (BBBP, ESOL, Tox21)
QM9, QM7 (quantum mechanical properties)
HIV, BACE, FreeSolv
See more on Papers With Code: Molecular Property Prediction Gen AI Workflow in Molecular Property Prediction Step 1: Molecule Input
⬇
Molecules are represented as:
- SMILES strings (1D)
- 2D molecular graphs
- 3D conformations
- Images (optional)
⬇
Step 2: Data Preprocessing
⬇
- Convert structures into model-readable formats
- Standardize molecules (e.g., remove salts, normalize tautomers)
- Generate molecular descriptors or fingerprints (optional)
⬇
Step 3: Molecular Representation Learning
⬇
Use Gen AI models such as:
- Graph Neural Networks (GNNs)
- Transformers (ChemBERTa, SCAGE, Chemformer)
- Contrastive learning models
Goal: Learn embeddings that capture molecular structure and behavior
⬇
Step 4: Pretraining (Optional but Powerful)
⬇
Train on large unlabeled datasets (e.g., 5M+ molecules)
Tasks include:
- Masked atom prediction
- 3D angle prediction
- Functional group prediction
- Fingerprint recovery
⬇
Step 5: Fine-tuning on Specific Property Prediction Tasks
⬇
- Solubility
- Toxicity (Tox21, ToxCast)
- Blood-brain barrier penetration (BBBP)
- Lipophilicity
- Drug-likeness, etc.
⬇
Step 6: Evaluation and Validation
⬇
Use benchmark datasets (e.g., MoleculeNet)
Metrics: RMSE, ROC-AUC, Accuracy, MAE
⬇
Step 7: Interpretation & Deployment
⬇
- Identify substructures linked to activity (e.g., using attention or saliency maps)
- Integrate into drug screening pipelines
- Prioritize or eliminate candidate molecules
If model performance is weak →
Go back to Step 3 or 4:
- Try better representations
- Use more training data
- Apply domain-specific augmentation
Project Management Workflow in Molecular Property Prediction by Pitchworks & Kwapio
Molecular Property Prediction (MPP) is a complex, multi-disciplinary process that involves chemistry, data science, AI modeling, and regulatory insight. To manage such a high-stakes workflow, Pitchworks and its portfolio company Kwapio have co-developed a comprehensive end-to-end project management system tailored for MPP research and productization. This workflow enables scientific and technical teams to collaborate seamlessly across discovery, model building, validation, and deployment stages.

The system offers integrated modules for project planning, task tracking, data management, model lifecycle tracking, and compliance documentation. It includes Kanban-style boards to handle different ticket types—such as dataset acquisition, preprocessing, pretraining, fine-tuning, and deployment—and assigns them to interdisciplinary team members across AI, chemistry, and operations. The platform supports real-time collaboration, where chemists can comment on model outputs, data scientists can update experiments, and leadership can monitor progress via dashboards.

A major feature is its smart documentation engine—every dataset, model version, experiment result, and validation metric is linked to a living document. This reduces duplication and improves traceability, which is critical in regulated environments like pharma and healthcare. The platform also includes automated reminders, experiment version control, and model governance templates, ensuring every stage from SMILES ingestion to prediction deployment is tracked and auditable.
Beyond technical execution, the system promotes transparency and speed by integrating commenting tools, meeting notepads, and milestone checklists, which are accessible to both internal R&D and external stakeholders (e.g., CROs, pharma partners). By aligning agile development principles with scientific rigor, Pitchworks and Kwpaio have created a scalable framework to manage MPP workflows—cutting lead times by 30–50% and boosting reproducibility and innovation across their portfolio.
The Bottom Line
Molecular Property Prediction powered by Gen AI has the potential to shorten drug development timelines from years to months, and cut costs by over 40–60% in early-stage screening. By learning from millions of known molecules, these systems can identify promising candidates, eliminate poor ones, and reduce the reliance on expensive lab experiments.
"AI won't replace chemists—but chemists using AI will replace those who don’t."
Molecular Property Prediction is rapidly evolving from a chemistry lab challenge to a data-driven, AI-enabled pipeline. As the field advances beyond traditional QSAR and descriptor-based approaches, Gen AI models—leveraging graph structures, self-supervised learning, and multi-modal representations—are unlocking new levels of accuracy and scalability. However, these complex workflows also demand structured coordination, clear documentation, and robust collaboration.
This is where the synergy between AI innovation and smart project management, as seen in Pitchworks and Kwapio’s end-to-end workflow platform, becomes critical. By integrating scientific computation with agile management tools—such as task tracking, version control, and explainability dashboards—MPP workflows become faster, more reproducible, and enterprise-ready.
Ultimately, combining cutting-edge Gen AI with disciplined execution transforms MPP from experimental modeling into a scalable, reliable engine for next-gen drug discovery, formulation, and chemical innovation. The future of molecular science will not only be AI-first—it will be workflow-smart.
Yorumlar