Guest Column | September 10, 2025

Building Robust AI Systems For Drug Discovery Requires Epistemic Humility

By Arvind Rao, University of Michigan-Ann Arbor

Machine Learning, digital mind technology-GettyImages-2026262347

The pharmaceutical industry stands at a critical juncture with artificial intelligence. While 2024's Nobel Prizes in Physics and Chemistry went to AI pioneers — signaling the technology's transformative potential — the translation from proof-of-concept to routine deployment in drug discovery remains surprisingly elusive. According to recent industry reports, only 5%–25% of AI pilot projects in pharma successfully graduate to production systems.^1,2 This stark reality demands a fundamental reassessment of how we approach AI implementation in mission-critical healthcare and pharma applications.

The Problem Of Uncertainty Quantification

Consider this troubling observation: After three years since ChatGPT's launch, large language models (LLMs) still lack basic uncertainty quantification. These systems present every output with equal confidence, whether discussing well-established pharmacological mechanisms or venturing into uncharted molecular territory. In drug discovery, where a single overlooked interaction can mean the difference between therapeutic breakthrough and clinical failure, this absence of calibrated uncertainty represents an unacceptable risk.

The pharmaceutical industry operates in a world of risk-asymmetric decisions. A 99% accuracy rate — impressive in consumer applications — becomes problematic when that 1% error could trigger a $2 billion clinical trial failure, regulatory delays, or patient harm. Yet current AI systems provide no systematic way to identify when they're operating within versus beyond their competence boundaries.

Phenotype-Guided Discovery: A Blueprint For Human–AI Collaboration

Recent work in phenotype-guided drug discovery offers a compelling model for productive human–AI partnership. By combining high-throughput imaging with active learning algorithms,³ researchers have successfully identified therapeutic targets for diseases — rediscovering in weeks what sometimes took the field decades to establish through traditional methods.

The key insight isn't the AI's pattern recognition capability but rather the iterative collaboration framework: human experts provide initial training examples, AI models identify areas of highest uncertainty, and humans selectively annotate only the most ambiguous cases. This approach transforms the traditionally adversarial "AI versus human" narrative into a synergistic partnership where each party contributes their unique strengths.

The Multi-Agent Emergence Challenge

As pharmaceutical companies rush to deploy multi-agent AI workflows — connecting models from different vendors trained on disparate data sets — we face an underappreciated "emergence" problem. Just as aircraft instruments must be carefully calibrated to work in concert, AI systems developed in isolation can produce unpredictable behaviors when combined. The pharmaceutical industry's proprietary data silos exacerbate this challenge, creating scenarios where critical drug interaction predictions depend on the unvalidated interplay of black-box models.

Education As Infrastructure: Building AI Fluency Across The Organization

The technology isn't the bottleneck. It's the adoption process. Successful AI deployment in pharma requires more than hiring data scientists; it demands organization-wide AI literacy.

For Research Scientists

Understand not just what AI can do, but its limitations. Recognize when a model's recommendations stem from genuine pattern recognition versus spurious correlations in training data. Scientists need to appreciate that while an AI system might identify a pathway's relevance to disease in a week — something that took researchers years to establish — this speed comes with trade-offs in interpretability and the risk of rediscovering known biology rather than generating novel insights. The ability to critically evaluate whether AI has genuinely discovered new biological mechanisms or simply reformulated existing knowledge becomes paramount.

For Clinical Development Teams

Develop intuition for when AI predictions require human oversight, particularly in novel therapeutic areas where training data may be sparse or biased. This includes understanding how active learning systems work — similar to how Netflix learns viewing preferences from limited initial data — and recognizing that starting with a few labeled examples can evolve into robust predictive models through strategic human–AI collaboration.⁴ Clinical experts must also grasp that when AI systems use attention mechanisms to highlight cells at tumor borders as prognostically important, they're often confirming well-established pathology knowledge rather than making revolutionary discoveries. This underscores the importance of domain expertise in interpreting AI outputs.

For Leadership

Appreciate that AI readiness isn't a technology procurement decision but a fundamental organizational transformation requiring sustained investment in human capital. Leaders must understand that building an AI model can happen over a weekend, but scaling it to routine deployment requires extensive change management, cultural shifts, and buy-in from frontline workers who will ultimately use these tools. The IBM Watson debacle in healthcare⁵ — where initial promises of revolutionizing cancer treatment ended in widespread failure — serves as a cautionary tale of what happens when technology deployment outpaces organizational readiness and scientific validation.

Toward Epistemic Humility In Pharmaceutical AI

The path forward requires what philosophers call "epistemic humility," which means to acknowledge the boundaries of our knowledge. Ironically, while humans can say "I don't know," current AI systems lack this fundamental capability. Future pharmaceutical AI systems must:

Quantify and communicate uncertainty at every prediction level. Implementing something as simple as color-coding outputs (green for high confidence, yellow for moderate, red for low) would represent a significant advance over current practice.
Maintain audit trails linking recommendations to source data. Follow Google's model of "datasheets for datasets" and "model cards"^6,7 that provide transparency equivalent to ingredient labels on food products.
Enable interrogation of their reasoning processes. This allows scientists to understand whether a drug recommendation stems from established pathways or novel pattern recognition.
Acknowledge knowledge boundaries explicitly. This is particularly critical when less than 5%–10% of future internet content may be human generated (or even human audited), with AI-generated content creating recursive feedback loops.

Practical Steps For Implementation

Organizations serious about AI in drug discovery should consider these four things.

Federated Learning Approaches

Enabling model training across proprietary data sets without data sharing preserves competitive advantage while advancing collective capability. This allows models to travel between organizations and learn iteratively^8,9 while data remains securely siloed, addressing the fundamental tension between collaboration and competition in pharmaceutical research. Companies can share the learnings embedded in trained models without ever exposing their proprietary molecular libraries or clinical trial data.

Synthetic Data Generation

Create shareable data sets that preserve statistical properties of proprietary data while protecting intellectual property. Despite the technical feasibility of generating synthetic data that mirrors the structure of proprietary data sets, pharmaceutical companies have been surprisingly slow to adopt this approach. This causes them to miss opportunities to accelerate industrywide learning while maintaining competitive advantages. This synthetic data can facilitate the training of more robust models and enable academic collaborations that would otherwise be impossible due to data sensitivity.

Standardized Evaluation Frameworks

Adopt emerging standards like TRIPOD-AI & TRIPOD-LLM^10,11 for systematic assessment of model performance, which goes beyond simplistic accuracy metrics. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) is a reporting guideline that improves the clarity and completeness of studies developing, validating, or updating prediction models. TRIPOD-AI is its extension tailored to AI/ML-based prediction models, ensuring transparent reporting of model development, training, evaluation, and interpretability aspects. TRIPOD-LLM is a further extension focused on LLM-based prediction or decision-support tools, addressing issues like prompt design, fine-tuning, context windows, and evaluation. Together, these guidelines aim to standardize reporting so models can be critically appraised, reproduced, and trusted across healthcare and biomedical research.

These frameworks must account for the phenomenon of model drift where the same AI system might report wildly different predictions as new data streams in. Evaluation must also consider the vector-algebraic nature of bias, examining not just bias in data but also in training procedures and model predictions.

Active Learning Pipelines

Implement human-in-the-loop systems that strategically leverage expert knowledge where it matters most. Taking inspiration from recommendation engines, these systems should start with minimal training data — and iteratively query human experts only for the most ambiguous cases. This would dramatically reduce the annotation burden from reviewing thousands of instances to perhaps a few hundred of carefully selected edge cases. This approach transforms scarce expert time from a bottleneck into a strategic resource,^3,4 achieving convergence to robust models through intelligent sampling rather than brute-force labeling.

The 2050 Challenge And Beyond

The Nobel Turing Challenge¹² posits that by 2050, AI scientists will autonomously make Nobel prize-worthy discoveries. Whether or not this proves achievable, the pharmaceutical industry must prepare for a future where AI serves as an increasingly capable research partner. Success won't come from wholesale automation but from thoughtful integration that amplifies human expertise while acknowledging AI's limitations.

The companies that thrive will be those that resist the allure of black-box solutions in favor of transparent, robust systems built on foundations of scientific rigor. They will invest not just in algorithms but in the human infrastructure necessary to deploy them responsibly. Most importantly, they will maintain the epistemic humility to recognize that in drug discovery — where we're ultimately accountable for human lives — the standard isn't just what AI can do, but what it should do.

The future of pharmaceutical AI lies not in replacing human judgment but in creating systems that enhance it — combining the pattern recognition power of machines with the contextual understanding, ethical reasoning, and epistemic humility that remain uniquely human. Only through this synthesis can we realize AI's transformative potential while maintaining the rigor that drug discovery demands.

References:

MIT NANDA Report via https://mlq.ai/media/quarterly_decks/v0.1_State_of_AI_in_Business_2025_Report.pdf
Bain AI Healthcare Adoption Index: 24% pharma AI projects to production (https://www.bain.com/insights/the-healthcare-ai-adoption-index/)
Nahal et al., Journal of Cheminformatics, 2024: Active learning + generative AI (https://pubmed.ncbi.nlm.nih.gov/39654043/)
He et al., arXiv, 2024: Collaborative Intelligence in Sequential Experiments: Human-in-the-loop framework for drug discovery https://arxiv.org/abs/2405.03942
Ross & Swetlitz, STAT, 2017: https://www.statnews.com/2017/09/05/watson-ibm-cancer/
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2021. Datasheets for datasets. Commun. ACM 64, 12 (December 2021), 86–92. https://doi.org/10.1145/3458723
Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model Cards for Model Reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT* '19). Association for Computing Machinery, New York, NY, USA, 220–229. https://doi.org/10.1145/3287560.3287596
MELLODDY Federated Learning: Owkin blog, https://www.owkin.com/blogs-case-studies/federated-learning-in-healthcare-the-future-of-collaborative-clinical-and-biomedical-research
Rieke et al., Nature, 2020: Future of Digital Health with federated learning, https://www.nature.com/articles/s41746-020-00323-1
TRIPOD-AI: https://www.bmj.com/content/385/bmj-2023-078378
TRIPOD-LLM: https://www.nature.com/articles/s41591-024-03425-5
Kitano, H. Nobel Turing Challenge: creating the engine for scientific discovery. npj Syst Biol Appl 7, 29 (2021). https://doi.org/10.1038/s41540-021-00189-3

This article was co-prepared and co-edited with generative AI tools.

About The Author

Arvind Rao is a professor in the Department of Computational Medicine and Bioinformatics at the University of Michigan. His group uses image analysis and machine learning methods to link image-derived phenotypes with genetic data across biological scale (i.e., single cell, tissue, and radiology data). Such methods have found application in radiogenomics, drug repurposing based on phenotypic screens and spatial profiling in tissue as well as in spatial transcriptomics. Rao received his PhD in electrical engineering and bioinformatics from the University of Michigan, specializing in transcriptional genomics. He was a Lane Postdoctoral Fellow at Carnegie Mellon University, specializing in bioimage informatics.