Guest Column | January 30, 2026

AI For NAM-Ready Drug Development: Turning Promise Into Trust

By Farhan Khodaee

processor-GettyImages-1437433096

Artificial intelligence (AI) methods have the potential to significantly improve drug development processes by shrinking costs, accelerating timelines, and reducing reliance on animal experiments. However, many uncertainties about AI approaches make them risky to be implemented in high-stake applications in biomedicine. In this article, I argue that AI is naturally situated for new approach methods (NAMs) and that its value will be maximized when it is deployed as part of a rigorously validated, evidence-based strategy. I’ll close with a practical 5S framework for making AI models NAM-ready.

What Exactly Do We Mean By “AI?”

It seems like a trivial question, but when people talk about artificial intelligence, they could mean very different things. In recent years, with the rise of large-scale language models like ChatGPT and Claude, AI has gone mainstream. But transformer-based unsupervised learning trained on textual data is not the only type of artificial intelligence.

Traditionally, the term machine learning is used to refer to neural network-based methods that rely on statistical learning. You can think about them as a hyper regression on many data points with complex relationships as opposed to more simple regression methods used in statistics. Machine learning methods evolved during the last decade but they were mostly limited in their scale and applications. It wasn’t until recently, specifically after the paper “Attention is All You Need,” that a new paradigm emerged. We were able to build unsupervised models that, by predicting the next token, could do many different tasks. This turned out to be really useful for modeling text, which is the backbone of any communication. By modeling language at scale, transformers redefined what many of us now mean by AI.

Today, the public often uses the term AI to refer broadly to large, highly sophisticated models. Yet not all AI systems are created equal. This distinction becomes especially important in drug discovery and development, where progress depends on answering precise scientific questions and where errors carry significant consequences. In this context, one could argue that general-purpose models are not always well suited: their lack of transparency and tendency to produce confident but incorrect outputs make them risky tools for high-stakes scientific decision-making.

In contrast, a different class of AI models is emerging in biomedicine, systems that leverage massive data sets while remaining tightly grounded in domain-specific predictions. Application-focused models, such as AlphaFold and ESM, demonstrated that carefully designed architectures can deliver reliable and scientifically meaningful results. Although, as with many transformative technologies, adoption takes time, and trust must be earned through demonstrated reliability.

Therefore, in drug development, we are facing a fundamental dilemma. On one hand, the potential of AI to accelerate discovery and unlock new insights is undeniable. On the other hand, the possibility of hallucinations and unverified predictions poses a serious challenge to adoption. In an environment where errors are unacceptable, trust is not optional. Building transparent, interpretable, and dependable algorithms is essential, and the absence of these guarantees explains why the field remains cautious about integrating AI into development pipelines.

Opportunities And Risks Of Using AI In Drug Development

The promise of AI is simple and powerful: to leverage our accumulated knowledge to make better predictions about the future. Humans already do this instinctively. Consider drug safety, for example, an area I am particularly excited about. When medicinal chemists or toxicologists evaluate a new compound, they routinely draw on years of experience, mentally comparing it to compounds they have seen before to assess potential toxicity or side effects. This intuition is valuable but inherently limited. How many relevant data points can any one expert see over the course of their career?

AI offers a way to scale this kind of reasoning beyond human capabilities. By learning from decades of historical data, models can help formalize expert intuition and enable more systematic assessments of risk. Used rigorously, these tools can refine and narrow the experimental search space, prioritize safer candidates earlier, and reduce unnecessary or redundant experiments.

It’s easy to get excited about new applications where we see huge opportunities to build data-driven methods that can accelerate workflows, reduce costs, and increase throughput. However, it’s not all about speed and efficiency. In drug development, they only matter after a model has been placed in the correct decision context, rigorously validated, and shown to remain reliable when the data inevitably shifts.

5S Framework For NAMS-Ready AI Models

This tension between opportunity and risk motivates the need for a more disciplined approach to AI adoption. This is particularly true in the context of NAMs, where AI models are increasingly expected to support decisions traditionally reliant on animal studies or clinical evidence. In such settings, the bar is really high: “can AI be trusted and reliably deployed in real-world drug development pipelines?”

To address this, I propose a 5S framework for building NAMs-ready AI models based on systems designed to be rigorously evaluated, stress-tested, and ultimately trusted in drug development. While many elements of this framework are technical in nature, they build on principles that the AI community has been developing for years: interpretability, robustness, reproducibility, and decision accountability. The difference is that in NAMs, these principles are not optional, they are prerequisites for application.

1. Scope: Define exactly what the model is for

The first step toward trustworthy AI deployment is clarity of purpose and context of use. A model cannot be validated, governed, or interpreted unless its role in the drug development process is explicitly defined. Scope is about preventing general intelligence thinking in a domain that demands precision. Key dimensions include:

  • Decision: Is the model ranking compounds, predicting toxicity risk, selecting assays, inferring mechanism of action? Each decision implies different acceptable error profiles, validation strategies, and downstream consequences.
  • Operating range: NAMs often involve narrow biological contexts: specific cell types, organoids, donor populations, or endpoints. A model trained in one chemical or biological space could fail when applied outside it.
  • Error costs: Drug development is asymmetric: false negatives may discard promising therapies, and false positives may waste years of resources or introduce safety risks. A NAMs-ready model must be scoped around the real decision environment.

2. Signal: Ensure inputs and labels are accurate and contextual

Even the most advanced model cannot overcome poor signaling. In drug development, the limiting factor is whether the data reflect true biology and are accurately recorded.

  • Data provenance and integrity: Biological data are highly sensitive to batch effects, lab protocols, and measurement drift. Without strict provenance tracking and leakage controls, models may simply learn experimental artifacts.
  • Endpoint quality: Many biological endpoints are ambiguous or poorly standardized. Weak labels reduce accuracy and erode interpretability. In high-stakes settings, label uncertainty must be treated explicitly and checked across multiple sources.
  • Assay and feature relevance: AI-NAMs must predict outcomes that matter to the drug development outcome. Features must be biologically meaningful. For example, a toxicity model that’s only trained on high-throughput in-vitro assays only learns how to predict that specific assay, not necessarily human-relevant toxicity.

3. Score: Evaluate performance fit to the decision

AI-NAMs’ performance cannot be reduced to metrics such as accuracy or area under the receiver operating characteristic curve (AUROC). The model should be evaluated by multiple factors to determine whether it helps make better drug-development decisions.

  • External validation: Internal cross-validation is necessary but NAMs-ready models must prove generalization across independent labs, datasets, or experimental conditions.
  • Calibration: Drug development decisions are not a simple yes or no. Scientists need to know meaningful probability metrics. For example, 0.9 toxicity risk for a compound is ambiguous. A calibrated model means that among compounds predicted at 0.9, ~90% truly end up toxic.
  • Decision-weighted metrics: Performance metrics are not objective, the proper metric depends on the decision context and the error thresholds. In certain applications, cost or speed considerations may be prioritized even when performance metrics such as AUROC are low.

4. Stability: Reliability under real-world variation

AI NAMs operate across different conditions and can be continuously updated — donors, batches, labs, protocols, and evolving biological contexts. Stability is what determines whether a model survives contact with reality.

  • Cross-context robustness: Models must remain reliable across labs, donors, cell lines, and experimental setups. A model that collapses under minor perturbation cannot support regulatory-grade decisions.
  • Drift detection and triggers: Stability requires ongoing vigilance. Models must detect when incoming data diverge from training conditions and trigger re-validation.
  • Stress testing: NAM-ready AI should undergo adversarial-style perturbations, out-of-distribution evaluation, and robustness checks. These tests can reveal failure modes before deployment.

5. Stewardship: Governance, auditability, and safe improvement

For AI NAMs to become a governed partner in drug development, there has to be mechanisms to control, monitor, and audit their use. Failure modes have to be documented and improved over time. It’s a strong benefit that AI methods that see more data can improve in their performance, so we have to make sure mechanisms for this exist.

  • Versioning and reproducibility: AI NAMs must track data, code, and model versions rigorously. Reproducibility analysis is also essential for auditability and regulatory confidence.
  • Monitoring and rollback: Deployment requires continuous performance monitoring, documentation, and the ability to reinforce feedback or revert when failures occur.
  • Human-in-the-loop escalation: AI NAMs must know when to abstain. Clear escalation rules are needed:
    • When does the model defer to wet lab validation?
    • When does uncertainty demand human review?

Many research labs and companies are already investing heavily in methods to ensure AI models perform well. Looking ahead, as building AI models becomes increasingly easy, the real differentiator will be trustworthiness. For AI to be used reliably in improving patients’ lives, models must be designed, validated, and maintained in ways that enable confidence, auditability, and long-term robustness.

In conclusion, not all AI models are created equal. Even within a narrow domain like toxicity prediction, thousands of neural network models exist. What truly distinguishes them is not specifically their architectural novelty but the rigor with which they are tested and validated for their specific context of use. Open-sourcing models is another powerful way to promote transparency and democratize evaluation within the scientific community. Ultimately, science advances through hypothesis generation and testing: the number of hypotheses is vast but only those that are rigorously tested and consistently validated survive. AI models should be treated no differently. Next time you see an impressive AI model, treat it as a new hypothesis and ask the critical question: is this hypothesis robustly tested?

About The Author

Farhan Khodaee is cofounder and CEO of Absentia Labs, a frontier AIxBio company building advanced platforms that turn complex biological data into reliable insights. Trained as a bioengineer, he earned his Ph.D. at MIT, where he developed machine learning models for large-scale transcriptomics data. Farhan’s career spans medical devices, biotech, and biopharma, including product management at Merck and venture creation at Flagship Pioneering. He founded Absentia Labs to make drug discovery predictable. He focuses on trustworthy AI for high-stakes drug development and NAMs, their rigorous validation, interpretability, monitoring, and long-term performance, accelerating safer decisions from discovery through development.