Guest Column | July 1, 2026

Wondering If Your Org Should Purchase R&D Datasets?

By Lilly Saiontz, Clarkston Consulting

digital currency, dollar sign-GettyImages-1781904929

The life sciences industry is moving quickly toward more AI-enabled, data-driven research and development, which is changing how organizations evaluate and invest in scientific data assets.1 As advances in artificial intelligence, precision medicine, and multimodal analytics accelerate, pharmaceutical and biotech companies are increasingly prioritizing access to large-scale, high-quality data sets that combine genomic, proteomic, clinical, and real-world evidence.

Increased participation in precompetitive consortia and collaborative research ecosystems is also reshaping how data is generated and shared across the industry. In this environment, competitive advantage is increasingly determined not only by access to differentiated data but also by an organization’s ability to rapidly integrate, govern, analyze, and apply these data sets to drive faster scientific insights, portfolio decisions, and therapeutic innovation.

Considerations For Investing In R&D Data

Computational biology continues to thrive in a world of Big Data, with researchers using in silico models to support new target identification and drug discovery.2 As companies are evaluating opportunities to use AI in research and development,3 research organizations must consider their available data and its utility in the organization.

Due to time-intensive and manual research paired with unstructured data,4 organizations may need to evaluate and identify the data sources available within the research organization, shedding light on any gaps in research capabilities. As the industry trends toward an ecosystem of connected data sets, organizations are quickly developing, revising, and executing their data strategies.

Specific data types, such as genomics, proteomics, and other computational data are increasingly valuable for target validation.5 For many organizations, the question isn’t whether they can simply access the data – but also whether they can use it well.

When evaluating opportunities for purchasing research data, a company may be presented with two options: directly purchasing data or joining a precompetitive consortium.

What Are Precompetitive Consortia?

Precompetitive consortia are partnership programs that join multiple parties to address a common goal.6 In the pharmaceutical R&D industry, this can refer to a collaborative partnership that funds the generation of research data and promotes collaboration.

These programs result in varied access restrictions to its consortia members and to the public. By joining these programs, partners may benefit from exclusive access to the generated data.7 These partnerships can be a huge cost up front, but if an organization can effectively use the data, they could see advantages by influencing portfolio decisions.

Newly available research provides the opportunity to find unique insights earlier than competitors, leading to novel therapeutics. By acquiring research data sets, companies can fill gaps in their computational biology capabilities, jumpstarting their research and development.

The Key Question: Should You Buy The Data At All?

Before jumping into a new contract, leaders should consider if their organization is fit to analyze large-scale data sets.

Though there are clear benefits to joining these consortia, the benefits are only realized with proper readiness and preparation. Consider storage and analytical platforms that could be leveraged for the data set and how this can impact the current research process. Additionally, there must be adequate internal expertise to properly analyze and interpret the data. Without the right people, technology, and processes in place, this investment may not provide the right returns.

Key Considerations Before Investing

Strategy And Differentiation

Prior to investing in a data set, the organization should conduct an inventory of data already available and any gaps preventing effective target validation. Additionally, leaders should develop processes that expand the availability of data already in use. Some questions to consider include:

  • What type of data may be missing, such as patient data, real-world data, or in vitro data? This can inform which partnerships to join.
  • Is the data set both unique and complementary to the research strategy? By reviewing available data, leaders can determine if there is a gap in a particular type of data set, including disease-enriched, population-enriched, breadth-based, or longitudinal data sets.
  • Are competitors seeking the same data set? If so, does the organization need to join to prevent falling behind or choose something else to have a competitive edge?
  • What is the novelty of this data set? Is there enough reward to warrant the risk of joining the partnership and sharing the cost of data generation?
  • Will access to peer organizations in consortia offer unique benefits to the organization?
  • Has sufficient due diligence on the research group been done to confirm the quality and useability of the study?

Overall, when selecting an appropriate data set, consider the organization’s overall goals and strategy to evaluate whether the data actually meets the organization’s needs.

Scalability

Before investing in a data set, organizations should consider how it will be used both in the short term and the long term, particularly its potential to inform research priorities and related portfolio decisions over time. Although a new data type may address a near-term need, leaders should also evaluate how well it will remain relevant as research methods advance and new scientific breakthroughs emerge.

Organizations should also consider the full range of potential use cases across the enterprise, along with the likely end users, since these factors will inform governance and access requirements. These key data sets can provide value not only in the immediate future but also over time as the organization’s research focus, resources, and strategy evolve. As AI becomes even more ingrained in everyday tasks, its scalability must be a factor to consider when evaluating long-term impact.

Economics And ROI

Before investing in these data sets, leaders are likely most interested in the bottom line: How much does the data set cost, is it within our budget, and what is the return on investment?

The cost of acquiring a data set is typically high, but it can create significant value by enabling new target identification and informing portfolio decisions. Realizing that value depends on the organization’s ability to use the data effectively, especially if the data is available to competitors. In many cases, purchasing access may still be more cost-effective than attempting to replicate the study at scale in-house.

There are also additional costs associated with acquiring and analyzing the data sets that must be considered to confirm operational readiness.

Some research data sets must be stored and analyzed on a secure computational platform that incorporates the tools necessary to analyze the data. Some computational platforms and servers have expensive data transfer, storage, and compute fees that can contribute significantly to the overall cost. For example, genetic data sets are extremely large in size,8 resulting in high storage costs. Some data sets have restrictions on data storage and access,9 limiting the available infrastructure options and further driving cost.

Although there are high costs associated with standing up infrastructure and skills, these investments may be reused for future initiatives, increasing the return on investment.

Operational Readiness

In addition to considering financial position, leaders must consider if the group is ready to receive, access, analyze, interpret, and act on the data set.

Companies should confirm the researchers have the appropriate skillset to analyze and use the data. If the organization doesn’t have the skillset internally, the company must consider hiring new researchers or using contractors. Just as important, skilled researchers must have the bandwidth to perform the time-intensive work involved in data access, quality control and data cleansing, analysis, and interpretation. Leaders should protect the researchers’ time to allow quicker actionable insights from the data set and reinforce the high priority of the effort across the broader research organization.

Infrastructure readiness is equally important. These data sets can be much larger than any data sets in use, and existing platforms may not appropriately scale to this case. Leaders should provide enough time in their planning to implement new platforms and integrate them with AI solutions and existing systems. They should also recognize that some consortia may restrict data storage locations, such as through country restrictions or privacy regulations.10

To prepare for operational readiness, organizations should establish and maintain a detailed project plan at least six months prior to data release. Early engagement with impacted functional areas is required for maximum preparation and usability.

Data Quality, Usability, And Governance

Data quality, usability, and governance must also be considered when selecting a data set to ensure maximum ROI. Different consortia and data types will have different considerations in terms of data quality, and leaders should consider any relevant processes, technology, guidance, and regulations to ensure compliant and appropriate usage of the data.

Before joining a research partnership, organizations should complete due diligence to ensure trust in the partner’s research method and resulting data quality. This is especially important when working with new technology or research methods, which can pose risks to receiving high-quality data.

Leaders should also consider the specific limitations associated with different data types. For example, cell cultures can’t entirely represent complex biological systems,11 so organizations should consider how this information will be used and applied to drive target discoveries. Further, patient data presents a different set of challenges, as it may contain missing or inconsistent data that makes cross-study comparison more difficult. 12

In terms of usability, data often must be mapped to standardized ontologies to support comparative analysis across studies and sources; this requires clinical informatics expertise. Without that foundation, even high-quality data may be difficult to interpret or use consistently across research teams. Value can be created by comparing novel data sets with internally generated data, which creates a larger patient cohort that has never been studied. Data governance and ownership are key to properly enable this process.

Governance and access requirements need to be established early, especially if partnerships restrict user access.13 Legal counsel should be engaged early to understand any restrictions and eliminate risk of noncompliance. Organizations should also define a governance model that clarifies data ownership, usage expectations, access controls, and tools needed to promote and support proper usage and traceability.

Finally, leaders should identify the intended end users of the data set as early as possible, as these users may need certain qualifications or formal approval before they can access the data.14 Defining key roles, such as data owners and data stewards, ensures proper data management.15

While the data owner serves as a decision maker for the data, including data storage location, a data steward enforces data governance policies to ensure compliance, enabling wider comprehension and use of the data set. By considering overall data quality, access, and applications, leaders can be confident in proper usage and interpretation of the data.

Final Thoughts

Acquiring novel data sets and participating in research partnerships can transform an organization’s research capabilities, but only if the organization is prepared to take on and leverage the data set. By considering the key factors noted above prior to investment, an organization can be informed and deliberate when deciding whether to invest in R&D data sets.

References

  1. Cuttsy+Cuttsy. (2025, April 28). AI and away we go: Key insights from Reuters Pharma 2025. https://www.cuttsyandcuttsy.com/latest/ai-and-away-we-go-key-insights-from-reuters-pharma-2025
  2. Chaudhari, J.K., Pant, S., Jha, R., Pathak, R.K., et al. (2024, January 27). Biological Big-Data Sources, Problems of Storage, Computational Issues, and Applications: A Comprehensive Review. Knowledge and Information Systems. https://link.springer.com/article/10.1007/s10115-023-02049-4
  3. Saiontz, L., & Watson, E. (2025, August 29). Opportunities for AI in Drug Product Development. Clarkston Consulting. https://clarkstonconsulting.com/insights/ai-in-drug-product-development/
  4. Froissart, J., & Chio, A. (2024, September 26). Developing a Dashboard Solution for Clinical Trial Data. Clarkston Consulting. https://clarkstonconsulting.com/insights/dashboard-solution-for-clinical-trial-data/
  5. Minikel, E. V., Painter, J. L., Dong, C. C., Gates, M. A., Akingbola, A., Barnett, A., ... MacArthur, D. G. (2024). Refining the impact of genetic evidence on clinical success. Nature, 629, 624–629. https://doi.org/10.1038/s41586-024-07316-0
  6. Barrett, J.S. (2023, October 3). The Precompetitive Space for Drug or Vaccine Development: What Does It Look Like Now and What Could It Look Like in the Future? The Journal of Pediatric Pharmacology and Therapeutics. https://pmc.ncbi.nlm.nih.gov/articles/PMC10731930/
  7. Nicol, D., Nielsen, J., & Archer, M. (2024). Data access arrangements in genomic research consortia. Scientific Reports, 14, 21685. https://doi.org/10.1038/s41598-024-72653-z
  8. Schmidt, B., & Hildebrandt, A. (2017, February 2). Next-Generation Sequencing: Big Data Meets High Performance Computing. Drug Discovery Today. https://pubmed.ncbi.nlm.nih.gov/28163155/
  9. Office of Research Services, University of Pennsylvania. (n.d.). NIH data management and access requirements for sharing genomic data. https://researchservices.upenn.edu/nih-data-management-and-access-requirements-for-sharing-genomic-data/
  10. Thaldar, D., Uberoi, D., Thorogood, A. et al. (2025). Communicating clearly about data sharing in genomics. Human Genomics. 19, 80 https://doi.org/10.1186/s40246-025-00784-z
  11. Urzì, O., Gasparro, R., Costanzo, E., De Luca, A., Giavaresi, G., Fontana, S., & Alessandro, R. (2023). Three-Dimensional Cell Cultures: The Bridge between In Vitro and In Vivo Models. International Journal of Molecular Sciences24(15), 12046. https://doi.org/10.3390/ijms241512046
  12. Haneuse S, Arterburn D, Daniels MJ. Assessing Missing Data Assumptions in EHR-Based Studies: A Complex and Underappreciated Task. JAMA Netw Open. 2021;4(2):e210184. doi:10.1001/jamanetworkopen.2021.0184
  13. All of Us Research Program. (n.d.). Researcher Workbench. All of Us Research Hub. https://www.researchallofus.org/data-tools/workbench/
  14. UK Biobank. (2026, May 3). Eligibility. https://www.ukbiobank.ac.uk/use-our-data/eligibility/
  15. Kneer, C., Erickson, S., & Lamont, B. (2026, February 6). Master Data Governance Strategy for an Emerging Biotech Company. Clarkston Consulting. https://clarkstonconsulting.com/insights/master-data-governance-strategy-case-study/

About The Author:

Lilly Saiontz is an associate consultant with expertise in strategy and data analytics at Clarkston Consulting. Her experience in the life sciences industry spans across pharmaceutical research and development and clinical organizations. At Clarkston, Saiontz combines this knowledge with effective communication and collaboration to ensure project success and client satisfaction. She has used these skills across strategy, operations, and implementation projects in the life sciences vertical. Saiontz holds a B.S. from Georgia Institute of Technology in biomedical engineering.