Healthcare AI – The Importance of Real-World Data

Picture of Collin Labar

Collin Labar

Collin has spent his career advising pharmaceutical and biotech companies on their medical, commercial, and market access strategy. After years as a consultant, he joined Latica (latica.ai) to lead partnerships, shaping data strategy for life sciences and AI companies.

Table of Contents

The future of AI in healthcare relies on the technology’s ability to process vast amounts of real-world data (RWD), with the goal of improving patient management and outcomes.

Novel AI technologies are focusing on solutions that leverage the most valuable yet underutilized sources of RWD: unstructured data – specifically, the integration of siloed, unstructured data sources mapped to clinical journeys. Because these technologies rely so heavily on RWD, AI companies need to train and validate their models on datasets that are large, linked, and longitudinal.

See my previous article discussing this by clicking here, where I discuss the various types of RWD sources now available and how they can be leveraged for research across life sciences organizations.

In this blog, I will outline the importance of high quality RWD for the development, commercialization through FDA approval, and adoption of AI technologies.

AI Model Development

High-quality RWD is an essential piece of AI training and development. To build clinically meaningful algorithms, companies need to access large datasets that capture the diversity and uniqueness of real-world patient populations.

To build robust models, siloed data sources – such as imaging, pathology, and clinical outcomes from notes and reports – must be linked and mapped to capture the full clinical patient journey. One timepoint of data is not enough, longitudinal data is crucial for AI models to learn from disease progression over time. Without this, AI may perform well in theory but fall short in practice.

Tempus AI is a great example of a company creating quality AI models by putting strong RWD as a core component of their development, which has resulted in many successes over the years.

On the flipside, IBM did not prioritize large, diverse RWD in the development of a Watson Health AI offering, and instead used hypothetical patient cases provided by doctors from one specialty center. This resulted in a biased AI model that did not account for clinical characteristics of the everyday patient in its recommendations
[1]

Validation and FDA Approval

The path to an FDA-cleared product demands actionable validation. The FDA expects evidence that an AI model is both accurate and generalizable when deployed in the real-world. Specifically, the FDA favors AI models be validated on real-world data from US patients / providers. Some models even require validation on ‘healthy patients’ to demonstrate specificity of the technology, which many small, training-focused datasets do not contain.

Here you can find 10 guiding principles published by the FDA (jointly with other regulatory agencies) on good practices that medical device companies should follow when creating machine learning model.:

Linked datasets help establish correlations between diagnostic inputs and clinical outcomes, while longitudinal data allows for measurement of downstream impact—such as treatment changes or long-term outcomes. These insights not only strengthen regulatory submissions, but also increase confidence among providers and payers evaluating clinical and economic utility.

Adoption Among Community Physicians

Even the most well-validated AI solutions will fall short if they can’t be adopted by the everyday provider. Community physicians, who care for the majority of patients, are often cautious about new technology tools—especially if those tools don’t reflect the realities of their specific patient population.

AI models trained on real-world, community-representative data are more likely to deliver relevant, actionable insights. Demonstrating that an AI tool works across diverse patient types and care settings is critical to driving trust, integration into workflows, and ultimately, improved patient outcomes at scale.

In the IBM Watson case discussed earlier, we see that the models lack of utilization in broader patient populations resulted in many major hospitals cancelling their engagements with Watson Health, and the downstream impact of that is quite negative for new AI models coming to market.

Final Thoughts

The future of healthcare AI hinges on the quality of the data it’s built on. A common adage in data science is “garbage in, garbage out”. Large, linked, and longitudinal datasets aren’t just nice to have—they’re essential. By investing in access to robust real-world data, we can accelerate the development of meaningful AI tools, navigate regulatory pathways more efficiently, and build solutions that providers trust and patients benefit from.

Article Footer Newsletter Signup
Scroll to Top

Get Our GitHub Code Library For Free