Researchers have traditionally employed histopathology techniques, which involve the microscopic examination of tissue, to gain insight into disease processes. This approach often leads to subjective interpretations and requires labor-intensive analysis. Computational pathology uses advanced algorithms and digital imaging to extract quantitative data from tissue samples, addressing some of these limitations. However, integrating these tools into clinical workflows remains a complex task. Recent advances in artificial intelligence (AI) foundation models have rapidly transformed the diagnostic landscape, particularly by integrating multiple data sources, enabling faster and more accurate biomarker identification from complex imaging datasets.

Paul O’Reilly, PhD
Head of Innovation
Sonrai Analytics
In this Innovation Spotlight, Paul O’Reilly, head of innovation at Sonrai Analytics, describes the obstacles researchers encounter when modeling pathology images and demonstrates how a whole-slide embedding approach addresses these challenges. He also highlights the impact of integrating molecular, clinical, and imaging data in a single platform for biomarker identification and precision medicine, and reviews current achievements and future possibilities for digital pathology platforms.
What are the benefits and challenges of combining histopathology with molecular and clinical data?
Clinicians and researchers widely use histopathology imaging as a key data modality for diagnosing and stratifying patients. Information about the structure and morphology of cells complements molecular data, allowing for a more comprehensive understanding of what is happening within and between those cells. In addition, when combined with other clinical data, it has the potential to uncover prognostic and predictive biomarkers with better sensitivity and specificity than those obtained from any single modality in isolation.
Despite these benefits, there are challenges to consider when integrating images with molecular and clinical data. Because these images are typically extremely large and unstructured, researchers must process and structure the information before proper integration. In some cases, this may be as simple as providing a score that summarizes the image features, but certain situations can require more complex processing, such as using AI to extract numerical features representing the salient information contained within the slide image.
What is computational pathology, and how does it advance scientific research?
Using modern scanner technologies, we are now able to easily digitize slide images, creating digital images that represent the slide as a large multidimensional array of numbers. Given that a digitized image may be gigabytes in size, resolving that information is an extremely computationally intensive task. The process of doing this is termed computational pathology.
Over the years, researchers have significantly advanced computational pathology. Where we once used a range of traditional image processing techniques, many modern workflows now leverage the power of AI to extract features, create classifiers, segment regions of interest, and quantify cells, to name a few. By doing so, we can simplify, structure, and quantify the information from the pathology images in a way that is efficient and reproducible, which in turn enhances the accuracy and reproducibility of analyses involving histopathology images.
What challenges do scientists face when modeling pathology images, and in what ways has the whole-slide embedding approach addressed these issues?
Pathology images tend to be very large and unstructured, meaning that working closely with pathologists to read, understand, and score images is an important part of effective processing. Unfortunately, pathologists are often a scarce resource in many scientific and clinical organizations, making collaboration a costly, time-intensive, and difficult task.
AI approaches have attempted to mitigate these problems, but have often given rise to new issues, such as the need to source, manage, and annotate large numbers of images to provide a ground truth for the AI to base its learning on. Cutting-edge approaches have attempted to distil features extracted using foundation models down to so-called whole-slide embeddings or representations in order to take the focus away from the data management and annotation tasks. Scientists have created foundation models without the need for pathologist annotations by using self-supervised learning methods, which can be based on much larger cohorts and utilize data across a plethora of indications and stains. Some models are trained using hundreds of thousands of whole-slide images.
Trained models encapsulate and can handle a massive amount of variability in the information encoded within slide images. Being able to automatically distill the information in a slide image down to a relatively small vector of numbers allows this representation to be used for multiple tasks, including determining the status of the slide, prognosis and prediction of patient response, and full integration with multiple other modalities.
How has integrating molecular, clinical, and imaging data into a single platform affected biomarker identification and precision medicine?
Specialist teams within organizations typically generate molecular and imaging data, so they have historically managed raw and resolved data in separate software systems. Cloud-based platforms for multimodal data management now enable organizations to store, manage, access, and analyze all program files within a single platform. This consolidated approach reduces costs, increases efficiency, and supports more effective collaboration across different groups. By bringing all data and analytical tools together in one place, organizations can perform multimodal data analysis and use advanced techniques such as AI to identify patterns that form the basis for prognostic and predictive biomarkers.
For example, treatments such as antibody drug conjugates (ADCs) rely on multiple mechanisms to target cancer cells specifically, using an antibody to deliver a cytotoxic payload. Clinicians may use imaging, in the form of immunohistochemistry (IHC)-stained pathology images, to identify patients who may be suitable for ADC treatment, but this only considers half of the equation. Even if the ADC binds to the correct cells, the patient may not respond to the treatment, which is where having another modality such as RNA sequencing is required to determine if there is a molecular element to that response. In this case, multimodal analysis is required to fully identify a biomarker or biomarkers that will predict response to the ADC.

Artificial intelligence-driven pathology platforms can accelerate biomarker identification and personalized medicine via data integration.
©iStock, Visual Generation
What are the key steps involved in leveraging foundation models to identify prognostic biomarkers, and how do these approaches improve accuracy and interpretability compared to traditional methods?
Pathology researchers have developed a large and growing list of foundation models, each with distinct strengths and weaknesses. Some researchers have trained these models only on haematoxylin & eosin (H&E)-stained images, while others include IHC-stained slides in their training cohorts. Some organizations license foundation models for commercial use, while others make them available for academic purposes only. Although the variety can seem intimidating, selecting from this wide array of models enables one to run, track, and compare experiments effectively using different approaches.
Researchers typically use the foundation model to extract representations from whole-slide images (WSIs). They break down these extremely large images into smaller tiles, usually excluding any image regions without tissue for efficiency, and use the foundation model to obtain a feature vector for each tile. Each feature vector represents the content, including morphology, of its corresponding tile. For a WSI, researchers may generate thousands of such tiles. While some analyses use individual tiles, they generally aggregate these into a single, fixed-length vector that summarizes the feature content for all tiles in the image.
Essentially, this approach allows researchers to compress all the information available within the image into a single vector. They can then use this vector alongside additional data, such as clinical information associated with the slide or RNA expression of different transcripts, to identify composite biomarkers of survival or response. Researchers can also utilize the slide-level representation to identify cases with similar morphologies and to facilitate discussions with domain experts, such as pathologists, thereby enhancing the interpretability of the biomarker.
Where has the integration of imaging and multiomics data had the most promising impact, and what are the current and future prospects of digital pathology platforms in these research fields?
We are starting to see multimodal biomarkers developed using AI being introduced into the clinic, which is the ultimate goal of any biomarker and an exciting step forward for AI workflows. In the arena of ADCs, Roche has recently obtained the Food and Drug Administration (FDA) Breakthrough Designation for an AI companion diagnostic, which automatically, without the intervention of a pathologist, scores trophoblast cell surface antigen 2 (TROP2) expression to identify patients suitable for treatment with datopotamab deruxtecan-dlnk, an ADC developed by Daiichi Sankyo and AstraZeneca. Outside of that, Artera AI has received FDA de novo approval for its ArteraAI Prostate product, which is a multimodal risk stratification tool for patients with prostate cancer.
What is perhaps most exciting is that these breakthroughs are only the tip of the iceberg, and numerous ongoing research projects are underway in academia and industry. As we look ahead, continued progress in integrating these two crucial data streams will open up new therapeutic possibilities, helping us develop novel biomarkers and gain deeper insights into disease mechanisms.
