Causal Modeling Helps Machine Learning Find Better Drug Targets

Picture of Shreya Patil

Shreya Patil

Data Scientist with expertise in developing predictive Machine Learning and Deep Learning solutions for healthcare and pharmacovigilance data.

Table of Contents

In a continuously evolving drug discovery field, finding potential drug targets efficiently and effectively is an ongoing challenge. Artificial intelligence and machine learning have enabled drug discovery using a data-driven approach. One such promising direction involves the integration of predictive modeling and causal inference. The integration of the two reflects a significant shift for computational drug target identification methods.

Why causal inference is needed in drug target prediction?

Classic predictive models are excellent at identifying statistical relationships in biomedical data. They can tell us things such as drug A will bind to protein B, or gene C is overexpressed in patients with disease D. But these models are typically unable to explain why these relationships exist, or whether altering a given component would affect another.

This is because of a fundamental challenge in data science, i.e., the difference between causation and prediction. Predictive models may tell us what would happen using the past trends, but they cannot tell us the causes of those effects.

Consider this example. Coffee drinkers are more likely to be smokers, when compared with non-coffee drinkers and smoking is a known cause of lung cancer. A predictive model using coffee consumption could predict that coffee drinkers are more likely to develop lung cancer. Yet, a public policy intervention to limit drinking coffee in the population would not affect the lung cancer rates, assuming the intervention does not also alter the variables that cause lung cancer, like smoking.

This is particularly important in drug discovery, where knowing the causal mechanisms is essential for designing successful therapeutics. While dealing with sensitive data like biomedical or clinical data, using black box models to predict the potential target molecules is not feasible nor ethical. Because the machine learning algorithm makes use of every signal that could be present in the training data set to get the best results. Their internal decision-making procedures are often opaque and challenging to understand. There is always an extremely high possibility that it will rely on the bias introduced from the historical data. This presents a challenge for clinical applications, where there is a specific need for methods that are reliable, transparent and interpretable in addition to being effective. Therefore, a correlated drug target with a causal effect on disease mechanisms is far more likely to be therapeutically successful.

Causal approaches to drug target identification offer particular benefits in addressing common challenges in the drug discovery process:

Cold-Start

When introducing new drugs or targets with limited interaction data:

  • Causal models can better generalize from existing data
  • They focus on stable causal relationships rather than superficial patterns
  • This improves prediction accuracy for novel compounds

Data Imbalance

When working with row data, positive drug-target interaction data is often sparse:

  • Causal methodologies have demonstrated robustness even with imbalanced datasets
  • By focusing on causal mechanisms rather than statistical patterns, these approaches maintain performance despite data limitations

Multi-Target Drugs

Understanding drugs that interact with multiple targets is increasingly important:

  • Causal models can help distinguish primary from secondary targets
  • They provide insight into the mechanisms behind off-target effects
  • This knowledge is crucial for minimizing adverse reactions and optimizing therapeutic outcomes

How Will the Integration of AI Improve the Drug Discovery Pipeline?

While the gold standard for establishing causality in medicine is the randomized controlled trial (RCT), these are often impractical in early drug discovery. In this scenario using machine learning approaches becomes useful. Observational data can be used to mimic controlled experimental conditions. Combining similar approaches to identify complex relationships between drugs and targets can enhance the performance significantly.

Graph Based Predictive Models

It is not simple to apply causal reasoning to computational models. It requires different styles of handling both model structure and the use of data. The most promising advances have been seen in graph-based models that think of biological systems as networks of entities and relationships.

Graph-based models are especially well placed to capture biological causality. Biological networks are a natural fit with complicated interactions between genes, proteins, metabolites, and cells.

We can build graph-based models for drug target identification, with biological entities (drugs, proteins, diseases) as nodes of the network and their interactions or relations as edges. This type of representation comes naturally with multi-source data integration, allowing us to combine data from various sources into a single model.


Drug-Target Knowledge Graph where nodes are biological entities (drugs, proteins) and edges between them represents their relations.

What makes these models so strong for causal inference is that they can represent not only direct but also indirect relationships. For example, a drug can never have a direct interaction with a disease protein but can influence it indirectly by stimulating an intermediate series of interactions. Graph models can represent these complex pathways and allow us to predict interaction points that would have maximal therapeutic impact.

Standard graph neural networks (GNNs) provide a solid foundation, but incorporating causal mechanisms can significantly improve their utility for drug target discovery. For instance, integrating attention mechanisms allows the model to distinguish between causal and non-causal connections in the network. During training, the model learns to assign higher attention weights to edges that represent causal relationships, effectively filtering out spurious correlations.

Causal Invariance for Robust Target Prediction

Causal invariance means relations that are unchanged across different environments or interventions. In drug discovery, they are the relations that always yield the same effect despite changes in the biological setting.

There are a couple of technical approaches to applying causal invariance in practice. One approach is making several perturbed copies of the biological graph where non-causal variables get perturbed differently in each copy at train time. Then the model is trained to predict the same thing on these different copies of the graph, which compels the model to rely on stable causal features rather than spurious correlations.

This approach addresses one of the most enduring challenges in computational target prediction: overfitting. Specifically, overfitting to current drug-target interaction knowledge without correctly modelling causality in the process will severely limit a model. When a model is overfit, it can learn well from historical data patterns but generalize very poorly to new drug compounds or targets. By employing causal invariance principles, we can create models that capture key biological processes rather than memorizing examples given.

Check this research paper for a detailed description.

Synthetic Data using Deep Learning

Synthetic data generation, particularly using generative adversarial networks (GANs) and other deep learning techniques, has become an invaluable tool for addressing data limitations in drug discovery.

Creating synthetic biological data requires careful consideration of complex dependencies and constraints. When developing generative models for this purpose, ensuring they capture the underlying causal structure of the biological system rather than just statistical distributions is very important. This involves incorporating causal domain knowledge directly into the model architecture and validation process.

The resulting synthetic data can augment real-world datasets in several ways. For rare diseases with limited patient data, synthetic samples can provide a more robust training set. For novel drug classes with few known interactions, synthetic examples can help models generalize better. And for underexplored biological pathways, synthetic data can help identify potential causal relationships which can be further validated experimentally.

However, it is important to acknowledge the limitations of synthetic data. We should never view synthetic data as a replacement for real experimental validation, but rather as a tool for generating hypotheses and extending the reach of our models into less-explored areas.

The Future of Causal Drug Target Identification

Graph neural network architectures and other deep learning methods will likely be at the center of drug target identification. Advances like graph transformers and physics-informed neural networks are particularly promising to represent complex biological systems more realistically. Through implementation of molecular physics and known biological constraints directly into model structures, we can have more realistic models of drug-target interactions.

Another promising frontier is the integration of multi-modal data sources like genomics, metabolomics, imaging, and clinical data. These varied data sources can be integrated into a shared causal framework and may provide unprecedented insights into disease mechanisms and therapeutic opportunities. The technical challenge in this is significant, but the potential reward in terms of new target discovery makes it an exciting area of research.

References

Scroll to Top

Get Our GitHub Code Library For Free