Machine Learning in drug development

The process of discovering new drugs is not an easy task. The development of the targeted substance takes high amounts of money and time and what is worse, the success is not guaranteed. Here you have some data to give you an idea:

  • The overall process to bring a new drug to market is estimated to cost more than $2 billion.
  • It can take 10-15 years
  • Approximately 10% of the new drugs finish in the market [1]

On the other hand, computing processing power has experienced a rapid growth in the last decades, in terms of volume of processed data and rapidness. Artificial Intelligence (AI) and Machine Learning (ML) algorithms have been used in a wide variety of applications: internet searches, void recognition software (like Siri), machine vision software in cameras, etc.

Can pharmaceutical research be benefited from these technological advances? The answer is yes. In fact, ML can help with the three goals mentioned before: reduction of both time and development costs and increasement of the chances to select the most adequate compounds.

Some basic concepts about Machine Learning

ML can be described as “the ability to learn without being directly programmed”. Its purpose is to use inductive inference on their training data sets to make future predictions with other inputs (which are not part of the initial ones). Then, the machine is trained with huge quantities of data and algorithms, which gives it the ability to learn how to perform the task.

As learning processes are based on data, the more quality and quantity of data, the best performance of the algorithms.

ML algorithms can be classified mainly as supervised, unsupervised and reinforcement learning.

Supervised learning methods are based on knowing input and output data relationships, also called labeled data. That means, input training samples are associated with the correct output. The algorithm goal is learning the general rules that relate the inputs with the right outputs.

Supervised learning is mainly used for classification and regression. Classification is based on the prediction of a discrete class where an input belongs. On the other hand, regression is used to predict a valued output from an input.

There is another type of supervised learning called semi-supervised learning. In these systems, the training datasets include a large amount of non-classified data (output is unknown for a certain input), and a little quantity of classified data (as in supervised learning).

Unsupervised algorithms are comprised by unlabeled data. They use iterative approaches to identify hidden patterns in the input data. They can also execute feature learning, a group of method to find features or representations in the input data. These algorithms can be used for clustering, dimensionally reduction or anomaly detection [2].

Reinforcement techniques are based on the interaction of an agent with its environment to maximize its performance. The agent receives feedback in the form of rewards and penalties. The learning process is given by maximizing rewards and minimizing penalties.

In the recent years, there has been an increasing interest of a ML subfield called Deep Learning (DL) for labelled and unlabeled training data. Its structure is inspired by the way biological nervous system communicates: through neurons.

Deep learning neural networks work in a similar way as simple neural networks, with the exception that they are composed by multiple hidden layers (see Figure 1). This configuration enables more complex functions such as feature transformation and extraction [3].

Figure 1. Simple vs Deep Neural Network [3]

 There are different types of DL architectures. Its differences are based on the way each of them can recognize patterns and extract. Some of the most widely applied are the convolutional neural networks (CNN), recurrent neural networks (RNN), and the generative networks [4].

Drug development process

In Early Drug Discovery, the initial research is the target identification. Substances are proposed to inhibit or activate a protein for a certain disease. To select these substances, there is a screening in chemical libraries, computer simulation or by screening of natural materials isolations.  It is important to remark the importance of chemists in this process. They can help in finding the relationship between physico-chemical properties of molecules (predictors) and biological activity.

Compounds that react with the target of interest are called “hits”. The next step is to find compounds which show attractive pharmaceutical properties, such as low toxicity or aqueous solubility.

Among all the hits, the most promising candidates are identified as “leads”. During the process of lead generation, hit molecules are chemically modified to improve their activity and selectivity towards specific biological targets, while reducing toxicity and unwanted effects. Once lead compounds have been found, its chemical structure is used as a starting point for chemical modifications with the objective of discovering compounds with maximal therapeutic benefit and minimal potential for harm.

After the preliminary choice, the next step is to validate the effect of the chosen candidates into relevant ex vivo and in vivo models. This step is known as Preclinical Research and provides essential information (in terms of efficacy and safety) before a drug can be tested in humans.

Clinical research is carried out in humans and is comprised by several phases. The main aim of this stage is to determine the drug efficacy and possible adverse effects as well as establish the optimal dosage to enhance therapeutic effects.

After Clinical research, there are other stages related to Regulatory Reviews, Approvals and Post-marketing safety surveillance [5].

Machine Learning in drug development

ML techniques have been proposed in literature in all stages of drug development. As a resume, they can benefit the process of designing a drug’s chemical structure and investigate the effect of the proposed drug [9].

Let’s start with the drug structural design:

As mentioned before, the first step is the target identification, which requires the demonstration of a causal association between target and disease. ML can help with the primary source of target-disease association: literature. Some recent natural language processing (NLP) and ML approaches allow the identification of relevant data in papers to extract valuable information [6].

ML techniques can also be used to analyze huge amounts of datasets to make predictions about causality in seconds. For example, ML models have been proposed to classify proteins into drug targets and non-drug targets for breast, pancreatic and ovarian cancer.

ML tools can aid to investigate the probability of a drug can be made for a given target. Some studies take into account different physicochemical and geometrical features of different drugs against a target. For example, in the identification of hits and leads, compounds that have a similar chemical structure can be identified through Deep Neuronal Networks (DNNs) [7].

AI systems can reduce the number of synthesized compounds that must be tested in either in vitro or in vivo systems. These tools allow prioritizing molecules based on the ease of synthesis or the development of optimal synthetic routes.

ML is also useful in the chemical synthesis of molecules. To plan the routes, the molecule is divided into smaller fragments. After that, the reactions to convert the building blocks into the desired compounds are proposed. This procedure results in the sequence that will be executed in laboratory. DNNs can aid in both predicting the most promising reactants and proposing synthetic routes in reasonable steps and reaction time.

With respect to the effect of a drug:

In clinical trials, based biomarker discovery and drug sensitivity predictive models helps improve clinical success rates. Biomarkers can be predicted through ML approaches based on preclinical datasets [8].

An important factor to success is finding the right patients. ML algorithms can speed up the process of analyze genetic information to identify the proper candidates for the test. They can also reduce data errors [9].

Future prospects

Recent progress in ML techniques together with the generation of high amounts of chemical and biological data supports the use of AI tools. It cannot be denied that the implementation of these technologies entails great advantages. They can help identify drug targets, find the right molecules from data libraries, suggest chemical modifications, identify candidates for repurposing, etc.

However, ML methods also have some drawbacks. A typical concern about deep neural network is its lack of interpretability. It is hard to know how the results have been obtained. Another fact is repeatability, caused by the initial values or weighs of the inputs [6]. Consistency in data standards and data quality are also fundamental to the success in these type of process, and it is not always assured.

But the truth is that experts in this field forecast that AI and ML approaches will change the way new drugs are discovered. In the next years we will probably see a revolutionary transformation in this sector.


[1] Wilson, J (2018). Machine Learning Could Make Drug Discovery Faster, Cheaper, Better, Elsevier. Link:

[2] Ge, Z., Song, Z., Ding, S. X., & Huang, B. (2017). Data mining and analytics in the process industry: The role of machine learning. Ieee Access5, 20590-20616.

[3] Jacobs, F. (2018). Safety through Machine Learning Applications: A Safety Case Analysis.

[4] Jing, Y., Bian, Y., Hu, Z., Wang, L., & Xie, X. Q. S. (2018). Deep learning for drug design: an artificial intelligence paradigm for drug discovery in the big data era. The AAPS journal20(3), 58.

[5] Lansdowne, L.E. (2020). Exploring the Drug Development Process. Drug Discovery from Technology Networks. Link:

[6] Vamathevan, J., Clark, D., Czodrowski, P., Dunham, I., Ferran, E., Lee, G., … & Zhao, S. (2019). Applications of machine learning in drug discovery and development. Nature Reviews Drug Discovery18(6), 463-477.

[7] Chen, H., Engkvist, O., Wang, Y., Olivecrona, M., & Blaschke, T. (2018). The rise of deep learning in drug discovery. Drug discovery today23(6), 1241-1250.

[8] Mak, K. K., & Pichika, M. R. (2019). Artificial intelligence in drug development: present status and future prospects. Drug discovery today24(3), 773-780.

[9] Addepto (2019) Artificial Intelligence in Drug Discovery with Machine Learning. Link:

Inmaculada García
Latest posts by Inmaculada García (see all)

Leave a Reply

Your email address will not be published. Required fields are marked *