WEBVTT 1 00:00:00.401 --> 00:00:02.010 (light electronic music) 2 00:00:02.010 --> 00:00:04.080 Diseases can arise from various factors, 3 00:00:04.080 --> 00:00:06.420 including genetic mutations, infections, 4 00:00:06.420 --> 00:00:10.080 environmental toxins, lifestyle choices, and aging. 5 00:00:10.080 --> 00:00:12.720 These factors can disrupt normal cellular functions, 6 00:00:12.720 --> 00:00:15.690 immune responses, and overall bodily homeostasis 7 00:00:15.690 --> 00:00:19.740 and physiological functions, ultimately leading to disease. 8 00:00:19.740 --> 00:00:22.050 Understanding the biological mechanisms of disease 9 00:00:22.050 --> 00:00:23.760 is central to drug discovery. 10 00:00:23.760 --> 00:00:25.680 It enables the design of new molecules, 11 00:00:25.680 --> 00:00:28.650 especially personalized and precision medicine. 12 00:00:28.650 --> 00:00:31.560 Studying the mechanism of diseases is a major challenge 13 00:00:31.560 --> 00:00:34.470 and is one of biology's most complex problems. 14 00:00:34.470 --> 00:00:36.900 To address this, Merck & Co., in collaboration 15 00:00:36.900 --> 00:00:39.570 with the BCG X AI Science Institute, 16 00:00:39.570 --> 00:00:42.030 developed a family of foundation models: 17 00:00:42.030 --> 00:00:45.330 TEDDY, Transformers for Enabling Drug Discovery. 18 00:00:45.330 --> 00:00:47.580 The TEDDY models are state-of-the-art AI models 19 00:00:47.580 --> 00:00:50.280 which have been trained on large amounts of biological data 20 00:00:50.280 --> 00:00:52.830 representing many different diseases. 21 00:00:52.830 --> 00:00:55.110 TEDDY consists of two foundation models. 22 00:00:55.110 --> 00:00:57.300 Each model has been trained on genomic data 23 00:00:57.300 --> 00:01:00.060 representing a broad range of genes in the human body. 24 00:01:00.060 --> 00:01:01.290 The two TEDDY models differ 25 00:01:01.290 --> 00:01:03.180 in how they represent the genomic data 26 00:01:03.180 --> 00:01:04.950 and how the models were trained. 27 00:01:04.950 --> 00:01:08.070 The first model, TEDDY-G, has been trained on genomic data 28 00:01:08.070 --> 00:01:09.210 which has been ranked. 29 00:01:09.210 --> 00:01:11.070 This allows the model to predict which genes 30 00:01:11.070 --> 00:01:13.290 are most likely to be involved in a certain disease 31 00:01:13.290 --> 00:01:15.720 based on their relationships with other genes. 32 00:01:15.720 --> 00:01:17.460 The second model, TEDDY-X, 33 00:01:17.460 --> 00:01:20.130 trained on genomic data, processes gene expression 34 00:01:20.130 --> 00:01:22.590 by grouping the values into discrete categories, 35 00:01:22.590 --> 00:01:25.740 and predicts which category a gene's expression falls into. 36 00:01:25.740 --> 00:01:27.900 This helps the model to learn factors responsible 37 00:01:27.900 --> 00:01:30.330 for variations in gene expression. 38 00:01:30.330 --> 00:01:31.770 The two TEDDY models have been trained 39 00:01:31.770 --> 00:01:33.510 on vast quantities of data 40 00:01:33.510 --> 00:01:37.470 containing 116 million cells from over 24,000 people, 41 00:01:37.470 --> 00:01:41.070 across 413 tissue types, 860 cell types, 42 00:01:41.070 --> 00:01:43.560 and 122 different diseases. 43 00:01:43.560 --> 00:01:44.940 The genomic data which was used 44 00:01:44.940 --> 00:01:46.980 to train the TEDDY models is annotated. 45 00:01:46.980 --> 00:01:49.470 This means that existing biology knowledge about genes 46 00:01:49.470 --> 00:01:51.810 has been incorporated into the models. 47 00:01:51.810 --> 00:01:54.120 During the training, the TEDDY models were therefore able 48 00:01:54.120 --> 00:01:56.490 to learn from the genomes and the annotations. 49 00:01:56.490 --> 00:01:58.740 As a result, the models have a deep understanding 50 00:01:58.740 --> 00:02:00.960 in genomics and disease biology. 51 00:02:00.960 --> 00:02:02.850 To find out how good the TEDDY models are, 52 00:02:02.850 --> 00:02:05.130 we assessed them on two critical tasks, 53 00:02:05.130 --> 00:02:07.080 firstly, predicting a disease in patients 54 00:02:07.080 --> 00:02:08.820 that the models had never seen before, 55 00:02:08.820 --> 00:02:12.060 and secondly, distinguishing healthy from diseased cells. 56 00:02:12.060 --> 00:02:14.220 The chart shows the result of the assessment. 57 00:02:14.220 --> 00:02:16.050 The performance of the TEDDY models held up 58 00:02:16.050 --> 00:02:19.080 when both model size and data volume were increased. 59 00:02:19.080 --> 00:02:21.780 The TEDDY models set a new benchmark for AI models trained 60 00:02:21.780 --> 00:02:22.890 with genomic data. 61 00:02:22.890 --> 00:02:25.320 They demonstrate comparable or better performance 62 00:02:25.320 --> 00:02:27.120 relative to other state-of-the-art models 63 00:02:27.120 --> 00:02:29.790 across a diverse spectrum of diseases. 64 00:02:29.790 --> 00:02:31.440 But why do the TEDDY models understand 65 00:02:31.440 --> 00:02:33.300 and predict diseases so well? 66 00:02:33.300 --> 00:02:35.250 We believe it is because the grouping of genes 67 00:02:35.250 --> 00:02:36.840 which we used when we trained TEDDY, 68 00:02:36.840 --> 00:02:39.420 together with the annotations we included in the training, 69 00:02:39.420 --> 00:02:42.750 are particularly good at capturing how diseases work. 70 00:02:42.750 --> 00:02:45.600 The connections between genes are called the interactome. 71 00:02:45.600 --> 00:02:47.220 During the training of the TEDDY models, 72 00:02:47.220 --> 00:02:49.950 the AI algorithm learns about the interactome, 73 00:02:49.950 --> 00:02:52.800 specifically, how groups of genes, so-called modules, 74 00:02:52.800 --> 00:02:55.380 align with known disease-associated genes. 75 00:02:55.380 --> 00:02:57.240 This allows the TEDDY models to rapidly learn 76 00:02:57.240 --> 00:02:59.220 about how diseases work. 77 00:02:59.220 --> 00:03:00.930 Looking forward, we are on a mission 78 00:03:00.930 --> 00:03:03.150 to accelerate the development of precision medicine 79 00:03:03.150 --> 00:03:05.040 and drug discovery. 80 00:03:05.040 --> 00:03:07.470 We plan to use the TEDDY models to identify new genes 81 00:03:07.470 --> 00:03:09.900 that are important in many different diseases, 82 00:03:09.900 --> 00:03:13.050 enabling biomedical researchers to identify new drug targets 83 00:03:13.050 --> 00:03:14.640 and precision biomarkers 84 00:03:14.640 --> 00:03:16.230 and hopefully leading to the discovery 85 00:03:16.230 --> 00:03:18.390 of many exciting new medicines. 86 00:03:18.390 --> 00:03:22.640 (light electronic music continues)