WEBVTT 1 00:00:04.800 --> 00:00:07.560 With GenAI, we understand that those models 2 00:00:07.560 --> 00:00:10.200 will have a strong command of language, 3 00:00:10.200 --> 00:00:12.680 but is that all they can do? Language? 4 00:00:12.680 --> 00:00:17.680 So far, the greatest success we've had is with language models. 5 00:00:18.080 --> 00:00:21.160 But if we ever want to get to general 6 00:00:21.160 --> 00:00:24.920 artificial intelligence, that might not be enough. 7 00:00:24.920 --> 00:00:28.000 In fact, we humans learn a lot from the environment, 8 00:00:28.000 --> 00:00:30.400 we learn a lot from what we see, 9 00:00:30.400 --> 00:00:33.440 we learn a lot from interacting with the environment. 10 00:00:33.440 --> 00:00:36.800 And there is a big research, open question: 11 00:00:36.800 --> 00:00:41.800 Is language enough to build our internal model of the world? 12 00:00:43.720 --> 00:00:48.720 The model that allows us to predict how the world will respond to our actions, 13 00:00:49.960 --> 00:00:53.840 because that is the essence for the intelligent system, 14 00:00:53.840 --> 00:00:57.920 being able to predict what's going to happen when you act on the world. 15 00:00:57.920 --> 00:01:02.760 So there is a lot of research happening in this field, and we're seeing a lot 16 00:01:02.760 --> 00:01:07.480 of multi-modal models appearing, those that connect text and images, 17 00:01:07.480 --> 00:01:11.760 text and audio, text and video and allowing it to go from text 18 00:01:11.760 --> 00:01:16.760 to generating images or taking images and describing them with language. 19 00:01:17.840 --> 00:01:22.840 What will be the impact for business or consumers of those multi-model models? 20 00:01:24.320 --> 00:01:26.600 I think that impact will be huge. 21 00:01:26.600 --> 00:01:31.600 We already have seen how quickly applications like Midjourney spread 22 00:01:32.800 --> 00:01:36.760 around and how people enjoying creating images. 23 00:01:36.760 --> 00:01:40.800 It really gives capability for those who cannot draw to create images, 24 00:01:40.800 --> 00:01:44.920 for those who cannot design, and so on and so forth. 25 00:01:44.920 --> 00:01:48.080 But if you think about businesses, about creative industries, 26 00:01:48.080 --> 00:01:51.800 those technologies are clearly transformative for those industries. 27 00:01:51.800 --> 00:01:56.800 And it will be very soon when we get algorithms that allow us to generate 28 00:01:57.240 --> 00:02:02.240 videos, to generate audio, and that will change how creative industry works. 29 00:02:02.600 --> 00:02:05.720 Leonid, we understand now, multi-modal models. 30 00:02:05.720 --> 00:02:08.560 Now, I've heard you say sometimes that 31 00:02:08.560 --> 00:02:12.120 this is a first step towards embodiment of agents. 32 00:02:12.120 --> 00:02:13.320 Can you tell us more? 33 00:02:13.320 --> 00:02:16.640 So embodiment is actually a very, very interesting concept. 34 00:02:16.640 --> 00:02:21.640 It's really about a model having a 'body' or acting in the physical world. 35 00:02:22.680 --> 00:02:24.120 And the way it is connected 36 00:02:24.120 --> 00:02:29.040 to multi-modality is quite simple because in order for agent or model 37 00:02:29.040 --> 00:02:32.160 to operate, it needs to sense the outside world. 38 00:02:32.160 --> 00:02:35.120 Sensing usually happens through images, 39 00:02:35.120 --> 00:02:37.280 through video, or maybe through audio. 40 00:02:37.280 --> 00:02:39.680 And then multi-modality allows us to build 41 00:02:39.680 --> 00:02:44.680 a single model that processes all that information, and then allows 42 00:02:46.240 --> 00:02:50.440 the embodied agent, to act on outside physical world. 43 00:02:50.440 --> 00:02:55.440 Will those embodied models change the way we or machines learn things? 44 00:02:56.920 --> 00:03:00.000 We, as humans, we have our very individual 45 00:03:00.000 --> 00:03:02.240 experience and individual learning pattern. 46 00:03:02.240 --> 00:03:05.480 In fact, if you, for example, tell me how to ski, 47 00:03:05.480 --> 00:03:09.320 I will not be able to ski because I don't have any experience in skiing. 48 00:03:09.320 --> 00:03:13.160 And just by listening to you, I will not gain that experience. 49 00:03:13.160 --> 00:03:16.600 So we literally need to live through our lives to gain experience. 50 00:03:16.600 --> 00:03:20.320 Now, if you think about agents, they also need to have experience, 51 00:03:20.320 --> 00:03:22.800 experiencing the physical world, getting feedback. 52 00:03:22.800 --> 00:03:25.520 But the difference is that all the agents, 53 00:03:25.520 --> 00:03:28.760 multiple agents can be connected in between themselves, 54 00:03:28.760 --> 00:03:33.000 and so they can instantly share this experience and their learning becomes 55 00:03:33.000 --> 00:03:36.120 much, much quicker and much, much more powerful. 56 00:03:36.120 --> 00:03:38.560 So imagine if humans could be able, 57 00:03:38.560 --> 00:03:42.320 instead of spending years in college, just instantly share their knowledge, 58 00:03:42.320 --> 00:03:45.360 and that's what brings the power to this 59 00:03:45.360 --> 00:03:47.520 collective intelligence or collective learning.