We’re excited to bring Transform 2022 back in person on July 19 and pretty much July 20-28. Join AI and data leaders for insightful conversations and exciting networking opportunities. Register today†
In the midst of the heated debate about AI feelingsentient machines and artificial general intelligence, Yann LeCun, Chief AI Scientist at Meta, published a blueprint for creating “autonomous machine intelligence.”
LeCun has bundled his ideas into a paper that draws inspiration from advances in machine learning, robotics, neuroscience and cognitive science. He explains a roadmap for creating AI that can model and understand the world, reason, and plan to perform tasks on different time scales.
While the article is not a scientific paper, it provides a very interesting framework to think about the various pieces needed to replicate animal and human intelligence. It also shows how the mindset of LeCun, an award-winning pioneer of deep learninghas changed and why he thinks current approaches to AI won’t take us to human-level AI.
A modular structure
One of the key elements of LeCun’s vision is a modular structure of different components, inspired by different parts of the brain. This is a break from the popular approach in deep learning, where a single model is trained from start to finish.
Central to the architecture is a world model that predicts the conditions of the world. While world modeling has been discussed and attempted in various AI architectures, they are task specific and cannot be adapted to different tasks. LeCun suggests that autonomous systems, like humans and animals, should have one flexible world model.
“One hypothesis in this paper is that animals and humans have only one world model motor somewhere in their prefrontal cortex,” LeCun writes. “That engine of the world model is dynamically configurable for the task at hand. With a single, configurable world model engine, rather than a separate model for each situation, knowledge about how the world works can be shared between tasks. This can make it possible to reason by analogy, by applying the model configured for one situation to another situation.”
The world model is complemented by several other modules that help the agent understand the world and take actions relevant to his goals. The “perception” module performs the role of the sensory system of animals, collecting information from the world and estimating the current state using the world model. In this regard, the world model performs two important tasks: first, it fills the missing pieces of information in the observation module (e.g. hidden objects), and second, it predicts the plausible future states of the world (e.g. flying ball in the next time step) .
The ‘cost’ module evaluates the ‘discomfort’ of the agent, measured in energy. The agent must take actions that reduce his discomfort. Some costs are tied up or ‘intrinsic costs’. For example, in humans and animals, these costs would be hunger, thirst, pain and fear. Another submodule is the “trainable critic”, whose goal is to reduce the cost of achieving a particular goal, such as navigating to a location, building a tool, etc.
The “short-term memory” module stores relevant information about the states of the world over time and the corresponding value of the intrinsic cost. Short-term memory plays an important role in the proper functioning of the world model and in making accurate predictions.
The “actor” module converts predictions into concrete actions. It gets its input from all other modules and controls the agent’s outward behavior.
Finally, a “configurator” module takes care of the executive control, adapting all other modules, including the world model, to the specific task it wants to perform. This is the main module that allows a single architecture to handle many different tasks. It adjusts the perception model, world model, cost function and actions of the agent based on the goal he wants to achieve. For example, if you are looking for a tool to drive in a nail, your sensing module should be configured to look for items that are heavy and solid, your actor module should plan actions to pick up the makeshift hammer and use it to drive the nail, and your cost module must be able to calculate whether the object is manageable and close enough or you must look for something else that is within reach.
Interestingly, LeCun considers two modes of operation in his proposed architecture, inspired by Daniel Kahneman’s “Think fast and slow” dichotomy. The autonomous agent must have a “Mode 1” control model, a fast and reflexive behavior that directly links perceptions to actions, and a “Mode 2” control model, which is slower and more involved and uses the world model and other modules to reason and plan.
While the architecture that LeCun proposes is interesting, its implementation poses some major challenges. Among them is training all modules to perform their tasks. In his paper, LeCun makes ample use of the terms “differentiable,” “gradient-based,” and “optimization,” all of which indicate that he believes architecture will be based on a set of deep learning models as opposed to symbolic systems in which knowledge is pre-established by humans. embedded.
LeCun is an advocate of self-directed learning, a concept he has been talking about for several years. One of the main bottlenecks of many deep learning applications is their need for human annotated examples, which is why they are called supervised learning models. Data labels do not scale and are slow and expensive.
On the other hand, unsupervised and self-supervised learning models learn by observing and analyzing data without the need for labels. Through self-supervision, human children acquire common sense knowledge of the world, including gravity, dimensionality and depth, object persistence, and even things like social relationships. Autonomous systems must also be able to learn independently.
In recent years, some major advances have been made in unsupervised and self-supervised learning, mainly in transformer models, the deep learning architecture used in large language models. Transformers learn the statistical relationships of words by masking parts of a known text and trying to predict the missing part.
One of the most popular forms of self-directed learning is “contrastive learning”, in which a model is taught to learn the latent characteristics of images through masking, magnification and exposure to different poses of the same object.
However, LeCun proposes a different type of self-supervised learning, which he describes as ‘energy-based models’. EBMs attempt to encode high-dimensional data such as images into low-dimensional embedding spaces that retain only the relevant features. By doing this, they can calculate whether two observations are related to each other or not.
In his paper, LeCun proposes the “Joint Embedding Predictive Architecture” (JEPA), a model that uses EBM to capture dependencies between different observations.
“A major advantage of JEPA is that: it may choose to ignore the details that are not easily predictableLeCun writes. In short, this means that instead of trying to predict the world state at the pixel level, JEPA predicts the latent, low-dimensional features relevant to the task at hand.
In the paper, LeCun further discusses Hierarchical JEPA (H-JEPA), a plan to stack JEPA models on top of one another to accommodate reasoning and planning on different time scales.
“JEPA’s ability to learn abstractions suggests an extension of the architecture to handle predictions at multiple timescales and multiple levels of abstraction,” LeCun writes. “Intuitively, low-level representations contain a lot of detail about the input and can be used for short-term forecasting. But it can be difficult to make accurate long-term forecasts with the same level of detail. Conversely, a high-level abstract view can allow for long-term predictions, but at the cost of eliminating a lot of detail.”
The road to autonomous agents
In his paper, LeCun admits that many things remain unanswered, including configuring the models to learn the optimal latent characteristics and a precise architecture and function for the short-term memory module and its views of the world. LeCun also says that the configurator module still remains a mystery and more work needs to be done to make it work correctly.
But LeCun clearly states that current proposals to achieve human-level AI will not work. For example, an argument that has received a lot of attention in recent months is that of ‘it’s all about scale’. Some scientists suggest that by scaling transformer models with more layers and parameters and training them on larger data sets, we will eventually achieve artificial general intelligence.
LeCun refutes this theory, stating that LLMs and transformers work as long as they are trained to discrete values.
“This approach doesn’t work for high-dimensional continuous modalities, such as video. To display such data, it is necessary to eliminate irrelevant information about the variable to be modeled through an encoder, as in the JEPA,” he writes.
Another theory is “reward is enough”, suggested by scientists at DeepMind. According to this theory, all you need to create artificial general intelligence is the right reward function and reinforcement learning algorithm.
But LeCun argues that while RL requires the agent to constantly interact with its environment, much of the learning that humans and animals do is through sheer observation.
LeCun also refutes the hybrid”neuro-symbolic” approach, stating that the model probably does not need explicit mechanisms for manipulating symbols, and describes reasoning as “energy minimization or constraint gratification by the actor using different search methods to find an appropriate combination of actions and latent variables .”
Much more needs to be done before LeCun’s blueprint becomes a reality. “It’s basically what I want to work on, and what I hope to inspire others to work on, over the next decade,” he wrote on Facebook after he published the newspaper.
The mission of VentureBeat is a digital city square for technical decision makers to gain knowledge about transformative business technology and transactions. Learn more about membership.