Github Link: https://github.com/facebookresearch/large_concept_model.
Concepts are fundamental in human society as they are often abstract and enable people to discuss things that do not physically exist but significantly impact their lives. For example, freedom, justice, organizations, ideas, and religion.
The book Sapiens: A Brief History of Humankind by Yuval Noah Harari explores various fascinating discussions about these abstract concepts, which distinguish humans from other creatures.
Recently, Meta released a new paper and model about large concept models. As a researcher working on knowledge graph construction with prior experience in conceptualization, I am particularly curious about the concepts defined in their research and how concepts contribute to building large language models.
Without further ado, let’s dive into the details.
Large Concept Models
Motivations
Here are some key motivations from the paper, quoted directly:
Despite the undeniable success of LLMs and continued progress, all current LLMs miss
a crucial characteristic of human intelligence: explicit reasoning and planning at multiple levels of
abstraction.
One may argue
that LLMs are implicitly learning a hierarchical representation, but we stipulate that models with an explicit hierarchical architecture are better suited to create coherent long-form output.
To the best of our knowledge, this explicit hierarchical structure of information processing and
generation, at an abstract level, independent of any instantiation in a particular language or modality,
cannot be found in any of the current LLMs.
To be honest, these claims are not entirely convincing. Researchers in NLP have spent decades building hierarchical architectures for text understanding.
At the sentence level, there are structured representations like constituency and dependency parsing trees.
Beyond the sentence level, we have discourse analysis, including shallow approaches and deep approaches.
Interestingly, state-of-the-art performances in these tasks are often achieved by LLMs, suggesting that LLMs already excel at modeling such structures. However, it’s also possible that the original benchmarks are too small to reliably indicate performance.
Imagine a researcher giving a fifteen-minute talk. In such a situation, researchers do not usually
prepare detailed speeches by writing out every single word they will pronounce. Instead, they outline
a flow of higher-level ideas they want to communicate. Should they give the same talk multiple
times, the actual words being spoken may differ, the talk could even be given in different languages,
but the flow of higher-level abstract ideas will remain the same. Similarly, when writing a research
paper or essay on a specific topic, humans usually start by preparing an outline that structures the
whole document into sections, which they then refine iteratively. Humans also detect and remember
dependencies between the different parts of a longer document at an abstract level. If we expand
on our previous research writing example, keeping track of dependencies means that we need to
provide results for each of the experiments mentioned in the introduction. Finally, when processing
and analyzing information, humans rarely consider every single word in a large document. Instead,
we use a hierarchical approach: we remember which part of a long document we should search to
find a specific piece of information.
The examples provided are more indicative of the effects of discourse analysis rather than being directly related to concepts—at least not the explicit concepts that can be described by words.
In this study, as proof of feasibility, we assume that a concept corresponds to a sentence, and use an existing sentence embedding space, SONAR, which supports up to 200 languages in both text and speech modalities.
Personally, I disagree with treating a sentence as a concept. For instance, I eat ten hamburgers is not a concept, but hunger is a concept. That said, let’s move past the discussion of concepts since the technical parts remain valid regardless of how we define concepts or whether K-lines theory by Marvin Minsky is equivalent to Rhetorical Structure Theory (RST).
Method
The method is straightforward:
The input is first
segmented into sentences, and each one is encoded with SONAR to achieve a sequence of concepts,
i.e., sentence embeddings. This sequence of concepts is then processed by a Large Concept Model
(LCM) to generate at the output a new sequence of concepts. Finally, the generated concepts are
decoded by SONAR into a sequence of subwords. The encoder and decoder are fixed and are not
trained.
Essentially, this method uses a relatively small cross-modality sentence-embedding model (and its reverse decoder), fixed during training and inference, as the tokenizer and detokenizer. The main body of the model is a transformer that operates over these sentence tokens.
Further Comments
This approach is likely to work well for speech and text modalities because they share similar discourse structures; people speak and write in similar ways. Consequently, the sequence of sentence embeddings can capture meaningful information.
However, I am uncertain about its applicability to other modalities like images. However, I think it might work exceptionally well for video, where the LCM could effectively glue together a fixed video encoder and decoder to generate more meaningful long videos.
In this sense, the work is quite fascinating. So, let’s put aside the discussion of concepts. I think terms like basic semantic units, events, or situations are more appropriate for describing what they are actually doing.