TL;DR:
6 technologies are in the works and will very likely mature within the next 3 years, further increasing the (justifiable?) hype around Generative AI and large-language-models (LLMs) specifically.
Once any of these technologies are integrated into a product, it will be perceived as a breakthrough.
Soon enough AI-empowered products will be accurate, informative, up-to-date, and efficient.
These technology breakthroughs consist of:
1. LLMs information grounding and referencing
2. Efficiently connecting LLMs to tools (i.e., databases, simulators, calculators & APIs)
3. LLMs context/input enlargement, to the point it becomes a non-issue for most applications
4. LLMs computing ecosystem maturing (leading to 1000x inference cost reduction)
5. LLMs efficient fine-tuning and alignment
6. LLMs reasoning
Organized according to my expectations for the feasibility of these technologies to mature and undergo breakthroughs in the next 1 to 3 years.
LLMs and ChatGPT
Basics:
Large language models (LLMs) are computer programs that use neural networks and deep learning techniques to understand and generate human-like language, as well as other forms of language such as programming code and genomic sequences. They are commonly based on the Transformer (wiki) neural network architecture and trained on massive amounts of data. Large language models have a wide range of applications, including natural language processing, text summarization, text generation, chatbots, virtual assistants, machine translation and code generators, and they have the potential to revolutionize the way we process and generate language across different fields.
Two of the most well-known examples of a large language model are GPT-3 (Generative Pre-trained Transformer 3 – wiki) and ChatGPT, which were released by OpenAI in 2020 and 2022 respectively.
As we all know already, ChatGPT can process and generate human-like language, and it can assist with a wide range of tasks, from answering questions and generating creative content to summarizing content.
New Era:
ChatGPT amazed many, and was the fastest product ever to reach 100 million users. We believe that the two key factors for the ChatGPT chatbot success are: (1) a simple and intuitive interface was introduced, enabling literally anyone to converse with a generative AI chatbot for free, (2) the ChatGPT model was trained to best fit a conversation setup, and it is a descendant of GPT-3.5, the latest-and-greatest model family of 2022.
OpenAI made it so intuitive to try and explore ChatGPT, that many people had the patience to get to know it and figure out the best ways to use it.
The below figure, done by HFS Research, illustrates ChatGPT’s usage through the lens of the Dunning-Kruger Effect.
While ChatGPT and other LLMs have demonstrated impressive capabilities in language processing and generation, there are several meaningful limitations that should be taken into consideration. Before elaborating on these limitations, let’s first discuss a few fundamental terms.
Also, if you are interested to read more on the LLMs ecosystem, I suggest this informative piece by Patrik et al. (link)
Data, Information, Knowledge and Intelligence
Data, information, knowledge, and intelligence are related concepts, but they have distinct differences.
- Data refers to raw, unorganized facts and figures.
- Information is data that has been organized, processed, and given context.
- Knowledge is information that has been absorbed, understood, and can be applied to make decisions or take action. It is the result of training, experience, and education.
- Intelligence refers to the ability to acquire, understand, and apply knowledge. It is the cognitive ability to think, reason, conclude and learn. It includes abilities such as perception, memory, problem-solving, and decision-making.
In summary, data is the raw material, information is processed data, knowledge is the understanding and application of that information, and intelligence is the ability to use information and knowledge.
Let us further linger on the difference between intelligence and knowledge.
Knowledge is specific information that has been learned, retained, and understood. Knowledge is the understanding of a particular subject or field. It can be used to make decisions, solve problems, and take action by intelligent beings.
Intelligence, on the other hand, mainly refers to the cognitive ability to think, infer, reason, learn, relearn and adapt. Intelligence is the general ability to acquire and apply knowledge in order to make decisions and solve problems in a wide range of tasks and situations.
Is ChatGPT a knowledgeable system, an intelligent system, or both?
It is definitely knowledgeable in various fields, don’t you already agree? But is it intelligent?
In the next table we analyze GPT and Alpha models and systems capacity according to these definitions above:
LLMs and ChatGPT Limitations
a. Content hallucinations: Refers to two variants of a phenomenon where an AI model generates output that appears to be: (1) meaningful but is actually nonsensical, lacking in coherence or just overly artistic, or (2) meaningful, sensical, coherent, but actually incorrect – in many of these cases the LLMs appears to be overconfident in their predictions and outputs.
b. Black-box behavior: Refers to the inability to fully understand or interpret how the model works and generates its output.
c. Requires massive training, and updating a pre-trained model is cumbersome and problematic: LLMs have numerous parameters that need to be adjusted, making it time-consuming and computationally expensive to train them, as well as fine-tune them for new tasks or data.
d. Query/context input size is too small for many applications: The overall input capacity of LLMs may not be able to process large amounts of text, making it less effective for tasks that require a broad context.
e. Very basic reasoning, logic, and math capabilities: LLMs are primarily based on statistical patterns in text and may struggle to perform tasks that require formal logic or mathematical reasoning.
f. Not optimizing for a global cause (at least not in a reasonably adjustable, controllable and stable manner): LLMs may not consider the overall objective or purpose of the task, which can result in suboptimal or even undesirable outcomes for the user or the model’s creators.
g. Heavily dependent on prompt engineering: Prompt engineering is the process of designing and fine-tuning prompts or input formats to achieve desired outputs from LLMs. In some cases, LLMs are very sensitive to small changes in the prompts. That is, the generalizability of LLMs is questionable.
We will further relate and elaborate on these limitations later on when diving into each breakthrough section.
For further reading about these limitations and related mitigation and requirements, I refer to Neta Barkay’s post (link).
According to the definitions and limitations mentioned above, ChatGPT intelligence is questionable.
However, the following six anticipated breakthroughs will very likely move us into the era of unquestionable artificial intelligence.
1. LLMs information grounding and referencing
Grounding & referencing:
- What is grounding? The grounding method aims to reduce generative language models hallucinations by grounding them in external knowledge sources such as articles, posts, knowledge graphs, semantic networks, or databases, which provide relevant context and background information. By incorporating external knowledge into the model’s inference process, the LLMs grounding method enables the model to make more accurate predictions and generate more coherent and contextually relevant output.
- What is referencing? Referencing means training and/or instructing the language models to provide references to some or all of the text they output. In some cases, it might require additional supporting techniques such as grounding.
- Example? In Bing’s new search chat, Sydney, grounding and referencing play a big role. Take a look:
- How to achieve it? Simplistic implementation already exists. To further improve the accuracy (1) new training techniques will reward proper referencing, (2) new designated inferencing and prompt engineering methods and best-practices will evolve, (3) and LLMs will get connected to external resources in various ways, which leads to the next section.
2. Efficiently connecting LLMs to tools
LLMs connection to tools:
- Connecting, what? Students taking a math exam will do better when they can access a calculator; Busy execs are better able to coordinate with one another when they can use a calendar. Due to their limitations, LLMs can also greatly benefit from being connected and chained to other tools such as a database, calculator, search engine, etc.
Jurrasic-X (link) by AI21 is an example of a system that includes one or more language models, and augments them with external knowledge sources as well as symbolic reasoning experts that can handle tasks that lie beyond the reach of common LLM neural models. - Dev tools for LLMs chaining: Let’s take LangChain (link) as an example. LangChain is an open-source library that aims at assisting in the development of applications that can benefit from combining and chaining LLMs with other sources of computation or knowledge. This library integrates and encapsulates many capabilities including, just to name a few, prompt management, prompt optimization, generic interface for all LLMs, standard interface for chains, lots of integrations with other tools, end-to-end chains for common use-cases, etc…
- Training LLMs to use tools: Toolformer by Meta (link) for example, is a “model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction”. Adept developed the Action Transformer (link) “a foundation model for actions, trained to use every software tool, API, and webapp that exists, is a practical path to this ambitious goal, and ACT-1 is our first step [and model] in this direction.”
3. LLMs context/input enlargement
Context enlargement:
- What is context? Generally speaking, context refers to the surrounding information or background knowledge that influences the meaning and interpretation of a given text or input. Specifically, it is common to add context when querying an LLM in order to obtain meaningful and accurate responses.
- So, what’s the problem? That the context size is limited, this is at least true for common LLMs as of Feb 2023.
As an example, let’s take code generation and code testing as a use case. Many codebases will not fit into the available common LLM context. This means that when queried the LLM can not easily consider the entire codebase. In many cases, even a few or even one code file won’t fit into the context. - What next? (1) Context/input capacity is going to grow. (2) new techniques will enable efficient access to external resources by LLMs during their computation process. Eventually, LLMs based systems will reach enormous context and memory size, practically infinite. (3) These improvement will be greatly capitalized, in example: (3.a) with advanced retrieval-augmented generation (RAG, 2020 paper by Meta link, and 2023 paper by AI21 link) where relevant information and knowledge will be injected into the context, e.g. for relevant grounding and referencing, and (3.b) in a conversation use-case, LLMs could have the entire conversations’ history, thus they could relate to any information from the past, adapt according to explicit and implicit preferences, etc… see more on this in the conclusion section.
4. LLMs computing ecosystem maturing (leading to 1000x inference cost reduction)
2023 March, OpenAI introduced APIs for ChatGPT and Whisper (link), “it is priced at $0.002 per 1k tokens, which is 10x cheaper than our existing GPT-3.5 models”.
- LLMs training and inference come with a cost. For many applications, the LLM inference cost is so high that it doesn’t enable a viable business model. For example, if an AI-empowered dev tool sells for $10/month/dev, and the LLM usage cost per developer is on par with that then we have a problem.
- How to meaningfully improve inference efficiency:
#1: Hardware (~2x/year): Specialized processors and systems are likely to be developed either by current processor leaders such as NVIDIA and alike, or by the cloud providers. For example, GCP, AWS and Alibaba Cloud have their proprietary chips and inference instances available for developers, e.g. see AWS’ Inferentia and Inf1 (link).
#2: Architecture and primitives reimplementations or redesign (~5x/year): For example, it is rumored that ChatGPT Turbo (released 2023 Feb), which is ~x4 faster than the original ChatGPT (say 2023 Jan), isn’t a smaller or different model, rather a different implementation of computing primitives.
#3: Smaller models efficiently trained (~3x/year): LLMs are undertrained, e.g. see Deepmind paper (link), and Meta AI paper (LLaMA link). For example, in Figure 1 below, taken from the LLaMa paper, you can see that more training means better accuracy (DS people: yea, I can see it is on the training loss and not validation/test loss ????). This means that smaller networks might be trained to achieve the accuracy performance of larger LLMs existing today. There are various techniques to further train a smaller model, from the obvious method of feeding it with more data, distilling knowledge from a much bigger model, to using various multi-modal/pre/post/chaining processing techniques. For example, Amazon researchers published a new model that claims to put GPT-3.5 to shame while being 784x smaller. More on this in the “LLMs reasoning” section.
- Efficiency is more than reducing cost: Different primitives reimplementation and decoding techniques such as speculative sampling (e.g. DeepMind paper link) can also greatly improve latency as well as throughput.
- Do we always want cheaper? Well, it depends. While inference cost for models with on-par capabilities to GPT-3.5 will drop significantly (1000x~ish), there will always be a demand for smarter models although much more expensive.
5. LLMs efficient fine-tuning and alignment
Alignment:
- What is AI alignment? As this term might be ambiguous, let’s use a few quotes:
Wiki (link): “AI alignment research aims to steer AI systems towards their designers’ intended goals and interests”
OpenAI: “OpenAI’s alignment research focuses on training AI systems to be helpful, truthful, and safe.” (from here).As we see it, there are two groups of alignment types: Local and Global:
- Local: Have LLMs follow instructions given by their user. E.g. see OpenAI’s post “Aligning Language Models to Follow Instructions” (link), which includes some explanation on how the InstructGPT models were trained to be “much better at following user intentions than GPT-3”. This method was critical in the creation of ChatGPT.
- Global: Have LLMs follow and consider: (1) general guidelines and rules, such as those related to safety, tone of voice, etc.., and (2) a general objective for the overall interaction, provided either by the user, the product owner or/and the LLM creator. For example, we may assume that the overall objective of a K-12 education chatbot would be to have its users, the students, learn by interacting with it and not just obtaining answers to quickly finish homework.
- Important topic under research: As this post is being written, the most common technique to improve LLMs alignment is via Reinforcement Learning (RL), including RL from human feedback (RLHF). However, it has been shown many times that the current implementation of these techniques is not strong enough to prevent LLMs jail-break (e.g. see DAN). Various techniques are under research by a variety of research groups that are focused on this subject matter like EleutherAI (link), OpenAI, Deepmind, Anthropic (link), AI2 (wiki), MIT AI Alignment (link), and more.
- Why this matters? As AI becomes increasingly knowledgeable, intelligent and integrated, it has the potential to significantly impact human society and the world as a whole. Without proper alignment, AI systems may exhibit unintended and even harmful behaviors that could have serious consequences for individuals, communities, and perhaps even to the environment and countries. Ensuring that AI systems are aligned with local and global values, goals, and interests is therefore critical to ensure that they are beneficial and reasonably controllable. Additionally, alignment is necessary to foster trust in AI and to encourage its ethical and responsible development and deployment.
Fine-tuning:
- Some people found that ChatGPT knows Elon Musk is the current CEO of Twitter (2023 March). ChatGPT was also found to be aware that Queen Elizabeth II has died and the UK has a new reigning monarch. That’s surprising because ChatGPT sometimes says: “As of my knowledge cutoff date of September 2021”, and Musk of course wasn’t CEO of Twitter until 2022.
This is consistent with the limitations of current science in large-scale foundation models. It is hard, even impractical to update the model’s facts exhaustively (e.g. all other CEOs and world leaders).
Here is by the way what I got from ChatGPT asking about Twitter’s CEO:
- What is fine-tuning? Fine-tuning is a process in which a pre-trained language model (such as GPT) is further trained on a task-specific dataset to improve its performance on a particular task or domain. During fine-tuning, the model’s parameters are adjusted to better fit the target data, while aiming to retain un-conflicting and compatible knowledge and generalization capabilities learned from the original pre-training. Fine-tuning has been shown to be useful to improve the performance of LLMs in various tasks and domains. However, fine-tuning efficiency still remains a major challenge, for example on two aspects: (1) fixing or updating specific knowledge or information such as facts, as exemplified in the above tweet, (2) compatibility with pre-training, such as forgetting information or knowledge from the pre-training data while fine-tuning, or more generally speaking optimizing toward the fine-tuning metric at the expense of the pre-trained metric ( sometimes referred as catastrophic forgetting, wiki link; an example in this post link).
- How to meaningfully improve it? Towards the end of 2022 we saw early birds of research showing promising results. For example, there’s this neat paper called MEND: “Fast Model Editing at Scale” (link) by Mitchell et al. from Stanford University, that shows how specific kinds of surgical language model editing can be accomplished. MEND is a tool for late-breaking changes to a fact that doesn’t change too often.
- A temporary solution? Exists, and works ok~ish: including new information and knowledge as part of the context. See the example in the conclusion section.
6. LLMs reasoning
How to achieve LLMs reasoning: Nobody knows, I think. Here are a couple of noticeable directions under research:
#1: CoT:
• The increasingly popular Chain of Thought (CoT) method consists of prompting Large Language Models (LLMs) with step-by-step rationale to increase their performance on reasoning tasks.
• Recently a team from Amazon published a work (a paper by Zhang et al: link), where they train a multimodal (vision + language) model for CoT and achieve state-of-the-art results on various tasks.
• Last year, in Satya Nadella’s Keynote, Sam Altman demoed an LLM based system with CoT capabilities. Watch this video (14:14 – 16:34):
#2:
• Combining all of the previously mentioned bleeding-edge technologies and techniques: CoT, grounding and referencing, connecting to external tools and resources, enlarging the context, enabling alignment on local and global objectives and being able to learn and adjust information and knowledge on the fly.
• Minerva (blog link) by Google Research, “combines several techniques (although not all of the mentioned in this post), including few-shot prompting, chain of thought or scratchpad prompting, and majority voting, to achieve state-of-the-art performance on STEM reasoning tasks.”
- Majority voting: Minerva generates multiple solutions to each question and chooses the most common answer as the solution, improving performance significantly. (source: the blog linked above)
- Why this matters: Hu-ha. When LLMs systems will be able to reason, disruption will hit many markets and domains with extreme automation. AlphaGo-like moments will happen in countless use-cases and industries. It will enable new applications and services that are currently beyond the reach of existing ones.
For example, at qodo (formerly Codium) we believe that high Code Integrity (link) would become feasible for any developer and codebase!
Conclusions (& what does the future hold):
Generative AI models, particularly language models like LLMs, can produce nonsensical or incoherent outputs, have limited reasoning, and may not optimize for a global goal.
To overcome these limitations, various techniques and best-practices were developed. Prompt engineering is one of those and is often used, but it can introduce biases and limit generalizability.
Additionally, LLMs require massive training and updating them can be cumbersome and problematic.
Overall, LLMs have become highly knowledgeable in many fields, but their intelligence is questionable.
In this post we suggested a near-future technology roadmap that we believe will usher in the era of intelligent products at a large scale. It is reasonable to claim that in the near future any software can meaningfully benefit from integrating with language models based tools.
As an example of a very useful application that will greatly benefit from infinite LLMs memory/context and other mentioned breakthroughs is a personal assistant focused on being your second brain and memory.
There are various developers around the world implementing this. Below is one — SelfGPT, upload whatever you want to remember into a private WhatsApp channel, and then easily query for anything saved in that channel.
For other applications and implications, we’ve written a post almost a year ago (link) — it seemed a bit ahead of its time back then (before ChatGPT), but turned out to be spot on now after ChatGPT.
Acknowledgments
Thank you to the wonderful contributors and reviewers: Gadi Zimerman@qodo, Tal Ridnik@qodo, Neta Barkay, Patrik Liu Tran@Validio, Jay Hack, Amit Mandelbaum, Adam Jafer@Voi, Sivan Metzger and Mehdi Ghissassi@Deepmind
Thanks to all the people behind the tweets, papers and screenshots we have used in this post.
At qodo (formerly Codium), we are redefining and automating Code Integrity, initially by building a dev tool that will enable busy devs to generate meaningful test suites interactively within their IDE.