In the nascent days of the Internet, few people could fully envision its world-altering impact. Think back to Jeff Bezos in the mid-90s, who, grasping the disruptive potential of the Internet for e-commerce, launched Amazon from a simple garage. Today, we know Amazon as a global titan. We're now witnessing another similar technological shift with generative AI, a technology seeing an adoption pace exceeding anything previously recorded.
Take the case of ChatGPT, which attracted an impressive 100 million users within just two months. This rapid acceptance and its potential for widespread appeal highlight the transformative power of generative AI. However, we're just at the dawn of comprehending how businesses can best harness this technology. Many organizations are on the starting blocks, deciphering how to weave generative AI into their operational fabric.
As such, the adoption of large language models (LLMs) by enterprises is heating up, with a surprising number of companies assembling open source technologies, and finding early successes directly with the LLM APIs. While tools continue to emerge to help companies to automate development tasks, the relative ease of building prototypes using open source components has led to positive results at a very low cost and effort. In this early stage of exploration, certain components of an emerging solution architecture are becoming evident. This paper focuses on those components, outlining the most common patterns of adoption of LLMs within enterprises. As was the case with the dawn of the internet, identifying and acting on these emerging trends early can be the key to success.
There are six primary patterns that this paper will discuss, ranging from simply calling public foundation model APIs, all the way to building LLM models from scratch.
Using Base Foundation Model APIs
Extending Foundation Model Knowledge with Enterprise Data
Using Orchestration
Calling External Functions
Fine-Tuning a Base Model
Custom Training a Custom LLM
Market Caution on Using LLMs
LLMs are in their infancy in enterprise adoption. Many technology companies have begun incorporating LLMs into their software products as a new human language interface, or as “co-pilots” that assist users embedded in existing workflows, but most enterprises are still in the experimentation phase. While most companies recognize the opportunities for generative AI, adoption has been cautious. This caution is founded on two basic concerns: security and trustworthiness.
From a security perspective, companies don’t want their private data to be fed into external systems when they don’t know how that data is being protected or used behind the API. Companies need to know that their data is being protected at all steps within the process, in the same way as they do for all external APIs they use. In addition, they need assurances that the data (prompt inputs) will not be used to train the model. Fortunately, many of the vendors who are providing LLM APIs have built the kinds of security protections around them that enterprise users would expect, and most have also excluded API interactions from their training data. This has enabled more businesses to experiment with them.
Trustworthiness is perhaps a larger concern, given that the outputs of LLMs have a reputation to include hallucinations, and the validity of underlying information returned by an LLM is not often verifiable. Many of the techniques discussed below are intended to address this concern, but there is still a lot of trial and error work to do to get to the point where LLM outputs can be completely trusted for a specific use case.
Using Base Foundation Model APIs
Most companies that are using LLMs are at the point where they are using foundation model APIs directly. These APIs are easy to use, enabling full access to the capabilities of LLMs. Even better, using them doesn’t require any specific data science expertise. They are very good at answering general purpose questions, summarizing and distilling documents and other information, writing draft documents, and classifying information. The primary challenge with foundation LLMs is that they can be very unconstrained, and can sometimes respond in unpredictable ways. Most companies focus on improving the specificity, relevance, and accuracy of responses from the API using prompt templates and prompt engineering techniques.
Prompts, the inputs to LLM APIs, are the sole determinants of what the APIs will return. Because there is no inherent structure to prompts, the results can vary widely depending on the prompt used. Prompts are completely flexible, with the only constraint being on the size of the data input. For GPT-3.5, for example, the input is constrained to about twenty pages of text, while GPT-4 enables inputs up to twice that size. In general, the more context that is fed into the model through the prompt, the more accurate and precise the response tends to be. That said, the more data that is fed in, the more expensive the processing becomes, and the slower the response tends to be.
Prompt Engineering
A lot of science has gone into figuring out the most effective ways to write prompts, to make results more accurate. This new space, called prompt engineering, has produced a broad range of methodologies suited to different desired use cases. Companies are already seeking to hire prompt engineers, though this is likely not going to be the long-term approach. Techniques like few-shot prompting, which improve model precision by providing examples alongside the request, are much more effective than just the request alone, and end up being used even within more advanced approaches to generative AI. Few-shot prompting can be further improved by providing an explanation for why the example response is correct, which is called chain-of-thought prompting. Interestingly, prompt engineering is increasingly being done programmatically as a pre-processing step from an initial prompt, even sometimes using the models themselves to generate effective prompts, so it is likely that prompt engineering won’t be a discipline that most companies need to hire for in the future.
Prompt Templates
In addition to these techniques, often companies want to limit the scope of what can be requested from the model. For example, in an application focused on export regulations, you wouldn’t want people asking general questions about other topics. Prompt templates can help to constrain the prompts to a specific focused area, while also structuring the prompt using specific prompt engineering techniques. Typically, the prompt template will write the entire prompt, seeking only input parameters from the user. This enables standardized prompt optimization, while removing much of the variability from providing open prompt text. LangChain, which I will talk about later, includes a prompt template framework within the package, though these are relatively easy to build at the application layer around the LLM API calls.
Unfortunately, basic prompt engineering techniques will only get you so far. The LLMs on the market generally only have knowledge of data before their training cutoff (for GPT-4, this is September 2021 as of the publishing of this paper), so anything that has happened since then is not known to the base model. In addition, these models are trained on open data, and don’t have access to specific proprietary data, which is usually what a company or application wants to use to answer specific questions.
Extending Foundation Model Knowledge with Enterprise Data
The most common mechanism being used to incorporate private data into an LLM response is to feed specific knowledge into the prompt directly. Technically, an entire corpus of knowledge could be fed into the prompt with each call, but the hard limitations on input size, along with the cost and performance of large inputs, make this suboptimal. The application code can easily do lookups of external data to feed the prompt with more context. It is very common to do a SQL or Graph lookup to inform the LLM about specific details it wouldn’t otherwise know. However, when information is contained in more unstructured formats, like PDFs, web pages, or even videos, it is not practical to input all the data into the prompt, so most companies use vector similarity searches to constrain the data being sent into the model to only what is most relevant.
The way this is done is to use a vector embeddings model to convert private reference data into vector encodings. Typically, this data is stored in a vector database for easy access and real-time similarity searches. When a new prompt query is received at the application layer, it is also run through the embeddings model, and an approximate nearest neighbor search is done between the prompt request and the back-end knowledge in the vector database. This returns the most relevant information on the requested topic, which is then submitted to the LLM for inference, as context within the prompt.
These approaches feed specific knowledge needed to answer requests into the model, augmenting the base model, while reducing the hallucinations typical to LLMs. The data is protected, somewhat, since most LLM providers no longer use user prompts to train their models, and most of the hosting services put strong security protections around the model inputs.
Embeddings models are relatively abundant, though most companies just use OpenAI’s embeddings API, which was recently given a 75% price reduction. There are other companies, like Cohere, that specialize in embeddings models, and there are open source embeddings models available from Hugging Face. At this point, there is not much differentiation to be gained from different embeddings models, but over time specialized embeddings models could arise that will likely perform better for specific subject areas.
Vector Databases
Where there can be differentiation is in the vector database itself. The database performs the heavy lifting of executing the approximate nearest neighbor search. Since the corpus of knowledge that it needs to manage can get indeterminately large, performance is critical, as is the ability to deal with distributed processing.
There are a handful of specific vector databases on the market, with newer entrants like Pinecone and Weaviate getting a lot of buzz in the market. They are primarily popular because they are easy to access, have very good trials and entry-level packaging, and have pre-built connectors to many of the tools developers are using, but they are SMP databases that are unlikely to scale to the needs of many use cases.
Milvus has emerged as the leading open source vector database alternative, which is also capable of scaling to very large vector search sizes. Zilliz, the primary company behind Milvus, offers an as a service model around it. Databricks also includes a native vectorized query engine, called Photon, that is the default data store for its SQL warehouses, and is directly compatible with Apache Spark APIs. It has the advantage of being able to run existing SQL and DataFrame API calls directly, without modification.
Most relational and NoSQL databases are also adding vector plug-ins, though the performance of these is questionable at scale. As an example, many companies that use Postgres are using pgvector, a vector search plug-in, and Redis and Elastic have added vector data types and similarity search capabilities. Some of the specific vector search libraries that are being embedded by the database vendors are also available directly, like Vald, FAISS, and ScaNN, though these are only suitable for smaller data sets, and there are technical hurdles required to use them.
Model Memory
Another approach that is commonly used is to store request history (and sometime responses and dialogs), and query these for input into the model at the next request. This context can often provide value for downstream requests. ChatGPT does this natively within a conversation thread, which is how it maintains context between one prompt to the next. When using the LLM APIs, this context needs to be maintained programmatically.
Though the easiest way to do this is to simply include the entire conversation thread in the next prompt, it is not the most efficient approach. Instead, storing the history in a vector database enables only the most relevant context to be retrieved to feed the next inference. Like the embeddings example above, the process involves doing a similarity search to a new input request against historical data, and inserting the most similar data into the prompt. This approach also enables context to be saved across multiple interactions happening at different times and locations, if desired.
Using Orchestration
As you can probably tell, there is a lot of complexity in the pre-processing that happens before calling out to an LLM for inference. We talked about prompt engineering techniques, prompt templates, embeddings, and model memory, but there are also things like guardrail filters on prompts, and chaining a series of LLM inferences together (to generate a better prompt, for example) that will increasingly be relevant. Today, this is code that must be written by a developer around the LLM API calls. However, multiple frameworks are arising to help orchestrate all of these pre-processing tasks.
The most common framework in the market has been LangChain. LangChain is an open source framework backed by Unstructured. It has become the most popular orchestration framework on the market, providing a harness for doing all of these pre-processing tasks before calling your favorite LLM model (it is model agnostic). It has capabilities for prompt templates, external context lookup (including embeddings), chaining multiple inference calls together, managing model memory, and calling out to external agents (which I will talk more about in the next section). LangChain is designed to be easy to use and it is available in Python and Javascript, so it is often used for prototyping, though developers will sometimes rewrite the code before moving into production. LangChain is facing stiff competition with OpenAI, as the market leader continues to add orchestration features natively, though the fact that it is model-agnostic may give it longer-term durability, as companies shy away from vendor lock-in to specific models.
OpenAI has made a play to provide the orchestration framework natively around their APIs in order to create more value (and stickiness) with developers. They provide most of the same capabilities as LangChain, without the model flexibility. However, they are very advanced in their approach, particularly around function calling (see below), and they have the majority share of the LLM market today, so the native, lower-code approach with one-stop shopping has appeal to many developers.
AutoChain is a newer entrant to the orchestration landscape, which extends LangChain with some nice features for testing and iterating on prompts. This makes it easier to see how effective prompts are, and update them iteratively. It also provides a testing framework to evaluate outputs under different scenarios, enabling the testing of more complex scenarios where a prompt structure change might have a negative impact on responses for other use cases. It also provides support for OpenAI function calling (see below). While AutoChain provides some very nice features for iterating and testing, I expect these types of capabilities to converge into LangChain and OpenAI very quickly.
LlamaIndex is another interesting open source orchestration player, though they focus on providing a standardized interface to back end data. It provides data connectors to ingest and convert existing structured and unstructured data sources, data structuring through indices, graphs, and vectors, to enable data to be used easily within LLMs, and a simple interface to enhance prompts with augmented information from the back end. LlamaIndex works with the other application frameworks, like LangChain and ChatGPT, so it is a nice, open accelerator.
There are other orchestration players emerging, as well, including things like Prophecy, Anarchy, Fixie.ai, Promptfoo, and extensions to Flask and Docker. Some of these focus on orchestrating the inputs to the model, while others provide frameworks for testing and providing DevOps at the application layer. Given the relatively consistent set of patterns to apply here, I expect further convergence, and that more players will emerge here quickly. Meanwhile, the cloud services providers, AI platforms, and LLM vendors will continue to evolve their own orchestration plays. I also expect pieces of this to commoditize with a focus on usability, though the data orchestration layer has an opportunity to remain differentiated for a long time.
Calling External Functions
One of the emerging enhancements to language model execution has been the incorporation of external function calls. These calls enable the model to reach out to external APIs to look up information or perform functions. This has provided another mechanism for the model to be extended with additional context (e.g. looking up the current weather to answer a weather-related question), but it is also giving LLMs the ability to complete complex tasks. For example, you could ask the model to place an order for the ingredients of a good chili recipe through Instacart. It would then do both the generation of a recipe, and structure the API call for the Instacart order.
The first implementation of this has been in LangChain, which provides a feature called agents to handle these external calls. The agent makes the decision of when to call a specified API, or a series of APIs, to complete a task. There aren’t a lot of examples of this being used in production, but the idea is relatively straightforward. The decisioning on calling APIs is also relatively deterministic, with a set of specific APIs being fed into the prompt, along with instructions (typically few-shot) on when and how to use them. The model produces the API call JSON structure, but does not call the API directly.
OpenAI just announced their own approach to this, introducing function calling in their latest release (for GPT-3.5 and GPT-4). This was a follow up to the plug-ins capability they announced earlier on top of Chat-GPT, now focused on their APIs directly. The approach is nearly identical to LangChain’s approach, though they are focused on a broader set of structured programmatic outputs, including not only APIs, but also SQL code and other functions. Like LangChain, the APIs and uses need to be defined as part of the inputs, and the model is pretty deterministic when deciding when it needs to use a function versus just doing things the old LLM way. It also does not make the actual API calls, it just structures them so that the surrounding code can make the calls.
There are a set of LLM models (e.g. TALM and Toolformer) being trained to help make the external call out decisions less pre-defined and more based on actual reasoning, and there is even a benchmark for this, called API-Bank. Some of this work can be done by asking the model to break down the steps needed to complete a task, a prompt engineering technique called Chain of Thought. However, individual tasks still require a decision on how to execute, and the requests to external functions need to be structured in a semantically accurate way. I expect more work to arise in these areas, though it is unclear whether this will commoditize over time. Another possibility is that individual API publishers will begin to include metadata embeddings frameworks that enable an LLM to easily learn when and how to call them. This could even emerge as a standardized extension of the Open API specification.
This concept of function calls really changes the game for LLMs. An entire wave of personal assistants is likely to emerge from this, and I expect most software vendors will embed these types of capabilities in their products to automate natural language commands against their products. I expect this to pick up most quickly in the consumer space, but B2B will be a fast follower.
Fine-Tuning a Base Foundation Model
Companies that can’t get enough specific value out of the base foundation models are sometimes choosing to fine-tune a foundation model to their specific needs. This requires a much higher degree of data science acumen, and a lot of traditional data engineering and model development work. While most companies are not at the point of doing this level of customization, I expect that most differentiation will come from this type of fine-tuning over time. I believe the popularity of fine-tuning activities will increase dramatically once the open source models get on par with GPT-3.5, which is now a reality since Meta announced the open source Llama 2 model.
The fine-tuning process typically will change the model’s underlying weights, to make it more suited to solving a specific type of problem. Because the existing weights are used as the foundation, not nearly as much data is needed to improve performance on a given task as was needed to generate the initial model. A typical scenario where fine-tuning is useful is when there is specific jargon that is used within a use case, and the model doesn’t understand it out of the box.
OpenAI does not enable fine-tuning on GPT-3.5 or GPT-4, but it does provide models like GPT-J and GPT Neo that can be fine-tuned. There are also a range of open source models, including Fine Tuned Language Nets (FLANs) that can be easily tuned with relatively small sets of private data. However, Meta’s recently announced Llama 2 is the most promising of the open source LLMs. Its performance is comparable to GPT 3.5 on many tasks, and it can take inputs of approximately 5 pages of text. Unlike the previously-released Llama model, Llama 2 is available for commercial use.
Tools needed to fine-tune LLMs are provided by the Cloud Service Providers, and most AI tools vendors are adding LLM fine-tuning capabilities. HuggingFace provides these capabilities on top of a set of open source models. However, the technical challenges and AI knowledge required to make this work represent a significant step up from what is required for using the standard APIs. While any developer who knows Python or Javascript can build applications around the APIs, fine-tuning an LLM requires deeper understanding of machine learning techniques and practices. In addition, the technical challenges to deploying and working with the models, data, and GPUs are significantly higher. That is why most companies are not yet at the stage of doing this work. I expect that the tools for simplifying LLM fine-tuning will continue to evolve to make it easier for more developers to participate over time.
Choosing an LLM Model
While OpenAI has dominated the landscape of initial use for LLMs, there are other choices, as well as multiple levels of LLMs available within each vendor. OpenAI’s core modern models are GPT-3.5 and GPT-4. GPT-3.5 is faster and much cheaper (50x) than GPT-4, but does not have the levels of accuracy that GPT-4 can provide. GPT-4 also has a larger context window, allowing up to around 40 pages of input context.
Anthropic has been an ongoing challenger to OpenAI. Formed by several engineers who left OpenAI, the Claude model provides a much bigger context window (3x), and accuracy that is on-par with GPT-3, while focusing on providing a safer model that produces less potentially harmful outputs. Claude 2 was recently announced, which improves performance and safety over its previous iteration. Anthropic provides a bit more customization capabilities than OpenAI, but is still trying to catch up from a market perspective. However, Claude has the advantage of being completely free to use in beta, which could provide a more attractive entry point for some companies.
Open source models have also been picking up steam, including Mosaic, Falcon, and Mistral, but none of them have come close to the level of accuracy of GPT-3.5 as of yet. Most of these are accessible through HuggingFace via APIs, making them relatively easy to consume. These are the primary models being used when companies want to fine-tune foundation models, since OpenAI does not provide fine-tuning access to GPT-3.5 or GPT-4.
The most prominent of the open-source models has been LLaMa, which was developed by Meta. They previously open-sourced LLaMa-1 for research use, and recently announced LLaMa-2 as open source and available for commercial use. With accuracy levels on par with GPT-3.5, this could turn out to be a highly disruptive play to the other foundation model vendors. The cost to develop a general purpose foundation model using traditional approaches has been so high that only the largest platform players have been able to afford to build them. Opening up LlaMa will likely put enormous pressure on OpenAI, and it could become the dominant player. With a highly accurate open-source model now available for commercial use, more companies are likely to shift into the mode of fine-tuning their own versions of the models.
Custom Training a Custom LLM
In very rare cases, companies are opting to build their own foundation models, rather than use existing models at the core. There are very few examples of this today, largely due to the amount of data needed to train a model, the expertise needed to execute on it, and the high cost associated with the GPU compute resources.
The primary reason companies would want to build their own foundation model would be that they believe they have expertise and data that surpasses the public market, and they want to apply that differentiation from the ground up. A good example of this is Inflection, that built a foundation model from scratch for its Pi virtual assistant. By building this from scratch, they could focus the context of the model on the tasks and capabilities they expect their users to need the most. Bloomberg also built a custom LLM, in order to leverage their deep domain expertise in Financial Services language. Great Britain has embarked on building their own LLM, as well, partially as an economic play, but also to more deeply reflect the idioms and dialects of their population.
Training a foundation model from scratch requires a very large amount of data (and a lot of money) today, though the tools to do it are widely available on the Cloud Service Provider platforms, and the patterns and approaches are well-documented. Over time, there are approaches being researched to reduce the data requirements and expense of creating these models, so adoption of completely custom models might increase if these areas bear fruit.
Summary
We are still in the “Wild West” stages of LLM adoption. We are already starting to see amazing applications of the technology, primarily in the consumer space and from software companies extending their products with natural language functionality. However, we have only just begun to scratch the surface. Most companies realize that in order to remain competitive they need to start figuring out how generative AI is going to impact their business. This is starting with experimentation and simple implementations of LLMs, but it will quickly progress into tangible new capabilities for customers and employees. There are very few barriers to basic adoption, at this point, and the patterns are fairly well understood. It will be interesting to see how the market evolves over the coming year, and how far up the stack of adoption mainstream companies will ascend. In the meantime, if your company is not already experimenting with LLMs, the time is now to get started.