4) Assess cost, data ownership options and resources
The choice of an LLM implementation approach impacts the complexity and costs, including those associated with:
- Training
- Data collection, ingestion and cleansing
- Hiring data scientists
- Maintaining the model in production
The selection also greatly affects how much control a company will have over its proprietary data. The key reason for using this data is that it can help a company differentiate its product and make it so complex that it can’t be replicated, potentially gaining a competitive advantage. In addition, proprietary data can be crucial for addressing narrow, business-specific use cases.
Also, there are regulatory and ethical reasons for sustaining control. For example, depending on the data that is stored and processed, secure storage and auditability could be required by regulators. In addition, uncontrolled language models may generate misleading or inaccurate advice. Implementing control measures can help address these issues; for instance, preventing the spread of false information and potential harm to individuals seeking medical guidance.
Typically, there are three ways to implement an LLM — an API, platform as a service (PaaS) or self-hosted — each of which presents different considerations.
Off-the-shelf model via API
Using an API can alleviate the complexities of maintaining a sizable team of data scientists, as well as a language model, which involves handling updates, bug fixes and improvements. Using an API shifts much of this maintenance burden to the provider, allowing a company to focus on its core functionality. In addition, an API can enable on-demand access to the LLM, which is essential for applications that require immediate responses to user queries or interactions.
When a company uses an LLM API, it typically shares data with the API provider. It’s important to review and understand the data usage policies and terms of service to confirm they align with a company’s privacy and compliance requirements. The ownership of data also depends on the terms and conditions of the provider. In many cases, while companies will retain ownership of their data, they will also grant the provider certain usage rights for processing it. It’s beneficial for companies to clarify data ownership in their provider contracts before investing.
PaaS
PaaS provides companies access to use its LLM as part of a broader platform offering and allows customers to operate their LLMs without managing the underlying application infrastructure, middleware or hardware. However, by using this approach, companies may incur higher model costs associated with purchasing the rights to build on top of the LLM using their own data, as well as allowing domain specificity and model customization during deployment. It also enables companies to control their data and minimize the time to value and cost compared to the self-hosted approach. On the flip side, auditability of the data and the ability to provide comprehensive explanations for results can pose challenges as organizations are constrained given that their PaaS providers don’t provide the underlying data. In addition, PaaS can result in a greater total cost of ownership for the LLM and can be more complex than utilizing an API.
Self-hosting an LLM
This is the most expensive approach because it means rebuilding the entire model from scratch and requires mature data processes to fully train, operationalize and deploy an LLM. Furthermore, upgrading the underlying model for self-hosted implementations is more intensive than a typical software upgrade. On the other hand, it provides maximum control — since a company would own the LLM — and the ability to customize extensively.