Unlock the untapped potential of your enterprise data by harnessing the power of ChatGPT API and semantic search. As businesses grapple with the challenge of implementing ChatGPT-like capabilities, text embeddings and vector databases emerge as the perfect solution. Delve into the world of OpenAI’s text-embedding-ada-002 model and discover how, when combined with a vector database, it can revolutionise your structured and unstructured data analysis.
In this article, we will demystify the concepts of semantic search and text embeddings, and guide you through the process of transforming your enterprise data into valuable, searchable resources that can be interacted with. Experience the synergy of ChatGPT and semantic search through real-world examples, and envision the endless possibilities they can bring to your business.
With all the hype around the ChatGPT consumer facing product over recent months, there is another related product released by OpenAI that has largely been flying under the radar. I suspect that the reason it has taken second place, is because it is not readily usable by end users, like ChatGPT is, and because it contains more technical concepts that need to be understood by the discerning reader before the problems it can solve becomes clear.
I am referring to OpenAI’s text-embedding-ada-002 model that can unlock incredible semantic search and analysis capabilities over your structured and unstructured enterprise data, when combined with a vector database and ChatGPT API.
Immediately, we encounter a few terms that may cause some to feel overwhelmed, but please bear with me, as they are actually straightforward to grasp, at least at a practical level: semantic search, text embeddings, and vector databases.
Semantic Search
You might be familiar with the saying, “Do what I meant, not what I said”? Keyword search is something we are all accustomed to; when typing one or two words into a search box, we expect to see results containing either or both of those words. If the search engine is a bit cleverer, it will also find results with synonyms of the words we used. However, traditional search engines typically do not find information based on meaning. Semantic search, on the other hand, goes beyond matching keywords and instead focuses on understanding the meaning and context behind the query. It helps find more relevant and accurate results by considering the user’s intent and the relationships between words, making search engines smarter and more efficient.
Text Embeddings
Semantic search operates using text embeddings, which are mathematical representations that capture the meaning and context of text. By converting text into numerical format and mapping words and phrases into high-dimensional space, or vectors, similar meanings are positioned closer together. This process enables algorithms to identify patterns, relationships, and relevant content, even when they use different vocabulary.
Vector Database
A vector database is a specialised storage system designed for managing high-dimensional data, such as text embeddings. By efficiently handling the querying and retrieval of similar embeddings, vector databases play a crucial role in enabling semantic search, making it easier for algorithms to find relevant content based on meaning and context.
Essentially, semantic search is the “what” and embeddings and vector databases are the “how.”
It’s important to note that in LLMs (Large Language Model) like ChatGPT, tokens are fundamental units of text, often parts of words, used for processing and generating language. With a limited token capacity for each query and response, the model can handle only a certain amount of data as context. To optimise its effectiveness, it’s crucial to identify the most relevant information to include with the query. This is where embeddings and vector databases are valuable, as they help extract specific, relevant details, ensuring clear and targeted interactions with the model. For instance, providing a 50-page document as context for your query would be impractical due to token limitations.
Converting enterprise data and documents into embeddings and leveraging vector databases
To harness the power of semantic search and ChatGPT for your enterprise data, it is necessary to first convert your company documents and other data into text embeddings and store them in a vector database. This process may seem daunting, but let us break it down step by step.
- Slicing up documents: Segment your company documents into smaller, more manageable pieces, like sections, paragraphs, or sentences, depending on the granularity you require. This ensures the generated embeddings will be more focused and relevant.
- Converting to embeddings: Feed each sliced document piece into the text-embedding model, which will process the text and generate an embedding capturing the meaning and context of the content.
- Storing embeddings in a vector database: Store the converted document slices as embeddings in a vector database, a storage system specifically designed for managing high-dimensional data like embeddings. It allows for efficient querying and retrieval of similar embeddings, crucial for semantic search.
- Querying the vector database: When a user submits a question or instruction to the system agent, first generate an embedding for the query using the same text-embedding model. Then, search the vector database for the most similar embeddings that correspond to relevant information from your enterprise documents and data. This is where semantic search shines, as the system identifies relevant content based on meaning and context, rather than just matching keywords.
- Presenting results to the ChatGPT API: Retrieve the most relevant embeddings from the vector database, use that to look up the original text, and provide this limited and focused information to the ChatGPT API. With this grounding, ChatGPT can generate a response that is both accurate and relevant to the user’s query, leveraging the wealth of knowledge from your enterprise data.
Let us look at some examples.
Example 1:
Imagine you have two versions of a reinsurance treaty, and you want to analyse their semantic differences to pinpoint the key variations between them. Each document is 30 to 40 pages long, making it too large to provide as context for queries. By harnessing the combined power of ChatGPT and semantic search, you can efficiently achieve this goal and receive a concise, generated response highlighting the main differences.
Let’s consider an example using two excerpts from these hypothetical treaties. Suppose we have a difference in the clauses related to the risk-sharing mechanism. These clauses are on separate pages in each document, use distinct vocabulary, but are ultimately the main focus of our analysis. We aim to extract a synthesised response that points out and emphasises these differences.
Version A – Clause 4.1:
“In the event of a loss, the Cedent shall retain the first £5,000,000 of each loss occurrence. The Reinsurer shall be liable for 90% of the loss excess of the Cedent’s retention, up to a maximum of £25,000,000 per loss occurrence.”
Version B – Clause 10.7:
“In the event of a loss, the Cedent shall retain the first £7,000,000 of each loss occurrence. The Reinsurer shall be liable for 80% of the loss excess of the Cedent’s retention, up to a maximum of £20,000,000 per loss occurrence.”
The two clauses, as well as all other clauses in each document, would first be converted into text embeddings using the text-embedding model. These embeddings, along with the relevant metadata, are stored in a vector database, making them accessible for semantic search.
To compare the clauses from the first document with those in the second, you would process each clause of the first document and use its embedding to find the most similar clause embeddings from the second document. By doing so, you can identify the corresponding clauses in both documents that share a semantic similarity. Once you have the relevant clauses from each document, you can provide them to the ChatGPT API. The model will then process these clauses and generate a response highlighting the key differences between them, making it easier to understand the variations in the treaties. A potential response about our example clauses might resemble the following:
“The primary differences between the two versions of the reinsurance treaty lie in the risk-sharing mechanism specified in Clause 4.1 and 10.7 respectively. In Version A, the Cedent retains the first £5,000,000 of each loss occurrence, while in Version B, the Cedent’s retention is increased to £7,000,000. Additionally, the Reinsurer’s liability in Version A is set at 90% of the loss excess of the Cedent’s retention, up to a maximum of £25,000,000 per loss occurrence. In contrast, Version B reduces the Reinsurer’s liability to 80% of the loss excess of the Cedent’s retention, with a maximum limit of £20,000,000 per loss occurrence.”
This clear, succinct synthesized response highlights the key differences in the risk-sharing mechanism between the two reinsurance treaty versions, thereby empowering stakeholders to make informed decisions during treaty negotiations.
Example 2:
Now, let us consider an example from the insurance industry, specifically related to customer service and support for policyholders. A customer service representative receives a query from a policyholder.
Policy holder query: “Can you help me understand the coverage details for water damage in my home insurance policy?”
As before, first, an embedding is generated for this query using the text-embedding model that was originally used to embed the source documents, capturing the meaning and context of the question in the same way.
We then search the vector database, containing embeddings from the company’s proprietary data sources such as policy documents, terms and conditions, and coverage details specific to the policy holder. The goal is to find the most semantically similar embeddings to the query. Here are some examples of relevant and non-relevant text found in the company’s data:
Relevant text examples:
“Home insurance policy includes coverage for water damage caused by burst pipes and accidental leakage.”
“Water damage resulting from faulty plumbing or appliance malfunctions is covered under the home insurance policy.”
“Policyholder’s home insurance provides protection against water damage due to sudden and accidental events.”
Non-relevant text examples:
“Vehicle insurance covers damages due to collision, theft, vandalism, and water damage due to floods.”
“Travel insurance provides coverage for medical expenses, trip cancellation, and lost luggage.”
“Life insurance pays out a lump sum to beneficiaries upon the policyholder’s death.”
The system identifies relevant content based on meaning and context, rather than just matching keywords. In this case, embeddings corresponding to the relevant text examples are considered semantically similar to the user query.
Once the most relevant embeddings are retrieved, they are used to lookup the original text, which is then provided as context to the ChatGPT API. With this grounding, ChatGPT can generate a response that is both accurate and relevant to the user’s query.
ChatGPT response: “Your home insurance policy covers water damage resulting from sudden and accidental events, such as burst pipes, faulty plumbing, or appliance malfunctions. It is important to note that this coverage does not extend to damage caused by gradual wear and tear or lack of maintenance. To ensure your home is adequately protected, we recommend regular maintenance checks and prompt repairs.”
In this example, the combination of embeddings, semantic search, and the ChatGPT API allowed the system to identify relevant information from proprietary and contextually relevant data sources and generate a helpful, informative response for the customer service representative to share with the policyholder. This approach can significantly improve the efficiency and accuracy of information retrieval and response generation across various use cases in the insurance industry.
There is of course a plethora of other examples and use cases. Let your imagination run wild.
In conclusion, the combination of ChatGPT and text embeddings presents a powerful solution for businesses looking to leverage semantic search capabilities over their enterprise data. By converting their documents into embeddings and storing them in a vector database, businesses can achieve a more accurate and context-driven search experience without the need for fine-tuning. The resulting insights and enhanced understanding of data can lead to better decision-making, increased efficiency, and a competitive edge in the market.
If you are interested in unlocking the potential of ChatGPT and semantic search for your organisation, do not hesitate to get in touch with Inversion. Our team of experts, well-versed in building enterprise systems in the insurance and reinsurance industries, can help you explore and tailor solutions to your specific requirements. Let us guide you through the process and demonstrate the value of these cutting-edge technologies for your business. Contact us today and embark on your journey towards AI-driven success.
Feel free to book a session with me on my calendar link:
https://calendly.com/jacques-bosch/30min
Jacques Bosch
Technical Information.
Below you can find more information about various tools and terms relating to these types of enterprise solutions.
More about embeddings
Vector embeddings are a crucial aspect of machine learning, used in numerous applications such as NLP, recommendation engines, and search algorithms. These embeddings represent real-world objects and concepts, such as images, text, and audio recordings, as lists of numbers. This numerical representation allows for the translation of semantic similarity into proximity within vector spaces, which can then be quantified and utilized for tasks like clustering, recommendation, and classification.
Creating vector embeddings can involve feature engineering or training models to convert objects into vectors. Deep neural networks are often employed to generate high-dimensional and dense embeddings for various data types. For instance, Word2Vec, GLoVE, and BERT are used for text data, while convolutional neural networks (CNNs) are used for images. The use of vector embeddings enables a wide range of machine learning applications, including similarity search, de-duplication, recommendations, anomaly detection, and more. Moreover, popular ML models and methods, such as encoder-decoder architectures, rely on embeddings internally to produce accurate results in applications like machine translation and caption generation.
LangChain
LangChain is a powerful framework and library for developing applications driven by language models. Designed with enterprise environments in mind, LangChain excels at fetching relevant information from vector storage and semantic search, and synthesising pertinent responses. By offering modular abstractions for components necessary to work with language models and use-case specific chains, LangChain enables the creation of applications that are data-aware and behaves as an agent, interacting seamlessly with their environment. Whether you are building chatbots, Generative Question-Answering systems, or summarisation tools, LangChain provides a flexible and adaptable solution that can be tailored to meet the needs of your specific use case.
Databases that can be considered for semantic search include the following:
Pinecone
Introducing Pinecone, a fully managed vector database designed to make it easy to add vector search capabilities to production applications. Pinecone combines cutting-edge vector search libraries, advanced features such as filtering, and distributed infrastructure to provide high performance and reliability at any scale. This makes Pinecone an excellent choice for vector storage and semantic search tasks.
Pinecone is a cloud-native vector database, offering ultra-low query latency, live index updates, and the ability to combine vector search with metadata filters for more relevant and faster results. The platform is fully managed and easy to use, with an intuitive API or Python client, ensuring users do not need to maintain infrastructure or troubleshoot algorithms. Additionally, Pinecone offers enterprise-grade security and compliance, including SOC 2 Type II certification and GDPR readiness. Pinecone is probably the lowest friction way to get started, and it takes a lot of pain out of the process, allowing for easy scaling.
Elasticsearch
Elasticsearch, a well-established search engine with a proven track record in enterprise solutions, is a superb choice for vector storage and semantic search. Known for its versatility across various applications, Elasticsearch is used by many reputable companies to enhance search results and decrease operational costs. With the introduction of the _knn_search endpoint in Elasticsearch v8.0, it enables efficient approximate nearest neighbours search on indexed dense_vector fields, making it ideal for storing embedding vectors and running semantic searches. Developers can utilise the k-nearest neighbour (kNN) vector search API with a query_vector_builder to query data with semantic search, providing both the query text and the model used for vector embeddings. By choosing Elasticsearch, you are opting for a powerful and flexible search engine that excels in vector storage and semantic search capabilities.
PostgreSQL with pgvector
Discover the power of PostgreSQL, a robust open-source object-relational database system, renowned for its reliability, performance, and extensive feature set. With over 35 years of active development, PostgreSQL has become the go-to choice for many businesses and developers. Among its numerous features is pgvector, an open-source vector similarity search extension for PostgreSQL, designed specifically for vector storage and semantic search. With support for exact and approximate nearest neighbour search, L2 distance, inner product, cosine distance, and compatibility with any language that has a PostgreSQL client, pgvector offers a comprehensive solution for managing and querying vector data. This could be a good choice if you want full control over the hosting environment and are planning to host and manage your own databases directly.
Azure Cognitive Search
Azure Cognitive Search is an impressive service capable at ingesting and indexing a wide array of content, including PDFs, images, audio, and video. Text is extracted from all types of content and indexed to render it searchable. Additionally, they offer a semantic search feature as an add-on. It is worth noting that it initially conducts a standard, keyword-based search, and only then applies semantic search to refine the top 50 results. Consequently, if the keyword search fails to yield the semantically optimal matches by chance, the semantic refinement will not provide the best results.
Fine Tuning
Dispelling a common misconception, fine tuning ChatGPT models with enterprise data is not always essential for achieving a ChatGPT-like experience based on that data. In fact, fine tuning can sometimes be an unsuitable approach. Its primary purpose is to teach the model a new capability, rather than providing it with information that can be retrieved via chat without a high probability of hallucinations or incorrect responses. In many use cases, the default models are already capable of generating responses in practical structures and formats, making fine tuning unnecessary, as long as the model is prompted together with the grounding data.