The truth about ChatGPT in the B2B context
Since its public announcement in November 2022, ChatGPT has rapidly gained several million of users worldwide. ChatGPTs outstanding performance on a bunch of language understanding tasks and in testing with human users set it apart from previous language models, sparking some kind of “arms race” between open source efforts to achieve comparable performances with these generative models trained using millions of dollars of compute resources. In contrast, its adoption in a B2B context feels much slower, with many companies wanting to follow the hype around ChatGPT, but lacking a clear understanding on how ChatGPT could benefit their business processes. This is where Niklas Frühauf, Senior Data Scientist, steps in and clarifies the strengths, weaknesses and also risks of Generative AI.
We have been approached by many customers that are either considering ChatGPT to be the answer to life, the universe, and everything, or customers that are completely unaware of how ChatGPT works and consequently its limitations. Based on our experience with ChatGPT from multiple internal as well as customer-facing projects, we always try to provide a starting ground for those companies that plan to adopt ChatGPT to optimize their business processes. Successful adoption requires a sound understanding of ChatGPTs limitations and risks, strengths and weaknesses, especially when compared to more “traditional” machine learning approaches, which we will tackle in this blogpost.
The power of ChatGPT
The most obvious strength of ChatGPT is its text generation ability: Most answers are very human-like and informative. While not always fully correct, these answers usually are able to point users at least in the correct direction. In contrast to earlier chat “experiments” (read more here: In 2016, Microsoft’s Racist Chatbot Revealed the Dangers of Online Conversation), OpenAI also managed to safeguard ChatGPT very well against toxic behaviour and personal attacks, making it suitable for conversations in a professional setting.
ChatGPT also showcases advanced text understanding over longer text bases, being able to follow also slightly longer prompts without problems. This opens up a new field dubbed “prompt engineering” – Hand-crafting the text input that is fed to ChatGPT in order to address your current business problem without the need for expensive fine-tuning. As we will later read, this ability to perform “zero-shot” or “few-shot” inference sets it apart from most “traditional” ML approaches. It is also possible to combine ChatGPT with external data sources such as Google Search, a calculator, etc. using custom “plugins”, extending ChatGPTs capabilities into fields that were previously not possible to cover using (generative) language models.
Some limitations of ChatGPT
However, not all is well with ChatGPT. There are certain very important limitations that you should keep in mind when considering its adoption.
ChatGPT as well as all other large language models suffer from “hallucinations”
It will gladly return incorrect or even completely invented pieces of information, even if prompted not to do so. Some famous examples include the generation of non-existing software package names, leading to unsuspecting users installing compromized dependencies (read more here: Can you trust ChatGPT’s package recommendations?). Another famous example was a user requesting papers for his upcoming research, only to found out that every given citation didn’t exist (read mor here: Hallucinations in ChatGPT: A Cautionary Tale for Biomedical Researchers). As a result, we advise to always double-check ChatGPT outputs, especially for reasoning about internal documents, or for language based heavily on customer-specific terms or abbreviations.
ChatGPT is (currently) trained on static data collected up to a specific cutoff date.
Information added to the Internet afterwards is not considered during training, making it more likely that ChatGPT will hallucinate over these events. It may be possible for vendors such as OpenAI to fine-tune ChatGPT with new knowledge at fixed time intervals, but this will be a costly effort. Thanks to the aforementioned plugin support, this may not even be required.
ChatGPT has a relatively short “context size”
Another drawback especially with “older” variants, such as ChatGPT-3.5, is it’s relatively short “context size” of 2048 tokens, roughly equal to 1-2 pages DIN A4 text. Reasoning over longer documents such as contracts, books, newspaper articles or long prompts are thus not supported, with words prior to the context size simply being ignored. While newer models such as ChatGPT-4 offer context length of up to 32.000 tokens and are more suited, they also come with a hefty price tag (0.12ct per 1k output tokens, compared to 0.002ct for ChatGPT-3.5, a 60x increase) and may not be fully available publicly.
ChatGPT will pick up (gender or racial) biases common to our everyday language use.
We previously highlighted ChatGPTs human-like behaviour as one of its main strengths. However, this also leads to a direct weakness of ChatGPT: Trained on huge amounts of human-generated material, ChatGPT will pick up (gender or racial) biases common to our everyday language use. A famous example for these biases is the implicit “doctors are male, nurses are female” bias (read more here: Hadas Kotek: Doctors can’t get pregnant and other gender biases in ChatGPT). This prohibits its usage for tasks such as applicant pre-screening or criminal prosecution.
ChatGPT won’t be feasible for scenarios operating on larger amounts of incoming samples, especially when you plan on self-hosting it to GDPR concerns.
Last but not least, we also need to take into account the slow and costly inference of models like ChatGPT. OpenAI most likely uses several hundreds of State-of-the-art GPUs to achieve inference which is roughly real-time. If you plan on deploying any of the Open Source models such as Falcon, Pythia, etc., you will need to wait several minutes for a reply using a quantized model on CPU only, 2-3 minutes using a consumer-grade GPU, and around 20-30 seconds using the latest data center GPUs – all depending on prompt and response size.
And of course, there are some risks using ChatGPT
Having read up on the shortcomings of ChatGPT, it is also important to dive into the organisational risks associated with an adoption of ChatGPT in your business context.
ChatGPT comes with serious data privacy/data protection risks.
Firstly, companies operating in the EU are subject to the GDPR. As such, special care needs to be taken when operating on user-related data, and a usage of ChatGPT to operate on this data may be impossible. Currently, ChatGPT can be consumed directly from OpenAI (a US company) as well as Microsoft Azure (a US company), potentially requiring them to disclose data based on the CLOUD Act. Even when using the Azure service based in West Europe, Azure will monitor all incoming ChatGPT requests as well as outputs (Read more here: Data, privacy, and security for Azure OpenAI Service – Azure AI services ). You can opt-out here, but only when fulfilling certain criteria: You need a managed Azure account, all users are registered, and ChatGPT output isn’t directly fed back to your users. This obviously prohibits usage in most customer-facing operations.
ChatGPT usage currently comes with a hard vendor lock-in
Heavy usage of the currently available APIs may be costly, especially when using the GPT-4 variants or longer prompts for few-shot classification. With Aleph Alpha (a company also based in Heidelberg), we may be looking at a potential European alternative in the near future.
ChatGPT also has an enormous ecological footprint.
It is estimated that the training of the older GPT-3 model cost around 5 Million USD, plus a running daily costs of 50 – 100k for daily inference. At the same time, people estimated the carbon footprint to be roughly equivalent to the production of 320 Tesla cars. Deploying one such model for around-the-clock predictions will emit more emissions than an average EU citizen over a year (read more here: The Environmental Impact of ChatGPT: A Call for Sustainable Practices In AI Development | Earth.Org).
Differences to traditional Machine Learning Approaches
Based on our experience, ChatGPT is not the answer to everything, as some customers seem to believe. With all the drawbacks and risks highlighted above, we think companies should seriously consider “traditional” Machine Learning (ML) approaches as a viable alternative for ChatGPT. In a traditional ML setting, an arbitrary “model” is trained on labelled ground-source data. In the natural language context, this ground source data could e.g. be a pair of a customer E-Mail and an associated label such as “order” or “service”. Using hundreds of already available data points, we can fit a task-specific pipeline, such as a TF-IDF vectoriser turning the E-Mail into a “Bag of Words” or keyword representation, followed by a simple Linear Support Vector Classifier (LinearSVC).
Using ChatGPT, you would be using either a “zero-shot” prompt (“Your Task is to classify E-Mails into two categories; category ORDER refers to a customer placing an order, and category SERVICE refers to a customer having issues with his/her product and requiring assistance. Which category do you assign to the following E-Mail? …”), or a “few-shot” prompt that additionally contains a handful of examples for each of these categories.
As shown by public research, training traditional ML models in many natural language areas will outperform or at least rival ChatGPT accuracies, with ChatGPT only having an advantage in around 22.5% of scenarios (read more here: ChatGPT Survey: Performance on NLP datasets). Note that this comparison was done only on benchmarks that fall into ChatGPTs “home turf”: Language understanding and reasoning. While possible with prompt engineering, ChatGPT is not made for forecasting or regression, tabular data or computer vision tasks.
However, getting this “labelled” data in many cases is one of the bigger obstacles for Data Scientists. An alternative here is to use ChatGPT as a “teacher” in a “student-teacher model”. ChatGPT is tasked with labelling a handful of (existing or even completely made up) customer requests, and then a traditional ML pipeline is trained on the generated data and labels, eliminating the need for ChatGPT down the line while keeping “labelling” costs low.
Our conclusion: We see ChatGPT as a handy tool for Data Scientists
In general, ChatGPT is a great tool to generate training data, and to establish a quick baseline performance against which custom model implementations can be benchmarked. And the hype around ChatGPT hasn’t gone unnoticed by SAP, too. They invested in Heidelberg startup Aleph Alpha, a German company developing language models said to rival the performance of OpenAIs models. This hopefully means that European companies will have an alternative to Azure/OpenAI, opening up usage with less data privacy concerns.
Do you also want to get started with ChatGPT? Then we can talk about your personal use case and find out together if and how Generative AI can help.