Using GenAI to kickstart your traditional Natural Language Processing Models
Traditionally, automatising business processes with AI required training a custom machine learning model with a substantial amount of task-specific “training data”. In many cases companies can use the advantages of Generative AI to tackle this challenge directly. But in cases with regulatory requirements such as GDPR GenAI is often not an option. In this blog post, Niklas Frühauf, Senior Data Scientist at sovanta, shows how GenAI solutions such as ChatGPT or AlephAlpha can still empower the development of task-specific traditional machine learning models – even when facing GDPR regulations – allowing companies to create a fast and affordable proof-of-concept (PoC).
In the past, most – if not all – AI approaches required training data specific to the task at hand. Depending on the challenge, this could range from a handful of samples (e.g. for extremely simplistic text classification of tickets into multiple “queues”) to thousands of samples (e.g. for entity recognition in legal documents, sample images for image segmentation) to “solve” the problem adequately. In some cases, we would get lucky and could use existing data (e.g. fetching tickets already assigned manually by users to a queue in Jira). However, in most cases, creating such a dataset requires a handful of experts to manually annotate the collected data, creating a high up-front invest for AI proof-of-concepts. This is even more true in the area of language processing, where documents in many different languages would often need to be labeled, requiring translations and/or multilingual inputs. This is complicated even more by widespread data issues.
Generate Training Data for a Proof-of-Concept
Even more so, many stakeholders want to have a quick and cheap proof-of-concept before committing large budgets for a fully integrated AI solution. This meant that many AI ideas already died before the PoC phase, as customers struggle to secure funding for the initial creation of a ground truth dataset for their business problem. Even with data readily available, first results are sometimes expected in an extremely short time frame, putting pressure on Data Scientists to rush a first working algorithm through development.
However, it is extremely important to note that the usage of GenAI in these use cases suffers from drawbacks. Depending on the GenAI vendor, there are large privacy concerns, preventing its usage for GDPR-relevant use cases and many corporate settings (e.g. OpenAI Azure is monitoring all inputs and outputs per default). It is also not feasible for extremely high workloads due to the processing time and involved costs, and in workloads where availability is critical. It also lacks explainability and thus isn’t suited for use cases where model audits are required. While you may be able to prompt your GenAI solution to output their “chain-of-thought reasoning” or look at individual attention given to each token of your input, explainability is a much more easily achieved in traditional models. Some examples include Linear/Logistic Regression (where we can analyze the coefficients), Decision Trees & Forests (where we can visualize the decision paths or look at feature importances), but also SHAP values and LIME explanations that are applicable to a wide variety of models.
Despite these risks GenAI can still help you to implement and optimize your custom task-specific Machine Learning Model as we will describe in the following.
GenAI as “quick & dirty” Baseline at Project Start
For most language based tasks, GenAI can be used to establish a quick baseline performance/score. If you have a handful of ground source data available, you can consider all of them to be a “test” set, and use the zero-shot (or few-shot) prompts to gauge how well they perform on your available samples. Even in the total absence of data, you could prompt a more complex GenAI model (ChatGPT4-8k or more) to generate some fake data, and then set a baseline score using a very simplistic traditional AI model or a less-complex LLM. While this technically doesn’t require a Data Scientists (knowing how to properly prompt is enough) or complex model pipelines and dependencies, we highly advise to have a Data Scientist set up a proper model evaluation workflow with a useful metric in order to prevent data leakage, enabling you to compare the GenAI baseline against other models (including task-specific models) in the future.
Even if GenAI is not an option due to the aforementioned privacy or security concerns, you may still be able to get a very quick baseline on a handful of manually anonymised documents – and if GenAI doesn’t handle the task well, it may be an early stopping signal for your project. On the other hand, if GenAI delivers outstanding performance, there is a high chance a task-specific model (without privacy issues) will also work for your use case. In any cases, this allows you to quickly set realistic stakeholder expectations for further project stages.
How to: Training Data Generation, Enhancement & Preprocessing
When it comes to data generation (or enhancement), we can think of a wide range of possibilities. GenAI can create training data from scratch in absence of previous data, leveraging its zero shot abilities.
It is easy to “ground” these prompts if you have a handful of available samples and simply want to extend the available dataset. If privacy is an issue, you could manually anonymise these samples before including them into a few-shot prompt. In order to diversify the obtained results, you can also leverage a simple trick and send multiple “few-shot” generation requests, each time modifying the included samples.
In some recent local LLM runtimes such as llama.cpp, you can even force your LLM to output JSON following a specific grammar, making it much easier for you to reliably parse the outputs! Additionally, the example above highlights another area where GenAI can support with data enhancement: We can use it to “impute” missing values in text corpora, such as the correct start and end positions of our entities. This could also be used to support with data cleansing: Assume you have output from a OCR service containing mingled/mixed words with incorrectly identified characters – GenAI will help you to convert them to a correct representation. Obviously, using GenAI within a prediction pipeline requires you to consider the aforementioned data privacy concerns – in many cases, GenAI will be able to directly solve your issue at hand without “manually” correcting these steps.
Does anyone have to do it without the advantages of GenAI?
In those cases where sending productive data to GenAI is ruled out (be it due to cost, latency, availability or privacy concerns), a solution could be to train a task-specific traditional AI model on data generated (partly) by a GenAI model. Generating this data usually just requires a prompt describing the desired outputs, and in some cases you may want to enrich it with some existing (anonymised) examples from your own data source, as this “few-shot” prompting in our experience greatly increases the data quality.
So there is no reason not to take advantage of GenAI, there are many workarounds and creative prompting opens new doors – our tip: just give it a try. And if you have any questions, we are always happy to help.