Artificial intelligence

Mastering AI Output Evaluation – Insights on Challenges and Strategies in the Era of Generative AI

Artificial intelligence (AI), particularly generative AI (gen AI), is rapidly reshaping industries and decision-making processes, the challenge of evaluating AI-generated outputs has become increasingly critical. In this in-depth interview, Venkat Gopalakrishnan, a seasoned AI and cybersecurity expert, explores the multifaceted challenges of assessing AI-generated content, from the technical hurdles in measuring accuracy to the nuanced impact of human subjectivity. With a background that includes leading AI initiatives at major tech companies and holding patents in AI and cybersecurity, Gopalakrishnan dives into the limitations of traditional metrics, the role of human feedback, and the ethical considerations that must be at the forefront of AI deployment. As organizations increasingly rely on AI for critical tasks, his perspectives on developing robust evaluation frameworks and preparing for the future of AI assessment are both timely and essential. Whether you’re a tech leader, a data scientist, or simply interested in the future of AI, this interview offers crucial insights into ensuring the reliability and effectiveness of AI systems in an increasingly AI-driven world.

Venkat, what are some of the main challenges in evaluating the quality of AI-generated outputs?

Let’s split it into AI and gen AI, look at this question from a more technical angle, and relate it to real world and business angles.

AI is predominantly used to predict what happens next (basically, predict the next data point “intelligently”, given the current set of historical data points). When you make a prediction, there is always an error component attached to it. The better your model, data, preprocessing, your understanding of events around the data you are predicting, etc., the better your prediction accuracy and low prediction related error value. Converting that mathematical/statistical error component, to your product and business tradeoffs, understanding what it means to your product’s customers and adjusting your business strategies,  is the most critical and challenging part. 

Now, let’s talk about gen AI, which predicts the next data point too. The data points here are called tokens. A token can be a word, set of words, character,  etc. depending on the large language model’s training method and data. This creates a “many-to-many” problem. LLMs can have multiple valid versions of both the questions and the answers. For example: For a question, “how are you?”, there can be 1000 possible right answers. Also, there are 1000 ways to ask the question “how are you?” How would you evaluate which one is better than which? And,in that process, how do you find if the right hand side of the equation (the AI response) is a result of hallucination, a concept where LLM completely misses the context of the question, and picks up a wrong thread/token/word and goes on a ramble? Once that question is answered, how do we do it consistently, so we can template the eval strategies that can be used against different scenarios. These are just some of the questions that we ask ourselves while constructing evaluation strategies. 

How does the subjectivity of human perception impact the assessment of AI-generated content?

It is safe to assume that when we say AI these days, we are talking about Large Language Models and gen AI so let’s dive in there.

In human terms, evaluating the response accuracy is just one piece to the puzzle. We evaluate the factuality in the response, we detect emotions attached to the response, we understand if the response is creative, if it resembles any culture, or if there is a tone attached to the response. If you know the person, we can even understand the response to a very personalized level. 

Take the previous example question: “how are you?”. “I am fine”, “Not bad, can’t complain”, “Life is good”, “Getting better” are all valid answers to that question, but a human can detect more than just “accuracy” of response in these responses.   

In short, humans evaluate the response very subjectively, and this introduces a complex variable to an already complex question.  

Can you give an example of how AI is being used to automatically summarize information, and what complications arise in evaluating these summaries?

Let’s consider a simple accuracy-related scenario. I say simple because detecting accuracy is often more straightforward compared to evaluating other factors that require subjective judgment.

Imagine you’re a small business or retail shop owner, with all your daily sales data stored in a database. At the end of the day, you want to check how many cartons of milk you sold, so you can adjust your inventory. You ask a gen AI, ‘How many cartons of milk did I sell today?’ The AI needs to summarize your sales data and respond with something like, ‘You sold 2 cartons at 10 AM, 3 at 12 PM, and 10 at 5 PM, for a total of 15 cartons today. You should restock 15 cartons for tomorrow to meet demand.’

Given the complexity of how gen AI processes data, including your query and the sales information, a wrong answer could lead to overstocking milk or not stocking enough. This is a simple retail example, but imagine using AI to summarize a patient’s medical report or a legal document. How do we ensure the response is accurate, especially when errors can directly impact human lives? These are the challenges we face when evaluating AI-generated responses in critical domains.

What are some traditional metrics used in natural language processing, and why might they fall short when evaluating generative AI outputs?

Traditional metrics do have their limitations, but they’re still really useful and, honestly, they’re pretty much all we have right now. When it comes to figuring out how accurate a response is, methods like BLEU, ROUGE, and various semantic scoring techniques are still commonly used. Sometimes, we use NLP to preprocess both the input and the output, which helps in comparing them. For example, we can summarize two documents and then see how similar they are.

We can also use gen AI to evaluate other gen AI outputs, but we have to keep in mind that running these models isn’t cheap. So, while traditional methods are helpful, they don’t cover everything, especially since the evaluation process can be quite subjective. There’s no silver bullet solution yet, but the field is evolving quickly, and hopefully, we’ll have more tools and methods to tackle these issues soon.

How are organizations approaching the evaluation of AI-generated content? Are there any frameworks or strategies that seem to be effective, and what role does human feedback play in assessing and improving AI outputs?

You chose the right word – framework and strategies. I would combine them and call them “Framework of Strategies”.  But, on a very high level, given the complexity of “many-to-many”,   I personally try to categorize the evaluation in a few major categories and then find methodologies (AI, non AI, gen AI) to evaluate the gen AI response.

Major categories for evaluation are accuracy, privacy, sensitivity, sentiment, bias, trust, etc. The above mentioned strategies like BLEU, ROUGE, Semantic similarity would go under accuracy category. 

Humans play a crucial role in any AI process, including with generative AI. Whether it’s evaluating responses, developing new strategies, or gathering customer feedback, human input is invaluable. Now, we can have customers as part of the evaluation strategy too. At the end of the day, they are the best judge of this.  Keeping customers engaged and involved throughout their journey is essential. It’s our responsibility to work with users, incorporate their feedback into our product (and testing strategies)  to enhance and personalize their experience with our product.

What steps can organizations take to establish a robust evaluation process for their AI outputs? Looking ahead, how do you think the methods for evaluating AI-generated content might evolve as the technology continues to advance?

When building evaluation strategies – I personally take several steps to ensure that the evaluation process remains robust, agile, and versatile. First, I focus on building a scalable testing framework that can accommodate multiple testing strategies. There’s no one-size-fits-all approach, so it’s crucial to use a combination of strategies and metrics tailored to your business and customer needs. Developing a transparent, scalable, and iterative evaluation environment is key, allowing for continuous human feedback and refinement of strategies.

Before presenting responses to real-world users, it’s paramount to implement guardrails that address empathy, sensitivity, and trust. These metrics help mitigate bias, ethical concerns, and privacy issues.

Looking ahead, from a business perspective, I believe AI evaluation will become a mainstream business. This will lead to more advanced metrics that better understand response context and incorporate necessary human insights. Ethical and fairness metrics are likely to become centric and mandatory (if not already), and we can expect increased automation in real-time evaluations, though the pace will be influenced by factors like cost and hardware availability.

Are there any ethical considerations that need to be taken into account when deploying and evaluating generative AI systems? And, for companies just starting to use generative AI, what advice would you give them about evaluating and improving their AI outputs?

Ethical considerations are a must, for various reasons (ranging from legal reasons, to simply fundamental human centric principles). I personally believe that everyone in the industry already view ethics as a mandate, and not a choice (while building AI, and for that sake, any product). That’s one reason I talked about the guard rail metrics earlier. Evaluating your AI against bias (racial, gender etc.), privacy and data security (protecting customer information etc.), trust (trust, in privacy, and also in accountability etc.) is something you should do before you take your product to public preview.

For companies starting with gen AI – my suggestion is to make your evaluation strategies around not just your product, but also around your customers. Understand that you will have not one, but a multi-metric approach while building a gen AI evaluation system. Engage with the customers as much as possible and collect feedback, since it helps with continuous learning. Invest in accountability, and also focus on simplicity. At the end of the day, AI should be explainable and transparent.    

Comments
To Top

Pin It on Pinterest

Share This