Towards AI

The leading AI community and content platform focused on making AI accessible to all. Check out our…

Follow publication

Member-only story

Building Confidence in LLM Evaluation: My Experience Testing DeepEval on an Open Dataset

deepeval helped me uncover what is the real source of Beyonce’s depression

Serj Smorodinsky
Towards AI
Published in
14 min readOct 26, 2024

--

Long Story Short

If you’re looking to catch LLM hallucinations — output that introduces information not present in the input — you can use DeepEval’s Faithfulness metric. I tested it on 100 samples from the SQuAD2 dataset and achieved 100% accuracy. This served as a crucial validation step, like a litmus test, before diving into more detailed analysis.

I wanted to find out how accurate is the process of LLM evaluation by a framework, I wanted to evaluate LLM evaluation.

Join me through this journey to find out the ins and outs of LLM evaluation.

Down the rabbit hole!

Content

  1. Hardship and metrics of evaluating text
  2. Why use deepeval
  3. What is deepeval
  4. How to use deepeval for detecting hallucinations?
  5. Evaluating deepeval (Rabbit hole)
  6. How to run evaluation on SQUAD2
  7. SQUAD2 evaluation results analysis
  8. Which metric to use?
  9. Summary

--

--

No responses yet

Write a response