Member-only story

Building Confidence in LLM Evaluation: My Experience Testing DeepEval on an Open Dataset

`deepeval` helped me uncover what is the real source of Beyonce’s depression

Serj Smorodinsky

Published in

Towards AI

14 min readOct 26, 2024

Long Story Short

If you’re looking to catch LLM hallucinations — output that introduces information not present in the input — you can use DeepEval’s Faithfulness metric. I tested it on 100 samples from the SQuAD2 dataset and achieved 100% accuracy. This served as a crucial validation step, like a litmus test, before diving into more detailed analysis.

I wanted to find out how accurate is the process of LLM evaluation by a framework, I wanted to evaluate LLM evaluation.

Join me through this journey to find out the ins and outs of LLM evaluation.

Content

Hardship and metrics of evaluating text
Why use deepeval
What is deepeval
How to use deepeval for detecting hallucinations?
Evaluating deepeval (Rabbit hole)
How to run evaluation on SQUAD2
SQUAD2 evaluation results analysis
Which metric to use?
Summary

Towards AI

Building Confidence in LLM Evaluation: My Experience Testing DeepEval on an Open Dataset

`deepeval` helped me uncover what is the real source of Beyonce’s depression

Long Story Short

Content

Published in Towards AI

Written by Serj Smorodinsky

No responses yet

Towards AI

Building Confidence in LLM Evaluation: My Experience Testing DeepEval on an Open Dataset

deepeval helped me uncover what is the real source of Beyonce’s depression

Long Story Short

Content

Published in Towards AI

Written by Serj Smorodinsky

No responses yet

`deepeval` helped me uncover what is the real source of Beyonce’s depression