Member-only story
Building Confidence in LLM Evaluation: My Experience Testing DeepEval on an Open Dataset
deepeval
helped me uncover what is the real source of Beyonce’s depression
Long Story Short
If you’re looking to catch LLM hallucinations — output that introduces information not present in the input — you can use DeepEval’s Faithfulness metric. I tested it on 100 samples from the SQuAD2 dataset and achieved 100% accuracy. This served as a crucial validation step, like a litmus test, before diving into more detailed analysis.
I wanted to find out how accurate is the process of LLM evaluation by a framework, I wanted to evaluate LLM evaluation.
Join me through this journey to find out the ins and outs of LLM evaluation.

Content
- Hardship and metrics of evaluating text
- Why use deepeval
- What is deepeval
- How to use deepeval for detecting hallucinations?
- Evaluating deepeval (Rabbit hole)
- How to run evaluation on SQUAD2
- SQUAD2 evaluation results analysis
- Which metric to use?
- Summary