user-group-simpleEvals & Human Feedback

Evals in RAG-in-a-Box let you measure, compare, and improve how different models perform on your own data — before you deploy them.

They’re built for experimentation and collaboration, helping you find the best model and configuration through both automated testing and community feedback.

🎯 What Are Evals?

Evals (short for evaluations) are structured experiments that allow you to compare and benchmark model performanceon your data.

You can use evals to:

  • Compare open-source and hosted models side-by-side

  • Measure how prompt or retrieval changes affect results

  • Validate whether model outputs are accurate and grounded in your data

  • Collect structured human feedback from reviewers and contributors

Evals give you the confidence to pick the right model setup — backed by real results and metrics.


Community Evals and Voting

RAG-in-a-Box extends traditional model evaluation with a community-driven feedback loop.

Through Community Evals, AI builders can publish evaluation tasks that others can participate in. Contributors log in via vote.joinable.aiarrow-up-right to:

  • Review model responses side-by-side

  • Vote on which output is better or more accurate

  • Leave qualitative feedback or comments

  • Help surface model strengths and weaknesses

This makes evaluation social, scalable, and transparent — enabling more robust and unbiased performance insights.

All contributor actions are tracked and versioned, making it easy to reproduce evaluations or identify trends over time.


How Evals Work in RAG-in-a-Box

  1. Define your evaluation task Describe what you want to test — for example, “Summarize customer feedback reports” or “Answer FAQs from our support docs.”

  1. Upload your dataset Choose which dataset or document collection you want the eval to run against. Use curated samples or real production data.

  1. LLMs models Your data will be evaluated against a range of different LLM models. See the full list here.

  2. Launch the eval task RAG-in-a-Box executes the task across all selected models, stores outputs, and organizes them for review.

  3. Invite contributors via Community Voting Dashboard Open your eval to collaborators — teammates, experts, or community contributors — to review and score results. Contributors can access the Community Voting Dashboard at vote.joinable.aiarrow-up-right to vote, compare, and rate model outputs directly. This creates a crowdsourced signal that helps identify which model performs best in real-world contexts.

  4. Analyze and iterate Review metrics, human scores, and voting outcomes to refine prompts, retrieval logic, or model selection before deployment.

What You Can Measure

Each eval in RAG-in-a-Box can include both automated metrics and community scoring.

Metric

Description

Relevance

How closely the answer aligns with the source data

Accuracy

Factual correctness and grounding

Completeness

Coverage of the full query or problem

Clarity

Coherence, readability, and structure

Preference

Community-voted best output between models

By combining quantitative metrics with human feedback, RAG-in-a-Box provides a 360° view of model performance.


🔄 From Evals to Production

Evals are not the end — they’re part of an iterative loop in RAG-in-a-Box:

  1. Prototype your app on your data

  2. Run evals to benchmark performance

  3. Invite the community to review and vote

  4. Refine and retrain based on insights

  5. Deploy your best-performing model confidently

This loop helps ensure that every deployment is data-driven, community-validated, and continuously improving.

Last updated