Evals & Human Feedback
Evals in RAG-in-a-Box let you measure, compare, and improve how different models perform on your own data — before you deploy them.
They’re built for experimentation and collaboration, helping you find the best model and configuration through both automated testing and community feedback.
🎯 What Are Evals?
Evals (short for evaluations) are structured experiments that allow you to compare and benchmark model performanceon your data.
You can use evals to:
Compare open-source and hosted models side-by-side
Measure how prompt or retrieval changes affect results
Validate whether model outputs are accurate and grounded in your data
Collect structured human feedback from reviewers and contributors
Evals give you the confidence to pick the right model setup — backed by real results and metrics.
Community Evals and Voting
RAG-in-a-Box extends traditional model evaluation with a community-driven feedback loop.
Through Community Evals, AI builders can publish evaluation tasks that others can participate in. Contributors log in via vote.joinable.ai to:
Review model responses side-by-side
Vote on which output is better or more accurate
Leave qualitative feedback or comments
Help surface model strengths and weaknesses
This makes evaluation social, scalable, and transparent — enabling more robust and unbiased performance insights.
All contributor actions are tracked and versioned, making it easy to reproduce evaluations or identify trends over time.
How Evals Work in RAG-in-a-Box
Define your evaluation task Describe what you want to test — for example, “Summarize customer feedback reports” or “Answer FAQs from our support docs.”

Upload your dataset Choose which dataset or document collection you want the eval to run against. Use curated samples or real production data.

LLMs models Your data will be evaluated against a range of different LLM models. See the full list here.
Launch the eval task RAG-in-a-Box executes the task across all selected models, stores outputs, and organizes them for review.
Invite contributors via Community Voting Dashboard Open your eval to collaborators — teammates, experts, or community contributors — to review and score results. Contributors can access the Community Voting Dashboard at vote.joinable.ai to vote, compare, and rate model outputs directly. This creates a crowdsourced signal that helps identify which model performs best in real-world contexts.
Analyze and iterate Review metrics, human scores, and voting outcomes to refine prompts, retrieval logic, or model selection before deployment.


What You Can Measure
Each eval in RAG-in-a-Box can include both automated metrics and community scoring.
Metric
Description
Relevance
How closely the answer aligns with the source data
Accuracy
Factual correctness and grounding
Completeness
Coverage of the full query or problem
Clarity
Coherence, readability, and structure
Preference
Community-voted best output between models
By combining quantitative metrics with human feedback, RAG-in-a-Box provides a 360° view of model performance.
🔄 From Evals to Production
Evals are not the end — they’re part of an iterative loop in RAG-in-a-Box:
Prototype your app on your data
Run evals to benchmark performance
Invite the community to review and vote
Refine and retrain based on insights
Deploy your best-performing model confidently
This loop helps ensure that every deployment is data-driven, community-validated, and continuously improving.
Last updated