Open-source web search evals you can run in minutes

Open-source web search evals you can run in minutes
From: You.com Product Team"
To: tjphuhs@gmail.com
Account: tjphuhs@gmail.com
Date: 4/23/2026, 11:58:50 AM
Gmail ID: 19dbb23361534d60
Thread ID: 19dbb23361534d60
Raw Path: /Volumes/Storage Drive/Homelab_Apps_storage/mcp-server/backups/email/tjphuhs@gmail.com/2026/2026-04-23/20260423-155850-19dbb23361534d60.eml
Back to Archive Download .eml Find Similar

Snippet

Run standardized evals (SimpleQA, FRAMES, BrowseComp, DeepSearchQA) in your own environment—no custom infra required. Hi there, When building AI systems that rely on web search, search quality

Body

Run standardized evals (SimpleQA, FRAMES, BrowseComp,
DeepSearchQA) in your own environment—no custom infra required.
Hi there,

When building AI systems that rely on web search, search quality
determines model quality.

You shouldn’t have to build eval infrastructure from scratch,
rely on results that can’t be reproduced, or spend weeks
debugging benchmarks instead of building your product.

We built something to do that for you.

*************************************************
The You.com Web Search Eval Harness (open source)

*************************************************

An evaluation framework that lets you benchmark any web search
provider using industry-standard datasets consistently and in
your own environment.

************
What you get

************

Run 4 widely used benchmarks out of the box

* SimpleQA → factual accuracy on direct questions
* FRAMES → multi-hop structured reasoning
* BrowseComp → multi-source navigation and synthesis
* DeepSearchQA → deep research-style completeness tasks
Together, they cover everything from factual retrieval to
production-grade research workflows.

********************
What it does for you
********************

Instead of building eval infrastructure from scratch, the
harness handles:

* dataset loading
* query execution across providers
* answer synthesis
* standardized scoring (based on original benchmark methods)
Same inputs → consistent outputs → comparable results across
providers

************
How it works
************

Clone → add API keys → run:

python src/evals/eval_runner.py --samplers
you_search_with_livecrawl --datasets simpleqa

Or run everything by default. Results are saved incrementally so
long runs are safe and resumable.

*****************
What you get back
*****************

A single file that includes:

* accuracy per benchmark
* latency per provider
* side-by-side comparisons across search systems

************************
Why we open-sourced this
************************

We’re trying to make it easier to:

* evaluate search providers yourself
* trust your own results (not snapshots or marketing claims)
* focus engineering time on building real AI workflows, not eval
infrastructure

********************************************
Get Started with the Web Search Eval Harness

********************************************

Repo: https://clicks.you.com/f/a/_iffcyi0pftCMdIvio3oOg~~/AAQRxRA~/XJ5TR_jpItAjWmAmb7nPgh4v-W5tJ_rjDZ-Hd9_X1DL4ZcKkY79ha9dvLQj6oFYhZBKCIo9e1MAwxtcFsxq2yLjiyJr5dmHwg24a8QOmp0171alqZrrhK4knbmCJR9wZqoy9wujzQS791LWoEMEZTA~~

You'll need:

* You.com API key 
( https://clicks.you.com/f/a/jAyoQbXm0oSnaOcWS7s1hQ~~/AAQRxRA~/6HSi8uKwRSCtaX5P2Ffdf2HO9N980F-a1H6OND8MUoBINX76-aEkl2eqneCpL6Mlfw8g0nhA-WWs2I0dxsmnDWIyBItgw9aptbYzEhAfd990OvbThRghCEjOoFC21FVE2dgrNJdP2tLz3w_0fBg22zI-afxFMCI88wFgS5ieDv1nzYAN6WKTRL77jVa4Lw2qNYRu0WA17o1lEeiXP8RxYTg3bSvj4sV0V6dQSASalFN0KGvgQJEOtxRw1J251DB-OJmw9exNbxe0nvIaAXey_qui7XrS64vbixHmkUFIVaN8tPXEsySDozvM4PhpLaP1 )

* OpenAI API key (for grading and synthesis models)
Read the article  
( https://clicks.you.com/f/a/U50dtOpq7oEOaC87xK_YzQ~~/AAQRxRA~/ft6Rxy7M8UPYx2KSHMYp4Ph9bFrhxq3Wh_kWVtq75ILrB18HRI7ZBEOA4OG-9DwN1iz9Ql7_pP_rNaCPNKxlE_Qk3MnLUL8ss_rXhQx4DugI7S00cUhNBdR2JsdJgpJYle3wpxwu0zQLGcCebnjBDI3AkgkWDCCRrS45UIh0easQylinHRCFCNvjCL7jrtgQnSn7HoyROO4GUxbRMVL0COi7KxDJEjhSBSiwNoKJTpVBvRhDSK82wSkp0G5K5NiGBGKuOx6MV8swh6TnkBu1lwWf2HdL-an7yBh4K3RAgWgBFZcOXOamnqPZ5WFLyIPlboxwm5Qw_6WtGaAUE0CvIRHQwOFQzOkorKQ4pYr4xAA~ )
   -The You.com Team

Twitter ( https://clicks.you.com/f/a/4CL2lg7bgeTBgXIO1JEGbA~~/AAQRxRA~/nvXoNW36mY-0nwVuiuOMcUTz_b5o_g8tRzQAigBoedcmxOLzadg-7YgdJ7tYtneJzYB9LSHumzE2PX-ItJpj2FsOI0kNjikVoBpOK1rUaZivA8t3bBaPjKSkMjVGywrW )  LinkedIn 
( https://clicks.you.com/f/a/W1mu3R2Z4KPLOTRqmt2MGA~~/AAQRxRA~/qEVrBWOBkXYZllVEgGDSPx1dy4Os7N6mf9-lkDxMfKOKIXhx0yqAiBuMiPyIE-19YrGg730GGVtUB9upWGo09i_1RE_CDoYVgwS9vZJkqF2FR2-t8TfJ0N4Cz6TsPIHC )  Discord 
( https://clicks.you.com/f/a/v-mk6amyvdmBIyJj-cJtTA~~/AAQRxRA~/WfsK8k0nrZNBTXYAHI9nIS-AMye37Nx40n1HohNiSSZg3Vfbwf72xzm2RNxhkG2fV8sjxHWUsOU2u0q7NQt_QGq4geZFPvnU9vb73-_SftUBI1tLBn14W-Ato1xxk84N )
Unsubscribe ( https://clicks.you.com/f/a/0pTBPBQ0jpEpwbxyCct80Q~~/AAQRxRA~/AFkvTu9BawwD_C4uX8EOayt9kcSgBeZZgOd91s4OcygSt4FFCPVyzmZZpK3jP2u9N5xVCK2vGAxU0hapC3S23j-d7l_s76EAVCmlLlVrr068SnBfyNDbVFCtA_ZovtdfhhwsyXRt6tQJQ1E1uspuVcUnvIxTS3zFuOGn0EOFDlMr6_zSEanKZY0OHYr98j8LAs60mMApNS1xUtUN9ZsMDdc2UJ0cLU2hEuU0kIjvg3S75fm9AUMVPqi3X90M1CtOfhaSS6uG14mqPzP8h1HnSb8Ks-W-r_vxLNVJCRJ5RinsGvC5L541IjnqOlOSU0SaJbkfzv_WsoGLx5xvba9RvHKZ9XgcKUfaO5KcIFT4ry6kjbSx5055ZBaIJChrvUBD1wmgz-liZkphPWDzuVsotYIOeo5d_A6kfIqmZzxUqchKuGC_IA2aP7LKRs5V87pgEPc6Nxrm2KzxO4xhZomeCSvtSyKs__oHZaQyDxjW5PI9XZINLDHrvFzRJw_1V2lJgu9fyfi03wVRfXLG6TBElZ5fqSooTYOq814kHjhHRzGEDlgY6UJr8zl7DX_5aMGDRuQQpoHALu2gUvGMHMTikPinlKBSYwce3G4W-qXFrdR9_imMVowZd3X5JdxpQdU9BrcJCxFU1k6LRbtAmksrkw~~ )

You.com
228 Hamilton Ave, Fl 3
Palo Alto, CA 94301