⚙️ Health · Eval

Eval

How well is Ember doing? Daily thumbs from Telegram, rubric scores from the LLM judge, and backtest performance against curated past events.

Thumbs from Telegram

No ratings yet. Once the eval harness is wired through Hermes, daily thumbs land here.

LLM-as-judge, 5 dimensions

2.3/ 5

Curated historical events

Curate ~10–15 past significant events to measure if Ember would have caught them.

Daily positive % from Telegram thumbs

Ratings will accumulate here as Jim rates daily briefings via Telegram.

Mean LLM-judge score across 5 dimensions