⚙️ Health · Eval
Eval
How well is Ember doing? Daily thumbs from Telegram, rubric scores from the LLM judge, and backtest performance against curated past events.
Daily ratings
Thumbs from Telegram
No ratings yet. Once the eval harness is wired through Hermes, daily thumbs land here.
Rubric score
LLM-as-judge, 5 dimensions
2.3/ 5
Backtest
Curated historical events
Curate ~10–15 past significant events to measure if Ember would have caught them.
Rating trend (last 30 days)
Daily positive % from Telegram thumbs
Ratings will accumulate here as Jim rates daily briefings via Telegram.
Rubric breakdown
Mean LLM-judge score across 5 dimensions