Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview
Sentiment Mix
Geography
Expert Signals
GodelNumbering
author • 1 mention
Hacker News
source • 1 mention
AI-Generated Claims
Generated from linked receipts; click sources for full context.
Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview.
Supported by 1 story
Scored 65.2% vs google's official 47.8%, and the existing top closed source model Junie CLI's 64.3%.Since there are a lot of reports of deliberate cheating on TerminalBench 2.0 lately (https://debugml.github.io/cheating-agents/), I would like to also clarify a few things1.
Supported by 1 story
Absolutely no {agents/skills}.md files were inserted at any point.
Supported by 1 story
The cli agent was run in leaderboard compliant way (no modification of resources or timeouts)3.
Supported by 1 story
The full terminal bench run was done using the fully open source version of the agent, no difference between what is on github and what was run.I was originally going to wait for it to land on the leaderboard, but it has been 8 days and the maintainers do not respond unfortunately (there is a large backlog of the pull requests on their HF) so I decided to post anyways.HF PR: https://huggingface.co/datasets/harborframework/terminal-ben...It is astounding how much the harness matters, based on this and other experiments I...
Supported by 1 story
Related Events
Show HN: Utilyze – an open source GPU monitoring tool more accurate than nvtop
Hardware • 4/27/2026
Show HN: A terminal spreadsheet editor with Vim keybindings
Uncategorized • 4/27/2026
UX Roundup: Claude Design | AI Does User Testing | AI Use Crosses 50% | GPT-Images-2 | GPT 5.5 | DeepSeek 4 - Jakob Nielsen on UX
LLMs • 4/27/2026
AI Writing Tools Cheat Sheet: ChatGPT, Claude, Gemini, and More - eWeek
LLMs • 4/28/2026
Not ChatGPT, Not Claude: This AI Platform Ranks Highest For Customer Satisfaction In 2026 - SlashGear
LLMs • 4/27/2026