Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line InterfacesAI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end,arxiv.orgTask FormulationTerminal-Bench의 테스크는 에이전트가 현실적..