'benchmark' 태그의 글 목록

benchmark 2

[논문 리뷰] TERMINAL-BENCH:BENCHMARKING AGENTS ON HARD, REALISTICTASKS IN COMMAND LINE INTERFACES

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line InterfacesAI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end,arxiv.orgTask FormulationTerminal-Bench의 테스크는 에이전트가 현실적..

카테고리 없음 2026.03.18

[논문 리뷰] AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language Models

AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language ModelsExisting language model evaluations primarily measure general capabilities, yet reliable use of these models across a range of domains demands factual accuracy and recognition of knowledge gaps. We introduce AA-Omniscience, a benchmark designed to measurearxiv.org ArtificialAnalysis/AA-Omniscience-Public · Dat..

논문/LLMs 2026.02.20

khseon7 님의 블로그

인공지능과 관련된 이것저것 정리해보는 블로그

benchmark, EKS, Linux, 강화 학습, URM, 강화학습, LLM, Prometheus, Terminal-bench, Grafana, AWS, grpo, minikube, Rag, OOM, 리눅스, Canary Deployment, k8s, istio, TurboQuant,

Today :
Yesterday :

일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

benchmark 2

티스토리툴바