khseon7 님의 블로그

  • 홈
  • 태그
  • 방명록

benchmark 2

[논문 리뷰] TERMINAL-BENCH:BENCHMARKING AGENTS ON HARD, REALISTICTASKS IN COMMAND LINE INTERFACES

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line InterfacesAI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end,arxiv.orgTask FormulationTerminal-Bench의 테스크는 에이전트가 현실적..

카테고리 없음 2026.03.18

[논문 리뷰] AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language Models

AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language ModelsExisting language model evaluations primarily measure general capabilities, yet reliable use of these models across a range of domains demands factual accuracy and recognition of knowledge gaps. We introduce AA-Omniscience, a benchmark designed to measurearxiv.org ArtificialAnalysis/AA-Omniscience-Public · Dat..

논문/LLMs 2026.02.20
이전
1
다음
더보기
프로필사진

khseon7 님의 블로그

인공지능과 관련된 이것저것 정리해보는 블로그

  • 분류 전체보기 (38) N
    • 코딩 (15) N
      • RAG (1)
      • Ray (1)
      • Go (1)
      • Linux (3)
      • vLLM (1)
      • k8s (6) N
    • 알고리즘 (1)
    • 논문 (4)
      • RL (10)
      • LLMs (5)
      • 보안 (1)
      • VLM (1)

Tag

benchmark, EKS, Linux, 강화 학습, URM, 강화학습, LLM, Prometheus, Terminal-bench, Grafana, AWS, grpo, minikube, Rag, OOM, 리눅스, Canary Deployment, k8s, istio, TurboQuant,

최근글과 인기글

  • 최근글
  • 인기글

최근댓글

공지사항

페이스북 트위터 플러그인

  • Facebook
  • Twitter

Archives

Calendar

«   2026/05   »
일 월 화 수 목 금 토
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
31

방문자수Total

  • Today :
  • Yesterday :

Copyright © AXZ Corp. All rights reserved.

티스토리툴바