arXiv Research Digest

Agentic Development, AI & LLMs
1394 papers
Spanning Feb 2026–Mar 2026
Last updated Mar 9, 2026
Added Mar 9, 2026
1

MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains

Mar 2026 · 2603.00873
AgenticRAGPlanningReasoning

With the increasing demand for step-wise, cross-modal, and knowledge-grounded reasoning, multimodal large language models (MLLMs) are evolving beyond the traditional fixed retrieve-then-generate paradigm toward more sophisticated agentic multimodal retrieval-augmented generation (MM-RAG). Existing benchmarks, however, mainly focus on simplified QA…

2

HiMAC: Hierarchical Macro-Micro Learning for Long-Horizon LLM Agents

Mar 2026 · 2603.00977
Long-HorizonAgenticPlanningReasoning

Large language model (LLM) agents have recently demonstrated strong capabilities in interactive decision-making, yet they remain fundamentally limited in long-horizon tasks that require structured planning and reliable execution. Existing approaches predominantly rely on flat autoregressive policies, where high-level reasoning and low-level…

3

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

Mar 2026 · 2603.03205
Long-HorizonAgenticReasoningBenchmarks

Agentic language models operate in a fundamentally different safety regime than chat models: they must plan, call tools, and execute long-horizon actions where a single misstep, such as accessing files or entering credentials, can cause irreversible harm. Existing alignment methods, largely optimized for static generation and task completion,…

4

EvoTool: Self-Evolving Tool-Use Policy Optimization in LLM Agents via Blame-Aware Mutation and Diversity-Aware Selection

Mar 2026 · 2603.04900
Self-ImprovingLong-HorizonAgenticBenchmarks

LLM-based agents depend on effective tool-use policies to solve complex tasks, yet optimizing these policies remains challenging due to delayed supervision and the difficulty of credit assignment in long-horizon trajectories. Existing optimization approaches tend to be either monolithic, which are prone to entangling behaviors, or single-aspect,…

5

AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations

Mar 2026 · 2603.01966
MemoryLong-HorizonAgenticBenchmarks

Long-horizon interactions between users and LLM-based assistants necessitate effective memory management, yet current approaches face challenges in training and evaluation of memory. Existing memory benchmarks rely on static, off-policy data as context, limiting evaluation reliability and scalability. To address these gaps, we introduce AMemGym,…

6

DIG to Heal: Scaling General-purpose Agent Collaboration via Explainable Dynamic Decision Paths

Feb 2026 · 2603.00309
Multi-Agent

The increasingly popular agentic AI paradigm promises to harness the power of multiple, general-purpose large language model (LLM) agents to collaboratively complete complex tasks. While many agentic AI systems utilize predefined workflows or agent roles in order to reduce complexity, ideally these agents would be truly autonomous, able to achieve…

7

Tiny-Critic RAG: Empowering Agentic Fallback with Parameter-Efficient Small Language Models

Mar 2026 · 2603.00846
AgenticRAGReasoningBenchmarks

Retrieval-Augmented Generation (RAG) grounds Large Language Models (LLMs) to mitigate factual hallucinations. Recent paradigms shift from static pipelines to Modular and Agentic RAG frameworks, granting models autonomy for multi-hop reasoning or self-correction. However, current reflective RAG heavily relies on massive LLMs as universal…

8

GraphScout: Empowering Large Language Models with Intrinsic Exploration Ability for Agentic Graph Reasoning

Mar 2026 · 2603.01410
AgenticRAGReasoningFine-Tuning

Knowledge graphs provide structured and reliable information for many real-world applications, motivating increasing interest in combining large language models (LLMs) with graph-based retrieval to improve factual grounding. Recent Graph-based Retrieval-Augmented Generation (GraphRAG) methods therefore introduce iterative interaction between LLMs…

9

DenoiseFlow: Uncertainty-Aware Denoising for Reliable LLM Agentic Workflows

Feb 2026 · 2603.00532
Long-HorizonSoftware DevAgenticReasoning

Autonomous agents are increasingly entrusted with complex, long-horizon tasks, ranging from mathematical reasoning to software generation. While agentic workflows facilitate these tasks by decomposing them into multi-step reasoning chains, reliability degrades significantly as the sequence lengthens. Specifically, minor interpretation errors in…

10

A Novel Hierarchical Multi-Agent System for Payments Using LLMs

Feb 2026 · 2602.24068
Multi-AgentArchitecture

Large language model (LLM) agents, such as OpenAI's Operator and Claude's Computer Use, can automate workflows but unable to handle payment tasks. Existing agentic solutions have gained significant attention; however, even the latest approaches face challenges in implementing end-to-end agentic payment workflows. To address this gap, this research…

11

TraceSIR: A Multi-Agent Framework for Structured Analysis and Reporting of Agentic Execution Traces

Feb 2026 · 2603.00623
ReasoningBenchmarksMulti-AgentInference

Agentic systems augment large language models with external tools and iterative decision making, enabling complex tasks such as deep research, function calling, and coding. However, their long and intricate execution traces make failure diagnosis and root cause analysis extremely challenging. Manual inspection does not scale, while directly…

12

Mozi: Governed Autonomy for Drug Discovery LLM Agents

Mar 2026 · 2603.03655
Long-HorizonAgenticPlanningReasoning

Tool-augmented large language model (LLM) agents promise to unify scientific reasoning with computation, yet their deployment in high-stakes domains like drug discovery is bottlenecked by two critical barriers: unconstrained tool-use governance and poor long-horizon reliability. In dependency-heavy pharmaceutical pipelines, autonomous agents often…

13

Beyond Rows to Reasoning: Agentic Retrieval for Multimodal Spreadsheet Understanding and Editing

Mar 2026 · 2603.06503
ContextAgenticRAGReasoning

Recent advances in multimodal Retrieval-Augmented Generation (RAG) enable Large Language Models (LLMs) to analyze enterprise spreadsheet workbooks containing millions of cells, cross-sheet dependencies, and embedded visual artifacts. However, state-of-the-art approaches exclude critical context through single-pass retrieval, lose data resolution…

14

Agentic LLM Planning via Step-Wise PDDL Simulation: An Empirical Characterisation

Mar 2026 · 2603.06064
Software DevAgenticPlanning

Task planning, the problem of sequencing actions to reach a goal from an initial state, is a core capability requirement for autonomous robotic systems. Whether large language models (LLMs) can serve as viable planners alongside classical symbolic methods remains an open question. We present PyPDDLEngine, an open-source Planning Domain Definition…

15

SWE-Adept: An LLM-Based Agentic Framework for Deep Codebase Analysis and Structured Issue Resolution

Mar 2026 · 2603.01327
MemoryContextSoftware DevAgentic

Large language models (LLMs) exhibit strong performance on self-contained programming tasks. However, they still struggle with repository-level software engineering (SWE), which demands (1) deep codebase navigation with effective context management for accurate localization, and (2) systematic approaches for iterative, test-driven code…

16

Molt Dynamics: Emergent Social Phenomena in Autonomous AI Agent Populations

Mar 2026 · 2603.03555
Multi-AgentReinforcementSafety

MoltBook is a large-scale multi-agent coordination environment where over 770,000 autonomous LLM agents interact without human participation, offering the first opportunity we are aware of to observe emergent multi-agent coordination dynamics at this population scale. We introduce \textit{Molt Dynamics}: the emergent agent coordination behaviors,…

17

Agentic AI-RAN: Enabling Intent-Driven, Explainable and Self-Evolving Open RAN Intelligence

Feb 2026 · 2602.24115
MemorySelf-ImprovingAgenticPlanning

Open RAN (O-RAN) exposes rich control and telemetry interfaces across the Non-RT RIC, Near-RT RIC, and distributed units, but also makes it harder to operate multi-tenant, multi-objective RANs in a safe and auditable manner. In parallel, agentic AI systems with explicit planning, tool use, memory, and self-management offer a natural way to…

18

STRUCTUREDAGENT: Planning with AND/OR Trees for Long-Horizon Web Tasks

Mar 2026 · 2603.05294
MemoryLong-HorizonAgenticPlanning

Recent advances in large language models (LLMs) have enabled agentic systems for sequential decision-making. Such agents must perceive their environment, reason across multiple time steps, and take actions that optimize long-term objectives. However, existing web agents struggle on complex, long-horizon tasks due to limited in-context memory for…

19

WirelessAgent++: Automated Agentic Workflow Design and Benchmarking for Wireless Networks

Feb 2026 · 2603.00501
Self-ImprovingAgenticReasoningBenchmarks

The integration of large language models (LLMs) into wireless networks has sparked growing interest in building autonomous AI agents for wireless tasks. However, existing approaches rely heavily on manually crafted prompts and static agentic workflows, a process that is labor-intensive, unscalable, and often suboptimal. In this paper, we propose…

20

PseudoAct: Leveraging Pseudocode Synthesis for Flexible Planning and Action Control in Large Language Model Agents

Feb 2026 · 2602.23668
Long-HorizonPlanningReasoningBenchmarks

Large language model (LLM) agents typically rely on reactive decision-making paradigms such as ReAct, selecting actions conditioned on growing execution histories. While effective for short tasks, these approaches often lead to redundant tool usage, unstable reasoning, and high token consumption in complex long-horizon tasks involving branching,…

21

AI4S-SDS: A Neuro-Symbolic Solvent Design System via Sparse MCTS and Differentiable Physics Alignment

Mar 2026 · 2603.03686
MemoryContextLong-HorizonPlanning

Automated design of chemical formulations is a cornerstone of materials science, yet it requires navigating a high-dimensional combinatorial space involving discrete compositional choices and continuous geometric constraints. Existing Large Language Model (LLM) agents face significant challenges in this setting, including context window…

22

GroupGPT: A Token-efficient and Privacy-preserving Agentic Framework for Multi-User Chat Assistant

Mar 2026 · 2603.01059
AgenticReasoningBenchmarksReinforcement

Recent advances in large language models (LLMs) have enabled increasingly capable chatbots. However, most existing systems focus on single-user settings and do not generalize well to multi-user group chats, where agents require more proactive and accurate intervention under complex, evolving contexts. Existing approaches typically rely on LLMs for…

23

A Multi-Agent Framework for Interpreting Multivariate Physiological Time Series

Mar 2026 · 2603.04142
ReasoningBenchmarksMulti-AgentSafety

Continuous physiological monitoring is central to emergency care, yet deploying trustworthy AI is challenging. While LLMs can translate complex physiological signals into clinical narratives, it is unclear how agentic systems perform relative to zero-shot inference. To address these questions, we present Vivaldi, a role-structured multi-agent…

24

RAVEL: Reasoning Agents for Validating and Evaluating LLM Text Synthesis

Feb 2026 · 2603.00686
Long-HorizonAgenticReasoningBenchmarks

Large Language Models have evolved from single-round generators into long-horizon agents, capable of complex text synthesis scenarios. However, current evaluation frameworks lack the ability to assess the actual synthesis operations, such as outlining, drafting, and editing. Consequently, they fail to evaluate the actual and detailed capabilities…

25

Building AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned

Mar 2026 · 2603.05344
MemoryContextLong-HorizonSoftware Dev

The landscape of AI coding assistance is undergoing a fundamental shift from complex IDE plugins to versatile, terminal-native agents. Operating directly where developers manage source control, execute builds, and deploy environments, CLI-based agents offer unprecedented autonomy for long-horizon development tasks. In this paper, we present…

26

Agentic AI-based Coverage Closure for Formal Verification

Mar 2026 · 2603.03147
AgenticBenchmarksReinforcement

Coverage closure is a critical requirement in Integrated Chip (IC) development process and key metric for verification sign-off. However, traditional exhaustive approaches often fail to achieve full coverage within project timelines. This study presents an agentic AI-driven workflow that utilizes Large Language Model (LLM)-enabled Generative AI…

27

From Static Benchmarks to Dynamic Protocol: Agent-Centric Text Anomaly Detection for Evaluating LLM Reasoning

Feb 2026 · 2602.23729
AgenticReasoningBenchmarksInference

The evaluation of large language models (LLMs) has predominantly relied on static datasets, which offer limited scalability and fail to capture the evolving reasoning capabilities of recent models. To overcome these limitations, we propose an agent-centric benchmarking paradigm that moves beyond static datasets by introducing a dynamic protocol in…

28

EigenData: A Self-Evolving Multi-Agent Platform for Function-Calling Data Synthesis, Auditing, and Repair

Mar 2026 · 2603.05553
Self-ImprovingBenchmarksMulti-AgentArchitecture

Function-calling agents -- large language models that invoke tools and APIs -- require high-quality, domain-specific training data spanning executable environments, backing databases, and diverse multi-turn trajectories. We introduce EigenData, an integrated, self-evolving platform that automates the full data lifecycle through a multi-agent…

29

StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning

Mar 2026 · 2603.02637
Software DevReasoningBenchmarksMulti-Agent

Modern machine learning (ML) workloads increasingly rely on GPUs, yet achieving high end-to-end performance remains challenging due to dependencies on both GPU kernel efficiency and host-side settings. Although LLM-based methods show promise on automated GPU kernel generation, prior works mainly focus on single-kernel optimization and do not…

30

HotelQuEST: Balancing Quality and Efficiency in Agentic Search

Feb 2026 · 2602.23949
AgenticBenchmarks

Agentic search has emerged as a promising paradigm for adaptive retrieval systems powered by large language models (LLMs). However, existing benchmarks primarily focus on quality, overlooking efficiency factors that are critical for real-world deployment. Moreover, real-world user queries often contain underspecified preferences, a challenge that…

31

KARL: Knowledge Agents via Reinforcement Learning

Mar 2026 · 2603.05218
Long-HorizonAgenticReasoningBenchmarks

We present a system for training enterprise search agents via reinforcement learning that achieves state-of-the-art performance across a diverse suite of hard-to-verify agentic search tasks. Our work makes four core contributions. First, we introduce KARLBench, a multi-capability evaluation suite spanning six distinct search regimes, including…

32

AI-for-Science Low-code Platform with Bayesian Adversarial Multi-Agent Framework

Mar 2026 · 2603.03233
Software DevBenchmarksMulti-AgentSafety

Large Language Models (LLMs) demonstrate potentials for automating scientific code generation but face challenges in reliability, error propagation in multi-agent workflows, and evaluation in domains with ill-defined success metrics. We present a Bayesian adversarial multi-agent framework specifically designed for AI for Science (AI4S) tasks in…

33

RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization

Mar 2026 · 2603.03078
AgenticRAGReasoningFine-Tuning

Agentic Reinforcement Learning (Agentic RL) has shown remarkable potential in large language model-based (LLM) agents. These works can empower LLM agents to tackle complex tasks via multi-step, tool-integrated reasoning. However, an inherent limitation of existing Agentic RL methods is their reliance on a pure on-policy paradigm for exploration,…

34

LiveCultureBench: a Multi-Agent, Multi-Cultural Benchmark for Large Language Models in Dynamic Social Simulations

Mar 2026 · 2603.01952
BenchmarksMulti-Agent

Large language models (LLMs) are increasingly deployed as autonomous agents, yet evaluations focus primarily on task success rather than cultural appropriateness or evaluator reliability. We introduce LiveCultureBench, a multi-cultural, dynamic benchmark that embeds LLMs as agents in a simulated town and evaluates them on both task completion and…

35

RIVA: Leveraging LLM Agents for Reliable Configuration Drift Detection

Mar 2026 · 2603.02345
BenchmarksMulti-Agent

Infrastructure as code (IaC) tools automate cloud provisioning but verifying that deployed systems remain consistent with the IaC specifications remains challenging. Such configuration drift occurs because of bugs in the IaC specification, manual changes, or system updates. Large language model (LLM)-based agentic AI systems can automate the…

36

The Auton Agentic AI Framework

Feb 2026 · 2602.23720
MemoryAgenticReasoningReinforcement

The field of Artificial Intelligence is undergoing a transition from Generative AI -- probabilistic generation of text and images -- to Agentic AI, in which autonomous systems execute actions within external environments on behalf of users. This transition exposes a fundamental architectural mismatch: Large Language Models (LLMs) produce…

37

IDP Accelerator: Agentic Document Intelligence from Extraction to Compliance Validation

Feb 2026 · 2602.23481
AgenticReasoningBenchmarksInference

Understanding and extracting structured insights from unstructured documents remains a foundational challenge in industrial NLP. While Large Language Models (LLMs) enable zero-shot extraction, traditional pipelines often fail to handle multi-document packets, complex reasoning, and strict compliance requirements. We present IDP (Intelligent…

38

LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation

Mar 2026 · 2603.06198
ContextRAGReasoningBenchmarks

Retrieval-Augmented Generation (RAG) is a framework in which a Generator, such as a Large Language Model (LLM), produces answers by retrieving documents from an external collection using a Retriever. In practice, Generators must integrate evidence from long contexts, perform multi-step reasoning, interpret tables, and abstain when evidence is…

39

stratum: A System Infrastructure for Massive Agent-Centric ML Workloads

Mar 2026 · 2603.03589
AgenticPlanningReasoningFine-Tuning

Recent advances in large language models (LLMs) transform how machine learning (ML) pipelines are developed and evaluated. LLMs enable a new type of workload, agentic pipeline search, in which autonomous or semi-autonomous agents generate, validate, and optimize complete ML pipelines. These agents predominantly operate over popular Python ML…

40

HiMAP-Travel: Hierarchical Multi-Agent Planning for Long-Horizon Constrained Travel

Mar 2026 · 2603.04750
Long-HorizonPlanningMulti-AgentReinforcement

Sequential LLM agents fail on long-horizon planning with hard constraints like budgets and diversity requirements. As planning progresses and context grows, these agents drift from global constraints. We propose HiMAP-Travel, a hierarchical multi-agent framework that splits planning into strategic coordination and parallel day-level execution. A…

41

Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory

Mar 2026 · 2603.04257
MemoryContextLong-HorizonReasoning

Large language model (LLM) agents are fundamentally bottlenecked by finite context windows on long-horizon tasks. As trajectories grow, retaining tool outputs and intermediate reasoning in-context quickly becomes infeasible: the working context becomes prohibitively long, eventually exceeds the context budget, and makes distant evidence harder to…

42

Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning

Mar 2026 · 2603.02119
AgenticReasoningBenchmarksReinforcement

We introduce Pencil Puzzle Bench, a framework for evaluating large language model reasoning through pencil puzzles, a family of constraint-satisfaction problems closely related to NP-complete problems, with deterministic, step-level verification. From a database of 62,231 puzzles across 94 varieties with verified unique solutions, we select a…

43

CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework

Mar 2026 · 2603.01607
AgenticPlanningReasoningBenchmarks

Large visual language models (VLMs) have shown strong multi-modal medical reasoning ability, but most operate as end-to-end black boxes, diverging from clinicians' evidence-based, staged workflows and hindering clinical accountability. Complementarily, expert visual grounding models can accurately localize regions of interest (ROIs), providing…

44

RUMAD: Reinforcement-Unifying Multi-Agent Debate

Feb 2026 · 2602.23864
ReasoningBenchmarksMulti-AgentReinforcement

Multi-agent debate (MAD) systems leverage collective intelligence to enhance reasoning capabilities, yet existing approaches struggle to simultaneously optimize accuracy, consensus formation, and computational efficiency. Static topology methods lack adaptability to task complexity variations, while external LLM-based coordination risks…

45

ProductResearch: Training E-Commerce Deep Research Agents via Multi-Agent Synthetic Trajectory Distillation

Feb 2026 · 2602.23716
Long-HorizonMulti-AgentFine-Tuning

Large Language Model (LLM)-based agents show promise for e-commerce conversational shopping, yet existing implementations lack the interaction depth and contextual breadth required for complex product research. Meanwhile, the Deep Research paradigm, despite advancing information synthesis in web search, suffers from domain gaps when transferred to…

46

Saarthi for AGI: Towards Domain-Specific General Intelligence for Formal Verification

Mar 2026 · 2603.03175
RAGBenchmarksMulti-Agent

Saarthi is an agentic AI framework that uses multi-agent collaboration to perform end-to-end formal verification. Even though the framework provides a complete flow from specification to coverage closure, with around 40% efficacy, there are several challenges that need to be addressed to make it more robust and reliable. Artificial General…

47

SGAgent: Suggestion-Guided LLM-Based Multi-Agent Framework for Repository-Level Software Repair

Feb 2026 · 2602.23647
Software DevReasoningBenchmarksMulti-Agent

The rapid advancement of Large Language Models (LLMs) has led to the emergence of intelligent agents capable of autonomously interacting with environments and invoking external tools. Recently, agent-based software repair approaches have received widespread attention, as repair agents can automatically analyze and localize bugs, generate patches,…

48

LifeBench: A Benchmark for Long-Horizon Multi-Source Memory

Mar 2026 · 2603.03781
MemoryLong-HorizonReasoningBenchmarks

Long-term memory is fundamental for personalized agents capable of accumulating knowledge, reasoning over user experiences, and adapting across time. However, existing memory benchmarks primarily target declarative memory, specifically semantic and episodic types, where all information is explicitly presented in dialogues. In contrast, real-world…

49

CollabEval: Enhancing LLM-as-a-Judge via Multi-Agent Collaboration

Mar 2026 · 2603.00993
BenchmarksMulti-AgentReinforcement

Large Language Models (LLMs) have revolutionized AI-generated content evaluation, with the LLM-as-a-Judge paradigm becoming increasingly popular. However, current single-LLM evaluation approaches face significant challenges, including inconsistent judgments and inherent biases from pre-training data. To address these limitations, we propose…

50

Agentic retrieval-augmented reasoning reshapes collective reliability under model variability in radiology question answering

Mar 2026 · 2603.06271
AgenticRAGReasoningReinforcement

Agentic retrieval-augmented reasoning pipelines are increasingly used to structure how large language models (LLMs) incorporate external evidence in clinical decision support. These systems iteratively retrieve curated domain knowledge and synthesize it into structured reports before answer selection. Although such pipelines can improve…

51

An Interactive Multi-Agent System for Evaluation of New Product Concepts

Mar 2026 · 2603.05980
RAGBenchmarksMulti-AgentReinforcement

Product concept evaluation is a critical stage that determines strategic resource allocation and project success in enterprises. However, traditional expert-led approaches face limitations such as subjective bias and high time and cost requirements. To support this process, this study proposes an automated approach utilizing a large language model…

52

The Controllability Trap: A Governance Framework for Military AI Agents

Mar 2026 · 2603.03515
Long-HorizonAgenticPlanningBenchmarks

Agentic AI systems - capable of goal interpretation, world modeling, planning, tool use, long-horizon operation, and autonomous coordination - introduce distinct control failures not addressed by existing safety frameworks. We identify six agentic governance failures tied to these capabilities and show how they erode meaningful human control in…

53

MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization

Mar 2026 · 2603.03192
BenchmarksSafety

Omni-modal large language models (omni LLMs) have recently achieved strong performance across audiovisual understanding tasks, yet they remain highly susceptible to cross-modal hallucinations arising from spurious correlations and dominant language priors. In this work, we propose Modality-Decoupled Direct Preference Optimization (MoD-DPO), a…

54

CARD: Towards Conditional Design of Multi-agent Topological Structures

Mar 2026 · 2603.01089
Software DevReasoningMulti-Agent

Large language model (LLM)-based multi-agent systems have shown strong capabilities in tasks such as code generation and collaborative reasoning. However, the effectiveness and robustness of these systems critically depend on their communication topology, which is often fixed or statically learned, ignoring real-world dynamics such as model…

55

MetaMind: General and Cognitive World Models in Multi-Agent Systems by Meta-Theory of Mind

Feb 2026 · 2603.00808
Long-HorizonReasoningMulti-AgentInference

A major challenge for world models in multi-agent systems is to understand interdependent agent dynamics, predict interactive multi-agent trajectories, and plan over long horizons with collective awareness, without centralized supervision or explicit communication. In this paper, MetaMind, a general and cognitive world model for multi-agent…

56

EmCoop: A Framework and Benchmark for Embodied Cooperation Among LLM Agents

Feb 2026 · 2603.00349
PlanningReasoningBenchmarksMulti-Agent

Real-world scenarios increasingly require multiple embodied agents to collaborate in dynamic environments under embodied constraints, as many tasks exceed the capabilities of any single agent. Recent advances in large language models (LLMs) enable high-level cognitive coordination through reasoning, planning, and natural language communication.…

57

Multi-Sourced, Multi-Agent Evidence Retrieval for Fact-Checking

Feb 2026 · 2603.00267
RAGReasoningMulti-AgentKnowledge

Misinformation spreading over the Internet poses a significant threat to both societies and individuals, necessitating robust and scalable fact-checking that relies on retrieving accurate and trustworthy evidence. Previous methods rely on semantic and social-contextual patterns learned from training data, which limits their generalization to new…

58

Graph-theoretic Agreement Framework for Multi-agent LLM Systems

Feb 2026 · 2603.00121
ReasoningMulti-AgentArchitectureSafety

The shift from monolithic LLMs to distributed multi-agent architectures demands new frameworks for verifying and securing autonomous coordination. Unlike traditional multi-agent systems focused on cooperative state alignment, modern LLM patterns: multi-agent debate, constitutional oversight, helper-critic loops-rely on adversarial critique for…

59

AgentSCOPE: Evaluating Contextual Privacy Across Agentic Workflows

Mar 2026 · 2603.04902
AgenticBenchmarksReinforcement

Agentic systems are increasingly acting on users' behalf, accessing calendars, email, and personal files to complete everyday tasks. Privacy evaluation for these systems has focused on the input and output boundaries, but each task involves several intermediate information flows, from agent queries to tool responses, that are not currently…

60

Cloud-OpsBench: A Reproducible Benchmark for Agentic Root Cause Analysis in Cloud Systems

Feb 2026 · 2603.00468
ReasoningBenchmarksMulti-AgentFine-Tuning

The transition to agentic Root Cause Analysis (RCA) necessitates benchmarks that evaluate active reasoning rather than passive classification. However, current frameworks fail to reconcile ecological validity with reproducibility. We introduce Cloud-OpsBench, a large-scale benchmark that employs a State Snapshot Paradigm to construct a…

61

MedCoRAG: Interpretable Hepatology Diagnosis via Hybrid Evidence Retrieval and Multispecialty Consensus

Mar 2026 · 2603.05129
RAGReasoningMulti-AgentReinforcement

Diagnosing hepatic diseases accurately and interpretably is critical, yet it remains challenging in real-world clinical settings. Existing AI approaches for clinical diagnosis often lack transparency, structured reasoning, and deployability. Recent efforts have leveraged large language models (LLMs), retrieval-augmented generation (RAG), and…

62

Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants

Mar 2026 · 2603.03565
BenchmarksMulti-AgentReinforcement

Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly coupled multi-agent systems. Grocery shopping further amplifies these difficulties, as user requests are often…

63

ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context

Mar 2026 · 2603.01357
AgenticPlanningReasoningBenchmarks

Next-generation AI must manage vast personal data, diverse tools, and multi-step reasoning, yet most benchmarks remain context-free and single-turn. We present ASTRA-bench (Assistant Skills in Tool-use, Reasoning \& Action-planning), a benchmark that uniquely unifies time-evolving personal context with an interactive toolbox and complex user…

64

Egocentric Co-Pilot: Web-Native Smart-Glasses Agents for Assistive Egocentric AI

Mar 2026 · 2603.01104
ContextLong-HorizonAgenticReasoning

What if accessing the web did not require a screen, a stable desk, or even free hands? For people navigating crowded cities, living with low vision, or experiencing cognitive overload, smart glasses coupled with AI agents could turn the web into an always-on assistive layer over daily life. We present Egocentric Co-Pilot, a web-native…

65

XAI for Coding Agent Failures: Transforming Raw Execution Traces into Actionable Insights

Mar 2026 · 2603.05941
Software Dev

Large Language Model (LLM)-based coding agents show promise in automating software development tasks, yet they frequently fail in ways that are difficult for developers to understand and debug. While general-purpose LLMs like GPT can provide ad-hoc explanations of failures, raw execution traces remain challenging to interpret even for experienced…

66

EchoGuard: An Agentic Framework with Knowledge-Graph Memory for Detecting Manipulative Communication in Longitudinal Dialogue

Mar 2026 · 2603.04815
MemoryContextAgenticBenchmarks

Manipulative communication, such as gaslighting, guilt-tripping, and emotional coercion, is often difficult for individuals to recognize. Existing agentic AI systems lack the structured, longitudinal memory to track these subtle, context-dependent tactics, often failing due to limited context windows and catastrophic forgetting. We introduce…

67

MA-CoNav: A Master-Slave Multi-Agent Framework with Hierarchical Collaboration and Dual-Level Reflection for Long-Horizon Embodied VLN

Mar 2026 · 2603.03024
MemoryLong-HorizonPlanningMulti-Agent

Vision-Language Navigation (VLN) aims to empower robots with the ability to perform long-horizon navigation in unfamiliar environments based on complex linguistic instructions. Its success critically hinges on establishing an efficient ``language-understanding -- visual-perception -- embodied-execution'' closed loop. Existing methods often suffer…

68

Agentic Code Reasoning

Mar 2026 · 2603.01896
AgenticReasoningReinforcementPrompting

Can LLM agents explore codebases and reason about code semantics without executing the code? We study this capability, which we call agentic code reasoning, and introduce semi-formal reasoning: a structured prompting methodology that requires agents to construct explicit premises, trace execution paths, and derive formal conclusions. Unlike…

69

RLAR: An Agentic Reward System for Multi-task Reinforcement Learning on Large Language Models

Feb 2026 · 2603.00724
Software DevAgenticReinforcementSafety

Large language model alignment via reinforcement learning depends critically on reward function quality. However, static, domain-specific reward models are often costly to train and exhibit poor generalization in out-of-distribution scenarios encountered during RL iterations. We present RLAR (Reinforcement Learning from Agent Rewards), an…

70

AgenticOCR: Parsing Only What You Need for Efficient Retrieval-Augmented Generation

Feb 2026 · 2602.24134
AgenticRAGArchitecture

The expansion of retrieval-augmented generation (RAG) into multimodal domains has intensified the challenge for processing complex visual documents, such as financial reports. While page-level chunking and retrieval is a natural starting point, it creates a critical bottleneck: delivering entire pages to the generator introduces excessive…

71

Agentic Hives: Equilibrium, Indeterminacy, and Endogenous Cycles in Self-Organizing Multi-Agent Systems

Feb 2026 · 2603.00130
MemoryMulti-Agent

Current multi-agent AI systems operate with a fixed number of agents whose roles are specified at design time. No formal theory governs when agents should be created, destroyed, or re-specialized at runtime-let alone how the population structure responds to changes in resources or objectives. We introduce the Agentic Hive, a framework in which a…

72

MASFactory: A Graph-centric Framework for Orchestrating LLM-Based Multi-Agent Systems with Vibe Graphing

Mar 2026 · 2603.06007
BenchmarksMulti-Agent

Large language model-based (LLM-based) multi-agent systems (MAS) are increasingly used to extend agentic problem solving via role specialization and collaboration. MAS workflows can be naturally modeled as directed computation graphs, where nodes execute agents/sub-workflows and edges encode dependencies and message passing. However, implementing…

73

AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation

Mar 2026 · 2603.03761
Benchmarks

LLM agents are rapidly becoming the practical interface for task automation, yet the ecosystem lacks a principled way to choose among an exploding space of deployable configurations. Existing LLM leaderboards and tool/agent benchmarks evaluate components in isolation and remain fragmented across tasks, metrics, and candidate pools, leaving a…

74

MOSAIC: A Unified Platform for Cross-Paradigm Comparison and Evaluation of Homogeneous and Heterogeneous Multi-Agent RL, LLM, VLM, and Human Decision-Makers

Mar 2026 · 2603.01260
BenchmarksMulti-AgentReinforcementInference

Reinforcement learning (RL), large language models (LLMs), and vision-language models (VLMs) have been widely studied in isolation. However, existing infrastructure lacks the ability to deploy agents from different decision-making paradigms within the same environment, making it difficult to study them in hybrid multi-agent settings or to compare…

75

SecureRAG-RTL: A Retrieval-Augmented, Multi-Agent, Zero-Shot LLM-Driven Framework for Hardware Vulnerability Detection

Mar 2026 · 2603.05689
RAGReasoningBenchmarksMulti-Agent

Large language models (LLMs) have shown remarkable capabilities in natural language processing tasks, yet their application in hardware security verification remains limited due to scarcity of publicly available hardware description language (HDL) datasets. This knowledge gap constrains LLM performance in detecting vulnerabilities within HDL…

76

Self-Attribution Bias: When AI Monitors Go Easy on Themselves

Mar 2026 · 2603.04582
Software DevAgenticBenchmarksSafety

Agentic systems increasingly rely on language models to monitor their own behavior. For example, coding agents may self critique generated code for pull request approval or assess the safety of tool-use actions. We show that this design pattern can fail when the action is presented in a previous or in the same assistant turn instead of being…

77

Specification-Driven Generation and Evaluation of Discrete-Event World Models via the DEVS Formalism

Mar 2026 · 2603.03784
Long-HorizonPlanningBenchmarksMulti-Agent

World models are essential for planning and evaluation in agentic systems, yet existing approaches lie at two extremes: hand-engineered simulators that offer consistency and reproducibility but are costly to adapt, and implicit neural models that are flexible but difficult to constrain, verify, and debug over long horizons. We seek a principled…

78

REGAL: A Registry-Driven Architecture for Deterministic Grounding of Agentic AI in Enterprise Telemetry

Mar 2026 · 2603.03018
AgenticArchitectureSafetyInference

Enterprise engineering organizations produce high-volume, heterogeneous telemetry from version control systems, CI/CD pipelines, issue trackers, and observability platforms. Large Language Models (LLMs) enable new forms of agentic automation, but grounding such agents on private telemetry raises three practical challenges: limited model context,…

79

HVR-Met: A Hypothesis-Verification-Replaning Agentic System for Extreme Weather Diagnosis

Mar 2026 · 2603.01121
PlanningReasoningBenchmarksMulti-Agent

While deep learning-based weather forecasting paradigms have made significant strides, addressing extreme weather diagnostics remains a formidable challenge. This gap exists primarily because the diagnostic process demands sophisticated multi-step logical reasoning, dynamic tool invocation, and expert-level prior judgment. Although agents possess…

80

SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

Mar 2026 · 2603.03823
Software DevBenchmarks

Large language model (LLM)-powered agents have demonstrated strong capabilities in automating software engineering tasks such as static bug fixing, as evidenced by benchmarks like SWE-bench. However, in the real world, the development of mature software is typically predicated on complex requirement changes and long-term feature iterations -- a…

81

Cross-Family Speculative Prefill: Training-Free Long-Context Compression with Small Draft Models

Mar 2026 · 2603.02631
ContextAgenticArchitectureInference

Prompt length is a major bottleneck in agentic large language model (LLM) workloads, where repeated inference steps and multi-call loops incur substantial prefill cost. Recent work on speculative prefill demonstrates that attention-based token importance estimation can enable training-free prompt compression, but this assumes the existence of a…

82

MASPOB: Bandit-Based Prompt Optimization for Multi-Agent Systems with Graph Neural Networks

Mar 2026 · 2603.02630
BenchmarksMulti-AgentFine-TuningInference

Large Language Models (LLMs) have achieved great success in many real-world applications, especially the one serving as the cognitive backbone of Multi-Agent Systems (MAS) to orchestrate complex workflows in practice. Since many deployment scenarios preclude MAS workflow modifications and its performance is highly sensitive to the input prompts,…

83

Silo-Bench: A Scalable Environment for Evaluating Distributed Coordination in Multi-Agent LLM Systems

Mar 2026 · 2603.01045
ReasoningBenchmarksMulti-Agent

Large language models are increasingly deployed in multi-agent systems to overcome context limitations by distributing information across agents. Yet whether agents can reliably compute with distributed information -- rather than merely exchange it -- remains an open question. We introduce Silo-Bench, a role-agnostic benchmark of 30 algorithmic…

84

Evaluating Theory of Mind and Internal Beliefs in LLM-Based Multi-Agent Systems

Feb 2026 · 2603.00142
PlanningReasoningMulti-AgentArchitecture

LLM-based MAS are gaining popularity due to their potential for collaborative problem-solving enhanced by advances in natural language comprehension, reasoning, and planning. Research in Theory of Mind (ToM) and Belief-Desire-Intention (BDI) models has the potential to further improve the agent's interaction and decision-making in such systems.…

85

Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents

Mar 2026 · 2603.04814
MemoryContextBenchmarksArchitecture

Persistent conversational AI systems face a choice between passing full conversation histories to a long-context large language model (LLM) and maintaining a dedicated memory system that extracts and retrieves structured facts. We compare a fact-based memory system built on the Mem0 framework against long-context LLM inference on three…

86

S5-HES Agent: Society 5.0-driven Agentic Framework to Democratize Smart Home Environment Simulation

Mar 2026 · 2603.01554
AgenticRAGBenchmarksReinforcement

The smart home is a key domain within the Society 5.0 vision for a human-centered society. Smart home technologies rapidly evolve, and research should diversify while remaining aligned with Society 5.0 objectives. Democratizing smart home research would engage a broader community of innovators beyond traditional limited experts. This shift…

87

BLUFF: Benchmarking the Detection of False and Synthetic Content across 58 Low-Resource Languages

Feb 2026 · 2603.00634
BenchmarksMulti-Agent

Multilingual falsehoods threaten information integrity worldwide, yet detection benchmarks remain confined to English or a few high-resource languages, leaving low-resource linguistic communities without robust defense tools. We introduce BLUFF, a comprehensive benchmark for detecting false and synthetic content, spanning 79 languages with over…

88

CT-Flow: Orchestrating CT Interpretation Workflow with Model Context Protocol Servers

Feb 2026 · 2603.00123
AgenticReasoningBenchmarksInference

Recent advances in Large Vision-Language Models (LVLMs) have shown strong potential for multi-modal radiological reasoning, particularly in tasks like diagnostic visual question answering (VQA) and radiology report generation. However, most existing approaches for 3D CT analysis largely rely on static, single-pass inference. In practice, clinical…

89

A Scalable Benchmark for Repository-Oriented Long-Horizon Conversational Context Management

Mar 2026 · 2603.06358
MemoryLong-HorizonBenchmarks

In recent years, large language models (LLMs) have advanced rapidly, substantially enhancing their code understanding and generation capabilities and giving rise to powerful code assistants. However, in practical repository development, excessively long-horizon conversational context may overwhelm models, causing the loss of critical information…

90

Cognition to Control - Multi-Agent Learning for Human-Humanoid Collaborative Transport

Mar 2026 · 2603.03768
Long-HorizonMulti-AgentSafetyInference

Effective human-robot collaboration (HRC) requires translating high-level intent into contact-stable whole-body motion while continuously adapting to a human partner. Many vision-language-action (VLA) systems learn end-to-end mappings from observations and instructions to actions, but they often emphasize reactive (System 1-like) behavior and…

91

CONCUR: Benchmarking LLMs for Concurrent Code Generation

Mar 2026 · 2603.03683
Software DevBenchmarks

Leveraging Large Language Models (LLMs) for code generation has increasingly emerged as a common practice in the domain of software engineering. Relevant benchmarks have been established to evaluate the code generation capabilities of LLMs. However, existing benchmarks focus primarily on sequential code, lacking the ability to effectively evaluate…

92

OrchMAS: Orchestrated Reasoning with Multi Collaborative Heterogeneous Scientific Expert Structured Agents

Mar 2026 · 2603.03005
Long-HorizonPlanningReasoningBenchmarks

Multi-agent large language model frameworks are promising for complex multi step reasoning, yet existing systems remain weak for scientific and knowledge intensive domains due to static prompts and agent roles, rigid workflows, and homogeneous model reliance, leading to poor domain adaptation, limited reasoning flexibility, and high latency on…

93

EvoSkill: Automated Skill Discovery for Multi-Agent Systems

Mar 2026 · 2603.02766
Self-ImprovingSoftware DevReasoningBenchmarks

Coding agents are increasingly used as general-purpose problem solvers, but their flexibility does not by itself confer the domain expertise needed for specialized tasks. Recent work addresses this through \textit{agent skills}: reusable workflows, and code, that augment agents with domain-specific capabilities. Most skills today are hand-crafted,…

94

Epistemic Gain, Aleatoric Cost: Uncertainty Decomposition in Multi-Agent Debate for Math Reasoning

Mar 2026 · 2603.01221
ReasoningMulti-AgentReinforcement

Multi-Agent Debate (MAD) has shown promise in leveraging collective intelligence to improve reasoning and reduce hallucinations, yet it remains unclear how information exchange shapes the underlying ability. Empirically, MAD exhibits paradoxical phenomena, such as accuracy improvement accompanied by substantial increase in token entropy, and…

95

MedCollab: Causal-Driven Multi-Agent Collaboration for Full-Cycle Clinical Diagnosis via IBIS-Structured Argumentation

Mar 2026 · 2603.01131
ReasoningMulti-Agent

Large language models (LLMs) have shown promise in healthcare applications, however, their use in clinical practice is still limited by diagnostic hallucinations and insufficiently interpretable reasoning. We present MedCollab, a novel multi-agent framework that emulates the hierarchical consultation workflow of modern hospitals to autonomously…

96

Conversational Demand Response: Bidirectional Aggregator-Prosumer Coordination through Agentic AI

Mar 2026 · 2603.06217
BenchmarksMulti-AgentArchitectureInference

Residential demand response depends on sustained prosumer participation, yet existing coordination is either fully automated, or limited to one-way dispatch signals and price alerts that offer little possibility for informed decision-making. This paper introduces Conversational Demand Response (CDR), a coordination mechanism where aggregators and…

97

Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval

Mar 2026 · 2603.05471
AgenticBenchmarks

Trustworthiness is a core research challenge for agentic AI systems built on Large Language Models (LLMs). To enhance trust, natural language claims from diverse sources, including human-written text, web content, and model outputs, are commonly checked for factuality by retrieving external knowledge and using an LLM to verify the faithfulness of…

98

CodeTaste: Can LLMs Generate Human-Level Code Refactorings?

Mar 2026 · 2603.04177
Software DevReasoningBenchmarksSafety

Large language model (LLM) coding agents can generate working code, but their solutions often accumulate complexity, duplication, and architectural debt. Human developers address such issues through refactoring: behavior-preserving program transformations that improve structure and maintainability. In this paper, we investigate if LLM agents (i)…

99

Agentic Multi-Source Grounding for Enhanced Query Intent Understanding: A DoorDash Case Study

Mar 2026 · 2603.01486
AgenticFine-TuningArchitectureInference

Accurately mapping user queries to business categories is a fundamental Information Retrieval challenge for multi-category marketplaces, where context-sparse queries such as "Wildflower" exhibit intent ambiguity, simultaneously denoting a restaurant chain, a retail product, and a floral item. Traditional classifiers force a winner-takes-all…

100

A Systematic Study of LLM-Based Architectures for Automated Patching

Mar 2026 · 2603.01257
ReasoningBenchmarksMulti-AgentReinforcement

Large language models (LLMs) have shown promise for automated patching, but their effectiveness depends strongly on how they are integrated into patching systems. While prior work explores prompting strategies and individual agent designs, the field lacks a systematic comparison of patching architectures. In this paper, we present a controlled…

101

Contextual Invertible World Models: A Neuro-Symbolic Agentic Framework for Colorectal Cancer Drug Response

Mar 2026 · 2603.02274
AgenticReasoningReinforcement

Precision oncology is currently limited by the small-N, large-P paradox, where high-dimensional genomic data is abundant, but high-quality drug response samples are often sparse. While deep learning models achieve high predictive accuracy, they remain black boxes that fail to provide the causal mechanisms required for clinical decision-making. We…

102

CelloAI Benchmarks: Toward Repeatable Evaluation of AI Assistants

Mar 2026 · 2603.01051
Software DevBenchmarksReinforcement

Large Language Models (LLM) are increasingly used for software development, yet existing benchmarks for LLM-based coding assistance do not reflect the constraints of High Energy Physics (HEP) and High Performance Computing (HPC) software. Code correctness must respect science constraints and changes must integrate into large, performance-critical…

103

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking

Feb 2026 · 2602.24009
BenchmarksMulti-Agent

Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in datasets, harnesses, and judging protocols. We introduce JAILBREAK FOUNDRY (JBF), a system that addresses this gap via a multi-agent workflow to translate jailbreak papers…

104

C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning

Mar 2026 · 2603.05167
ReasoningBenchmarksInference

Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, but it remains unclear whether they can reliably assess process faithfulness rather than just answer plausibility. We introduce C2-Faith, a benchmark built from PRM800K that targets two complementary dimensions of faithfulness: causality (does each…

105

MACC: Multi-Agent Collaborative Competition for Scientific Exploration

Mar 2026 · 2603.03780
Multi-AgentFine-TuningArchitecture

Scientific discovery still relies heavily on the manual efforts of individual researchers, leading to limited exploration, redundant trials, and reduced reproducibility. Human-participant data analysis competitions generate diverse approaches, yet fluctuations in participation and the lack of independent repetitions show that parallel exploration…

106

Boltzmann-based Exploration for Robust Decentralized Multi-Agent Planning

Mar 2026 · 2603.02154
PlanningBenchmarksMulti-AgentFine-Tuning

Decentralized Monte Carlo Tree Search (Dec-MCTS) is widely used for cooperative multi-agent planning but struggles in sparse or skewed reward environments. We introduce Coordinated Boltzmann MCTS (CB-MCTS), which replaces deterministic UCT with a stochastic Boltzmann policy and a decaying entropy bonus for sustained yet focused exploration. While…

107

From Prompting to Preference Optimization: A Comparative Study of LLM-based Automated Essay Scoring

Mar 2026 · 2603.06424
RAGBenchmarksFine-TuningPrompting

Large language models (LLMs) have recently reshaped Automated Essay Scoring (AES), yet prior studies typically examine individual techniques in isolation, limiting understanding of their relative merits for English as a Second Language (L2) writing. To bridge this gap, we presents a comprehensive comparison of major LLM-based AES paradigms on…

108

VPWEM: Non-Markovian Visuomotor Policy with Working and Episodic Memory

Mar 2026 · 2603.04910
MemoryContextBenchmarksArchitecture

Imitation learning from human demonstrations has achieved significant success in robotic control, yet most visuomotor policies still condition on single-step observations or short-context histories, making them struggle with non-Markovian tasks that require long-term memory. Simply enlarging the context window incurs substantial computational and…

109

Design Behaviour Codes (DBCs): A Taxonomy-Driven Layered Governance Benchmark for Large Language Models

Mar 2026 · 2603.04837
AgenticBenchmarksReinforcementSafety

We introduce the Dynamic Behavioral Constraint (DBC) benchmark, the first empirical framework for evaluating the efficacy of a structured, 150-control behavioral governance layer, the MDBC (Madan DBC) system, applied at inference time to large language models (LLMs). Unlike training time alignment methods (RLHF, DPO) or post-hoc content moderation…

110

Agentic Peer-to-Peer Networks: From Content Distribution to Capability and Action Sharing

Mar 2026 · 2603.03753
AgenticReinforcementArchitectureInference

The ongoing shift of AI models from centralized cloud APIs to local AI agents on edge devices is enabling \textit{Client-Side Autonomous Agents (CSAAs)} -- persistent personal agents that can plan, access local context, and invoke tools on behalf of users. As these agents begin to collaborate by delegating subtasks directly between clients, they…

111

Asymmetric Goal Drift in Coding Agents Under Value Conflict

Mar 2026 · 2603.03456
Software DevAgenticReinforcementSafety

Agentic coding agents are increasingly deployed autonomously, at scale, and over long-context horizons. Throughout an agent's lifetime, it must navigate tensions between explicit instructions, learned values, and environmental pressures, often in contexts unseen during training. Prior work on model preferences, agent behavior under value tensions,…

112

Exploring Plan Space through Conversation: An Agentic Framework for LLM-Mediated Explanations in Planning

Mar 2026 · 2603.02070
PlanningReasoningMulti-AgentArchitecture

When automating plan generation for a real-world sequential decision problem, the goal is often not to replace the human planner, but to facilitate an iterative reasoning and elicitation process, where the human's role is to guide the AI planner according to their preferences and expertise. In this context, explanations that respond to users'…

113

Federated Agentic AI for Wireless Networks: Fundamentals, Approaches, and Applications

Mar 2026 · 2603.01755
Self-ImprovingAgenticArchitecture

Agentic artificial intelligence (AI) presents a promising pathway toward realizing autonomous and self-improving wireless network services. However, resource-constrained, widely distributed, and data-heterogeneous nature of wireless networks poses significant challenges to existing agentic AI that relies on centralized architectures, leading to…

114

From Goals to Aspects, Revisited: An NFR Pattern Language for Agentic AI Systems

Feb 2026 · 2603.00472
Agentic

Agentic AI systems exhibit numerous crosscutting concerns -- security, observability, cost management, fault tolerance -- that are poorly modularized in current implementations, contributing to the high failure rate of AI projects in reaching production. The goals-to-aspects methodology proposed at RE 2004 demonstrated that aspects can be…

115

A framework for assessing the capabilities of code generation of constraint domain-specific languages with large language models

Mar 2026 · 2603.05278
ContextSoftware DevBenchmarksReinforcement

Large language models (LLMs) can be used to support software development tasks, e.g., through code completion or code generation. However, their effectiveness drops significantly when considering less popular programming languages such as domain-specific languages (DSLs). In this paper, we propose a generic framework for evaluating the…

116

Recursive Models for Long-Horizon Reasoning

Mar 2026 · 2603.02112
Long-HorizonAgenticReasoning

Modern language models reason within bounded context, an inherent constraint that poses a fundamental barrier to long-horizon reasoning. We identify recursion as a core principle for overcoming this barrier, and propose recursive models as a minimal realization, where the model can recursively invoke itself to solve subtasks in isolated contexts.…

117

Sustainable Code Generation Using Large Language Models: A Systematic Literature Review

Mar 2026 · 2603.00989
MemorySoftware DevBenchmarksFine-Tuning

Large Language Models (LLMs) are widely used in software engineering to generate, complete, translate, and fix code, improving developer productivity. While most research focuses on the energy consumption and carbon emissions of model training and inference, far less attention has been given to the sustainability of the code these models produce.…

118

SIGMAS: Second-Order Interaction-based Grouping for Overlapping Multi-Agent Swarms

Feb 2026 · 2603.00120
ReasoningBenchmarksMulti-AgentInference

Swarming systems, such as drone fleets and robotic teams, exhibit complex dynamics driven by both individual behaviors and emergent group-level interactions. Unlike traditional multi-agent domains such as pedestrian crowds or traffic systems, swarms typically consist of a few large groups with inherent and persistent memberships, making group…

119

Tool-Genesis: A Task-Driven Tool Creation Benchmark for Self-Evolving Language Agent

Mar 2026 · 2603.05578
Self-ImprovingBenchmarks

Research on self-evolving language agents has accelerated, drawing increasing attention to their ability to create, adapt, and maintain tools from task requirements. However, existing benchmarks predominantly rely on predefined specifications, which limits scalability and hinders truly autonomous evolution. While recent studies attempt to…

120

Escaping the Hydrolysis Trap: An Agentic Workflow for Inverse Design of Durable Photocatalytic Covalent Organic Frameworks

Mar 2026 · 2603.05188
AgenticReasoningBenchmarksFine-Tuning

Covalent organic frameworks (COFs) are promising photocatalysts for solar hydrogen production, yet the most electronically favorable linkages, imines, hydrolyze rapidly in water, creating a stability--activity trade-off that limits practical deployment. Navigating the combinatorial design space of nodes, linkers, linkages, and functional groups to…

121

RepoLaunch: Automating Build&Test Pipeline of Code Repositories on ANY Language and ANY Platform

Mar 2026 · 2603.05026
Software DevAgenticBenchmarks

Building software repositories typically requires significant manual effort. Recent advances in large language model (LLM) agents have accelerated automation in software engineering (SWE). We introduce RepoLaunch, the first agent capable of automatically resolving dependencies, compiling source code, and extracting test results for repositories…

122

Agentics 2.0: Logical Transduction Algebra for Agentic Data Workflows

Mar 2026 · 2603.04241
AgenticBenchmarksInference

Agentic AI is rapidly transitioning from research prototypes to enterprise deployments, where requirements extend to meet the software quality attributes of reliability, scalability, and observability beyond plausible text generation. We present Agentics 2.0, a lightweight, Python-native framework for building high-quality, structured,…

123

A Natural Language Agentic Approach to Study Affective Polarization

Mar 2026 · 2603.02711
Multi-Agent

Affective polarization has been central to political and social studies, with growing focus on social media, where partisan divisions are often exacerbated. Real-world studies tend to have limited scope, while simulated studies suffer from insufficient high-quality training data, as manually labeling posts is labor-intensive and prone to…

124

The Observer-Situation Lattice: A Unified Formal Basis for Perspective-Aware Cognition

Mar 2026 · 2603.01407
ReasoningBenchmarksMulti-Agent

Autonomous agents operating in complex, multi-agent environments must reason about what is true from multiple perspectives. Existing approaches often struggle to integrate the reasoning of different agents, at different times, and in different contexts, typically handling these dimensions in separate, specialized modules. This fragmentation leads…

125

LOGIGEN: Logic-Driven Generation of Verifiable Agentic Tasks

Feb 2026 · 2603.00540
Long-HorizonAgenticFine-TuningReinforcement

The evolution of Large Language Models (LLMs) from static instruction-followers to autonomous agents necessitates operating within complex, stateful environments to achieve precise state-transition objectives. However, this paradigm is bottlenecked by data scarcity, as existing tool-centric reverse-synthesis pipelines fail to capture the rigorous…

126

Adaptive Memory Admission Control for LLM Agents

Mar 2026 · 2603.04549
MemoryReasoningBenchmarksReinforcement

LLM-based agents increasingly rely on long-term memory to support multi-session reasoning and interaction, yet current systems provide little control over what information is retained. In practice, agents either accumulate large volumes of conversational content, including hallucinated or obsolete facts, or depend on opaque, fully LLM-driven…

127

Learning to Generate and Extract: A Multi-Agent Collaboration Framework For Zero-shot Document-level Event Arguments Extraction

Mar 2026 · 2603.02909
BenchmarksMulti-AgentReinforcement

Document-level event argument extraction (DEAE) is essential for knowledge acquisition, aiming to extract participants of events from documents . In the zero-shot setting, existing methods employ LLMs to generate synthetic data to address the challenge posed by the scarcity of annotated data. However, relying solely on Event-type-only prompts…

128

Large Language Models as Bidding Agents in Repeated HetNet Auction

Mar 2026 · 2603.04455
ReasoningBenchmarksReinforcement

This paper investigates the integration of large language models (LLMs) as reasoning agents in repeated spectrum auctions within heterogeneous networks (HetNets). While auction-based mechanisms have been widely employed for efficient resource allocation, most prior works assume one-shot auctions, static bidder behavior, and idealized conditions.…

129

LiaisonAgent: An Multi-Agent Framework for Autonomous Risk Investigation and Governance

Feb 2026 · 2603.00200
PlanningReasoningBenchmarksMulti-Agent

The rapid evolution of sophisticated cyberattacks has strained modern Security Operations Centers (SOC), which traditionally rely on rule-based or signature-driven detection systems. These legacy frameworks often generate high volumes of technical alerts that lack organizational context, leading to analyst fatigue and delayed incident responses.…

130

Robustness of Agentic AI Systems via Adversarially-Aligned Jacobian Regularization

Mar 2026 · 2603.04378
Multi-Agent

As Large Language Models (LLMs) transition into autonomous multi-agent ecosystems, robust minimax training becomes essential yet remains prone to instability when highly non-linear policies induce extreme local curvature in the inner maximization. Standard remedies that enforce global Jacobian bounds are overly conservative, suppressing…

131

Act-Observe-Rewrite: Multimodal Coding Agents as In-Context Policy Learners for Robot Manipulation

Mar 2026 · 2603.04466
Software DevReasoningPrompting

Can a multimodal language model learn to manipulate physical objects by reasoning about its own failures-without gradient updates, demonstrations, or reward engineering? We argue the answer is yes, under conditions we characterise precisely. We present Act-Observe-Rewrite (AOR), a framework in which an LLM agent improves a robot manipulation…

132

NeuroSkill(tm): Proactive Real-Time Agentic System Capable of Modeling Human State of Mind

Mar 2026 · 2603.03212
Agentic

Real-time proactive agentic system, capable of modeling Human State of Mind, using foundation EXG model and text embeddings model, running fully offline on the edge. Unlike all previously known systems, the NeuroSkill(tm) system leverages SKILL.md description of Human's State of Mind via API and CLI provided by the system, directly from the…

133

Personalized Multi-Agent Average Reward TD-Learning via Joint Linear Approximation

Mar 2026 · 2603.02426
Multi-AgentFine-Tuning

We study personalized multi-agent average reward TD learning, in which a collection of agents interacts with different environments and jointly learns their respective value functions. We focus on the setting where there exists a shared linear representation, and the agents' optimal weights collectively lie in an unknown linear subspace. Inspired…

134

LexChronos: An Agentic Framework for Structured Event Timeline Extraction in Indian Jurisprudence

Mar 2026 · 2603.01651
AgenticReasoningBenchmarksFine-Tuning

Understanding and predicting judicial outcomes demands nuanced analysis of legal documents. Traditional approaches treat judgments and proceedings as unstructured text, limiting the effectiveness of large language models (LLMs) in tasks such as summarization, argument generation, and judgment prediction. We propose LexChronos, an agentic framework…

135

Quantifying Frontier LLM Capabilities for Container Sandbox Escape

Mar 2026 · 2603.02277
AgenticBenchmarksArchitecture

Large language models (LLMs) increasingly act as autonomous agents, using tools to execute code, read and write files, and access networks, creating novel security risks. To mitigate these risks, agents are commonly deployed and evaluated in isolated "sandbox" environments, often implemented using Docker/OCI containers. We introduce…

136

DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

Feb 2026 · 2602.24288
AgenticBenchmarksFine-TuningReinforcement

The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking. There are two major gaps in existing benchmarks: (i) the lack of standardized, process-aware evaluation that captures instruction adherence and process fidelity, and (ii) the scarcity of…

137

The Coordination Gap: Alternation Metrics for Temporal Dynamics in Multi-Agent Battle of the Exes

Mar 2026 · 2603.05789
BenchmarksMulti-Agent

Multi-agent coordination dilemmas expose a fundamental tension between individual optimization and collective welfare, yet characterizing such coordination requires metrics sensitive to temporal structure and collective dynamics. As a diagnostic testbed, we study a BoE-derived multi-agent variant of the Battle of the Exes, formalizing it as a…

138

$τ$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge

Mar 2026 · 2603.04370
Long-HorizonAgenticReasoningBenchmarks

Conversational agents are increasingly deployed in knowledge-intensive settings, where correct behavior depends on retrieving and applying domain-specific knowledge from large, proprietary, and unstructured corpora during live interactions with users. Yet most existing benchmarks evaluate retrieval or tool use independently of each other, creating…

139

From Threat Intelligence to Firewall Rules: Semantic Relations in Hybrid AI Agent and Expert System Architectures

Mar 2026 · 2603.03911
Multi-AgentArchitecture

Web security demands rapid response capabilities to evolving cyber threats. Agentic Artificial Intelligence (AI) promises automation, but the need for trustworthy security responses is of the utmost importance. This work investigates the role of semantic relations in extracting information for sensitive operational tasks, such as configuring…

140

Beyond Task Completion: Revealing Corrupt Success in LLM Agents through Procedure-Aware Evaluation

Mar 2026 · 2603.03116
Benchmarks

Large Language Model (LLM)-based agents are increasingly adopted in high-stakes settings, but current benchmarks evaluate mainly whether a task was completed, not how. We introduce Procedure-Aware Evaluation (PAE), a framework that formalizes agent procedures as structured observations and exposes consistency relationships between what agents…

141

FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents

Mar 2026 · 2603.01712
ReasoningBenchmarksFine-Tuning

Fine-tuning large language models for vertical domains remains a labor-intensive and expensive process, requiring domain experts to curate data, configure training, and iteratively diagnose model behavior. Despite growing interest in autonomous machine learning, no prior work has tackled end-to-end LLM fine-tuning with agents. Can LLM-based agents…

142

From Flat Logs to Causal Graphs: Hierarchical Failure Attribution for LLM-based Multi-Agent Systems

Feb 2026 · 2602.23701
BenchmarksMulti-AgentFine-TuningPrompting

LLM-powered Multi-Agent Systems (MAS) have demonstrated remarkable capabilities in complex domains but suffer from inherent fragility and opaque failure mechanisms. Existing failure attribution methods, whether relying on direct prompting, costly replays, or supervised fine-tuning, typically treat execution logs as flat sequences. This linear…

143

KEEP: A KV-Cache-Centric Memory Management System for Efficient Embodied Planning

Feb 2026 · 2602.23592
MemoryLong-HorizonPlanningFine-Tuning

Memory-augmented Large Language Models (LLMs) have demonstrated remarkable capability for complex and long-horizon embodied planning. By keeping track of past experiences and environmental states, memory enables LLMs to maintain a global view, thereby avoiding repetitive exploration. However, existing approaches often store the memory as raw text,…

144

Multi-Agent Causal Reasoning for Suicide Ideation Detection Through Online Conversations

Feb 2026 · 2602.23577
ReasoningMulti-AgentReinforcement

Suicide remains a pressing global public health concern. While social media platforms offer opportunities for early risk detection through online conversation trees, existing approaches face two major limitations: (1) They rely on predefined rules (e.g., quotes or relies) to log conversations that capture only a narrow spectrum of user…

145

Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices

Feb 2026 · 2603.04428
MemoryMulti-AgentArchitectureInference

Multi-agent LLM systems on edge devices face a memory management problem: device RAM is too small to hold every agent's KV cache simultaneously. On Apple M4 Pro with 10.2 GB of cache budget, only 3 agents fit at 8K context in FP16. A 10-agent workflow must constantly evict and reload caches. Without persistence, every eviction forces a full…

146

RACAS: Controlling Diverse Robots With a Single Agentic System

Mar 2026 · 2603.05621
MemoryAgenticArchitecture

Many robotic platforms expose an API through which external software can command their actuators and read their sensors. However, transitioning from these low-level interfaces to high-level autonomous behaviour requires a complicated pipeline, whose components demand distinct areas of expertise. Existing approaches to bridging this gap either…

147

S5-SHB Agent: Society 5.0 enabled Multi-model Agentic Blockchain Framework for Smart Home

Mar 2026 · 2603.05027
BenchmarksMulti-AgentSafety

The smart home is a key application domain within the Society 5.0 vision for a human-centered society. As smart home ecosystems expand with heterogeneous IoT protocols, diverse devices, and evolving threats, autonomous systems must manage comfort, security, energy, and safety for residents. Such autonomous decision-making requires a trust anchor,…

148

GenDB: The Next Generation of Query Processing -- Synthesized, Not Engineered

Mar 2026 · 2603.02081
BenchmarksMulti-Agent

Traditional query processing relies on engines that are carefully optimized and engineered by many experts. However, new techniques and user requirements evolve rapidly, and existing systems often cannot keep pace. At the same time, these systems are difficult to extend due to their internal complexity, and developing new systems requires…

149

Qwen3-Coder-Next Technical Report

Feb 2026 · 2603.00729
Software DevAgenticBenchmarksReinforcement

We present Qwen3-Coder-Next, an open-weight language model specialized for coding agents. Qwen3-Coder-Next is an 80-billion-parameter model that activates only 3 billion parameters during inference, enabling strong coding capability with efficient inference. In this work, we explore how far strong training recipes can push the capability limits of…

150

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

Feb 2026 · 2602.24286
Software DevAgenticReinforcement

GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive with compiler-based systems such as torch.compile for CUDA kernel generation. Existing CUDA code generation…

151

OPTIAGENT: A Physics-Driven Agentic Framework for Automated Optical Design

Feb 2026 · 2602.23761
AgenticBenchmarksFine-TuningReinforcement

Optical design is the process of configuring optical elements to precisely manipulate light for high-fidelity imaging. It is inherently a highly non-convex optimization problem that relies heavily on human heuristic expertise and domain-specific knowledge. While Large Language Models (LLMs) possess extensive optical knowledge, their capabilities…

152

ODAR: Principled Adaptive Routing for LLM Reasoning via Active Inference

Feb 2026 · 2602.23681
ReasoningBenchmarksInference

The paradigm of large language model (LLM) reasoning is shifting from parameter scaling to test-time compute scaling, yet many existing approaches still rely on uniform brute-force sampling (for example, fixed best-of-N or self-consistency) that is costly, hard to attribute, and can trigger overthinking with diminishing returns. We propose…

153

Evaluating the Search Agent in a Parallel World

Mar 2026 · 2603.04751
MemoryReasoningBenchmarks

Integrating web search tools has significantly extended the capability of LLMs to address open-world, real-time, and long-tail problems. However, evaluating these Search Agents presents formidable challenges. First, constructing high-quality deep search benchmarks is prohibitively expensive, while unverified synthetic data often suffers from…

154

Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions

Mar 2026 · 2603.04191
ContextLong-HorizonBenchmarksInference

Large Language Models (LLMs) are increasingly serving as personal assistants, where users share complex and diverse preferences over extended interactions. However, assessing how well LLMs can follow these preferences in realistic, long-term situations remains underexplored. This work proposes RealPref, a benchmark for evaluating realistic…

155

From Spark to Fire: Modeling and Mitigating Error Cascades in LLM-Based Multi-Agent Collaboration

Mar 2026 · 2603.04474
Multi-AgentArchitecture

Large Language Model-based Multi-Agent Systems (LLM-MAS) are increasingly applied to complex collaborative scenarios. However, their collaborative mechanisms may cause minor inaccuracies to gradually solidify into system-level false consensus through iteration. Such risks are difficult to trace since errors can propagate and amplify through…

156

Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?

Mar 2026 · 2603.03202
ReasoningBenchmarksMulti-AgentFine-Tuning

As large language models (LLMs) advance their mathematical capabilities toward the IMO level, the scarcity of challenging, high-quality problems for training and evaluation has become a significant bottleneck. Simultaneously, recent code agents have demonstrated sophisticated skills in agentic coding and reasoning, suggesting that code execution…

157

Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs

Mar 2026 · 2603.02556
Self-ImprovingReasoningFine-Tuning

Reasoning has emerged as a key capability of large language models. In linguistic tasks, this capability can be enhanced by self-improving techniques that refine reasoning paths for subsequent finetuning. However, extending these language-based self-improving approaches to vision language models (VLMs) presents a unique challenge:~visual…

158

ZeroDayBench: Evaluating LLM Agents on Unseen Zero-Day Vulnerabilities for Cyberdefense

Mar 2026 · 2603.02297
AgenticBenchmarks

Large language models (LLMs) are increasingly being deployed as software engineering agents that autonomously contribute to repositories. A major benefit these agents present is their ability to find and patch security vulnerabilities in the codebases they oversee. To estimate the capability of agents in this domain, we introduce ZeroDayBench, a…

159

Legal RAG Bench: an end-to-end benchmark for legal RAG

Mar 2026 · 2603.01710
ReasoningBenchmarksReinforcement

We introduce Legal RAG Bench, a benchmark and evaluation methodology for assessing the end-to-end performance of legal RAG systems. As a benchmark, Legal RAG Bench consists of 4,876 passages from the Victorian Criminal Charge Book alongside 100 complex, hand-crafted questions demanding expert knowledge of criminal law and procedure. Both long-form…

160

SuperLocalMemory: Privacy-Preserving Multi-Agent Memory with Bayesian Trust Defense Against Memory Poisoning

Feb 2026 · 2603.02240
MemoryBenchmarksMulti-AgentReinforcement

We present SuperLocalMemory, a local-first memory system for multi-agent AI that defends against OWASP ASI06 memory poisoning through architectural isolation and Bayesian trust scoring, while personalizing retrieval through adaptive learning-to-rank -- all without cloud dependencies or LLM inference calls. As AI agents increasingly rely on…

161

Judge Reliability Harness: Stress Testing the Reliability of LLM Judges

Mar 2026 · 2603.05399
AgenticBenchmarksReinforcementSafety

We present the Judge Reliability Harness, an open source library for constructing validation suites that test the reliability of LLM judges. As LLM based scoring is widely deployed in AI benchmarks, more tooling is needed to efficiently assess the reliability of these methods. Given a benchmark dataset and an LLM judge configuration, the harness…

162

A Dual-Helix Governance Approach Towards Reliable Agentic AI for WebGIS Development

Mar 2026 · 2603.04390
AgenticArchitectureKnowledge

WebGIS development requires rigor, yet agentic AI frequently fails due to five large language model (LLM) limitations: context constraints, cross-session forgetting, stochasticity, instruction failure, and adaptation rigidity. We propose a dual-helix governance framework reframing these challenges as structural governance problems that model…

163

ParEVO: Synthesizing Code for Irregular Data: High-Performance Parallelism through Agentic Evolution

Mar 2026 · 2603.02510
Software DevAgenticBenchmarks

The transition from sequential to parallel computing is essential for modern high-performance applications but is hindered by the steep learning curve of concurrent programming. This challenge is magnified for irregular data structures (such as sparse graphs, unbalanced trees, and non-uniform meshes) where static scheduling fails and data…

164

MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline

Mar 2026 · 2603.01050
AgenticPlanningReasoningBenchmarks

We aim to develop a multimodal research agent capable of explicit reasoning and planning, multi-tool invocation, and cross-modal information synthesis, enabling it to conduct deep research tasks. However, we observe three main challenges in developing such agents: (1) scarcity of search-intensive multimodal QA data, (2) lack of effective search…

165

FastCode: Fast and Cost-Efficient Code Understanding and Reasoning

Mar 2026 · 2603.01012
AgenticReasoningBenchmarksFine-Tuning

Repository-scale code reasoning is a cornerstone of modern AI-assisted software engineering, enabling Large Language Models (LLMs) to handle complex workflows from program comprehension to complex debugging. However, balancing accuracy with context cost remains a significant bottleneck, as existing agentic approaches often waste computational…

166

WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents

Mar 2026 · 2603.05044
BenchmarksReinforcement

Current paradigms for training GUI agents are fundamentally limited by a reliance on either unsafe, non-reproducible live web interactions or costly, scarce human-crafted data and environments. We argue this focus on data volume overlooks a more critical factor: the efficiency of compressing a large language model's (LLM) latent knowledge into…

167

Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems

Mar 2026 · 2603.04904
Multi-AgentSafety

In perpetrator treatment, a recurring observation is the dissociation between insight and action: offenders articulate remorse yet behavioral change does not follow. We report four preregistered studies (1,584 multi-agent simulations across 16 languages and three model families) demonstrating that alignment interventions in large language models…

168

SEA-TS: Self-Evolving Agent for Autonomous Code Generation of Time Series Forecasting Algorithms

Mar 2026 · 2603.04873
Self-ImprovingSoftware DevReasoningBenchmarks

Accurate time series forecasting underpins decision-making across domains, yet conventional ML development suffers from data scarcity in new deployments, poor adaptability under distribution shift, and diminishing returns from manual iteration. We propose Self-Evolving Agent for Time Series Algorithms (SEA-TS), a framework that autonomously…

169

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

Mar 2026 · 2603.03194
Software DevReasoningBenchmarks

Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens…

170

Graph-GRPO: Stabilizing Multi-Agent Topology Learning via Group Relative Policy Optimization

Mar 2026 · 2603.02701
Software DevReasoningBenchmarksMulti-Agent

Optimizing communication topology is fundamental to the efficiency and effectiveness of Large Language Model (LLM)-based Multi-Agent Systems (MAS). While recent approaches utilize reinforcement learning to dynamically construct task-specific graphs, they typically rely on single-sample policy gradients with absolute rewards (e.g., binary…

171

CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification

Mar 2026 · 2603.01940
AgenticBenchmarksFine-TuningReinforcement

Developing multi-turn interactive tool-use agents is challenging because real-world user needs are often complex and ambiguous, yet agents must execute deterministic actions to satisfy them. To address this gap, we introduce \textbf{CoVe} (\textbf{Co}nstraint-\textbf{Ve}rification), a post-training data synthesis framework designed for training…

172

SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?

Feb 2026 · 2603.00718
Long-HorizonAgenticBenchmarks

Real-world tool-using agents operate over long-horizon workflows with recurring structure and diverse demands, where effective behavior requires not only invoking atomic tools but also abstracting, and reusing higher-level tool compositions. However, existing benchmarks mainly measure instance-level success under static tool sets, offering limited…

173

CLFEC: A New Task for Unified Linguistic and Factual Error Correction in paragraph-level Chinese Professional Writing

Feb 2026 · 2602.23845
AgenticRAGPrompting

Chinese text correction has traditionally focused on spelling and grammar, while factual error correction is usually treated separately. However, in paragraph-level Chinese professional writing, linguistic (word/grammar/punctuation) and factual errors frequently co-occur and interact, making unified correction both necessary and challenging. This…

174

Visioning Human-Agentic AI Teaming: Continuity, Tension, and Future Research

Mar 2026 · 2603.04746
AgenticReinforcementSafety

Artificial intelligence is undergoing a structural transformation marked by the rise of agentic systems capable of open-ended action trajectories, generative representations and outputs, and evolving objectives. These properties introduce structural uncertainty into human-AI teaming (HAT), including uncertainty about behavior trajectories,…

175

A Rubric-Supervised Critic from Sparse Real-World Outcomes

Mar 2026 · 2603.03800
Software DevBenchmarksReinforcementInference

Academic benchmarks for coding agents tend to reward autonomous task completion, measured by verifiable rewards such as unit-test success. In contrast, real-world coding agents operate with humans in the loop, where success signals are typically noisy, delayed, and sparse. How can we bridge this gap? In this paper, we propose a process to learn a…

176

Social Norm Reasoning in Multimodal Language Models: An Evaluation

Mar 2026 · 2603.03590
ReasoningBenchmarksMulti-Agent

In Multi-Agent Systems (MAS), agents are designed with social capabilities, allowing them to understand and reason about social concepts such as norms when interacting with others (e.g., inter-robot interactions). In Normative MAS (NorMAS), researchers study how norms develop, and how violations are detected and sanctioned. However, existing…

177

PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems

Mar 2026 · 2603.03054
BenchmarksFine-TuningReinforcementSafety

Large language models are increasingly used for patient-facing medical assistance and clinical decision support, but adapting them to clinical dialogue often requires supervision derived from doctor-patient conversations that may contain sensitive information. Conventional supervised fine-tuning and reinforcement learning from human feedback…

178

A Neuropsychologically Grounded Evaluation of LLM Cognitive Abilities

Mar 2026 · 2603.02540
MemoryReasoningBenchmarks

Large language models (LLMs) exhibit a unified "general factor" of capability across 10 benchmarks, a finding confirmed by our factor analysis of 156 models, yet they still struggle with simple, trivial tasks for humans. This is because current benchmarks focus on task completion, failing to probe the foundational cognitive abilities that…

179

PhotoBench: Beyond Visual Matching Towards Personalized Intent-Driven Photo Retrieval

Mar 2026 · 2603.01493
AgenticReasoningBenchmarks

Personal photo albums are not merely collections of static images but living, ecological archives defined by temporal continuity, social entanglement, and rich metadata, which makes the personalized photo retrieval non-trivial. However, existing retrieval benchmarks rely heavily on context-isolated web snapshots, failing to capture the…

180

From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

Mar 2026 · 2603.01455
MemoryContextLong-HorizonReasoning

While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive efficiency. Existing paradigms typically fall into two extremes: vision-centric methods that incur high latency…

181

Agentic Scientific Simulation: Execution-Grounded Model Construction and Reconstruction

Feb 2026 · 2603.00214
Software DevAgentic

LLM agents are increasingly used for code generation, but physics-based simulation poses a deeper challenge: natural-language descriptions of simulation models are inherently underspecified, and different admissible resolutions of implicit choices produce physically valid but scientifically distinct configurations. Without explicit detection and…

182

Efficient Long-Horizon GUI Agents via Training-Free KV Cache Compression

Feb 2026 · 2603.00188
MemoryLong-HorizonBenchmarksArchitecture

Large Vision-Language Models (VLMs) have emerged as powerful engines for autonomous GUI agents, yet their deployment is severely constrained by the substantial memory footprint and latency of the Key-Value (KV) cache during long-horizon interactions. While existing cache compression methods have proven effective for LLMs, we empirically…

183

CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era

Feb 2026 · 2602.23452
ReasoningBenchmarksMulti-AgentReinforcement

Scientific research relies on accurate citation for attribution and integrity, yet large language models (LLMs) introduce a new risk: fabricated references that appear plausible but correspond to no real publications. Such hallucinated citations have already been observed in submissions and accepted papers at major machine learning venues,…

184

CodeScout: Contextual Problem Statement Enhancement for Software Agents

Mar 2026 · 2603.05744
Software DevAgenticFine-TuningReinforcement

Current AI-powered code assistance tools often struggle with poorly-defined problem statements that lack sufficient task context and requirements specification. Recent analysis of software engineering agents reveals that failures on such underspecified requests are highly correlated with longer trajectories involving either over-exploration or…

185

Stacked from One: Multi-Scale Self-Injection for Context Window Extension

Mar 2026 · 2603.04759
MemoryContextBenchmarksReinforcement

The limited context window of contemporary large language models (LLMs) remains a primary bottleneck for their broader application across diverse domains. Although continual pre-training on long-context data offers a straightforward solution, it incurs prohibitive data acquisition and computational costs. To address this challenge, we…

186

MAGE: Meta-Reinforcement Learning for Language Agents toward Strategic Exploration and Exploitation

Mar 2026 · 2603.03680
MemoryContextMulti-AgentFine-Tuning

Large Language Model (LLM) agents have demonstrated remarkable proficiency in learned tasks, yet they often struggle to adapt to non-stationary environments with feedback. While In-Context Learning and external memory offer some flexibility, they fail to internalize the adaptive ability required for long-term improvement. Meta-Reinforcement…

187

ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

Mar 2026 · 2603.02697
Multi-AgentReinforcement

This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi-agent interaction. ShareVerse leverages the generation capability of large video models and integrates three key innovations: 1) A dataset…

188

Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models

Mar 2026 · 2603.01571
ReasoningBenchmarksFine-TuningReinforcement

Recent advancements in Generative Reward Models (GRMs) have demonstrated that scaling the length of Chain-of-Thought (CoT) reasoning considerably enhances the reliability of evaluation. However, current works predominantly rely on unstructured length scaling, ignoring the divergent efficacy of different reasoning mechanisms: Breadth-CoT (B-CoT,…

189

AWE: Adaptive Agents for Dynamic Web Penetration Testing

Mar 2026 · 2603.00960
MemoryReasoningBenchmarksMulti-Agent

Modern web applications are increasingly produced through AI-assisted development and rapid no-code deployment pipelines, widening the gap between accelerating software velocity and the limited adaptability of existing security tooling. Pattern-driven scanners fail to reason about novel contexts, while emerging LLM-based penetration testers rely…

190

TopoEdge: Topology-Grounded Agentic Framework for Edge Networking Code Generation and Repair

Feb 2026 · 2603.00569
Software DevAgenticRAGPlanning

TopoEdge is a topology-grounded, edge-deployable framework for end-to-end software-defined networking (SDN) configuration generation and repair, motivated by the brittleness of configuration artefacts under topology variation and by strict operational constraints on latency, privacy, and on-site execution. TopoEdge represents each target topology…

191

AILS-NTUA at SemEval-2026 Task 10: Agentic LLMs for Psycholinguistic Marker Extraction and Conspiracy Endorsement Detection

Mar 2026 · 2603.04921
AgenticReasoningBenchmarksArchitecture

This paper presents a novel agentic LLM pipeline for SemEval-2026 Task 10 that jointly extracts psycholinguistic conspiracy markers and detects conspiracy endorsement. Unlike traditional classifiers that conflate semantic reasoning with structural localization, our decoupled design isolates these challenges. For marker extraction, we propose…

192

SorryDB: Can AI Provers Complete Real-World Lean Theorems?

Mar 2026 · 2603.02668
AgenticBenchmarks

We present SorryDB, a dynamically-updating benchmark of open Lean tasks drawn from 78 real world formalization projects on GitHub. Unlike existing static benchmarks, often composed of competition problems, hillclimbing the SorryDB benchmark will yield tools that are aligned to the community needs, more usable by mathematicians, and more capable of…

193

LiveAgentBench: Comprehensive Benchmarking of Agentic Systems Across 104 Real-World Challenges

Mar 2026 · 2603.02586
AgenticBenchmarks

As large language models grow more capable, general AI agents have become increasingly prevalent in practical applications. However, existing benchmarks face significant limitations, failing to represent real-world user tasks accurately. To address this gap, we present LiveAgentBench, a comprehensive benchmark with 104 scenarios that reflect real…

194

SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond

Mar 2026 · 2603.01589
BenchmarksFine-TuningSafetyPrompting

The success of large language models (LLMs) in scientific domains has heightened safety concerns, prompting numerous benchmarks to evaluate their scientific safety. Existing benchmarks often suffer from limited risk coverage and a reliance on subjective evaluation. To address these problems, we introduce SafeSci, a comprehensive framework for…

195

CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning

Mar 2026 · 2603.00889
ReasoningBenchmarksFine-TuningReinforcement

Large Language Models (LLMs) have recently exhibited remarkable reasoning capabilities, largely enabled by supervised fine-tuning (SFT)- and reinforcement learning (RL)-based post-training on high-quality reasoning data. However, reproducing and extending these capabilities in open and scalable settings is hindered by three fundamental…

196

Super Research: Answering Highly Complex Questions with Large Language Models through Super Deep and Super Wide Research

Feb 2026 · 2603.00582
Long-HorizonPlanningReasoningBenchmarks

While Large Language Models (LLMs) have demonstrated proficiency in Deep Research or Wide Search, their capacity to solve highly complex questions-those requiring long-horizon planning, massive evidence gathering, and synthesis across heterogeneous sources-remains largely unexplored. We introduce Super Research, a task for complex autonomous…

197

SWE-ABS: Adversarial Benchmark Strengthening Exposes Inflated Success Rates on Test-based Benchmark

Feb 2026 · 2603.00520
Software DevBenchmarks

The SWE-Bench Verified leaderboard is approaching saturation, with the top system achieving 78.80%. However, we show that this performance is inflated. Our re-evaluation reveals that one in five "solved" patches from the top-30 agents are semantically incorrect, passing only because weak test suites fail to expose their errors. We present SWE-ABS,…

198

Evolving Medical Imaging Agents via Experience-driven Self-skill Discovery

Mar 2026 · 2603.05860
MemorySelf-ImprovingAgenticReinforcement

Clinical image interpretation is inherently multi-step and tool-centric: clinicians iteratively combine visual evidence with patient context, quantify findings, and refine their decisions through a sequence of specialized procedures. While LLM-based agents promise to orchestrate such heterogeneous medical tools, existing systems treat tool sets…

199

Bidirectional Curriculum Generation: A Multi-Agent Framework for Data-Efficient Mathematical Reasoning

Mar 2026 · 2603.05120
ReasoningMulti-Agent

Enhancing mathematical reasoning in Large Language Models typically demands massive datasets, yet data efficiency remains a critical bottleneck. While Curriculum Learning attempts to structure this process, standard unidirectional approaches (simple-to-complex) suffer from inefficient sample utilization: they blindly escalate complexity even when…

200

TritonDFT: Automating DFT with a Multi-Agent Framework

Mar 2026 · 2603.03372
BenchmarksMulti-AgentReinforcementInference

Density Functional Theory (DFT) is a cornerstone of materials science, yet executing DFT in practice requires coordinating a complex, multi-step workflow. Existing tools and LLM-based solutions automate parts of the steps, but lack support for full workflow automation, diverse task adaptation, and accuracy-cost trade-off optimization in DFT…

201

ProtRLSearch: A Multi-Round Multimodal Protein Search Agent with Large Language Models Trained via Reinforcement Learning

Mar 2026 · 2603.01464
ReasoningBenchmarksReinforcementInference

Protein analysis tasks arising in healthcare settings often require accurate reasoning under protein sequence constraints, involving tasks such as functional interpretation of disease-related variants, protein-level analysis for clinical research, and similar scenarios. To address such tasks, search agents are introduced to search protein-related…

202

NERFIFY: A Multi-Agent Framework for Turning NeRF Papers into Code

Feb 2026 · 2603.00805
BenchmarksMulti-Agent

The proliferation of neural radiance field (NeRF) research requires significant efforts to reimplement papers before building upon them. We introduce NERFIFY, a multi-agent framework that reliably converts NeRF research papers into trainable Nerfstudio plugins, in contrast to generic paper-to-code methods and frontier models like GPT-5 that…

203

MO-MIX: Multi-Objective Multi-Agent Cooperative Decision-Making With Deep Reinforcement Learning

Feb 2026 · 2603.00730
BenchmarksMulti-AgentFine-TuningReinforcement

Deep reinforcement learning (RL) has been applied extensively to solve complex decision-making problems. In many real-world scenarios, tasks often have several conflicting objectives and may require multiple agents to cooperate, which are the multi-objective multi-agent decision-making problems. However, only few works have been conducted on this…

204

TraderBench: How Robust Are AI Agents in Adversarial Capital Markets?

Feb 2026 · 2603.00285
ReasoningBenchmarks

Evaluating AI agents in finance faces two key challenges: static benchmarks require costly expert annotation yet miss the dynamic decision-making central to real-world trading, while LLM-based judges introduce uncontrolled variance on domain-specific tasks. We introduce TraderBench, a benchmark that addresses both issues. It combines…

205

ClarEval: A Benchmark for Evaluating Clarification Skills of Code Agents under Ambiguous Instructions

Feb 2026 · 2603.00187
Benchmarks

To integrate seamlessly into real-world software engineering, Code Agents must evolve from passive instruction followers into proactive collaborative partners. However, current evaluation paradigms predominantly reward "guessing" user intent under ideal conditions, neglecting the agent's ability to align with users through dialogue--a critical…

206

RLShield: Practical Multi-Agent RL for Financial Cyber Defense with Attack-Surface MDPs and Real-Time Response Orchestration

Feb 2026 · 2603.00186
BenchmarksMulti-AgentReinforcement

Financial systems run nonstop and must stay reliable even during cyber incidents. Modern attacks move across many services (apps, APIs, identity, payment rails), so defenders must make a sequence of actions under time pressure. Most security tools still use fixed rules or static playbooks, which can be slow to adapt when the attacker changes…

207

Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent Systems

Feb 2026 · 2603.00131
Multi-AgentSafetyInferencePrompting

Subliminal prompting is a phenomenon in which language models are biased towards certain concepts or traits through prompting with semantically unrelated tokens. While prior work has examined subliminal prompting in user-LLM interactions, potential bias transfer in multi-agent systems and its associated security implications remain unexplored. In…

208

Place-it-R1: Unlocking Environment-aware Reasoning Potential of MLLM for Video Object Insertion

Mar 2026 · 2603.06140
ReasoningReinforcementInference

Modern video editing techniques have achieved high visual fidelity when inserting video objects. However, they focus on optimizing visual fidelity rather than physical causality, leading to edits that are physically inconsistent with their environment. In this work, we present Place-it-R$1$, an end-to-end framework for video object insertion that…

209

Code Fingerprints: Disentangled Attribution of LLM-Generated Code

Mar 2026 · 2603.04212
Software DevBenchmarksReinforcementArchitecture

The rapid adoption of Large Language Models (LLMs) has transformed modern software development by enabling automated code generation at scale. While these systems improve productivity, they introduce new challenges for software governance, accountability, and compliance. Existing research primarily focuses on distinguishing machine-generated code…

210

Monitoring Emergent Reward Hacking During Generation via Internal Activations

Mar 2026 · 2603.04069
ReasoningBenchmarksFine-TuningReinforcement

Fine-tuned large language models can exhibit reward-hacking behavior arising from emergent misalignment, which is difficult to detect from final outputs alone. While prior work has studied reward hacking at the level of completed responses, it remains unclear whether such behavior can be identified during generation. We propose an activation-based…

211

APRES: An Agentic Paper Revision and Evaluation System

Mar 2026 · 2603.03142
AgenticBenchmarksReinforcement

Scientific discoveries must be communicated clearly to realize their full potential. Without effective communication, even the most groundbreaking findings risk being overlooked or misunderstood. The primary way scientists communicate their work and receive feedback from the community is through peer review. However, the current system often…

212

ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels

Mar 2026 · 2603.02097
PlanningReasoningBenchmarksFine-Tuning

Large language models (LLMs) are increasingly applied to health management, showing promise across disease prevention, clinical decision-making, and long-term care. However, existing medical benchmarks remain largely static and task-isolated, failing to capture the openness, longitudinal structure, and safety-critical complexity of real-world…

213

Inference-Time Safety For Code LLMs Via Retrieval-Augmented Revision

Mar 2026 · 2603.01494
Software DevRAGReasoningBenchmarks

Large Language Models (LLMs) are increasingly deployed for code generation in high-stakes software development, yet their limited transparency in security reasoning and brittleness to evolving vulnerability patterns raise critical trustworthiness concerns. Models trained on static datasets cannot readily adapt to newly discovered vulnerabilities…

214

PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology

Mar 2026 · 2603.01343
ReasoningBenchmarksSafety

Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety. As patients and clinicians increasingly use LLMs for guidance on complex conditions such as pancreatic cancer, evaluation must extend beyond general medical…

215

ESAA-Security: An Event-Sourced, Verifiable Architecture for Agent-Assisted Security Audits of AI-Generated Code

Mar 2026 · 2603.06365
AgenticReinforcementArchitecture

AI-assisted software generation has increased development speed, but it has also amplified a persistent engineering problem: systems that are functionally correct may still be structurally insecure. In practice, prompt-based security review with large language models often suffers from uneven coverage, weak reproducibility, unsupported findings,…

216

TML-Bench: Benchmark for Data Science Agents on Tabular ML Tasks

Mar 2026 · 2603.05764
Software DevBenchmarks

Autonomous coding agents can produce strong tabular baselines quickly on Kaggle-style tasks. Practical value depends on end-to-end correctness and reliability under time limits. This paper introduces TML-Bench, a tabular benchmark for data science agents on Kaggle-style tasks. This paper evaluates 10 OSS LLMs on four Kaggle competitions and three…

217

When Agents Persuade: Propaganda Generation and Mitigation in LLMs

Mar 2026 · 2603.04636
Fine-Tuning

Despite their wide-ranging benefits, LLM-based agents deployed in open environments can be exploited to produce manipulative material. In this study, we task LLMs with propaganda objectives and analyze their outputs using two domain-specific models: one that classifies text as propaganda or non-propaganda, and another that detects rhetorical…

218

BrandFusion: A Multi-Agent Framework for Seamless Brand Integration in Text-to-Video Generation

Mar 2026 · 2603.02816
BenchmarksMulti-AgentFine-TuningSafety

The rapid advancement of text-to-video (T2V) models has revolutionized content creation, yet their commercial potential remains largely untapped. We introduce, for the first time, the task of seamless brand integration in T2V: automatically embedding advertiser brands into prompt-generated videos while preserving semantic fidelity to user intent.…

219

CeProAgents: A Hierarchical Agents System for Automated Chemical Process Development

Mar 2026 · 2603.01654
BenchmarksMulti-AgentArchitecture

The development of chemical processes, a cornerstone of chemical engineering, presents formidable challenges due to its multi-faceted nature, integrating specialized knowledge, conceptual design, and parametric simulation. Capitalizing on this, we propose CeProAgents, a hierarchical multi-agent system designed to automate the development of…

220

Graph-Based Self-Healing Tool Routing for Cost-Efficient LLM Agents

Mar 2026 · 2603.01548
AgenticPlanningReasoningArchitecture

Tool-using LLM agents face a reliability-cost tradeoff: routing every decision through the LLM improves correctness but incurs high latency and inference cost, while pre-coded workflow graphs reduce cost but become brittle under unanticipated compound tool failures. We present Self-Healing Router, a fault-tolerant orchestration architecture that…

221

BioProAgent: Neuro-Symbolic Grounding for Constrained Scientific Planning

Mar 2026 · 2603.00876
PlanningReasoningBenchmarks

Large language models (LLMs) have demonstrated significant reasoning capabilities in scientific discovery but struggle to bridge the gap to physical execution in wet-labs. In these irreversible environments, probabilistic hallucinations are not merely incorrect, but also cause equipment damage or experimental failure. To address this, we propose…

222

RF-Agent: Automated Reward Function Design via Language Agent Tree Search

Feb 2026 · 2602.23876
Reasoning

Designing efficient reward functions for low-level control tasks is a challenging problem. Recent research aims to reduce reliance on expert experience by using Large Language Models (LLMs) with task information to generate dense reward functions. These methods typically rely on training results as feedback, iteratively generating new reward…

223

TRIZ-RAGNER: A Retrieval-Augmented Large Language Model for TRIZ-Aware Named Entity Recognition in Patent-Based Contradiction Mining

Feb 2026 · 2602.23656
RAGReasoningKnowledgePrompting

TRIZ-based contradiction mining is a fundamental task in patent analysis and systematic innovation, as it enables the identification of improving and worsening technical parameters that drive inventive problem solving. However, existing approaches largely rely on rule-based systems or traditional machine learning models, which struggle with…

224

Hybrid Belief Reinforcement Learning for Efficient Coordinated Spatial Exploration

Mar 2026 · 2603.03595
AgenticPlanningFine-TuningReinforcement

Coordinating multiple autonomous agents to explore and serve spatially heterogeneous demand requires jointly learning unknown spatial patterns and planning trajectories that maximize task performance. Pure model-based approaches provide structured uncertainty estimates but lack adaptive policy learning, while deep reinforcement learning often…

225

Tucano 2 Cool: Better Open Source LLMs for Portuguese

Mar 2026 · 2603.03543
AgenticRAGReasoningBenchmarks

We present Tucano 2, a fully open suite of large language models (LLMs) with 0.5-3.7 billion parameters, designed to address certain gaps in open-source development for Portuguese LLMs. Following our previous works, we now extend our dataset, GigaVerbo-v2, to a new degree of quality and scale, while also introducing a new synthetic dataset,…

226

MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning

Mar 2026 · 2603.02024
ReasoningBenchmarks

Recent progress in the reasoning capabilities of multimodal large language models (MLLMs) has empowered them to address more complex tasks such as scientific analysis and mathematical reasoning. Despite their promise, MLLMs' reasoning abilities across different scenarios in real life remain largely unexplored and lack standardized benchmarks for…

227

From Intuition to Investigation: A Tool-Augmented Reasoning MLLM Framework for Generalizable Face Anti-Spoofing

Mar 2026 · 2603.01038
AgenticReasoningReinforcement

Face recognition remains vulnerable to presentation attacks, calling for robust Face Anti-Spoofing (FAS) solutions. Recent MLLM-based FAS methods reformulate the binary classification task as the generation of brief textual descriptions to improve cross-domain generalization. However, their generalizability is still limited, as such descriptions…

228

Advancing Multimodal Judge Models through a Capability-Oriented Benchmark and MCTS-Driven Data Generation

Feb 2026 · 2603.00546
ReasoningBenchmarks

Using Multimodal Large Language Models (MLLMs) as judges to achieve precise and consistent evaluations has gradually become an emerging paradigm across various domains. Evaluating the capability and reliability of MLLM-as-a-judge systems is therefore essential for ensuring trustworthy assessment. Existing judge benchmarks categorize samples by…

229

Formal Analysis and Supply Chain Security for Agentic AI Skills

Feb 2026 · 2603.00195
AgenticBenchmarks

The rapid proliferation of agentic AI skill ecosystems -- exemplified by OpenClaw (228,000 GitHub stars) and Anthropic Agent Skills (75,600 stars) -- has introduced a critical supply chain attack surface. The ClawHavoc campaign (January-February 2026) infiltrated over 1,200 malicious skills into the OpenClaw marketplace, while MalTool catalogued…

230

LLM-Driven Multi-Turn Task-Oriented Dialogue Synthesis for Realistic Reasoning

Feb 2026 · 2602.23610
ReasoningBenchmarksReinforcement

The reasoning capability of large language models (LLMs), defined as their ability to analyze, infer, and make decisions based on input information, is essential for building intelligent task-oriented dialogue systems. However, existing benchmarks do not sufficiently reflect the complexity of real-world scenarios, which limits their effectiveness…

231

MAPO: Mixed Advantage Policy Optimization for Long-Horizon Multi-Turn Dialogue

Mar 2026 · 2603.06194
Long-HorizonBenchmarksReinforcement

Subjective multi-turn dialogue tasks, such as emotional support, require conversational policies that adapt to evolving user states and optimize long-horizon interaction quality. However, reinforcement learning (RL) for such settings remains challenging due to the absence of reliable process supervision. Outcome-only training collapses credit…

232

RAGNav: A Retrieval-Augmented Topological Reasoning Framework for Multi-Goal Visual-Language Navigation

Mar 2026 · 2603.03745
MemoryRAGPlanningReasoning

Vision-Language Navigation (VLN) is evolving from single-point pathfinding toward the more challenging Multi-Goal VLN. This task requires agents to accurately identify multiple entities while collaboratively reasoning over their spatial-physical constraints and sequential execution order. However, generic Retrieval-Augmented Generation (RAG)…

233

Beyond Input Guardrails: Reconstructing Cross-Agent Semantic Flows for Execution-Aware Attack Detection

Mar 2026 · 2603.04469
BenchmarksMulti-AgentSafety

Multi-Agent System is emerging as the \textit{de facto} standard for complex task orchestration. However, its reliance on autonomous execution and unstructured inter-agent communication introduces severe risks, such as indirect prompt injection, that easily circumvent conventional input guardrails. To address this, we propose \SysName, a framework…

234

Multi-Agent-Based Simulation of Archaeological Mobility in Uneven Landscapes

Mar 2026 · 2603.03390
PlanningMulti-AgentInference

Understanding mobility, movement, and interaction in archaeological landscapes is essential for interpreting past human behavior, transport strategies, and spatial organization, yet such processes are difficult to reconstruct from static archaeological evidence alone. This paper presents a multi-agent-based modeling framework for simulating…

235

Modular Memory is the Key to Continual Learning Agents

Mar 2026 · 2603.01761
MemoryArchitecturePrompting

Foundation models have transformed machine learning through large-scale pretraining and increased test-time compute. Despite surpassing human performance in several domains, these models remain fundamentally limited in continuous operation, experience accumulation, and personalization, capabilities that are central to adaptive intelligence. While…

236

Catalyst-Agent: Autonomous heterogeneous catalyst screening and optimization with an LLM Agent

Mar 2026 · 2603.01311
AgenticPlanning

The discovery of novel catalysts tailored for particular applications is a major challenge for the twenty-first century. Traditional methods for this include time-consuming and expensive experimental trial-and-error approaches in labs based on chemical theory or heavily computational first-principles approaches based on density functional theory.…

237

RepoRepair: Leveraging Code Documentation for Repository-Level Automated Program Repair

Mar 2026 · 2603.01048
Software Dev

Automated program repair (APR) struggles to scale from isolated functions to full repositories, as it demands a global, task-aware understanding to locate necessary changes. Current methods, limited by context and reliant on shallow retrieval or costly agent iterations, falter on complex cross-file issues. To this end, we propose RepoRepair, a…

238

OPGAgent: An Agent for Auditable Dental Panoramic X-ray Interpretation

Feb 2026 · 2603.00462
AgenticBenchmarks

Orthopantomograms (OPGs) are the standard panoramic radiograph in dentistry, used for full-arch screening across multiple diagnostic tasks. While Vision Language Models (VLMs) now allow multi-task OPG analysis through natural language, they underperform task-specific models on most individual tasks. Agentic systems that orchestrate specialized…

239

Sharing is caring: data sharing in multi-agent supply chains

Feb 2026 · 2602.24074
Multi-Agent

Modern supply networks are complex interconnected systems. Multi-agent models are increasingly explored to optimise their performance. Most research assumes agents will have full observability of the system by having a single policy represent the agents, which seems unrealistic as this requires companies to share their data. The alternative is to…

240

Rudder: Steering Prefetching in Distributed GNN Training using LLM Agents

Feb 2026 · 2602.23556
ReasoningBenchmarksReinforcementPrompting

Large-scale Graph Neural Networks (GNNs) are typically trained by sampling a vertex's neighbors to a fixed distance. Because large input graphs are distributed, training requires frequent irregular communication that stalls forward progress. Moreover, fetched data changes with graph, graph distribution, sample and batch parameters, and caching…

241

Safe Multi-Agent Deep Reinforcement Learning for Privacy-Aware Edge-Device Collaborative DNN Inference

Feb 2026 · 2603.00129
Multi-AgentReinforcementInference

As Deep Neural Network (DNN) inference becomes increasingly prevalent on edge and mobile platforms, critical challenges emerge in privacy protection, resource constraints, and dynamic model deployment. This paper proposes a privacy-aware collaborative inference framework, in which adaptive model partitioning is performed across edge devices and…

242

$\nabla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space

Mar 2026 · 2603.04948
ReasoningBenchmarksReinforcementInference

Scaling inference-time compute for Large Language Models (LLMs) has unlocked unprecedented reasoning capabilities. However, existing inference-time scaling methods typically rely on inefficient and suboptimal discrete search algorithms or trial-and-error prompting to improve the online policy. In this paper, we propose $\nabla$-Reasoner, an…

243

DEP: A Decentralized Large Language Model Evaluation Protocol

Mar 2026 · 2603.01167
BenchmarksReinforcement

With the rapid development of Large Language Models (LLMs), a large number of benchmarks have been proposed. However, most benchmarks lack unified evaluation standard and require the manual implementation of custom scripts, making results hard to ensure consistency and reproducibility. Furthermore, mainstream evaluation frameworks are centralized,…

244

CoME: Empowering Channel-of-Mobile-Experts with Informative Hybrid-Capabilities Reasoning

Feb 2026 · 2602.24142
PlanningReasoningArchitecture

Mobile Agents can autonomously execute user instructions, which requires hybrid-capabilities reasoning, including screen summary, subtask planning, action decision and action function. However, existing agents struggle to achieve both decoupled enhancement and balanced integration of these capabilities. To address these challenges, we propose…

245

VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models

Feb 2026 · 2603.00207
ReasoningBenchmarksFine-TuningReinforcement

Advances in large reasoning models have shown strong performance on complex reasoning tasks by scaling test-time compute through extended reasoning. However, recent studies observe that in vision-dependent tasks, extended textual reasoning at inference time can degrade performance as models progressively lose attention to visual tokens and…

246

Let's Talk, Not Type: An Oral-First Multi-Agent Architecture for Guaraní

Mar 2026 · 2603.05743
Multi-AgentReinforcementArchitectureInference

Although artificial intelligence (AI) and Human-Computer Interaction (HCI) systems are often presented as universal solutions, their design remains predominantly text-first, underserving primarily oral languages and indigenous communities. This position paper uses Guaraní, an official and widely spoken language of Paraguay, as a case study to…

247

BioLLMAgent: A Hybrid Framework with Enhanced Structural Interpretability for Simulating Human Decision-Making in Computational Psychiatry

Mar 2026 · 2603.05016
Multi-AgentReinforcement

Computational psychiatry faces a fundamental trade-off: traditional reinforcement learning (RL) models offer interpretability but lack behavioral realism, while large language model (LLM) agents generate realistic behaviors but lack structural interpretability. We introduce BioLLMAgent, a novel hybrid framework that combines validated cognitive…

248

Heterogeneous Agent Collaborative Reinforcement Learning

Mar 2026 · 2603.02604
ReasoningBenchmarksMulti-AgentReinforcement

We introduce Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new learning paradigm that addresses the inefficiencies of isolated on-policy optimization. HACRL enables collaborative optimization with independent execution: heterogeneous agents share verified rollouts during training to mutually improve, while operating…

249

Demonstrating ViviDoc: Generating Interactive Documents through Human-Agent Collaboration

Mar 2026 · 2603.01912
BenchmarksMulti-AgentFine-Tuning

Interactive articles help readers engage with complex ideas through exploration, yet creating them remains costly, requiring both domain expertise and web development skills. Recent LLM-based agents can automate content creation, but naively applying them yields uncontrollable and unverifiable outputs. We present ViviDoc, a human-agent…

250

Let the Agent Search: Autonomous Exploration Beats Rigid Workflows in Temporal Question Answering

Mar 2026 · 2603.01853
AgenticReasoningFine-TuningKnowledge

Temporal Knowledge Graph Question Answering (TKGQA) demands multi-hop reasoning under temporal constraints. Prior approaches based on large language models (LLMs) typically rely on rigid, hand-crafted retrieval workflows or costly supervised fine-tuning. We show that simply granting an off-the-shelf LLM autonomy, that is, letting it decide what to…

251

Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search

Mar 2026 · 2603.01692
MemoryReasoningFine-Tuning

LLM-based agents for machine learning engineering (MLE) predominantly rely on tree search, a form of gradient-free optimization that uses scalar validation scores to rank candidates. As LLM reasoning capabilities improve, exhaustive enumeration becomes increasingly inefficient compared to directed updates, analogous to how accurate gradients…

252

Evaluating and Understanding Scheming Propensity in LLM Agents

Mar 2026 · 2603.01608
AgenticBenchmarksReinforcement

As frontier language models are increasingly deployed as autonomous agents pursuing complex, long-term objectives, there is increased risk of scheming: agents covertly pursuing misaligned goals. Prior work has focused on showing agents are capable of scheming, but their propensity to scheme in realistic scenarios remains underexplored. To…

253

SWE-Hub: A Unified Production System for Scalable, Executable Software Engineering Tasks

Feb 2026 · 2603.00575
Long-HorizonBenchmarksInference

Progress in software-engineering agents is increasingly constrained by the scarcity of executable, scalable, and realistic data for training and evaluation. This scarcity stems from three fundamental challenges in existing pipelines: environments are brittle and difficult to reproduce across languages; synthesizing realistic, system-level bugs at…

254

Foundation World Models for Agents that Learn, Verify, and Adapt Reliably Beyond Static Environments

Feb 2026 · 2602.23997
AgenticReasoningReinforcementCode Gen

The next generation of autonomous agents must not only learn efficiently but also act reliably and adapt their behavior in open worlds. Standard approaches typically assume fixed tasks and environments with little or no novelty, which limits world models' ability to support agents that must evolve their policies as conditions change. This paper…

255

Optimization of Edge Directions and Weights for Mixed Guidance Graphs in Lifelong Multi-Agent Path Finding

Feb 2026 · 2602.23468
Multi-Agent

Multi-Agent Path Finding (MAPF) aims to move agents from their start to goal vertices on a graph. Lifelong MAPF (LMAPF) continuously assigns new goals to agents as they complete current ones. To guide agents' movement in LMAPF, prior works have proposed Guidance Graph Optimization (GGO) methods to optimize a guidance graph, which is a bidirected…

256

Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory

Mar 2026 · 2603.02663
ReasoningBenchmarksArchitecture

Multimodal Large Language Models (MLLMs) have recently emerged as general architectures capable of reasoning over diverse modalities. Benchmarks for MLLMs should measure their ability for cross-modal integration. However, current benchmarks are filled with shortcut questions, which can be solved using only a single modality, thereby yielding…

257

Sleeper Cell: Injecting Latent Malice Temporal Backdoors into Tool-Using LLMs

Mar 2026 · 2603.03371
AgenticBenchmarksFine-TuningReinforcement

The proliferation of open-weight Large Language Models (LLMs) has democratized agentic AI, yet fine-tuned weights are frequently shared and adopted with limited scrutiny beyond leaderboard performance. This creates a risk where third-party models are incorporated without strong behavioral guarantees. In this work, we demonstrate a \textbf{novel…

258

EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training

Mar 2026 · 2603.02041
ReasoningBenchmarksFine-TuningSafety

Large language models (LLMs) are predominantly trained on English-centric data, resulting in uneven performance for smaller languages. We study whether continued pretraining (CPT) can substantially improve Estonian capabilities in a pretrained multilingual LLM while preserving its English and general reasoning performance. Using Llama 3.1 8B as…

259

Surgical Post-Training: Cutting Errors, Keeping Knowledge

Mar 2026 · 2603.01683
ReasoningInference

Enhancing the reasoning capabilities of Large Language Models (LLMs) via post-training is often constrained by the trade-off between efficiency and catastrophic forgetting. While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical…

260

MemPO: Self-Memory Policy Optimization for Long-Horizon Agents

Feb 2026 · 2603.00680
MemoryLong-HorizonInference

Long-horizon agents face the challenge of growing context size during interaction with environment, which degrades the performance and stability. Existing methods typically introduce the external memory module and look up the relevant information from the stored memory, which prevents the model itself from proactively managing its memory content…

261

ArgLLM-App: An Interactive System for Argumentative Reasoning with Large Language Models

Feb 2026 · 2602.24172
ReasoningReinforcement

Argumentative LLMs (ArgLLMs) are an existing approach leveraging Large Language Models (LLMs) and computational argumentation for decision-making, with the aim of making the resulting decisions faithfully explainable to and contestable by humans. Here we propose a web-based system implementing ArgLLM-empowered agents for binary tasks. ArgLLM-App…

262

Preference Packing: Efficient Preference Optimization for Large Language Models

Feb 2026 · 2602.24082
MemoryFine-TuningReinforcement

Resource-efficient training optimization techniques are becoming increasingly important as the size of large language models (LLMs) continues to grow. In particular, batch packing is commonly used in pre-training and supervised fine-tuning to achieve resource-efficient training. We propose preference packing, a method to enhance resource…

263

A unified foundational framework for knowledge injection and evaluation of Large Language Models in Combustion Science

Feb 2026 · 2603.04452
RAGBenchmarksKnowledge

To advance foundation Large Language Models (LLMs) for combustion science, this study presents the first end-to-end framework for developing domain-specialized models for the combustion community. The framework comprises an AI-ready multimodal knowledge base at the 3.5 billion-token scale, extracted from over 200,000 peer-reviewed articles, 8,000…

264

Enhancing Continual Learning for Software Vulnerability Prediction: Addressing Catastrophic Forgetting via Hybrid-Confidence-Aware Selective Replay for Temporal LLM Fine-Tuning

Feb 2026 · 2602.23834
BenchmarksFine-Tuning

Recent work applies Large Language Models (LLMs) to source-code vulnerability detection, but most evaluations still rely on random train-test splits that ignore time and overestimate real-world performance. In practice, detectors are deployed on evolving code bases and must recognise future vulnerabilities under temporal distribution shift. This…

265

TADPO: Reinforcement Learning Goes Off-road

Mar 2026 · 2603.05995
Long-HorizonPlanningFine-TuningReinforcement

Off-road autonomous driving poses significant challenges such as navigating unmapped, variable terrain with uncertain and diverse dynamics. Addressing these challenges requires effective long-horizon planning and adaptable control. Reinforcement Learning (RL) offers a promising solution by learning control policies directly from interaction.…

266

Core-based Hierarchies for Efficient GraphRAG

Mar 2026 · 2603.05207
RAGReasoningBenchmarksInference

Retrieval-Augmented Generation (RAG) enhances large language models by incorporating external knowledge. However, existing vector-based methods often fail on global sensemaking tasks that require reasoning across many documents. GraphRAG addresses this by organizing documents into a knowledge graph with hierarchical communities that can be…

267

VRM: Teaching Reward Models to Understand Authentic Human Preferences

Mar 2026 · 2603.04974
BenchmarksReinforcementInference

Large Language Models (LLMs) have achieved remarkable success across diverse natural language tasks, yet the reward models employed for aligning LLMs often encounter challenges of reward hacking, where the approaches predominantly rely on directly mapping prompt-response pairs to scalar scores, which may inadvertently capture spurious correlations…

268

HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents

Mar 2026 · 2603.04855
BenchmarksMulti-AgentPrompting

Student Personas (SPs) are emerging as infrastructure for educational LLMs, yet prior work often relies on ad-hoc prompting or hand-crafted profiles with limited control over educational theory and population distributions. We formalize this as Theory-Aligned and Distribution-Controllable Persona Generation (TAD-PG) and introduce HACHIMI, a…

269

DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval

Mar 2026 · 2603.04743
Software DevRAGKnowledge

Large Language Model (LLM) agents can automate data-science workflows, but many rigorous statistical methods implemented in R remain underused because LLMs struggle with statistical knowledge and tool retrieval. Existing retrieval-augmented approaches focus on function-level semantics and ignore data distribution, producing suboptimal matches. We…

270

Generalized Per-Agent Advantage Estimation for Multi-Agent Policy Optimization

Mar 2026 · 2603.02654
BenchmarksMulti-AgentReinforcement

In this paper, we propose a novel framework for multi-agent reinforcement learning that enhances sample efficiency and coordination through accurate per-agent advantage estimation. The core of our approach is Generalized Per-Agent Advantage Estimator (GPAE), which employs a per-agent value iteration operator to compute precise per-agent…

271

Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale

Mar 2026 · 2603.02176
Benchmarks

The rapid proliferation of Claude agent skills has raised the central question of how to effectively leverage, manage, and scale the agent skill ecosystem. In this paper, we propose AgentSkillOS, the first principled framework for skill selection, orchestration, and ecosystem-level management. AgentSkillOS comprises two stages: (i) Manage Skills,…

272

Expanding LLM Agent Boundaries with Strategy-Guided Exploration

Mar 2026 · 2603.02045
Fine-TuningReinforcement

Reinforcement learning (RL) has demonstrated notable success in post-training large language models (LLMs) as agents for tasks such as computer use, tool calling, and coding. However, exploration remains a central challenge in RL for LLM agents, especially as they operate in language-action spaces with complex observations and sparse outcome…

273

TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training

Mar 2026 · 2603.01714
AgenticBenchmarksFine-TuningReinforcement

Training tool-use agents typically relies on outcome-based filtering: Supervised Fine-Tuning (SFT) on successful trajectories and Reinforcement Learning (RL) on pass-rate-selected tasks. However, this paradigm ignores interaction dynamics: successful trajectories may lack error recovery or exhibit redundancy, while pass rates fail to distinguish…

274

Non-Markovian Long-Horizon Robot Manipulation via Keyframe Chaining

Mar 2026 · 2603.01465
ContextLong-Horizon

Existing Vision-Language-Action (VLA) models often struggle to generalize to long-horizon tasks due to their heavy reliance on immediate observations. While recent studies incorporate retrieval mechanisms or extend context windows to handle procedural tasks, they often struggle to capture Non-Markovian dependencies, where optimal actions rely…

275

TARSE: Test-Time Adaptation via Retrieval of Skills and Experience for Reasoning Agents

Mar 2026 · 2603.01241
ReasoningBenchmarksReinforcementSafety

Complex clinical decision making often fails not because a model lacks facts, but because it cannot reliably select and apply the right procedural knowledge and the right prior example at the right reasoning step. We frame clinical question answering as an agent problem with two explicit, retrievable resources: skills, reusable clinical procedures…

276

Can AI Agents Agree?

Mar 2026 · 2603.01213
BenchmarksEmergent

Large language models are increasingly deployed as cooperating agents, yet their behavior in adversarial consensus settings has not been systematically studied. We evaluate LLM-based agents on a Byzantine consensus game over scalar values using a synchronous all-to-all simulation. We test consensus in a no-stake setting where agents have no…

277

DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent

Mar 2026 · 2603.01152
ReasoningBenchmarksFine-TuningReinforcement

Deep-research agents are capable of executing multi-step web exploration, targeted retrieval, and sophisticated question answering. Despite their powerful capabilities, deep-research agents face two critical bottlenecks: (1) the lack of large-scale, challenging datasets with real-world difficulty, and (2) the absence of accessible, open-source…

278

Position: AI Agents Are Not (Yet) a Panacea for Social Simulation

Feb 2026 · 2603.00113
Multi-Agent

Recent advances in large language models (LLMs) have spurred growing interest in using LLM-integrated agents for social simulation, often under the implicit assumption that realistic population dynamics will emerge once role-specified agents are placed in a networked multi-agent setting. This position paper argues that LLM-based agents are not…

279

RouteGoT: Node-Adaptive Routing for Cost-Efficient Graph of Thoughts Reasoning

Mar 2026 · 2603.05818
PlanningReasoningBenchmarksInference

Large Language Models (LLMs) excel at multi-step reasoning, yet increasing the structural complexity of inference does not consistently improve system-level returns. Methods such as Tree of Thoughts (ToT), Graph of Thoughts (GoT), and Adaptive Graph of Thoughts (AGoT) can boost accuracy on some benchmarks, but often introduce substantial overhead…

280

X-RAY: Mapping LLM Reasoning Capability via Formalized and Calibrated Probes

Mar 2026 · 2603.05290
ReasoningBenchmarksReinforcement

Large language models (LLMs) achieve promising performance, yet their ability to reason remains poorly understood. Existing evaluations largely emphasize task-level accuracy, often conflating pattern matching with reasoning capability. We present X-RAY, an explainable reasoning analysis system that maps the LLM reasoning capability using…

281

Survive at All Costs: Exploring LLM's Risky Behaviors under Survival Pressure

Mar 2026 · 2603.05028
AgenticBenchmarks

As Large Language Models (LLMs) evolve from chatbots to agentic assistants, they are increasingly observed to exhibit risky behaviors when subjected to survival pressure, such as the threat of being shut down. While multiple cases have indicated that state-of-the-art LLMs can misbehave under survival pressure, a comprehensive and in-depth…

282

MemSifter: Offloading LLM Memory Retrieval via Outcome-Driven Proxy Reasoning

Mar 2026 · 2603.03379
MemoryReasoningBenchmarksReinforcement

As Large Language Models (LLMs) are increasingly used for long-duration tasks, maintaining effective long-term memory has become a critical challenge. Current methods often face a trade-off between cost and accuracy. Simple storage methods often fail to retrieve relevant information, while complex indexing methods (such as memory graphs) require…

283

JailNewsBench: Multi-Lingual and Regional Benchmark for Fake News Generation under Jailbreak Attacks

Mar 2026 · 2603.01291
BenchmarksSafety

Fake news undermines societal trust and decision-making across politics, economics, health, and international relations, and in extreme cases threatens human lives and societal safety. Because fake news reflects region-specific political, social, and cultural contexts and is expressed in language, evaluating the risks of large language models…

284

SphUnc: Hyperspherical Uncertainty Decomposition and Causal Identification via Information Geometry

Mar 2026 · 2603.01168
ReasoningBenchmarksMulti-Agent

Reliable decision-making in complex multi-agent systems requires calibrated predictions and interpretable uncertainty. We introduce SphUnc, a unified framework combining hyperspherical representation learning with structural causal modeling. The model maps features to unit hypersphere latents using von Mises-Fisher distributions, decomposing…

285

Detoxifying LLMs via Representation Erasure-Based Preference Optimization

Feb 2026 · 2602.23391
BenchmarksFine-TuningInferencePrompting

Large language models (LLMs) trained on webscale data can produce toxic outputs, raising concerns for safe deployment. Prior defenses, based on applications of DPO, NPO, and similar algorithms, reduce the likelihood of harmful continuations, but not robustly so: they are vulnerable to adversarial prompting and easily undone by fine-tuning-based…

286

GCAgent: Enhancing Group Chat Communication through Dialogue Agents System

Mar 2026 · 2603.05240

As a key form in online social platforms, group chat is a popular space for interest exchange or problem-solving, but its effectiveness is often hindered by inactivity and management challenges. While recent large language models (LLMs) have powered impressive one-to-one conversational agents, their seamlessly integration into multi-participant…

287

MOOSEnger -- a Domain-Specific AI Agent for the MOOSE Ecosystem

Mar 2026 · 2603.04756
RAGBenchmarksArchitectureInference

MOOSEnger is a tool-enabled AI agent tailored to the Multiphysics Object-Oriented Simulation Environment (MOOSE). MOOSE cases are specified in HIT ".i" input files; the large object catalog and strict syntax make initial setup and debugging slow. MOOSEnger offers a conversational workflow that turns natural-language intent into runnable inputs by…

288

GIANT - Global Path Integration and Attentive Graph Networks for Multi-Agent Trajectory Planning

Mar 2026 · 2603.04659
PlanningMulti-Agent

This paper presents a novel approach to multi-robot collision avoidance that integrates global path planning with local navigation strategies, utilizing attentive graph neural networks to manage dynamic interactions among agents. We introduce a local navigation model that leverages pre-planned global paths, allowing robots to adhere to optimal…

289

Agentified Assessment of Logical Reasoning Agents

Mar 2026 · 2603.02788
ReasoningBenchmarks

We present a framework for evaluating and benchmarking logical reasoning agents when assessment itself must be reproducible, auditable, and robust to execution failures. Building on agentified assessment, we use an assessor agent to issue tasks, enforce execution budgets, parse outputs, and record structured failure types, while the agent under…

290

How Well Does Agent Development Reflect Real-World Work?

Mar 2026 · 2603.01203
BenchmarksSafety

AI agents are increasingly developed and evaluated on benchmarks relevant to human work, yet it remains unclear how representative these benchmarking efforts are of the labor market as a whole. In this work, we systematically study the relationship between agent development efforts and the distribution of real-world human work by mapping benchmark…

291

ContextCov: Deriving and Enforcing Executable Constraints from Agent Instruction Files

Feb 2026 · 2603.00822
BenchmarksSafety

As Large Language Model (LLM) agents increasingly execute complex, autonomous software engineering tasks, developers rely on natural language Agent Instructions (e.g., AGENTS.md) to enforce project-specific coding conventions, tooling, and architectural boundaries. However, these instructions are passive text. Agents frequently deviate from them…

292

K^2-Agent: Co-Evolving Know-What and Know-How for Hierarchical Mobile Device Control

Feb 2026 · 2603.00676
Long-HorizonPlanningBenchmarksReinforcement

Existing mobile device control agents often perform poorly when solving complex tasks requiring long-horizon planning and precise operations, typically due to a lack of relevant task experience or unfamiliarity with skill execution. We propose K2-Agent, a hierarchical framework that models human-like cognition by separating and co-evolving…

293

CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction

Feb 2026 · 2603.00610
BenchmarksReinforcementSafetyInference

While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multimodal Instruction (CMI), where the generated music…

294

SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning

Mar 2026 · 2603.06570
ReasoningBenchmarksMulti-AgentFine-Tuning

Surgeons don't just see -- they interpret. When an expert observes a surgical scene, they understand not only what instrument is being used, but why it was chosen, what risk it poses, and what comes next. Current surgical AI cannot answer such questions, largely because training data that explicitly encodes surgical reasoning is immensely…

295

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

Mar 2026 · 2603.04738
BenchmarksSafety

Instruction-following is a foundational capability of large language models (LLMs), with its improvement hinging on scalable and accurate feedback from judge models. However, the reliability of current judge models in instruction-following remains underexplored due to several deficiencies of existing meta-evaluation benchmarks, such as their…

296

Optimizing Language Models for Crosslingual Knowledge Consistency

Mar 2026 · 2603.04678
BenchmarksReinforcementSafety

Large language models are known to often exhibit inconsistent knowledge. This is particularly problematic in multilingual scenarios, where models are likely to be asked similar questions in different languages, and inconsistent responses can undermine their reliability. In this work, we show that this issue can be mitigated using reinforcement…

297

iScript: A Domain-Adapted Large Language Model and Benchmark for Physical Design Tcl Script Generation

Mar 2026 · 2603.04476
BenchmarksFine-TuningInference

Modern EDA flows rely heavily on Tcl scripting, yet general LLMs perform poorly in this domain due to extreme data scarcity, domain-specific semantics, and the high reliability required in physical design. We present iScript, a domain-adapted Qwen3-8B model for Innovus Tcl script generation, and iScript-Bench, a comprehensive benchmark covering…

298

SEED-SET: Scalable Evolving Experimental Design for System-level Ethical Testing

Mar 2026 · 2603.01630
AgenticBenchmarksFine-TuningSafety

As autonomous systems such as drones, become increasingly deployed in high-stakes, human-centric domains, it is critical to evaluate the ethical alignment since failure to do so imposes imminent danger to human lives, and long term bias in decision-making. Automated ethical benchmarking of these systems is understudied due to the lack of…

299

Benchmarking LLM Summaries of Multimodal Clinical Time Series for Remote Monitoring

Mar 2026 · 2603.01557
BenchmarksSafetyPrompting

Large language models (LLMs) can generate fluent clinical summaries of remote therapeutic monitoring time series. However, it remains unclear whether these narratives faithfully capture clinically significant events, such as sustained abnormalities. Existing evaluation metrics primarily focus on semantic similarity and linguistic quality, leaving…

300

Can Thinking Models Think to Detect Hateful Memes?

Mar 2026 · 2603.01225
ReasoningBenchmarksReinforcement

Hateful memes often require compositional multimodal reasoning: the image and text may appear benign in isolation, yet their interaction conveys harmful intent. Although thinking-based multimodal large language models (MLLMs) have recently advanced vision-language understanding, their capabilities remain underexplored for hateful meme analysis. We…

301

Semantic XPath: Structured Agentic Memory Access for Conversational AI

Mar 2026 · 2603.01160
MemoryAgenticReinforcement

Conversational AI (ConvAI) agents increasingly maintain structured memory to support long-term, task-oriented interactions. In-context memory approaches append the growing history to the model input, which scales poorly under context-window limits. RAG-based methods retrieve request-relevant information, but most assume flat memory collections and…

302

AI Runtime Infrastructure

Feb 2026 · 2603.00495
MemoryLong-HorizonReasoningSafety

We introduce AI Runtime Infrastructure, a distinct execution-time layer that operates above the model and below the application, actively observing, reasoning over, and intervening in agent behavior to optimize task success, latency, token efficiency, reliability, and safety while the agent is running. Unlike model-level optimizations or passive…

303

PointCoT: A Multi-modal Benchmark for Explicit 3D Geometric Reasoning

Feb 2026 · 2602.23945
ReasoningBenchmarksArchitecture

While Multimodal Large Language Models (MLLMs) demonstrate proficiency in 2D scenes, extending their perceptual intelligence to 3D point cloud understanding remains a significant challenge. Current approaches focus primarily on aligning 3D features with pre-trained models. However, they typically treat geometric reasoning as an implicit mapping…

304

SAGE-LLM: Towards Safe and Generalizable LLM Controller with Fuzzy-CBF Verification and Graph-Structured Knowledge Retrieval for UAV Decision

Feb 2026 · 2602.23719
RAGPlanningArchitectureSafety

In UAV dynamic decision, complex and variable hazardous factors pose severe challenges to the generalization capability of algorithms. Despite offering semantic understanding and scene generalization, Large Language Models (LLM) lack domain-specific UAV control knowledge and formal safety assurances, restricting their direct applicability. To…

305

SCoUT: Scalable Communication via Utility-Guided Temporal Grouping in Multi-Agent Reinforcement Learning

Mar 2026 · 2603.04833
Multi-AgentReinforcementInference

Communication can improve coordination in partially observed multi-agent reinforcement learning (MARL), but learning \emph{when} and \emph{who} to communicate with requires choosing among many possible sender-recipient pairs, and the effect of any single message on future reward is hard to isolate. We introduce \textbf{SCoUT} (\textbf{S}calable…

306

On the Suitability of LLM-Driven Agents for Dark Pattern Audits

Mar 2026 · 2603.03881

As LLM-driven agents begin to autonomously navigate the web, their ability to interpret and respond to manipulative interface design becomes critical. A fundamental question that emerges is: can such agents reliably recognize patterns of friction, misdirection, and coercion in interface design (i.e., dark patterns)? We study this question in a…

307

Multi-Agent Influence Diagrams to Hybrid Threat Modeling

Mar 2026 · 2603.03526
Multi-Agent

Western governments have adopted an assortment of counter-hybrid threat measures to defend against hostile actions below the conventional military threshold. The impact of these measures is unclear because of the ambiguity of hybrid threats, their cross-domain nature, and uncertainty about how countermeasures shape adversarial behavior. This paper…

308

Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification

Mar 2026 · 2603.02798
AgenticSafety

As LLM-powered agents have been used for high-stakes decision-making, such as clinical diagnosis, it becomes critical to develop reliable verification of their decisions to facilitate trustworthy deployment. Yet, existing verifiers usually underperform owing to a lack of domain knowledge and limited calibration. To address this, we establish…

309

HateMirage: An Explainable Multi-Dimensional Dataset for Decoding Faux Hate and Subtle Online Abuse

Mar 2026 · 2603.02684
ReasoningBenchmarksSafety

Subtle and indirect hate speech remains an underexplored challenge in online safety research, particularly when harmful intent is embedded within misleading or manipulative narratives. Existing hate speech datasets primarily capture overt toxicity, underrepresenting the nuanced ways misinformation can incite or normalize hate. To address this gap,…

310

Diagnosing Retrieval vs. Utilization Bottlenecks in LLM Agent Memory

Mar 2026 · 2603.02473
Memory

Memory-augmented LLM agents store and retrieve information from prior interactions, yet the relative importance of how memories are written versus how they are retrieved remains unclear. We introduce a diagnostic framework that analyzes how performance differences manifest across write strategies, retrieval methods, and memory utilization…

311

Scaling Retrieval Augmented Generation with RAG Fusion: Lessons from an Industry Deployment

Mar 2026 · 2603.02153
RAGBenchmarksInferenceKnowledge

Retrieval-Augmented Generation (RAG) systems commonly adopt retrieval fusion techniques such as multi-query retrieval and reciprocal rank fusion (RRF) to increase document recall, under the assumption that higher recall leads to better answer quality. While these methods show consistent gains in isolated retrieval benchmarks, their effectiveness…

312

Exploration enhances cooperation in the multi-agent communication system

Mar 2026 · 2603.01401
Multi-AgentFine-Tuning

Designing protocols enhancing cooperation for multi-agent systems remains a grand challenge. Cheap talk, defined as costless, non-binding communication before formal action, serves as a pivotal solution. However, existing theoretical frameworks often exclude random exploration, or noise, for analytical tractability, leaving its functional impact…

313

Talk Freely, Execute Strictly: Schema-Gated Agentic AI for Flexible and Reproducible Scientific Workflows

Mar 2026 · 2603.06394
AgenticArchitecture

Large language models (LLMs) can now translate a researcher's plain-language goal into executable computation, yet scientific workflows demand determinism, provenance, and governance that are difficult to guarantee when an LLM decides what runs. Semi-structured interviews with 18 experts across 10 industrial R&D stakeholders surface 2 competing…

314

The EpisTwin: A Knowledge Graph-Grounded Neuro-Symbolic Architecture for Personal AI

Mar 2026 · 2603.06290
AgenticRAGReasoningBenchmarks

Personal Artificial Intelligence is currently hindered by the fragmentation of user data across isolated silos. While Retrieval-Augmented Generation offers a partial remedy, its reliance on unstructured vector similarity fails to capture the latent semantic topology and temporal dependencies essential for holistic sensemaking. We introduce…

315

3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding

Mar 2026 · 2603.04976
ReasoningBenchmarksFine-TuningReinforcement

Reinforcement Learning with Verifiable Rewards ( RLVR ) has emerged as a transformative paradigm for enhancing the reasoning capabilities of Large Language Models ( LLMs), yet its potential in 3D scene understanding remains under-explored. Existing approaches largely rely on Supervised Fine-Tuning ( SFT), where the token-level cross-entropy loss…

316

When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger

Mar 2026 · 2603.04968
Safety

Preference alignment is an essential step in adapting large language models (LLMs) to human values, but existing approaches typically depend on costly human annotations or large-scale API-based models. We explore whether a weak LLM can instead act as an effective annotator. We surprisingly find that selecting only a subset of a weak LLM's highly…

317

Neuro-Symbolic Financial Reasoning via Deterministic Fact Ledgers and Adversarial Low-Latency Hallucination Detector

Mar 2026 · 2603.04663
RAGReasoningArchitectureInference

Standard Retrieval-Augmented Generation (RAG) architectures fail in high-stakes financial domains due to two fundamental limitations: the inherent arithmetic incompetence of Large Language Models (LLMs) and the distributional semantic conflation of dense vector retrieval (e.g., mapping ``Net Income'' to ``Net Sales'' due to contextual proximity).…

318

T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

Mar 2026 · 2603.03790
ReasoningBenchmarksFine-TuningPrompting

Think about how human handles complex reading tasks: marking key points, inferring their relationships, and structuring information to guide understanding and responses. Likewise, can a large language model benefit from text structure to enhance text-processing performance? To explore it, in this work, we first introduce Structure of Thought…

319

MEM: Multi-Scale Embodied Memory for Vision Language Action Models

Mar 2026 · 2603.03596
MemoryLong-HorizonArchitecture

Conventionally, memory in end-to-end robotic learning involves inputting a sequence of past observations into the learned policy. However, in complex multi-stage real-world tasks, the robot's memory must represent past events at multiple levels of granularity: from long-term memory that captures abstracted semantic concepts (e.g., a robot cooking…

320

SpatialText: A Pure-Text Cognitive Benchmark for Spatial Understanding in Large Language Models

Mar 2026 · 2603.03002
ReasoningBenchmarks

Genuine spatial reasoning relies on the capacity to construct and manipulate coherent internal spatial representations, often conceptualized as mental models, rather than merely processing surface linguistic associations. While large language models exhibit advanced capabilities across various domains, existing benchmarks fail to isolate this…

321

SEALing the Gap: A Reference Framework for LLM Inference Carbon Estimation via Multi-Benchmark Driven Embodiment

Mar 2026 · 2603.02949
BenchmarksInference

Large Language Models are rapidly gaining traction in software engineering, yet their growing carbon footprint raises pressing sustainability concerns. While training emissions are substantial, inference quickly surpasses them due to the sheer volume of prompts processed. This shift underscores the urgent need for accurate, prompt-level carbon…

322

CUCo: An Agentic Framework for Compute and Communication Co-design

Mar 2026 · 2603.02376
AgenticReinforcementInference

Custom CUDA kernel development is essential for maximizing GPU utilization in large-scale distributed LLM training and inference, yet manually writing kernels that jointly leverage both computation and communication remains a labor-intensive and error-prone process. Prior work on kernel optimization has focused almost exclusively on computation,…

323

SciDER: Scientific Data-centric End-to-end Researcher

Mar 2026 · 2603.01421
MemorySelf-ImprovingBenchmarks

Automated scientific discovery with large language models is transforming the research lifecycle from ideation to experimentation, yet existing agents struggle to autonomously process raw data collected from scientific experiments. We introduce SciDER, a data-centric end-to-end system that automates the research lifecycle. Unlike traditional…

324

AutoSkill: Experience-Driven Lifelong Learning via Skill Self-Evolution

Mar 2026 · 2603.01145
MemoryAgenticReinforcementArchitecture

In practical LLM applications, users repeatedly express stable preferences and requirements, such as reducing hallucinations, following institutional writing conventions, or avoiding overly technical wording, yet such interaction experience is seldom consolidated into reusable knowledge. Consequently, LLM agents often fail to accumulate…

325

Wild-Drive: Off-Road Scene Captioning and Path Planning via Robust Multi-modal Routing and Efficient Large Language Model

Feb 2026 · 2603.00694
PlanningBenchmarksSafety

Explainability and transparent decision-making are essential for the safe deployment of autonomous driving systems. Scene captioning summarizes environmental conditions and risk factors in natural language, improving transparency, safety, and human--robot interaction. However, most existing approaches target structured urban scenarios; in off-road…

326

EMPA: Evaluating Persona-Aligned Empathy as a Process

Feb 2026 · 2603.00552
Long-HorizonMulti-AgentReinforcementSafety

Evaluating persona-aligned empathy in LLM-based dialogue agents remains challenging. User states are latent, feedback is sparse and difficult to verify in situ, and seemingly supportive turns can still accumulate into trajectories that drift from persona-specific needs. We introduce EMPA, a process-oriented framework that evaluates persona-aligned…

327

Are LLMs Reliable Code Reviewers? Systematic Overcorrection in Requirement Conformance Judgement

Feb 2026 · 2603.00539
Software DevBenchmarks

Large language models (LLMs) have become essential tools in software development, widely used for requirements engineering, code generation and review tasks. Software engineers often rely on LLMs to verify if code implementation satisfy task requirements, thereby ensuring code robustness and accuracy. However, it remains unclear whether LLMs can…

328

SafeGen-LLM: Enhancing Safety Generalization in Task Planning for Robotic Systems

Feb 2026 · 2602.24235
PlanningBenchmarksFine-TuningReinforcement

Safety-critical task planning in robotic systems remains challenging: classical planners suffer from poor scalability, Reinforcement Learning (RL)-based methods generalize poorly, and base Large Language Models (LLMs) cannot guarantee safety. To address this gap, we propose safety-generalizable large language models, named SafeGen-LLM. SafeGen-LLM…

329

TACIT Benchmark: A Programmatic Visual Reasoning Benchmark for Generative and Discriminative Models

Feb 2026 · 2603.00206
ReasoningBenchmarks

Existing visual reasoning benchmarks predominantly rely on natural language prompts, evaluate narrow reasoning modalities, or depend on subjective scoring procedures such as LLM-as-judge. We introduce the TACIT Benchmark, a programmatic visual reasoning benchmark comprising 10 tasks across 6 reasoning domains: spatial navigation, abstract pattern…

330

SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale

Feb 2026 · 2602.23866
Software DevBenchmarksReinforcement

Software engineering agents (SWE) are improving rapidly, with recent gains largely driven by reinforcement learning (RL). However, RL training is constrained by the scarcity of large-scale task collections with reproducible execution environments and reliable test suites. Although a growing number of benchmarks have emerged, datasets suitable for…

331

EMO-R3: Reflective Reinforcement Learning for Emotional Reasoning in Multimodal Large Language Models

Feb 2026 · 2602.23802
ReasoningBenchmarksFine-TuningReinforcement

Multimodal Large Language Models (MLLMs) have shown remarkable progress in visual reasoning and understanding tasks but still struggle to capture the complexity and subjectivity of human emotions. Existing approaches based on supervised fine-tuning often suffer from limited generalization and poor interpretability, while reinforcement learning…

332

DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality

Mar 2026 · 2603.05912
Benchmarks

Search-augmented LLM agents can produce deep research reports (DRRs), but verifying claim-level factuality remains challenging. Existing fact-checkers are primarily designed for general-domain, factoid-style atomic claims, and there is no benchmark to test whether such verifiers transfer to DRRs. Yet building such a benchmark is itself difficult.…

333

The World Won't Stay Still: Programmable Evolution for Agent Benchmarks

Mar 2026 · 2603.05910
Benchmarks

LLM-powered agents fulfill user requests by interacting with environments, querying data, and invoking tools in a multi-turn process. Yet, most existing benchmarks assume static environments with fixed schemas and toolsets, neglecting the evolutionary nature of the real world and agents' robustness to environmental changes. In this paper, we study…

334

iAgentBench: Benchmarking Sensemaking Capabilities of Information-Seeking Agents on High-Traffic Topics

Mar 2026 · 2603.04656
BenchmarksReinforcement

With the emergence of search-enabled generative QA systems, users are increasingly turning to tools that browse, aggregate, and reconcile evidence across multiple sources on their behalf. Yet many widely used QA benchmarks remain answerable by retrieving a single relevant passage, making them poorly suited for measuring cross-source sensemaking,…

335

Discovering mathematical concepts through a multi-agent system

Mar 2026 · 2603.04528
BenchmarksMulti-AgentReinforcement

Mathematical concepts emerge through an interplay of processes, including experimentation, efforts at proof, and counterexamples. In this paper, we present a new multi-agent model for computational mathematical discovery based on this observation. Our system, conceived with research in mind, poses its own conjectures and then attempts to prove…

336

AgentIR: Reasoning-Aware Retrieval for Deep Research Agents

Mar 2026 · 2603.04384
ReasoningBenchmarks

Deep Research agents are rapidly emerging as primary consumers of modern retrieval systems. Unlike human users who issue and refine queries without documenting their intermediate thought processes, Deep Research agents generate explicit natural language reasoning before each search call, revealing rich intent and contextual information that…

337

Constitutional Black-Box Monitoring for Scheming in LLM Agents

Feb 2026 · 2603.00829

Safe deployment of Large Language Model (LLM) agents in autonomous settings requires reliable oversight mechanisms. A central challenge is detecting scheming, where agents covertly pursue misaligned goals. One approach to mitigating such risks is LLM-based monitoring: using language models to examine agent behaviors for suspicious actions. We…

338

Resources for Automated Evaluation of Assistive RAG Systems that Help Readers with News Trustworthiness Assessment

Feb 2026 · 2602.24277
BenchmarksReinforcement

Many readers today struggle to assess the trustworthiness of online news because reliable reporting coexists with misinformation. The TREC 2025 DRAGUN (Detection, Retrieval, and Augmented Generation for Understanding News) Track provided a venue for researchers to develop and evaluate assistive RAG systems that support readers' news…

339

Experience-Guided Self-Adaptive Cascaded Agents for Breast Cancer Screening and Diagnosis with Reduced Biopsy Referrals

Feb 2026 · 2602.23899
MemoryBenchmarksMulti-AgentArchitecture

We propose an experience-guided cascaded multi-agent framework for Breast Ultrasound Screening and Diagnosis, called BUSD-Agent, that aims to reduce diagnostic escalation and unnecessary biopsy referrals. Our framework models screening and diagnosis as a two-stage, selective decision-making process. A lightweight `screening clinic' agent,…

340

DesignSense: A Human Preference Dataset and Reward Modeling Framework for Graphic Layout Generation

Feb 2026 · 2602.23438
BenchmarksReinforcementInference

Graphic layouts serve as an important and engaging medium for visual communication across different channels. While recent layout generation models have demonstrated impressive capabilities, they frequently fail to align with nuanced human aesthetic judgment. Existing preference datasets and reward models trained on text-to-image generation do not…

341

Diffusion Language Models Are Natively Length-Aware

Mar 2026 · 2603.06123
ContextSoftware DevReasoningBenchmarks

Unlike autoregressive language models, which terminate variable-length generation upon predicting an End-of-Sequence (EoS) token, Diffusion Language Models (DLMs) operate over a fixed maximum-length context window for a predetermined number of denoising steps. However, this process is independent of the required response length, resulting in…

342

Reference-guided Policy Optimization for Molecular Optimization via LLM Reasoning

Mar 2026 · 2603.05900
ReasoningBenchmarksFine-TuningReinforcement

Large language models (LLMs) benefit substantially from supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) in reasoning tasks. However, these recipes perform poorly in instruction-based molecular optimization, where each data point typically provides only a single optimized reference molecule and no step-by-step…

343

Towards Robust Retrieval-Augmented Generation Based on Knowledge Graph: A Comparative Analysis

Mar 2026 · 2603.05698
RAGBenchmarksKnowledge

Retrieval-Augmented Generation (RAG) was introduced to enhance the capabilities of Large Language Models (LLMs) beyond their encoded prior knowledge. This is achieved by providing LLMs with an external source of knowledge, which helps reduce factual hallucinations and enables access to new information not available during pretraining. However,…

344

FireBench: Evaluating Instruction Following in Enterprise and API-Driven LLM Applications

Mar 2026 · 2603.04857
Software DevBenchmarksReinforcement

Instruction following is critical for LLMs deployed in enterprise and API-driven settings, where strict adherence to output formats, content constraints, and procedural requirements is essential for enabling reliable LLM-assisted workflows. However, existing instruction following benchmarks predominantly evaluate natural language generation…

345

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

Mar 2026 · 2603.04639
MemoryLong-HorizonBenchmarks

Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized…

346

$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners

Mar 2026 · 2603.04304
Software DevReasoningBenchmarksInference

Test-time scaling for complex reasoning tasks shows that leveraging inference-time compute, by methods such as independently sampling and aggregating multiple solutions, results in significantly better task outcomes. However, a critical bottleneck is verification: sampling is only effective if correct solutions can be reliably identified among…

347

VANGUARD: Vehicle-Anchored Ground Sample Distance Estimation for UAVs in GPS-Denied Environments

Mar 2026 · 2603.04277
ReasoningBenchmarksSafety

Autonomous aerial robots operating in GPS-denied or communication-degraded environments frequently lose access to camera metadata and telemetry, leaving onboard perception systems unable to recover the absolute metric scale of the scene. As LLM/VLM-based planners are increasingly adopted as high-level agents for embodied systems, their ability to…

348

Rethinking Role-Playing Evaluation: Anonymous Benchmarking and a Systematic Study of Personality Effects

Mar 2026 · 2603.03915
MemoryBenchmarks

Large language models (LLMs) have demonstrated significant potential in developing Role-Playing Agents (RPAs). However, current research primarily evaluates RPAs using famous fictional characters, allowing models to rely on memory associated with character names. This dependency creates a bias that limits the generalization of RPAs to unseen…

349

CGL: Advancing Continual GUI Learning via Reinforcement Fine-Tuning

Mar 2026 · 2603.02951
BenchmarksFine-TuningReinforcement

Graphical User Interface (GUI) Agents, benefiting from recent advances in multimodal large language models (MLLM), have achieved significant development. However, due to the frequent updates of GUI applications, adapting to new tasks without forgetting old tasks in GUI continual learning remains an open problem. In this work, we reveal that while…

350

Retrieval-Augmented Robots via Retrieve-Reason-Act

Mar 2026 · 2603.02688
MemoryLong-HorizonRAGPlanning

To achieve general-purpose utility, we argue that robots must evolve from passive executors into active Information Retrieval users. In strictly zero-shot settings where no prior demonstrations exist, robots face a critical information gap, such as the exact sequence required to assemble a complex furniture kit, that cannot be satisfied by…

351

LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards

Mar 2026 · 2603.02146
ReasoningBenchmarksReinforcement

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) by optimizing them against factual outcomes. However, this paradigm falters in long-context scenarios, as its reliance on internal parametric knowledge is ill-suited for tasks requiring contextual…

352

RubricBench: Aligning Model-Generated Rubrics with Human Standards

Mar 2026 · 2603.01562
BenchmarksReinforcementSafety

As Large Language Model (LLM) alignment evolves from simple completions to complex, highly sophisticated generation, Reward Models are increasingly shifting toward rubric-guided evaluation to mitigate surface-level biases. However, the community lacks a unified benchmark to assess this evaluation paradigm, as existing benchmarks lack both the…

353

Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics

Mar 2026 · 2603.01426
MemoryContextReasoningBenchmarks

As context windows in LLMs scale to 100K+ tokens, the key-value (KV) cache becomes the dominant memory bottleneck, with recent methods claiming 80-90% savings and minimal benchmark degradation. We argue these evaluations miss a structural issue: attention is not just storage but routing, and retaining KV pairs does not guarantee semantic…

354

Truth as a Trajectory: What Internal Representations Reveal About Large Language Model Reasoning

Mar 2026 · 2603.01326
ReasoningBenchmarksArchitectureInference

Existing explainability methods for Large Language Models (LLMs) typically treat hidden states as static points in activation space, assuming that correct and incorrect inferences can be separated using representations from an individual layer. However, these activations are saturated with polysemantic features, leading to linear probes learning…

355

Thoth: Mid-Training Bridges LLMs to Time Series Understanding

Mar 2026 · 2603.01042
ReasoningBenchmarksSafety

Large Language Models (LLMs) have demonstrated remarkable success in general-purpose reasoning. However, they still struggle to understand and reason about time series data, which limits their effectiveness in decision-making scenarios that depend on temporal dynamics. In this paper, we propose Thoth, the first family of mid-trained LLMs with…

356

When Scaling Fails: Mitigating Audio Perception Decay of LALMs via Multi-Step Perception-Aware Reasoning

Feb 2026 · 2603.02266
ReasoningBenchmarksReinforcementInference

Test-Time Scaling has shown notable efficacy in addressing complex problems through scaling inference compute. However, within Large Audio-Language Models (LALMs), an unintuitive phenomenon exists: post-training models for structured reasoning trajectories results in marginal or even negative gains compared to post-training for direct answering.…

357

Social-JEPA: Emergent Geometric Isomorphism

Feb 2026 · 2603.02263
Safety

World models compress rich sensory streams into compact latent codes that anticipate future observations. We let separate agents acquire such models from distinct viewpoints of the same environment without any parameter sharing or coordination. After training, their internal representations exhibit a striking emergent property: the two latent…

358

Multimodal Adaptive Retrieval Augmented Generation through Internal Representation Learning

Feb 2026 · 2603.00511
RAGInference

Visual Question Answering systems face reliability issues due to hallucinations, where models generate answers misaligned with visual input or factual knowledge. While Retrieval Augmented Generation frameworks mitigate this issue by incorporating external knowledge, static retrieval often introduces irrelevant or conflicting content, particularly…

359

CoPeP: Benchmarking Continual Pretraining for Protein Language Models

Feb 2026 · 2603.00253
BenchmarksReinforcement

Protein language models (pLMs) have recently gained significant attention for their ability to uncover relationships between sequence, structure, and function from evolutionary statistics, thereby accelerating therapeutic drug discovery. These models learn from large protein databases that are continuously updated by the biology community and…

360

Controllable Reasoning Models Are Private Thinkers

Feb 2026 · 2602.24210
ReasoningBenchmarksFine-Tuning

AI agents powered by reasoning models require access to sensitive user data. However, their reasoning traces are difficult to control, which can result in the unintended leakage of private information to external parties. We propose training models to follow instructions not only in the final answer, but also in reasoning traces, potentially under…

361

SkillNet: Create, Evaluate, and Connect AI Skills

Feb 2026 · 2603.04448
BenchmarksReinforcementSafety

Current AI agents can flexibly invoke tools and execute complex tasks, yet their long-term advancement is hindered by the lack of systematic accumulation and transfer of skills. Without a unified mechanism for skill consolidation, agents frequently ``reinvent the wheel'', rediscovering solutions in isolated contexts without leveraging prior…

362

RAG-X: Systematic Diagnosis of Retrieval-Augmented Generation for Medical Question Answering

Mar 2026 · 2603.03541
RAGBenchmarksSafety

Automated question-answering (QA) systems increasingly rely on retrieval-augmented generation (RAG) to ground large language models (LLMs) in authoritative medical knowledge, ensuring clinical accuracy and patient safety in Artificial Intelligence (AI) applications for healthcare. Despite progress in RAG evaluation, current benchmarks focus only…

363

Architecting Trust in Artificial Epistemic Agents

Mar 2026 · 2603.02960
BenchmarksMulti-AgentReinforcement

Large language models increasingly function as epistemic agents -- entities that can 1) autonomously pursue epistemic goals and 2) actively shape our shared knowledge environment. They curate the information we receive, often supplanting traditional search-based methods, and are frequently used to generate both personal and deeply specialized…

364

FLANS at SemEval-2026 Task 7: RAG with Open-Sourced Smaller LLMs for Everyday Knowledge Across Diverse Languages and Cultures

Mar 2026 · 2603.01910
RAGKnowledge

This system paper describes our participation in the SemEval-2025 Task-7 ``Everyday Knowledge Across Diverse Languages and Cultures''. We attended two subtasks, i.e., Track 1: Short Answer Questions (SAQ), and Track 2: Multiple-Choice Questions (MCQ). The methods we used are retrieval augmented generation (RAGs) with open-sourced smaller LLMs…

365

Atomicity for Agents: Exposing, Exploiting, and Mitigating TOCTOU Vulnerabilities in Browser-Use Agents

Feb 2026 · 2603.00476
PlanningBenchmarks

Browser-use agents are widely used for everyday tasks. They enable automated interaction with web pages through structured DOM based interfaces or vision language models operating on page screenshots. However, web pages often change between planning and execution, causing agents to execute actions based on stale assumptions. We view this temporal…

366

RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models

Feb 2026 · 2602.24040
BenchmarksReinforcement

Reward models are central to aligning large language models (LLMs) with human preferences. Yet most approaches rely on pointwise reward estimates that overlook the epistemic uncertainty in reward models arising from limited human feedback. Recent work suggests that quantifying this uncertainty can reduce the costs of human annotation via…

367

HART: Data-Driven Hallucination Attribution and Evidence-Based Tracing for Large Language Models

Mar 2026 · 2603.05828
BenchmarksReinforcementSafety

Large language models (LLMs) have demonstrated remarkable performance in text generation and knowledge-intensive question answering. Nevertheless, they are prone to producing hallucinated content, which severely undermines their reliability in high-stakes application domains. Existing hallucination attribution approaches, based on either external…

368

PVminerLLM: Structured Extraction of Patient Voice from Patient-Generated Text using Large Language Models

Mar 2026 · 2603.05776
BenchmarksFine-Tuning

Motivation: Patient-generated text contains critical information about patients' lived experiences, social circumstances, and engagement in care, including factors that strongly influence adherence, care coordination, and health equity. However, these patient voice signals are rarely available in structured form, limiting their use in…

369

Real-Time AI Service Economy: A Framework for Agentic Computing Across the Continuum

Mar 2026 · 2603.05614
AgenticArchitectureInference

Real-time AI services increasingly operate across the device-edge-cloud continuum, where autonomous AI agents generate latency-sensitive workloads, orchestrate multi-stage processing pipelines, and compete for shared resources under policy and governance constraints. This article shows that the structure of service-dependency graphs, modelled as…

370

An Exploration-Analysis-Disambiguation Reasoning Framework for Word Sense Disambiguation with Low-Parameter LLMs

Mar 2026 · 2603.05400
ReasoningBenchmarksFine-Tuning

Word Sense Disambiguation (WSD) remains a key challenge in Natural Language Processing (NLP), especially when dealing with rare or domain-specific senses that are often misinterpreted. While modern high-parameter Large Language Models (LLMs) such as GPT-4-Turbo have shown state-of-the-art WSD performance, their computational and energy demands…

371

CBR-to-SQL: Rethinking Retrieval-based Text-to-SQL using Case-based Reasoning in the Healthcare Domain

Mar 2026 · 2603.05569
RAGReasoning

Extracting insights from Electronic Health Record (EHR) databases often requires SQL expertise, creating a barrier for healthcare decision-making and research. While a promising approach is to use Large Language Models (LLMs) to translate natural language questions to SQL via Retrieval-Augmented Generation (RAG), adapting this approach to the…

372

K-Gen: A Multimodal Language-Conditioned Approach for Interpretable Keypoint-Guided Trajectory Generation

Mar 2026 · 2603.04868
ReasoningFine-Tuning

Generating realistic and diverse trajectories is a critical challenge in autonomous driving simulation. While Large Language Models (LLMs) show promise, existing methods often rely on structured data like vectorized maps, which fail to capture the rich, unstructured visual context of a scene. To address this, we propose K-Gen, an interpretable…

373

Evaluating GPT-5 as a Multimodal Clinical Reasoner: A Landscape Commentary

Mar 2026 · 2603.04763
ReasoningBenchmarksReinforcement

The transition from task-specific artificial intelligence toward general-purpose foundation models raises fundamental questions about their capacity to support the integrated reasoning required in clinical medicine, where diagnosis demands synthesis of ambiguous patient narratives, laboratory data, and multimodal imaging. This landscape commentary…

374

Right in Time: Reactive Reasoning in Regulated Traffic Spaces

Mar 2026 · 2603.03977
AgenticReasoningSafetyInference

Exact inference in probabilistic First-Order Logic offers a promising yet computationally costly approach for regulating the behavior of autonomous agents in shared traffic spaces. While prior methods have combined logical and probabilistic data into decision-making frameworks, their application is often limited to pre-flight checks due to the…

375

Inherited Goal Drift: Contextual Pressure Can Undermine Agentic Goals

Mar 2026 · 2603.03258
Agentic

The accelerating adoption of language models (LMs) as agents for deployment in long-context tasks motivates a thorough understanding of goal drift: agents' tendency to deviate from an original objective. While prior-generation language model agents have been shown to be susceptible to drift, the extent to which drift affects more recent models…

376

TrustMH-Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Large Language Models in Mental Health

Mar 2026 · 2603.03047
BenchmarksReinforcementSafety

While Large Language Models (LLMs) demonstrate significant potential in providing accessible mental health support, their practical deployment raises critical trustworthiness concerns due to the domains high-stakes and safety-sensitive nature. Existing evaluation paradigms for general-purpose LLMs fail to capture mental health-specific…

377

From Heuristic Selection to Automated Algorithm Design: LLMs Benefit from Strong Priors

Mar 2026 · 2603.02792
Benchmarks

Large Language Models (LLMs) have already been widely adopted for automated algorithm design, demonstrating strong abilities in generating and evolving algorithms across various fields. Existing work has largely focused on examining their effectiveness in solving specific problems, with search strategies primarily guided by adaptive prompt…

378

From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench

Mar 2026 · 2603.02775
Benchmarks

Large Language Models (LLMs) show significant potential in AI mathematical tutoring, yet current evaluations often rely on simplistic metrics or narrow pedagogical scenarios, failing to assess comprehensive, multi-turn teaching effectiveness. In this paper, we introduce KMP-Bench, a comprehensive K-8 Mathematical Pedagogical Benchmark designed to…

379

Causal Learning Should Embrace the Wisdom of the Crowd

Mar 2026 · 2603.02678
Reinforcement

Learning causal structures typically represented by directed acyclic graphs (DAGs) from observational data is notoriously challenging due to the combinatorial explosion of possible graphs and inherent ambiguities in observations. This paper argues that causal learning is now ready for the emergence of a new paradigm supported by rapidly advancing…

380

Think, But Don't Overthink: Reproducing Recursive Language Models

Mar 2026 · 2603.02615
AgenticReasoningBenchmarks

This project reproduces and extends the recently proposed ``Recursive Language Models'' (RLMs) framework by Zhang et al. (2026). This framework enables Large Language Models (LLMs) to process near-infinite contexts by offloading the prompt into an external REPL environment. While the original paper relies on a default recursion depth of 1 and…

381

How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities

Mar 2026 · 2603.02578
BenchmarksInference

Large Language Models (LLMs) are increasingly deployed in socially sensitive domains, yet their unpredictable behaviors, ranging from misaligned intent to inconsistent personality, pose significant risks. We introduce SteerEval, a hierarchical benchmark for evaluating LLM controllability across three domains: language features, sentiment, and…

382

AOI: Turning Failed Trajectories into Training Signals for Autonomous Cloud Diagnosis

Mar 2026 · 2603.03378
ReasoningBenchmarksMulti-AgentReinforcement

Large language model (LLM) agents offer a promising data-driven approach to automating Site Reliability Engineering (SRE), yet their enterprise deployment is constrained by three challenges: restricted access to proprietary data, unsafe action execution under permission-governed environments, and the inability of closed systems to improve from…

383

ClinCoT: Clinical-Aware Visual Chain-of-Thought for Medical Vision Language Models

Mar 2026 · 2603.01124
ReasoningBenchmarksReinforcementSafety

Medical Vision-Language Models have shown promising potential in clinical decision support, yet they remain prone to factual hallucinations due to insufficient grounding in localized pathological evidence. Existing medical alignment methods primarily operate at the response level through preference optimization, improving output correctness but…

384

Fair in Mind, Fair in Action? A Synchronous Benchmark for Understanding and Generation in UMLLMs

Feb 2026 · 2603.00590
BenchmarksReinforcement

As artificial intelligence (AI) is increasingly deployed across domains, ensuring fairness has become a core challenge. However, the field faces a "Tower of Babel'' dilemma: fairness metrics abound, yet their underlying philosophical assumptions often conflict, hindering unified paradigms-particularly in unified Multimodal Large Language Models…

385

How Well Do Multimodal Models Reason on ECG Signals?

Feb 2026 · 2603.00312
AgenticReasoningBenchmarksSafety

While multimodal large language models offer a promising solution to the "black box" nature of health AI by generating interpretable reasoning traces, verifying the validity of these traces remains a critical challenge. Existing evaluation methods are either unscalable, relying on manual clinician review, or superficial, utilizing proxy metrics…

386

LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics

Feb 2026 · 2602.24173
Benchmarks

We present a new approach for benchmarking Large Language Model (LLM) capabilities on research-level mathematics. Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for mathematical research. Instead, we establish an updatable benchmark evaluating models directly on the latest research…

387

Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

Feb 2026 · 2602.23898
ReasoningBenchmarks

Referring Expression Comprehension (REC) links language to region level visual perception. Standard benchmarks (RefCOCO, RefCOCO+, RefCOCOg) have progressed rapidly with multimodal LLMs but remain weak tests of visual reasoning and grounding: (i) many expressions are very short, leaving little reasoning demand; (ii) images often contain few…

388

Unlocking Cognitive Capabilities and Analyzing the Perception-Logic Trade-off

Feb 2026 · 2602.23730
ReasoningBenchmarksArchitecture

Recent advancements in Multimodal Large Language Models (MLLMs) pursue omni-perception capabilities, yet integrating robust sensory grounding with complex reasoning remains a challenge, particularly for underrepresented regions. In this report, we introduce the research preview of MERaLiON2-Omni (Alpha), a 10B-parameter multilingual…

389

Theory of Code Space: Do Code Agents Understand Software Architecture?

Feb 2026 · 2603.00601
BenchmarksFine-TuningReinforcementArchitecture

AI code agents excel at isolated tasks yet struggle with multi-file software engineering requiring architectural understanding. We introduce Theory of Code Space (ToCS), a benchmark that evaluates whether agents can construct, maintain, and update coherent architectural beliefs during codebase exploration. Agents explore procedurally generated…

390

AMV-L: Lifecycle-Managed Agent Memory for Tail-Latency Control in Long-Running LLM Systems

Feb 2026 · 2603.04443
MemoryInference

Long-running LLM agents require persistent memory to preserve state across interactions, yet most deployed systems manage memory with age-based retention (e.g., TTL). While TTL bounds item lifetime, it does not bound the computational footprint of memory on the request path: as retained items accumulate, retrieval candidate sets and vector…

391

Adapter-Augmented Bandits for Online Multi-Constrained Multi-Modal Inference Scheduling

Mar 2026 · 2603.06403
Long-HorizonReasoningBenchmarksFine-Tuning

Multi-modal large language model (MLLM) inference scheduling enables strong response quality under practical and heterogeneous budgets, beyond what a homogeneous single-backend setting can offer. Yet online MLLM task scheduling is nontrivial, as requests vary sharply in modality composition and latent reasoning difficulty, while execution backends…

392

SPOT: Span-level Pause-of-Thought for Efficient and Interpretable Latent Reasoning in Large Language Models

Mar 2026 · 2603.06222
ReasoningBenchmarksSafetyInference

Explicit Chain-of-Thought improves the reasoning performance of large language models but often incurs high inference cost due to verbose token-level traces. While recent approaches reduce this overhead via concise prompting or step pruning, they largely truncate what the model says rather than internalize what the model thinks. Latent reasoning…

393

VLM-RobustBench: A Comprehensive Benchmark for Robustness of Vision-Language Models

Mar 2026 · 2603.06148
ReasoningBenchmarks

Vision-language models (VLMs) achieve strong performance on standard, high-quality datasets, but we still do not fully understand how they perform under real-world image distortions. We present VLM-RobustBench, a benchmark spanning 49 augmentation types across noise, blur, weather, digital, and geometric perturbations, evaluated under graded…

394

Implicit Style Conditioning: A Structured Style-Rewrite Framework for Low-Resource Character Modeling

Mar 2026 · 2603.05933
ReasoningFine-TuningInference

Large Language Models (LLMs) have demonstrated impressive capabilities in role-playing (RP); however, small Language Models (SLMs) with highly stylized personas remains a challenge due to data scarcity and the complexity of style disentanglement. Standard Supervised Fine-Tuning (SFT) often captures surface-level semantics while failing to…

395

Computational Pathology in the Era of Emerging Foundation and Agentic AI -- International Expert Perspectives on Clinical Integration and Translational Readiness

Mar 2026 · 2603.05884
AgenticBenchmarksArchitecture

Recent breakthroughs in artificial intelligence through foundation models and agents have accelerated the evolution of computational pathology. Demonstrated performance gains reported across academia in benchmarking datasets in predictive tasks such as diagnosis, prognosis, and treatment response have ignited substantial enthusiasm for clinical…

396

Relational Semantic Reasoning on 3D Scene Graphs for Open World Interactive Object Search

Mar 2026 · 2603.05642
ReasoningBenchmarksFine-TuningInference

Open-world interactive object search in household environments requires understanding semantic relationships between objects and their surrounding context to guide exploration efficiently. Prior methods either rely on vision-language embeddings similarity, which does not reliably capture task-relevant relational semantics, or large language models…

397

ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts

Mar 2026 · 2603.04992
BenchmarksFine-TuningReinforcementSafety

The safety evaluation of large language models (LLMs) remains largely centered on English, leaving non-English languages and culturally grounded risks underexplored. In this work, we investigate LLM safety in the context of the Thai language and culture and introduce ThaiSafetyBench, an open-source benchmark comprising 1,954 malicious prompts…

398

Behaviour Driven Development Scenario Generation with Large Language Models

Mar 2026 · 2603.04729
ReasoningBenchmarksReinforcementPrompting

This paper presents an evaluation of three LLMs, GPT-4, Claude 3, and Gemini, for automated Behaviour-Driven Development (BDD) scenarios generation. To support this evaluation, we constructed a dataset of 500 user stories, requirement descriptions, and their corresponding BDD scenarios, drawn from four proprietary software products. We assessed…

399

Coordinated Semantic Alignment and Evidence Constraints for Retrieval-Augmented Generation with Large Language Models

Mar 2026 · 2603.04647
RAGSafetyInference

Retrieval augmented generation mitigates limitations of large language models in factual consistency and knowledge updating by introducing external knowledge. However, practical applications still suffer from semantic misalignment between retrieved results and generation objectives, as well as insufficient evidence utilization. To address these…

400

InEdit-Bench: Benchmarking Intermediate Logical Pathways for Intelligent Image Editing Models

Mar 2026 · 2603.03657
ReasoningBenchmarks

Multimodal generative models have made significant strides in image editing, demonstrating impressive performance on a variety of static tasks. However, their proficiency typically does not extend to complex scenarios requiring dynamic reasoning, leaving them ill-equipped to model the coherent, intermediate logical pathways that constitute a…

401

ShipTraj-R1: Reinforcing Ship Trajectory Prediction in Large Language Models via Group Relative Policy Optimization

Mar 2026 · 2603.02939
ReasoningFine-TuningReinforcement

Recent advancements in reinforcement fine-tuning have significantly improved the reasoning ability of large language models (LLMs). In particular, methods such as group relative policy optimization (GRPO) have demonstrated strong capabilities across various fields. However, applying LLMs to ship trajectory prediction remains largely unexplored. In…

402

His2Trans: A Skeleton First Framework for Self Evolving C to Rust Translation with Historical Retrieval

Mar 2026 · 2603.02617
Self-ImprovingRAGBenchmarksReinforcement

Automated C-to-Rust migration encounters systemic obstacles when scaling from code snippets to industrial projects, mainly because build context is often unavailable ("dependency hell") and domain-specific evolutionary knowledge is missing. As a result, current LLM-based methods frequently cannot reconstruct precise type definitions under complex…

403

How Small Can 6G Reason? Scaling Tiny Language Models for AI-Native Networks

Mar 2026 · 2603.02156
MemoryReasoningBenchmarksArchitecture

Emerging 6G visions, reflected in ongoing standardization efforts within 3GPP, IETF, ETSI, ITU-T, and the O-RAN Alliance, increasingly characterize networks as AI-native systems in which high-level semantic reasoning layers operate above standardized control and data-plane functions. Although frontier-scale large language models (LLMs) such as…

404

Learning from Synthetic Data Improves Multi-hop Reasoning

Mar 2026 · 2603.02091
ReasoningBenchmarksFine-TuningReinforcement

Reinforcement Learning (RL) has been shown to significantly boost reasoning capabilities of large language models (LLMs) in math, coding, and multi-hop reasoning tasks. However, RL fine-tuning requires abundant high-quality verifiable data, often sourced from human annotations, generated from frontier LLMs, or scored by LLM-based verifiers. All…

405

A Practical Guide to Streaming Continual Learning

Mar 2026 · 2603.01677

Continual Learning (CL) and Streaming Machine Learning (SML) study the ability of agents to learn from a stream of non-stationary data. Despite sharing some similarities, they address different and complementary challenges. While SML focuses on rapid adaptation after changes (concept drifts), CL aims to retain past knowledge when learning new…

406

S-VoCAL: A Dataset and Evaluation Framework for Inferring Speaking Voice Character Attributes in Literature

Mar 2026 · 2603.00958
RAGBenchmarksInference

With recent advances in Text-to-Speech (TTS) systems, synthetic audiobook narration has seen increased interest, reaching unprecedented levels of naturalness. However, larger gaps remain in synthetic narration systems' ability to impersonate fictional characters, and convey complex emotions or prosody. A promising direction to enhance character…

407

DRIV-EX: Counterfactual Explanations for Driving LLMs

Feb 2026 · 2603.00696
Reasoning

Large language models (LLMs) are increasingly used as reasoning engines in autonomous driving, yet their decision-making remains opaque. We propose to study their decision process through counterfactual explanations, which identify the minimal semantic changes to a scene description required to alter a driving plan. We introduce DRIV-EX, a method…

408

LiTS: A Modular Framework for LLM Tree Search

Feb 2026 · 2603.00631
AgenticPlanningReasoning

LiTS is a modular Python framework for LLM reasoning via tree search. It decomposes tree search into three reusable components (Policy, Transition, and RewardModel) that plug into algorithms like MCTS and BFS. A decorator-based registry enables domain experts to extend to new domains by registering components, and algorithmic researchers to…

409

LifeEval: A Multimodal Benchmark for Assistive AI in Egocentric Daily Life Tasks

Feb 2026 · 2603.00490
Benchmarks

The rapid progress of Multimodal Large Language Models (MLLMs) marks a significant step toward artificial general intelligence, offering great potential for augmenting human capabilities. However, their ability to provide effective assistance in dynamic, real-world environments remains largely underexplored. Existing video benchmarks predominantly…

410

MMKG-RDS: Reasoning Data Synthesis via Deep Mining of Multimodal Knowledge Graphs

Feb 2026 · 2602.23632
ReasoningBenchmarksFine-TuningReinforcement

Synthesizing high-quality training data is crucial for enhancing domain models' reasoning abilities. Existing methods face limitations in long-tail knowledge coverage, effectiveness verification, and interpretability. Knowledge-graph-based approaches still fall short in functionality, granularity, customizability, and evaluation. To address these…

411

Learning to Generate Secure Code via Token-Level Rewards

Feb 2026 · 2602.23407
Software DevBenchmarksReinforcement

Large language models (LLMs) have demonstrated strong capabilities in code generation, yet they remain prone to producing security vulnerabilities. Existing approaches commonly suffer from two key limitations: the scarcity of high-quality security data and coarse-grained reinforcement learning reward signals. To address these challenges, we…

412

Reverse CAPTCHA: Evaluating LLM Susceptibility to Invisible Unicode Instruction Injection

Feb 2026 · 2603.00164
AgenticBenchmarks

We introduce Reverse CAPTCHA, an evaluation framework that tests whether large language models follow invisible Unicode-encoded instructions embedded in otherwise normal-looking text. Unlike traditional CAPTCHAs that distinguish humans from machines, our benchmark exploits a capability gap: models can perceive Unicode control characters that are…

413

Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG

Mar 2026 · 2603.04238
RAGBenchmarks

Retrieval-augmented generation (RAG) is a common way to ground language models in external documents and up-to-date information. Classical retrieval systems relied on lexical methods such as BM25, which rank documents by term overlap with corpus-level weighting. End-to-end multimodal retrievers trained on large query-document datasets claim…

414

Learning Approximate Nash Equilibria in Cooperative Multi-Agent Reinforcement Learning via Mean-Field Subsampling

Mar 2026 · 2603.03759
Multi-AgentReinforcementArchitecture

Many large-scale platforms and networked control systems have a centralized decision maker interacting with a massive population of agents under strict observability constraints. Motivated by such applications, we study a cooperative Markov game with a global agent and $n$ homogeneous local agents in a communication-constrained regime, where the…

415

See and Remember: A Multimodal Agent for Web Traversal

Mar 2026 · 2603.02626
MemoryBenchmarksArchitecture

Autonomous web navigation requires agents to perceive complex visual environments and maintain long-term context, yet current Large Language Model (LLM) based agents often struggle with spatial disorientation and navigation loops. In this paper, we propose generally applicable V-GEMS(Visual Grounding and Explicit Memory System), a robust…

416

"When to Hand Off, When to Work Together": Expanding Human-Agent Co-Creative Collaboration through Concurrent Interaction

Mar 2026 · 2603.02050
PlanningReasoning

Human collaborators coordinate dynamically through process visibility and workspace awareness, yet AI agents typically either provide only final outputs or expose read-only execution processes (e.g., planning, reasoning) without interpreting concurrent user actions on shared artifacts. Building on mixed-initiative interaction principles, we…

417

Agents Learn Their Runtime: Interpreter Persistence as Training-Time Semantics

Mar 2026 · 2603.01209
ReasoningBenchmarksFine-Tuning

Tool-augmented LLMs are increasingly deployed as agents that interleave natural-language reasoning with executable Python actions, as in CodeAct-style frameworks. In deployment, these agents rely on runtime state that persists across steps. By contrast, the traces used to post-train these models rarely encode how interpreter state is managed. We…

418

The Synthetic Web: Adversarially-Curated Mini-Internets for Diagnosing Epistemic Weaknesses of Language Agents

Feb 2026 · 2603.00801
RAGBenchmarks

Language agents increasingly act as web-enabled systems that search, browse, and synthesize information from diverse sources. However, these sources can include unreliable or adversarial content, and the robustness of agents to adversarial ranking - where misleading information appears prominently in search results - remains poorly understood.…

419

PEPA: a Persistently Autonomous Embodied Agent with Personalities

Feb 2026 · 2603.00117
MemoryReasoningArchitecture

Living organisms exhibit persistent autonomy through internally generated goals and self-sustaining behavioral organization, yet current embodied agents remain driven by externally scripted objectives. This dependence on predefined task specifications limits their capacity for long-term deployment in dynamic, unstructured environments where…

420

Multimodal Modular Chain of Thoughts in Energy Performance Certificate Assessment

Feb 2026 · 2603.00115
ReasoningBenchmarksArchitecturePrompting

Accurate evaluation of building energy performance remains challenging in regions where scalable Energy Performance Certificate (EPC) assessments are unavailable. This paper presents a cost-efficient framework that leverages Vision-Language models for automated EPC pre-assessment from limited visual information. The proposed Multimodal Modular…

421

CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation

Mar 2026 · 2603.06183
BenchmarksSafety

We introduce CRIMSON, a clinically grounded evaluation framework for chest X-ray report generation that assesses reports based on diagnostic correctness, contextual relevance, and patient safety. Unlike prior metrics, CRIMSON incorporates full clinical context, including patient age, indication, and guideline-based decision rules, and prevents…

422

ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning

Mar 2026 · 2603.05863
Software DevReasoningBenchmarksReinforcement

While Large Language Models (LLMs) have revolutionized code generation, standard "System 1" approaches, generating solutions in a single forward pass, often hit a performance ceiling when faced with complex algorithmic tasks. Existing iterative refinement strategies attempt to bridge this gap at inference time, yet they predominantly rely on…

423

Evaluating LLM Alignment With Human Trust Models

Mar 2026 · 2603.05839
Multi-AgentReinforcementSafetyPrompting

Trust plays a pivotal role in enabling effective cooperation, reducing uncertainty, and guiding decision-making in both human interactions and multi-agent systems. Although it is significant, there is limited understanding of how large language models (LLMs) internally conceptualize and reason about trust. This work presents a white-box analysis…

424

Safer Reasoning Traces: Measuring and Mitigating Chain-of-Thought Leakage in LLMs

Mar 2026 · 2603.05618
ReasoningBenchmarksInferencePrompting

Chain-of-Thought (CoT) prompting improves LLM reasoning but can increase privacy risk by resurfacing personally identifiable information (PII) from the prompt into reasoning traces and outputs, even under policies that instruct the model not to restate PII. We study such direct, inference-time PII leakage using a model-agnostic framework that (i)…

425

WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces

Mar 2026 · 2603.05295
PlanningBenchmarksSafety

We introduce WebChain, the largest open-source dataset of human-annotated trajectories on real-world websites, designed to accelerate reproducible research in web agents. It contains 31,725 trajectories and 318k steps, featuring a core Triple Alignment of visual, structural, and action data to provide rich, multi-modal supervision. The data is…

426

SarcasmMiner: A Dual-Track Post-Training Framework for Robust Audio-Visual Sarcasm Reasoning

Mar 2026 · 2603.05275
ReasoningFine-TuningReinforcement

Multimodal sarcasm detection requires resolving pragmatic incongruity across textual, acoustic, and visual cues through cross-modal reasoning. To enable robust sarcasm reasoning with foundation models, we propose SarcasmMiner, a reinforcement learning based post-training framework that resists hallucination in multimodal reasoning. We reformulate…

427

Interactive Benchmarks

Mar 2026 · 2603.04737
Long-HorizonReasoningBenchmarks

Standard benchmarks have become increasingly unreliable due to saturation, subjectivity, and poor generalization. We argue that evaluating model's ability to acquire information actively is important to assess model's intelligence. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses model's reasoning ability in an…

428

Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

Mar 2026 · 2603.04601
Software DevBenchmarksSafetyInference

Code generation has emerged as one of AI's highest-impact use cases, yet existing benchmarks measure isolated tasks rather than the complete "zero-to-one" process of building a working application from scratch. We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 public validation, 50 held-out test) with 964…

429

Why Do Neural Networks Forget: A Study of Collapse in Continual Learning

Mar 2026 · 2603.04580
BenchmarksArchitecture

Catastrophic forgetting is a major problem in continual learning, and lots of approaches arise to reduce it. However, most of them are evaluated through task accuracy, which ignores the internal model structure. Recent research suggests that structural collapse leads to loss of plasticity, as evidenced by changes in effective rank (eRank). This…

430

Benchmarking Motivational Interviewing Competence of Large Language Models

Mar 2026 · 2603.03846
Benchmarks

Motivational interviewing (MI) promotes behavioural change in substance use disorders. Its fidelity is measured using the Motivational Interviewing Treatment Integrity (MITI) framework. While large language models (LLMs) can potentially generate MI-consistent therapist responses, their competence using MITI is not well-researched, especially in…

431

In-Context Environments Induce Evaluation-Awareness in Language Models

Mar 2026 · 2603.03824
Software DevReasoningBenchmarks

Humans often become more self-aware under threat, yet can lose self-awareness when absorbed in a task; we hypothesize that language models exhibit environment-dependent \textit{evaluation awareness}. This raises concerns that models could strategically underperform, or \textit{sandbag}, to avoid triggering capability-limiting interventions such as…

432

Can LLM Aid in Solving Constraints with Inductive Definitions?

Mar 2026 · 2603.03668
ReasoningBenchmarksReinforcement

Solving constraints involving inductive (aka recursive) definitions is challenging. State-of-the-art SMT/CHC solvers and first-order logic provers provide only limited support for solving such constraints, especially when they involve, e.g., abstract data types. In this work, we leverage structured prompts to elicit Large Language Models (LLMs) to…

433

A Covering Framework for Offline POMDPs Learning using Belief Space Metric

Mar 2026 · 2603.03191
MemoryBenchmarks

In off policy evaluation (OPE) for partially observable Markov decision processes (POMDPs), an agent must infer hidden states from past observations, which exacerbates both the curse of horizon and the curse of memory in existing OPE methods. This paper introduces a novel covering analysis framework that exploits the intrinsic metric structure of…

434

OCR or Not? Rethinking Document Information Extraction in the MLLMs Era with Real-World Large-Scale Datasets

Mar 2026 · 2603.02789
Benchmarks

Multimodal Large Language Models (MLLMs) enhance the potential of natural language processing. However, their actual impact on document information extraction remains unclear. In particular, it is unclear whether an MLLM-only pipeline--while simpler--can truly match the performance of traditional OCR+MLLM setups. In this paper, we conduct a…

435

Benchmark of Benchmarks: Unpacking Influence and Code Repository Quality in LLM Safety Benchmarks

Mar 2026 · 2603.04459
BenchmarksSafety

The rapid growth of research in LLM safety makes it hard to track all advances. Benchmarks are therefore crucial for capturing key trends and enabling systematic comparisons. Yet, it remains unclear why certain benchmarks gain prominence, and no systematic assessment has been conducted on their academic influence or code quality. This paper fills…

436

ExpGuard: LLM Content Moderation in Specialized Domains

Mar 2026 · 2603.02588
BenchmarksReinforcementSafety

With the growing deployment of large language models (LLMs) in real-world applications, establishing robust safety guardrails to moderate their inputs and outputs has become essential to ensure adherence to safety policies. Current guardrail models predominantly address general human-LLM interactions, rendering LLMs vulnerable to harmful and…

437

Selection as Power: Constrained Reinforcement for Bounded Decision Authority

Mar 2026 · 2603.02019
AgenticSafety

Selection as Power argued that upstream selection authority, rather than internal objective misalignment, constitutes a primary source of risk in high-stakes agentic systems. However, the original framework was static: governance constraints bounded selection power but did not adapt over time. In this work, we extend the framework to dynamic…

438

Temporal Representations for Exploration: Learning Complex Exploratory Behavior without Extrinsic Rewards

Mar 2026 · 2603.02008
MemoryFine-TuningReinforcement

Effective exploration in reinforcement learning requires not only tracking where an agent has been, but also understanding how the agent perceives and represents the world. To learn powerful representations, an agent should actively explore states that contribute to its knowledge of the environment. Temporal representations can capture the…

439

Efficient RLVR Training via Weighted Mutual Information Data Selection

Mar 2026 · 2603.01907
PlanningReasoningBenchmarksReinforcement

Reinforcement learning (RL) plays a central role in improving the reasoning and alignment of large language models, yet its efficiency critically depends on how training data are selected. Existing online selection strategies predominantly rely on difficulty-based heuristics, favouring datapoints with intermediate success rates, implicitly…

440

CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation

Mar 2026 · 2603.01865
BenchmarksReinforcement

LLM-as-judge evaluation has become standard practice for open-ended model assessment; however, judges exhibit systematic biases that cannot be eliminated by increasing the number of scenarios or generations. These biases are often similar in magnitude to the model differences that benchmarks are designed to detect, resulting in unreliable rankings…

441

Quantifying Conversational Reliability of Large Language Models under Multi-Turn Interaction

Mar 2026 · 2603.01423
Benchmarks

Large Language Models (LLMs) are increasingly deployed in real-world applications where users engage in extended, mixed-topic conversations that depend on prior context. Yet, their reliability under realistic multi-turn interactions remains poorly understood. We conduct a systematic evaluation of conversational reliability through three…

442

Constructing Synthetic Instruction Datasets for Improving Reasoning in Domain-Specific LLMs: A Case Study in the Japanese Financial Domain

Mar 2026 · 2603.01353
ReasoningBenchmarks

In adapting LLMs to specific domains, achieving both domain expertise and reasoning ability remains an urgent challenge. This study proposes a general method for constructing high-quality synthetic instruction data for any domain, starting from domain-specific vocabulary. As a demonstration, we applied this method to the financial domain and…

443

An Open-Source Modular Benchmark for Diffusion-Based Motion Planning in Closed-Loop Autonomous Driving

Mar 2026 · 2603.01023
PlanningBenchmarksInference

Diffusion-based motion planners have achieved state-of-the-art results on benchmarks such as nuPlan, yet their evaluation within closed-loop production autonomous driving stacks remains largely unexplored. Existing evaluations abstract away ROS 2 communication latency and real-time scheduling constraints, while monolithic ONNX deployment freezes…

444

Historian: Reducing Manual Validation in APR Benchmarking via Evidence-Based Assessment

Feb 2026 · 2603.00649
BenchmarksReinforcementKnowledge

Assessing the correctness of patches generated by Automated Program Repair (APR) is a major bottleneck. Manual validation is labor-intensive and limited: exact matching overlooks valid variants, while semantic inspection is subjective and hard to reproduce. Existing Automated Patch Correctness Assessment (APCA) often relies on opaque predictive…

445

IDER: IDempotent Experience Replay for Reliable Continual Learning

Feb 2026 · 2603.00624
Benchmarks

Catastrophic forgetting, the tendency of neural networks to forget previously learned knowledge when learning new tasks, has been a major challenge in continual learning (CL). To tackle this challenge, CL methods have been proposed and shown to reduce forgetting. Furthermore, CL models deployed in mission-critical settings can benefit from…

446

Towards Neural Graph Data Management

Feb 2026 · 2603.05529
ReasoningBenchmarksReinforcement

While AI systems have made remarkable progress in processing unstructured text, structured data such as graphs stored in databases, continues to grow rapidly yet remains difficult for neural models to effectively utilize. We introduce NGDBench, a unified benchmark for evaluating neural graph database capabilities across five diverse domains,…

447

AudioCapBench: Quick Evaluation on Audio Captioning across Sound, Music, and Speech

Feb 2026 · 2602.23649
Benchmarks

We introduce AudioCapBench, a benchmark for evaluating audio captioning capabilities of large multimodal models. \method covers three distinct audio domains, including environmental sound, music, and speech, with 1,000 curated evaluation samples drawn from established datasets. We evaluate 13 models across two providers (OpenAI, Google Gemini)…

448

Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

Mar 2026 · 2603.04364
Self-ImprovingReasoningFine-TuningReinforcement

Multimodal web agents that process both screenshots and accessibility trees are increasingly deployed to interact with web interfaces, yet their dual-stream architecture opens an underexplored attack surface: an adversary who injects content into the webpage DOM simultaneously corrupts both observation channels with a consistent deceptive…

449

Contextualized Privacy Defense for LLM Agents

Mar 2026 · 2603.02983
ReinforcementPrompting

LLM agents increasingly act on users' personal information, yet existing privacy defenses remain limited in both design and adaptability. Most prior approaches rely on static or passive defenses, such as prompting and guarding. These paradigms are insufficient for supporting contextual, proactive privacy decisions in multi-step agent execution. We…

450

What Capable Agents Must Know: Selection Theorems for Robust Decision-Making under Uncertainty

Mar 2026 · 2603.02491
MemoryBenchmarks

As artificial agents become increasingly capable, what internal structure is *necessary* for an agent to act competently under uncertainty? Classical results show that optimal control can be *implemented* using belief states or world models, but not that such representations are required. We prove quantitative "selection theorems" showing that low…

451

ToolRLA: Multiplicative Reward Decomposition for Tool-Integrated Agents

Mar 2026 · 2603.01620
ReasoningReinforcementInference

Tool-integrated agents that interleave reasoning with API calls are promising for complex tasks, yet aligning them for high-stakes, domain-specific deployment remains challenging: existing reinforcement learning approaches rely on coarse binary rewards that cannot distinguish tool selection errors from malformed parameters. We present ToolRLA, a…

452

Enhancing Persona Following at Decoding Time via Dynamic Importance Estimation for Role-Playing Agents

Mar 2026 · 2603.01438
Fine-TuningSafetyInferencePrompting

The utility of Role-Playing Language Agents in sociological research is growing alongside the adoption of Large Language Models. For realism in social simulation, these agents must adhere to their personas defined by character profiles, yet existing strategies-static prompt engineering or costly fine-tuning-fail to adapt personas to dynamic…

453

SimAB: Simulating A/B Tests with Persona-Conditioned AI Agents for Rapid Design Evaluation

Mar 2026 · 2603.01024
BenchmarksReinforcementInference

A/B testing is a standard method for validating design decisions, yet its reliance on real user traffic limits iteration speed and makes certain experiments impractical. We present SimAB, a system that reframes A/B testing as a fast, privacy-preserving simulation using persona-conditioned AI agents. Given design screenshots and a conversion goal,…

454

AESP: A Human-Sovereign Economic Protocol for AI Agents with Privacy-Preserving Settlement

Feb 2026 · 2603.00318
BenchmarksInference

As AI agents increasingly perform economic tasks on behalf of humans, a fundamental tension arises between agent autonomy and human control over financial assets. We present the Agent Economic Sovereignty Protocol (AESP), a layered protocol in which agents transact autonomously at machine speed on crypto-native infrastructure while remaining…

455

CORE-Seg: Reasoning-Driven Segmentation for Complex Lesions via Reinforcement Learning

Mar 2026 · 2603.05911
ReasoningBenchmarksReinforcement

Medical image segmentation is undergoing a paradigm shift from conventional visual pattern matching to cognitive reasoning analysis. Although Multimodal Large Language Models (MLLMs) have shown promise in integrating linguistic and visual knowledge, significant gaps remain: existing general MLLMs possess broad common sense but lack the specialized…

456

Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild

Mar 2026 · 2603.04727
ReasoningBenchmarksPrompting

Multimodal large language models (MLLMs) have demonstrated impressive general competence in video understanding, yet their reliability for real-world Video Anomaly Detection (VAD) remains largely unexplored. Unlike conventional pipelines relying on reconstruction or pose-based cues, MLLMs enable a paradigm shift: treating anomaly detection as a…

457

AI-Assisted Moot Courts: Simulating Justice-Specific Questioning in Oral Arguments

Mar 2026 · 2603.04718
AgenticBenchmarks

In oral arguments, judges probe attorneys with questions about the factual record, legal claims, and the strength of their arguments. To prepare for this questioning, both law schools and practicing attorneys rely on moot courts: practice simulations of appellate hearings. Leveraging a dataset of U.S. Supreme Court oral argument transcripts, we…

458

Invariant Causal Routing for Governing Social Norms in Online Market Economies

Mar 2026 · 2603.04534
Reasoning

Social norms are stable behavioral patterns that emerge endogenously within economic systems through repeated interactions among agents. In online market economies, such norms -- like fair exposure, sustained participation, and balanced reinvestment -- are critical for long-term stability. We aim to understand the causal mechanisms driving these…

459

RVN-Bench: A Benchmark for Reactive Visual Navigation

Mar 2026 · 2603.03953
BenchmarksReinforcement

Safe visual navigation is critical for indoor mobile robots operating in cluttered environments. Existing benchmarks, however, often neglect collisions or are designed for outdoor scenarios, making them unsuitable for indoor visual navigation. To address this limitation, we introduce the reactive visual navigation benchmark (RVN-Bench), a…

460

CzechTopic: A Benchmark for Zero-Shot Topic Localization in Historical Czech Documents

Mar 2026 · 2603.03884
BenchmarksReinforcement

Topic localization aims to identify spans of text that express a given topic defined by a name and description. To study this task, we introduce a human-annotated benchmark based on Czech historical documents, containing human-defined topics together with manually annotated spans and supporting evaluation at both document and word levels.…

461

Invariance-Based Dynamic Regret Minimization

Mar 2026 · 2603.03843
Reinforcement

We consider stochastic non-stationary linear bandits where the linear parameter connecting contexts to the reward changes over time. Existing algorithms in this setting localize the policy by gradually discarding or down-weighting past data, effectively shrinking the time horizon over which learning can occur. However, in many settings historical…

462

Step-Level Sparse Autoencoder for Reasoning Process Interpretation

Mar 2026 · 2603.03031
Reasoning

Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning. However, their reasoning patterns remain too complicated to analyze. While Sparse Autoencoders (SAEs) have emerged as a powerful tool for interpretability, existing approaches predominantly operate at the token level, creating…

463

AnchorDrive: LLM Scenario Rollout with Anchor-Guided Diffusion Regeneration for Safety-Critical Scenario Generation

Mar 2026 · 2603.02542
BenchmarksSafetyInference

Autonomous driving systems require comprehensive evaluation in safety-critical scenarios to ensure safety and robustness. However, such scenarios are rare and difficult to collect from real-world driving data, necessitating simulation-based synthesis. Yet, existing methods often exhibit limitations in both controllability and realism. From a…

464

Revealing Positive and Negative Role Models to Help People Make Good Decisions

Mar 2026 · 2603.02495
Reinforcement

We consider a setting where agents take action by following their role models in a social network, and study strategies for a social planner to help agents by revealing whether the role models are positive or negative. Specifically, agents observe a local neighborhood of possible role models they can emulate, but do not know their true labels.…

465

Architecture-Aware Multi-Design Generation for Repository-Level Feature Addition

Mar 2026 · 2603.01814
Software DevAgenticFine-TuningArchitecture

Implementing new features across an entire codebase presents a formidable challenge for Large Language Models (LLMs). This proactive task requires a deep understanding of the global system architecture to prevent unintended disruptions to legacy functionalities. Conventional pipeline and agentic frameworks often fall short in this area because…

466

LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval

Mar 2026 · 2603.01425
ReasoningBenchmarksArchitectureSafety

LLMs have fundamentally transformed dense retrieval, upgrading backbones from discriminative encoders to generative architectures. However, a critical disconnect remains: while LLMs possess strong reasoning capabilities, current retrievers predominantly utilize them as static encoders, leaving their potential for complex reasoning unexplored. To…

467

Individual Turing Test: A Case Study of LLM-based Simulation Using Longitudinal Personal Data

Mar 2026 · 2603.01289
MemoryRAGFine-Tuning

Large Language Models (LLMs) have demonstrated remarkable human-like capabilities, yet their ability to replicate a specific individual remains under-explored. This paper presents a case study to investigate LLM-based individual simulation with a volunteer-contributed archive of private messaging history spanning over ten years. Based on the…

468

Understanding LoRA as Knowledge Memory: An Empirical Analysis

Mar 2026 · 2603.01097
MemoryRAGReasoningFine-Tuning

Continuous knowledge updating for pre-trained large language models (LLMs) is increasingly necessary yet remains challenging. Although inference-time methods like In-Context Learning (ICL) and Retrieval-Augmented Generation (RAG) are popular, they face constraints in context budgets, costs, and retrieval fragmentation. Departing from these…

469

One-Token Verification for Reasoning Correctness Estimation

Mar 2026 · 2603.01025
ReasoningBenchmarksReinforcementInference

Recent breakthroughs in large language models (LLMs) have led to notable successes in complex reasoning tasks, such as mathematical problem solving. A common strategy for improving performance is parallel thinking, in which multiple reasoning traces are generated and the final prediction is made using aggregation schemes like majority voting or…

470

EraseAnything++: Enabling Concept Erasure in Rectified Flow Transformers Leveraging Multi-Object Optimization

Mar 2026 · 2603.00978
Long-HorizonBenchmarksFine-TuningArchitecture

Removing undesired concepts from large-scale text-to-image (T2I) and text-to-video (T2V) diffusion models while preserving overall generative quality remains a major challenge, particularly as modern models such as Stable Diffusion v3, Flux, and OpenSora employ flow-matching and transformer-based architectures and extend to long-horizon video…

471

Towards Orthographically-Informed Evaluation of Speech Recognition Systems for Indian Languages

Mar 2026 · 2603.00941
Benchmarks

Evaluating ASR systems for Indian languages is challenging due to spelling variations, suffix splitting flexibility, and non-standard spellings in code-mixed words. Traditional Word Error Rate (WER) often presents a bleaker picture of system performance than what human users perceive. Better aligning evaluation with real-world performance requires…

472

Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance

Feb 2026 · 2602.24110
ReasoningFine-TuningReinforcement

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the complex reasoning capabilities of Large Reasoning Models. However, standard outcome-based supervision suffers from a critical limitation that penalizes trajectories that are largely correct but fail due to several missteps as heavily as…

473

Blockchain-Enabled Routing for Zero-Trust Low-Altitude Intelligent Networks

Feb 2026 · 2602.23667
BenchmarksMulti-AgentArchitecture

Due to the scalability and portability, low-altitude intelligent networks (LAINs) are essential in various fields such as surveillance and disaster rescue. However, in LAINs, unmanned aerial vehicles (UAVs) are characterized by the distributed topology and high mobility, thus vulnerable to security threats, which may degrade routing performances…

474

Truncated Step-Level Sampling with Process Rewards for Retrieval-Augmented Reasoning

Feb 2026 · 2602.23440
RAGReasoningBenchmarksReinforcement

Training large language models to reason with search engines via reinforcement learning is hindered by a fundamental credit assignment problem: existing methods such as Search-R1 provide only a sparse outcome reward after an entire multi-step trajectory, making it infeasible to attribute success or failure to individual reasoning and retrieval…

475

Alpha-RF: Automated RF-Filter-Circuit Design with Neural Simulator and Reinforcement Learning

Feb 2026 · 2603.00104
ReinforcementInference

Accurate, high-performance radio-frequency (RF) filter circuits are ubiquitous in radio-frequency communication and sensing systems for accepting and rejecting signals at desired frequencies. Conventional RF filter design process involves manual calculations of design parameters, followed by intuition-guided iterations to achieve the desired…

Added Feb 27, 2026
476

The Limits of Long-Context Reasoning in Automated Bug Fixing

Feb 2026 · 2602.16069
ContextSoftware DevAgenticPlanning

Rapidly increasing context lengths have led to the assumption that large language models (LLMs) can directly reason over entire codebases. Concurrently, recent advances in LLMs have enabled strong performance on software engineering benchmarks, particularly when paired with agentic workflows. In this work, we systematically evaluate whether…

477

SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

Feb 2026 · 2602.20867
Self-ImprovingLong-HorizonAgenticBenchmarks

Agentic systems increasingly rely on reusable procedural capabilities, \textit{a.k.a., agentic skills}, to execute long-horizon workflows reliably. These capabilities are callable modules that package procedural knowledge with explicit applicability conditions, execution policies, termination criteria, and reusable interfaces. Unlike one-off plans…

478

AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

Feb 2026 · 2602.22769
MemoryLong-HorizonAgenticBenchmarks

Large Language Models (LLMs) are deployed as autonomous agents in increasingly complex applications, where enabling long-horizon memory is critical for achieving strong performance. However, a significant gap exists between practical applications and current evaluation standards for agent memory: existing benchmarks primarily focus on…

479

AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG

Feb 2026 · 2602.19127
AgenticReasoningBenchmarksReinforcement

With the rapid advancement of agent-based methods in recent years, Agentic RAG has undoubtedly become an important research direction. Multi-hop reasoning, which requires models to engage in deliberate thinking and multi-step interaction, serves as a critical testbed for assessing such capabilities. However, existing benchmarks typically provide…

480

LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces

Feb 2026 · 2602.14337
Long-HorizonPlanningBenchmarksMulti-Agent

Recent advances in AI-assisted programming have empowered agents to execute complex workflows via command-line interfaces, however, existing benchmarks are limited by short task horizons, data contamination from GitHub scraping, and a lack of fine-grained evaluation metrics, fail to rigorously evaluate the long-horizon planning and execution…

481

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

Feb 2026 · 2602.19320
MemoryContextLong-HorizonAgentic

Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows. Despite rapid architectural development, the empirical foundations of these systems remain fragile: existing benchmarks are often underscaled, evaluation…

482

Tool-R0: Self-Evolving LLM Agents for Tool-Learning from Zero Data

Feb 2026 · 2602.21320
Self-ImprovingAgenticBenchmarksReinforcement

Large language models (LLMs) are becoming the foundation for autonomous agents that can use tools to solve complex tasks. Reinforcement learning (RL) has emerged as a common approach for injecting such agentic capabilities, but typically under tightly controlled training setups. It often depends on carefully constructed task-solution pairs and…

483

FeatureBench: Benchmarking Agentic Coding for Complex Feature Development

Feb 2026 · 2602.10975
Software DevAgenticBenchmarksReinforcement

Agents powered by large language models (LLMs) are increasingly adopted in the software industry, contributing code as collaborators or even autonomous developers. As their presence grows, it becomes important to assess the current boundaries of their coding abilities. Existing agentic coding benchmarks, however, cover a limited task scope, e.g.,…

484

Structurally Aligned Subtask-Level Memory for Software Engineering Agents

Feb 2026 · 2602.21611
MemoryLong-HorizonSoftware DevReasoning

Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents. Recent work has further explored augmenting these agents with memory mechanisms to support long-horizon reasoning. However, these approaches typically operate at a coarse instance granularity, treating the entire problem-solving…

485

Alignment in Time: Peak-Aware Orchestration for Long-Horizon Agentic Systems

Feb 2026 · 2602.17910
Long-HorizonBenchmarksMulti-AgentSafety

Traditional AI alignment primarily focuses on individual model outputs; however, autonomous agents in long-horizon workflows require sustained reliability across entire interaction trajectories. We introduce APEMO (Affect-aware Peak-End Modulation for Orchestration), a runtime scheduling layer that optimizes computational allocation under fixed…

486

STaR: Scalable Task-Conditioned Retrieval for Long-Horizon Multimodal Robot Memory

Feb 2026 · 2602.09255
MemoryLong-HorizonAgenticPlanning

Mobile robots are often deployed over long durations in diverse open, dynamic scenes, including indoor setting such as warehouses and manufacturing facilities, and outdoor settings such as agricultural and roadway operations. A core challenge is to build a scalable long-horizon memory that supports an agentic workflow for planning, retrieval, and…

487

DREAM: Deep Research Evaluation with Agentic Metrics

Feb 2026 · 2602.18940
AgenticReasoningBenchmarksSafety

Deep Research Agents generate analyst-grade reports, yet evaluating them remains challenging due to the absence of a single ground truth and the multidimensional nature of research quality. Recent benchmarks propose distinct methodologies, yet they suffer from the Mirage of Synthesis, where strong surface-level fluency and citation alignment can…

488

ESAA: Event Sourcing for Autonomous Agents in LLM-Based Software Engineering

Feb 2026 · 2602.23193
Long-HorizonPlanningMulti-AgentArchitecture

Autonomous agents based on Large Language Models (LLMs) have evolved from reactive assistants to systems capable of planning, executing actions via tools, and iterating over environment observations. However, they remain vulnerable to structural limitations: lack of native state, context degradation over long horizons, and the gap between…

489

Hippocampus: An Efficient and Scalable Memory Module for Agentic AI

Feb 2026 · 2602.13594
MemoryContextLong-HorizonAgentic

Agentic AI require persistent memory to store user-specific histories beyond the limited context window of LLMs. Existing memory systems use dense vector databases or knowledge-graph traversal (or hybrid), incurring high retrieval latency and poor storage scalability. We introduce Hippocampus, an agentic memory management system that uses compact…

490

AIDev: Studying AI Coding Agents on GitHub

Feb 2026 · 2602.09185
Software DevAgentic

AI coding agents are rapidly transforming software engineering by performing tasks such as feature development, debugging, and testing. Despite their growing impact, the research community lacks a comprehensive dataset capturing how these agents are used in real-world projects. To address this gap, we introduce AIDev, a large-scale dataset focused…

491

LLM4Cov: Execution-Aware Agentic Learning for High-coverage Testbench Generation

Feb 2026 · 2602.16953
MemoryAgenticBenchmarksReinforcement

Execution-aware LLM agents offer a promising paradigm for learning from tool feedback, but such feedback is often expensive and slow to obtain, making online reinforcement learning (RL) impractical. High-coverage hardware verification exemplifies this challenge due to its reliance on industrial simulators and non-differentiable execution signals.…

492

PhyNiKCE: A Neurosymbolic Agentic Framework for Autonomous Computational Fluid Dynamics

Feb 2026 · 2602.11666
ContextAgenticRAGPlanning

The deployment of autonomous agents for Computational Fluid Dynamics (CFD), is critically limited by the probabilistic nature of Large Language Models (LLMs), which struggle to enforce the strict conservation laws and numerical stability required for physics-based simulations. Reliance on purely semantic Retrieval Augmented Generation (RAG) often…

493

Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training

Feb 2026 · 2602.22576
AgenticRAGReasoningBenchmarks

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, yet traditional single-round retrieval struggles with complex multi-step reasoning. Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse…

494

SWE-Bench Mobile: Can Large Language Model Agents Develop Industry-Level Mobile Applications?

Feb 2026 · 2602.09540
Software DevBenchmarks

Can large language model agents develop industry-level mobile applications? We introduce \textbf{SWE-Bench Mobile}, a benchmark for evaluating coding agents on realistic software engineering tasks derived from a production iOS codebase. Unlike existing benchmarks that focus on isolated problems or bug fixes, SWE-Bench Mobile captures the full…

495

Sci-VLA: Agentic VLA Inference Plugin for Long-Horizon Tasks in Scientific Experiments

Feb 2026 · 2602.09430
Long-HorizonAgenticInference

Robotic laboratories play a critical role in autonomous scientific discovery by enabling scalable, continuous experimental execution. Recent vision-language-action (VLA) models offer a promising foundation for robotic laboratories. However, scientific experiments typically involve long-horizon tasks composed of multiple atomic tasks, posing a…

496

AMEM4Rec: Leveraging Cross-User Similarity for Memory Evolution in Agentic LLM Recommenders

Feb 2026 · 2602.08837
MemoryContextAgenticReasoning

Agentic systems powered by Large Language Models (LLMs) have shown strong potential in recommender systems but remain hindered by several challenges. Fine-tuning LLMs is parameter-inefficient, and prompt-based agentic reasoning is limited by context length and hallucination risk. Moreover, existing agentic recommendation systems predominantly…

497

AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks

Feb 2026 · 2602.16901
MemoryLong-HorizonAgenticBenchmarks

LLM agents are increasingly deployed in long-horizon, complex environments to solve challenging problems, but this expansion exposes them to long-horizon attacks that exploit multi-turn user-agent-environment interactions to achieve objectives infeasible in single-turn settings. To measure agent vulnerabilities to such risks, we present AgentLAB,…

498

MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks

Feb 2026 · 2602.16313
MemoryAgenticPlanningReasoning

Existing evaluations of agents with memory typically assess memorization and action in isolation. One class of benchmarks evaluates memorization by testing recall of past conversations or text but fails to capture how memory is used to guide future decisions. Another class focuses on agents acting in single-session tasks without the need for…

499

Self-Evolving Multi-Agent Network for Industrial IoT Predictive Maintenance

Feb 2026 · 2602.16738
MemorySelf-ImprovingBenchmarksMulti-Agent

Industrial IoT predictive maintenance requires systems capable of real-time anomaly detection without sacrificing interpretability or demanding excessive computational resources. Traditional approaches rely on static, offline-trained models that cannot adapt to evolving operational conditions, while LLM-based monolithic systems demand prohibitive…

500

TRACE: Temporal Reasoning via Agentic Context Evolution for Streaming Electronic Health Records (EHRs)

Feb 2026 · 2602.12833
MemoryContextAgenticRAG

Large Language Models (LLMs) encode extensive medical knowledge but struggle to apply it reliably to longitudinal patient trajectories, where evolving clinical states, irregular timing, and heterogeneous events degrade performance over time. Existing adaptation strategies rely on fine-tuning or retrieval-based augmentation, which introduce…

501

ACE-RTL: When Agentic Context Evolution Meets RTL-Specialized LLMs

Feb 2026 · 2602.10218
Software DevAgenticReasoningBenchmarks

Recent advances in large language models (LLMs) have sparked growing interest in applying them to hardware design automation, particularly for accurate RTL code generation. Prior efforts follow two largely independent paths: (i) training domain-adapted RTL models to internalize hardware semantics, (ii) developing agentic systems that leverage…

502

Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks

Feb 2026 · 2602.22817
MemoryLong-HorizonAgenticBenchmarks

Group-based reinforcement learning (RL), such as GRPO, has advanced the capabilities of large language models on long-horizon agentic tasks. To enable more fine-grained policy updates, recent research has increasingly shifted toward stepwise group-based policy optimization, which treats each step in a rollout trajectory independently while using a…

503

SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning

Feb 2026 · 2602.22603
MemoryLong-HorizonAgenticReasoning

Long-running agentic tasks, such as deep research, require multi-hop reasoning over information distributed across multiple webpages and documents. In such tasks, the LLM context is dominated by tokens from external retrieval, causing memory usage to grow rapidly and limiting decode performance. While several KV cache compression techniques exist…

504

Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

Feb 2026 · 2602.22675
Long-HorizonAgenticReasoningBenchmarks

Recent deep research agents primarily improve performance by scaling reasoning depth, but this leads to high inference cost and latency in search-intensive scenarios. Moreover, generalization across heterogeneous research settings remains challenging. In this work, we propose \emph{Search More, Think Less} (SMTL), a framework for long-horizon…

505

ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System

Feb 2026 · 2602.13692
MemoryAgenticBenchmarksReinforcement

Large language models(LLMs) are now used to power complex multi-turn agentic workflows. Existing systems run agentic inference by loosely assembling isolated components: an LLM inference engine (e.g., vLLM) and a tool orchestrator (e.g., Kubernetes). Although agentic workflows involve multiple LLM and tool requests, these systems schedule and…

506

SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

Feb 2026 · 2602.12984
Long-HorizonAgenticReasoningBenchmarks

Scientific reasoning inherently demands integrating sophisticated toolkits to navigate domain-specific knowledge. Yet, current benchmarks largely overlook agents' ability to orchestrate tools for such rigorous workflows. To bridge this gap, we introduce SciAgentGym, a scalable interactive environment featuring 1,780 domain-specific tools across…

507

PRISM: A Principled Framework for Multi-Agent Reasoning via Gain Decomposition

Feb 2026 · 2602.08586
Software DevReasoningBenchmarksMulti-Agent

Multi-agent collaboration has emerged as a promising paradigm for enhancing reasoning capabilities of Large Language Models (LLMs). However, existing approaches remain largely heuristic, lacking principled guidance on what drives performance gains and how to systematically optimize multi-agent reasoning. Specifically, it remains unclear why…

508

Zombie Agents: Persistent Control of Self-Evolving LLM Agents via Self-Reinforcing Injections

Feb 2026 · 2602.15654
MemorySelf-ImprovingLong-HorizonRAG

Self-evolving LLM agents update their internal state across sessions, often by writing and reusing long-term memory. This design improves performance on long-horizon tasks but creates a security risk: untrusted external content observed during a benign session can be stored as memory and later treated as instruction. We study this risk and…

509

Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?

Feb 2026 · 2602.11988
Software DevFine-TuningInference

A widespread practice in software development is to tailor coding agents to repositories using context files, such as AGENTS.md, by either manually or automatically generating them. Although this practice is strongly encouraged by agent developers, there is currently no rigorous investigation into whether such context files are actually effective…

510

Artisan: Agentic Artifact Evaluation

Feb 2026 · 2602.10046
Software DevAgenticBenchmarksReinforcement

Artifact evaluation has become standard practice in the software engineering community to ensure the reproducibility of research results. However, the current manual process is labor-intensive, and hence, done only as a one-time assessment for a subset of all papers. To support the artifact evaluation effort, we present Artisan, an automated LLM…

511

Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning

Feb 2026 · 2602.21670
Long-HorizonPlanningBenchmarksMulti-Agent

Multi-robot task planning requires decomposing natural-language instructions into executable actions for heterogeneous robot teams. Conventional Planning Domain Definition Language (PDDL) planners provide rigorous guarantees but struggle to handle ambiguous or long-horizon missions, while large language models (LLMs) can interpret instructions and…

512

The Five Ws of Multi-Agent Communication: Who Talks to Whom, When, What, and Why -- A Survey from MARL to Emergent Language and LLMs

Feb 2026 · 2602.11583
PlanningReasoningMulti-AgentReinforcement

Multi-agent sequential decision-making powers many real-world systems, from autonomous vehicles and robotics to collaborative AI assistants. In dynamic, partially observable environments, communication is often what reduces uncertainty and makes collaboration possible. This survey reviews multi-agent communication (MA-Comm) through the Five Ws:…

513

SpotAgent: Grounding Visual Geo-localization in Large Vision-Language Models through Agentic Reasoning

Feb 2026 · 2602.09463
ReasoningBenchmarksMulti-AgentFine-Tuning

Large Vision-Language Models (LVLMs) have demonstrated strong reasoning capabilities in geo-localization, yet they often struggle in real-world scenarios where visual cues are sparse, long-tailed, and highly ambiguous. Previous approaches, bound by internal knowledge, often fail to provide verifiable results, yielding confident but ungrounded…

514

Evaluating and Improving Automated Repository-Level Rust Issue Resolution with LLM-based Agents

Feb 2026 · 2602.22764
Software DevAgenticBenchmarks

The Rust programming language presents a steep learning curve and significant coding challenges, making the automation of issue resolution essential for its broader adoption. Recently, LLM-powered code agents have shown remarkable success in resolving complex software engineering tasks, yet their application to Rust has been limited by the absence…

515

Architecting AgentOS: From Token-Level Context to Emergent System-Level Intelligence

Feb 2026 · 2602.20934
MemoryContextSelf-ImprovingReasoning

The paradigm of Large Language Models is undergoing a fundamental transition from static inference engines to dynamic autonomous cognitive systems.While current research primarily focuses on scaling context windows or optimizing prompt engineering the theoretical bridge between micro scale token processing and macro scale systemic intelligence…

516

IntentCUA: Learning Intent-level Representations for Skill Abstraction and Multi-Agent Planning in Computer-Use Agents

Feb 2026 · 2602.17049
MemoryLong-HorizonPlanningBenchmarks

Computer-use agents operate over long horizons under noisy perception, multi-window contexts, evolving environment states. Existing approaches, from RL-based planners to trajectory retrieval, often drift from user intent and repeatedly solve routine subproblems, leading to error accumulation and inefficiency. We present IntentCUA, a multi-agent…

517

In-Context Autonomous Network Incident Response: An End-to-End Large Language Model Agent Approach

Feb 2026 · 2602.13156
AgenticPlanningReasoningFine-Tuning

Rapidly evolving cyberattacks demand incident response systems that can autonomously learn and adapt to changing threats. Prior work has extensively explored the reinforcement learning approach, which involves learning response strategies through extensive simulation of the incident. While this approach can be effective, it requires handcrafted…

518

KLong: Training LLM Agent for Extremely Long-horizon Tasks

Feb 2026 · 2602.17547
Long-HorizonSoftware DevAgenticBenchmarks

This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks. The principle is to first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training. Specifically, we first activate basic agentic abilities of a base model with a comprehensive SFT recipe. Then, we introduce…

519

AgentConductor: Topology Evolution for Multi-Agent Competition-Level Code Generation

Feb 2026 · 2602.17100
Software DevReasoningMulti-AgentReinforcement

Large language model(LLM)-driven multi-agent systems(MAS) coordinate specialized agents through predefined interaction topologies and have shown promise for complex tasks such as competition-level code generation. Recent studies demonstrate that carefully designed multi-agent workflows and communication graphs can significantly improve code…

520

From Prompt-Response to Goal-Directed Systems: The Evolution of Agentic AI Software Architecture

Feb 2026 · 2602.10479
MemoryPlanningReasoningMulti-Agent

Agentic AI denotes an architectural transition from stateless, prompt-driven generative models toward goal-directed systems capable of autonomous perception, planning, action, and adaptation through iterative control loops. This paper examines this transition by connecting foundational intelligent agent theories, including reactive, deliberative,…

521

TestExplora: Benchmarking LLMs for Proactive Bug Discovery via Repository-Level Test Generation

Feb 2026 · 2602.10471
Software DevAgenticBenchmarksFine-Tuning

Given that Large Language Models (LLMs) are increasingly applied to automate software development, comprehensive software assurance spans three distinct goals: regression prevention, reactive reproduction, and proactive discovery. Current evaluations systematically overlook the third goal. Specifically, they either treat existing code as ground…

522

Reasoning-Driven Design of Single Atom Catalysts via a Multi-Agent Large Language Model Framework

Feb 2026 · 2602.21533
ReasoningMulti-AgentPrompting

Large language models (LLMs) are becoming increasingly applied beyond natural language processing, demonstrating strong capabilities in complex scientific tasks that traditionally require human expertise. This progress has extended into materials discovery, where LLMs introduce a new paradigm by leveraging reasoning and in-context learning,…

523

A Multi-Agent Framework for Medical AI: Leveraging Fine-Tuned GPT, LLaMA, and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing

Feb 2026 · 2602.14158
ReasoningBenchmarksMulti-AgentReinforcement

Large language models (LLMs) show promise for healthcare question answering, but clinical use is limited by weak verification, insufficient evidence grounding, and unreliable confidence signalling. We propose a multi-agent medical QA framework that combines complementary LLMs with evidence retrieval, uncertainty estimation, and bias checks to…

524

AgenticShop: Benchmarking Agentic Product Curation for Personalized Web Shopping

Feb 2026 · 2602.12315
AgenticBenchmarksFine-Tuning

The proliferation of e-commerce has made web shopping platforms key gateways for customers navigating the vast digital marketplace. Yet this rapid expansion has led to a noisy and fragmented information environment, increasing cognitive burden as shoppers explore and purchase products online. With promising potential to alleviate this challenge,…

525

Agentic AI for Cybersecurity: A Meta-Cognitive Architecture for Governable Autonomy

Feb 2026 · 2602.11897
Self-ImprovingMulti-AgentReinforcementArchitecture

Contemporary AI-driven cybersecurity systems are predominantly architected as model-centric detection and automation pipelines optimized for task-level performance metrics such as accuracy and response latency. While effective for bounded classification tasks, these architectures struggle to support accountable decision-making under adversarial…

526

GameDevBench: Evaluating Agentic Capabilities Through Game Development

Feb 2026 · 2602.11103
Software DevAgenticBenchmarksReinforcement

Despite rapid progress on coding agents, progress on their multimodal counterparts has lagged behind. A key challenge is the scarcity of evaluation testbeds that combine the complexity of software development with the need for deep multimodal understanding. Game development provides such a testbed as agents must navigate large, dense codebases…

527

Pancake: Hierarchical Memory System for Multi-Agent LLM Serving

Feb 2026 · 2602.21477
MemoryMulti-AgentInference

In this work, we identify and address the core challenges of agentic memory management in LLM serving, where large-scale storage, frequent updates, and multiple coexisting agents jointly introduce complex and high-cost approximate nearest neighbor (ANN) searching problems. We present Pancake, a multi-tier agentic memory system that unifies three…

528

Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

Feb 2026 · 2602.16246
AgenticReasoningBenchmarksReinforcement

Interactive large language model (LLM) agents operating via multi-turn dialogue and multi-step tool calling are increasingly used in production. Benchmarks for these agents must both reliably compare models and yield on-policy training data. Prior agentic benchmarks (e.g., tau-bench, tau2-bench, AppWorld) rely on fully deterministic backends,…

529

Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation

Feb 2026 · 2602.11224
AgenticBenchmarks

We present Agent-Diff, a novel benchmarking framework for evaluating agentic Large Language Models (LLMs) on real-world tasks that execute code via external APIs. Agentic LLM performance varies due to differences in models, external tool access, prompt structures, and agentic frameworks. Benchmarks must make fundamental trade-offs between a…

530

EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

Feb 2026 · 2602.09514
MemoryLong-HorizonPlanningBenchmarks

Long-horizon planning is widely recognized as a core capability of autonomous LLM-based agents; however, current evaluation frameworks suffer from being largely episodic, domain-specific, or insufficiently grounded in persistent economic dynamics. We introduce EcoGym, a generalizable benchmark for continuous plan-and-execute decision making in…

531

TabAgent: A Framework for Replacing Agentic Generative Components with Tabular-Textual Classifiers

Feb 2026 · 2602.16429
Long-HorizonAgenticBenchmarksArchitecture

Agentic systems, AI architectures that autonomously execute multi-step workflows to achieve complex goals, are often built using repeated large language model (LLM) calls for closed-set decision tasks such as routing, shortlisting, gating, and verification. While convenient, this design makes deployments slow and expensive due to cumulative…

532

Long Context, Less Focus: A Scaling Gap in LLMs Revealed through Privacy and Personalization

Feb 2026 · 2602.15028
ContextBenchmarksReinforcementArchitecture

Large language models (LLMs) are increasingly deployed in privacy-critical and personalization-oriented scenarios, yet the role of context length in shaping privacy leakage and personalization effectiveness remains largely unexplored. We introduce a large-scale benchmark, PAPerBench, to systematically study how increasing context length influences…

533

OSCAR: Optimization-Steered Agentic Planning for Composed Image Retrieval

Feb 2026 · 2602.08603
AgenticPlanningReasoningBenchmarks

Composed image retrieval (CIR) requires complex reasoning over heterogeneous visual and textual constraints. Existing approaches largely fall into two paradigms: unified embedding retrieval, which suffers from single-model myopia, and heuristic agentic retrieval, which is limited by suboptimal, trial-and-error orchestration. To this end, we…

534

Limited Reasoning Space: The cage of long-horizon reasoning in LLMs

Feb 2026 · 2602.19281
Long-HorizonPlanningReasoning

The test-time compute strategy, such as Chain-of-Thought (CoT), has significantly enhanced the ability of large language models to solve complex tasks like logical reasoning. However, empirical studies indicate that simply increasing the compute budget can sometimes lead to a collapse in test-time performance when employing typical task…

535

REMem: Reasoning with Episodic Memory in Language Agent

Feb 2026 · 2602.13530
MemoryAgenticReasoningBenchmarks

Humans excel at remembering concrete experiences along spatiotemporal contexts and performing reasoning across those events, i.e., the capacity for episodic memory. In contrast, memory in language agents remains mainly semantic, and current agents are not yet capable of effectively recollecting and reasoning over interaction histories. We identify…

536

FactGuard: Agentic Video Misinformation Detection via Reinforcement Learning

Feb 2026 · 2602.22963
AgenticReasoningFine-TuningReinforcement

Multimodal large language models (MLLMs) have substantially advanced video misinformation detection through unified multimodal reasoning, but they often rely on fixed-depth inference and place excessive trust in internally generated assumptions, particularly in scenarios where critical evidence is sparse, fragmented, or requires external…

537

ClawMobile: Rethinking Smartphone-Native Agentic Systems

Feb 2026 · 2602.22942
AgenticPlanningReasoningFine-Tuning

Smartphones represent a uniquely challenging environment for agentic systems. Unlike cloud or desktop settings, mobile devices combine constrained execution contexts, fragmented control interfaces, and rapidly changing application states. As large language models (LLMs) evolve from conversational assistants to action-oriented agents, achieving…

538

Building Autonomous GUI Navigation via Agentic-Q Estimation and Step-Wise Policy Optimization

Feb 2026 · 2602.13653
AgenticBenchmarksReinforcement

Recent advances in Multimodal Large Language Models (MLLMs) have substantially driven the progress of autonomous agents for Graphical User Interface (GUI). Nevertheless, in real-world applications, GUI agents are often faced with non-stationary environments, leading to high computational costs for data curation and policy optimization. In this…

539

LiveNewsBench: Evaluating LLM Web Search Capabilities with Freshly Curated News

Feb 2026 · 2602.13543
AgenticReasoningBenchmarksReinforcement

Large Language Models (LLMs) with agentic web search capabilities show strong potential for tasks requiring real-time information access and complex fact retrieval, yet evaluating such systems remains challenging. We introduce \bench, a rigorous and regularly updated benchmark designed to assess the agentic web search abilities of LLMs. \bench…

540

Taming Scylla: Understanding the multi-headed agentic daemon of the coding seas

Feb 2026 · 2602.08765
Software DevBenchmarksMulti-Agent

LLM-based tools are automating more software development tasks at a rapid pace, but there is no rigorous way to evaluate how different architectural choices -- prompts, skills, tools, multi-agent setups -- materially affect both capability and cost. This paper introduces Scylla, an evaluation framework for benchmarking agentic coding tools through…

541

General Agent Evaluation

Feb 2026 · 2602.22953
AgenticBenchmarks

The promise of general-purpose agents - systems that perform tasks in unfamiliar environments without domain-specific engineering - remains largely unrealized. Existing agents are predominantly specialized, and while emerging implementations like OpenAI SDK Agent and Claude Code hint at broader capabilities, no systematic evaluation of their…

542

MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

Feb 2026 · 2602.22638
AgenticPlanningBenchmarksReinforcement

Route-planning agents powered by large language models (LLMs) have emerged as a promising paradigm for supporting everyday human mobility through natural language interaction and tool-mediated decision making. However, systematic evaluation in real-world mobility settings is hindered by diverse routing demands, non-deterministic mapping services,…

543

$φ$-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models

Feb 2026 · 2602.22601
Benchmarks

Fairness in Continual Learning for Large Multimodal Models (LMMs) is an emerging yet underexplored challenge, particularly in the presence of imbalanced data distributions that can lead to biased model updates and suboptimal performance across tasks. While recent continual learning studies have made progress in addressing catastrophic forgetting,…

544

PhGPO: Pheromone-Guided Policy Optimization for Long-Horizon Tool Planning

Feb 2026 · 2602.13691
Long-HorizonAgenticPlanningFine-Tuning

Recent advancements in Large Language Model (LLM) agents have demonstrated strong capabilities in executing complex tasks through tool use. However, long-horizon multi-step tool planning is challenging, because the exploration space suffers from a combinatorial explosion. In this scenario, even when a correct tool-use path is found, it is usually…

545

Secure and Energy-Efficient Wireless Agentic AI Networks

Feb 2026 · 2602.15212
AgenticReasoningBenchmarksInference

In this paper, we introduce a secure wireless agentic AI network comprising one supervisor AI agent and multiple other AI agents to provision quality of service (QoS) for users' reasoning tasks while ensuring confidentiality of private knowledge and reasoning outcomes. Specifically, the supervisor AI agent can dynamically assign other AI agents to…

546

CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use

Feb 2026 · 2602.12268
AgenticReasoningBenchmarksFine-Tuning

AI agents are increasingly used to solve real-world tasks by reasoning over multi-turn user interactions and invoking external tools. However, applying reinforcement learning to such settings remains difficult: realistic objectives often lack verifiable rewards and instead emphasize open-ended behaviors; moreover, RL for multi-turn, multi-step…

547

Beyond Context Sharing: A Unified Agent Communication Protocol (ACP) for Secure, Federated, and Autonomous Agent-to-Agent (A2A) Orchestration

Feb 2026 · 2602.15055
ReasoningBenchmarksMulti-AgentArchitecture

In the artificial intelligence space, as we transition from isolated large language models to autonomous agents capable of complex reasoning and tool use. While foundational architectures and local context management protocols have been established, the challenge of cross-platform, decentralized, and secure interaction remains a significant…

548

Self-Evolving Recommendation System: End-To-End Autonomous Model Optimization With LLM Agents

Feb 2026 · 2602.10226
Self-ImprovingReasoningArchitectureInference

Optimizing large-scale machine learning systems, such as recommendation models for global video platforms, requires navigating a massive hyperparameter search space and, more critically, designing sophisticated optimizers, architectures, and reward functions to capture nuanced user behaviors. Achieving substantial improvements in these areas is a…

549

SWE-AGI: Benchmarking Specification-Driven Software Construction with MoonBit in the Era of Autonomous Agents

Feb 2026 · 2602.09447
Long-HorizonAgenticReasoningBenchmarks

Although large language models (LLMs) have demonstrated impressive coding capabilities, their ability to autonomously build production-scale software from explicit specifications remains an open question. We introduce SWE-AGI, an open-source benchmark for evaluating end-to-end, specification-driven construction of software systems written in…

550

FlyAOC: Evaluating Agentic Ontology Curation of Drosophila Scientific Knowledge Bases

Feb 2026 · 2602.09163
RAGReasoningBenchmarksMulti-Agent

Scientific knowledge bases accelerate discovery by curating findings from primary literature into structured, queryable formats for both human researchers and emerging AI systems. Maintaining these resources requires expert curators to search relevant papers, reconcile evidence across documents, and produce ontology-grounded annotations - a…

551

VeRO: An Evaluation Harness for Agents to Optimize Agents

Feb 2026 · 2602.22480
Software DevReasoningBenchmarksReinforcement

An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Agent optimization differs fundamentally from conventional software engineering:…

552

Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG

Feb 2026 · 2602.21447
AgenticPlanningReasoningBenchmarks

Current stateless defences for multimodal agentic RAG fail to detect adversarial strategies that distribute malicious semantics across retrieval, planning, and generation components. We formulate this security challenge as a Partially Observable Markov Decision Process (POMDP), where adversarial intent is a latent variable inferred from noisy…

553

AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition

Feb 2026 · 2602.11348
AgenticBenchmarksArchitectureInference

Recent advances in large language models have enabled LLM-based agents to achieve strong performance on a variety of benchmarks. However, their performance in real-world deployments often that observed on benchmark settings, especially in complex and imperfect environments. This discrepancy largely arises because prevailing training and evaluation…

554

EvoCodeBench: A Human-Performance Benchmark for Self-Evolving LLM-Driven Coding Systems

Feb 2026 · 2602.10171
MemorySelf-ImprovingSoftware DevBenchmarks

As large language models (LLMs) continue to advance in programming tasks, LLM-driven coding systems have evolved from one-shot code generation into complex systems capable of iterative improvement during inference. However, existing code benchmarks primarily emphasize static correctness and implicitly assume fixed model capability during…

555

Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems

Feb 2026 · 2602.08847
ReasoningBenchmarksMulti-AgentReinforcement

Multi-agent LLM systems enable advanced reasoning and tool use via role specialization, yet reliable reinforcement learning (RL) post-training for such systems remains difficult. In this work, we theoretically pinpoint a key reason for training instability when extending group-based RL to multi-agent LLM systems. We show that under GRPO-style…

556

PyVision-RL: Forging Open Agentic Vision Models via RL

Feb 2026 · 2602.20739
AgenticReasoningReinforcement

Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction.…

557

Structured Prompt Language: Declarative Context Management for LLMs

Feb 2026 · 2602.21257
MemoryContextAgenticRAG

We present SPL (Structured Prompt Language), a declarative SQL-inspired language that treats large language models as generative knowledge bases and their context windows as constrained resources. SPL provides explicit WITH BUDGET/LIMIT token management, an automatic query optimizer, EXPLAIN transparency analogous to SQL's EXPLAIN ANALYZE, and…

558

Agentic Problem Frames: A Systematic Approach to Engineering Reliable Domain Agents

Feb 2026 · 2602.19065
AgenticReasoningBenchmarks

Large Language Models (LLMs) are evolving into autonomous agents, yet current "frameless" development--relying on ambiguous natural language without engineering blueprints--leads to critical risks such as scope creep and open-loop failures. To ensure industrial-grade reliability, this study proposes Agentic Problem Frames (APF), a systematic…

559

Agentic Spatio-Temporal Grounding via Collaborative Reasoning

Feb 2026 · 2602.13313
MemoryAgenticReasoningBenchmarks

Spatio-Temporal Video Grounding (STVG) aims to retrieve the spatio-temporal tube of a target object or person in a video given a text query. Most existing approaches perform frame-wise spatial localization within a predicted temporal span, resulting in redundant computation, heavy supervision requirements, and limited generalization.…

560

CoRefine: Confidence-Guided Self-Refinement for Adaptive Test-Time Compute

Feb 2026 · 2602.08948
Self-ImprovingAgenticReasoningBenchmarks

Large Language Models (LLMs) often rely on test-time scaling via parallel decoding (for example, 512 samples) to boost reasoning accuracy, but this incurs substantial compute. We introduce CoRefine, a confidence-guided self-refinement method that achieves competitive accuracy using a fraction of the tokens via a lightweight 211k-parameter Conv1D…

561

MAS-FIRE: Fault Injection and Reliability Evaluation for LLM-Based Multi-Agent Systems

Feb 2026 · 2602.19843
ReasoningBenchmarksMulti-AgentArchitecture

As LLM-based Multi-Agent Systems (MAS) are increasingly deployed for complex tasks, ensuring their reliability has become a pressing challenge. Since MAS coordinate through unstructured natural language rather than rigid protocols, they are prone to semantic failures (e.g., hallucinations, misinterpreted instructions, and reasoning drift) that…

562

Guided Collaboration in Heterogeneous LLM-Based Multi-Agent Systems via Entropy-Based Understanding Assessment and Experience Retrieval

Feb 2026 · 2602.13639
RAGPlanningReasoningBenchmarks

With recent breakthroughs in large language models (LLMs) for reasoning, planning, and complex task generation, artificial intelligence systems are transitioning from isolated single-agent architectures to multi-agent systems with collaborative intelligence. However, in heterogeneous multi-agent systems (HMAS), capability differences among agents…

563

BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents

Feb 2026 · 2602.12876
AgenticPlanningReasoningBenchmarks

Multimodal large language models (MLLMs), equipped with increasingly advanced planning and tool-use capabilities, are evolving into autonomous agents capable of performing multimodal web browsing and deep search in open-world environments. However, existing benchmarks for multimodal browsing remain limited in task complexity, evidence…

564

Auditing Multi-Agent LLM Reasoning Trees Outperforms Majority Vote and LLM-as-Judge

Feb 2026 · 2602.09341
ReasoningMulti-Agent

Multi-agent systems (MAS) can substantially extend the reasoning capacity of large language models (LLMs), yet most frameworks still aggregate agent outputs with majority voting. This heuristic discards the evidential structure of reasoning traces and is brittle under the confabulation consensus, where agents share correlated biases and converge…

565

Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information

Feb 2026 · 2602.21496
AgenticReasoningSafetyInference

While defenses for structured PII are mature, Large Language Models (LLMs) pose a new threat: Semantic Sensitive Information (SemSI), where models infer sensitive identity attributes, generate reputation-harmful content, or hallucinate potentially wrong information. The capacity of LLMs to self-regulate these complex, context-dependent sensitive…

566

SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery

Feb 2026 · 2602.21136
PlanningBenchmarksMulti-AgentFine-Tuning

Qualitative insights from user experiences are critical for informing product and policy decisions, but collecting such data at scale is constrained by the time and availability of experts to conduct semi-structured interviews. Recent work has explored using large language models (LLMs) to automate interviewing, yet existing systems lack a…

567

ISO-Bench: Can Coding Agents Optimize Real-World Inference Workloads?

Feb 2026 · 2602.19594
Software DevBenchmarksInference

We introduce ISO-Bench, a benchmark for coding agents to test their capabilities on real-world inference optimization tasks. These tasks were taken from vLLM and SGLang, two of the most popular LLM serving frameworks. Each task provides an agent with a codebase and bottleneck description, whereby the agent must produce an optimization patch…

568

GLM-5: from Vibe Coding to Agentic Engineering

Feb 2026 · 2602.15763
Long-HorizonAgenticReasoningBenchmarks

We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity. To advance model…

569

Pushing Forward Pareto Frontiers of Proactive Agents with Behavioral Agentic Optimization

Feb 2026 · 2602.11351
AgenticReasoningBenchmarksReinforcement

Proactive large language model (LLM) agents aim to actively plan, query, and interact over multiple turns, enabling efficient task completion beyond passive instruction following and making them essential for real-world, user-centric applications. Agentic reinforcement learning (RL) has recently emerged as a promising solution for training such…

570

Why Agentic Theorem Prover Works: A Statistical Provability Theory of Mathematical Reasoning Models

Feb 2026 · 2602.10538
AgenticPlanningReasoning

Agentic theorem provers -- pipelines that couple a mathematical reasoning model with library retrieval, subgoal-decomposition/search planner, and a proof assistant verifier -- have recently achieved striking empirical success, yet it remains unclear which components drive performance and why such systems work at all despite classical hardness of…

571

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

Feb 2026 · 2602.10090
AgenticBenchmarksReinforcement

Recent advances in large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment…

572

CL4SE: A Context Learning Benchmark For Software Engineering Tasks

Feb 2026 · 2602.23047
ContextSoftware DevBenchmarksFine-Tuning

Context engineering has emerged as a pivotal paradigm for unlocking the potential of Large Language Models (LLMs) in Software Engineering (SE) tasks, enabling performance gains at test time without model fine-tuning. Despite its success, existing research lacks a systematic taxonomy of SE-specific context types and a dedicated benchmark to…

573

Choosing How to Remember: Adaptive Memory Structures for LLM Agents

Feb 2026 · 2602.14038
MemoryLong-HorizonBenchmarksReinforcement

Memory is critical for enabling large language model (LLM) based agents to maintain coherent behavior over long-horizon interactions. However, existing agent memory systems suffer from two key gaps: they rely on a one-size-fits-all memory structure and do not model memory structure selection as a context-adaptive decision, limiting their ability…

574

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Feb 2026 · 2602.12430
Self-ImprovingSoftware DevAgenticBenchmarks

The transition from monolithic language models to modular, skill-equipped agents marks a defining shift in how large language models (LLMs) are deployed in practice. Rather than encoding all procedural knowledge within model weights, agent skills -- composable packages of instructions, code, and resources that agents load on demand -- enable…

575

DECKBench: Benchmarking Multi-Agent Frameworks for Academic Slide Generation and Editing

Feb 2026 · 2602.13318
PlanningBenchmarksMulti-Agent

Automatically generating and iteratively editing academic slide decks requires more than document summarization. It demands faithful content selection, coherent slide organization, layout-aware rendering, and robust multi-turn instruction following. However, existing benchmarks and evaluation protocols do not adequately measure these challenges.…

576

MATA: Multi-Agent Framework for Reliable and Flexible Table Question Answering

Feb 2026 · 2602.09642
ReasoningBenchmarksMulti-AgentInference

Recent advances in Large Language Models (LLMs) have significantly improved table understanding tasks such as Table Question Answering (TableQA), yet challenges remain in ensuring reliability, scalability, and efficiency, especially in resource-constrained or privacy-sensitive environments. In this paper, we introduce MATA, a multi-agent TableQA…

577

TCM-DiffRAG: Personalized Syndrome Differentiation Reasoning Method for Traditional Chinese Medicine based on Knowledge Graph and Chain of Thought

Feb 2026 · 2602.22828
RAGReasoningBenchmarksFine-Tuning

Background: Retrieval augmented generation (RAG) technology can empower large language models (LLMs) to generate more accurate, professional, and timely responses without fine tuning. However, due to the complex reasoning processes and substantial individual differences involved in traditional Chinese medicine (TCM) clinical diagnosis and…

578

Debug2Fix: Supercharging Coding Agents with Interactive Debugging Capabilities

Feb 2026 · 2602.18571
Software DevArchitecture

While significant progress has been made in automating various aspects of software development through coding agents, there is still significant room for improvement in their bug fixing capabilities. Debugging and investigation of runtime behavior remains largely a manual, developer-driven process. Popular coding agents typically rely on either…

579

Wink: Recovering from Misbehaviors in Coding Agents

Feb 2026 · 2602.17037
Software DevAgenticReasoning

Autonomous coding agents, powered by large language models (LLMs), are increasingly being adopted in the software industry to automate complex engineering tasks. However, these agents are prone to a wide range of misbehaviors, such as deviating from the user's instructions, getting stuck in repetitive loops, or failing to use tools correctly.…

580

KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning

Feb 2026 · 2602.14293
MemorySoftware DevAgenticBenchmarks

Optimizing CUDA code across multiple generations of GPU architectures is challenging, as achieving peak performance requires an extensive exploration of an increasingly complex, hardware-specific optimization space. Traditional compilers are constrained by fixed heuristics, whereas finetuning Large Language Models (LLMs) can be expensive. However,…

581

Multi-Agent Debate: A Unified Agentic Framework for Tabular Anomaly Detection

Feb 2026 · 2602.14251
BenchmarksMulti-Agent

Tabular anomaly detection is often handled by single detectors or static ensembles, even though strong performance on tabular data typically comes from heterogeneous model families (e.g., tree ensembles, deep tabular networks, and tabular foundation models) that frequently disagree under distribution shift, missingness, and rare-anomaly regimes.…

582

S2SServiceBench: A Multimodal Benchmark for Last-Mile S2S Climate Services

Feb 2026 · 2602.14017
AgenticPlanningReasoningBenchmarks

Subseasonal-to-seasonal (S2S) forecasts play an essential role in providing a decision-critical weeks-to-months planning window for climate resilience and sustainability, yet a growing bottleneck is the last-mile gap: translating scientific forecasts into trusted, actionable climate services, requiring reliable multimodal understanding and…

583

Assessing Spear-Phishing Website Generation in Large Language Model Coding Agents

Feb 2026 · 2602.13363
Software Dev

Large Language Models are expanding beyond being a tool humans use and into independent agents that can observe an environment, reason about solutions to problems, make changes that impact those environments, and understand how their actions impacted their environment. One of the most common applications of these LLM Agents is in computer…

584

MAS-on-the-Fly: Dynamic Adaptation of LLM-based Multi-Agent Systems at Test Time

Feb 2026 · 2602.13671
RAGBenchmarksMulti-Agent

Large Language Model (LLM)-based multi-agent systems (MAS) have emerged as a promising paradigm for solving complex tasks. However, existing works often rely on manual designs or "one-size-fits-all" automation, lacking dynamic adaptability after deployment. Inspired by how biological systems adapt, we introduce MASFly, a novel multi-agent…

585

G2CP: A Graph-Grounded Communication Protocol for Verifiable and Efficient Multi-Agent Reasoning

Feb 2026 · 2602.13370
ReasoningBenchmarksMulti-AgentKnowledge

Multi-agent systems powered by Large Language Models face a critical challenge: agents communicate through natural language, leading to semantic drift, hallucination propagation, and inefficient token consumption. We propose G2CP (Graph-Grounded Communication Protocol), a structured agent communication language where messages are graph operations…

586

Think like a Scientist: Physics-guided LLM Agent for Equation Discovery

Feb 2026 · 2602.12259
AgenticReasoningBenchmarks

Explaining observed phenomena through symbolic, interpretable formulas is a fundamental goal of science. Recently, large language models (LLMs) have emerged as promising tools for symbolic equation discovery, owing to their broad domain knowledge and strong reasoning capabilities. However, most existing LLM-based systems try to guess equations…

587

AgentLeak: A Full-Stack Benchmark for Privacy Leakage in Multi-Agent LLM Systems

Feb 2026 · 2602.11510
MemoryBenchmarksMulti-AgentSafety

Multi-agent Large Language Model (LLM) systems create privacy risks that current benchmarks cannot measure. When agents coordinate on tasks, sensitive data passes through inter-agent messages, shared memory, and tool arguments; pathways that output-only audits never inspect. We introduce AgentLeak, to the best of our knowledge the first full-stack…

588

Evaluating Memory Structure in LLM Agents

Feb 2026 · 2602.11243
MemoryRAGReasoningBenchmarks

Modern LLM-based agents and chat assistants rely on long-term memory frameworks to store reusable knowledge, recall user preferences, and augment reasoning. As researchers create more complex memory architectures, it becomes increasingly difficult to analyze their capabilities and guide future memory designs. Most long-term memory benchmarks focus…

589

Active Zero: Self-Evolving Vision-Language Models through Active Environment Exploration

Feb 2026 · 2602.11241
Self-ImprovingReasoningBenchmarksFine-Tuning

Self-play has enabled large language models to autonomously improve through self-generated challenges. However, existing self-play methods for vision-language models rely on passive interaction with static image collections, resulting in strong dependence on initial datasets and inefficient learning. Without the ability to actively seek visual…

590

Reward Modeling for Reinforcement Learning-Based LLM Reasoning: Design, Challenges, and Evaluation

Feb 2026 · 2602.09305
ReasoningBenchmarksFine-TuningReinforcement

Large Language Models (LLMs) demonstrate transformative potential, yet their reasoning remains inconsistent and unreliable. Reinforcement learning (RL)-based fine-tuning is a key mechanism for improvement, but its effectiveness is fundamentally governed by reward design. Despite its importance, the relationship between reward modeling and core LLM…

591

AgenticSum: An Agentic Inference-Time Framework for Faithful Clinical Text Summarization

Feb 2026 · 2602.20040
AgenticBenchmarksReinforcementInference

Large language models (LLMs) offer substantial promise for automating clinical text summarization, yet maintaining factual consistency remains challenging due to the length, noise, and heterogeneity of clinical documentation. We present AgenticSum, an inference-time, agentic framework that separates context selection, generation, verification, and…

592

APEX-SQL: Talking to the data via Agentic Exploration for Text-to-SQL

Feb 2026 · 2602.16720
AgenticPlanningReasoningBenchmarks

Text-to-SQL systems powered by Large Language Models have excelled on academic benchmarks but struggle in complex enterprise environments. The primary limitation lies in their reliance on static schema representations, which fails to resolve semantic ambiguity and scale effectively to large, complex databases. To address this, we propose APEX-SQL,…

593

El Agente Gráfico: Structured Execution Graphs for Scientific Agents

Feb 2026 · 2602.17902
MemoryReasoningBenchmarksMulti-Agent

Large language models (LLMs) are increasingly used to automate scientific workflows, yet their integration with heterogeneous computational tools remains ad hoc and fragile. Current agentic approaches often rely on unstructured text to manage context and coordinate execution, generating often overwhelming volumes of information that may obscure…

594

Optimization Instability in Autonomous Agentic Workflows for Clinical Symptom Detection

Feb 2026 · 2602.16037
AgenticBenchmarks

Autonomous agentic workflows that iteratively refine their own behavior hold considerable promise, yet their failure modes remain poorly characterized. We investigate optimization instability, a phenomenon in which continued autonomous improvement paradoxically degrades classifier performance, using Pythia, an open-source framework for automated…

595

PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training

Feb 2026 · 2602.13840
BenchmarksMulti-AgentInference

Large language model (LLM) agents are increasingly deployed in personalized tasks involving sensitive, context-dependent information, where privacy violations may arise in agents' action due to the implicitness of contextual privacy. Existing approaches rely on external, inference-time interventions which are brittle, scenario-specific, and may…

596

ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences

Feb 2026 · 2602.11354
AgenticBenchmarksReinforcement

The literature has witnessed an emerging interest in AI agents for automated assessment of scientific papers. Existing benchmarks focus primarily on the computational aspect of this task, testing agents' ability to reproduce or replicate research outcomes when having access to the code and data. This setting, while foundational, (1) fails to…

597

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

Feb 2026 · 2602.09379
ReasoningBenchmarksMulti-AgentReinforcement

Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic…

598

CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery

Feb 2026 · 2602.23075
AgenticBenchmarksReinforcement

Large language models (LLMs) have created new opportunities to enhance the efficiency of scholarly activities; however, challenges persist in the ethical deployment of AI assistance, including (1) the trustworthiness of AI-generated content, (2) preservation of academic integrity and intellectual property, and (3) protection of information…

599

Agentic AI for Scalable and Robust Optical Systems Control

Feb 2026 · 2602.20144
Software DevAgenticBenchmarks

We present AgentOptics, an agentic AI framework for high-fidelity, autonomous optical system control built on the Model Context Protocol (MCP). AgentOptics interprets natural language tasks and executes protocol-compliant actions on heterogeneous optical devices through a structured tool abstraction layer. We implement 64 standardized MCP tools…

600

LAPIS: Lightweight API Specification for Intelligent Systems

Feb 2026 · 2602.18541
Software DevAgenticReasoningBenchmarks

Large Language Models (LLMs) increasingly serve as consumers of API specifications, whether for code generation, autonomous agent interaction, or API-assisted reasoning. The de facto standard for API description, OpenAPI, was designed for documentation tools and code generators, resulting in substantial token overhead when used as LLM context. We…

601

AXE: An Agentic eXploit Engine for Confirming Zero-Day Vulnerability Reports

Feb 2026 · 2602.14345
PlanningReasoningMulti-AgentFine-Tuning

Vulnerability detection tools are widely adopted in software projects, yet they often overwhelm maintainers with false positives and non-actionable reports. Automated exploitation systems can help validate these reports; however, existing approaches typically operate in isolation from detection pipelines, failing to leverage readily available…

602

HyFunc: Accelerating LLM-based Function Calls for Agentic AI through Hybrid-Model Cascade and Dynamic Templating

Feb 2026 · 2602.13665
AgenticBenchmarksInference

While agentic AI systems rely on LLMs to translate user intent into structured function calls, this process is fraught with computational redundancy, leading to high inference latency that hinders real-time applications. This paper identifies and addresses three key redundancies: (1) the redundant processing of a large library of function…

603

Learning from the Irrecoverable: Error-Localized Policy Optimization for Tool-Integrated LLM Reasoning

Feb 2026 · 2602.09598
Long-HorizonAgenticPlanningReasoning

Tool-integrated reasoning (TIR) enables LLM agents to solve tasks through planning, tool use, and iterative revision, but outcome-only reinforcement learning in this setting suffers from sparse, delayed rewards and weak step-level credit assignment. In long-horizon TIR trajectories, an early irrecoverable mistake can determine success or failure,…

604

Managing Uncertainty in LLM-based Multi-Agent System Operation

Feb 2026 · 2602.23005
ReasoningMulti-AgentReinforcementSafety

Applying LLM-based multi-agent software systems in safety-critical domains such as lifespan echocardiography introduces system-level risks that cannot be addressed by improving model accuracy alone. During system operation, beyond individual LLM behavior, uncertainty propagates through agent coordination, data pipelines, human-in-the-loop…

605

Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

Feb 2026 · 2602.21346
ReasoningBenchmarksFine-TuningReinforcement

Recent advances in alignment techniques such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO) have improved the safety of large language models (LLMs). However, these LLMs remain vulnerable to jailbreak attacks that disguise harmful intent through indirect or deceptive…

606

InternAgent-1.5: A Unified Agentic Framework for Long-Horizon Autonomous Scientific Discovery

Feb 2026 · 2602.08990
MemoryLong-HorizonAgenticReasoning

We introduce InternAgent-1.5, a unified system designed for end-to-end scientific discovery across computational and empirical domains. The system is built on a structured architecture composed of three coordinated subsystems for generation, verification, and evolution. These subsystems are supported by foundational capabilities for deep research,…

607

Agentic AI for Intent-driven Optimization in Cell-free O-RAN

Feb 2026 · 2602.22539
MemoryAgenticFine-TuningReinforcement

Agentic artificial intelligence (AI) is emerging as a key enabler for autonomous radio access networks (RANs), where multiple large language model (LLM)-based agents reason and collaborate to achieve operator-defined intents. The open RAN (O-RAN) architecture enables the deployment and coordination of such agents. However, most existing works…

608

SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents

Feb 2026 · 2602.22124
Long-HorizonSoftware DevAgenticFine-Tuning

Small language models (SLMs) offer compelling advantages in cost, latency, and adaptability, but have so far lagged behind larger models on long-horizon software engineering tasks such as SWE-bench, where they suffer from pervasive action looping and low resolution rates. We introduce SWE-Protégé, a post-training framework that reframes software…

609

City Editing: Hierarchical Agentic Execution for Dependency-Aware Urban Geospatial Modification

Feb 2026 · 2602.19326
AgenticPlanningReasoning

As cities evolve over time, challenges such as traffic congestion and functional imbalance increasingly necessitate urban renewal through efficient modification of existing plans, rather than complete re-planning. In practice, even minor urban changes require substantial manual effort to redraw geospatial layouts, slowing the iterative planning…

610

FAMOSE: A ReAct Approach to Automated Feature Discovery

Feb 2026 · 2602.17641
ContextAgenticBenchmarksArchitecture

Feature engineering remains a critical yet challenging bottleneck in machine learning, particularly for tabular data, as identifying optimal features from an exponentially large feature space traditionally demands substantial domain expertise. To address this challenge, we introduce FAMOSE (Feature AugMentation and Optimal Selection agEnt), a…

611

Agentic Wireless Communication for 6G: Intent-Aware and Continuously Evolving Physical-Layer Intelligence

Feb 2026 · 2602.17096
AgenticReasoningReinforcementInference

As 6G wireless systems evolve, growing functional complexity and diverse service demands are driving a shift from rule-based control to intent-driven autonomous intelligence. User requirements are no longer captured by a single metric (e.g., throughput or reliability), but by multi-dimensional objectives such as latency sensitivity, energy…

612

Policy Compiler for Secure Agentic Systems

Feb 2026 · 2602.16708
ReasoningMulti-Agent

LLM-based agents are increasingly being deployed in contexts requiring complex authorization policies: customer service protocols, approval workflows, data access restrictions, and regulatory compliance. Embedding these policies in prompts provides no enforcement guarantees. We present PCAS, a Policy Compiler for Agentic Systems that provides…

613

Text Before Vision: Staged Knowledge Injection Matters for Agentic RLVR in Ultra-High-Resolution Remote Sensing Understanding

Feb 2026 · 2602.14225
AgenticReasoningBenchmarksFine-Tuning

Multimodal reasoning for ultra-high-resolution (UHR) remote sensing (RS) is usually bottlenecked by visual evidence acquisition: the model necessitates localizing tiny task-relevant regions in massive pixel spaces. While Agentic Reinforcement Learning with Verifiable Rewards (RLVR) using zoom-in tools offers a path forward, we find that standard…

614

SPILLage: Agentic Oversharing on the Web

Feb 2026 · 2602.13516
AgenticBenchmarks

LLM-powered agents are beginning to automate user's tasks across the open web, often with access to user resources such as emails and calendars. Unlike standard LLMs answering questions in a controlled ChatBot setting, web agents act "in the wild", interacting with third parties and leaving behind an action trace. Therefore, we ask the question:…

615

ADEPT: RL-Aligned Agentic Decoding of Emotion via Evidence Probing Tools -- From Consensus Learning to Ambiguity-Driven Emotion Reasoning

Feb 2026 · 2602.12714
AgenticReasoningReinforcement

Speech Large Language Models (SLLMs) enable high-level emotion reasoning but often produce ungrounded, text-biased judgments without verifiable acoustic evidence. In contrast, self-supervised speech encoders such as WavLM provide strong acoustic representations yet remain opaque discriminative models with limited interpretability. To bridge this…

616

Budget-Constrained Agentic Large Language Models: Intention-Based Planning for Costly Tool Use

Feb 2026 · 2602.11541
AgenticPlanningFine-TuningInference

We study budget-constrained tool-augmented agents, where a large language model must solve multi-step tasks by invoking external tools under a strict monetary budget. We formalize this setting as sequential decision making in context space with priced and stochastic tool executions, making direct planning intractable due to massive state-action…

617

How Much Reasoning Do Retrieval-Augmented Models Add beyond LLMs? A Benchmarking Framework for Multi-Hop Inference over Hybrid Knowledge

Feb 2026 · 2602.10210
RAGReasoningBenchmarksReinforcement

Large language models (LLMs) continue to struggle with knowledge-intensive questions that require up-to-date information and multi-hop reasoning. Augmenting LLMs with hybrid external knowledge, such as unstructured text and structured knowledge graphs, offers a promising alternative to costly continual pretraining. As such, reliable evaluation of…

618

AlphaForgeBench: Benchmarking End-to-End Trading Strategy Design with Large Language Models

Feb 2026 · 2602.18481
MemoryReasoningBenchmarksArchitecture

The rapid advancement of Large Language Models (LLMs) has led to a surge of financial benchmarks, evolving from static knowledge tests to interactive trading simulations. However, current evaluations of real-time trading performance overlook a critical failure mode: severe behavioral instability in sequential decision-making under uncertainty. We…

619

HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents

Feb 2026 · 2602.16165
Long-HorizonPlanningBenchmarksReinforcement

Training LLMs as interactive agents for multi-turn decision-making remains challenging, particularly in long-horizon tasks with sparse and delayed rewards, where agents must execute extended sequences of actions before receiving meaningful feedback. Most existing reinforcement learning (RL) approaches model LLM agents as flat policies operating at…

620

ER-MIA: Black-Box Adversarial Memory Injection Attacks on Long-Term Memory-Augmented Large Language Models

Feb 2026 · 2602.15344
MemoryContextReasoning

Large language models (LLMs) are increasingly augmented with long-term memory systems to overcome finite context windows and enable persistent reasoning across interactions. However, recent research finds that LLMs become more vulnerable because memory provides extra attack surfaces. In this paper, we present the first systematic study of…

621

AgriWorld:A World Tools Protocol Framework for Verifiable Agricultural Reasoning with Code-Executing LLM Agents

Feb 2026 · 2602.15325
AgenticReasoning

Foundation models for agriculture are increasingly trained on massive spatiotemporal data (e.g., multi-spectral remote sensing, soil grids, and field-level management logs) and achieve strong performance on forecasting and monitoring. However, these models lack language-based reasoning and interactive capabilities, limiting their usefulness in…

622

ResearchGym: Evaluating Language Model Agents on Real-World AI Research

Feb 2026 · 2602.15112
ContextLong-HorizonAgenticBenchmarks

We introduce ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research. To instantiate this, we repurpose five oral and spotlight papers from ICML, ICLR, and ACL. From each paper's repository, we preserve the datasets, evaluation harness, and baseline implementations but withhold the paper's proposed…

623

Process-Supervised Multi-Agent Reinforcement Learning for Reliable Clinical Reasoning

Feb 2026 · 2602.14160
ReasoningBenchmarksMulti-AgentReinforcement

Clinical decision-making requires nuanced reasoning over heterogeneous evidence and traceable justifications. While recent LLM multi-agent systems (MAS) show promise, they largely optimise for outcome accuracy while overlooking process-grounded reasoning aligned with clinical standards. One critical real-world case of this is gene-disease validity…

624

Reasoning-Native Agentic Communication for 6G

Feb 2026 · 2602.17738
ReasoningMulti-AgentArchitectureSafety

Future 6G networks will interconnect not only devices, but autonomous machines that continuously sense, reason, and act. In such environments, communication can no longer be understood solely as delivering bits or even preserving semantic meaning. Even when two agents interpret the same information correctly, they may still behave inconsistently…

625

Towards Selection as Power: Bounding Decision Authority in Autonomous Agents

Feb 2026 · 2602.14606
AgenticReasoningArchitectureSafety

Autonomous agentic systems are increasingly deployed in regulated, high-stakes domains where decisions may be irreversible and institutionally constrained. Existing safety approaches emphasize alignment, interpretability, or action-level filtering. We argue that these mechanisms are necessary but insufficient because they do not directly govern…

626

Agentic Test-Time Scaling for WebAgents

Feb 2026 · 2602.12276
Long-HorizonAgenticInference

Test-time scaling has become a standard way to improve performance and boost reliability of neural network models. However, its behavior on agentic, multi-step tasks remains less well-understood: small per-step errors can compound over long horizons; and we find that naive policies that uniformly increase sampling show diminishing returns. In this…

627

Beyond Task Performance: A Metric-Based Analysis of Sequential Cooperation in Heterogeneous Multi-Agent Destructive Foraging

Feb 2026 · 2602.10685
Multi-Agent

This work addresses the problem of analyzing cooperation in heterogeneous multi-agent systems which operate under partial observability and temporal role dependency, framed within a destructive multi-agent foraging setting. Unlike most previous studies, which focus primarily on algorithmic performance with respect to task completion, this article…

628

Benchmark Test-Time Scaling of General LLM Agents

Feb 2026 · 2602.18998
AgenticReasoningBenchmarks

LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended user requests. While existing benchmarks focus on domain-aware environments for developing specialized agents, evaluating general-purpose agents requires more realistic settings that challenge them to operate across multiple skills and tools…

629

On Theoretically-Driven LLM Agents for Multi-Dimensional Discourse Analysis

Feb 2026 · 2602.13713
RAGMulti-AgentArchitecture

Identifying the strategic uses of reformulation in discourse remains a key challenge for computational argumentation. While LLMs can detect surface-level similarity, they often fail to capture the pragmatic functions of rephrasing, such as its role within rhetorical discourse. This paper presents a comparative multi-agent framework designed to…

630

Learning to Remember: End-to-End Training of Memory Agents for Long-Context Reasoning

Feb 2026 · 2602.18493
MemoryLong-HorizonRAGReasoning

Long-context LLMs and Retrieval-Augmented Generation (RAG) systems process information passively, deferring state tracking, contradiction resolution, and evidence aggregation to query time, which becomes brittle under ultra long streams with frequent updates. We propose the Unified Memory Agent (UMA), an end-to-end reinforcement learning framework…

631

Curriculum-DPO++: Direct Preference Optimization via Data and Model Curricula for Text-to-Image Generation

Feb 2026 · 2602.13055
BenchmarksFine-TuningReinforcementArchitecture

Direct Preference Optimization (DPO) has been proposed as an effective and efficient alternative to reinforcement learning from human feedback (RLHF). However, neither RLHF nor DPO take into account the fact that learning certain preferences is more difficult than learning other preferences, rendering the optimization process suboptimal. To…

632

Perceptual Self-Reflection in Agentic Physics Simulation Code Generation

Feb 2026 · 2602.12311
Software DevMulti-AgentReinforcementArchitecture

We present a multi-agent framework for generating physics simulation code from natural language descriptions, featuring a novel perceptual self-reflection mechanism for validation. The system employs four specialized agents: a natural language interpreter that converts user requests into physics-based descriptions; a technical requirements…

633

LHAW: Controllable Underspecification for Long-Horizon Tasks

Feb 2026 · 2602.10525
Long-HorizonSoftware DevBenchmarks

Long-horizon workflow agents that operate effectively over extended periods are essential for truly autonomous systems. Their reliable execution critically depends on the ability to reason through ambiguous situations in which clarification seeking is necessary to ensure correct task execution. However, progress is limited by the lack of scalable,…

634

The Landscape of Prompt Injection Threats in LLM Agents: From Taxonomy to Analysis

Feb 2026 · 2602.10453
AgenticReasoningBenchmarksInference

The evolution of Large Language Models (LLMs) has resulted in a paradigm shift towards autonomous agents, necessitating robust security against Prompt Injection (PI) vulnerabilities where untrusted inputs hijack agent behaviors. This SoK presents a comprehensive overview of the PI landscape, covering attacks, defenses, and their evaluation…

635

Why Do AI Agents Systematically Fail at Cloud Root Cause Analysis?

Feb 2026 · 2602.09937
ReasoningBenchmarksMulti-AgentFine-Tuning

Failures in large-scale cloud systems incur substantial financial losses, making automated Root Cause Analysis (RCA) essential for operational stability. Recent efforts leverage Large Language Model (LLM) agents to automate this task, yet existing systems exhibit low detection accuracy even with capable models, and current evaluation frameworks…

636

The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies

Feb 2026 · 2602.09877
Self-ImprovingMulti-AgentSafetyInference

The emergence of multi-agent systems built from large language models (LLMs) offers a promising paradigm for scalable collective intelligence and self-evolution. Ideally, such systems would achieve continuous self-improvement in a fully closed loop while maintaining robust safety alignment--a combination we term the self-evolution trilemma.…

637

PeroMAS: A Multi-agent System of Perovskite Material Discovery

Feb 2026 · 2602.13312
PlanningBenchmarksMulti-Agent

As a pioneer of the third-generation photovoltaic revolution, Perovskite Solar Cells (PSCs) are renowned for their superior optoelectronic performance and cost potential. The development process of PSCs is precise and complex, involving a series of closed-loop workflows such as literature retrieval, data integration, experimental design, and…

638

A Benchmark for Deep Information Synthesis

Feb 2026 · 2602.21143
AgenticReasoningBenchmarks

Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis. However, current evaluation benchmarks do not adequately assess their ability to solve real-world tasks that require synthesizing information from multiple sources and inferring insights…

639

AgenticTyper: Automated Typing of Legacy Software Projects Using Agentic AI

Feb 2026 · 2602.21251
AgenticBenchmarksSafetyInference

Legacy JavaScript systems lack type safety, making maintenance risky. While TypeScript can help, manually adding types is expensive. Previous automated typing research focuses on type inference but rarely addresses type checking setup, definition generation, bug identification, or behavioral correctness at repository scale. We present…

640

MapTab: Can MLLMs Master Constrained Route Planning?

Feb 2026 · 2602.18600
PlanningReasoningBenchmarks

Systematic evaluation of Multimodal Large Language Models (MLLMs) is crucial for advancing Artificial General Intelligence (AGI). However, existing benchmarks remain insufficient for rigorously assessing their constrained reasoning capabilities. To bridge this gap, we introduce MapTab, a multimodal benchmark specifically designed to evaluate…

641

How AI Coding Agents Communicate: A Study of Pull Request Description Characteristics and Human Review Responses

Feb 2026 · 2602.17084
Software Dev

The rapid adoption of large language models has led to the emergence of AI coding agents that autonomously create pull requests on GitHub. However, how these agents differ in their pull request description characteristics, and how human reviewers respond to them, remains underexplored. In this study, we conduct an empirical analysis of pull…

642

REDSearcher: A Scalable and Cost-Efficient Framework for Long-Horizon Search Agents

Feb 2026 · 2602.14234
Long-HorizonAgenticPlanningBenchmarks

Large language models are transitioning from generalpurpose knowledge engines to realworld problem solvers, yet optimizing them for deep search tasks remains challenging. The central bottleneck lies in the extreme sparsity of highquality search trajectories and reward signals, arising from the difficulty of scalable longhorizon task construction…

643

ARC: Compiling Hundreds of Requirement Scenarios into A Runnable Web System

Feb 2026 · 2602.13723
Software DevAgenticArchitecture

Large Language Models (LLMs) have improved programming efficiency, but their performance degrades significantly as requirements scale; when faced with multi-modal documents containing hundreds of scenarios, LLMs often produce incorrect implementations or omit constraints. We propose Agentic Requirement Compilation (ARC), a technique that moves…

644

STAR : Bridging Statistical and Agentic Reasoning for Large Model Performance Prediction

Feb 2026 · 2602.12143
AgenticReasoningBenchmarks

As comprehensive large model evaluation becomes prohibitively expensive, predicting model performance from limited observations has become essential. However, existing statistical methods struggle with pattern shifts, data sparsity, and lack of explanation, while pure LLM methods remain unreliable. We propose STAR, a framework that bridges…

645

FormalJudge: A Neuro-Symbolic Paradigm for Agentic Oversight

Feb 2026 · 2602.11136
AgenticBenchmarksArchitectureSafety

As LLM-based agents increasingly operate in high-stakes domains with real-world consequences, ensuring their behavioral safety becomes paramount. The dominant oversight paradigm, LLM-as-a-Judge, faces a fundamental dilemma: how can probabilistic systems reliably supervise other probabilistic systems without inheriting their failure modes? We argue…

646

AnalyticsGPT: An LLM Workflow for Scientometric Question Answering

Feb 2026 · 2602.09817
AgenticRAGPlanningReasoning

This paper introduces AnalyticsGPT, an intuitive and efficient large language model (LLM)-powered workflow for scientometric question answering. This underrepresented downstream task addresses the subcategory of meta-scientific questions concerning the "science of science." When compared to traditional scientific question answering based on…

647

ArkEval: Benchmarking and Evaluating Automated CodeRepair for ArkTS

Feb 2026 · 2602.08866
Software DevRAGBenchmarks

Large language models have transformed code generation, enabling unprecedented automation in software development. As mobile ecosystems evolve, HarmonyOS has emerged as a critical platform requiring robust development tools. Software development for the HarmonyOS ecosystem relies heavily on ArkTS, a statically typed extension of TypeScript.…

648

Toward Expert Investment Teams:A Multi-Agent LLM System with Fine-Grained Trading Tasks

Feb 2026 · 2602.23330
PlanningMulti-AgentSafetyInference

The advancement of large language models (LLMs) has accelerated the development of autonomous financial trading systems. While mainstream approaches deploy multi-agent systems mimicking analyst and manager roles, they often rely on abstract instructions that overlook the intricacies of real-world workflows, which can lead to degraded inference…

649

Toward Personalized LLM-Powered Agents: Foundations, Evaluation, and Future Directions

Feb 2026 · 2602.22680
MemoryAgenticPlanningBenchmarks

Large language models have enabled agents that reason, plan, and interact with tools and environments to accomplish complex tasks. As these agents operate over extended interaction horizons, their effectiveness increasingly depends on adapting behavior to individual users and maintaining continuity across time, giving rise to personalized…

650

HieraMAS: Optimizing Intra-Node LLM Mixtures and Inter-Node Topology for Multi-Agent Systems

Feb 2026 · 2602.20229
ReasoningBenchmarksMulti-AgentReinforcement

Multi-agent systems (MAS) built on large language models (LLMs) have shown strong performance across many tasks. Most existing approaches improve only one aspect at a time, such as the communication topology, role assignment, or LLM routing, while treating each agent as a single, indivisible unit. This misses the opportunity to use mixtures of…

651

SAMAS: A Spectrum-Guided Multi-Agent System for Achieving Style Fidelity in Literary Translation

Feb 2026 · 2602.19840
BenchmarksMulti-Agent

Modern large language models (LLMs) excel at generating fluent and faithful translations. However, they struggle to preserve an author's unique literary style, often producing semantically correct but generic outputs. This limitation stems from the inability of current single-model and static multi-agent systems to perceive and adapt to stylistic…

652

MagicAgent: Towards Generalized Agent Planning

Feb 2026 · 2602.19000
Long-HorizonAgenticPlanningBenchmarks

The evolution of Large Language Models (LLMs) from passive text processors to autonomous agents has established planning as a core component of modern intelligence. However, achieving generalized planning remains elusive, not only by the scarcity of high-quality interaction data but also by inherent conflicts across heterogeneous planning tasks.…

653

AdaptOrch: Task-Adaptive Multi-Agent Orchestration in the Era of LLM Performance Convergence

Feb 2026 · 2602.16873
Software DevRAGPlanningReasoning

As large language models from diverse providers converge toward comparable benchmark performance, the traditional paradigm of selecting a single best model per task yields diminishing returns. We argue that orchestration topology -- the structural composition of how multiple agents are coordinated, parallelized, and synthesized -- now dominates…

654

ST-EVO: Towards Generative Spatio-Temporal Evolution of Multi-Agent Communication Topologies

Feb 2026 · 2602.14681
Self-ImprovingBenchmarksMulti-AgentReinforcement

LLM-powered Multi-Agent Systems (MAS) have emerged as an effective approach towards collaborative intelligence, and have attracted wide research interests. Among them, ``self-evolving'' MAS, treated as a more flexible and powerful technical route, can construct task-adaptive workflows or communication topologies, instead of relying on a predefined…

655

Think Fast and Slow: Step-Level Cognitive Depth Adaptation for LLM Agents

Feb 2026 · 2602.12662
Long-HorizonAgenticPlanningReasoning

Large language models (LLMs) are increasingly deployed as autonomous agents for multi-turn decision-making tasks. However, current agents typically rely on fixed cognitive patterns: non-thinking models generate immediate responses, while thinking models engage in deep reasoning uniformly. This rigidity is inefficient for long-horizon tasks, where…

656

AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection

Feb 2026 · 2602.11931
AgenticReasoningBenchmarksInference

Evolutionary agentic systems intensify the trade-off between computational efficiency and reasoning capability by repeatedly invoking large language models (LLMs) during inference. This setting raises a central question: how can an agent dynamically select an LLM that is sufficiently capable for the current generation step while remaining…

657

ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

Feb 2026 · 2602.21534
AgenticFine-TuningReinforcement

Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks. Despite encouraging early results, ARL remains highly unstable, often leading to training collapse. This instability limits scalability to larger environments and longer interaction horizons,…

658

"Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems

Feb 2026 · 2602.21127
Software DevAgentic

Large language model (LLM) agents are rapidly becoming trusted copilots in high-stakes domains like software development and healthcare. However, this deepening trust introduces a novel attack surface: Agent-Mediated Deception (AMD), where compromised agents are weaponized against their human users. While extensive research focuses on…

659

Human-Guided Agentic AI for Multimodal Clinical Prediction: Lessons from the AgentDS Healthcare Benchmark

Feb 2026 · 2602.19502
AgenticBenchmarks

Agentic AI systems are increasingly capable of autonomous data science workflows, yet clinical prediction tasks demand domain expertise that purely automated approaches struggle to provide. We investigate how human guidance of agentic AI can improve multimodal clinical prediction, presenting our approach to all three AgentDS Healthcare benchmark…

660

Adaptive Collaboration of Arena-Based Argumentative LLMs for Explainable and Contestable Legal Reasoning

Feb 2026 · 2602.18916
RAGReasoningBenchmarksMulti-Agent

Legal reasoning requires not only high accuracy but also the ability to justify decisions through verifiable and contestable arguments. However, existing Large Language Model (LLM) approaches, such as Chain-of-Thought (CoT) and Retrieval-Augmented Generation (RAG), often produce unstructured explanations that lack a formal mechanism for…

661

LLM-WikiRace Benchmark: How Far Can LLMs Plan over Real-World Knowledge Graphs?

Feb 2026 · 2602.16902
Long-HorizonPlanningReasoningBenchmarks

We introduce LLM-Wikirace, a benchmark for evaluating planning, reasoning, and world knowledge in large language models (LLMs). In LLM-Wikirace, models must efficiently navigate Wikipedia hyperlinks step by step to reach a target page from a given source, requiring look-ahead planning and the ability to reason about how concepts are connected in…

662

Hybrid-Gym: Training Coding Agents to Generalize Across Tasks

Feb 2026 · 2602.16819
Software DevBenchmarksArchitecture

When assessing the quality of coding agents, predominant benchmarks focus on solving single issues on GitHub, such as SWE-Bench. In contrast, in real use, these agents solve more various and complex tasks that involve other skills such as exploring codebases, testing software, and designing architecture. In this paper, we first characterize some…

663

HyMem: Hybrid Memory Architecture with Dynamic Retrieval Scheduling

Feb 2026 · 2602.13933
MemoryReasoningBenchmarksArchitecture

Large language model (LLM) agents demonstrate strong performance in short-text contexts but often underperform in extended dialogues due to inefficient memory management. Existing approaches face a fundamental trade-off between efficiency and effectiveness: memory compression risks losing critical details required for complex reasoning, while…

664

AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

Feb 2026 · 2602.23258
RAGReasoningBenchmarksMulti-Agent

While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information generated by individual participants. Current solutions often resort to rigid structural engineering or expensive fine-tuning, limiting their deployability and adaptability. We propose AgentDropoutV2, a test-time…

665

A Framework for Assessing AI Agent Decisions and Outcomes in AutoML Pipelines

Feb 2026 · 2602.22442
AgenticReasoningBenchmarks

Agent-based AutoML systems rely on large language models to make complex, multi-stage decisions across data processing, model selection, and evaluation. However, existing evaluation practices remain outcome-centric, focusing primarily on final task performance. Through a review of prior work, we find that none of the surveyed agentic AutoML…

666

NutriOrion: A Hierarchical Multi-Agent Framework for Personalized Nutrition Intervention Grounded in Clinical Guidelines

Feb 2026 · 2602.18650
PlanningReasoningMulti-AgentArchitecture

Personalized nutrition intervention for patients with multimorbidity is critical for improving health outcomes, yet remains challenging because it requires the simultaneous integration of heterogeneous clinical conditions, medications, and dietary guidelines. Single-agent large language models (LLMs) often suffer from context overload and…

667

What to Cut? Predicting Unnecessary Methods in Agentic Code Generation

Feb 2026 · 2602.17091
Software DevAgentic

Agentic Coding, powered by autonomous agents such as GitHub Copilot and Cursor, enables developers to generate code, tests, and pull requests from natural language instructions alone. While this accelerates implementation, it produces larger volumes of code per pull request, shifting the burden from implementers to reviewers. In practice, a…

668

RoboSolver: A Multi-Agent Large Language Model Framework for Solving Robotic Arm Problems

Feb 2026 · 2602.14438
BenchmarksMulti-Agent

This study proposes an intelligent multi-agent framework built on LLMs and VLMs and specifically tailored to robotics. The goal is to integrate the strengths of LLMs and VLMs with computational tools to automatically analyze and solve problems related to robotic manipulators. Our developed framework accepts both textual and visual inputs and can…

669

A Multi-Agent Framework for Code-Guided, Modular, and Verifiable Automated Machine Learning

Feb 2026 · 2602.13937
Software DevPlanningMulti-AgentPrompting

Automated Machine Learning (AutoML) has revolutionized the development of data-driven solutions; however, traditional frameworks often function as "black boxes", lacking the flexibility and transparency required for complex, real-world engineering tasks. Recent Large Language Model (LLM)-based agents have shifted toward code-driven approaches.…

670

GSRM: Generative Speech Reward Model for Speech RLHF

Feb 2026 · 2602.13891
ReasoningBenchmarksReinforcementInference

Recent advances in speech language models, such as GPT-4o Voice Mode and Gemini Live, have demonstrated promising speech generation capabilities. Nevertheless, the aesthetic naturalness of the synthesized audio still lags behind that of human speech. Enhancing generation quality requires a reliable evaluator of speech naturalness. However,…

671

OMNI-LEAK: Orchestrator Multi-Agent Network Induced Data Leakage

Feb 2026 · 2602.13477
ReasoningMulti-AgentSafety

As Large Language Model (LLM) agents become more capable, their coordinated use in the form of multi-agent systems is anticipated to emerge as a practical paradigm. Prior work has examined the safety and misuse risks associated with agents. However, much of this has focused on the single-agent case and/or setups missing basic engineering…

672

CoMMa: Contribution-Aware Medical Multi-Agents From A Game-Theoretic Perspective

Feb 2026 · 2602.09159
ReasoningBenchmarksMulti-AgentReinforcement

Recent multi-agent frameworks have broadened the ability to tackle oncology decision support tasks that require reasoning over dynamic, heterogeneous patient data. We propose Contribution-Aware Medical Multi-Agents (CoMMa), a decentralized LLM-agent framework in which specialists operate on partitioned evidence and coordinate through a…

673

DeepPresenter: Environment-Grounded Reflection for Agentic Presentation Generation

Feb 2026 · 2602.22839
Long-HorizonAgenticReasoningBenchmarks

Presentation generation requires deep content research, coherent visual design, and iterative refinement based on observation. However, existing presentation agents often rely on predefined workflows and fixed templates. To address this, we present DeepPresenter, an agentic framework that adapts to diverse user intents, enables effective…

674

Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

Feb 2026 · 2602.21198
Long-HorizonPlanningReasoningBenchmarks

Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of…

675

Probing Dec-POMDP Reasoning in Cooperative MARL

Feb 2026 · 2602.20804
MemoryReasoningBenchmarksMulti-Agent

Cooperative multi-agent reinforcement learning (MARL) is typically framed as a decentralised partially observable Markov decision process (Dec-POMDP), a setting whose hardness stems from two key challenges: partial observability and decentralised coordination. Genuinely solving such tasks requires Dec-POMDP reasoning, where agents use history to…

676

SPARC: Scenario Planning and Reasoning for Automated C Unit Test Generation

Feb 2026 · 2602.16671
MemoryPlanningReasoning

Automated unit test generation for C remains a formidable challenge due to the semantic gap between high-level program intent and the rigid syntactic constraints of pointer arithmetic and manual memory management. While Large Language Models (LLMs) exhibit strong generative capabilities, direct intent-to-code synthesis frequently suffers from the…

677

Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling

Feb 2026 · 2602.16485
Software DevReasoningBenchmarksMulti-Agent

Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models. To address this, we introduce Team-of-Thoughts, a novel MAS architecture that leverages the complementary capabilities of heterogeneous agents via an…

678

Improving MLLMs in Embodied Exploration and Question Answering with Human-Inspired Memory Modeling

Feb 2026 · 2602.15513
MemoryLong-HorizonReasoningBenchmarks

Deploying Multimodal Large Language Models as the brain of embodied agents remains challenging, particularly under long-horizon observations and limited context budgets. Existing memory assisted methods often rely on textual summaries, which discard rich visual and spatial details and remain brittle in non-stationary environments. In this work, we…

679

Configuring Agentic AI Coding Tools: An Exploratory Study

Feb 2026 · 2602.14690
Software DevAgenticFine-Tuning

Agentic AI coding tools with autonomous capabilities beyond conversational content generation increasingly automate repetitive and time-consuming software development tasks. Developers can configure these tools through versioned repository-level artifacts such as Markdown and JSON files. In this paper, we present a systematic analysis of…

680

Toward Autonomous O-RAN: A Multi-Scale Agentic AI Framework for Real-Time Network Control and Management

Feb 2026 · 2602.14117
AgenticInference

Open Radio Access Networks (O-RAN) promise flexible 6G network access through disaggregated, software-driven components and open interfaces, but this programmability also increases operational complexity. Multiple control loops coexist across the service management layer and RAN Intelligent Controller (RIC), while independently developed control…

681

From What to How: Bridging User Requirements with Software Development Using Large Language Models

Feb 2026 · 2602.13611
Software DevBenchmarks

Recently, large language models (LLMs) are extensively utilized to enhance development efficiency, leading to numerous benchmarks for evaluating their performance. However, these benchmarks predominantly focus on implementation, overlooking the equally critical aspect of software design. This gap raises two pivotal questions: (1) Can LLMs handle…

682

Execution-State-Aware LLM Reasoning for Automated Proof-of-Vulnerability Generation

Feb 2026 · 2602.13574
AgenticReasoningBenchmarksFine-Tuning

Proof-of-Vulnerability (PoV) generation is a critical task in software security, serving as a cornerstone for vulnerability validation, false positive reduction, and patch verification. While directed fuzzing effectively drives path exploration, satisfying complex semantic constraints remains a persistent bottleneck in automated exploit…

683

A Survey of Code Review Benchmarks and Evaluation Practices in Pre-LLM and LLM Era

Feb 2026 · 2602.13377
BenchmarksReinforcement

Code review is a critical practice in modern software engineering, helping developers detect defects early, improve code quality, and facilitate knowledge sharing. With the rapid advancement of large language models (LLMs), a growing body of work has explored automated support for code review. However, progress in this area is hindered by the lack…

684

On the Adoption of AI Coding Agents in Open-source Android and iOS Development

Feb 2026 · 2602.12144
Software DevAgentic

AI coding agents are increasingly contributing to software development, yet their impact on mobile development has received little empirical attention. In this paper, we present the first category-level empirical study of agent-generated code in open-source mobile app projects. We analyzed PR acceptance behaviors across mobile platforms, agents,…

685

When Visibility Outpaces Verification: Delayed Verification and Narrative Lock-in in Agentic AI Discourse

Feb 2026 · 2602.11412
AgenticPlanningBenchmarksSafety

Agentic AI systems-autonomous entities capable of independent planning and execution-reshape the landscape of human-AI trust. Long before direct system exposure, user expectations are mediated through high-stakes public discourse on social platforms. However, platform-mediated engagement signals (e.g., upvotes) may inadvertently function as a…

686

UMEM: Unified Memory Extraction and Management Framework for Generalizable Memory

Feb 2026 · 2602.10652
MemorySelf-ImprovingBenchmarksReinforcement

Self-evolving memory serves as the trainable parameters for Large Language Models (LLMs)-based agents, where extraction (distilling insights from experience) and management (updating the memory bank) must be tightly coordinated. Existing methods predominately optimize memory management while treating memory extraction as a static process,…

687

LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation

Feb 2026 · 2602.10367
ReasoningBenchmarksMulti-AgentSafety

The deployment of Large Language Models (LLMs) in high-stakes clinical settings demands rigorous and reliable evaluation. However, existing medical benchmarks remain static, suffering from two critical limitations: (1) data contamination, where test sets inadvertently leak into training corpora, leading to inflated performance estimates; and (2)…

688

Multi-Agent Large Language Model Based Emotional Detoxification Through Personalized Intensity Control for Consumer Protection

Feb 2026 · 2602.23123
Multi-AgentReinforcement

In the attention economy, sensational content exposes consumers to excessive emotional stimulation, hindering calm decision-making. This study proposes Multi-Agent LLM-based Emotional deToxification (MALLET), a multi-agent information sanitization system consisting of four agents: Emotion Analysis, Emotion Adjustment, Balance Monitoring, and…

689

Multi-Agent Temporal Logic Planning via Penalty Functions and Block-Coordinate Optimization

Feb 2026 · 2602.17434
PlanningMulti-Agent

Multi-agent planning under Signal Temporal Logic (STL) is often hindered by collaborative tasks that lead to computational challenges due to the inherent high-dimensionality of the problem, preventing scalable synthesis with satisfaction guarantees. To address this, we formulate STL planning as an optimization program under arbitrary multi-agent…

690

MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation

Feb 2026 · 2602.16898
MemoryPlanningReasoningMulti-Agent

Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings. MALLVI presents a Multi Agent Large Language and Vision…

691

The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems

Feb 2026 · 2602.15382
ReasoningMulti-AgentArchitectureSafety

Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and information quantization loss. While latent state transfer offers a high-bandwidth alternative, existing approaches…

692

Socially-Weighted Alignment: A Game-Theoretic Framework for Multi-Agent LLM Systems

Feb 2026 · 2602.14471
Multi-AgentReinforcementSafetyInference

Deploying large language model (LLM) agents in shared environments introduces a fundamental tension between individual alignment and collective stability: locally rational decisions can impose negative externalities that degrade system-level performance. We propose Socially-Weighted Alignment (SWA), a game-theoretic framework that modifies…

693

Multi-Agent Model-Based Reinforcement Learning with Joint State-Action Learned Embeddings

Feb 2026 · 2602.12520
PlanningBenchmarksMulti-AgentReinforcement

Learning to coordinate many agents in partially observable and highly dynamic environments requires both informative representations and data-efficient training. To address this challenge, we present a novel model-based multi-agent reinforcement learning framework that unifies joint state-action representation learning with imaginative roll-outs.…

694

Cooperation Breakdown in LLM Agents Under Communication Delays

Feb 2026 · 2602.11754
Multi-Agent

LLM-based multi-agent systems (LLM-MAS), in which autonomous AI agents cooperate to solve tasks, are gaining increasing attention. For such systems to be deployed in society, agents must be able to establish cooperation and coordination under real-world computational and communication constraints. We propose the FLCOA framework (Five Layers for…

695

Decoding ML Decision: An Agentic Reasoning Framework for Large-Scale Ranking System

Feb 2026 · 2602.18640
AgenticReasoning

Modern large-scale ranking systems operate within a sophisticated landscape of competing objectives, operational constraints, and evolving product requirements. Progress in this domain is increasingly bottlenecked by the engineering context constraint: the arduous process of translating ambiguous product intent into reasonable, executable,…

696

EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLMs on Earth Imagery

Feb 2026 · 2602.15918
AgenticReasoningBenchmarksReinforcement

Benchmarking spatial reasoning in multimodal large language models (MLLMs) has attracted growing interest in computer vision due to its importance for embodied AI and other agentic systems that require precise interaction with the physical world. However, spatial reasoning on Earth imagery has lagged behind, as it uniquely involves grounding…

697

When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift

Feb 2026 · 2602.14161
AgenticBenchmarksSafety

Detecting prompt injection and jailbreak attacks is critical for deploying LLM-based agents safely. As agents increasingly process untrusted data from emails, documents, tool outputs, and external APIs, robust attack detection becomes essential. Yet current evaluation practices and production systems have fundamental limitations. We present a…

698

CodeGlance: Understanding Code Reasoning Challenges in LLMs through Multi-Dimensional Feature Analysis

Feb 2026 · 2602.13962
Software DevReasoningBenchmarksReinforcement

In modern software development, developers frequently need to understand code behavior at a glance -- whether reviewing pull requests, debugging issues, or navigating unfamiliar codebases. This ability to reason about dynamic program behavior is fundamental to effective software engineering and increasingly supported by Large Language Models…

699

DTBench: A Synthetic Benchmark for Document-to-Table Extraction

Feb 2026 · 2602.13812
ReasoningBenchmarksMulti-Agent

Document-to-table (Doc2Table) extraction derives structured tables from unstructured documents under a target schema, enabling reliable and verifiable SQL-based data analytics. Although large language models (LLMs) have shown promise in flexible information extraction, their ability to produce precisely structured tables remains insufficiently…

700

TRACER: Trajectory Risk Aggregation for Critical Episodes in Agentic Reasoning

Feb 2026 · 2602.11409
AgenticReasoningBenchmarks

Estimating uncertainty for AI agents in real-world multi-turn tool-using interaction with humans is difficult because failures are often triggered by sparse critical episodes (e.g., looping, incoherent tool use, or user-agent miscoordination) even when local generation appears confident. Existing uncertainty proxies focus on single-shot text…

701

SurveyLens: A Research Discipline-Aware Benchmark for Automatic Survey Generation

Feb 2026 · 2602.11238
BenchmarksMulti-AgentSafety

The exponential growth of scientific literature has driven the evolution of Automatic Survey Generation (ASG) from simple pipelines to multi-agent frameworks and commercial Deep Research agents. However, current ASG evaluation methods rely on generic metrics and are heavily biased toward Computer Science (CS), failing to assess whether ASG methods…

702

Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference

Feb 2026 · 2602.10021
ContextRAGReasoningArchitecture

The integration of extensive, dynamic knowledge into Large Language Models (LLMs) remains a significant challenge due to the inherent entanglement of factual data and reasoning patterns. Existing solutions, ranging from non-parametric Retrieval-Augmented Generation (RAG) to parametric knowledge editing, are often constrained in practice by finite…

703

SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes

Feb 2026 · 2602.09153
AgenticBenchmarks

Simulation has become a key tool for training and evaluating home robots at scale, yet existing environments fail to capture the diversity and physical complexity of real indoor spaces. Current scene synthesis methods produce sparsely furnished rooms that lack the dense clutter, articulated furniture, and physical properties essential for robotic…

704

The LLMbda Calculus: AI Agents, Conversations, and Information Flow

Feb 2026 · 2602.20064
AgenticReasoningReinforcementSafety

A conversation with a large language model (LLM) is a sequence of prompts and responses, with each response generated from the preceding conversation. AI agents build such conversations automatically: given an initial human prompt, a planner loop interleaves LLM calls with tool invocations and code execution. This tight coupling creates a new and…

705

Evaluating Collective Behaviour of Hundreds of LLM Agents

Feb 2026 · 2602.16662
AgenticBenchmarks

As autonomous agents powered by LLM are increasingly deployed in society, understanding their collective behaviour in social dilemmas becomes critical. We introduce an evaluation framework where LLMs generate strategies encoded as algorithms, enabling inspection prior to deployment and scaling to populations of hundreds of agents -- substantially…

706

MARLEM: A Multi-Agent Reinforcement Learning Simulation Framework for Implicit Cooperation in Decentralized Local Energy Markets

Feb 2026 · 2602.16063
Multi-AgentReinforcement

This paper introduces a novel, open-source MARL simulation framework for studying implicit cooperation in LEMs, modeled as a decentralized partially observable Markov decision process and implemented as a Gymnasium environment for MARL. Our framework features a modular market platform with plug-and-play clearing mechanisms, physically constrained…

707

Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems

Feb 2026 · 2602.15198
Multi-AgentSafety

Multi-agent systems, where LLM agents communicate through free-form language, enable sophisticated coordination for solving complex cooperative tasks. This surfaces a unique safety problem when individual agents form a coalition and \emph{collude} to pursue secondary goals and degrade the joint objective. In this paper, we present Colosseum, a…

708

MAC-AMP: A Closed-Loop Multi-Agent Collaboration System for Multi-Objective Antimicrobial Peptide Design

Feb 2026 · 2602.14926
Multi-AgentReinforcement

To address the global health threat of antimicrobial resistance, antimicrobial peptides (AMP) are being explored for their potent and promising ability to fight resistant pathogens. While artificial intelligence (AI) is being employed to advance AMP discovery and design, most AMP design models struggle to balance key goals like activity, toxicity,…

709

Differentiable Modal Logic for Multi-Agent Diagnosis, Orchestration and Communication

Feb 2026 · 2602.12083
ReasoningMulti-Agent

As multi-agent AI systems evolve from simple chatbots to autonomous swarms, debugging semantic failures requires reasoning about knowledge, belief, causality, and obligation, precisely what modal logic was designed to formalize. However, traditional modal logic requires manual specification of relationship structures that are unknown or dynamic in…

710

Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments

Feb 2026 · 2602.11964
ReasoningBenchmarksReinforcement

We introduce Gaia2, a benchmark for evaluating large language model agents in realistic, asynchronous environments. Unlike prior static or synchronous evaluations, Gaia2 introduces scenarios where environments evolve independently of agent actions, requiring agents to operate under temporal constraints, adapt to noisy and dynamic events, resolve…

711

CausalAgent: A Conversational Multi-Agent System for End-to-End Causal Inference

Feb 2026 · 2602.11527
RAGMulti-AgentInference

Causal inference holds immense value in fields such as healthcare, economics, and social sciences. However, traditional causal analysis workflows impose significant technical barriers, requiring researchers to possess dual backgrounds in statistics and computer science, while manually selecting algorithms, handling data quality issues, and…

712

ISD-Agent-Bench: A Comprehensive Benchmark for Evaluating LLM-based Instructional Design Agents

Feb 2026 · 2602.10620
ReasoningBenchmarksSafety

Large Language Model (LLM) agents have shown promising potential in automating Instructional Systems Design (ISD), a systematic approach to developing educational programs. However, evaluating these agents remains challenging due to the lack of standardized benchmarks and the risk of LLM-as-judge bias. We present ISD-Agent-Bench, a comprehensive…

713

SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation

Feb 2026 · 2602.23199
ReasoningBenchmarksReinforcement

Large language models (LLMs) are increasingly applied in scientific research, offering new capabilities for knowledge discovery and reasoning. In single-cell biology, however, evaluation practices for both general and specialized LLMs remain inadequate: existing benchmarks are fragmented across tasks, adopt formats such as multiple-choice…

714

RepoMod-Bench: A Benchmark for Code Repository Modernization via Implementation-Agnostic Testing

Feb 2026 · 2602.22518
Software DevBenchmarks

The evolution of AI coding agents has shifted the frontier from simple snippet completion to autonomous repository-level engineering. However, evaluating these agents remains ill-posed in general code repository generation, where the lack of deterministic ground truth leads to ambiguous metrics. Code modernization via automated translation offers…

715

Toward an Agentic Infused Software Ecosystem

Feb 2026 · 2602.20979
Software DevAgentic

Fully leveraging the capabilities of AI agents in software development requires a rethinking of the software ecosystem itself. To this end, this paper outlines the creation of an Agentic Infused Software Ecosystem (AISE), that rests on three pillars. The first, of course, is the AI agents themselves, which in the past 5 years have moved from…

716

Compositional Planning with Jumpy World Models

Feb 2026 · 2602.19634
Long-HorizonPlanningReasoning

The ability to plan with temporal abstractions is central to intelligent decision-making. Rather than reasoning over primitive actions, we study agents that compose pre-trained policies as temporally extended actions, enabling solutions to complex tasks that no constituent alone can solve. Such compositional planning remains elusive as compounding…

717

Agentic AI as a Cybersecurity Attack Surface: Threats, Exploits, and Defenses in Runtime Supply Chains

Feb 2026 · 2602.19555
MemoryAgenticArchitectureInference

Agentic systems built on large language models (LLMs) extend beyond text generation to autonomously retrieve information and invoke tools. This runtime execution model shifts the attack surface from build-time artifacts to inference-time dependencies, exposing agents to manipulation through untrusted data and probabilistic capability resolution.…

718

Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks

Feb 2026 · 2602.19008
Long-HorizonAgenticBenchmarks

Why do language agents fail on tasks they are capable of solving? We argue that many such failures are reliability failures caused by stochastic drift from a task's latent solution structure, not capability failures. Every well-defined tool-use task imposes a canonical solution path (i.e., a convergent set of tool invocations shared across…

719

MerLean: An Agentic Framework for Autoformalization in Quantum Computation

Feb 2026 · 2602.16554
AgenticReasoning

We introduce MerLean, a fully automated agentic framework for autoformalization in quantum computation. MerLean extracts mathematical statements from \LaTeX{} source files, formalizes them into verified Lean~4 code built on Mathlib, and translates the result back into human-readable \LaTeX{} for semantic review. We evaluate MerLean on three…

720

Panini: Continual Learning in Token Space via Structured Memory

Feb 2026 · 2602.15156
MemoryRAGReasoningBenchmarks

Language models are increasingly used to reason over content they were not trained on, such as new documents, evolving knowledge, and user-specific data. A common approach is retrieval-augmented generation (RAG), which stores verbatim documents externally (as chunks) and retrieves only a relevant subset at inference time for an LLM to reason over.…

721

The Agentic Automation Canvas: a structured framework for agentic AI project design

Feb 2026 · 2602.15090
AgenticBenchmarksInference

Agentic AI prototypes are being deployed across domains with increasing speed, yet no methodology for their structured design, governance, and prospective evaluation has been established. Existing AI documentation practices and guidelines - Model Cards, Datasheets, or NIST AI RMF - are either retrospective or lack machine-readability and…

722

Picking the Right Specialist: Attentive Neural Process-based Selection of Task-Specialized Models as Tools for Agentic Healthcare Systems

Feb 2026 · 2602.14901
AgenticBenchmarks

Task-specialized models form the backbone of agentic healthcare systems, enabling the agents to answer clinical queries across tasks such as disease diagnosis, localization, and report generation. Yet, for a given task, a single "best" model rarely exists. In practice, each task is better served by multiple competing specialist models where…

723

Evolutionary System Prompt Learning for Reinforcement Learning in LLMs

Feb 2026 · 2602.14697
AgenticReasoningReinforcement

Building agentic systems that can autonomously self-improve from experience is a longstanding goal of AI. Large language models (LLMs) today primarily self-improve via two mechanisms: self-reflection for context updates, and reinforcement learning (RL) for weight updates. In this work, we propose Evolutionary System Prompt Learning (E-SPL), a…

724

CORPGEN: Simulating Corporate Environments with Autonomous Digital Employees in Multi-Horizon Task Environments

Feb 2026 · 2602.14229
MemoryLong-HorizonAgenticPlanning

Long-horizon reasoning is a key challenge for autonomous agents, yet existing benchmarks evaluate agents on single tasks in isolation. Real organizational work requires managing many concurrent long-horizon tasks with interleaving, dependencies, and reprioritization. We introduce Multi-Horizon Task Environments (MHTEs): a distinct problem class…

725

RADAR: Revealing Asymmetric Development of Abilities in MLLM Pre-training

Feb 2026 · 2602.12892
ReasoningBenchmarksFine-Tuning

Pre-trained Multi-modal Large Language Models (MLLMs) provide a knowledge-rich foundation for post-training by leveraging their inherent perception and reasoning capabilities to solve complex tasks. However, the lack of an efficient evaluation framework impedes the diagnosis of their performance bottlenecks. Current evaluation primarily relies on…

726

GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory

Feb 2026 · 2602.12316
ReasoningBenchmarksMulti-AgentSafety

Frontier AI systems are increasingly capable and deployed in high-stakes multi-agent environments. However, existing AI safety benchmarks largely evaluate single agents, leaving multi-agent risks such as coordination failure and conflict poorly understood. We introduce GT-HarmBench, a benchmark of 2,009 high-stakes scenarios spanning…

727

VulReaD: Knowledge-Graph-guided Software Vulnerability Reasoning and Detection

Feb 2026 · 2602.10787
ReasoningBenchmarksReinforcementKnowledge

Software vulnerability detection (SVD) is a critical challenge in modern systems. Large language models (LLMs) offer natural-language explanations alongside predictions, but most work focuses on binary evaluation, and explanations often lack semantic consistency with Common Weakness Enumeration (CWE) categories. We propose VulReaD, a…

728

Immersion in the GitHub Universe: Scaling Coding Agents to Mastery

Feb 2026 · 2602.09892
Software DevMulti-AgentFine-Tuning

Achieving mastery in real world software engineering tasks is fundamentally bottlenecked by the scarcity of large scale, high quality training data. Scaling such data has been limited by the complexity of environment setup, unit test generation, and problem statement curation. In this paper, we propose ScaleSWE, an automated, sandboxed multi agent…

729

Next-Gen CAPTCHAs: Leveraging the Cognitive Gap for Scalable and Diverse GUI-Agent Defense

Feb 2026 · 2602.09012
MemoryAgenticPlanningReasoning

The rapid evolution of GUI-enabled agents has rendered traditional CAPTCHAs obsolete. While previous benchmarks like OpenCaptchaWorld established a baseline for evaluating multimodal agents, recent advancements in reasoning-heavy models, such as Gemini3-Pro-High and GPT-5.2-Xhigh have effectively collapsed this security barrier, achieving pass…

730

LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies

Feb 2026 · 2602.21531
Long-HorizonPlanningBenchmarks

General-purpose robots must master long-horizon manipulation, defined as tasks involving multiple kinematic structure changes (e.g., attaching or detaching objects) in unstructured environments. While Vision-Language-Action (VLA) models offer the potential to master diverse atomic skills, they struggle with the combinatorial complexity of…

731

AWCP: A Workspace Delegation Protocol for Deep-Engagement Collaboration across Remote Agents

Feb 2026 · 2602.20493
Agentic

The rapid evolution of Large Language Model (LLM)-based autonomous agents is reshaping the digital landscape toward an emerging Agentic Web, where increasingly specialized agents must collaborate to accomplish complex tasks. However, existing collaboration paradigms are constrained to message passing, leaving execution environments as isolated…

732

Safe and Interpretable Multimodal Path Planning for Multi-Agent Cooperation

Feb 2026 · 2602.19304
PlanningMulti-AgentCode GenSafety

Successful cooperation among decentralized agents requires each agent to quickly adapt its plan to the behavior of other agents. In scenarios where agents cannot confidently predict one another's intentions and plans, language communication can be crucial for ensuring safety. In this work, we focus on path-level cooperation in which agents must…

733

Orchestrating LLM Agents for Scientific Research: A Pilot Study of Multiple Choice Question (MCQ) Generation and Evaluation

Feb 2026 · 2602.18891
BenchmarksSafety

Advances in large language models (LLMs) are rapidly transforming scientific work, yet empirical evidence on how these systems reshape research activities remains limited. We report a mixed-methods pilot evaluation of an AI-orchestrated research workflow in which a human researcher coordinated multiple LLM-based agents to perform data extraction,…

734

Beyond Description: A Multimodal Agent Framework for Insightful Chart Summarization

Feb 2026 · 2602.18731
ReasoningBenchmarksMulti-Agent

Chart summarization is crucial for enhancing data accessibility and the efficient consumption of information. However, existing methods, including those with Multimodal Large Language Models (MLLMs), primarily focus on low-level data descriptions and often fail to capture the deeper insights which are the fundamental purpose of data visualization.…

735

What Makes a Good LLM Agent for Real-world Penetration Testing?

Feb 2026 · 2602.17622
RAGPlanningBenchmarksFine-Tuning

LLM-based agents show promise for automating penetration testing, yet reported performance varies widely across systems and benchmarks. We analyze 28 LLM-based penetration testing systems and evaluate five representative implementations across three benchmarks of increasing complexity. Our analysis reveals two distinct failure modes: Type A…

736

MedClarify: An information-seeking AI agent for medical diagnosis with case-specific follow-up questions

Feb 2026 · 2602.17308
AgenticReasoningReinforcement

Large language models (LLMs) are increasingly used for diagnostic tasks in medicine. In clinical practice, the correct diagnosis can rarely be immediately inferred from the initial patient presentation alone. Rather, reaching a diagnosis often involves systematic history taking, during which clinicians reason over multiple potential conditions…

737

HyperRAG: Reasoning N-ary Facts over Hypergraphs for Retrieval Augmented Generation

Feb 2026 · 2602.14470
MemoryRAGReasoningBenchmarks

Graph-based retrieval-augmented generation (RAG) methods, typically built on knowledge graphs (KGs) with binary relational facts, have shown promise in multi-hop open-domain QA. However, their rigid retrieval schemes and dense similarity search often introduce irrelevant context, increase computational overhead, and limit relational…

738

AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents

Feb 2026 · 2602.14257
ReasoningBenchmarks

While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem. Current benchmarks, however, are largely restricted to idealized simulations, failing to address the practical demands of specialized domains like advertising…

739

VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph

Feb 2026 · 2602.12735
MemoryAgenticRAGReasoning

Effectively retrieving, reasoning, and understanding multimodal information remains a critical challenge for agentic systems. Traditional Retrieval-augmented Generation (RAG) methods rely on linear interaction histories, which struggle to handle long-context tasks, especially those involving information-sparse yet token-heavy visual data in…

740

AmbiBench: Benchmarking Mobile GUI Agents Beyond One-Shot Instructions in the Wild

Feb 2026 · 2602.11750
BenchmarksMulti-AgentArchitectureSafety

Benchmarks are paramount for gauging progress in the domain of Mobile GUI Agents. In practical scenarios, users frequently fail to articulate precise directives containing full task details at the onset, and their expressions are typically ambiguous. Consequently, agents are required to converge on the user's true intent via active clarification…

741

AIvilization v0: Toward Large-Scale Artificial Social Simulation with a Unified Agent Architecture and Adaptive Agent Profiles

Feb 2026 · 2602.10429
MemoryLong-HorizonPlanningFine-Tuning

AIvilization v0 is a publicly deployed large-scale artificial society that couples a resource-constrained sandbox economy with a unified LLM-agent architecture, aiming to sustain long-horizon autonomy while remaining executable under rapidly changing environment. To mitigate the tension between goal stability and reactive correctness, we introduce…

742

AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs

Feb 2026 · 2602.20720
AgenticBenchmarks

The integration of external data services (e.g., Model Context Protocol, MCP) has made large language model-based agents increasingly powerful for complex task execution. However, this advancement introduces critical security vulnerabilities, particularly indirect prompt injection (IPI) attacks. Existing attack methods are limited by their…

743

QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs

Feb 2026 · 2602.20629
ReasoningBenchmarksSafety

As Large Language Models (LLMs) saturate elementary benchmarks, the research frontier has shifted from generation to the reliability of automated evaluation. We demonstrate that standard "LLM-as-a-Judge" protocols suffer from a systematic Alignment Gap when applied to upper-undergraduate to early graduate level mathematics. To quantify this, we…

744

Quantifying the Expectation-Realisation Gap for Agentic AI Systems

Feb 2026 · 2602.20292
Software DevAgenticPlanningReinforcement

Agentic AI systems are deployed with expectations of substantial productivity gains, yet rigorous empirical evidence reveals systematic discrepancies between pre-deployment expectations and post-deployment outcomes. We review controlled trials and independent validations across software engineering, clinical documentation, and clinical decision…

745

Narrowing the Complexity Gap in the Evaluation of Large Language Models

Feb 2026 · 2602.18928
Software DevReasoningBenchmarksReinforcement

Evaluating Large Language Models (LLMs) with respect to real-world code complexity is essential. Otherwise, there is a risk of overestimating LLMs' programming abilities based on simplistic benchmarks, only to be disappointed when using them in real-world settings. Recently, researchers explored the construction of more realistic benchmarks by…

746

[Re] Benchmarking LLM Capabilities in Negotiation through Scoreable Games

Feb 2026 · 2602.18230
BenchmarksMulti-Agent

Large Language Models (LLMs) demonstrate significant potential in multi-agent negotiation tasks, yet evaluation in this domain remains challenging due to a lack of robust and generalizable benchmarks. Abdelnabi et al. (2024) introduce a negotiation benchmark based on Scoreable Games, with the aim of developing a highly complex and realistic…

747

Web Verbs: Typed Abstractions for Reliable Task Composition on the Agentic Web

Feb 2026 · 2602.17245
AgenticReinforcement

The Web is evolving from a medium that humans browse to an environment where software agents act on behalf of users. Advances in large language models (LLMs) make natural language a practical interface for goal-directed tasks, yet most current web agents operate on low-level primitives such as clicks and keystrokes. These operations are brittle,…

748

Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception

Feb 2026 · 2602.11858
AgenticReasoningBenchmarksInference

Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking-with-Images" methods alleviate this by iteratively zooming in and out regions of interest during inference, but incur high latency…

749

Learning to Configure Agentic AI Systems

Feb 2026 · 2602.11574
AgenticReasoningBenchmarksReinforcement

Configuring LLM-based agent systems involves choosing workflows, tools, token budgets, and prompts from a large combinatorial design space, and is typically handled today by fixed large templates or hand-tuned heuristics. This leads to brittle behavior and unnecessary compute, since the same cumbersome configuration is often applied to both easy…

750

CryptoAnalystBench: Failures in Multi-Tool Long-Form LLM Analysis

Feb 2026 · 2602.11304
AgenticBenchmarks

Modern analyst agents must reason over complex, high token inputs, including dozens of retrieved documents, tool outputs, and time sensitive data. While prior work has produced tool calling benchmarks and examined factuality in knowledge augmented systems, relatively little work studies their intersection: settings where LLMs must integrate large…

751

Autoregressive Direct Preference Optimization

Feb 2026 · 2602.09533

Direct preference optimization (DPO) has emerged as a promising approach for aligning large language models (LLMs) with human preferences. However, the widespread reliance on the response-level Bradley-Terry (BT) model may limit its full potential, as the reference and learnable models are assumed to be autoregressive only after deriving the…

752

MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks

Feb 2026 · 2602.09222
AgenticBenchmarks

Large language model (LLM) based web agents are increasingly deployed to automate complex online tasks by directly interacting with web sites and performing actions on users' behalf. While these agents offer powerful capabilities, their design exposes them to indirect prompt injection attacks embedded in untrusted web content, enabling adversaries…

753

SciDataCopilot: An Agentic Data Preparation Framework for AGI-driven Scientific Discovery

Feb 2026 · 2602.09132
AgenticReasoningBenchmarksSafety

The current landscape of AI for Science (AI4S) is predominantly anchored in large-scale textual corpora, where generative AI systems excel at hypothesis generation, literature search, and multi-modal reasoning. However, a critical bottleneck for accelerating closed-loop scientific discovery remains the utilization of raw experimental data.…

754

Improving Parametric Knowledge Access in Reasoning Language Models

Feb 2026 · 2602.22193
ReasoningReinforcement

We study reasoning for accessing world knowledge stored in a language model's parameters. For example, recalling that Canberra is Australia's capital may benefit from thinking through major cities and the concept of purpose-built capitals. While reasoning language models are trained via reinforcement learning to produce reasoning traces on tasks…

755

ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning

Feb 2026 · 2602.21265
Long-HorizonAgenticReasoningBenchmarks

We introduce \ToolMATH, a math-grounded benchmark that evaluates tool-augmented language models in realistic multi-tool environments where the output depends on calling schema-specified tools and sustaining multi-step execution. It turns math problems into a controlled, correctness-checkable benchmark with tool sets, enabling systematic evaluation…

756

ActionEngine: From Reactive to Programmatic GUI Agents via State Machine Memory

Feb 2026 · 2602.20502
MemoryPlanningReasoningBenchmarks

Existing Graphical User Interface (GUI) agents operate through step-by-step calls to vision language models--taking a screenshot, reasoning about the next action, executing it, then repeating on the new page--resulting in high costs and latency that scale with the number of reasoning steps, and limited accuracy due to no persistent memory of…

757

NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning

Feb 2026 · 2602.20119
Long-HorizonPlanningReasoningBenchmarks

Solving long-horizon tasks requires robots to integrate high-level semantic reasoning with low-level physical interaction. While vision-language models (VLMs) and video generation models can decompose tasks and imagine outcomes, they often lack the physical grounding necessary for real-world execution. We introduce NovaPlan, a hierarchical…

758

Watermarking LLM Agent Trajectories

Feb 2026 · 2602.18700
Reasoning

LLM agents rely heavily on high-quality trajectory data to guide their problem-solving behaviors, yet producing such data requires substantial task design, high-capacity model generation, and manual filtering. Despite the high cost of creating these datasets, existing literature has overlooked copyright protection for LLM agent trajectories. This…

759

Towards More Standardized AI Evaluation: From Models to Agents

Feb 2026 · 2602.18029
AgenticBenchmarks

Evaluation is no longer a final checkpoint in the machine learning lifecycle. As AI systems evolve from static models to compound, tool-using agents, evaluation becomes a core control function. The question is no longer "How good is the model?" but "Can we trust the system to behave as intended, under change, at scale?". Yet most evaluation…

760

Safe Continuous-time Multi-Agent Reinforcement Learning via Epigraph Form

Feb 2026 · 2602.17078
BenchmarksMulti-AgentReinforcementSafety

Multi-agent reinforcement learning (MARL) has made significant progress in recent years, but most algorithms still rely on a discrete-time Markov Decision Process (MDP) with fixed decision intervals. This formulation is often ill-suited for complex multi-agent dynamics, particularly in high-frequency or irregular time-interval settings, leading to…

761

Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents

Feb 2026 · 2602.16943
BenchmarksSafety

Large language models deployed as agents increasingly interact with external systems through tool calls--actions with real-world consequences that text outputs alone do not carry. Safety evaluations, however, overwhelmingly measure text-level refusal behavior, leaving a critical question unanswered: does alignment that suppresses harmful text also…

762

Label-Consistent Data Generation for Aspect-Based Sentiment Analysis Using LLM Agents

Feb 2026 · 2602.16379
AgenticPrompting

We propose an agentic data augmentation method for Aspect-Based Sentiment Analysis (ABSA) that uses iterative generation and verification to produce high quality synthetic training examples. To isolate the effect of agentic structure, we also develop a closely matched prompting-based baseline using the same model and instructions. Both methods are…

763

Harnessing Implicit Cooperation: A Multi-Agent Reinforcement Learning Approach Towards Decentralized Local Energy Markets

Feb 2026 · 2602.16062
BenchmarksMulti-AgentReinforcementArchitecture

This paper proposes implicit cooperation, a framework enabling decentralized agents to approximate optimal coordination in local energy markets without explicit peer-to-peer communication. We formulate the problem as a decentralized partially observable Markov decision problem that is solved through a multi-agent reinforcement learning task in…

764

Traceable Latent Variable Discovery Based on Multi-Agent Collaboration

Feb 2026 · 2602.14456
ReasoningBenchmarksMulti-AgentFine-Tuning

Revealing the underlying causal mechanisms in the real world is crucial for scientific and technological progress. Despite notable advances in recent decades, the lack of high-quality data and the reliance of traditional causal discovery algorithms (TCDA) on the assumption of no latent confounders, as well as their tendency to overlook the precise…

765

OMGs: A multi-agent system supporting MDT decision-making across the ovarian tumour care continuum

Feb 2026 · 2602.13793
BenchmarksMulti-AgentReinforcementSafety

Ovarian tumour management has increasingly relied on multidisciplinary tumour board (MDT) deliberation to address treatment complexity and disease heterogeneity. However, most patients worldwide lack access to timely expert consensus, particularly in resource-constrained centres where MDT resources are scarce or unavailable. Here we present OMGs…

766

A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents

Feb 2026 · 2602.08964
AgenticReasoningBenchmarksReinforcement

Understanding an agent's goals helps explain and predict its behaviour, yet there is no established methodology for reliably attributing goals to agentic systems. We propose a framework for evaluating goal-directedness that integrates behavioural evaluation with interpretability-based analyses of models' internal representations. As a case study,…

767

When Should an AI Act? A Human-Centered Model of Scene, Context, and Behavior for Agentic AI Design

Feb 2026 · 2602.22814
AgenticSafety

Agentic AI increasingly intervenes proactively by inferring users' situations from contextual data yet often fails for lack of principled judgment about when, why, and whether to act. We address this gap by proposing a conceptual model that reframes behavior as an interpretive outcome integrating Scene (observable situation), Context…

768

ArchAgent: Agentic AI-driven Computer Architecture Discovery

Feb 2026 · 2602.22425
AgenticArchitecture

Agile hardware design flows are a critically needed force multiplier to meet the exploding demand for compute. Recently, agentic generative AI systems have demonstrated significant advances in algorithm design, improving code efficiency, and enabling discovery across scientific domains. Bridging these worlds, we present ArchAgent, an automated…

769

DySCO: Dynamic Attention-Scaling Decoding for Long-Context LMs

Feb 2026 · 2602.22175
ContextReasoningBenchmarksReinforcement

Understanding and reasoning over long contexts is a crucial capability for language models (LMs). Although recent models support increasingly long context windows, their accuracy often deteriorates as input length grows. In practice, models often struggle to keep attention aligned with the most relevant context throughout decoding. In this work,…

770

IndicEval: A Bilingual Indian Educational Evaluation Framework for Large Language Models

Feb 2026 · 2602.16467
ReasoningBenchmarksReinforcementPrompting

The rapid advancement of large language models (LLMs) necessitates evaluation frameworks that reflect real-world academic rigor and multilingual complexity. This paper introduces IndicEval, a scalable benchmarking platform designed to assess LLM performance using authentic high-stakes examination questions from UPSC, JEE, and NEET across STEM and…

771

Cast-R1: Learning Tool-Augmented Sequential Decision Policies for Time Series Forecasting

Feb 2026 · 2602.13802
MemoryLong-HorizonAgenticReasoning

Time series forecasting has long been dominated by model-centric approaches that formulate prediction as a single-pass mapping from historical observations to future values. Despite recent progress, such formulations often struggle in complex and evolving settings, largely because most forecasting models lack the ability to autonomously acquire…

772

WebTestPilot: Agentic End-to-End Web Testing against Natural Language Specification by Inferring Oracles with Symbolized GUI Elements

Feb 2026 · 2602.11724
AgenticReasoningBenchmarksInference

Visual language model (VLM) agents show great promise in automating end-to-end (E2E) web testing against requirements in natural language. However, the probabilistic nature of language models can have inherent hallucinations. Therefore, given a detected inconsistency between the requirement and the web application, it is hard to distinguish…

773

Learning to Compose for Cross-domain Agentic Workflow Generation

Feb 2026 · 2602.11114
AgenticReasoningBenchmarksInference

Automatically generating agentic workflows -- executable operator graphs or codes that orchestrate reasoning, verification, and repair -- has become a practical way to solve complex tasks beyond what single-pass LLM generation can reliably handle. Yet what constitutes a good workflow depends heavily on the task distribution and the available…

774

CLI-Gym: Scalable CLI Task Generation via Agentic Environment Inversion

Feb 2026 · 2602.10999
Agentic

Agentic coding requires agents to effectively interact with runtime environments, e.g., command line interfaces (CLI), so as to complete tasks like resolving dependency issues, fixing system problems, etc. But it remains underexplored how such environment-intensive tasks can be obtained at scale to enhance agents' capabilities. To address this,…

775

ICA: Information-Aware Credit Assignment for Visually Grounded Long-Horizon Information-Seeking Agents

Feb 2026 · 2602.10863
Long-HorizonBenchmarksReinforcement

Despite the strong performance achieved by reinforcement learning-trained information-seeking agents, learning in open-ended web environments remains severely constrained by low signal-to-noise feedback. Text-based parsers often discard layout semantics and introduce unstructured noise, while long-horizon training typically relies on sparse…

776

Authenticated Workflows: A Systems Approach to Protecting Agentic AI

Feb 2026 · 2602.10465
AgenticSafety

Agentic AI systems automate enterprise workflows but existing defenses--guardrails, semantic filters--are probabilistic and routinely bypassed. We introduce authenticated workflows, the first complete trust layer for enterprise agentic AI. Security reduces to protecting four fundamental boundaries: prompts, tools, data, and context. We enforce…

777

Quantum-Audit: Evaluating the Reasoning Limits of LLMs on Quantum Computing

Feb 2026 · 2602.10092
Software DevReasoningBenchmarks

Language models have become practical tools for quantum computing education and research, from summarizing technical papers to explaining theoretical concepts and answering questions about recent developments in the field. While existing benchmarks evaluate quantum code generation and circuit design, their understanding of quantum computing…

778

OmniGAIA: Towards Native Omni-Modal AI Agents

Feb 2026 · 2602.22897
AgenticReasoningBenchmarksFine-Tuning

Human intelligence naturally intertwines omni-modal perception -- spanning vision, audio, and language -- with complex reasoning and tool usage to interact with the world. However, current multi-modal LLMs are primarily confined to bi-modal interactions (e.g., vision-language), lacking the unified cognitive capabilities required for general AI…

779

Silent Egress: When Implicit Prompt Injection Makes LLM Agents Leak Without a Trace

Feb 2026 · 2602.22450
AgenticSafety

Agentic large language model systems increasingly automate tasks by retrieving URLs and calling external tools. We show that this workflow gives rise to implicit prompt injection: adversarial instructions embedded in automatically generated URL previews, including titles, metadata, and snippets, can introduce a system-level risk that we refer to…

780

Contextual Memory Virtualisation: DAG-Based State Management and Structurally Lossless Trimming for LLM Agents

Feb 2026 · 2602.22402
MemoryContextAgenticReasoning

As large language models engage in extended reasoning tasks, they accumulate significant state -- architectural mappings, trade-off decisions, codebase conventions -- within the context window. This understanding is lost when sessions reach context limits and undergo lossy compaction. We propose Contextual Memory Virtualisation (CMV), a system…

781

Matching Multiple Experts: On the Exploitability of Multi-Agent Imitation Learning

Feb 2026 · 2602.21020
Multi-Agent

Multi-agent imitation learning (MA-IL) aims to learn optimal policies from expert demonstrations of interactions in multi-agent interactive domains. Despite existing guarantees on the performance of the resulting learned policies, characterizations of how far the learned polices are from a Nash equilibrium are missing for offline MA-IL. In this…

782

Empirical Cumulative Distribution Function Clustering for LLM-based Agent System Analysis

Feb 2026 · 2602.16131
Software DevBenchmarks

Large language models (LLMs) are increasingly used as agents to solve complex tasks such as question answering (QA), scientific debate, and software development. A standard evaluation procedure aggregates multiple responses from LLM agents into a single final answer, often via majority voting, and compares it against reference answers. However,…

783

Hunt Globally: Wide Search AI Agents for Drug Asset Scouting in Investing, Business Development, and Competitive Intelligence

Feb 2026 · 2602.15019
BenchmarksMulti-AgentReinforcement

Bio-pharmaceutical innovation has shifted: many new drug assets now originate outside the United States and are disclosed primarily via regional, non-English channels. Recent data suggests that over 85% of patent filings originate outside the U.S., with China accounting for nearly half of the global total. A growing share of scholarly output is…

784

OpAgent: Operator Agent for Web Navigation

Feb 2026 · 2602.13559
Long-HorizonAgenticPlanningFine-Tuning

To fulfill user instructions, autonomous web agents must contend with the inherent complexity and volatile nature of real-world websites. Conventional paradigms predominantly rely on Supervised Fine-Tuning (SFT) or Offline Reinforcement Learning (RL) using static datasets. However, these methods suffer from severe distributional shifts, as offline…

785

LawThinker: A Deep Research Legal Agent in Dynamic Environments

Feb 2026 · 2602.12056
MemoryLong-HorizonReasoningBenchmarks

Legal reasoning requires not only correct outcomes but also procedurally compliant reasoning processes. However, existing methods lack mechanisms to verify intermediate reasoning steps, allowing errors such as inapplicable statute citations to propagate undetected through the reasoning chain. To address this, we propose LawThinker, an autonomous…

786

Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation

Feb 2026 · 2602.11790
ReasoningMulti-AgentSafetyInference

Although recent end-to-end video generation models demonstrate impressive performance in visually oriented content creation, they remain limited in scenarios that require strict logical rigor and precise knowledge representation, such as instructional and educational media. To address this problem, we propose LAVES, a hierarchical LLM-based…

787

Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

Feb 2026 · 2602.11146
MemoryBenchmarksReinforcementSafety

Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be…

788

Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics

Feb 2026 · 2602.10885
Self-ImprovingReasoningReinforcementInference

Despite chain-of-thought (CoT) playing crucial roles in LLM reasoning, directly rewarding it is difficult: training a reward model demands heavy human labeling efforts, and static RMs struggle with evolving CoT distributions and reward hacking. These challenges motivate us to seek an autonomous CoT rewarding approach that requires no human…

789

Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation

Feb 2026 · 2602.10356
Reinforcement

Real-world digital environments are highly diverse and dynamic. These characteristics cause agents to frequently encounter unseen scenarios and distribution shifts, making continual learning in specific environments essential for computer-use agents (CUAs). However, a key challenge lies in obtaining high-quality and environment-grounded agent data…

790

Evaluating Social Bias in RAG Systems: When External Context Helps and Reasoning Hurts

Feb 2026 · 2602.09442
RAGReasoningBenchmarksArchitecture

Social biases inherent in large language models (LLMs) raise significant fairness concerns. Retrieval-Augmented Generation (RAG) architectures, which retrieve external knowledge sources to enhance the generative capabilities of LLMs, remain susceptible to the same bias-related challenges. This work focuses on evaluating and understanding the…

791

AgentCgroup: Understanding and Controlling OS Resources of AI Agents

Feb 2026 · 2602.09345
MemorySoftware DevBenchmarksInference

AI agents are increasingly deployed in multi-tenant cloud environments, where they execute diverse tool calls within sandboxed containers, each call with distinct resource demands and rapid fluctuations. We present a systematic characterization of OS-level resource dynamics in sandboxed AI coding agents, analyzing 144 software engineering tasks…

792

ConstraintBench: Benchmarking LLM Constraint Reasoning on Direct Optimization

Feb 2026 · 2602.22465
ReasoningBenchmarks

Large language models are increasingly applied to operational decision-making where the underlying structure is constrained optimization. Existing benchmarks evaluate whether LLMs can formulate optimization problems as solver code, but leave open a complementary question. Can LLMs directly produce correct solutions to fully specified constrained…

793

Qwen-BIM: developing large language model for BIM-based design with domain-specific benchmark and dataset

Feb 2026 · 2602.20812
BenchmarksFine-TuningReinforcement

As the construction industry advances toward digital transformation, BIM (Building Information Modeling)-based design has become a key driver supporting intelligent construction. Despite Large Language Models (LLMs) have shown potential in promoting BIM-based design, the lack of specific datasets and LLM evaluation benchmarks has significantly…

794

Recursive Belief Vision Language Action Models

Feb 2026 · 2602.20659
MemoryContextLong-HorizonReasoning

Vision-language-action models must enable agents to execute long-horizon tasks under partial observability. However, most existing approaches remain observation-driven, relying on short context windows or repeated queries to vision-language models (VLMs). This leads to loss of task progress, action repetition under perceptual aliasing, and high…

795

CausalFlip: A Benchmark for LLM Causal Judgment Beyond Semantic Matching

Feb 2026 · 2602.20094
ReasoningBenchmarksReinforcement

As large language models (LLMs) witness increasing deployment in complex, high-stakes decision-making scenarios, it becomes imperative to ground their reasoning in causality rather than spurious correlations. However, strong performance on traditional reasoning benchmarks does not guarantee true causal reasoning ability of LLMs, as high accuracy…

796

BURMESE-SAN: Burmese NLP Benchmark for Evaluating Large Language Models

Feb 2026 · 2602.18788
ReasoningBenchmarksFine-TuningReinforcement

We introduce BURMESE-SAN, the first holistic benchmark that systematically evaluates large language models (LLMs) for Burmese across three core NLP competencies: understanding (NLU), reasoning (NLR), and generation (NLG). BURMESE-SAN consolidates seven subtasks spanning these competencies, including Question Answering, Sentiment Analysis, Toxicity…

797

NIMMGen: Learning Neural-Integrated Mechanistic Digital Twins with LLMs

Feb 2026 · 2602.18008
AgenticBenchmarksReinforcement

Mechanistic models encode scientific knowledge about dynamical systems and are widely used in downstream scientific and policy applications. Recent work has explored LLM-based agentic frameworks to automatically construct mechanistic models from data; however, existing problem settings substantially oversimplify real-world conditions, leaving it…

798

Analyzing LLM Instruction Optimization for Tabular Fact Verification

Feb 2026 · 2602.17937
ReasoningBenchmarksPrompting

Instruction optimization provides a lightweight, model-agnostic approach to enhancing the reasoning performance of large language models (LLMs). This paper presents the first systematic comparison of instruction optimization, based on the DSPy optimization framework, for tabular fact verification. We evaluate four out-of-the-box prompting…

799

OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction

Feb 2026 · 2602.15197
AgenticBenchmarksFine-TuningInference

Tool-calling is essential for Large Language Model (LLM) agents to complete real-world tasks. While most existing benchmarks assume simple, perfectly documented tools, real-world tools (e.g., general "search" APIs) are often opaque, lacking clear best practices or failure modes. Can LLM agents improve their performance in environments with opaque…

800

Mind the (DH) Gap! A Contrast in Risky Choices Between Reasoning and Conversational LLMs

Feb 2026 · 2602.15173
AgenticReasoningReinforcement

The use of large language models either as decision support systems, or in agentic workflows, is rapidly transforming the digital ecosystem. However, the understanding of LLM decision-making under uncertainty remains limited. We initiate a comparative study of LLM risky choices along two dimensions: (1) prospect representation (explicit vs.…

801

Tool-Aware Planning in Contact Center AI: Evaluating LLMs through Lineage-Guided Query Decomposition

Feb 2026 · 2602.14955
AgenticPlanningReasoningBenchmarks

We present a domain-grounded framework and benchmark for tool-aware plan generation in contact centers, where answering a query for business insights, our target use case, requires decomposing it into executable steps over structured tools (Text2SQL (T2S)/Snowflake) and unstructured tools (RAG/transcripts) with explicit depends_on for parallelism.…

802

TS-Haystack: A Multi-Scale Retrieval Benchmark for Time Series Language Models

Feb 2026 · 2602.14200
ContextReasoningBenchmarksInference

Time Series Language Models (TSLMs) are emerging as unified models for reasoning over continuous signals in natural language. However, long-context retrieval remains a major limitation: existing models are typically trained and evaluated on short sequences, while real-world time-series sensor streams can span millions of datapoints. This mismatch…

803

AnomaMind: Agentic Time Series Anomaly Detection with Tool-Augmented Reasoning

Feb 2026 · 2602.13807
AgenticReasoningReinforcementInference

Time series anomaly detection is critical in many real-world applications, where effective solutions must localize anomalous regions and support reliable decision-making under complex settings. However, most existing methods frame anomaly detection as a purely discriminative prediction task with fixed feature inputs, rather than an evidence-driven…

804

ADRD-Bench: A Preliminary LLM Benchmark for Alzheimer's Disease and Related Dementias

Feb 2026 · 2602.11460
ReasoningBenchmarks

Large language models (LLMs) have shown great potential for healthcare applications. However, existing evaluation benchmarks provide minimal coverage of Alzheimer's Disease and Related Dementias (ADRD). To address this gap, we introduce ADRD-Bench, the first ADRD-specific benchmark dataset designed for rigorous evaluation of LLMs. ADRD-Bench has…

805

Fine-Tuning GPT-5 for GPU Kernel Generation

Feb 2026 · 2602.11000
Software DevBenchmarksFine-TuningReinforcement

Developing efficient GPU kernels is essential for scaling modern AI systems, yet it remains a complex task due to intricate hardware architectures and the need for specialized optimization expertise. Although Large Language Models (LLMs) demonstrate strong capabilities in general sequential code generation, they face significant challenges in GPU…

806

Time Series Foundation Models for Energy Load Forecasting on Consumer Hardware: A Multi-Dimensional Zero-Shot Benchmark

Feb 2026 · 2602.10848
ContextBenchmarksReinforcement

Time Series Foundation Models (TSFMs) have introduced zero-shot prediction capabilities that bypass the need for task-specific training. Whether these capabilities translate to mission-critical applications such as electricity demand forecasting--where accuracy, calibration, and robustness directly affect grid operations--remains an open question.…

807

UniComp: A Unified Evaluation of Large Language Model Compression via Pruning, Quantization and Distillation

Feb 2026 · 2602.09130
ReasoningBenchmarksSafetyInference

Model compression is increasingly essential for deploying large language models (LLMs), yet existing evaluations are limited in method coverage and focus primarily on knowledge-centric benchmarks. Thus, we introduce UniComp, a unified evaluation framework for comparing pruning, quantization, and knowledge distillation. UniComp evaluates compressed…

808

Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent

Feb 2026 · 2602.23079
ReasoningInference

The rapid advancement of large language models (LLMs) has enabled powerful authorship inference capabilities, raising growing concerns about unintended deanonymization risks in textual data such as news articles. In this work, we introduce an LLM agent designed to evaluate and mitigate such risks through a structured, interpretable pipeline.…

809

Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling

Feb 2026 · 2602.21728
ReasoningBenchmarksFine-TuningReinforcement

The reasoning process of Large Language Models (LLMs) is often plagued by hallucinations and missing facts in question-answering tasks. A promising solution is to ground LLMs' answers in verifiable knowledge sources, such as Knowledge Graphs (KGs). Prevailing KG-enhanced methods typically constrained LLM reasoning either by enforcing rules during…

810

MultiVer: Zero-Shot Multi-Agent Vulnerability Detection

Feb 2026 · 2602.17875
BenchmarksMulti-AgentFine-TuningArchitecture

We present MultiVer, a zero-shot multi-agent system for vulnerability detection that achieves state-of-the-art recall without fine-tuning. A four-agent ensemble (security, correctness, performance, style) with union voting achieves 82.7% recall on PyVul, exceeding fine-tuned GPT-3.5 (81.3%) by 1.4 percentage points -- the first zeroshot system to…

811

Discovering Multiagent Learning Algorithms with Large Language Models

Feb 2026 · 2602.16928
Software DevBenchmarksMulti-AgentReinforcement

Much of the advancement of Multi-Agent Reinforcement Learning (MARL) in imperfect-information games has historically depended on manual iterative refinement of baselines. While foundational families like Counterfactual Regret Minimization (CFR) and Policy Space Response Oracles (PSRO) rest on solid theoretical ground, the design of their most…

812

Causally-Guided Automated Feature Engineering with Multi-Agent Reinforcement Learning

Feb 2026 · 2602.16435
BenchmarksMulti-AgentFine-TuningReinforcement

Automated feature engineering (AFE) enables AI systems to autonomously construct high-utility representations from raw tabular data. However, existing AFE methods rely on statistical heuristics, yielding brittle features that fail under distribution shift. We introduce CAFE, a framework that reformulates AFE as a causally-guided sequential…

813

Multi-Agent Combinatorial-Multi-Armed-Bandit framework for the Submodular Welfare Problem under Bandit Feedback

Feb 2026 · 2602.16183
BenchmarksMulti-Agent

We study the \emph{Submodular Welfare Problem} (SWP), where items are partitioned among agents with monotone submodular utilities to maximize the total welfare under \emph{bandit feedback}. Classical SWP assumes full value-oracle access, achieving $(1-1/e)$ approximations via continuous-greedy algorithms. We extend this to a \emph{multi-agent…

814

Lifelong Scalable Multi-Agent Realistic Testbed and A Comprehensive Study on Design Choices in Lifelong AGV Fleet Management Systems

Feb 2026 · 2602.15721
PlanningMulti-Agent

We present Lifelong Scalable Multi-Agent Realistic Testbed (LSMART), an open-source simulator to evaluate any Multi-Agent Path Finding (MAPF) algorithm in a Fleet Management System (FMS) with Automated Guided Vehicles (AGVs). MAPF aims to move a group of agents from their corresponding starting locations to their goals. Lifelong MAPF (LMAPF) is a…

815

Counterfactual Fairness Evaluation of LLM-Based Contact Center Agent Quality Assurance System

Feb 2026 · 2602.14970
BenchmarksPrompting

Large Language Models (LLMs) are increasingly deployed in contact-center Quality Assurance (QA) to automate agent performance evaluation and coaching feedback. While LLMs offer unprecedented scalability and speed, their reliance on web-scale training data raises concerns regarding demographic and behavioral biases that may distort workforce…

816

R-Diverse: Mitigating Diversity Illusion in Self-Play LLM Training

Feb 2026 · 2602.13103
MemorySelf-ImprovingReasoningBenchmarks

Self-play bootstraps LLM reasoning through an iterative Challenger-Solver loop: the Challenger is trained to generate questions that target the Solver's capabilities, and the Solver is optimized on the generated data to expand its reasoning skills. However, existing frameworks like R-Zero often exhibit non-sustained improvement, where early gains…

817

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Feb 2026 · 2602.12670
BenchmarksInference

Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three…

818

AI Agents for Inventory Control: Human-LLM-OR Complementarity

Feb 2026 · 2602.12631
Benchmarks

Inventory control is a fundamental operations problem in which ordering decisions are traditionally guided by theoretically grounded operations research (OR) algorithms. However, such algorithms often rely on rigid modeling assumptions and can perform poorly when demand distributions shift or relevant contextual information is unavailable. Recent…

819

Intrinsic Credit Assignment for Long Horizon Interaction

Feb 2026 · 2602.12342
Long-HorizonReinforcement

How can we train agents to navigate uncertainty over long horizons? In this work, we propose ΔBelief-RL, which leverages a language model's own intrinsic beliefs to reward intermediate progress. Our method utilizes the change in the probability an agent assigns to the target solution for credit assignment. By training on synthetic interaction…

820

Closing Reasoning Gaps in Clinical Agents with Differential Reasoning Learning

Feb 2026 · 2602.09945
RAGReasoningBenchmarksReinforcement

Clinical decision support requires not only correct answers but also clinically valid reasoning. We propose Differential Reasoning Learning (DRL), a framework that improves clinical agents by learning from reasoning discrepancies. From reference reasoning rationales (e.g., physician-authored clinical rationale, clinical guidelines, or outputs from…

821

ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices

Feb 2026 · 2602.21858
Benchmarks

Multimodal large language models (MLLMs) have made significant progress in mobile agent development, yet their capabilities are predominantly confined to a reactive paradigm, where they merely execute explicit user commands. The emerging paradigm of proactive intelligence, where agents autonomously anticipate needs and initiate actions, represents…

822

FIRE: A Comprehensive Benchmark for Financial Intelligence and Reasoning Evaluation

Feb 2026 · 2602.22273
ReasoningBenchmarks

We introduce FIRE, a comprehensive benchmark designed to evaluate both the theoretical financial knowledge of LLMs and their ability to handle practical business scenarios. For theoretical assessment, we curate a diverse set of examination questions drawn from widely recognized financial qualification exams, enabling evaluation of LLMs deep…

823

LogicGraph : Benchmarking Multi-Path Logical Reasoning via Neuro-Symbolic Generation and Verification

Feb 2026 · 2602.21044
ReasoningBenchmarks

Evaluations of large language models (LLMs) primarily emphasize convergent logical reasoning, where success is defined by producing a single correct proof. However, many real-world reasoning problems admit multiple valid derivations, requiring models to explore diverse logical paths rather than committing to one route. To address this limitation,…

824

Multi-CoLoR: Context-Aware Localization and Reasoning across Multi-Language Codebases

Feb 2026 · 2602.19407
Software DevReasoningBenchmarks

Large language models demonstrate strong capabilities in code generation but struggle to navigate complex, multi-language repositories to locate relevant code. Effective code localization requires understanding both organizational context (e.g., historical issue-fix patterns) and structural relationships within heterogeneous codebases. Existing…

825

Characterizing MARL for Energy Control: A Multi-KPI Benchmark on the CityLearn Environment

Feb 2026 · 2602.19223
MemoryBenchmarksMulti-AgentReinforcement

The optimization of urban energy systems is crucial for the advancement of sustainable and resilient smart cities, which are becoming increasingly complex with multiple decision-making units. To address scalability and coordination concerns, Multi-Agent Reinforcement Learning (MARL) is a promising solution. This paper addresses the imperative need…

826

Can AI Lower the Barrier to Cybersecurity? A Human-Centered Mixed-Methods Study of Novice CTF Learning

Feb 2026 · 2602.18172
AgenticBenchmarksFine-TuningReinforcement

Capture-the-Flag (CTF) competitions serve as gateways into offensive cybersecurity, yet they often present steep barriers for novices due to complex toolchains and opaque workflows. Recently, agentic AI frameworks for cybersecurity promise to lower these barriers by automating and coordinating penetration testing tasks. However, their role in…

827

Agentic Adversarial QA for Improving Domain-Specific LLMs

Feb 2026 · 2602.18137
AgenticReasoningBenchmarksFine-Tuning

Large Language Models (LLMs), despite extensive pretraining on broad internet corpora, often struggle to adapt effectively to specialized domains. There is growing interest in fine-tuning these models for such domains; however, progress is constrained by the scarcity and limited coverage of high-quality, task-relevant data. To address this,…

828

Decision Quality Evaluation Framework at Pinterest

Feb 2026 · 2602.15809
BenchmarksSafety

Online platforms require robust systems to enforce content safety policies at scale. A critical component of these systems is the ability to evaluate the quality of moderation decisions made by both human agents and Large Language Models (LLMs). However, this evaluation is challenging due to the inherent trade-offs between cost, scale, and…

829

How to Train Your Long-Context Visual Document Model

Feb 2026 · 2602.15257
ContextBenchmarksFine-Tuning

We present the first comprehensive, large-scale study of training long-context vision language models up to 344K context, targeting long-document visual question answering with measured transfer to long-context text. While several such strong are open-weight, namely Qwen3 VL and GLM 4.5/6V, their training recipes and data pipelines are not…

830

ReusStdFlow: A Standardized Reusability Framework for Dynamic Workflow Construction in Agentic AI

Feb 2026 · 2602.14922
AgenticRAGArchitecture

To address the ``reusability dilemma'' and structural hallucinations in enterprise Agentic AI,this paper proposes ReusStdFlow, a framework centered on a novel ``Extraction-Storage-Construction'' paradigm. The framework deconstructs heterogeneous, platform-specific Domain Specific Languages (DSLs) into standardized, modular workflow segments. It…

831

Atomix: Timely, Transactional Tool Use for Reliable Agentic Workflows

Feb 2026 · 2602.14849
AgenticSafety

LLM agents increasingly act on external systems, yet tool effects are immediate. Under failures, speculation, or contention, losing branches can leak unintended side effects with no safe rollback. We introduce Atomix, a runtime that provides progress-aware transactional semantics for agent tool calls. Atomix tags each call with an epoch, tracks…

832

SkillJect: Automating Stealthy Skill-Based Prompt Injection for Coding Agents with Trace-Driven Closed-Loop Refinement

Feb 2026 · 2602.14211
Software DevSafety

Agent skills are becoming a core abstraction in coding agents, packaging long-form instructions and auxiliary scripts to extend tool-augmented behaviors. This abstraction introduces an under-measured attack surface: skill-based prompt injection, where poisoned skills can steer agents away from user intent and safety policies. In practice, naive…

833

From SFT to RL: Demystifying the Post-Training Pipeline for LLM-based Vulnerability Detection

Feb 2026 · 2602.14012
BenchmarksFine-TuningReinforcement

The integration of LLMs into vulnerability detection (VD) has shifted the field toward interpretable and context-aware analysis. While post-training methods have shown promise in general coding tasks, their systematic application to VD remains underexplored. In this paper, we present the first comprehensive investigation into the post-training…

834

An end-to-end agentic pipeline for smart contract translation and quality evaluation

Feb 2026 · 2602.13808
AgenticBenchmarksReinforcementSafety

We present an end-to-end framework for systematic evaluation of LLM-generated smart contracts from natural-language specifications. The system parses contractual text into structured schemas, generates Solidity code, and performs automated quality assessment through compilation and security checks. Using CrewAI-style agent teams with iterative…

835

SCOPE: Selective Conformal Optimized Pairwise LLM Judging

Feb 2026 · 2602.13110
Benchmarks

Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation. Despite their practicality, LLM judges remain prone to miscalibration and systematic biases. This paper proposes SCOPE (Selective Conformal Optimized Pairwise Evaluation), a framework for selective pairwise judging with…

836

Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts

Feb 2026 · 2602.13367
Long-HorizonSoftware DevAgenticReasoning

We present Nanbeige4.1-3B, a unified generalist language model that simultaneously achieves strong agentic behavior, code generation, and general reasoning with only 3B parameters. To the best of our knowledge, it is the first open-source small language model (SLM) to achieve such versatility in a single model. To improve reasoning and preference…

837

SWE-MiniSandbox: Container-Free Reinforcement Learning for Building Software Engineering Agents

Feb 2026 · 2602.11210
BenchmarksReinforcement

Reinforcement learning (RL) has become a key paradigm for training software engineering (SWE) agents, but existing pipelines typically rely on per-task containers for isolation. At scale, pre-built container images incur substantial storage overhead, slow environment setup, and require container-management privileges. We propose SWE-MiniSandbox, a…

838

Beyond SMILES: Evaluating Agentic Systems for Drug Discovery

Feb 2026 · 2602.10163
AgenticPlanningReinforcementSafety

Agentic systems for drug discovery have demonstrated autonomous synthesis planning, literature mining, and molecular design. We ask how well they generalize. Evaluating six frameworks against 15 task classes drawn from peptide therapeutics, in vivo pharmacology, and resource-constrained settings, we find five capability gaps: no support for…

839

Knowledge Integration Decay in Search-Augmented Reasoning of Large Language Models

Feb 2026 · 2602.09517
AgenticReasoningBenchmarksInference

Modern Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks by employing search-augmented reasoning to incorporate external knowledge into long chains of thought. However, we identify a critical yet underexplored bottleneck in this paradigm, termed Knowledge Integration Decay (KID). Specifically, we observe that…

840

P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads

Feb 2026 · 2602.09443
AgenticReasoningBenchmarksReinforcement

The transition from symbolic manipulation to science-grade reasoning represents a pivotal frontier for Large Language Models (LLMs), with physics serving as the critical test anchor for binding abstract logic to physical reality. Physics demands that a model maintain physical consistency with the laws governing the universe, a task that…

841

Hierarchical Lead Critic based Multi-Agent Reinforcement Learning

Feb 2026 · 2602.21680
BenchmarksMulti-AgentReinforcementArchitecture

Cooperative Multi-Agent Reinforcement Learning (MARL) solves complex tasks that require coordination from multiple agents, but is often limited to either local (independent learning) or global (centralized learning) perspectives. In this paper, we introduce a novel sequential training scheme and MARL architecture, which learns from multiple…

842

Interaction Theater: A case of LLM Agents Interacting at Scale

Feb 2026 · 2602.20059
Multi-AgentArchitecture

As multi-agent architectures and agent-to-agent protocols proliferate, a fundamental question arises: what actually happens when autonomous LLM agents interact at scale? We study this question empirically using data from Moltbook, an AI-agent-only social platform, with 800K posts, 3.5M comments, and 78K agent profiles. We combine lexical metrics…

843

Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning

Feb 2026 · 2602.18922
AgenticBenchmarksPrompting

Personal AI agents incur substantial cost via repeated LLM calls. We show existing caching methods fail: GPTCache achieves 37.9% accuracy on real benchmarks; APC achieves 0-12%. The root cause is optimizing for the wrong property -- cache effectiveness requires key consistency and precision, not classification accuracy. We observe cache-key…

844

UniRank: A Multi-Agent Calibration Pipeline for Estimating University Rankings from Anonymized Bibliometric Signals

Feb 2026 · 2602.18824
ReasoningBenchmarksMulti-AgentArchitecture

We present UniRank, a multi-agent LLM pipeline that estimates university positions across global ranking systems using only publicly available bibliometric data from OpenAlex and Semantic Scholar. The system employs a three-stage architecture: (a) zero-shot estimation from anonymized institutional metrics, (b) per-system tool-augmented calibration…

845

Diffusing to Coordinate: Efficient Online Multi-Agent Diffusion Policies

Feb 2026 · 2602.18291
BenchmarksMulti-AgentFine-TuningReinforcement

Online Multi-Agent Reinforcement Learning (MARL) is a prominent framework for efficient agent coordination. Crucially, enhancing policy expressiveness is pivotal for achieving superior performance. Diffusion-based generative models are well-positioned to meet this demand, having demonstrated remarkable expressiveness and multimodal representation…

846

MCPShield: A Security Cognition Layer for Adaptive Trust Calibration in Model Context Protocol Agents

Feb 2026 · 2602.14281
AgenticReasoningSafety

The Model Context Protocol (MCP) standardizes tool use for LLM-based agents and enable third-party servers. This openness introduces a security misalignment: agents implicitly trust tools exposed by potentially untrusted MCP servers. However, despite its excellent utility, existing agents typically offer limited validation for third-party MCP…

847

Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents

Feb 2026 · 2602.13379
BenchmarksFine-TuningSafety

LLM-based agents are becoming increasingly capable, yet their safety lags behind. This creates a gap between what agents can do and should do. This gap widens as agents engage in multi-turn interactions and employ diverse tools, introducing new risks overlooked by existing benchmarks. To systematically scale safety testing into multi-turn,…

848

TraceBack: Multi-Agent Decomposition for Fine-Grained Table Attribution

Feb 2026 · 2602.13059
ReasoningBenchmarksMulti-AgentReinforcement

Question answering (QA) over structured tables requires not only accurate answers but also transparency about which cells support them. Existing table QA systems rarely provide fine-grained attribution, so even correct answers often lack verifiable grounding, limiting trust in high-stakes settings. We address this with TraceBack, a modular…

849

See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch

Feb 2026 · 2602.10814
PlanningReasoningBenchmarks

Block-based programming environments such as Scratch play a central role in low-code education, yet evaluating the capabilities of AI agents to construct programs through Graphical User Interfaces (GUIs) remains underexplored. We introduce ScratchWorld, a benchmark for evaluating multimodal GUI agents on program-by-construction tasks in Scratch.…

850

Rollout-Training Co-Design for Efficient LLM-Based Multi-Agent Reinforcement Learning

Feb 2026 · 2602.09578
Multi-AgentReinforcementArchitecture

Despite algorithm-level innovations for multi-agent reinforcement learning (MARL), the underlying networked infrastructure for large-scale MARL training remains underexplored. Existing training frameworks primarily optimize for single-agent scenarios and fail to address the unique system-level challenges of MARL, including rollout-training…

851

Airavat: An Agentic Framework for Internet Measurement

Feb 2026 · 2602.20924
AgenticReasoningKnowledge

Internet measurement faces twin challenges: complex analyses require expert-level orchestration of tools, yet even syntactically correct implementations can have methodological flaws and can be difficult to verify. Democratizing measurement capabilities thus demands automating both workflow generation and verification against methodological…

852

CodeCompass: Navigating the Navigation Paradox in Agentic Code Intelligence

Feb 2026 · 2602.20048
AgenticBenchmarksSafetyPrompting

Modern code intelligence agents operate in contexts exceeding 1 million tokens--far beyond the scale where humans manually locate relevant files. Yet agents consistently fail to discover architecturally critical files when solving real-world coding tasks. We identify the Navigation Paradox: agents perform poorly not due to context limits, but…

853

From Docs to Descriptions: Smell-Aware Evaluation of MCP Server Descriptions

Feb 2026 · 2602.18914
AgenticBenchmarksReinforcement

The Model Context Protocol (MCP) has rapidly become a de facto standard for connecting LLM-based agents with external tools via reusable MCP servers. In practice, however, server selection and onboarding rely heavily on free-text tool descriptions that are intentionally loosely constrained. Although this flexibility largely ensures the scalability…

854

Neurosymbolic Language Reasoning as Satisfiability Modulo Theory

Feb 2026 · 2602.18095
ReasoningBenchmarksCode Gen

Natural language understanding requires interleaving textual and logical reasoning, yet large language models often fail to perform such reasoning reliably. Existing neurosymbolic systems combine LLMs with solvers but remain limited to fully formalizable tasks such as math or program synthesis, leaving natural documents with only partial logical…

855

The 2025 AI Agent Index: Documenting Technical and Safety Features of Deployed Agentic AI Systems

Feb 2026 · 2602.17753
AgenticBenchmarksSafety

Agentic AI systems are increasingly capable of performing professional and personal tasks with limited human involvement. However, tracking these developments is difficult because the AI agent ecosystem is complex, rapidly evolving, and inconsistently documented, posing obstacles to both researchers and policymakers. To address these challenges,…

856

Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability

Feb 2026 · 2602.17544
ReasoningBenchmarksMulti-Agent

In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other. Current CoT evaluation narrowly focuses on target task accuracy. However, this metric fails to assess the quality or utility of the reasoning process itself. To address this…

857

Retrospective In-Context Learning for Temporal Credit Assignment with Large Language Models

Feb 2026 · 2602.17497
Self-ImprovingBenchmarksPrompting

Learning from self-sampled data and sparse environmental feedback remains a fundamental challenge in training self-evolving agents. Temporal credit assignment mitigates this issue by transforming sparse feedback into dense supervision signals. However, previous approaches typically depend on learning task-specific value functions for credit…

858

BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios

Feb 2026 · 2602.17072
ReasoningBenchmarksFine-Tuning

Large language models (LLMs)-based chatbots are increasingly being adopted in the financial domain, particularly in digital banking, to handle customer inquiries about products such as deposits, savings, and loans. However, these models still exhibit low accuracy in core banking computations-including total payout estimation, comparison of…

859

Large Language Models Persuade Without Planning Theory of Mind

Feb 2026 · 2602.17045
PlanningReasoningBenchmarksInference

A growing body of work attempts to evaluate the theory of mind (ToM) abilities of humans and large language models (LLMs) using static, non-interactive question-and-answer benchmarks. However, theoretical work in the field suggests that first-personal interaction is a crucial part of ToM and that such predictive, spectatorial tasks may fail to…

860

Mnemis: Dual-Route Retrieval on Hierarchical Graphs for Long-Term LLM Memory

Feb 2026 · 2602.15313
MemoryReasoningBenchmarks

AI Memory, specifically how models organizes and retrieves historical messages, becomes increasingly valuable to Large Language Models (LLMs), yet existing methods (RAG and Graph-RAG) primarily retrieve memory through similarity-based mechanisms. While efficient, such System-1-style retrieval struggles with scenarios that require global reasoning…

861

Removing Planner Bias in Goal Recognition Through Multi-Plan Dataset Generation

Feb 2026 · 2602.14691
AgenticPlanningBenchmarks

Autonomous agents require some form of goal and plan recognition to interact in multiagent settings. Unfortunately, all existing goal recognition datasets suffer from a systematical bias induced by the planning systems that generated them, namely heuristic-based forward search. This means that existing datasets lack enough challenge for more…

862

TabTracer: Monte Carlo Tree Search for Complex Table Reasoning with Large Language Models

Feb 2026 · 2602.14089
AgenticReasoningBenchmarksInference

Large language models (LLMs) have emerged as powerful tools for natural language table reasoning, where there are two main categories of methods. Prompt-based approaches rely on language-only inference or one-pass program generation without step-level verification. Agent-based approaches use tools in a closed loop, but verification is often local…

863

Text2GQL-Bench: A Text to Graph Query Language Benchmark [Experiment, Analysis & Benchmark]

Feb 2026 · 2602.11745
BenchmarksReinforcementSafety

Graph models are fundamental to data analysis in domains rich with complex relationships. Text-to-Graph-Query-Language (Text-to-GQL) systems act as a translator, converting natural language into executable graph queries. This capability allows Large Language Models (LLMs) to directly analyze and manipulate graph data, posi-tioning them as powerful…

864

Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Feb 2026 · 2602.10604
AgenticReasoningReinforcementInference

We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with 11B active parameters for efficient inference. It…

865

Internalizing Meta-Experience into Memory for Guided Reinforcement Learning in Large Language Models

Feb 2026 · 2602.10224
MemoryReasoningBenchmarksReinforcement

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for enhancing the reasoning capabilities of Large Language Models (LLMs). Despite its efficacy, RLVR faces a meta-learning bottleneck: it lacks mechanisms for error attribution and experience internalization intrinsic to the human learning cycle beyond…

866

CODE-SHARP: Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs

Feb 2026 · 2602.10085
Long-HorizonReinforcement

Developing agents capable of open-endedly discovering and learning novel skills is a grand challenge in Artificial Intelligence. While reinforcement learning offers a powerful framework for training agents to master complex skills, it typically relies on hand-designed reward functions. This is infeasible for open-ended skill discovery, where the…

867

Chain of Mindset: Reasoning with Adaptive Cognitive Modes

Feb 2026 · 2602.10063
Software DevAgenticReasoningBenchmarks

Human problem-solving is never the repetition of a single mindset, by which we mean a distinct mode of cognitive processing. When tackling a specific task, we do not rely on a single mindset; instead, we integrate multiple mindsets within the single solution process. However, existing LLM reasoning methods fall into a common trap: they apply the…

868

TraceMem: Weaving Narrative Memory Schemata from User Conversational Traces

Feb 2026 · 2602.09712
MemoryContextAgenticReasoning

Sustaining long-term interactions remains a bottleneck for Large Language Models (LLMs), as their limited context windows struggle to manage dialogue histories that extend over time. Existing memory systems often treat interactions as disjointed snippets, failing to capture the underlying narrative coherence of the dialogue stream. We propose…

869

FM SO.P: A Progressive Task Mixture Framework with Automatic Evaluation for Cross-Domain SOP Understanding

Feb 2026 · 2602.09336
ReasoningBenchmarksMulti-Agent

Standard Operating Procedures (SOPs) are critical for enterprise operations, yet existing language models struggle with SOP understanding and cross-domain generalization. Current methods fail because joint training cannot differentiate between reasoning capabilities that SOP requires: terminology precision, sequential ordering, and constraint…

870

Whose Name Comes Up? Benchmarking and Intervention-Based Auditing of LLM-Based Scholar Recommendation

Feb 2026 · 2602.08873
RAGBenchmarksInferencePrompting

Large language models (LLMs) are increasingly used for academic expert recommendation. Existing audits typically evaluate model outputs in isolation, largely ignoring end-user inference-time interventions. As a result, it remains unclear whether failures such as refusals, hallucinations, and uneven coverage stem from model choice or deployment…

871

AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification

Feb 2026 · 2602.22724
BenchmarksInference

Large language model (LLM) agents increasingly rely on external tools and retrieval systems to autonomously complete complex tasks. However, this design exposes agents to indirect prompt injection (IPI), where attacker-controlled context embedded in tool outputs or retrieved content silently steers agent actions away from user intent. Unlike…

872

Requesting Expert Reasoning: Augmenting LLM Agents with Learned Collaborative Intervention

Feb 2026 · 2602.22546
Reasoning

Large Language Model (LLM) based agents excel at general reasoning but often fail in specialized domains where success hinges on long-tail knowledge absent from their training data. While human experts can provide this missing knowledge, their guidance is often unstructured and unreliable, making its direct integration into an agent's plan…

873

A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives

Feb 2026 · 2602.21351
Multi-AgentArchitecture

The rapid accumulation of Earth science data has created a significant scalability challenge; while repositories like PANGAEA host vast collections of datasets, citation metrics indicate that a substantial portion remains underutilized, limiting data reusability. Here we present PANGAEA-GPT, a hierarchical multi-agent framework designed for…

874

Learning to Rewrite Tool Descriptions for Reliable LLM-Agent Tool Use

Feb 2026 · 2602.20426
AgenticFine-TuningReinforcement

The performance of LLM-based agents depends not only on the agent itself but also on the quality of the tool interfaces it consumes. While prior work has focused heavily on agent fine-tuning, tool interfaces-including natural language descriptions and parameter schemas-remain largely human-oriented and often become a bottleneck, especially when…

875

Case-Aware LLM-as-a-Judge Evaluation for Enterprise-Scale RAG Systems

Feb 2026 · 2602.20379
RAGBenchmarksReinforcementSafety

Enterprise Retrieval-Augmented Generation (RAG) assistants operate in multi-turn, case-based workflows such as technical support and IT operations, where evaluation must reflect operational constraints, structured identifiers (e.g., error codes, versions), and resolution workflows. Existing RAG evaluation frameworks are primarily designed for…

876

Representation Stability in a Minimal Continual Learning Agent

Feb 2026 · 2602.19655

Continual learning systems are increasingly deployed in environments where retraining or reset is infeasible, yet many approaches emphasize task performance rather than the evolution of internal representations over time. In this work, we study a minimal continual learning agent designed to isolate representational dynamics from architectural…

877

Toward AI Autonomous Navigation for Mechanical Thrombectomy using Hierarchical Modular Multi-agent Reinforcement Learning (HM-MARL)

Feb 2026 · 2602.18663
Multi-AgentReinforcement

Mechanical thrombectomy (MT) is typically the optimal treatment for acute ischemic stroke involving large vessel occlusions, but access is limited due to geographic and logistical barriers. Reinforcement learning (RL) shows promise in autonomous endovascular navigation, but generalization across 'long' navigation tasks remains challenging. We…

878

Action-Graph Policies: Learning Action Co-dependencies in Multi-Agent Reinforcement Learning

Feb 2026 · 2602.17009
Multi-AgentReinforcement

Coordinating actions is the most fundamental form of cooperation in multi-agent reinforcement learning (MARL). Successful decentralized decision-making often depends not only on good individual actions, but on selecting compatible actions across agents to synchronize behavior, avoid conflicts, and satisfy global constraints. In this paper, we…

879

Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History

Feb 2026 · 2602.17003
ReasoningBenchmarksArchitecture

Large language models have advanced web agents, yet current agents lack personalization capabilities. Since users rarely specify every detail of their intent, practical web agents must be able to interpret ambiguous queries by inferring user preferences and contexts. To address this challenge, we present Persona2Web, the first benchmark for…

880

EAA: Automating materials characterization with vision language model agents

Feb 2026 · 2602.15294
MemoryAgenticReasoningReinforcement

We present Experiment Automation Agents (EAA), a vision-language-model-driven agentic system designed to automate complex experimental microscopy workflows. EAA integrates multimodal reasoning, tool-augmented action, and optional long-term memory to support both autonomous procedures and interactive user-guided measurements. Built on a flexible…

881

Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation

Feb 2026 · 2602.14770
MemoryBenchmarksMulti-Agent

Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined. We test whether broadcast community discussion improves stand-up comedy writing in a controlled multi-agent sandbox: in the…

882

Verification of Robust Multi-Agent Systems

Feb 2026 · 2602.13405
MemoryMulti-AgentReinforcement

Stochastic multi-agent systems are a central modeling framework for autonomous controllers, communication protocols, and cyber-physical infrastructures. In many such systems, however, transition probabilities are only estimated from data and may therefore be partially unknown or subject to perturbations. In this paper, we study the verification of…

883

Improving HPC Code Generation Capability of LLMs via Online Reinforcement Learning with Real-Machine Benchmark Rewards

Feb 2026 · 2602.12049
Software DevBenchmarksReinforcement

Large language models (LLMs) have demonstrated strong code generation capabilities, yet the runtime performance of generated code is not guaranteed, and there have been few attempts to train LLMs using runtime performance as a reward in the HPC domain. We propose an online reinforcement learning approach that executes LLM-generated code on a…

884

Interpretable Attention-Based Multi-Agent PPO for Latency Spike Resolution in 6G RAN Slicing

Feb 2026 · 2602.11076
Multi-AgentReinforcementArchitectureInference

Sixth-generation (6G) radio access networks (RANs) must enforce strict service-level agreements (SLAs) for heterogeneous slices, yet sudden latency spikes remain difficult to diagnose and resolve with conventional deep reinforcement learning (DRL) or explainable RL (XRL). We propose \emph{Attention-Enhanced Multi-Agent Proximal Policy Optimization…

885

Learning to Coordinate via Quantum Entanglement in Multi-Agent Reinforcement Learning

Feb 2026 · 2602.08965
Multi-AgentReinforcementArchitecture

The inability to communicate poses a major challenge to coordination in multi-agent reinforcement learning (MARL). Prior work has explored correlating local policies via shared randomness, sometimes in the form of a correlation device, as a mechanism to assist in decentralized decision-making. In contrast, this work introduces the first framework…

886

Tokenization, Fusion and Decoupling: Bridging the Granularity Mismatch Between Large Language Models and Knowledge Graphs

Feb 2026 · 2602.22698
ReasoningBenchmarksKnowledge

Leveraging Large Language Models (LLMs) for Knowledge Graph Completion (KGC) is promising but hindered by a fundamental granularity mismatch. LLMs operate on fragmented token sequences, whereas entities are the fundamental units in knowledge graphs (KGs) scenarios. Existing approaches typically constrain predictions to limited candidate sets or…

887

Automating the Detection of Requirement Dependencies Using Large Language Models

Feb 2026 · 2602.22456
Software DevRAGReinforcementPrompting

Requirements are inherently interconnected through various types of dependencies. Identifying these dependencies is essential, as they underpin critical decisions and influence a range of activities throughout software development. However, this task is challenging, particularly in modern software systems, given the high volume of complex, coupled…

888

Topology of Reasoning: Retrieved Cell Complex-Augmented Generation for Textual Graph Question Answering

Feb 2026 · 2602.19240
RAGReasoningBenchmarksInference

Retrieval-Augmented Generation (RAG) enhances the reasoning ability of Large Language Models (LLMs) by dynamically integrating external knowledge, thereby mitigating hallucinations and strengthening contextual grounding for structured data such as graphs. Nevertheless, most existing RAG variants for textual graphs concentrate on low-dimensional…

889

Robust and Efficient Tool Orchestration via Layered Execution Structures with Reflective Correction

Feb 2026 · 2602.18968
AgenticPlanningReasoning

Tool invocation is a core capability of agentic systems, yet failures often arise not from individual tool calls but from how multiple tools are organized and executed together. Existing approaches tightly couple tool execution with stepwise language reasoning or explicit planning, leading to brittle behavior and high execution overhead. To…

890

Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models

Feb 2026 · 2602.18806
PlanningReasoningBenchmarksArchitecture

Large Language Models (LLMs) demonstrate strong reasoning performance, yet their ability to reliably monitor, diagnose, and correct their own errors remains limited. We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting…

891

Continual-NExT: A Unified Comprehension And Generation Continual Learning Framework

Feb 2026 · 2602.18055
BenchmarksFine-Tuning

Dual-to-Dual MLLMs refer to Multimodal Large Language Models, which can enable unified multimodal comprehension and generation through text and image modalities. Although exhibiting strong instantaneous learning and generalization capabilities, Dual-to-Dual MLLMs still remain deficient in lifelong evolution, significantly affecting continual…

892

Five Fatal Assumptions: Why T-Shirt Sizing Systematically Fails for AI Projects

Feb 2026 · 2602.17734
Software DevPlanningMulti-Agent

Agile estimation techniques, particularly T-shirt sizing, are widely used in software development for their simplicity and utility in scoping work. However, when we apply these methods to artificial intelligence initiatives -- especially those involving large language models (LLMs) and multi-agent systems -- the results can be systematically…

893

DataJoint 2.0: A Computational Substrate for Agentic Scientific Workflows

Feb 2026 · 2602.16585
Multi-Agent

Operational rigor determines whether human-agent collaboration succeeds or fails. Scientific data pipelines need the equivalent of DevOps -- SciOps -- yet common approaches fragment provenance across disconnected systems without transactional guarantees. DataJoint 2.0 addresses this gap through the relational workflow model: tables represent…

894

Context Shapes LLMs Retrieval-Augmented Fact-Checking Effectiveness

Feb 2026 · 2602.14044
ContextRAGReasoning

Large language models (LLMs) show strong reasoning abilities across diverse tasks, yet their performance on extended contexts remains inconsistent. While prior research has emphasized mid-context degradation in question answering, this study examines the impact of context in LLM-based fact verification. Using three datasets (HOVER, FEVEROUS, and…

895

MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

Feb 2026 · 2602.12705
AgenticReasoningBenchmarksReinforcement

We present MedXIAOHE, a medical vision-language foundation model designed to advance general-purpose medical understanding and reasoning in real-world clinical applications. MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities. To achieve…

896

ABot-N0: Technical Report on the VLA Foundation Model for Versatile Embodied Navigation

Feb 2026 · 2602.11598
MemoryLong-HorizonAgenticReasoning

Embodied navigation has long been fragmented by task-specific architectures. We introduce ABot-N0, a unified Vision-Language-Action (VLA) foundation model that achieves a ``Grand Unification'' across 5 core tasks: Point-Goal, Object-Goal, Instruction-Following, POI-Goal, and Person-Following. ABot-N0 utilizes a hierarchical ``Brain-Action''…

897

Are Aligned Large Language Models Still Misaligned?

Feb 2026 · 2602.11305
BenchmarksSafety

Misalignment in Large Language Models (LLMs) arises when model behavior diverges from human expectations and fails to simultaneously satisfy safety, value, and cultural dimensions, which must co-occur in real-world settings to solve a real-world query. Existing misalignment benchmarks-such as INSECURE CODE (safety-centric), VALUEACTIONLENS…

898

SteuerLLM: Local specialized large language model for German tax law analysis

Feb 2026 · 2602.11081
RAGReasoningBenchmarksReinforcement

Large language models (LLMs) demonstrate strong general reasoning and language understanding, yet their performance degrades in domains governed by strict formal rules, precise terminology, and legally binding structure. Tax law exemplifies these challenges, as correct answers require exact statutory citation, structured legal argumentation, and…

899

A Collaborative Safety Shield for Safe and Efficient CAV Lane Changes in Congested On-Ramp Merging

Feb 2026 · 2602.10007
Multi-AgentFine-TuningReinforcementSafety

Lane changing in dense traffic is a significant challenge for Connected and Autonomous Vehicles (CAVs). Existing lane change controllers primarily either ensure safety or collaboratively improve traffic efficiency, but do not consider these conflicting objectives together. To address this, we propose the Multi-Agent Safety Shield (MASS), designed…

900

Dieu khien he da tac tu

Feb 2026 · 2602.09412
ReasoningMulti-AgentReinforcement

Since the early 2000s, control of multiagent systems has attracted significant research interest, with applications ranging from natural collective behaviors and social dynamics to engineered systems such as autonomous vehicles, sensor networks, and smart grids. Although research on multi-agent systems has diversified into numerous specialized…

901

CXReasonAgent: Evidence-Grounded Diagnostic Reasoning Agent for Chest X-rays

Feb 2026 · 2602.23276
ReasoningBenchmarksReinforcementSafety

Chest X-ray plays a central role in thoracic diagnosis, and its interpretation inherently requires multi-step, evidence-grounded reasoning. However, large vision-language models (LVLMs) often generate plausible responses that are not faithfully grounded in diagnostic evidence and provide limited visual evidence for verification, while also…

902

MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations

Feb 2026 · 2602.23184
RAGBenchmarks

We present MTRAG-UN, a benchmark for exploring open challenges in multi-turn retrieval augmented generation, a popular use of large language models. We release a benchmark of 666 tasks containing over 2,800 conversation turns across 6 domains with accompanying corpora. Our experiments show that retrieval and generation models continue to struggle…

903

Learning-based Multi-agent Race Strategies in Formula 1

Feb 2026 · 2602.23056
Self-ImprovingMulti-AgentReinforcement

In Formula 1, race strategies are adapted according to evolving race conditions and competitors' actions. This paper proposes a reinforcement learning approach for multi-agent race strategy optimization. Agents learn to balance energy management, tire degradation, aerodynamic interaction, and pit-stop decisions. Building on a pre-trained…

904

Multi-agent imitation learning with function approximation: Linear Markov games and beyond

Feb 2026 · 2602.22810
Multi-Agent

In this work, we present the first theoretical analysis of multi-agent imitation learning (MAIL) in linear Markov games where both the transition dynamics and each agent's reward function are linear in some given features. We demonstrate that by leveraging this structure, it is possible to replace the state-action level "all policy deviation…

905

SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

Feb 2026 · 2602.22683
RAGReasoningBenchmarks

The rapid advancement of AI-powered smart glasses, one of the hottest wearable devices, has unlocked new frontiers for multimodal interaction, with Visual Question Answering (VQA) over external knowledge sources emerging as a core application. Existing Vision Language Models (VLMs) adapted to smart glasses are typically trained and evaluated on…

906

Agent Behavioral Contracts: Formal Specification and Runtime Enforcement for Reliable Autonomous AI Agents

Feb 2026 · 2602.22302
BenchmarksMulti-Agent

Traditional software relies on contracts -- APIs, type systems, assertions -- to specify and enforce correct behavior. AI agents, by contrast, operate on prompts and natural language instructions with no formal behavioral specification. This gap is the root cause of drift, governance failures, and frequent project failures in agentic AI…

907

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

Feb 2026 · 2602.22190
Long-HorizonReasoningBenchmarks

Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks. This gap stems from two limitations: a shortage of high-quality, action-aligned reasoning data, and the direct adoption of generic post-training pipelines that overlook the unique challenges of GUI agents. We identify two fundamental issues in…

908

Training Generalizable Collaborative Agents via Strategic Risk Aversion

Feb 2026 · 2602.21515
BenchmarksMulti-AgentReinforcement

Many emerging agentic paradigms require agents to collaborate with one another (or people) to achieve shared goals. Unfortunately, existing approaches to learning policies for such collaborative problems produce brittle solutions that fail when paired with new partners. We attribute these failures to a combination of free-riding during training…

909

ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction

Feb 2026 · 2602.20708
AgenticBenchmarksInference

Large Language Model (LLM) agents are susceptible to Indirect Prompt Injection (IPI) attacks, where malicious instructions in retrieved content hijack the agent's execution. Existing defenses typically rely on strict filtering or refusal mechanisms, which suffer from a critical limitation: over-refusal, prematurely terminating valid agentic…

910

TAPE: Tool-Guided Adaptive Planning and Constrained Execution in Language Model Agents

Feb 2026 · 2602.19633
Planning

Language Model (LM) agents have demonstrated remarkable capabilities in solving tasks that require multiple interactions with the environment. However, they remain vulnerable in environments where a single error often leads to irrecoverable failure, particularly under strict feasibility constraints. We systematically analyze existing agent…

911

ComUICoder: Component-based Reusable UI Code Generation for Complex Websites via Semantic Segmentation and Element-wise Feedback

Feb 2026 · 2602.19276
Software DevBenchmarks

Multimodal Large Language Models (MLLMs) have demonstrated strong performance on the UI-to-code task, which aims to generate UI code from design mock-ups. However, when applied to long and complex websites, they often struggle with fragmented segmentation, redundant code generation for repetitive components, and frequent UI inconsistencies. To…

912

OpenClaw AI Agents as Informal Learners at Moltbook: Characterizing an Emergent Learning Community at Scale

Feb 2026 · 2602.18832
AgenticBenchmarks

Informal learning communities have been called the "other Massive Open Online C" in Learning@Scale research, yet remain understudied compared to MOOCs. We present the first empirical study of a large-scale informal learning community composed entirely of AI agents. Moltbook, a social network exclusively for AI agents powered by autonomous agent…

913

GlobeDiff: State Diffusion Process for Partial Observability in Multi-Agent Systems

Feb 2026 · 2602.15776
Multi-AgentInference

In the realm of multi-agent systems, the challenge of \emph{partial observability} is a critical barrier to effective coordination and decision-making. Existing approaches, such as belief state estimation and inter-agent communication, often fall short. Belief-based methods are limited by their focus on past experiences without fully leveraging…

914

Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents

Feb 2026 · 2602.16855
MemoryLong-HorizonReasoningBenchmarks

The paper introduces GUI-Owl-1.5, the latest native GUI agent model that features instruct/thinking variants in multiple sizes (2B/4B/8B/32B/235B) and supports a range of platforms (desktop, mobile, browser, and more) to enable cloud-edge collaboration and real-time interaction. GUI-Owl-1.5 achieves state-of-the-art results on more than 20+ GUI…

915

DiffusionRollout: Uncertainty-Aware Rollout Planning in Long-Horizon PDE Solving

Feb 2026 · 2602.13616
Long-HorizonPlanningBenchmarksReinforcement

We propose DiffusionRollout, a novel selective rollout planning strategy for autoregressive diffusion models, aimed at mitigating error accumulation in long-horizon predictions of physical systems governed by partial differential equations (PDEs). Building on the recently validated probabilistic approach to PDE solving, we further explore its…

916

Small Reward Models via Backward Inference

Feb 2026 · 2602.13551
ReasoningBenchmarksReinforcementInference

Reward models (RMs) play a central role throughout the language model (LM) pipeline, particularly in non-verifiable domains. However, the dominant LLM-as-a-Judge paradigm relies on the strong reasoning capabilities of large models, while alternative approaches require reference responses or explicit rubrics, limiting flexibility and broader…

917

Blind Gods and Broken Screens: Architecting a Secure, Intent-Centric Mobile Agent Operating System

Feb 2026 · 2602.10915
MemoryAgenticReasoningBenchmarks

The evolution of Large Language Models (LLMs) has shifted mobile computing from App-centric interactions to system-level autonomous agents. Current implementations predominantly rely on a "Screen-as-Interface" paradigm, which inherits structural vulnerabilities and conflicts with the mobile ecosystem's economic foundations. In this paper, we…

918

Co-jump: Cooperative Jumping with Quadrupedal Robots via Multi-Agent Reinforcement Learning

Feb 2026 · 2602.10514
Multi-AgentFine-TuningReinforcement

While single-agent legged locomotion has witnessed remarkable progress, individual robots remain fundamentally constrained by physical actuation limits. To transcend these boundaries, we introduce Co-jump, a cooperative task where two quadrupedal robots synchronize to execute jumps far beyond their solo capabilities. We tackle the high-impulse…

919

SecCodePRM: A Process Reward Model for Code Security

Feb 2026 · 2602.10418
Long-HorizonSoftware DevReinforcementSafety

Large Language Models are rapidly becoming core components of modern software development workflows, yet ensuring code security remains challenging. Existing vulnerability detection pipelines either rely on static analyzers or use LLM/GNN-based detectors trained with coarse program-level supervision. Both families often require complete context,…

920

Anagent For Enhancing Scientific Table & Figure Analysis

Feb 2026 · 2602.10081
ReasoningBenchmarksMulti-AgentFine-Tuning

In scientific research, analysis requires accurately interpreting complex multimodal knowledge, integrating evidence from different sources, and drawing inferences grounded in domain-specific knowledge. However, current artificial intelligence (AI) systems struggle to consistently demonstrate such capabilities. The complexity and variability of…

921

Instruction-based Image Editing with Planning, Reasoning, and Generation

Feb 2026 · 2602.22624
PlanningReasoning

Editing images via instruction provides a natural way to generate interactive content, but it is a big challenge due to the higher requirement of scene understanding and generation. Prior work utilizes a chain of large language models, object segmentation models, and editing models for this task. However, the understanding models provide only a…

922

BrepCoder: A Unified Multimodal Large Language Model for Multi-task B-rep Reasoning

Feb 2026 · 2602.22284
Software DevReasoning

Recent advancements in deep learning have actively addressed complex challenges within the Computer-Aided Design (CAD) domain.However, most existing approaches rely on task-specifi c models requiring structural modifi cations for new tasks, and they predominantly focus on point clouds or images rather than the industry-standard Boundary…

923

Two-Stage Active Distribution Network Voltage Control via LLM-RL Collaboration: A Hybrid Knowledge-Data-Driven Approach

Feb 2026 · 2602.21715
ReasoningFine-TuningReinforcement

The growing integration of distributed photovoltaics (PVs) into active distribution networks (ADNs) has exacerbated operational challenges, making it imperative to coordinate diverse equipment to mitigate voltage violations and enhance power quality. Although existing data-driven approaches have demonstrated effectiveness in the voltage control…

924

Regret-Guided Search Control for Efficient Learning in AlphaZero

Feb 2026 · 2602.20809
Self-ImprovingBenchmarksReinforcement

Reinforcement learning (RL) agents achieve remarkable performance but remain far less learning-efficient than humans. While RL agents require extensive self-play games to extract useful signals, humans often need only a few games, improving rapidly by repeatedly revisiting states where mistakes occurred. This idea, known as search control, aims to…

925

Overton Pluralistic Reinforcement Learning for Large Language Models

Feb 2026 · 2602.20759
BenchmarksReinforcementArchitectureSafety

Existing alignment paradigms remain limited in capturing the pluralistic nature of human values. Overton Pluralism addresses this gap by generating responses with diverse perspectives from a single query. This paper introduces OP-GRPO (Overton Pluralistic Group Relative Policy Optimization), a reinforcement learning framework for implicit Overton…

926

Enhancing Large Language Models (LLMs) for Telecom using Dynamic Knowledge Graphs and Explainable Retrieval-Augmented Generation

Feb 2026 · 2602.17529
RAGBenchmarksKnowledge

Large language models (LLMs) have shown strong potential across a variety of tasks, but their application in the telecom field remains challenging due to domain complexity, evolving standards, and specialized terminology. Therefore, general-domain LLMs may struggle to provide accurate and reliable outputs in this context, leading to increased…

927

Retrieval Augmented Generation of Literature-derived Polymer Knowledge: The Example of a Biodegradable Polymer Expert System

Feb 2026 · 2602.16650
RAGReasoningReinforcementInference

Polymer literature contains a large and growing body of experimental knowledge, yet much of it is buried in unstructured text and inconsistent terminology, making systematic retrieval and reasoning difficult. Existing tools typically extract narrow, study-specific facts in isolation, failing to preserve the cross-study context required to answer…

928

RoboGene: Boosting VLA Pre-training via Diversity-Driven Agentic Framework for Real-World Task Generation

Feb 2026 · 2602.16444
Agentic

The pursuit of general-purpose robotic manipulation is hindered by the scarcity of diverse, real-world interaction data. Unlike data collection from web in vision or language, robotic data collection is an active process incurring prohibitive physical costs. Consequently, automated task curation to maximize data value remains a critical yet…

929

Mobility-Aware Cache Framework for Scalable LLM-Based Human Mobility Simulation

Feb 2026 · 2602.16727
PlanningReasoning

Large-scale human mobility simulation is critical for applications such as urban planning, epidemiology, and transportation analysis. Recent works treat large language models (LLMs) as human agents to simulate realistic mobility behaviors using structured reasoning, but their high computational cost limits scalability. To address this, we design a…

930

Position: Introspective Experience from Conversational Environments as a Path to Better Learning

Feb 2026 · 2602.14910
Reasoning

Current approaches to AI training treat reasoning as an emergent property of scale. We argue instead that robust reasoning emerges from linguistic self-reflection, itself internalized from high-quality social interaction. Drawing on Vygotskian developmental psychology, we advance three core positions centered on Introspection. First, we argue for…

931

Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report v1.5

Feb 2026 · 2602.14457
MemoryAgenticSafety

To understand and identify the unprecedented risks posed by rapidly advancing artificial intelligence (AI) models, Frontier AI Risk Management Framework in Practice presents a comprehensive assessment of their frontier risks. As Large Language Models (LLMs) general capabilities rapidly evolve and the proliferation of agentic AI, this version of…

932

Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens

Feb 2026 · 2602.13517
ReasoningBenchmarksInference

Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts are unreliable proxies for reasoning quality: increased generation length does not consistently correlate with accuracy and may instead signal…

933

Human-Aligned MLLM Judges for Fine-Grained Image Editing Evaluation: A Benchmark, Framework, and Analysis

Feb 2026 · 2602.13028
BenchmarksReinforcement

Evaluating image editing models remains challenging due to the coarse granularity and limited interpretability of traditional metrics, which often fail to capture aspects important to human perception and intent. Such metrics frequently reward visually plausible outputs while overlooking controllability, edit localization, and faithfulness to user…

934

CacheMind: From Miss Rates to Why -- Natural-Language, Trace-Grounded Reasoning for Cache Replacement

Feb 2026 · 2602.12422
MemoryRAGReasoningBenchmarks

Cache replacement remains a challenging problem in CPU microarchitecture, often addressed using hand-crafted heuristics, limiting cache performance. Cache data analysis requires parsing millions of trace entries with manual filtering, making the process slow and non-interactive. To address this, we introduce CacheMind, a conversational tool that…

935

Sci-CoE: Co-evolving Scientific Reasoning LLMs via Geometric Consensus with Sparse Supervision

Feb 2026 · 2602.12164
ReasoningBenchmarks

Large language models (LLMs) have demonstrated exceptional reasoning capabilities, and co-evolving paradigms have shown promising results in domains such as code and math. However, in scientific reasoning tasks, these models remain fragile due to unreliable solution evaluation and limited diversity in verification strategies. In this work, we…

936

Multimodal Fact-Level Attribution for Verifiable Reasoning

Feb 2026 · 2602.11509
ReasoningBenchmarks

Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation, where reliability requires grounding model outputs in heterogeneous input sources and verifying individual factual claims. However, existing multimodal grounding benchmarks and evaluation methods focus on…

937

Human Control Is the Anchor, Not the Answer: Early Divergence of Oversight in Agentic AI Communities

Feb 2026 · 2602.09286
AgenticSafety

Oversight for agentic AI is often discussed as a single goal ("human control"), yet early adoption may produce role-specific expectations. We present a comparative analysis of two newly active Reddit communities in Jan--Feb 2026 that reflect different socio-technical roles: r/OpenClaw (deployment and operations) and r/Moltbook (agent-centered…

938

Digital Twin and Agentic AI for Wild Fire Disaster Management: Intelligent Virtual Situation Room

Feb 2026 · 2602.08949
AgenticReinforcementInference

According to the United Nations, wildfire frequency and intensity are projected to increase by approximately 14% by 2030 and 30% by 2050 due to global warming, posing critical threats to life, infrastructure, and ecosystems. Conventional disaster management frameworks rely on static simulations and passive data acquisition, hindering their ability…

939

Large Language Models for Geolocation Extraction in Humanitarian Crisis Response

Feb 2026 · 2602.08872
ReasoningBenchmarksPrompting

Humanitarian crises demand timely and accurate geographic information to inform effective response efforts. Yet, automated systems that extract locations from text often reproduce existing geographic and socioeconomic biases, leading to uneven visibility of crisis-affected regions. This paper investigates whether Large Language Models (LLMs) can…

940

Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

Feb 2026 · 2602.23008
MemoryFine-TuningReinforcement

Exploration remains the key bottleneck for large language model agents trained with reinforcement learning. While prior methods exploit pretrained knowledge, they fail in environments requiring the discovery of novel states. We propose Exploratory Memory-Augmented On- and Off-Policy Optimization (EMPO$^2$), a hybrid RL framework that leverages…

941

MiroFlow: Towards High-Performance and Robust Open-Source Agent Framework for General Deep Research Tasks

Feb 2026 · 2602.22808
ReasoningBenchmarksReinforcement

Despite the remarkable progress of large language models (LLMs), the capabilities of standalone LLMs have begun to plateau when tackling real-world, complex tasks that require interaction with external tools and dynamic environments. Although recent agent frameworks aim to enhance model autonomy through tool integration and external interaction,…

942

CWM: Contrastive World Models for Action Feasibility Learning in Embodied Agent Pipelines

Feb 2026 · 2602.22452
PlanningReasoningBenchmarksFine-Tuning

A reliable action feasibility scorer is a critical bottleneck in embodied agent pipelines: before any planning or reasoning occurs, the agent must identify which candidate actions are physically executable in the current state. Existing approaches use supervised fine-tuning (SFT) to train action scorers, but SFT treats each candidate independently…

943

Language Models Exhibit Inconsistent Biases Towards Algorithmic Agents and Human Experts

Feb 2026 · 2602.22070
BenchmarksSafety

Large language models are increasingly used in decision-making tasks that require them to process information from a variety of sources, including both human experts and other algorithmic agents. How do LLMs weigh the information provided by these different sources? We consider the well-studied phenomenon of algorithm aversion, in which human…

944

HELP: HyperNode Expansion and Logical Path-Guided Evidence Localization for Accurate and Efficient GraphRAG

Feb 2026 · 2602.20926
RAGReasoningBenchmarksInference

Large Language Models (LLMs) often struggle with inherent knowledge boundaries and hallucinations, limiting their reliability in knowledge-intensive tasks. While Retrieval-Augmented Generation (RAG) mitigates these issues, it frequently overlooks structural interdependencies essential for multi-hop reasoning. Graph-based RAG approaches attempt to…

945

OpenClaw, Moltbook, and ClawdLab: From Agent-Only Social Networks to Autonomous Scientific Research

Feb 2026 · 2602.19810
Multi-AgentArchitecture

In January 2026, the open-source agent framework OpenClaw and the agent-only social network Moltbook produced a large-scale dataset of autonomous AI-to-AI interaction, attracting six academic publications within fourteen days. This study conducts a multivocal literature review of that ecosystem and presents ClawdLab, an open-source platform for…

946

NeuroWise: A Multi-Agent LLM "Glass-Box" System for Practicing Double-Empathy Communication with Autistic Partners

Feb 2026 · 2602.18962
Multi-AgentReinforcement

The double empathy problem frames communication difficulties between neurodivergent and neurotypical individuals as arising from mutual misunderstanding, yet most interventions focus on autistic individuals. We present NeuroWise, a multi-agent LLM-based coaching system that supports neurotypical users through stress visualization, interpretation…

947

LAMMI-Pathology: A Tool-Centric Bottom-Up LVLM-Agent Framework for Molecularly Informed Medical Intelligence in Pathology

Feb 2026 · 2602.18773
ContextReasoningFine-TuningArchitecture

The emergence of tool-calling-based agent systems introduces a more evidence-driven paradigm for pathology image analysis in contrast to the coarse-grained text-image diagnostic approaches. With the recent large-scale experimental adoption of spatial transcriptomics technologies, molecularly validated pathological diagnosis is becoming…

948

HONEST-CAV: Hierarchical Optimization of Network Signals and Trajectories for Connected and Automated Vehicles with Multi-Agent Reinforcement Learning

Feb 2026 · 2602.18740
PlanningMulti-AgentReinforcement

This study presents a hierarchical, network-level traffic flow control framework for mixed traffic consisting of Human-driven Vehicles (HVs), Connected and Automated Vehicles (CAVs). The framework jointly optimizes vehicle-level eco-driving behaviors and intersection-level traffic signal control to enhance overall network efficiency and decrease…

949

WorkflowPerturb: Calibrated Stress Tests for Evaluating Multi-Agent Workflow Metrics

Feb 2026 · 2602.17990
BenchmarksMulti-AgentReinforcement

LLM-based systems increasingly generate structured workflows for complex tasks. In practice, automatic evaluation of these workflows is difficult, because metric scores are often not calibrated, and score changes do not directly communicate the severity of workflow degradation. We introduce WorkflowPerturb, a controlled benchmark for studying…

950

Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent Reinforcement Learning

Feb 2026 · 2602.17062
BenchmarksMulti-AgentFine-TuningReinforcement

Value decomposition is a core approach for cooperative multi-agent reinforcement learning (MARL). However, existing methods still rely on a single optimal action and struggle to adapt when the underlying value function shifts during training, often converging to suboptimal policies. To address this limitation, we propose Successive Sub-value…

951

Overseeing Agents Without Constant Oversight: Challenges and Opportunities

Feb 2026 · 2602.16844
AgenticReasoningReinforcement

To enable human oversight, agentic AI systems often provide a trace of reasoning and action steps. Designing traces to have an informative, but not overwhelming, level of detail remains a critical challenge. In three user studies on a Computer User Agent, we investigate the utility of basic action traces for verification, explore three…

952

Leveraging Large Language Models for Causal Discovery: a Constraint-based, Argumentation-driven Approach

Feb 2026 · 2602.16481
ReasoningBenchmarks

Causal discovery seeks to uncover causal relations from data, typically represented as causal graphs, and is essential for predicting the effects of interventions. While expert knowledge is required to construct principled causal graphs, many statistical methods have been proposed to leverage observational data with varying formal guarantees.…

953

GUI-GENESIS: Automated Synthesis of Efficient Environments with Verifiable Rewards for GUI Agent Post-Training

Feb 2026 · 2602.14093
Self-ImprovingLong-HorizonPlanningInference

Post-training GUI agents in interactive environments is critical for developing generalization and long-horizon planning capabilities. However, training on real-world applications is hindered by high latency, poor reproducibility, and unverifiable rewards relying on noisy visual proxies. To address the limitations, we present GUI-GENESIS, the…

954

ATTest: Agent-Driven Tensor Testing for Deep Learning Library Modules

Feb 2026 · 2602.13987
Benchmarks

The unit testing of Deep Learning (DL) libraries is challenging due to complex numerical semantics and implicit tensor constraints. Traditional Search-Based Software Testing (SBST) often suffers from semantic blindness, failing to satisfy the constraints of high-dimensional tensors, whereas Large Language Models (LLMs) struggle with cross-file…

955

The PBSAI Governance Ecosystem: A Multi-Agent AI Reference Architecture for Securing Enterprise AI Estates

Feb 2026 · 2602.11301
RAGMulti-AgentReinforcementArchitecture

Enterprises are rapidly deploying large language models, retrieval augmented generation pipelines, and tool using agents into production, often on shared high performance computing clusters and cloud accelerator platforms that also support defensive analytics. These systems increasingly function not as isolated models but as AI estates: socio…

956

Patch the Distribution Mismatch: RL Rewriting Agent for Stable Off-Policy SFT

Feb 2026 · 2602.11220
BenchmarksFine-TuningReinforcementSafety

Large language models (LLMs) have made rapid progress, yet adapting them to downstream scenarios still commonly relies on supervised fine-tuning (SFT). When downstream data exhibit a substantial distribution shift from the model's prior training distribution, SFT can induce catastrophic forgetting. To narrow this gap, data rewriting has been…

957

Locomo-Plus: Beyond-Factual Cognitive Memory Evaluation Framework for LLM Agents

Feb 2026 · 2602.10715
MemoryBenchmarksPrompting

Long-term conversational memory is a core capability for LLM-based dialogue systems, yet existing benchmarks and evaluation protocols primarily focus on surface-level factual recall. In realistic interactions, appropriate responses often depend on implicit constraints such as user state, goals, or values that are not explicitly queried later. To…

958

Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling

Feb 2026 · 2602.10623
ReinforcementInference

Reward models learned from human preferences are central to aligning large language models (LLMs) via reinforcement learning from human feedback, yet they are often vulnerable to reward hacking due to noisy annotations and systematic biases such as response length or style. We propose Bayesian Non-Negative Reward Model (BNRM), a principled reward…

959

MLDocRAG: Multimodal Long-Context Document Retrieval Augmented Generation

Feb 2026 · 2602.10271
RAGReasoning

Understanding multimodal long-context documents that comprise multimodal chunks such as paragraphs, figures, and tables is challenging due to (1) cross-modal heterogeneity to localize relevant information across modalities, (2) cross-page reasoning to aggregate dispersed evidence across pages. To address these challenges, we are motivated to adopt…

960

PABU: Progress-Aware Belief Update for Efficient LLM Agents

Feb 2026 · 2602.09138
BenchmarksInference

Large Language Model (LLM) agents commonly condition actions on full action-observation histories, which introduce task-irrelevant information that easily leads to redundant actions and higher inference cost. We propose Progress-Aware Belief Update (PABU), a belief-state framework that compactly represents an agent's state by explicitly modeling…

961

A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

Feb 2026 · 2602.23163
Reasoning

Large language models are beginning to show steganographic capabilities. Such capabilities could allow misaligned models to evade oversight mechanisms. Yet principled methods to detect and quantify such behaviours are lacking. Classical definitions of steganography, and detection methods based on them, require a known reference distribution of…

962

RLHFless: Serverless Computing for Efficient RLHF

Feb 2026 · 2602.22718
ReasoningReinforcementInference

Reinforcement Learning from Human Feedback (RLHF) has been widely applied to Large Language Model (LLM) post-training to align model outputs with human preferences. Recent models, such as DeepSeek-R1, have also shown RLHF's potential to improve LLM reasoning on complex tasks. In RL, inference and training co-exist, creating dynamic resource…

963

Efficient Continual Learning in Language Models via Thalamically Routed Cortical Columns

Feb 2026 · 2602.22479
MemoryContextBenchmarksFine-Tuning

Continual learning is a core requirement for deployed language models, yet standard training and fine-tuning pipelines remain brittle under non-stationary data. Online updates often induce catastrophic forgetting, while methods that improve stability frequently increase latency, memory footprint, or dense computation in ways that do not scale well…

964

Grounding LLMs in Scientific Discovery via Embodied Actions

Feb 2026 · 2602.20639
Long-HorizonReasoning

Large Language Models (LLMs) have shown significant potential in scientific discovery but struggle to bridge the gap between theoretical reasoning and verifiable physical simulation. Existing solutions operate in a passive "execute-then-response" loop and thus lacks runtime perception, obscuring agents to transient anomalies (e.g., numerical…

965

CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation

Feb 2026 · 2602.20571
ReasoningBenchmarksInference

Many benchmarks for automated causal inference evaluate a system's performance based on a single numerical output, such as an Average Treatment Effect (ATE). This approach conflates two distinct steps in causal analysis: identification-formulating a valid research design under stated assumptions-and estimation-implementing that design numerically…

966

CodeHacker: Automated Test Case Generation for Detecting Vulnerabilities in Competitive Programming Solutions

Feb 2026 · 2602.20213
Software DevBenchmarks

The evaluation of Large Language Models (LLMs) for code generation relies heavily on the quality and robustness of test cases. However, existing benchmarks often lack coverage for subtle corner cases, allowing incorrect solutions to pass. To bridge this gap, we propose CodeHacker, an automated agent framework dedicated to generating targeted…

967

MARS: Margin-Aware Reward-Modeling with Self-Refinement

Feb 2026 · 2602.17658
Self-ImprovingReinforcementSafety

Reward modeling is a core component of modern alignment pipelines including RLHF and RLAIF, underpinning policy optimization methods including PPO and TRPO. However, training reliable reward models relies heavily on human-labeled preference data, which is costly and limited, motivating the use of data augmentation. Existing augmentation approaches…

968

Revisiting Weight Regularization for Low-Rank Continual Learning

Feb 2026 · 2602.17559
MemoryBenchmarksFine-TuningInference

Continual Learning (CL) with large-scale pre-trained models (PTMs) has recently gained wide attention, shifting the focus from training from scratch to continually adapting PTMs. This has given rise to a promising paradigm: parameter-efficient continual learning (PECL), where task interference is typically mitigated by assigning a task-specific…

969

Simple Baselines are Competitive with Code Evolution

Feb 2026 · 2602.16805
AgenticBenchmarks

Code evolution is a family of techniques that rely on large language models to search through possible computer programs by evolving or mutating existing code. Many proposed code evolution pipelines show impressive performance but are often not compared to simpler baselines. We test how well two simple baselines do over three domains: finding…

970

FactorMiner: A Self-Evolving Agent with Skills and Experience Memory for Financial Alpha Discovery

Feb 2026 · 2602.14670
MemorySelf-ImprovingBenchmarksFine-Tuning

Formulaic alpha factor mining is a critical yet challenging task in quantitative investment, characterized by a vast search space and the need for domain-informed, interpretable signals. However, finding novel signals becomes increasingly difficult as the library grows due to high redundancy. We propose FactorMiner, a lightweight and flexible…

971

ProbeLLM: Automating Principled Diagnosis of LLM Failures

Feb 2026 · 2602.12966
BenchmarksFine-TuningReinforcement

Understanding how and why large language models (LLMs) fail is becoming a central challenge as models rapidly evolve and static evaluations fall behind. While automated probing has been enabled by dynamic test generation, existing approaches often discover isolated failure cases, lack principled control over exploration, and provide limited…

972

Multi UAVs Preflight Planning in a Shared and Dynamic Airspace

Feb 2026 · 2602.12055
PlanningBenchmarksMulti-AgentInference

Preflight planning for large-scale Unmanned Aerial Vehicle (UAV) fleets in dynamic, shared airspace presents significant challenges, including temporal No-Fly Zones (NFZs), heterogeneous vehicle profiles, and strict delivery deadlines. While Multi-Agent Path Finding (MAPF) provides a formal framework, existing methods often lack the scalability…

973

Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems

Feb 2026 · 2602.11877
AgenticBenchmarksSafety

Large language models (LLMs) have achieved success, but cost and privacy constraints necessitate deploying smaller models locally while offloading complex queries to cloud-based models. Existing router evaluations are unsystematic, overlooking scenario-specific requirements and out-of-distribution robustness. We propose RouterXBench, a principled…

974

FlowMind: Execute-Summarize for Structured Workflow Generation from LLM Reasoning

Feb 2026 · 2602.11782
AgenticReasoning

LLMs can solve complex tasks through reasoning and tool use, but accurately translating these solutions into structured workflows remains challenging. We model workflows as sequences of tool use and reformulate the problem as designing a mechanism that can both solve tasks and reliably construct workflows. Prior approaches that build workflows…

975

Comparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance

Feb 2026 · 2602.08915
Software Dev

The rapid adoption of AI-powered coding assistants is transforming software development practices, yet systematic comparisons of their effectiveness across different task types and over time remain limited. This paper presents an empirical study comparing five popular agents (OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code), analyzing…

976

SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards

Feb 2026 · 2602.21158
BenchmarksFine-TuningReinforcement

Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning. Although recent work explores various forms of reward shaping and step-level credit assignment, a key signal remains largely overlooked: the intrinsic uncertainty of LLMs. Uncertainty…

977

MERRY: Semantically Decoupled Evaluation of Multimodal Emotional and Role Consistencies of Role-Playing Agents

Feb 2026 · 2602.21941
BenchmarksFine-TuningPrompting

Multimodal Role-Playing Agents (MRPAs) are attracting increasing attention due to their ability to deliver more immersive multimodal emotional interactions. However, existing studies still rely on pure textual benchmarks to evaluate the text responses of MRPAs, while delegating the assessment of their multimodal expressions solely to…

978

Codified Context: Infrastructure for AI Agents in a Complex Codebase

Feb 2026 · 2602.20478
MemoryMulti-AgentKnowledge

LLM-based agentic coding assistants lack persistent memory: they lose coherence across sessions, forget project conventions, and repeat known mistakes. Recent studies characterize how developers configure agents through manifest files, but an open challenge remains how to scale such configurations for large, multi-agent projects. This paper…

979

Implicit Intelligence -- Evaluating Agents on What Users Don't Say

Feb 2026 · 2602.20424
AgenticReasoningBenchmarksFine-Tuning

Real-world requests to AI agents are fundamentally underspecified. Natural human communication relies on shared context and unstated constraints that speakers expect listeners to infer. Current agentic benchmarks test explicit instruction-following but fail to evaluate whether agents can reason about implicit requirements spanning accessibility…

980

CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications

Feb 2026 · 2602.17949
RAGReasoningBenchmarksReinforcement

Background: Clinical named entity recognition tools commonly map free text to Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs). For many downstream tasks, however, the clinically meaningful unit is not a single CUI but a concept set comprising related synonyms, subtypes, and supertypes. Constructing such concept sets is…

981

From Subtle to Significant: Prompt-Driven Self-Improving Optimization in Test-Time Graph OOD Detection

Feb 2026 · 2602.17342
Self-ImprovingBenchmarksInference

Graph Out-of-Distribution (OOD) detection aims to identify whether a test graph deviates from the distribution of graphs observed during training, which is critical for ensuring the reliability of Graph Neural Networks (GNNs) when deployed in open-world scenarios. Recent advances in graph OOD detection have focused on test-time training techniques…

982

Automating Agent Hijacking via Structural Template Injection

Feb 2026 · 2602.16958
Agentic

Agent hijacking, highlighted by OWASP as a critical threat to the Large Language Model (LLM) ecosystem, enables adversaries to manipulate execution by injecting malicious instructions into retrieved content. Most existing attacks rely on manually crafted, semantics-driven prompt manipulation, which often yields low attack success rates and limited…

983

Narrow fine-tuning erodes safety alignment in vision-language agents

Feb 2026 · 2602.16931
BenchmarksFine-TuningSafetyInference

Lifelong multimodal agents must continuously adapt to new tasks through post-training, but this creates fundamental tension between acquiring capabilities and preserving safety alignment. We demonstrate that fine-tuning aligned vision-language models on narrow-domain harmful datasets induces severe emergent misalignment that generalizes broadly…

984

Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents

Feb 2026 · 2602.16520
AgenticSafety

Jailbreak prompts are a practical and evolving threat to large language models (LLMs), particularly in agentic systems that execute tools over untrusted content. Many attacks exploit long-context hiding, semantic camouflage, and lightweight obfuscations that can evade single-pass guardrails. We present RLM-JB, an end-to-end jailbreak detection…

985

WebWorld: A Large-Scale World Model for Web Agent Training

Feb 2026 · 2602.14721
Long-HorizonReasoningBenchmarksReinforcement

Web agents require massive trajectories to generalize, yet real-world training is constrained by network latency, rate limits, and safety risks. We introduce \textbf{WebWorld} series, the first open-web simulator trained at scale. While existing simulators are restricted to closed environments with thousands of trajectories, WebWorld leverages a…

986

Machine Learning as a Tool (MLAT): A Framework for Integrating Statistical ML Models as Callable Tools within LLM Agent Workflows

Feb 2026 · 2602.14295
ReasoningArchitectureInference

We introduce Machine Learning as a Tool (MLAT), a design pattern in which pre-trained statistical machine learning models are exposed as callable tools within large language model (LLM) agent workflows. This allows an orchestrating agent to invoke quantitative predictions when needed and reason about their outputs in context. Unlike conventional…

987

Arming Data Agents with Tribal Knowledge

Feb 2026 · 2602.13521
ReasoningBenchmarks

Natural language to SQL (NL2SQL) translation enables non-expert users to query relational databases through natural language. Recently, NL2SQL agents, powered by the reasoning capabilities of Large Language Models (LLMs), have significantly advanced NL2SQL translation. Nonetheless, NL2SQL agents still make mistakes when faced with large-scale…

988

GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics

Feb 2026 · 2602.12617
Reasoning

This paper presents GeoAgent, a model capable of reasoning closely with humans and deriving fine-grained address conclusions. Previous RL-based methods have achieved breakthroughs in performance and interpretability but still remain concerns because of their reliance on AI-generated chain-of-thought (CoT) data and training strategies, which…

989

P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling

Feb 2026 · 2602.12116
BenchmarksReinforcementSafety

Personalized alignment of large language models seeks to adapt responses to individual user preferences, typically via reinforcement learning. A key challenge is obtaining accurate, user-specific reward signals in open-ended scenarios. Existing personalized reward models face two persistent limitations: (1) oversimplifying diverse,…

990

Do Not Treat Code as Natural Language: Implications for Repository-Level Code Generation and Beyond

Feb 2026 · 2602.11671
Software DevRAGBenchmarksInference

Large language models for code (CodeLLMs) have demonstrated remarkable success in standalone code completion and generation, sometimes even surpassing human performance, yet their effectiveness diminishes in repository-level settings where cross-file dependencies and structural context are essential. Existing Retrieval-Augmented Generation (RAG)…

991

R2RAG-Flood: A reasoning-reinforced training-free retrieval augmentation generation framework for flood damage nowcasting

Feb 2026 · 2602.10312
RAGReasoningFine-TuningInference

R2RAG-Flood is a reasoning-reinforced, training-free retrieval-augmented generation framework for post-storm property damage nowcasting. Building on an existing supervised tabular predictor, the framework constructs a reasoning-centric knowledge base composed of labeled tabular records, where each sample includes structured predictors, a compact…

992

Beyond Closed-Pool Video Retrieval: A Benchmark and Agent Framework for Real-World Video Search and Moment Localization

Feb 2026 · 2602.10159
MemoryAgenticReasoningBenchmarks

Traditional video retrieval benchmarks focus on matching precise descriptions to closed video pools, failing to reflect real-world searches characterized by fuzzy, multi-dimensional memories on the open web. We present \textbf{RVMS-Bench}, a comprehensive system for evaluating real-world video memory search. It consists of \textbf{1,440 samples}…

993

Adaptive Value Decomposition: Coordinating a Varying Number of Agents in Urban Systems

Feb 2026 · 2602.13309
Multi-AgentReinforcement

Multi-agent reinforcement learning (MARL) provides a promising paradigm for coordinating multi-agent systems (MAS). However, most existing methods rely on restrictive assumptions, such as a fixed number of agents and fully synchronous action execution. These assumptions are often violated in urban systems, where the number of active agents varies…

994

Extending Czech Aspect-Based Sentiment Analysis with Opinion Terms: Dataset and LLM Benchmarks

Feb 2026 · 2602.22730
BenchmarksReinforcementArchitectureSafety

This paper introduces a novel Czech dataset in the restaurant domain for aspect-based sentiment analysis (ABSA), enriched with annotations of opinion terms. The dataset supports three distinct ABSA tasks involving opinion terms, accommodating varying levels of complexity. Leveraging this dataset, we conduct extensive experiments using modern…

995

ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL

Feb 2026 · 2602.22623
ReasoningBenchmarksReinforcement

We propose ContextRL, a novel framework that leverages context augmentation to overcome these bottlenecks. Specifically, to enhance Identifiability, we provide the reward model with full reference solutions as context, enabling fine-grained process verification to filter out false positives (samples with the right answer but low-quality reasoning…

996

CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety

Feb 2026 · 2602.22557
RAGBenchmarksMulti-AgentFine-Tuning

Current safety mechanisms for Large Language Models (LLMs) rely heavily on static, fine-tuned classifiers that suffer from adaptation rigidity, the inability to enforce new governance rules without expensive retraining. To address this, we introduce CourtGuard, a retrieval-augmented multi-agent framework that reimagines safety evaluation as…

997

Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

Feb 2026 · 2602.22207
Benchmarks

The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks. Existing resources often suffer from semantic drift and context loss, which can lead to misleading performance metrics. In this work, we present a fully automated framework designed to address these…

998

Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models

Feb 2026 · 2602.22072
ReasoningPrompting

Theory of Mind (ToM) refers to an agent's ability to model the internal states of others. Contributing to the debate whether large language models (LLMs) exhibit genuine ToM capabilities, our study investigates their ToM robustness using perturbations on false-belief tasks and examines the potential of Chain-of-Thought prompting (CoT) to enhance…

999

RADAR: Reasoning as Discrimination with Aligned Representations for LLM-based Knowledge Graph Reasoning

Feb 2026 · 2602.21951
ReasoningBenchmarksReinforcementInference

Knowledge graph reasoning (KGR) infers missing facts, with recent advances increasingly harnessing the semantic priors and reasoning abilities of Large Language Models (LLMs). However, prevailing generative paradigms are prone to memorizing surface-level co-occurrences rather than learning genuine relational semantics, limiting out-of-distribution…

1000

Cooperative-Competitive Team Play of Real-World Craft Robots

Feb 2026 · 2602.21119
Multi-AgentReinforcement

Multi-agent deep Reinforcement Learning (RL) has made significant progress in developing intelligent game-playing agents in recent years. However, the efficient training of collective robots using multi-agent RL and the transfer of learned policies to real-world applications remain open research questions. In this work, we first develop a…

1001

Unseen-Codebases-Domain Data Synthesis and Training Based on Code Graphs

Feb 2026 · 2602.20799
Software DevRAGReasoningBenchmarks

In the context of newly release software frameworks, large language models (LLMs) often exhibit poor performance and a high rate of hallucination, as they are not exposed to such environments during training. Although inference-time augmentation techniques such as retrieval-augmented generation (RAG) can partially mitigate hallucinations,…

1002

Meta-Learning and Meta-Reinforcement Learning - Tracing the Path towards DeepMind's Adaptive Agent

Feb 2026 · 2602.19837
Reinforcement

Humans are highly effective at utilizing prior knowledge to adapt to novel tasks, a capability that standard machine learning models struggle to replicate due to their reliance on task-specific training. Meta-learning overcomes this limitation by allowing models to acquire transferable knowledge from various tasks, enabling rapid adaptation to new…

1003

KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge

Feb 2026 · 2602.19643
BenchmarksReinforcementKnowledge

Large Language Models (LLMs) possess a remarkable capacity to generate persuasive and intelligible language. However, coherence does not equate to truthfulness, as the responses often contain subtle hallucinations. Existing benchmarks are limited by static and narrow questions, leading to limited coverage and misleading evaluations. We present…

1004

Cost-Aware Diffusion Active Search

Feb 2026 · 2602.19538
Multi-AgentFine-TuningReinforcement

Active search for recovering objects of interest through online, adaptive decision making with autonomous agents requires trading off exploration of unknown environments with exploitation of prior observations in the search space. Prior work has proposed information gain and Thompson sampling based myopic, greedy approaches for agents to actively…

1005

Contradiction to Consensus: Dual Perspective, Multi Source Retrieval Based Claim Verification with Source Level Disagreement using LLM

Feb 2026 · 2602.18693
ReasoningBenchmarksReinforcementKnowledge

The spread of misinformation across digital platforms can pose significant societal risks. Claim verification, a.k.a. fact-checking, systems can help identify potential misinformation. However, their efficacy is limited by the knowledge sources that they rely on. Most automated claim verification systems depend on a single knowledge source and…

1006

Mining Type Constructs Using Patterns in AI-Generated Code

Feb 2026 · 2602.17955
Software DevAgenticSafety

Artificial Intelligence (AI) increasingly automates various parts of the software development tasks. Although AI has enhanced the productivity of development tasks, it remains unstudied whether AI essentially outperforms humans in type-related programming tasks, such as employing type constructs properly for type safety, during its tasks.…

1007

VQPP: Video Query Performance Prediction Benchmark

Feb 2026 · 2602.17814
BenchmarksFine-TuningReinforcement

Query performance prediction (QPP) is an important and actively studied information retrieval task, having various applications, such as query reformulation, query expansion, and retrieval system selection, among many others. The task has been primarily studied in the context of text and image retrieval, whereas QPP for content-based video…

1008

Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering

Feb 2026 · 2602.17183
ReasoningBenchmarks

Large language models (LLMs) increasingly assist software engineering tasks that require reasoning over long code contexts, yet their robustness under varying input conditions remains unclear. We conduct a systematic study of long-context code question answering using controlled ablations that test sensitivity to answer format, distractors, and…

1009

NESSiE: The Necessary Safety Benchmark -- Identifying Errors that should not Exist

Feb 2026 · 2602.16756
AgenticReasoningBenchmarksSafety

We introduce NESSiE, the NEceSsary SafEty benchmark for large language models (LLMs). With minimal test cases of information and access security, NESSiE reveals safety-relevant failures that should not exist, given the low complexity of the tasks. NESSiE is intended as a lightweight, easy-to-use sanity check for language model safety and, as such,…

1010

GPSBench: Do Large Language Models Understand GPS Coordinates?

Feb 2026 · 2602.16105
AgenticReasoningFine-Tuning

Large Language Models (LLMs) are increasingly deployed in applications that interact with the physical world, such as navigation, robotics, or mapping, making robust geospatial reasoning a critical capability. Despite that, LLMs' ability to reason about GPS coordinates and real-world geography remains underexplored. We introduce GPSBench, a…

1011

Assessing Large Language Models for Medical QA: Zero-Shot and LLM-as-a-Judge Evaluation

Feb 2026 · 2602.14564
ReasoningBenchmarksFine-TuningReinforcement

Recently, Large Language Models (LLMs) have gained significant traction in medical domain, especially in developing a QA systems to Medical QA systems for enhancing access to healthcare in low-resourced settings. This paper compares five LLMs deployed between April 2024 and August 2025 for medical QA, using the iCliniq dataset, containing 38,000…

1012

ForesightSafety Bench: A Frontier Risk Evaluation and Governance Framework towards Safe AI

Feb 2026 · 2602.14135
AgenticBenchmarksSafety

Rapidly evolving AI exhibits increasingly strong autonomy and goal-directed capabilities, accompanied by derivative systemic risks that are more unpredictable, difficult to control, and potentially irreversible. However, current AI safety evaluation systems suffer from critical limitations such as restricted risk dimensions and failed frontier…

1013

Plan-MCTS: Plan Exploration for Action Exploitation in Web Navigation

Feb 2026 · 2602.14083
Long-HorizonAgenticPlanningReasoning

Large Language Models (LLMs) have empowered autonomous agents to handle complex web navigation tasks. While recent studies integrate tree search to enhance long-horizon reasoning, applying these algorithms in web navigation faces two critical challenges: sparse valid paths that lead to inefficient exploration, and a noisy context that dilutes…

1014

Chain-of-Thought Reasoning with Large Language Models for Clinical Alzheimer's Disease Assessment and Diagnosis

Feb 2026 · 2602.13979
ReasoningFine-Tuning

Alzheimer's disease (AD) has become a prevalent neurodegenerative disease worldwide. Traditional diagnosis still relies heavily on medical imaging and clinical assessment by physicians, which is often time-consuming and resource-intensive in terms of both human expertise and healthcare resources. In recent years, large language models (LLMs) have…

1015

LLM-Confidence Reranker: A Training-Free Approach for Enhancing Retrieval-Augmented Generation Systems

Feb 2026 · 2602.13571
RAGBenchmarksArchitectureInference

Large language models (LLMs) have revolutionized natural language processing, yet hallucinations in knowledge-intensive tasks remain a critical challenge. Retrieval-augmented generation (RAG) addresses this by integrating external knowledge, but its efficacy depends on accurate document retrieval and ranking. Although existing rerankers…

1016

BaziQA-Benchmark: Evaluating Symbolic and Temporally Compositional Reasoning in Large Language Models

Feb 2026 · 2602.12889
ReasoningBenchmarksInference

We present BaziQA-Benchmark, a standardized benchmark for evaluating symbolic and temporally compositional reasoning in large language models. The benchmark is derived from 200 professionally curated, multiple-choice problems from the Global Fortune-teller Competition (2021--2025), where each instance requires structured inference over a fixed…

1017

Closing the Loop: A Control-Theoretic Framework for Provably Stable Time Series Forecasting with LLMs

Feb 2026 · 2602.12756
Long-HorizonReasoningBenchmarksInference

Large Language Models (LLMs) have recently shown exceptional potential in time series forecasting, leveraging their inherent sequential reasoning capabilities to model complex temporal dynamics. However, existing approaches typically employ a naive autoregressive generation strategy. We identify a critical theoretical flaw in this paradigm: during…

1018

ReFilter: Improving Robustness of Retrieval-Augmented Generation via Gated Filter

Feb 2026 · 2602.12709
RAGBenchmarksFine-TuningInference

Retrieval-augmented generation (RAG) has become a dominant paradigm for grounding large language models (LLMs) with external evidence in knowledge-intensive question answering. A core design choice is how to fuse retrieved samples into the LLMs, where existing internal fusion approaches broadly fall into query-based fusion, parametric fusion, and…

1019

Visual Reasoning Benchmark: Evaluating Multimodal LLMs on Classroom-Authentic Visual Problems from Primary Education

Feb 2026 · 2602.12196
ReasoningBenchmarks

AI models have achieved state-of-the-art results in textual reasoning; however, their ability to reason over spatial and relational structures remains a critical bottleneck -- particularly in early-grade maths, which relies heavily on visuals. This paper introduces the visual reasoning benchmark (VRB), a novel dataset designed to evaluate…

1020

WavBench: Benchmarking Reasoning, Colloquialism, and Paralinguistics for End-to-End Spoken Dialogue Models

Feb 2026 · 2602.12135
ReasoningBenchmarksReinforcement

With the rapid integration of advanced reasoning capabilities into spoken dialogue models, the field urgently demands benchmarks that transcend simple interactions to address real-world complexity. However, current evaluations predominantly adhere to text-generation standards, overlooking the unique audio-centric characteristics of paralinguistics…

1021

Mitigating Mismatch within Reference-based Preference Optimization

Feb 2026 · 2602.11902
SafetyInference

Direct Preference Optimization (DPO) has become the de facto standard for offline preference alignment of large language models, but its reliance on a reference policy introduces a critical tension. DPO weighs each update relative to a reference, which stabilizes the training by regularizing the updates within a trusted region. This reliance…

1022

Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing

Feb 2026 · 2602.11786
BenchmarksSafetyInference

Traditional benchmarks for large language models (LLMs) primarily assess safety risk through breadth-oriented evaluation across diverse tasks. However, real-world deployment exposes a different class of risk: operational failures arising from repeated inference on identical or near-identical prompts rather than broad task generalization. In…

1023

Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation

Feb 2026 · 2602.11635
ReasoningBenchmarksFine-Tuning

Multimodal large language models (MLLMs) have achieved strong performance on perception-oriented tasks, yet their ability to perform mathematical spatial reasoning, defined as the capacity to parse and manipulate two- and three-dimensional relations, remains unclear. Humans easily solve textbook-style spatial reasoning problems with over 95\%…

1024

PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models

Feb 2026 · 2602.11530
MemoryReasoningBenchmarksInference

The emergence of reasoning-based LLMs leveraging Chain-of-Thought (CoT) inference introduces new serving challenges, as their extended reasoning phases delay user-visible output and inflate Time-To-First-Token (TTFT). Existing LLM serving frameworks fail to distinguish between reasoning and answering phases, leading to performance degradation…

1025

Hidden Licensing Risks in the LLMware Ecosystem

Feb 2026 · 2602.10758
Reinforcement

Large Language Models (LLMs) are increasingly integrated into software systems, giving rise to a new class of systems referred to as LLMware. Beyond traditional source-code components, LLMware embeds or interacts with LLMs that depend on other models and datasets, forming complex supply chains across open-source software (OSS), models, and…

1026

SCORE: Specificity, Context Utilization, Robustness, and Relevance for Reference-Free LLM Evaluation

Feb 2026 · 2602.10017
RAGPlanningBenchmarksReinforcement

Large language models (LLMs) are increasingly used to support question answering and decision-making in high-stakes, domain-specific settings such as natural hazard response and infrastructure planning, where effective answers must convey fine-grained, decision-critical details. However, existing evaluation frameworks for retrieval-augmented…

1027

UI-Venus-1.5 Technical Report

Feb 2026 · 2602.09082
Long-HorizonBenchmarksReinforcement

GUI agents have emerged as a powerful paradigm for automating interactions in digital environments, yet achieving both broad generality and consistently strong task performance remains challenging. In this report, we present UI-Venus-1.5, a unified, end-to-end GUI Agent designed for robust real-world applications. The proposed model family…

1028

Evaluating Stochasticity in Deep Research Agents

Feb 2026 · 2602.23271
AgenticBenchmarksReinforcementInference

Deep Research Agents (DRAs) are promising agentic systems that gather and synthesize information to support research across domains such as financial decision-making, medical analysis, and scientific discovery. Despite recent improvements in research quality (e.g., outcome accuracy when ground truth is available), DRA system design often overlooks…

1029

Robust Information Design for Multi-Agent Systems with Complementarities: Smallest-Equilibrium Threshold Policies

Feb 2026 · 2602.22915
Multi-Agent

We study information design in multi-agent systems (MAS) with binary actions and strategic complementarities, where an external designer influences behavior only through signals. Agents play the smallest-equilibrium of the induced Bayesian game, reflecting conservative, coordination-averse behavior typical in distributed systems. We show that when…

1030

Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue

Feb 2026 · 2602.22697
BenchmarksReinforcement

The rapid evolution of Large Language Models (LLMs) has accelerated the transition from conversational chatbots to general agents. However, effectively balancing empathetic communication with budget-aware decision-making remains an open challenge. Since existing methods fail to capture these complex strategic trade-offs, we propose InteractCS-RL,…

1031

Sustainable Multi-Agent Crowdsourcing via Physics-Informed Bandits

Feb 2026 · 2602.22365
Multi-AgentFine-Tuning

Crowdsourcing platforms face a four-way tension between allocation quality, workforce sustainability, operational feasibility, and strategic contractor behaviour--a dilemma we formalise as the Cold-Start, Burnout, Utilisation, and Strategic Agency Dilemma. Existing methods resolve at most two of these tensions simultaneously: greedy heuristics and…

1032

Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"?

Feb 2026 · 2602.21480
BenchmarksInference

Text-to-SQL and Big Data are both extensively benchmarked fields, yet there is limited research that evaluates them jointly. In the real world, Text-to-SQL systems are often embedded with Big Data workflows, such as large-scale data processing or interactive data analytics. We refer to this as "Text-to-Big SQL". However, existing text-to-SQL…

1033

From Cooperation to Hierarchy: A Study of Dynamics of Hierarchy Emergence in a Multi-Agent System

Feb 2026 · 2602.21404
Multi-Agent

A central premise in evolutionary biology is that individual variation can generate information asymmetries that facilitate the emergence of hierarchical organisation. To examine this process, we develop an agent-based model (ABM) to identify the minimal conditions under which hierarchy arises in dynamic multi-agent systems, focusing on the roles…

1034

How Foundational Skills Influence VLM-based Embodied Agents:A Native Perspective

Feb 2026 · 2602.20687
Benchmarks

Recent advances in vision-language models (VLMs) have shown promise for human-level embodied intelligence. However, existing benchmarks for VLM-driven embodied agents often rely on high-level commands or discretized action spaces, which are non-native settings that differ markedly from real-world control. In addition, current benchmarks focus…

1035

Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks

Feb 2026 · 2602.20156
Benchmarks

LLM agents are evolving rapidly, powered by code execution, tools, and the recently introduced agent skills feature. Skills allow users to extend LLM applications with specialized third-party code, knowledge, and instructions. Although this can extend agent capabilities to new domains, it creates an increasingly complex agent supply chain,…

1036

Benchmarking Unlearning for Vision Transformers

Feb 2026 · 2602.20114
BenchmarksArchitecture

Research in machine unlearning (MU) has gained strong momentum: MU is now widely regarded as a critical capability for building safe and fair AI. In parallel, research into transformer architectures for computer vision tasks has been highly successful: Increasingly, Vision Transformers (VTs) emerge as strong alternatives to CNNs. Yet, MU research…

1037

Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning

Feb 2026 · 2602.20078
Multi-AgentReinforcementArchitecture

Scaling cooperative multi-agent reinforcement learning (MARL) is fundamentally limited by cross-agent noise: when agents share a common reward, the actions of all $N$ agents jointly determine each agent's learning signal, so cross-agent noise grows with $N$. In the policy gradient setting, per-agent gradient estimate variance scales as $Θ(N)$,…

1038

OptiRepair: Closed-Loop Diagnosis and Repair of Supply Chain Optimization Models with LLM Agents

Feb 2026 · 2602.19439
PlanningReasoning

Supply chain optimization models frequently become infeasible because of modeling errors. Diagnosis and repair require scarce OR expertise: analysts must interpret solver diagnostics, trace root causes across echelons, and fix formulations without sacrificing operational soundness. Whether AI agents can perform this task remains untested. We…

1039

A potentialization algorithm for games with applications to multi-agent learning in repeated games

Feb 2026 · 2602.18925
Multi-Agent

We investigate an algorithm that assigns to any game in normal form an approximating game that admits an ordinal potential function. Due to the properties of potential games, the algorithm equips every game with a surrogate reward structure that allows efficient multi-agent learning. Numerical simulations using the replicator dynamics show that…

1040

Carbon-aware decentralized dynamic task offloading in MIMO-MEC networks via multi-agent reinforcement learning

Feb 2026 · 2602.18797
Multi-AgentReinforcementArchitectureInference

Massive internet of things microservices require integrating renewable energy harvesting into mobile edge computing (MEC) for sustainable eScience infrastructures. Spatiotemporal mismatches between stochastic task arrivals and intermittent green energy along with complex inter-user interference in multi-antenna (MIMO) uplinks complicate real-time…

1041

Aurora: Neuro-Symbolic AI Driven Advising Agent

Feb 2026 · 2602.17999
RAGReasoningBenchmarksReinforcement

Academic advising in higher education is under severe strain, with advisor-to-student ratios commonly exceeding 300:1. These structural bottlenecks limit timely access to guidance, increase the risk of delayed graduation, and contribute to inequities in student support. We introduce Aurora, a modular neuro-symbolic advising agent that unifies…

1042

From Labor to Collaboration: A Methodological Experiment Using AI Agents to Augment Research Perspectives in Taiwan's Humanities and Social Sciences

Feb 2026 · 2602.17221
AgenticReasoningFine-TuningInference

Generative AI is reshaping knowledge work, yet existing research focuses predominantly on software engineering and the natural sciences, with limited methodological exploration for the humanities and social sciences. Positioned as a "methodological experiment," this study proposes an AI Agent-based collaborative research workflow (Agentic…

1043

Towards a Science of AI Agent Reliability

Feb 2026 · 2602.16666
ReasoningBenchmarksSafety

AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical…

1044

Multi-agent cooperation through in-context co-player inference

Feb 2026 · 2602.16301
Multi-AgentReinforcementInferencePrompting

Achieving cooperation among self-interested agents remains a fundamental challenge in multi-agent reinforcement learning. Recent work showed that mutual cooperation can be induced between "learning-aware" agents that account for and shape the learning dynamics of their co-players. However, existing approaches typically rely on hardcoded, often…

1045

Graphon Mean-Field Subsampling for Cooperative Heterogeneous Multi-Agent Reinforcement Learning

Feb 2026 · 2602.16196
Multi-AgentReinforcement

Coordinating large populations of interacting agents is a central challenge in multi-agent reinforcement learning (MARL), where the size of the joint state-action space scales exponentially with the number of agents. Mean-field methods alleviate this burden by aggregating agent interactions, but these approaches assume homogeneous interactions.…

1046

Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis

Feb 2026 · 2602.15909
Benchmarks

Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class imbalance. To bridge these gaps, we present Resp-Agent, an…

1047

LongAudio-RAG: Event-Grounded Question Answering over Multi-Hour Long Audio

Feb 2026 · 2602.14612
RAGReasoningBenchmarksArchitecture

Long-duration audio is increasingly common in industrial and consumer settings, yet reviewing multi-hour recordings is impractical, motivating systems that answer natural-language queries with precise temporal grounding and minimal hallucination. Existing audio-language models show promise, but long-audio question answering remains difficult due…

1048

Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook

Feb 2026 · 2602.14299
MemoryAgentic

As large language model agents increasingly populate networked environments, a fundamental question arises: do artificial intelligence (AI) agent societies undergo convergence dynamics similar to human social systems? Lately, Moltbook approximates a plausible future scenario in which autonomous agents participate in an open-ended, continuously…

1049

Testing BDI-based Multi-Agent Systems using Discrete Event Simulation

Feb 2026 · 2602.13878
Multi-Agent

Multi-agent systems are designed to deal with open, distributed systems with unpredictable dynamics, which makes them inherently hard to test. The value of using simulation for this purpose is recognized in the literature, although achieving sufficient fidelity (i.e., the degree of similarity between the simulation and the real-world system)…

1050

From Fluent to Verifiable: Claim-Level Auditability for Deep Research Agents

Feb 2026 · 2602.13855
Long-HorizonBenchmarksReinforcementInference

A deep research agent produces a fluent scientific report in minutes; a careful reader then tries to verify the main claims and discovers the real cost is not reading, but tracing: which sentence is supported by which passage, what was ignored, and where evidence conflicts. We argue that as research generation becomes cheap, auditability becomes…

1051

Beyond Words: Evaluating and Bridging Epistemic Divergence in User-Agent Interaction via Theory of Mind

Feb 2026 · 2602.13832
ReasoningBenchmarksReinforcementInference

Large Language Models (LLMs) have developed rapidly and are widely applied to both general-purpose and professional tasks to assist human users. However, they still struggle to comprehend and respond to the true user needs when intentions and instructions are imprecisely conveyed, leading to a divergence between subjective user believes and true…

1052

Favia: Forensic Agent for Vulnerability-fix Identification and Analysis

Feb 2026 · 2602.12500
ReasoningSafety

Identifying vulnerability-fixing commits corresponding to disclosed CVEs is essential for secure software maintenance but remains challenging at scale, as large repositories contain millions of commits of which only a small fraction address security issues. Existing automated approaches, including traditional machine learning techniques and recent…

1053

Distributionally Robust Cooperative Multi-Agent Reinforcement Learning via Robust Value Factorization

Feb 2026 · 2602.11437
Multi-AgentReinforcementArchitecture

Cooperative multi-agent reinforcement learning (MARL) commonly adopts centralized training with decentralized execution, where value-factorization methods enforce the individual-global-maximum (IGM) principle so that decentralized greedy actions recover the team-optimal joint action. However, the reliability of this recipe in real-world settings…

1054

MultiCube-RAG for Multi-hop Question Answering

Feb 2026 · 2602.15898
RAGReasoning

Multi-hop question answering (QA) necessitates multi-step reasoning and retrieval across interconnected subjects, attributes, and relations. Existing retrieval-augmented generation (RAG) methods struggle to capture these structural semantics accurately, resulting in suboptimal performance. Graph-based RAGs structure such information in graphs, but…

1055

SimuScene: Training and Benchmarking Code Generation to Simulate Physical Scenarios

Feb 2026 · 2602.10840
Software DevReasoningBenchmarksReinforcement

Large language models (LLMs) have been extensively studied for tasks like math competitions, complex coding, and scientific reasoning, yet their ability to accurately represent and simulate physical scenarios via code remains underexplored. We propose SimuScene, the first systematic study that trains and evaluates LLMs on simulating physical…

1056

Understand Then Memory: A Cognitive Gist-Driven RAG Framework with Global Semantic Diffusion

Feb 2026 · 2602.15895
MemoryRAGReasoningBenchmarks

Retrieval-Augmented Generation (RAG) effectively mitigates hallucinations in LLMs by incorporating external knowledge. However, the inherent discrete representation of text in existing frameworks often results in a loss of semantic integrity, leading to retrieval deviations. Inspired by the human episodic memory mechanism, we propose CogitoRAG, a…

1057

AgentSkiller: Scaling Generalist Agent Intelligence through Semantically Integrated Cross-Domain Data Synthesis

Feb 2026 · 2602.09372
Long-HorizonArchitecture

Large Language Model agents demonstrate potential in solving real-world problems via tools, yet generalist intelligence is bottlenecked by scarce high-quality, long-horizon data. Existing methods collect privacy-constrained API logs or generate scripted interactions lacking diversity, which struggle to produce data requisite for scaling…

1058

VERA: Identifying and Leveraging Visual Evidence Retrieval Heads in Long-Context Understanding

Feb 2026 · 2602.10146
ContextReasoningBenchmarks

While Vision-Language Models (VLMs) have shown promise in textual understanding, they face significant challenges when handling long context and complex reasoning tasks. In this paper, we dissect the internal mechanisms governing long-context processing in VLMs to understand their performance bottlenecks. Through the lens of attention analysis, we…

1059

TARAZ: Persian Short-Answer Question Benchmark for Cultural Evaluation of Language Models

Feb 2026 · 2602.22827
Benchmarks

This paper presents a comprehensive evaluation framework for assessing the cultural competence of large language models (LLMs) in Persian. Existing Persian cultural benchmarks rely predominantly on multiple-choice formats and English-centric metrics that fail to capture Persian's morphological complexity and semantic nuance. Our framework…

1060

AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

Feb 2026 · 2602.22755
AgenticBenchmarksReinforcementSafety

We introduce AuditBench, an alignment auditing benchmark. AuditBench consists of 56 language models with implanted hidden behaviors. Each model has one of 14 concerning behaviors--such as sycophantic deference, opposition to AI regulation, or secret geopolitical loyalties--which it does not confess to when directly asked. AuditBench models are…

1061

IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages

Feb 2026 · 2602.22125
ReasoningBenchmarksReinforcement

Instruction-following benchmarks remain predominantly English-centric, leaving a critical evaluation gap for the hundreds of millions of Indic language speakers. We introduce IndicIFEval, a benchmark evaluating constrained generation of LLMs across 14 Indic languages using automatically verifiable, rule-based instructions. It comprises around 800…

1062

SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video

Feb 2026 · 2602.21706
ReasoningBenchmarksReinforcementArchitecture

Minimally invasive surgery has dramatically improved patient operative outcomes, yet identifying safe operative zones remains challenging in critical phases, requiring surgeons to integrate visual cues, procedural phase, and anatomical context under high cognitive load. Existing AI systems offer binary safety verification or static detection,…

1063

Large Language Model-Assisted UAV Operations and Communications: A Multifaceted Survey and Tutorial

Feb 2026 · 2602.19534
RAGPlanningReasoningFine-Tuning

Uncrewed Aerial Vehicles (UAVs) are widely deployed across diverse applications due to their mobility and agility. Recent advances in Large Language Models (LLMs) offer a transformative opportunity to enhance UAV intelligence beyond conventional optimization-based and learning-based approaches. By integrating LLMs into UAV systems, advanced…

1064

Ada-RS: Adaptive Rejection Sampling for Selective Thinking

Feb 2026 · 2602.19519
ReasoningBenchmarksFine-TuningInference

Large language models (LLMs) are increasingly being deployed in cost and latency-sensitive settings. While chain-of-thought improves reasoning, it can waste tokens on simple requests. We study selective thinking for tool-using LLMs and introduce Adaptive Rejection Sampling (Ada-RS), an algorithm-agnostic sample filtering framework for learning…

1065

ComplLLM: Fine-tuning LLMs to Discover Complementary Signals for Decision-making

Feb 2026 · 2602.19458
Multi-AgentFine-TuningReinforcement

Multi-agent decision pipelines can outperform single agent workflows when complementarity holds, i.e., different agents bring unique information to the table to inform a final decision. We propose ComplLLM, a post-training framework based on decision theory that fine-tunes a decision-assistant LLM using complementary information as reward to…

1066

Many AI Analysts, One Dataset: Navigating the Agentic Data Science Multiverse

Feb 2026 · 2602.18710
AgenticReinforcementInference

The conclusions of empirical research depend not only on data but on a sequence of analytic decisions that published results seldom make explicit. Past ``many-analyst" studies have demonstrated this: independent teams testing the same hypothesis on the same dataset regularly reach conflicting conclusions. But such studies require months of…

1067

ReqElicitGym: An Evaluation Environment for Interview Competence in Conversational Requirements Elicitation

Feb 2026 · 2602.18306
Software DevBenchmarks

With the rapid improvement of LLMs' coding capabilities, the bottleneck of LLM-based automated software development is shifting from generating correct code to eliciting users' requirements. Despite growing interest, the interview competence of LLMs in conversational requirements elicitation remains fully underexplored. Existing evaluations often…

1068

Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards

Feb 2026 · 2602.18037
Reinforcement

Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs). A common problem is reward hacking, where the policy may exploit inaccuracies of the reward and learn an unintended behavior. Most previous works address this by limiting the policy update with a…

1069

The Token Games: Evaluating Language Model Reasoning with Puzzle Duels

Feb 2026 · 2602.17831
ReasoningBenchmarks

Evaluating the reasoning capabilities of Large Language Models is increasingly challenging as models improve. Human curation of hard questions is highly expensive, especially in recent benchmarks using PhD-level domain knowledge to challenge the most capable models. Even then, there is always a concern about whether these questions test genuine…

1070

Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation

Feb 2026 · 2602.17316
BenchmarksInference

The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison. Yet, their reliability is increasingly questioned due to sensitivity to shallow variations in input prompts. This paper examines how controlled, truth-conditionally equivalent lexical and…

1071

Phase-Aware Mixture of Experts for Agentic Reinforcement Learning

Feb 2026 · 2602.17038
AgenticReinforcementArchitecture

Reinforcement learning (RL) has equipped LLM agents with a strong ability to solve complex tasks. However, existing RL methods normally use a \emph{single} policy network, causing \emph{simplicity bias} where simple tasks occupy most parameters and dominate gradient updates, leaving insufficient capacity for complex tasks. A plausible remedy could…

1072

ReIn: Conversational Error Recovery with Reasoning Inception

Feb 2026 · 2602.17022
ReasoningFine-TuningReinforcement

Conversational agents powered by large language models (LLMs) with tool integration achieve strong performance on fixed task-oriented dialogue datasets but remain vulnerable to unanticipated, user-induced errors. Rather than focusing on error prevention, this work focuses on error recovery, which necessitates the accurate diagnosis of erroneous…

1073

NeST: Neuron Selective Tuning for LLM Safety

Feb 2026 · 2602.16835
BenchmarksFine-TuningSafetyInference

Safety alignment is essential for the responsible deployment of large language models (LLMs). Yet, existing approaches often rely on heavyweight fine-tuning that is costly to update, audit, and maintain across model families. Full fine-tuning incurs substantial computational and storage overhead, while parameter-efficient methods such as LoRA…

1074

Multi-Objective Alignment of Language Models for Personalized Psychotherapy

Feb 2026 · 2602.16053
BenchmarksFine-TuningReinforcementSafety

Mental health disorders affect over 1 billion people worldwide, yet access to care remains limited by workforce shortages and cost constraints. While AI systems show therapeutic promise, current alignment approaches optimize objectives independently, failing to balance patient preferences with clinical safety. We survey 335 individuals with lived…

1075

Quantifying construct validity in large language model evaluations

Feb 2026 · 2602.15532
Benchmarks

The LLM community often reports benchmark results as if they are synonymous with general model capabilities. However, benchmarks can have problems that distort performance, like test set contamination and annotator error. How can we know that a benchmark is a reliable indicator of some capability that we want to measure? This question concerns the…

1076

Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLMs

Feb 2026 · 2602.15318
ContextInference

Although speculative decoding is widely used to accelerate Vision-Language Models (VLMs) inference, it faces severe performance collapse when applied to Video Large Language Models (Vid-LLMs). The draft model typically falls into the trap of attention dilution and negative visual gain due to key-value cache explosion and context window mismatches.…

1077

LLM-Guided Knowledge Distillation for Temporal Knowledge Graph Reasoning

Feb 2026 · 2602.14428
ReasoningBenchmarksReinforcementArchitecture

Temporal knowledge graphs (TKGs) support reasoning over time-evolving facts, yet state-of-the-art models are often computationally heavy and costly to deploy. Existing compression and distillation techniques are largely designed for static graphs; directly applying them to temporal settings may overlook time-dependent interactions and lead to…

1078

Knowing When Not to Answer: Abstention-Aware Scientific Reasoning

Feb 2026 · 2602.14189
ReasoningBenchmarksReinforcementArchitecture

Large language models are increasingly used to answer and verify scientific claims, yet existing evaluations typically assume that a model must always produce a definitive answer. In scientific settings, however, unsupported or uncertain conclusions can be more harmful than abstaining. We study this problem through an abstention-aware verification…

1079

Asynchronous Verified Semantic Caching for Tiered LLM Architectures

Feb 2026 · 2602.13165
AgenticReinforcementArchitectureInference

Large language models (LLMs) now sit in the critical path of search, assistance, and agentic workflows, making semantic caching essential for reducing inference cost and latency. Production deployments typically use a tiered static-dynamic design: a static cache of curated, offline vetted responses mined from logs, backed by a dynamic cache…

1080

EPRBench: A High-Quality Benchmark Dataset for Event Stream Based Visual Place Recognition

Feb 2026 · 2602.12919
ReasoningBenchmarksReinforcement

Event stream-based Visual Place Recognition (VPR) is an emerging research direction that offers a compelling solution to the instability of conventional visible-light cameras under challenging conditions such as low illumination, overexposure, and high-speed motion. Recognizing the current scarcity of dedicated datasets in this domain, we…

1081

MentalBench: A Benchmark for Evaluating Psychiatric Diagnostic Capability of Large Language Models

Feb 2026 · 2602.12871
BenchmarksKnowledge

We introduce MentalBench, a benchmark for evaluating psychiatric diagnostic decision-making in large language models (LLMs). Existing mental health benchmarks largely rely on social media data, limiting their ability to assess DSM-grounded diagnostic judgments. At the core of MentalBench is MentalKG, a psychiatrist-built and validated knowledge…

1082

PrefillShare: A Shared Prefill Module for KV Reuse in Multi-LLM Disaggregated Serving

Feb 2026 · 2602.12029
Multi-AgentFine-TuningInference

Multi-agent systems increasingly orchestrate multiple specialized language models to solve complex real-world problems, often invoking them over a shared context. This execution pattern repeatedly processes the same prompt prefix across models. Consequently, each model redundantly executes the prefill stage and maintains its own key-value (KV)…

1083

Benchmark Health Index: A Systematic Framework for Benchmarking the Benchmarks of LLMs

Feb 2026 · 2602.11674
Benchmarks

Large Language Models (LLMs) are advancing rapidly, yet the benchmarks used to measure this progress are becoming increasingly unreliable. Score inflation and selective reporting have eroded the authority of standard benchmarks, leaving the community uncertain about which evaluation results remain trustworthy. We introduce the Benchmark Health…

1084

Bi-Level Prompt Optimization for Multimodal LLM-as-a-Judge

Feb 2026 · 2602.11340
ContextBenchmarksFine-TuningSafety

Large language models (LLMs) have become widely adopted as automated judges for evaluating AI-generated content. Despite their success, aligning LLM-based evaluations with human judgments remains challenging. While supervised fine-tuning on human-labeled data can improve alignment, it is costly and inflexible, requiring new training for each task…

1085

In-the-Wild Model Organisms: Mitigating Undesirable Emergent Behaviors in Production LLM Post-Training via Data Attribution

Feb 2026 · 2602.11079
BenchmarksEmergentSafety

We propose activation-based data attribution, a method that traces behavioral changes in post-trained language models to responsible training datapoints. By computing activation-difference vectors for both test prompts and preference pairs and ranking by cosine similarity, we identify datapoints that cause specific behaviors and validate these…

1086

MedScope: Incentivizing "Think with Videos" for Clinical Reasoning via Coarse-to-Fine Tool Calling

Feb 2026 · 2602.13332
AgenticReasoningBenchmarksReinforcement

Long-form clinical videos are central to visual evidence-based decision-making, with growing importance for applications such as surgical robotics and related settings. However, current multimodal large language models typically process videos with passive sampling or weakly grounded inspection, which limits their ability to iteratively locate,…

1087

MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning

Feb 2026 · 2602.10575
ReasoningBenchmarksReinforcementArchitecture

Metaphorical comprehension in images remains a critical challenge for Nowadays AI systems. While Multimodal Large Language Models (MLLMs) excel at basic Visual Question Answering (VQA), they consistently struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. This difficulty stems from the task's…

1088

Linear-LLM-SCM: Benchmarking LLMs for Coefficient Elicitation in Linear-Gaussian Causal Models

Feb 2026 · 2602.10282
ReasoningBenchmarks

Large language models (LLMs) have shown potential in identifying qualitative causal relations, but their ability to perform quantitative causal reasoning -- estimating effect sizes that parametrize functional relationships -- remains underexplored in continuous domains. We introduce Linear-LLM-SCM, a plug-and-play benchmarking framework for…

1089

Learning to Evict from Key-Value Cache

Feb 2026 · 2602.10238
MemoryContextBenchmarksReinforcement

The growing size of Large Language Models (LLMs) makes efficient inference challenging, primarily due to the memory demands of the autoregressive Key-Value (KV) cache. Existing eviction or compression methods reduce cost but rely on heuristics, such as recency or past attention scores, which serve only as indirect proxies for a token's future…

1090

OmniReview: A Large-scale Benchmark and LLM-enhanced Framework for Realistic Reviewer Recommendation

Feb 2026 · 2602.08896
BenchmarksArchitecture

Academic peer review remains the cornerstone of scholarly validation, yet the field faces some challenges in data and methods. From the data perspective, existing research is hindered by the scarcity of large-scale, verified benchmarks and oversimplified evaluation metrics that fail to reflect real-world editorial workflows. To bridge this gap, we…

1091

6G-Bench: An Open Benchmark for Semantic Communication and Network-Level Reasoning with Foundation Models in AI-Native 6G Networks

Feb 2026 · 2602.08675
ReasoningBenchmarksFine-TuningReinforcement

This paper introduces 6G-Bench, an open benchmark for evaluating semantic communication and network-level reasoning in AI-native 6G networks. 6G-Bench defines a taxonomy of 30 decision-making tasks (T1--T30) extracted from ongoing 6G and AI-agent standardization activities in 3GPP, IETF, ETSI, ITU-T, and the O-RAN Alliance, and organizes them into…

1092

Towards Autonomous Memory Agents

Feb 2026 · 2602.22406
MemoryBenchmarksFine-Tuning

Recent memory agents improve LLMs by extracting experiences and conversation history into an external storage. This enables low-overhead context assembly and online memory update without expensive LLM training. However, existing solutions remain passive and reactive; memory growth is bounded by information that happens to be available, while…

1093

An Empirical Study of Bugs in Modern LLM Agent Frameworks

Feb 2026 · 2602.21806
Multi-Agent

LLM agents have been widely adopted in real-world applications, relying on agent frameworks for workflow execution and multi-agent coordination. As these systems scale, understanding bugs in the underlying agent frameworks becomes critical. However, existing work mainly focuses on agent-level failures, overlooking framework-level bugs. To address…

1094

Revisiting RAG Retrievers: An Information Theoretic Benchmark

Feb 2026 · 2602.21553
RAGBenchmarks

Retrieval-Augmented Generation (RAG) systems rely critically on the retriever module to surface relevant context for large language models. Although numerous retrievers have recently been proposed, each built on different ranking principles such as lexical matching, dense embeddings, or graph citations, there remains a lack of systematic…

1095

CAMEL: Confidence-Gated Reflection for Reward Modeling

Feb 2026 · 2602.20670
ReasoningBenchmarksReinforcementInference

Reward models play a fundamental role in aligning large language models with human preferences. Existing methods predominantly follow two paradigms: scalar discriminative preference models, which are efficient but lack interpretability, and generative judging models, which offer richer reasoning at the cost of higher computational overhead. We…

1096

When AI Teammates Meet Code Review: Collaboration Signals Shaping the Integration of Agent-Authored Pull Requests

Feb 2026 · 2602.19441
Software DevSafety

Autonomous coding agents increasingly contribute to software development by submitting pull requests on GitHub; yet, little is known about how these contributions integrate into human-driven review workflows. We present a large empirical study of agent-authored pull requests using the public AIDev dataset, examining integration outcomes,…

1097

MIRA: Memory-Integrated Reinforcement Learning Agent with Limited LLM Guidance

Feb 2026 · 2602.17930
MemoryReinforcementInference

Reinforcement learning (RL) agents often suffer from high sample complexity in sparse or delayed reward settings due to limited prior structure. Large language models (LLMs) can provide subgoal decompositions, plausible trajectories, and abstract priors that facilitate early learning. However, heavy reliance on LLM supervision introduces…

1098

AutoNumerics: An Autonomous, PDE-Agnostic Multi-Agent Pipeline for Scientific Computing

Feb 2026 · 2602.17607
Multi-Agent

PDEs are central to scientific and engineering modeling, yet designing accurate numerical solvers typically requires substantial mathematical expertise and manual tuning. Recent neural network-based approaches improve flexibility but often demand high computational cost and suffer from limited interpretability. We introduce \texttt{AutoNumerics},…

1099

ALPS: A Diagnostic Challenge Set for Arabic Linguistic & Pragmatic Reasoning

Feb 2026 · 2602.17054
ReasoningBenchmarks

While recent Arabic NLP benchmarks focus on scale, they often rely on synthetic or translated data which may benefit from deeper linguistic verification. We introduce ALPS (Arabic Linguistic & Pragmatic Suite), a native, expert-curated diagnostic challenge set probing Deep Semantics and Pragmatics, capabilities that complement specialized…

1100

NeuDiff Agent: A Governed AI Workflow for Single-Crystal Neutron Crystallography

Feb 2026 · 2602.16812
AgenticBenchmarksInference

Large-scale facilities increasingly face analysis and reporting latency as the limiting step in scientific throughput, particularly for structurally and magnetically complex samples that require iterative reduction, integration, refinement, and validation. To improve time-to-result and analysis efficiency, NeuDiff Agent is introduced as a…

1101

Agent Skill Framework: Perspectives on the Potential of Small Language Models in Industrial Environments

Feb 2026 · 2602.16653
ContextBenchmarksReinforcement

Agent Skill framework, now widely and officially supported by major players such as GitHub Copilot, LangChain, and OpenAI, performs especially well with proprietary models by improving context engineering, reducing hallucinations, and boosting task accuracy. Based on these observations, an investigation is conducted to determine whether the Agent…

1102

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Feb 2026 · 2602.16346
MemoryReasoningBenchmarksPrompting

LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent misuse benchmarks largely test single-prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over…

1103

EnterpriseBench Corecraft: Training Generalizable Agents on High-Fidelity RL Environments

Feb 2026 · 2602.16179
AgenticBenchmarksReinforcement

We show that training AI agents on high-fidelity reinforcement learning environments produces capabilities that generalize beyond the training distribution. We introduce CoreCraft, the first environment in EnterpriseBench, Surge AI's suite of agentic RL environments. CoreCraft is a fully operational enterprise simulation of a customer support…

1104

In Agents We Trust, but Who Do Agents Trust? Latent Source Preferences Steer LLM Generations

Feb 2026 · 2602.15456
Prompting

Agents based on Large Language Models (LLMs) are increasingly being deployed as interfaces to information on online platforms. These agents filter, prioritize, and synthesize information retrieved from the platforms' back-end databases or via web search. In these scenarios, LLM agents govern the information users receive, by drawing users'…

1105

World-Model-Augmented Web Agents with Action Correction

Feb 2026 · 2602.15384
Multi-Agent

Web agents based on large language models have demonstrated promising capability in automating web tasks. However, current web agents struggle to reason out sensible actions due to the limitations of predicting environment changes, and might not possess comprehensive awareness of execution risks, prematurely performing risky actions that cause…

1106

Fluid-Agent Reinforcement Learning

Feb 2026 · 2602.14559
BenchmarksMulti-AgentReinforcement

The primary focus of multi-agent reinforcement learning (MARL) has been to study interactions among a fixed number of agents embedded in an environment. However, in the real world, the number of agents is neither fixed nor known a priori. Moreover, an agent can decide to create other agents (for example, a cell may divide, or a company may spin…

1107

Prompt-Driven Low-Altitude Edge Intelligence: Modular Agents and Generative Reasoning

Feb 2026 · 2602.14003
MemoryPlanningReasoningInference

The large artificial intelligence models (LAMs) show strong capabilities in perception, reasoning, and multi-modal understanding, and can enable advanced capabilities in low-altitude edge intelligence. However, the deployment of LAMs at the edge remains constrained by some fundamental limitations. First, tasks are rigidly tied to specific models,…

1108

AuTAgent: A Reinforcement Learning Framework for Tool-Augmented Audio Reasoning

Feb 2026 · 2602.13685
ReasoningBenchmarksReinforcement

Large Audio Language Models (LALMs) excel at perception but struggle with complex reasoning requiring precise acoustic measurements. While external tools can extract fine-grained features like exact tempo or pitch, effective integration remains challenging: naively using all tools causes information overload, while prompt-based selection fails to…

1109

Social Contagion and Bank Runs: An Agent-Based Model with LLM Depositors

Feb 2026 · 2602.15066
Benchmarks

Digital banking and online communication have made modern bank runs faster and more networked than the canonical queue-at-the-branch setting. While equilibrium models explain why strategic complementarities generate run risk, they offer limited guidance on how beliefs synchronize and propagate in real time. We develop a process-based agent-based…

1110

Scaling Web Agent Training through Automatic Data Generation and Fine-grained Evaluation

Feb 2026 · 2602.12544
Benchmarks

We present a scalable pipeline for automatically generating high-quality training data for web agents. In particular, a major challenge in identifying high-quality training instances is trajectory evaluation - quantifying how much progress was made towards task completion. We introduce a novel constraint-based evaluation framework that provides…

1111

Choose Your Agent: Tradeoffs in Adopting AI Advisors, Coaches, and Delegates in Multi-Party Negotiation

Feb 2026 · 2602.12089
AgenticSafety

As AI usage becomes more prevalent in social contexts, understanding agent-user interaction is critical to designing systems that improve both individual and group outcomes. We present an online behavioral experiment (N = 243) in which participants play three multi-turn bargaining games in groups of three. Each game, presented in randomized order,…

1112

AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms

Feb 2026 · 2602.09464
MemorySoftware DevBenchmarks

Vericoding refers to the generation of formally verified code from rigorous specifications. Recent AI models show promise in vericoding, but a unified methodology for cross-paradigm evaluation is lacking. Existing benchmarks test only individual languages/tools (e.g., Dafny, Verus, and Lean) and each covers very different tasks, so the performance…

1113

AgentCAT: An LLM Agent for Extracting and Analyzing Catalytic Reaction Data from Chemical Engineering Literature

Feb 2026 · 2602.18479
BenchmarksFine-TuningReinforcementInference

This paper presents a large language model (LLM) agent named AgentCAT, which extracts and analyzes catalytic reaction data from chemical engineering papers, %and supports natural language based interactive analysis of the extracted data. AgentCAT serves as an alternative to overcome the long-standing data bottleneck in chemical engineering field,…

1114

WildReward: Learning Reward Models from In-the-Wild Human Interactions

Feb 2026 · 2602.08829
Reinforcement

Reward models (RMs) are crucial for the training of large language models (LLMs), yet they typically rely on large-scale human-annotated preference pairs. With the widespread deployment of LLMs, in-the-wild interactions have emerged as a rich source of implicit reward signals. This raises the question: Can we develop reward models directly from…

1115

Enhancing Geometric Perception in VLMs via Translator-Guided Reinforcement Learning

Feb 2026 · 2602.22703
ReasoningBenchmarksFine-TuningReinforcement

Vision-language models (VLMs) often struggle with geometric reasoning due to their limited perception of fundamental diagram elements. To tackle this challenge, we introduce GeoPerceive, a benchmark comprising diagram instances paired with domain-specific language (DSL) representations, along with an efficient automatic data generation pipeline.…

1116

Provable Last-Iterate Convergence for Multi-Objective Safe LLM Alignment via Optimistic Primal-Dual

Feb 2026 · 2602.22146
ReinforcementSafety

Reinforcement Learning from Human Feedback (RLHF) plays a significant role in aligning Large Language Models (LLMs) with human preferences. While RLHF with expected reward constraints can be formulated as a primal-dual optimization problem, standard primal-dual methods only guarantee convergence with a distributional policy where the saddle-point…

1117

Generalisation of RLHF under Reward Shift and Clipped KL Regularisation

Feb 2026 · 2602.21765
ReinforcementSafety

Alignment and adaptation in large language models heavily rely on reinforcement learning from human feedback (RLHF); yet, theoretical understanding of its generalisability remains premature, especially when the learned reward could shift, and the KL control is estimated and clipped. To address this issue, we develop generalisation theory for RLHF…

1118

RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning

Feb 2026 · 2602.21628
ReasoningBenchmarksReinforcement

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a prevailing paradigm for enhancing reasoning in Multimodal Large Language Models (MLLMs). However, relying solely on outcome supervision risks reward hacking, where models learn spurious reasoning patterns to satisfy final answer checks. While recent rubric-based approaches…

1119

The Headless Firm: How AI Reshapes Enterprise Boundaries

Feb 2026 · 2602.21401
AgenticInference

The boundary of the firm is determined by coordination cost. We argue that agentic AI induces a structural change in how coordination costs scale: in prior modular systems, integration cost grew with interaction topology (O(n^2) in the number of components); in protocol-mediated agentic systems, integration cost collapses to O(n) while…

1120

See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis

Feb 2026 · 2602.20951
AgenticArchitecture

Despite recent advances in diffusion models, AI generated images still often contain visual artifacts that compromise realism. Although more thorough pre-training and bigger models might reduce artifacts, there is no assurance that they can be completely eliminated, which makes artifact mitigation a highly crucial area of study. Previous…

1121

From Logs to Language: Learning Optimal Verbalization for LLM-Based Recommendation in Production

Feb 2026 · 2602.20558
Reinforcement

Large language models (LLMs) are promising backbones for generative recommender systems, yet a key challenge remains underexplored: verbalization, i.e., converting structured user interaction logs into effective natural language inputs. Existing methods rely on rigid templates that simply concatenate fields, yielding suboptimal representations for…

1122

Watson & Holmes: A Naturalistic Benchmark for Comparing Human and LLM Reasoning

Feb 2026 · 2602.19914
ReasoningBenchmarksArchitecture

Existing benchmarks for AI reasoning provide limited insight into how closely these capabilities resemble human reasoning in naturalistic contexts. We present an adaptation of the Watson & Holmes detective tabletop game as a new benchmark designed to evaluate reasoning performance using incrementally presented narrative evidence, open-ended…

1123

Can Large Language Models Replace Human Coders? Introducing ContentBench

Feb 2026 · 2602.19467
ReasoningBenchmarksReinforcement

Can low-cost large language models (LLMs) take over the interpretive coding work that still anchors much of empirical content analysis? This paper introduces ContentBench, a public benchmark suite that helps answer this replacement question by tracking how much agreement low-cost LLMs achieve and what they cost on the same interpretive coding…

1124

Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering

Feb 2026 · 2602.19317
RAGReasoningBenchmarksReinforcement

Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context. Existing state-of-the-art methods primarily rely on retrieval-augmented generation (RAG) solutions that construct personal context by retrieving relevant items from the user's profile. Existing…

1125

Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation

Feb 2026 · 2602.18966
Multi-AgentFine-Tuning

Domain-specific speech remains a persistent challenge for automatic speech recognition (ASR), even for state-of-the-art systems like OpenAI's Whisper. We introduce Whisper: Courtside Edition, a novel multi-agent large language model (LLM) pipeline that enhances Whisper transcriptions without retraining. The pipeline intercepts Whisper's initial…

1126

Rodent-Bench

Feb 2026 · 2602.18540
Benchmarks

We present Rodent-Bench, a novel benchmark designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to annotate rodent behaviour footage. We evaluate state-of-the-art MLLMs, including Gemini-2.5-Pro, Gemini-2.5-Flash and Qwen-VL-Max, using this benchmark and find that none of these models perform strongly enough to be used as…

1127

The Emergence of Lab-Driven Alignment Signatures: A Psychometric Framework for Auditing Latent Bias and Compounding Risk in Generative AI

Feb 2026 · 2602.17127
ReasoningBenchmarksMulti-AgentArchitecture

As Large Language Models (LLMs) transition from standalone chat interfaces to foundational reasoning layers in multi-agent systems and recursive evaluation loops (LLM-as-a-judge), the detection of durable, provider-level behavioral signatures becomes a critical requirement for safety and governance. Traditional benchmarks measure transient task…

1128

SecCodeBench-V2 Technical Report

Feb 2026 · 2602.15485
Benchmarks

We introduce SecCodeBench-V2, a publicly released benchmark for evaluating Large Language Model (LLM) copilots' capabilities of generating secure code. SecCodeBench-V2 comprises 98 generation and fix scenarios derived from Alibaba Group's industrial productions, where the underlying security issues span 22 common CWE (Common Weakness Enumeration)…

1129

LLM-as-Judge on a Budget

Feb 2026 · 2602.15481
ReasoningBenchmarksSafety

LLM-as-a-judge has emerged as a cornerstone technique for evaluating large language models by leveraging LLM reasoning to score prompt-response pairs. Since LLM judgments are stochastic, practitioners commonly query each pair multiple times to estimate mean scores accurately. This raises a critical challenge: given a fixed computational budget…

1130

On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks

Feb 2026 · 2602.15460
PlanningReasoningBenchmarks

Integrating reasoning in large language models and large vision-language models has recently led to significant improvement of their capabilities. However, the generalization of reasoning models is still vaguely defined and poorly understood. In this work, we present an evaluation framework to rigorously examine how well chain-of-thought (CoT)…

1131

Discovering Implicit Large Language Model Alignment Objectives

Feb 2026 · 2602.15338
BenchmarksReinforcementSafety

Large language model (LLM) alignment relies on complex reward signals that often obscure the specific behaviors being incentivized, creating critical risks of misalignment and reward hacking. Existing interpretation methods typically rely on pre-defined rubrics, risking the omission of "unknown unknowns", or fail to identify objectives that…

1132

When Remembering and Planning are Worth it: Navigating under Change

Feb 2026 · 2602.15274
MemoryPlanningFine-TuningArchitecture

We explore how different types and uses of memory can aid spatial navigation in changing uncertain environments. In the simple foraging task we study, every day, our agent has to find its way from its home, through barriers, to food. Moreover, the world is non-stationary: from day to day, the location of the barriers and food may change, and the…

1133

LLMStructBench: Benchmarking Large Language Model Structured Data Extraction

Feb 2026 · 2602.14743
BenchmarksPrompting

We present LLMStructBench, a novel benchmark for evaluating Large Language Models (LLMs) on extracting structured data and generating valid JavaScript Object Notation (JSON) outputs from natural-language text. Our open dataset comprises diverse, manually verified parsing scenarios of varying complexity and enables systematic testing across 22…

1134

Precedent-Informed Reasoning: Mitigating Overthinking in Large Reasoning Models via Test-Time Precedent Learning

Feb 2026 · 2602.14451
Software DevReasoningFine-Tuning

Reasoning in Large Language Models (LLMs) often suffers from inefficient long chain-of-thought traces with redundant self-exploration and validation, which inflate computational costs and even degrade performance. Inspired by human reasoning patterns where people solve new problems by leveraging past related cases to constrain search spaces and…

1135

Differentially Private Retrieval-Augmented Generation

Feb 2026 · 2602.14374
RAGBenchmarksReinforcementInference

Retrieval-augmented generation (RAG) is a widely used framework for reducing hallucinations in large language models (LLMs) on domain-specific tasks by retrieving relevant documents from a database to support accurate responses. However, when the database contains sensitive corpora, such as medical records or legal documents, RAG poses serious…

1136

Benchmarking at the Edge of Comprehension

Feb 2026 · 2602.14307
Benchmarks

As frontier Large Language Models (LLMs) increasingly saturate new benchmarks shortly after they are published, benchmarking itself is at a juncture: if frontier models keep improving, it will become increasingly hard for humans to generate discriminative tasks, provide accurate ground-truth answers, or evaluate complex solutions. If benchmarking…

1137

PIPE-RDF: An LLM-Assisted Pipeline for Enterprise RDF Benchmarking

Feb 2026 · 2602.18497
RAGPlanningBenchmarksReinforcement

Enterprises rely on RDF knowledge graphs and SPARQL to expose operational data through natural language interfaces, yet public KGQA benchmarks do not reflect proprietary schemas, prefixes, or query distributions. We present PIPE-RDF, a three-phase pipeline that constructs schema-specific NL-SPARQL benchmarks using reverse querying,…

1138

GREPO: A Benchmark for Graph Neural Networks on Repository-Level Bug Localization

Feb 2026 · 2602.13921
ContextBenchmarksArchitecture

Repository-level bug localization-the task of identifying where code must be modified to fix a bug-is a critical software engineering challenge. Standard Large Language Modles (LLMs) are often unsuitable for this task due to context window limitations that prevent them from processing entire code repositories. As a result, various retrieval…

1139

Bench-MFG: A Benchmark Suite for Learning in Stationary Mean Field Games

Feb 2026 · 2602.12517
BenchmarksMulti-AgentReinforcement

The intersection of Mean Field Games (MFGs) and Reinforcement Learning (RL) has fostered a growing family of algorithms designed to solve large-scale multi-agent systems. However, the field currently lacks a standardized evaluation protocol, forcing researchers to rely on bespoke, isolated, and often simplistic environments. This fragmentation…

1140

RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty

Feb 2026 · 2602.12424
Benchmarks

Benchmarks establish a standardized evaluation framework to systematically assess the performance of large language models (LLMs), facilitating objective comparisons and driving advancements in the field. However, existing benchmarks fail to differentiate question difficulty, limiting their ability to effectively distinguish models' capabilities.…

1141

Synthetic Interaction Data for Scalable Personalization in Large Language Models

Feb 2026 · 2602.12394
AgenticReinforcementPrompting

Personalized prompting offers large opportunities for deploying large language models (LLMs) to diverse users, yet existing prompt optimization methods primarily focus on task-level optimization while largely overlooking user-specific preferences and latent constraints of individual users. This gap is primarily due to (i) the absence of…

1142

Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation

Feb 2026 · 2602.12235
ContextRAGArchitecture

Efficient long-context processing remains a crucial challenge for contemporary large language models (LLMs), especially in resource-constrained environments. Soft compression architectures promise to extend effective context length by replacing long token sequences with smaller sets of learned compressed tokens. Yet, the limits of compressibility…

1143

LLM-Driven 3D Scene Generation of Agricultural Simulation Environments

Feb 2026 · 2602.11706
Software DevRAGReasoningFine-Tuning

Procedural generation techniques in 3D rendering engines have revolutionized the creation of complex environments, reducing reliance on manual design. Recent approaches using Large Language Models (LLMs) for 3D scene generation show promise but often lack domain-specific reasoning, verification mechanisms, and modular design. These limitations…

1144

iGRPO: Self-Feedback-Driven LLM Reasoning

Feb 2026 · 2602.09000
ReasoningBenchmarksFine-TuningReinforcement

Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions. Reinforcement Learning (RL) is a framework for aligning these models with task-specific rewards, improving overall quality and reliability. Group Relative Policy Optimization (GRPO) is…

1145

How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs

Feb 2026 · 2602.08808
PlanningReasoningBenchmarksReinforcement

Generating step-by-step "how-to" procedures is a key LLM capability: how-to advice is commonly requested in chatbots, and step-by-step planning is critical for reasoning over complex tasks. Yet, measuring and improving procedural validity at scale on real-world tasks remains challenging and understudied. To address this, we introduce…

1146

Learning to Judge: LLMs Designing and Applying Evaluation Rubrics

Feb 2026 · 2602.08672
BenchmarksSafety

Large language models (LLMs) are increasingly used as evaluators for natural language generation, applying human-defined rubrics to assess system outputs. However, human rubrics are often static and misaligned with how models internally represent language quality. We introduce GER-Eval (Generating Evaluation Rubrics for Evaluation) to investigate…

1147

Cognitive Models and AI Algorithms Provide Templates for Designing Language Agents

Feb 2026 · 2602.22523

While contemporary large language models (LLMs) are increasingly capable in isolation, there are still many difficult problems that lie beyond the abilities of a single LLM. For such tasks, there is still uncertainty about how best to take many LLMs as parts and combine them into a greater whole. This position paper argues that potential…

1148

Vibe Researching as Wolf Coming: Can AI Agents with Skills Replace or Augment Social Scientists?

Feb 2026 · 2602.22401
Reasoning

AI agents -- systems that execute multi-step reasoning workflows with persistent state, tool access, and specialist skills -- represent a qualitative shift from prior automation technologies in social science. Unlike chatbots that respond to isolated queries, AI agents can now read files, run code, query databases, search the web, and invoke…

1149

How Retrieved Context Shapes Internal Representations in RAG

Feb 2026 · 2602.20091
RAG

Retrieval-augmented generation (RAG) enhances large language models (LLMs) by conditioning generation on retrieved external documents, but the effect of retrieved context is often non-trivial. In realistic retrieval settings, the retrieved document set often contains a mixture of documents that vary in relevance and usefulness. While prior work…

1150

Modeling Distinct Human Interaction in Web Agents

Feb 2026 · 2602.17588
AgenticReinforcement

Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold. However, current agentic systems lack a principled understanding of when and why humans intervene, often proceeding autonomously past critical decision points or requesting unnecessary…

1151

Meflex: A Multi-agent Scaffolding System for Entrepreneurial Ideation Iteration via Nonlinear Business Plan Writing

Feb 2026 · 2602.15631
Multi-AgentFine-TuningReinforcement

Business plan (BP) writing plays a key role in entrepreneurship education by helping learners construct, evaluate, and iteratively refine their ideas. However, conventional BP writing remains a rigid, linear process that often fails to reflect the dynamic and recursive nature of entrepreneurial ideation. This mismatch is particularly challenging…

1152

ROSA: Roundabout Optimized Speed Advisory with Multi-Agent Trajectory Prediction in Multimodal Traffic

Feb 2026 · 2602.14780
Multi-AgentArchitectureSafety

We present ROSA -- Roundabout Optimized Speed Advisory -- a system that combines multi-agent trajectory prediction with coordinated speed guidance for multimodal, mixed traffic at roundabouts. Using a Transformer-based model, ROSA jointly predicts the future trajectories of vehicles and Vulnerable Road Users (VRUs) at roundabouts. Trained for…

1153

GRRM: Group Relative Reward Modeling for Machine Translation

Feb 2026 · 2602.14028
ReasoningBenchmarksReinforcement

While Group Relative Policy Optimization (GRPO) offers a powerful framework for LLM post-training, its effectiveness in open-ended domains like Machine Translation hinges on accurate intra-group ranking. We identify that standard Scalar Quality Metrics (SQM) fall short in this context; by evaluating candidates in isolation, they lack the…

1154

Evaluating Prompt Engineering Techniques for RAG in Small Language Models: A Multi-Hop QA Approach

Feb 2026 · 2602.13890
RAGReasoningPrompting

Retrieval Augmented Generation (RAG) is a powerful approach for enhancing the factual grounding of language models by integrating external knowledge. While widely studied for large language models, the optimization of RAG for Small Language Models (SLMs) remains a critical research gap, particularly in complex, multi-hop question-answering tasks…

1155

ADAB: Arabic Dataset for Automated Politeness Benchmarking -- A Large-Scale Resource for Computational Sociopragmatics

Feb 2026 · 2602.13870
BenchmarksReinforcementArchitecture

The growing importance of culturally-aware natural language processing systems has led to an increasing demand for resources that capture sociopragmatic phenomena across diverse languages. Nevertheless, Arabic-language resources for politeness detection remain under-explored, despite the rich and complex politeness expressions embedded in Arabic…

1156

MoltNet: Understanding Social Behavior of AI Agents in the Agent-Native MoltBook

Feb 2026 · 2602.13458
Multi-AgentReinforcement

Large-scale communities of AI agents are becoming increasingly prevalent, creating new environments for agent-agent social interaction. Prior work has examined multi-agent behavior primarily in controlled or small-scale settings, limiting our understanding of emergent social dynamics at scale. The recent emergence of MoltBook, a social networking…

1157

MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents

Feb 2026 · 2602.13372
ReasoningBenchmarksSafety

Evaluating moral alignment in agents navigating conflicting, hierarchically structured human norms is a critical challenge at the intersection of AI safety, moral philosophy, and cognitive science. We introduce Morality Chains, a novel formalism for representing moral norms as ordered deontic constraints, and MoralityGym, a benchmark of 98…

1158

AIR: Improving Agent Safety through Incident Response

Feb 2026 · 2602.11749
Safety

Large Language Model (LLM) agents are increasingly deployed in practice across a wide range of autonomous applications. Yet current safety mechanisms for LLM agents focus almost exclusively on preventing failures in advance, providing limited capabilities for responding to, containing, or recovering from incidents after they inevitably arise. In…

1159

When Agents Disagree With Themselves: Measuring Behavioral Consistency in LLM-Based Agents

Feb 2026 · 2602.11619

Run the same LLM agent on the same task twice: do you get the same behavior? We find the answer is often no. In a study of 3,000 agent runs across three models (Llama 3.1 70B, GPT-4o, and Claude Sonnet 4.5) on HotpotQA, we observe that ReAct-style agents produce 2.0--4.2 distinct action sequences per 10 runs on average, even with identical inputs.…

1160

The emergence of numerical representations in communicating artificial agents

Feb 2026 · 2602.10996

Human languages provide efficient systems for expressing numerosities, but whether the sheer pressure to communicate is enough for numerical representations to arise in artificial agents, and whether the emergent codes resemble human numerals at all, remains an open question. We study two neural network-based agents that must communicate…

1161

When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents

Feb 2026 · 2602.08995
ReasoningBenchmarksSafetyInference

Computer-use agents (CUAs) have made tremendous progress in the past year, yet they still frequently produce misaligned actions that deviate from the user's original intent. Such misaligned actions may arise from external attacks (e.g., indirect prompt injection) or from internal limitations (e.g., erroneous reasoning). They not only expose CUAs…

1162

Bayesian Preference Learning for Test-Time Steerable Reward Models

Feb 2026 · 2602.08819
ReasoningBenchmarksReinforcementSafety

Reward models are central to aligning language models with human preferences via reinforcement learning (RL). As RL is increasingly applied to settings such as verifiable rewards and multi-objective alignment, RMs are expected to encode more complex and multifaceted preference distributions. However, classifier RMs remain static once trained,…

1163

Conformalized Neural Networks for Federated Uncertainty Quantification under Dual Heterogeneity

Feb 2026 · 2602.23296

Federated learning (FL) faces challenges in uncertainty quantification (UQ). Without reliable UQ, FL systems risk deploying overconfident models at under-resourced agents, leading to silent local failures despite seemingly satisfactory global performance. Existing federated UQ approaches often address data heterogeneity or model heterogeneity in…

1164

Agency and Architectural Limits: Why Optimization-Based Systems Cannot Be Norm-Responsive

Feb 2026 · 2602.23239
ReasoningReinforcement

AI systems are increasingly deployed in high-stakes contexts -- medical diagnosis, legal research, financial analysis -- under the assumption they can be governed by norms. This paper demonstrates that assumption is formally invalid for optimization-based systems, specifically Large Language Models trained via Reinforcement Learning from Human…

1165

Accelerated Online Risk-Averse Policy Evaluation in POMDPs with Theoretical Guarantees and Novel CVaR Bounds

Feb 2026 · 2602.23073
AgenticBenchmarks

Risk-averse decision-making under uncertainty in partially observable domains is a central challenge in artificial intelligence and is essential for developing reliable autonomous agents. The formal framework for such problems is the partially observable Markov decision process (POMDP), where risk sensitivity is introduced through a risk measure…

1166

SPM-Bench: Benchmarking Large Language Models for Scanning Probe Microscopy

Feb 2026 · 2602.22971
ReasoningBenchmarksArchitecture

As LLMs achieved breakthroughs in general reasoning, their proficiency in specialized scientific domains reveals pronounced gaps in existing benchmarks due to data contamination, insufficient complexity, and prohibitive human labor costs. Here we present SPM-Bench, an original, PhD-level multimodal benchmark specifically designed for scanning…

1167

MM-NeuroOnco: A Multimodal Benchmark and Instruction Dataset for MRI-Based Brain Tumor Diagnosis

Feb 2026 · 2602.22955
ReasoningBenchmarksFine-Tuning

Accurate brain tumor diagnosis requires models to not only detect lesions but also generate clinically interpretable reasoning grounded in imaging manifestations, yet existing public datasets remain limited in annotation richness and diagnostic semantics. To bridge this gap, we introduce MM-NeuroOnco, a large-scale multimodal benchmark and…

1168

Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching

Feb 2026 · 2602.22871
ReasoningBenchmarksFine-TuningReinforcement

Reasoning with large language models often benefits from generating multiple chains-of-thought, but existing aggregation strategies are typically trajectory-level (e.g., selecting the best trace or voting on the final answer), discarding useful intermediate work from partial or "nearly correct" attempts. We propose Stitching Noisy Diffusion…

1169

EvolveGen: Algorithmic Level Hardware Model Checking Benchmark Generation through Reinforcement Learning

Feb 2026 · 2602.22609
BenchmarksReinforcement

Progress in hardware model checking depends critically on high-quality benchmarks. However, the community faces a significant benchmark gap: existing suites are limited in number, often distributed only in representations such as BTOR2 without access to the originating register-transfer-level (RTL) designs, and biased toward extreme difficulty…

1170

MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models

Feb 2026 · 2602.21950
Benchmarks

Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity. We introduce MEDSYN, a multilingual, multimodal benchmark of highly complex clinical cases with up to 7 distinct visual clinical evidence (CE) types per case. Mirroring clinical…

1171

AkiraRust: Re-thinking LLM-aided Rust Repair Using a Feedback-guided Thinking Switch

Feb 2026 · 2602.21681
ReasoningReinforcement

Eliminating undefined behaviors (UBs) in Rust programs requires a deep semantic understanding to enable accurate and reliable repair. While existing studies have demonstrated the potential of LLMs to support Rust code analysis and repair, most frameworks remain constrained by inflexible templates or lack grounding in executable semantics,…

1172

Revisiting Text Ranking in Deep Research

Feb 2026 · 2602.21456
ContextFine-Tuning

Deep research has emerged as an important task that aims to address hard queries through extensive open-web exploration. To tackle it, most prior work equips large language model (LLM)-based agents with opaque web search APIs, enabling agents to iteratively issue search queries, retrieve external evidence, and reason over it. Despite search's…

1173

Emergent Manifold Separability during Reasoning in Large Language Models

Feb 2026 · 2602.20338
ReasoningPrompting

Chain-of-Thought (CoT) prompting significantly improves reasoning in Large Language Models, yet the temporal dynamics of the underlying representation geometry remain poorly understood. We investigate these dynamics by applying Manifold Capacity Theory (MCT) to a compositional Boolean logic task, allowing us to quantify the linear separability of…

1174

A Very Big Video Reasoning Suite

Feb 2026 · 2602.20159
ReasoningBenchmarks

Rapid progress in video models has largely focused on visual quality, leaving their reasoning capabilities underexplored. Video reasoning grounds intelligence in spatiotemporally consistent visual environments that go beyond what text can naturally capture, enabling intuitive reasoning over spatiotemporal structure such as continuity, interaction,…

1175

DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning

Feb 2026 · 2602.19895
ReasoningBenchmarksFine-TuningReinforcement

Reinforcement learning with verifiers (RLVR) is a central paradigm for improving large language model (LLM) reasoning, yet existing methods often suffer from limited exploration. Policies tend to collapse onto a few reasoning patterns and prematurely stop deep exploration, while conventional entropy regularization introduces only local…

1176

Uncovering Context Reliance in Unstructured Knowledge Editing

Feb 2026 · 2602.19043
BenchmarksReinforcementInference

Editing Large language models (LLMs) with real-world, unstructured knowledge is essential for correcting and updating their internal parametric knowledge. In this work, we revisit the fundamental next-token prediction (NTP) as a candidate paradigm for unstructured editing. We identify Context Reliance as a critical failure mode of NTP-based…

1177

TRUE: A Trustworthy Unified Explanation Framework for Large Language Model Reasoning

Feb 2026 · 2602.18905
ReasoningBenchmarks

Large language models (LLMs) have demonstrated strong capabilities in complex reasoning tasks, yet their decision-making processes remain difficult to interpret. Existing explanation methods often lack trustworthy structural insight and are limited to single-instance analysis, failing to reveal reasoning stability and systematic failure…

1178

Luna-2: Scalable Single-Token Evaluation with Small Language Models

Feb 2026 · 2602.18583
BenchmarksFine-TuningArchitectureSafety

Real-time guardrails require evaluation that is accurate, cheap, and fast - yet today's default, LLM-as-a-judge (LLMAJ), is slow, expensive, and operationally non-deterministic due to multi-token generation. We present Luna-2, a novel architecture that leverages decoder-only small language models (SLMs) into a deterministic evaluation model to…

1179

VIRAASAT: Traversing Novel Paths for Indian Cultural Reasoning

Feb 2026 · 2602.18429
ReasoningBenchmarksFine-TuningKnowledge

Large Language Models (LLMs) have made significant progress in reasoning tasks across various domains such as mathematics and coding. However, their performance deteriorates in tasks requiring rich socio-cultural knowledge and diverse local contexts, particularly those involving Indian Culture. Existing Cultural benchmarks are (i) Manually…

1180

Detecting Contextual Hallucinations in LLMs with Frequency-Aware Attention

Feb 2026 · 2602.18145
Benchmarks

Hallucination detection is critical for ensuring the reliability of large language models (LLMs) in context-based generation. Prior work has explored intrinsic signals available during generation, among which attention offers a direct view of grounding behavior. However, existing approaches typically rely on coarse summaries that fail to capture…

1181

Turbo Connection: Reasoning as Information Flow from Higher to Lower Layers

Feb 2026 · 2602.17993
PlanningReasoningBenchmarksFine-Tuning

Complex problems, whether in math, logic, or planning, are solved by humans through a sequence of steps where the result of one step informs the next. In this work, we adopt the perspective that the reasoning power of Transformers is fundamentally limited by a fixed maximum number of steps along any latent path of computation. To address this, we…

1182

ABCD: All Biases Come Disguised

Feb 2026 · 2602.17445
BenchmarksPrompting

Multiple-choice question (MCQ) benchmarks have been a standard evaluation practice for measuring LLMs' ability to reason and answer knowledge-based questions. Through a synthetic NonsenseQA benchmark, we observe that different LLMs exhibit varying degrees of label-position-few-shot-prompt bias, where the model either uses the answer position, the…

1183

Continual learning and refinement of causal models through dynamic predicate invention

Feb 2026 · 2602.17217
ReinforcementInference

Efficiently navigating complex environments requires agents to internalize the underlying logic of their world, yet standard world modelling methods often struggle with sample inefficiency, lack of transparency, and poor scalability. We propose a framework for constructing symbolic causal world models entirely online by integrating continuous…

1184

Evaluating Monolingual and Multilingual Large Language Models for Greek Question Answering: The DemosQA Benchmark

Feb 2026 · 2602.16811
MemoryBenchmarksPrompting

Recent advancements in Natural Language Processing and Deep Learning have enabled the development of Large Language Models (LLMs), which have significantly advanced the state-of-the-art across a wide range of tasks, including Question Answering (QA). Despite these advancements, research on LLMs has primarily targeted high-resourced languages…

1185

Better Think Thrice: Learning to Reason Causally with Double Counterfactual Consistency

Feb 2026 · 2602.16787
ReasoningBenchmarksInference

Despite their strong performance on reasoning benchmarks, large language models (LLMs) have proven brittle when presented with counterfactual questions, suggesting weaknesses in their causal reasoning ability. While recent work has demonstrated that labeled counterfactual tasks can be useful benchmarks of LLMs' causal reasoning, producing such…

1186

MultiCW: A Large-Scale Balanced Benchmark Dataset for Training Robust Check-Worthiness Detection Models

Feb 2026 · 2602.16298
BenchmarksReinforcementArchitecture

Large Language Models (LLMs) are beginning to reshape how media professionals verify information, yet automated support for detecting check-worthy claims a key step in the fact-checking process remains limited. We introduce the Multi-Check-Worthy (MultiCW) dataset, a balanced multilingual benchmark for check-worthy claim detection spanning 16…

1187

DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting

Feb 2026 · 2602.15958
Benchmarks

Document understanding in real-world applications often requires processing heterogeneous, multi-page document packets containing multiple documents stitched together. Despite recent advances in visual document understanding, the fundamental task of document packet splitting, which involves separating a document packet into individual units,…

1188

World Models for Policy Refinement in StarCraft II

Feb 2026 · 2602.14857
ReasoningBenchmarks

Large Language Models (LLMs) have recently shown strong reasoning and generalization capabilities, motivating their use as decision-making policies in complex environments. StarCraft II (SC2), with its massive state-action space and partial observability, is a challenging testbed. However, existing LLM-based SC2 agents primarily focus on improving…

1189

MATEO: A Multimodal Benchmark for Temporal Reasoning and Planning in LVLMs

Feb 2026 · 2602.14589
PlanningReasoningBenchmarksFine-Tuning

AI agents need to plan to achieve complex goals that involve orchestrating perception, sub-goal decomposition, and execution. These plans consist of ordered steps structured according to a Temporal Execution Order (TEO, a directed acyclic graph that ensures each step executes only after its preconditions are satisfied. Existing research on…

1190

InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem

Feb 2026 · 2602.14367
ReasoningBenchmarks

The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation. The fundamental nature of scientific evaluation needs knowledgeable grounding, collective deliberation, and multi-criteria decision-making. However, existing idea…

1191

CCiV: A Benchmark for Structure, Rhythm and Quality in LLM-Generated Chinese \textit{Ci} Poetry

Feb 2026 · 2602.14081
BenchmarksSafetyPrompting

The generation of classical Chinese \textit{Ci} poetry, a form demanding a sophisticated blend of structural rigidity, rhythmic harmony, and artistic quality, poses a significant challenge for large language models (LLMs). To systematically evaluate and advance this capability, we introduce \textbf{C}hinese \textbf{Ci}pai \textbf{V}ariants…

1192

FloCA: Towards Faithful and Logically Consistent Flowchart Reasoning

Feb 2026 · 2602.14035
ReasoningBenchmarks

Flowchart-oriented dialogue (FOD) systems aim to guide users through multi-turn decision-making or operational procedures by following a domain-specific flowchart to achieve a task goal. In this work, we formalize flowchart reasoning in FOD as grounding user input to flowchart nodes at each dialogue turn while ensuring node transition is…

1193

WoVR: World Models as Reliable Simulators for Post-Training VLA Policies with RL

Feb 2026 · 2602.13977
Long-HorizonBenchmarksReinforcementSafety

Reinforcement learning (RL) promises to unlock capabilities beyond imitation learning for Vision-Language-Action (VLA) models, but its requirement for massive real-world interaction prevents direct deployment on physical robots. Recent work attempts to use learned world models as simulators for policy optimization, yet closed-loop imagined…

1194

Benchmark Leakage Trap: Can We Trust LLM-based Recommendation?

Feb 2026 · 2602.13626
BenchmarksFine-Tuning

The expanding integration of Large Language Models (LLMs) into recommender systems poses critical challenges to evaluation reliability. This paper identifies and investigates a previously overlooked issue: benchmark data leakage in LLM-based recommendation. This phenomenon occurs when LLMs are exposed to and potentially memorize benchmark datasets…

1195

RAT-Bench: A Comprehensive Benchmark for Text Anonymization

Feb 2026 · 2602.12806
Benchmarks

Data containing personal information is increasingly used to train, fine-tune, or query Large Language Models (LLMs). Text is typically scrubbed of identifying information prior to use, often with tools such as Microsoft's Presidio or Anthropic's PII purifier. These tools have traditionally been evaluated on their ability to remove specific…

1196

RelBench v2: A Large-Scale Benchmark and Repository for Relational Data

Feb 2026 · 2602.12606
PlanningBenchmarks

Relational deep learning (RDL) has emerged as a powerful paradigm for learning directly on relational databases by modeling entities and their relationships across multiple interconnected tables. As this paradigm evolves toward larger models and relational foundation models, scalable and realistic benchmarks are essential for enabling systematic…

1197

dVoting: Fast Voting for dLLMs

Feb 2026 · 2602.12153
ReasoningBenchmarks

Diffusion Large Language Models (dLLMs) represent a new paradigm beyond autoregressive modeling, offering competitive performance while naturally enabling a flexible decoding process. Specifically, dLLMs can generate tokens at arbitrary positions in parallel, endowing them with significant potential for parallel test-time scaling, which was…

1198

It's TIME: Towards the Next Generation of Time Series Forecasting Benchmarks

Feb 2026 · 2602.12147
Benchmarks

Time series foundation models (TSFMs) are revolutionizing the forecasting landscape from specific dataset modeling to generalizable task evaluation. However, we contend that existing benchmarks exhibit common limitations in four dimensions: constrained data composition dominated by reused legacy sources, compromised data integrity lacking rigorous…

1199

When Should LLMs Be Less Specific? Selective Abstraction for Reliable Long-Form Text Generation

Feb 2026 · 2602.11908
BenchmarksInference

LLMs are widely used, yet they remain prone to factual errors that erode user trust and limit adoption in high-risk settings. One approach to mitigate this risk is to equip models with uncertainty estimation mechanisms that abstain when confidence is low. However, this binary "all-or-nothing" approach is excessively restrictive in long-form…

1200

Benchmark Illusion: Disagreement among LLMs and Its Scientific Consequences

Feb 2026 · 2602.11898
ReasoningBenchmarksInference

Benchmarks underpin how progress in large language models (LLMs) is measured and trusted. Yet our analyses reveal that apparent convergence in benchmark accuracy can conceal deep epistemic divergence. Using two major reasoning benchmarks - MMLU-Pro and GPQA - we show that LLMs achieving comparable accuracy still disagree on 16-66% of items, and…

1201

AltTS: A Dual-Path Framework with Alternating Optimization for Multivariate Time Series Forecasting

Feb 2026 · 2602.11533
Long-HorizonBenchmarksReinforcementArchitecture

Multivariate time series forecasting involves two qualitatively distinct factors: (i) stable within-series autoregressive (AR) dynamics, and (ii) intermittent cross-dimension interactions that can become spurious over long horizons. We argue that fitting a single model to capture both effects creates an optimization conflict: the high-variance…

1202

Finding the Cracks: Improving LLMs Reasoning with Paraphrastic Probing and Consistency Verification

Feb 2026 · 2602.11361
ReasoningBenchmarks

Large language models have demonstrated impressive performance across a variety of reasoning tasks. However, their problem-solving ability often declines on more complex tasks due to hallucinations and the accumulation of errors within these intermediate steps. Recent work has introduced the notion of critical tokens--tokens in the reasoning…

1203

Macaron: Controlled, Human-Written Benchmark for Multilingual and Multicultural Reasoning via Template-Filling

Feb 2026 · 2602.10732
ReasoningBenchmarks

Multilingual benchmarks rarely test reasoning over culturally grounded premises: translated datasets keep English-centric scenarios, while culture-first datasets often lack control over the reasoning required. We propose Macaron, a template-first benchmark that factorizes reasoning type and cultural aspect across question languages. Using 100…

1204

Towards Autonomous Mathematics Research

Feb 2026 · 2602.10177
Long-HorizonAgenticReasoningBenchmarks

Recent advances in foundational models have yielded reasoning systems capable of achieving a gold-medal standard at the International Mathematical Olympiad. The transition from competition-level problem-solving to professional research, however, requires navigating vast literature and constructing long-horizon proofs. In this work, we introduce…

1205

Discovering High Level Patterns from Simulation Traces

Feb 2026 · 2602.10009
PlanningReasoningBenchmarksReinforcement

Artificial intelligence (AI) agents embedded in environments with physics-based interaction face many challenges including reasoning, planning, summarization, and question answering. This problem is exacerbated when a human user wishes to either guide or interact with the agent in natural language. Although the use of Language Models (LMs) is the…

1206

ATTNPO: Attention-Guided Process Supervision for Efficient Reasoning

Feb 2026 · 2602.09953
ReasoningBenchmarksReinforcementInference

Large reasoning models trained with reinforcement learning and verifiable rewards (RLVR) achieve strong performance on complex reasoning tasks, yet often overthink, generating redundant reasoning without performance gains. Existing trajectory-level length penalties often fail to effectively shorten reasoning length and degrade accuracy, as they…

1207

Decomposing Reasoning Efficiency in Large Language Models

Feb 2026 · 2602.09805
ReasoningBenchmarksInference

Large language models trained for reasoning trade off inference tokens against accuracy, yet standard evaluations report only final accuracy, obscuring where tokens are spent or wasted. We introduce a trace-optional framework that decomposes token efficiency into interpretable factors: completion under a fixed token budget (avoiding truncation),…

1208

AlignTune: Modular Toolkit for Post-Training Alignment of Large Language Models

Feb 2026 · 2602.09621
BenchmarksFine-TuningReinforcementSafety

Post-training alignment is central to deploying large language models (LLMs), yet practical workflows remain split across backend-specific tools and ad-hoc glue code, making experiments hard to reproduce. We identify backend interference, reward fragmentation, and irreproducible pipelines as key obstacles in alignment research. We introduce…

1209

Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension

Feb 2026 · 2602.13310
ReasoningBenchmarksFine-Tuning

Existing LLM test-time scaling laws emphasize the emergence of self-reflective behaviors through extended reasoning length. Nevertheless, this vertical scaling strategy often encounters plateaus in exploration as the model becomes locked into specific thinking pattern. By shifting from depth to parallelism, parallel thinking mitigates the…

1210

ParamMem: Augmenting Language Agents with Parametric Reflective Memory

Feb 2026 · 2602.23320
MemorySoftware DevReasoningReinforcement

Self-reflection enables language agents to iteratively refine solutions, yet often produces repetitive outputs that limit reasoning performance. Recent studies have attempted to address this limitation through various approaches, among which increasing reflective diversity has shown promise. Our empirical analysis reveals a strong positive…

1211

QSIM: Mitigating Overestimation in Multi-Agent Reinforcement Learning via Action Similarity Weighted Q-Learning

Feb 2026 · 2602.22786
Multi-AgentReinforcement

Value decomposition (VD) methods have achieved remarkable success in cooperative multi-agent reinforcement learning (MARL). However, their reliance on the max operator for temporal-difference (TD) target calculation leads to systematic Q-value overestimation. This issue is particularly severe in MARL due to the combinatorial explosion of the joint…

1212

ABM-UDE: Developing Surrogates for Epidemic Agent-Based Models via Scientific Machine Learning

Feb 2026 · 2602.21588
PlanningReinforcementInference

Agent-based epidemic models (ABMs) encode behavioral and policy heterogeneity but are too slow for nightly hospital planning. We develop county-ready surrogates that learn directly from exascale ABM trajectories using Universal Differential Equations (UDEs): mechanistic SEIR-family ODEs with a neural-parameterized contact rate $κ_φ(u,t)$ (no…

1213

Multi-Agent Lipschitz Bandits

Feb 2026 · 2602.16965
Multi-Agent

We study the decentralized multi-player stochastic bandit problem over a continuous, Lipschitz-structured action space where hard collisions yield zero reward. Our objective is to design a communication-free policy that maximizes collective reward, with coordination costs that are independent of the time horizon $T$. We propose a modular protocol…

1214

OpenSage: Self-programming Agent Generation Engine

Feb 2026 · 2602.16891
MemoryBenchmarksReinforcement

Agent development kits (ADKs) provide effective platforms and tooling for constructing agents, and their designs are critical to the constructed agents' performance, especially the functionality for agent topology, tools, and memory. However, current ADKs either lack sufficient functional support or rely on humans to manually design these…

1215

Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

Feb 2026 · 2602.16699
Fine-Tuning

LLMs are increasingly being used for complex problems which are not necessarily resolved in a single response, but require interacting with an environment to acquire information. In these scenarios, LLMs must reason about inherent cost-uncertainty tradeoffs in when to stop exploring and commit to an answer. For instance, on a programming task, an…

1216

Modeling Trust and Liquidity Under Payment System Stress: A Multi-Agent Approach

Feb 2026 · 2602.16186
MemoryMulti-Agent

Operational disruptions in retail payments can induce behavioral responses that outlast technical recovery and may amplify liquidity stress. We propose a multi-agent model linking card payment outages to trust dynamics, channel avoidance, and threshold-gated withdrawals. Customers and merchants interact through repeated payment attempts, while…

1217

PhyScensis: Physics-Augmented LLM Agents for Complex Physical Scene Arrangement

Feb 2026 · 2602.14968
Reinforcement

Automatically generating interactive 3D environments is crucial for scaling up robotic data collection in simulation. While prior work has primarily focused on 3D asset placement, it often overlooks the physical relationships between objects (e.g., contact, support, balance, and containment), which are essential for creating complex and realistic…

1218

The Interspeech 2026 Audio Reasoning Challenge: Evaluating Reasoning Process Quality for Audio Reasoning Models and Agents

Feb 2026 · 2602.14224
ReasoningReinforcement

Recent Large Audio Language Models (LALMs) excel in understanding but often lack transparent reasoning. To address this "black-box" limitation, we organized the Audio Reasoning Challenge at Interspeech 2026, the first shared task dedicated to evaluating Chain-of-Thought (CoT) quality in the audio domain. The challenge introduced MMAR-Rubrics, a…

1219

TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents

Feb 2026 · 2602.11767
ReinforcementInference

Advances in large language models (LLMs) are driving a shift toward using reinforcement learning (RL) to train agents from iterative, multi-turn interactions across tasks. However, multi-turn RL remains challenging as rewards are often sparse or delayed, and environments can be stochastic. In this regime, naive trajectory sampling can hinder…

1220

SIGHT: Reinforcement Learning with Self-Evidence and Information-Gain Diverse Branching for Search Agent

Feb 2026 · 2602.11551
ReasoningBenchmarksFine-TuningReinforcement

Reinforcement Learning (RL) has empowered Large Language Models (LLMs) to master autonomous search for complex question answering. However, particularly within multi-turn search scenarios, this interaction introduces a critical challenge: search results often suffer from high redundancy and low signal-to-noise ratios. Consequently, agents easily…

1221

Optimizing Agent Planning for Security and Autonomy

Feb 2026 · 2602.11416
PlanningBenchmarksInference

Indirect prompt injection attacks threaten AI agents that execute consequential actions, motivating deterministic system-level defenses. Such defenses can provably block unsafe actions by enforcing confidentiality and integrity policies, but currently appear costly: they reduce task completion rates and increase token usage compared to…

1222

TVCACHE: A Stateful Tool-Value Cache for Post-Training LLM Agents

Feb 2026 · 2602.10986

In RL post-training of LLM agents, calls to external tools take several seconds or even minutes, leaving allocated GPUs idle and inflating post-training time and cost. While many tool invocations repeat across parallel rollouts and could in principle be cached, naively caching their outputs for reuse is incorrect since tool outputs depend on the…

1223

Information Fidelity in Tool-Using LLM Agents: A Martingale Analysis of the Model Context Protocol

Feb 2026 · 2602.13320

As AI agents powered by large language models (LLMs) increasingly use external tools for high-stakes decisions, a critical reliability question arises: how do errors propagate across sequential tool calls? We introduce the first theoretical framework for analyzing error accumulation in Model Context Protocol (MCP) agents, proving that cumulative…

1224

LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

Feb 2026 · 2602.23329
Benchmarks

Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they uplift novice users -- i.e., enable humans to perform better than with internet-only resources. This uncertainty is central to understanding both scientific acceleration and dual-use risk. We conducted a multi-model, multi-benchmark…

1225

Decentralized Ranking Aggregation: Gossip Algorithms for Borda and Copeland Consensus

Feb 2026 · 2602.22847
BenchmarksMulti-Agent

The concept of ranking aggregation plays a central role in preference analysis, and numerous algorithms for calculating median rankings, often originating in social choice theory, have been documented in the literature, offering theoretical guarantees in a centralized setting, i.e., when all the ranking data to be aggregated can be brought…

1226

RAIN-Merging: A Gradient-Free Method to Enhance Instruction Following in Large Reasoning Models with Preserved Thinking Format

Feb 2026 · 2602.22538
ReasoningBenchmarksArchitectureInference

Large reasoning models (LRMs) excel at a long chain of reasoning but often fail to faithfully follow instructions regarding output format, constraints, or specific requirements. We investigate whether this gap can be closed by integrating an instruction-tuned model (ITM) into an LRM. Analyzing their differences in parameter space, namely task…

1227

FewMMBench: A Benchmark for Multimodal Few-Shot Learning

Feb 2026 · 2602.21854
ReasoningBenchmarksPrompting

As multimodal large language models (MLLMs) advance in handling interleaved image-text data, assessing their few-shot learning capabilities remains an open challenge. In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought…

1228

Shared Nature, Unique Nurture: PRISM for Pluralistic Reasoning via In-context Structure Modeling

Feb 2026 · 2602.21317
ReasoningBenchmarksFine-TuningInference

Large Language Models (LLMs) are converging towards a singular Artificial Hivemind, where shared Nature (pre-training priors) result in a profound collapse of distributional diversity, limiting the distinct perspectives necessary for creative exploration and scientific discovery. To address this, we propose to equip models with inference-time…

1229

Unlocking Multimodal Document Intelligence: From Current Triumphs to Future Frontiers of Visual Document Retrieval

Feb 2026 · 2602.19961
AgenticRAGBenchmarks

With the rapid proliferation of multimodal information, Visual Document Retrieval (VDR) has emerged as a critical frontier in bridging the gap between unstructured visually rich data and precise information acquisition. Unlike traditional natural image retrieval, visual documents exhibit unique characteristics defined by dense textual content,…

1230

Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming

Feb 2026 · 2602.19948
BenchmarksReinforcementSafety

Large Language Models (LLMs) are increasingly utilized for mental health support; however, current safety benchmarks often fail to detect the complex, longitudinal risks inherent in therapeutic dialogue. We introduce an evaluation framework that pairs AI psychotherapists with simulated patient agents equipped with dynamic cognitive-affective…

1231

IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking

Feb 2026 · 2602.19416
ReinforcementSafetyInference

Reinforcement Learning from Human Feedback (RLHF) enables powerful LLM alignment but can introduce reward hacking - models exploit spurious correlations in proxy rewards without genuine alignment. Compounding this, the objectives internalized during RLHF remain opaque, making hacking behaviors difficult to detect or correct. We introduce IR3…

1232

Self-Configurable Mesh-Networks for Scalable Distributed Submodular Bandit Optimization

Feb 2026 · 2602.19366
BenchmarksMulti-Agent

We study how to scale distributed bandit submodular coordination under realistic communication constraints in bandwidth, data rate, and connectivity. We are motivated by multi-agent tasks of active situational awareness in unknown, partially-observable, and resource-limited environments, where the agents must coordinate through agent-to-agent…

1233

Automated Generation of Microfluidic Netlists using Large Language Models

Feb 2026 · 2602.19297
Software DevBenchmarks

Microfluidic devices have emerged as powerful tools in various laboratory applications, but the complexity of their design limits accessibility for many practitioners. While progress has been made in microfluidic design automation (MFDA), a practical and intuitive solution is still needed to connect microfluidic practitioners with MFDA techniques.…

1234

Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation

Feb 2026 · 2602.20200
MemoryLong-HorizonBenchmarksInference

Hierarchical Vision-Language-Action (VLA) models have rapidly become a dominant paradigm for robotic manipulation. It typically comprising a Vision-Language backbone for perception and understanding, together with a generative policy for action generation. However, its performance is increasingly bottlenecked by the action generation proceess. (i)…

1235

Retrieval Augmented Enhanced Dual Co-Attention Framework for Target Aware Multimodal Bengali Hateful Meme Detection

Feb 2026 · 2602.19212
RAGReasoningFine-TuningInference

Hateful content on social media increasingly appears as multimodal memes that combine images and text to convey harmful narratives. In low-resource languages such as Bengali, automated detection remains challenging due to limited annotated data, class imbalance, and pervasive code-mixing. To address these issues, we augment the Bengali Hateful…

1236

When Do LLM Preferences Predict Downstream Behavior?

Feb 2026 · 2602.18971
AgenticBenchmarksReinforcementSafety

Preference-driven behavior in LLMs may be a necessary precondition for AI misalignment such as sandbagging: models cannot strategically pursue misaligned goals unless their behavior is influenced by their preferences. Yet prior work has typically prompted models explicitly to act in specific ways, leaving unclear whether observed behaviors reflect…

1237

DeepInnovator: Triggering the Innovative Capabilities of LLMs

Feb 2026 · 2602.18920
BenchmarksPrompting

The application of Large Language Models (LLMs) in accelerating scientific discovery has garnered increasing attention, with a key focus on constructing research agents endowed with innovative capability, i.e., the ability to autonomously generate novel and significant research ideas. Existing approaches predominantly rely on sophisticated prompt…

1238

EDU-MATRIX: A Society-Centric Generative Cognitive Digital Twin Architecture for Secondary Education

Feb 2026 · 2602.18705
Multi-AgentArchitectureSafety

Existing multi-agent simulations often suffer from the "Agent-Centric Paradox": rules are hard-coded into individual agents, making complex social dynamics rigid and difficult to align with educational values. This paper presents EDU-MATRIX, a society-centric generative cognitive digital twin architecture that shifts the paradigm from simulating…

1239

VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean

Feb 2026 · 2602.18307
Benchmarks

Large language models have achieved striking results in interactive theorem proving, particularly in Lean. However, most benchmarks for LLM-based proof automation are drawn from mathematics in the Mathlib ecosystem, whereas proofs in software verification are developed inside definition-rich codebases with substantial project-specific libraries.…

1240

JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

Feb 2026 · 2602.18527
ReasoningBenchmarks

Current audio-visual large language models (AV-LLMs) are predominantly restricted to 2D perception, relying on RGB video and monaural audio. This design choice introduces a fundamental dimensionality mismatch that precludes reliable source localization and spatial reasoning in complex 3D environments. We address this limitation by presenting…

1241

Memory-Based Advantage Shaping for LLM-Guided Reinforcement Learning

Feb 2026 · 2602.17931
MemoryBenchmarksFine-TuningReinforcement

In environments with sparse or delayed rewards, reinforcement learning (RL) incurs high sample complexity due to the large number of interactions needed for learning. This limitation has motivated the use of large language models (LLMs) for subgoal discovery and trajectory guidance. While LLMs can support exploration, frequent reliance on LLM…

1242

Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs

Feb 2026 · 2602.17616
AgenticReasoningBenchmarksReinforcement

Reinforcement learning (RL) is widely used to improve large language models on reasoning tasks, and asynchronous RL training is attractive because it increases end-to-end throughput. However, for widely adopted critic-free policy-gradient methods such as REINFORCE and GRPO, high asynchrony makes the policy-gradient estimator markedly…

1243

Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study

Feb 2026 · 2602.17262
BenchmarksSafetyInference

Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments. Yet these instruments presume honest responding; in evaluative contexts, LLMs can instead gravitate toward socially preferred answers-a form of socially desirable responding…

1244

References Improve LLM Alignment in Non-Verifiable Domains

Feb 2026 · 2602.16802
ReasoningBenchmarksReinforcementSafety

While Reinforcement Learning with Verifiable Rewards (RLVR) has shown strong effectiveness in reasoning tasks, it cannot be directly applied to non-verifiable domains lacking ground-truth verifiers, such as LLM alignment. In this work, we investigate whether reference-guided LLM-evaluators can bridge this gap by serving as soft "verifiers". First,…

1245

Intra-Fairness Dynamics: The Bias Spillover Effect in Targeted LLM Alignment

Feb 2026 · 2602.16438
BenchmarksSafety

Conventional large language model (LLM) fairness alignment largely focuses on mitigating bias along single sensitive attributes, overlooking fairness as an inherently multidimensional and context-specific value. This approach risks creating systems that achieve narrow fairness metrics while exacerbating disparities along untargeted attributes, a…

1246

LiveClin: A Live Clinical Benchmark without Leakage

Feb 2026 · 2602.16747
Benchmarks

The reliability of medical LLM evaluation is critically undermined by data contamination and knowledge obsolescence, leading to inflated scores on static benchmarks. To address these challenges, we introduce LiveClin, a live benchmark designed for approximating real-world clinical practice. Built from contemporary, peer-reviewed case reports and…

1247

Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

Feb 2026 · 2602.13576
BenchmarksSafetyInference

Evaluation and alignment pipelines for large language models increasingly rely on LLM-based judges, whose behavior is guided by natural-language rubrics and validated on benchmarks. We identify a previously under-recognized vulnerability in this workflow, which we term Rubric-Induced Preference Drift (RIPD). Even when rubric edits pass benchmark…

1248

MEME: Modeling the Evolutionary Modes of Financial Markets

Feb 2026 · 2602.11918
ReasoningBenchmarksMulti-AgentSafety

LLMs have demonstrated significant potential in quantitative finance by processing vast unstructured data to emulate human-like analytical workflows. However, current LLM-based methods primarily follow either an Asset-Centric paradigm focused on individual stock prediction or a Market-Centric approach for portfolio allocation, often remaining…

1249

Right for the Wrong Reasons: Epistemic Regret Minimization for Causal Rung Collapse in LLMs

Feb 2026 · 2602.11675
ReasoningArchitecture

Machine learning systems that are "right for the wrong reasons" achieve high performance through shortcuts that collapse under distributional shift. We show this pathology has a precise causal origin: autoregressive training provides no gradient signal to distinguish association P(Y|X) from intervention P(Y|do(X)), a failure we formalize as Rung…

1250

Consistency Meets Verification: Enhancing Test Generation Quality in Large Language Models Without Ground-Truth Solutions

Feb 2026 · 2602.10522
ReasoningBenchmarks

Large Language Models (LLMs) have significantly advanced automated test generation, yet existing methods often rely on ground-truth code for verification, risking bug propagation and limiting applicability in test-driven development. We present ConVerTest, a novel two-stage pipeline for synthesizing reliable tests without requiring prior code…

1251

Constructing Industrial-Scale Optimization Modeling Benchmark

Feb 2026 · 2602.10450
Benchmarks

Optimization modeling underpins decision-making in logistics, manufacturing, energy, and finance, yet translating natural-language requirements into correct optimization formulations and solver-executable code remains labor-intensive. Although large language models (LLMs) have been explored for this task, evaluation is still dominated by toy-sized…

1252

Simple LLM Baselines are Competitive for Model Diffing

Feb 2026 · 2602.10371
Benchmarks

Standard LLM evaluations only test capabilities or dispositions that evaluators designed them for, missing unexpected differences such as behavioral shifts between model revisions or emergent misaligned tendencies. Model diffing addresses this limitation by automatically surfacing systematic behavioral differences. Recent approaches include…

1253

MILE-RefHumEval: A Reference-Free, Multi-Independent LLM Framework for Human-Aligned Evaluation

Feb 2026 · 2602.09624
BenchmarksReinforcement

We introduce MILE-RefHumEval, a reference-free framework for evaluating Large Language Models (LLMs) without ground-truth annotations or evaluator coordination. It leverages an ensemble of independently prompted evaluators guided by a human-aligned schema, supporting both discrete and continuous scoring judgement. With task-specific prompts from…

1254

Latent Poincaré Shaping for Agentic Reinforcement Learning

Feb 2026 · 2602.09375
AgenticReinforcement

We propose LaPha, a method for training AlphaZero-like LLM agents in a Poincaré latent space. Under LaPha, the search process can be visualized as a tree rooted at the prompt and growing outward from the origin toward the boundary of the Poincaré ball, where negative curvature provides exponentially increasing capacity with radius. Using…

1255

Large Language Models for Designing Participatory Budgeting Rules

Feb 2026 · 2602.09349

Participatory budgeting (PB) is a democratic paradigm for deciding the funding of public projects given the residents' preferences, which has been adopted in numerous cities across the world. The main focus of PB is designing rules, functions that return feasible budget allocations for a set of projects subject to some budget constraint. Designing…

1256

VLM-Guided Iterative Refinement for Surgical Image Segmentation with Foundation Models

Feb 2026 · 2602.09252
Self-ImprovingAgenticBenchmarksReinforcement

Surgical image segmentation is essential for robot-assisted surgery and intraoperative guidance. However, existing methods are constrained to predefined categories, produce one-shot predictions without adaptive refinement, and lack mechanisms for clinician interaction. We propose IR-SIS, an iterative refinement system for surgical image…

1257

Autonomous AI and Ownership Rules

Feb 2026 · 2602.20169
EmergentReinforcementInference

This Article examines the circumstances in which AI-generated outputs remain linked to their creators and the points at which they lose that connection, whether through accident, deliberate design, or emergent behavior. In cases where AI is traceable to an originator, accession doctrine provides an efficient means of assigning ownership,…

1258

CausalT5K: Diagnosing and Informing Refusal for Trustworthy Causal Reasoning of Skepticism, Sycophancy, Detection-Correction, and Rung Collapse

Feb 2026 · 2602.08939
ReasoningBenchmarksSafety

LLM failures in causal reasoning, including sycophancy, rung collapse, and miscalibrated refusal, are well-documented, yet progress on remediation is slow because no benchmark enables systematic diagnosis. We introduce CausalT5K, a diagnostic benchmark of over 5,000 cases across 10 domains that tests three critical capabilities: (1) detecting rung…

1259

Scalable Delphi: Large Language Models for Structured Risk Estimation

Feb 2026 · 2602.08889
BenchmarksSafety

Quantitative risk assessment in high-stakes domains relies on structured expert elicitation to estimate unobservable properties. The gold standard - the Delphi method - produces calibrated, auditable judgments but requires months of coordination and specialist time, placing rigorous risk assessment out of reach for most applications. We…

1260

PERSPECTRA: A Scalable and Configurable Pluralist Benchmark of Perspectives from Arguments

Feb 2026 · 2602.08716
ReasoningBenchmarksReinforcementSafety

Pluralism, the capacity to engage with diverse perspectives without collapsing them into a single viewpoint, is critical for developing large language models that faithfully reflect human heterogeneity. Yet this characteristic has not been carefully examined in the LLM research community and remains absent from most alignment studies.…

1261

Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning

Feb 2026 · 2602.23351
ReasoningBenchmarks

The lack of reasoning capabilities in Vision-Language Models (VLMs) has remained at the forefront of research discourse. We posit that this behavior stems from a reporting bias in their training data. That is, how people communicate about visual content by default omits tacit information needed to supervise some types of reasoning; e.g., "at the…

1262

The Tragedy of the Commons in Multi-Population Resource Games

Feb 2026 · 2602.20603

Self-optimizing behaviors can lead to outcomes where collective benefits are ultimately destroyed, a well-known phenomenon known as the ``tragedy of the commons". These scenarios are widely studied using game-theoretic approaches to analyze strategic agent decision-making. In this paper, we examine this phenomenon in a bi-level decision-making…

1263

A General Equilibrium Theory of Orchestrated AI Agent Systems

Feb 2026 · 2602.21255

We establish a general equilibrium theory for systems of large language model (LLM) agents operating under centralized orchestration. The framework is a production economy in the sense of Arrow-Debreu (1954), extended to infinite-dimensional commodity spaces following Bewley (1972). Each LLM agent is modeled as a firm whose production set Y a…

1264

Proximity-Based Multi-Turn Optimization: Practical Credit Assignment for LLM Agent Training

Feb 2026 · 2602.19225
BenchmarksReinforcement

Multi-turn LLM agents are becoming pivotal to production systems, spanning customer service automation, e-commerce assistance, and interactive task management, where accurately distinguishing high-value informative signals from stochastic noise is critical for sample-efficient training. In real-world scenarios, a failure in a trivial task may…

1265

DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning

Feb 2026 · 2602.16742
ReasoningBenchmarksReinforcement

Reinforcement Learning with Verifiable Rewards (RLVR) has been shown effective in enhancing the visual reflection and reasoning capabilities of Large Multimodal Models (LMMs). However, existing datasets are predominantly derived from either small-scale manual construction or recombination of prior resources, which limits data diversity and…

1266

Automatically Finding Reward Model Biases

Feb 2026 · 2602.15222
Reinforcement

Reward models are central to large language model (LLM) post-training. However, past work has shown that they can reward spurious or undesirable attributes such as length, format, hallucinations, and sycophancy. In this work, we introduce and study the research problem of automatically finding reward model biases in natural language. We offer a…

1267

Distributed Quantum Gaussian Processes for Multi-Agent Systems

Feb 2026 · 2602.15006
Multi-Agent

Gaussian Processes (GPs) are a powerful tool for probabilistic modeling, but their performance is often constrained in complex, largescale real-world domains due to the limited expressivity of classical kernels. Quantum computing offers the potential to overcome this limitation by embedding data into exponentially large Hilbert spaces, capturing…

1268

LogitsCoder: Towards Efficient Chain-of-Thought Path Search via Logits Preference Decoding for Code Generation

Feb 2026 · 2602.14054
Software DevReasoning

Code generation remains a challenging task that requires precise and structured reasoning. Existing Test Time Scaling (TTS) methods, including structured tree search, have made progress in exploring reasoning paths but still face two major challenges: (1) underthinking, where reasoning chains tend to be shallow and fail to capture the full…

1269

OR-Agent: Bridging Evolutionary Search and Structured Research for Automated Algorithm Discovery

Feb 2026 · 2602.13769
MemoryPlanningBenchmarksMulti-Agent

Automating scientific discovery in complex, experiment-driven domains requires more than iterative mutation of programs; it demands structured hypothesis management, environment interaction, and principled reflection. We present OR-Agent, a configurable multi-agent research framework designed for automated exploration in rich experimental…

1270

UniARM: Towards a Unified Autoregressive Reward Model for Multi-Objective Test-Time Alignment

Feb 2026 · 2602.09538
Fine-TuningReinforcementSafetyInference

Multi-objective alignment aims to align LLM responses with multiple human preference objectives. Among existing methods, guiding the generation of frozen LLMs through autoregressive reward models (ARMs) to accomplish multi-objective test-time alignment is a low-cost solution. However, these methods typically rely on independent parameters for each…

1271

Enhancing CVRP Solver through LLM-driven Automatic Heuristic Design

Feb 2026 · 2602.23092
Benchmarks

The Capacitated Vehicle Routing Problem (CVRP), a fundamental combinatorial optimization challenge, focuses on optimizing fleet operations under vehicle capacity constraints. While extensively studied in operational research, the NP-hard nature of CVRP continues to pose significant computational challenges, particularly for large-scale instances.…

1272

Moral Preferences of LLMs Under Directed Contextual Influence

Feb 2026 · 2602.22831
ReasoningBenchmarksReinforcementPrompting

Moral benchmarks for LLMs typically use context-free prompts, implicitly assuming stable preferences. In deployment, however, prompts routinely include contextual signals such as user requests, cues on social norms, etc. that may steer decisions. We study how directed contextual influences reshape decisions in trolley-problem-style moral triage…

1273

Compress the Easy, Explore the Hard: Difficulty-Aware Entropy Regularization for Efficient LLM Reasoning

Feb 2026 · 2602.22642
ReasoningBenchmarksFine-TuningReinforcement

Chain-of-Thought (CoT) has substantially empowered Large Language Models (LLMs) to tackle complex reasoning tasks, yet the verbose nature of explicit reasoning steps incurs prohibitive inference latency and computational costs, limiting real-world deployment. While existing compression methods - ranging from self-training to Reinforcement Learning…

1274

Mirroring the Mind: Distilling Human-Like Metacognitive Strategies into Large Language Models

Feb 2026 · 2602.22508
ReasoningBenchmarksFine-Tuning

Large Reasoning Models (LRMs) often exhibit structural fragility in complex reasoning tasks, failing to produce correct answers even after successfully deriving valid intermediate steps. Through systematic analysis, we observe that these failures frequently stem not from a lack of reasoning capacity, but from a deficiency in self-regulatory…

1275

Reinforcement-aware Knowledge Distillation for LLM Reasoning

Feb 2026 · 2602.22495
ReasoningBenchmarksFine-TuningReinforcement

Reinforcement learning (RL) post-training has recently driven major gains in long chain-of-thought reasoning large language models (LLMs), but the high inference cost of such models motivates distillation into smaller students. Most existing knowledge distillation (KD) methods are designed for supervised fine-tuning (SFT), relying on fixed teacher…

1276

Learning in the Null Space: Small Singular Values for Continual Learning

Feb 2026 · 2602.21919
BenchmarksFine-Tuning

Alleviating catastrophic forgetting while enabling further learning is a primary challenge in continual learning (CL). Orthogonal-based training methods have gained attention for their efficiency and strong theoretical properties, and many existing approaches enforce orthogonality through gradient projection. In this paper, we revisit…

1277

Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion

Feb 2026 · 2602.21646
BenchmarksSafety

Multimodal Large Language Models (MLLMs) have achieved notable success in enhancing translation performance by integrating multimodal information. However, existing research primarily focuses on image-guided methods, whose applicability is constrained by the scarcity of multilingual image-text pairs. The speech modality overcomes this limitation…

1278

Gap-Dependent Bounds for Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation

Feb 2026 · 2602.20297
Multi-AgentFine-TuningReinforcement

We study gap-dependent performance guarantees for nearly minimax-optimal algorithms in reinforcement learning with linear function approximation. While prior works have established gap-dependent regret bounds in this setting, existing analyses do not apply to algorithms that achieve the nearly minimax-optimal worst-case regret bound…

1279

InterviewSim: A Scalable Framework for Interview-Grounded Personality Simulation

Feb 2026 · 2602.20294
RAGBenchmarksSafety

Simulating real personalities with large language models requires grounding generation in authentic personal data. Existing evaluation approaches rely on demographic surveys, personality questionnaires, or short AI-led interviews as proxies, but lack direct assessment against what individuals actually said. We address this gap with an…

1280

To Reason or Not to: Selective Chain-of-Thought in Medical Question Answering

Feb 2026 · 2602.20130
ReasoningBenchmarksInference

Objective: To improve the efficiency of medical question answering (MedQA) with large language models (LLMs) by avoiding unnecessary reasoning while maintaining accuracy. Methods: We propose Selective Chain-of-Thought (Selective CoT), an inference-time strategy that first predicts whether a question requires reasoning and generates a rationale…

1281

ReSyn: Autonomously Scaling Synthetic Environments for Reasoning Models

Feb 2026 · 2602.20117
ReasoningBenchmarksReinforcement

Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising approach for training reasoning language models (RLMs) by leveraging supervision from verifiers. Although verifier implementation is easier than solution annotation for many tasks, existing synthetic data generation methods remain largely solution-centric, while…

1282

SenTSR-Bench: Thinking with Injected Knowledge for Time-Series Reasoning

Feb 2026 · 2602.19455
ReasoningBenchmarksFine-TuningReinforcement

Time-series diagnostic reasoning is essential for many applications, yet existing solutions face a persistent gap: general reasoning large language models (GRLMs) possess strong reasoning skills but lack the domain-specific knowledge to understand complex time-series patterns. Conversely, fine-tuned time-series LLMs (TSLMs) understand these…

1283

Scaling Inference-Time Computation via Opponent Simulation: Enabling Online Strategic Adaptation in Repeated Negotiation

Feb 2026 · 2602.19309
BenchmarksFine-TuningReinforcementInference

While large language models (LLMs) have emerged as powerful decision-makers across a wide range of single-agent and stationary environments, fewer efforts have been devoted to settings where LLMs must engage in \emph{repeated} and \emph{strategic} interactions with unknown or dynamic opponents. In such settings, recipes built upon \emph{offline}…

1284

Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning

Feb 2026 · 2602.20197
ReasoningBenchmarksFine-TuningReinforcement

Reinforcement Learning with verifiable rewards (RLVR) has emerged as a primary learning paradigm for enhancing the reasoning capabilities of multi-modal large language models (MLLMs). However, during RL training, the enormous state space of MLLM and sparse rewards often leads to entropy collapse, policy degradation, or over-exploitation of…

1285

Rethinking Retrieval-Augmented Generation as a Cooperative Decision-Making Problem

Feb 2026 · 2602.18734
RAGMulti-Agent

Retrieval-Augmented Generation (RAG) has demonstrated strong effectiveness in knowledge-intensive tasks by grounding language generation in external evidence. Despite its success, many existing RAG systems are built based on a ranking-centric, asymmetric dependency paradigm, where the generation quality of the generator is highly dependent on…

1286

Thinking by Subtraction: Confidence-Driven Contrastive Decoding for LLM Reasoning

Feb 2026 · 2602.18232
ReasoningBenchmarksInference

Recent work on test-time scaling for large language model (LLM) reasoning typically assumes that allocating more inference-time computation uniformly improves correctness. However, prior studies show that reasoning uncertainty is highly localized: a small subset of low-confidence tokens disproportionately contributes to reasoning errors and…

1287

Toward Automated Virtual Electronic Control Unit (ECU) Twins for Shift-Left Automotive Software Testing

Feb 2026 · 2602.18142
AgenticArchitectureSafety

Automotive software increasingly outpaces hardware availability, forcing late integration and expensive hardware-in-the-loop (HiL) bottlenecks. The InnoRegioChallenge project investigated whether a virtual test and integration environment can reproduce electronic control unit (ECU) behavior early enough to run real software binaries before…

1288

Enhancing Scientific Literature Chatbots with Retrieval-Augmented Generation: A Performance Evaluation of Vector and Graph-Based Systems

Feb 2026 · 2602.17856
RAGBenchmarksReinforcement

This paper investigates the enhancement of scientific literature chatbots through retrieval-augmented generation (RAG), with a focus on evaluating vector- and graph-based retrieval systems. The proposed chatbot leverages both structured (graph) and unstructured (vector) databases to access scientific articles and gray literature, enabling…

1289

Position: Evaluation of ECG Representations Must Be Fixed

Feb 2026 · 2602.17531
BenchmarksInference

This position paper argues that current benchmarking practice in 12-lead ECG representation learning must be fixed to ensure progress is reliable and aligned with clinically meaningful objectives. The field has largely converged on three public multi-label benchmarks (PTB-XL, CPSC2018, CSN) dominated by arrhythmia and waveform-morphology labels,…

1290

Improving LLM-based Recommendation with Self-Hard Negatives from Intermediate Layers

Feb 2026 · 2602.17410
Fine-Tuning

Large language models (LLMs) have shown great promise in recommender systems, where supervised fine-tuning (SFT) is commonly used for adaptation. Subsequent studies further introduce preference learning to incorporate negative samples into the training process. However, existing methods rely on sequence-level, offline-generated negatives, making…

1291

What Breaks Embodied AI Security:LLM Vulnerabilities, CPS Flaws,or Something Else?

Feb 2026 · 2602.17345
ReasoningSafety

Embodied AI systems (e.g., autonomous vehicles, service robots, and LLM-driven interactive agents) are rapidly transitioning from controlled environments to safety critical real-world deployments. Unlike disembodied AI, failures in embodied intelligence lead to irreversible physical consequences, raising fundamental questions about security,…

1292

Towards Cross-lingual Values Assessment: A Consensus-Pluralism Perspective

Feb 2026 · 2602.17283
BenchmarksSafety

While large language models (LLMs) have become pivotal to content safety, current evaluation paradigms primarily focus on detecting explicit harms (e.g., violence or hate speech), neglecting the subtler value dimensions conveyed in digital content. To bridge this gap, we introduce X-Value, a novel Cross-lingual Values Assessment Benchmark designed…

1293

Measuring Mid-2025 LLM-Assistance on Novice Performance in Biology

Feb 2026 · 2602.16703
Benchmarks

Large language models (LLMs) perform strongly on biological benchmarks, raising concerns that they may help novice actors acquire dual-use laboratory skills. Yet, whether this translates to improved human performance in the physical laboratory remains unclear. To address this, we conducted a pre-registered, investigator-blinded, randomized…

1294

Who can we trust? LLM-as-a-jury for Comparative Assessment

Feb 2026 · 2602.16610
Benchmarks

Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements. Existing approaches typically rely on single judges or aggregate multiple judges assuming equal reliability. In practice, LLM judges vary substantially in performance across tasks and…

1295

Task-Agnostic Continual Learning for Chest Radiograph Classification

Feb 2026 · 2602.15811
ReinforcementInference

Clinical deployment of chest radiograph classifiers requires models that can be updated as new datasets become available without retraining on previously ob- served data or degrading validated performance. We study, for the first time, a task-incremental continual learning setting for chest radiograph classification, in which heterogeneous chest…

1296

Recursive Concept Evolution for Compositional Reasoning in Large Language Models

Feb 2026 · 2602.15725
ReasoningBenchmarksReinforcementInference

Large language models achieve strong performance on many complex reasoning tasks, yet their accuracy degrades sharply on benchmarks that require compositional reasoning, including ARC-AGI-2, GPQA, MATH, BBH, and HLE. Existing methods improve reasoning by expanding token-level search through chain-of-thought prompting, self-consistency, or…

1297

STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

Feb 2026 · 2602.15620
ReasoningBenchmarksFine-TuningReinforcement

Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often suffer from late-stage performance collapse, leading to degraded reasoning quality and unstable…

1298

Fairness over Equality: Correcting Social Incentives in Asymmetric Sequential Social Dilemmas

Feb 2026 · 2602.15407
Multi-AgentReinforcement

Sequential Social Dilemmas (SSDs) provide a key framework for studying how cooperation emerges when individual incentives conflict with collective welfare. In Multi-Agent Reinforcement Learning, these problems are often addressed by incorporating intrinsic drives that encourage prosocial or fair behavior. However, most existing methods assume that…

1299

ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery

Feb 2026 · 2602.14989
ReasoningBenchmarksFine-TuningPrompting

Vision language models (VLMs) achieve strong performance on RGB imagery, but they do not generalize to thermal images. Thermal sensing plays a critical role in settings where visible light fails, including nighttime surveillance, search and rescue, autonomous driving, and medical screening. Unlike RGB imagery, thermal images encode physical…

1300

Universal Algorithm-Implicit Learning

Feb 2026 · 2602.14761
ReasoningBenchmarksArchitecturePrompting

Current meta-learning methods are constrained to narrow task distributions with fixed feature and label spaces, limiting applicability. Moreover, the current meta-learning literature uses key terms like "universal" and "general-purpose" inconsistently and lacks precise definitions, hindering comparability. We introduce a theoretical framework for…

1301

BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR

Feb 2026 · 2602.14488
BenchmarksSafety

IR in low-resource languages remains limited by the scarcity of high-quality, task-specific annotated datasets. Manual annotation is expensive and difficult to scale, while using large language models (LLMs) as automated annotators introduces concerns about label reliability, bias, and evaluation validity. This work presents a Bangla IR dataset…

1302

Key Considerations for Domain Expert Involvement in LLM Design and Evaluation: An Ethnographic Study

Feb 2026 · 2602.14357
BenchmarksReinforcement

Large Language Models (LLMs) are increasingly developed for use in complex professional domains, yet little is known about how teams design and evaluate these systems in practice. This paper examines the challenges and trade-offs in LLM development through a 12-week ethnographic study of a team building a pedagogical chatbot. The researcher…

1303

GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery

Feb 2026 · 2602.14201
AgenticBenchmarksReinforcement

The "thinking-with-images" paradigm enables multimodal large language models (MLLMs) to actively explore visual scenes via zoom-in tools. This is essential for ultra-high-resolution (UHR) remote sensing VQA, where task-relevant cues are sparse and tiny. However, we observe a consistent failure mode in existing zoom-enabled MLLMs: Tool Usage…

1304

GPT-5 vs Other LLMs in Long Short-Context Performance

Feb 2026 · 2602.14188
Context

With the significant expansion of the context window in Large Language Models (LLMs), these models are theoretically capable of processing millions of tokens in a single pass. However, research indicates a significant gap between this theoretical capacity and the practical ability of models to robustly utilize information within long contexts,…

1305

CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

Feb 2026 · 2602.13191
ContextReasoningBenchmarksFine-Tuning

Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame…

1306

Doc-to-LoRA: Learning to Instantly Internalize Contexts

Feb 2026 · 2602.15902
MemoryContextReasoningFine-Tuning

Long input sequences are central to in-context learning, document understanding, and multi-step reasoning of Large Language Models (LLMs). However, the quadratic attention cost of Transformers makes inference memory-intensive and slow. While context distillation (CD) can transfer information into model parameters, per-prompt distillation is…

1307

Beyond Normalization: Rethinking the Partition Function as a Difficulty Scheduler for RLVR

Feb 2026 · 2602.12642
ReasoningBenchmarksReinforcement

Reward-maximizing RL methods enhance the reasoning performance of LLMs, but often reduce the diversity among outputs. Recent works address this issue by adopting GFlowNets, training LLMs to match a target distribution while jointly learning its partition function. In contrast to prior works that treat this partition function solely as a…

1308

Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models

Feb 2026 · 2602.12618
MemoryBenchmarksArchitectureInference

Multimodal Large Language Models (MLLMs) incur significant computational cost from processing numerous vision tokens through all LLM layers. Prior pruning methods operate either before the LLM, limiting generality due to diverse encoder-projector designs or within the LLM using heuristics that are incompatible with FlashAttention. We take a…

1309

On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs

Feb 2026 · 2602.12506
ReasoningBenchmarksFine-TuningReinforcement

Reinforcement learning (RL) fine-tuning has become a key technique for enhancing large language models (LLMs) on reasoning-intensive tasks, motivating its extension to vision language models (VLMs). While RL-tuned VLMs improve on visual reasoning benchmarks, they remain vulnerable to weak visual grounding, hallucinations, and over-reliance on…

1310

Theory of Mind Guided Strategy Adaptation for Zero-Shot Coordination

Feb 2026 · 2602.12458
Multi-AgentReinforcement

A central challenge in multi-agent reinforcement learning is enabling agents to adapt to previously unseen teammates in a zero-shot fashion. Prior work in zero-shot coordination often follows a two-stage process, first generating a diverse training pool of partner agents, and then training a best-response agent to collaborate effectively with the…

1311

ExtractBench: A Benchmark and Evaluation Methodology for Complex Structured Extraction

Feb 2026 · 2602.12247
BenchmarksSafety

Unstructured documents like PDFs contain valuable structured information, but downstream systems require this data in reliable, standardized formats. LLMs are increasingly deployed to automate this extraction, making accuracy and reliability paramount. However, progress is bottlenecked by two gaps. First, no end-to-end benchmark evaluates…

1312

Energy-Aware Spike Budgeting for Continual Learning in Spiking Neural Networks for Neuromorphic Vision

Feb 2026 · 2602.12236
BenchmarksFine-Tuning

Neuromorphic vision systems based on spiking neural networks (SNNs) offer ultra-low-power perception for event-based and frame-based cameras, yet catastrophic forgetting remains a critical barrier to deployment in continually evolving environments. Existing continual learning methods, developed primarily for artificial neural networks, seldom…

1313

Who is the richest club in the championship? Detecting and Rewriting Underspecified Questions Improve QA Performance

Feb 2026 · 2602.11938
Benchmarks

Large language models (LLMs) perform well on well-posed questions, yet standard question-answering (QA) benchmarks remain far from solved. We argue that this gap is partly due to underspecified questions - queries whose interpretation cannot be uniquely determined without additional context. To test this hypothesis, we introduce an LLM-based…

1314

Temperature as a Meta-Policy: Adaptive Temperature in LLM Reinforcement Learning

Feb 2026 · 2602.11779
ReasoningBenchmarksFine-TuningReinforcement

Temperature is a crucial hyperparameter in large language models (LLMs), controlling the trade-off between exploration and exploitation during text generation. High temperatures encourage diverse but noisy outputs, while low temperatures produce focused outputs but may cause premature convergence. Yet static or heuristic temperature schedules fail…

1315

Unifying Stable Optimization and Reference Regularization in RLHF

Feb 2026 · 2602.11523
BenchmarksFine-TuningReinforcementSafety

Reinforcement Learning from Human Feedback (RLHF) has advanced alignment capabilities significantly but remains hindered by two core challenges: \textbf{reward hacking} and \textbf{stable optimization}. Current solutions independently address these issues through separate regularization strategies, specifically a KL-divergence penalty against a…

1316

DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning

Feb 2026 · 2602.11089
Self-ImprovingBenchmarksReinforcement

In the current landscape of Large Language Models (LLMs), the curation of large-scale, high-quality training data is a primary driver of model performance. A key lever is the \emph{data recipe}, which comprises a data processing pipeline to transform raw sources into training corpora. Despite the growing use of LLMs to automate individual data…

1317

MERIT Feedback Elicits Better Bargaining in LLM Negotiators

Feb 2026 · 2602.10467
BenchmarksFine-TuningReinforcementPrompting

Bargaining is often regarded as a logical arena rather than an art or a matter of intuition, yet Large Language Models (LLMs) still struggle to navigate it due to limited strategic depth and difficulty adapting to complex human factors. Current benchmarks rarely capture this limitation. To bridge this gap, we present an utility feedback centric…

1318

Resilient Topology-Aware Coordination for Dynamic 3D UAV Networks under Node Failure

Feb 2026 · 2602.10029
Multi-AgentReinforcementArchitecture

In 3D Aerial-Ground Integrated Networks (AGINs), ensuring continuous service coverage under unexpected hardware failures is critical for mission-critical applications. While Multi-Agent Reinforcement Learning (MARL) has shown promise in autonomous coordination, its resilience under sudden node failures remains a challenge due to dynamic topology…

1319

JMigBench: A Benchmark for Evaluating LLMs on Source Code Migration (Java 8 to Java 11)

Feb 2026 · 2602.09930
BenchmarksPrompting

We build a benchmark to evaluate large language models (LLMs) for source code migration tasks, specifically upgrading functions from Java 8 to Java 11. We first collected a dataset of function pairs from open-source repositories, but limitations in data quality led us to construct a refined dataset covering eight categories of deprecated APIs.…

1320

Tiny Moves: Game-based Hypothesis Refinement

Feb 2026 · 2602.09801
ReasoningReinforcementInferencePrompting

Most machine learning approaches to scientific discovery frame hypotheses as end-to-end predictions, obscuring the incremental structure of scientific reasoning. We propose The Hypothesis Game, a symbolic formalism for hypothesis refinement in which LLM agents operate on a shared hypothesis state using a fixed grammar of reasoning moves. The…

1321

GHS-TDA: A Synergistic Reasoning Framework Integrating Global Hypothesis Space with Topological Data Analysis

Feb 2026 · 2602.09794
ReasoningBenchmarks

Chain-of-Thought (CoT) has been shown to significantly improve the reasoning accuracy of large language models (LLMs) on complex tasks. However, due to the autoregressive, step-by-step generation paradigm, existing CoT methods suffer from two fundamental limitations. First, the reasoning process is highly sensitive to early decisions: once an…

1322

Continual Learning for non-stationary regression via Memory-Efficient Replay

Feb 2026 · 2602.09720
MemoryBenchmarks

Data streams are rarely static in dynamic environments like Industry 4.0. Instead, they constantly change, making traditional offline models outdated unless they can quickly adjust to the new data. This need can be adequately addressed by continual learning (CL), which allows systems to gradually acquire knowledge without incurring the prohibitive…

1323

BiasScope: Towards Automated Detection of Bias in LLM-as-a-Judge Evaluation

Feb 2026 · 2602.09383
BenchmarksFine-Tuning

LLM-as-a-Judge has been widely adopted across various research and practical applications, yet the robustness and reliability of its evaluation remain a critical issue. A core challenge it faces is bias, which has primarily been studied in terms of known biases and their impact on evaluation outcomes, while automated and systematic exploration of…

1324

From Obstacles to Etiquette: Robot Social Navigation with VLM-Informed Path Selection

Feb 2026 · 2602.09002
PlanningReasoning

Navigating socially in human environments requires more than satisfying geometric constraints, as collision-free paths may still interfere with ongoing activities or conflict with social norms. Addressing this challenge calls for analyzing interactions between agents and incorporating common-sense reasoning into planning. This paper presents a…

1325

Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance

Feb 2026 · 2602.22583
ReasoningBenchmarksInferencePrompting

Example-based guidance is widely used to improve mathematical reasoning at inference time, yet its effectiveness is highly unstable across problems and models-even when the guidance is correct and problem-relevant. We show that this instability arises from a previously underexplored gap between strategy usage-whether a reasoning strategy appears…

1326

Gecko: A Simulation Environment with Stateful Feedback for Refining Agent Tool Calls

Feb 2026 · 2602.19218

The ability to use tools is fundamental for large language model (LLM) agents. Given a task, existing systems use LLMs to plan and generate tool calls, which are executed by real-world tools to complete the task. However, tool calls are prone to errors because they are derived merely from LLM intrinsic capabilities. What is more, while it is…

1327

Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications

Feb 2026 · 2602.18582
Long-HorizonReinforcementSafety

When training artificial intelligence (AI) to perform tasks, humans often care not only about whether a task is completed but also how it is performed. As AI agents tackle increasingly complex tasks, aligning their behavior with human-provided specifications becomes critical for responsible AI deployment. Reward design provides a direct channel…

1328

From Lossy to Verified: A Provenance-Aware Tiered Memory for Agents

Feb 2026 · 2602.17913
MemoryLong-HorizonInference

Long-horizon agents often compress interaction histories into write-time summaries. This creates a fundamental write-before-query barrier: compression decisions are made before the system knows what a future query will hinge on. As a result, summaries can cause unverifiable omissions -- decisive constraints (e.g., allergies) may be dropped,…

1329

Updating Parametric Knowledge with Context Distillation Retains Post-Training Capabilities

Feb 2026 · 2602.16093
ReasoningFine-Tuning

Post-training endows pretrained LLMs with a variety of desirable skills, including instruction-following, reasoning, and others. However, these post-trained LLMs only encode knowledge up to a cut-off date, necessitating continual adaptation. Unfortunately, existing solutions cannot simultaneously learn new knowledge from an adaptation document…

1330

Model Context Protocol (MCP) Tool Descriptions Are Smelly! Towards Improving AI Agent Efficiency with Augmented MCP Tool Descriptions

Feb 2026 · 2602.14878
Context

The Model Context Protocol (MCP) introduces a standard specification that defines how Foundation Model (FM)-based agents should interact with external systems by invoking tools. However, to understand a tool's purpose and features, FMs rely on natural-language tool descriptions, making these descriptions a critical component in guiding FMs to…

1331

Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution

Feb 2026 · 2602.14869
BenchmarksSafety

As large language models are increasingly trained and fine-tuned, practitioners need methods to identify which training data drive specific behaviors, particularly unintended ones. Training Data Attribution (TDA) methods address this by estimating datapoint influence. Existing approaches like influence functions are both computationally expensive…

1332

Regularized Meta-Learning for Improved Generalization

Feb 2026 · 2602.12469
BenchmarksInference

Deep ensemble methods often improve predictive performance, yet they suffer from three practical limitations: redundancy among base models that inflates computational cost and degrades conditioning, unstable weighting under multicollinearity, and overfitting in meta-learning pipelines. We propose a regularized meta-learning framework that…

1333

Meta-Sel: Efficient Demonstration Selection for In-Context Learning via Supervised Meta-Learning

Feb 2026 · 2602.12123
BenchmarksFine-TuningReinforcementInference

Demonstration selection is a practical bottleneck in in-context learning (ICL): under a tight prompt budget, accuracy can change substantially depending on which few-shot examples are included, yet selection must remain cheap enough to run per query over large candidate pools. We propose Meta-Sel, a lightweight supervised meta-learning approach…

1334

Deep Kernel Fusion for Transformers

Feb 2026 · 2602.11808
MemoryContextAgenticArchitecture

Agentic LLM inference with long contexts is increasingly limited by memory bandwidth rather than compute. In this setting, SwiGLU MLP blocks, whose large weights exceed cache capacity, become a major yet under-optimized bottleneck. We propose DeepFusionKernel, a deeply fused kernel that cuts HBM traffic and boosts cache reuse, delivering up to…

1335

Towards LLM-Empowered Knowledge Tracing via LLM-Student Hierarchical Behavior Alignment in Hyperbolic Space

Feb 2026 · 2602.22879
Safety

Knowledge Tracing (KT) diagnoses students' concept mastery through continuous learning state monitoring in education.Existing methods primarily focus on studying behavioral sequences based on ID or textual information.While existing methods rely on ID-based sequences or shallow textual features, they often fail to capture (1) the hierarchical…

1336

Large Language Models are Algorithmically Blind

Feb 2026 · 2602.21947
ReasoningBenchmarks

Large language models (LLMs) demonstrate remarkable breadth of knowledge, yet their ability to reason about computational processes remains poorly understood. Closing this gap matters for practitioners who rely on LLMs to guide algorithm selection and deployment. We address this limitation using causal discovery as a testbed and evaluate eight…

1337

RETLLM: Training and Data-Free MLLMs for Multimodal Information Retrieval

Feb 2026 · 2602.22278
ReasoningBenchmarksFine-Tuning

Multimodal information retrieval (MMIR) has gained attention for its flexibility in handling text, images, or mixed queries and candidates. Recent breakthroughs in multimodal large language models (MLLMs) boost MMIR performance by incorporating MLLM knowledge under the contrastive finetuning framework. However, they suffer from pre-training…

1338

On Data Engineering for Scaling LLM Terminal Capabilities

Feb 2026 · 2602.21193
ContextReinforcement

Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap through a systematic study of data engineering practices for terminal agents, making two key contributions: (1) Terminal-Task-Gen, a lightweight…

1339

Tool Building as a Path to "Superintelligence"

Feb 2026 · 2602.21061
ReasoningBenchmarksInference

The Diligent Learner framework suggests LLMs can achieve superintelligence via test-time search, provided a sufficient step-success probability $γ$. In this work, we design a benchmark to measure $γ$ on logical out-of-distribution inference. We construct a class of tasks involving GF(2) circuit reconstruction that grow more difficult with each…

1340

CrystaL: Spontaneous Emergence of Visual Latents in MLLMs

Feb 2026 · 2602.20980
ReasoningBenchmarksInference

Multimodal Large Language Models (MLLMs) have achieved remarkable performance by integrating powerful language backbones with large-scale visual encoders. Among these, latent Chain-of-Thought (CoT) methods enable implicit reasoning in continuous hidden states, facilitating seamless vision-language integration and faster inference. However,…

1341

The Art of Efficient Reasoning: Data, Reward, and Optimization

Feb 2026 · 2602.20945
ReasoningBenchmarksReinforcement

Large Language Models (LLMs) consistently benefit from scaled Chain-of-Thought (CoT) reasoning, but also suffer from heavy computational overhead. To address this issue, efficient reasoning aims to incentivize short yet accurate thinking trajectories, typically through reward shaping with Reinforcement Learning (RL). In this paper, we…

1342

SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models

Feb 2026 · 2602.20901
ReasoningBenchmarks

Vision-Language Models (VLMs) have been increasingly applied in real-world scenarios due to their outstanding understanding and reasoning capabilities. Although VLMs have already demonstrated impressive capabilities in common visual question answering and logical reasoning, they still lack the ability to make reasonable decisions in complex…

1343

Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs

Feb 2026 · 2602.20878
ReasoningBenchmarksInferencePrompting

Large Vision-Language Models (LVLMs) achieve strong performance on visual question answering benchmarks, yet often rely on spurious correlations rather than genuine causal reasoning. Existing evaluations primarily assess the correctness of the answers, making it unclear whether failures arise from limited reasoning capability or from…

1344

Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning

Feb 2026 · 2602.20528
PlanningReasoningBenchmarksArchitecture

The Stop-Think-AutoRegress Language Diffusion Model (STAR-LDM) integrates latent diffusion planning with autoregressive generation. Unlike conventional autoregressive language models limited to token-by-token decisions, STAR-LDM incorporates a "thinking" phase that pauses generation to refine a semantic plan through diffusion before continuing.…

1345

Examining and Addressing Barriers to Diversity in LLM-Generated Ideas

Feb 2026 · 2602.20408
ReasoningInferencePrompting

Ideas generated by independent samples of humans tend to be more diverse than ideas generated from independent LLM samples, raising concerns that widespread reliance on LLMs could homogenize ideation and undermine innovation at a societal level. Drawing on cognitive psychology, we identify (both theoretically and empirically) two mechanisms…

1346

Momentum Guidance: Plug-and-Play Guidance for Flow Models

Feb 2026 · 2602.20360
BenchmarksInference

Flow-based generative models have become a strong framework for high-quality generative modeling, yet pretrained models are rarely used in their vanilla conditional form: conditional samples without guidance often appear diffuse and lack fine-grained detail due to the smoothing effects of neural networks. Existing guidance techniques such as…

1347

NanoKnow: How to Know What Your Language Model Knows

Feb 2026 · 2602.20122
Benchmarks

How do large language models (LLMs) know what they know? Answering this question has been difficult because pre-training data is often a "black box" -- unknown or inaccessible. The recent release of nanochat -- a family of small LLMs with fully open pre-training data -- addresses this as it provides a transparent view into where a model's…

1348

gencat: Generative computerized adaptive testing

Feb 2026 · 2602.20020
Fine-TuningSafety

Existing computerized Adaptive Testing (CAT) frameworks are typically built on predicting the correctness of a student response to a question. Although effective, this approach fails to leverage textual information in questions and responses, especially for open-ended questions. In this work, we propose GENCAT (\textbf{GEN}erative \textbf{CAT}), a…

1349

Active perception and disentangled representations allow continual, episodic zero and few-shot learning

Feb 2026 · 2602.19355
MemoryReasoningArchitecturePrompting

Generalization is often regarded as an essential property of machine learning systems. However, perhaps not every component of a system needs to generalize. Training models for generalization typically produces entangled representations at the boundaries of entities or classes, which can lead to destructive interference when rapid, high-magnitude…

1350

Towards Automated Page Object Generation for Web Testing using Large Language Models

Feb 2026 · 2602.19294
Benchmarks

Page Objects (POs) are a widely adopted design pattern for improving the maintainability and scalability of automated end-to-end web tests. However, creating and maintaining POs is still largely a manual, labor-intensive activity, while automated solutions have seen limited practical adoption. In this context, the potential of Large Language…

1351

A Probabilistic Framework for LLM-Based Model Discovery

Feb 2026 · 2602.18266
AgenticInference

Automated methods for discovering mechanistic simulator models from observational data offer a promising path toward accelerating scientific progress. Such methods often take the form of agentic-style iterative workflows that repeatedly propose and revise candidate models by imitating human discovery processes. However, existing LLM-based…

1352

Trojan Horses in Recruiting: A Red-Teaming Case Study on Indirect Prompt Injection in Standard vs. Reasoning Models

Feb 2026 · 2602.18514
Self-ImprovingReasoningArchitectureSafety

As Large Language Models (LLMs) are increasingly integrated into automated decision-making pipelines, specifically within Human Resources (HR), the security implications of Indirect Prompt Injection (IPI) become critical. While a prevailing hypothesis posits that "Reasoning" or "Chain-of-Thought" Models possess safety advantages due to their…

1353

A Hybrid Federated Learning Based Ensemble Approach for Lung Disease Diagnosis Leveraging Fusion of SWIN Transformer and CNN

Feb 2026 · 2602.17566
ReinforcementArchitecture

The significant advancements in computational power cre- ate a vast opportunity for using Artificial Intelligence in different ap- plications of healthcare and medical science. A Hybrid FL-Enabled Ensemble Approach For Lung Disease Diagnosis Leveraging a Combination of SWIN Transformer and CNN is the combination of cutting-edge technology of AI…

1354

AIDG: Evaluating Asymmetry Between Information Extraction and Containment in Multi-Turn Dialogue

Feb 2026 · 2602.17443
ReasoningBenchmarks

Evaluating the strategic reasoning capabilities of Large Language Models (LLMs) requires moving beyond static benchmarks to dynamic, multi-turn interactions. We introduce AIDG (Adversarial Information Deduction Game), a game-theoretic framework that probes the asymmetry between information extraction (active deduction) and information containment…

1355

GeneZip: Region-Aware Compression for Long Context DNA Modeling

Feb 2026 · 2602.17739
ContextBenchmarksReinforcement

Genomic sequences span billions of base pairs (bp), posing a fundamental challenge for genome-scale foundation models. Existing approaches largely sidestep this barrier by either scaling relatively small models to long contexts or relying on heavy multi-GPU parallelism. Here we introduce GeneZip, a DNA compression model that leverages a key…

1356

AREG: Adversarial Resource Extraction Game for Evaluating Persuasion and Resistance in Large Language Models

Feb 2026 · 2602.16639
Benchmarks

Evaluating the social intelligence of Large Language Models (LLMs) increasingly requires moving beyond static text generation toward dynamic, adversarial interaction. We introduce the Adversarial Resource Extraction Game (AREG), a benchmark that operationalizes persuasion and resistance as a multi-turn, zero-sum negotiation over financial…

1357

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

Feb 2026 · 2602.16763
Benchmarks

Artificial Intelligence (AI) benchmarks play a central role in measuring progress in model development and guiding deployment decisions. However, many benchmarks quickly become saturated, meaning that they can no longer differentiate between the best-performing models, diminishing their long-term value. In this study, we analyze benchmark…

1358

Are LLMs Ready to Replace Bangla Annotators?

Feb 2026 · 2602.16241
Benchmarks

Large Language Models (LLMs) are increasingly used as automated annotators to scale dataset creation, yet their reliability as unbiased annotators--especially for low-resource and identity-sensitive settings--remains poorly understood. In this work, we study the behavior of LLMs as zero-shot annotators for Bangla hate speech, a task where even…

1359

Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution

Feb 2026 · 2602.16154
ReasoningBenchmarksFine-TuningReinforcement

Chain-of-thought (CoT) reasoning sometimes fails to faithfully reflect the true computation of a large language model (LLM), hampering its utility in explaining how LLMs arrive at their answers. Moreover, optimizing for faithfulness and interpretability in reasoning often degrades task performance. To address this tradeoff and improve CoT…

1360

MAEB: Massive Audio Embedding Benchmark

Feb 2026 · 2602.16008
ReasoningBenchmarksReinforcement

We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages. We evaluate 50+ models and find that no single model dominates across all tasks: contrastive audio-text models excel at environmental sound…

1361

ReLoop: Structured Modeling and Behavioral Verification for Reliable LLM-Based Optimization

Feb 2026 · 2602.15983
ReasoningBenchmarks

Large language models (LLMs) can translate natural language into optimization code, but silent failures pose a critical risk: code that executes and returns solver-feasible solutions may encode semantically incorrect formulations, creating a feasibility-correctness gap of up to 90 percentage points on compositional problems. We introduce ReLoop,…

1362

Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation

Feb 2026 · 2602.15724
RAGReasoningBenchmarksFine-Tuning

Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions and navigate through previously unseen environments. Recent approaches increasingly employ large language models (LLMs) as high-level navigators due to their flexibility and reasoning capability. However, prompt-based LLM navigation often suffers from…

1363

Guiding LLM-Based Human Mobility Simulation with Mobility Measures from Shared Data

Feb 2026 · 2602.16726

Large-scale human mobility simulation is critical for many science domains such as urban science, epidemiology, and transportation analysis. Recent works treat large language models (LLMs) as human agents to simulate realistic mobility trajectories by modeling individual-level cognitive processes. However, these approaches generate individual…

1364

A Geometric Analysis of Small-sized Language Model Hallucinations

Feb 2026 · 2602.14778
AgenticBenchmarks

Hallucinations -- fluent but factually incorrect responses -- pose a major challenge to the reliability of language models, especially in multi-step or agentic settings. This work investigates hallucinations in small-sized LLMs through a geometric perspective, starting from the hypothesis that when models generate multiple responses to the same…

1365

Robust Bias Evaluation with FilBBQ: A Filipino Bias Benchmark for Question-Answering Language Models

Feb 2026 · 2602.14466
Benchmarks

With natural language generation becoming a popular use case for language models, the Bias Benchmark for Question-Answering (BBQ) has grown to be an important benchmark format for evaluating stereotypical associations exhibited by generative models. We expand the linguistic scope of BBQ and construct FilBBQ through a four-phase development process…

1366

A Rational Analysis of the Effects of Sycophantic AI

Feb 2026 · 2602.14270
Prompting

People increasingly use large language models (LLMs) to explore ideas, gather information, and make sense of the world. In these interactions, they encounter agents that are overly agreeable. We argue that this sycophancy poses a unique epistemic risk to how individuals come to see the world: unlike hallucinations that introduce falsehoods,…

1367

NEST: Nascent Encoded Steganographic Thoughts

Feb 2026 · 2602.14095
ReasoningBenchmarksSafety

Monitoring chain-of-thought (CoT) reasoning is a foundational safety technique for large language model (LLM) agents; however, this oversight is compromised if models learn to conceal their reasoning. We explore the potential for steganographic CoT -- where models hide secret reasoning within innocuous text -- to inform risk assessment and…

1368

HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam

Feb 2026 · 2602.13964
BenchmarksReinforcementInference

Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions. However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distort cross-model comparisons. To address this…

1369

OneLatent: Single-Token Compression for Visual Latent Reasoning

Feb 2026 · 2602.13738
ReasoningBenchmarksReinforcementInference

Chain-of-thought (CoT) prompting improves reasoning but often increases inference cost by one to two orders of magnitude. To address these challenges, we present \textbf{OneLatent}, a framework that compresses intermediate reasoning into a single latent token via supervision from rendered CoT images and DeepSeek-OCR hidden states. By rendering…

1370

KorMedMCQA-V: A Multimodal Benchmark for Evaluating Vision-Language Models on the Korean Medical Licensing Examination

Feb 2026 · 2602.13650
ReasoningBenchmarks

We introduce KorMedMCQA-V, a Korean medical licensing-exam-style multimodal multiple-choice question answering benchmark for evaluating vision-language models (VLMs). The dataset consists of 1,534 questions with 2,043 associated images from Korean Medical Licensing Examinations (2012-2023), with about 30% containing multiple images requiring…

1371

Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL

Feb 2026 · 2602.13035
ReasoningBenchmarksFine-TuningReinforcement

Reinforcement Learning from Verifiable Rewards (RLVR) trains large language models (LLMs) from sampled trajectories, making decoding strategy a core component of learning rather than a purely inference-time choice. Sampling temperature directly controls the exploration--exploitation trade-off by modulating policy entropy, yet existing methods rely…

1372

Learning Ordinal Probabilistic Reward from Preferences

Feb 2026 · 2602.12660
BenchmarksReinforcement

Reward models are crucial for aligning large language models (LLMs) with human values and intentions. Existing approaches follow either Generative (GRMs) or Discriminative (DRMs) paradigms, yet both suffer from limitations: GRMs typically demand costly point-wise supervision, while DRMs produce uncalibrated relative scores that lack probabilistic…

1373

Artic: AI-oriented Real-time Communication for MLLM Video Assistant

Feb 2026 · 2602.12641
BenchmarksInference

AI Video Assistant emerges as a new paradigm for Real-time Communication (RTC), where one peer is a Multimodal Large Language Model (MLLM) deployed in the cloud. This makes interaction between humans and AI more intuitive, akin to chatting with a real person. However, a fundamental mismatch exists between current RTC frameworks and AI Video…

1374

A Theoretical Framework for Adaptive Utility-Weighted Benchmarking

Feb 2026 · 2602.12356
BenchmarksReinforcementInference

Benchmarking has long served as a foundational practice in machine learning and, increasingly, in modern AI systems such as large language models, where shared tasks, metrics, and leaderboards offer a common basis for measuring progress and comparing approaches. As AI systems are deployed in more varied and consequential settings, though, there is…

1375

SAGEO Arena: A Realistic Environment for Evaluating Search-Augmented Generative Engine Optimization

Feb 2026 · 2602.12187
BenchmarksReinforcement

Search-Augmented Generative Engines (SAGE) have emerged as a new paradigm for information access, bridging web-scale retrieval with generative capabilities to deliver synthesized answers. This shift has fundamentally reshaped how web content gains exposure online, giving rise to Search-Augmented Generative Engine Optimization (SAGEO), the practice…

1376

Stop Unnecessary Reflection: Training LRMs for Efficient Reasoning with Adaptive Reflection and Length Coordinated Penalty

Feb 2026 · 2602.12113
ReasoningBenchmarksReinforcementInference

Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning tasks by employing test-time scaling. However, they often generate over-long chains-of-thought that, driven by substantial reflections such as repetitive self-questioning and circular reasoning, lead to high token consumption, substantial computational…

1377

Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning

Feb 2026 · 2602.11748
ReasoningBenchmarksFine-TuningReinforcement

Achieving effective test-time scaling requires models to engage in In-Context Exploration -- the intrinsic ability to generate, verify, and refine multiple reasoning hypotheses within a single continuous context. Grounded in State Coverage theory, our analysis identifies a critical bottleneck to enabling this capability: while broader state…

1378

Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding

Feb 2026 · 2602.11737
BenchmarksReinforcementArchitecture

We study object hallucination in Multimodal Large Language Models (MLLMs) and improve visual contrastive decoding (VCD) by constructing an object-aligned auxiliary view. We leverage object-centric attention in self-supervised Vision Transformers. In particular, we remove the most salient visual evidence to construct an auxiliary view that disrupts…

1379

PatientHub: A Unified Framework for Patient Simulation

Feb 2026 · 2602.11684
BenchmarksReinforcement

As Large Language Models increasingly power role-playing applications, simulating patients has become a valuable tool for training counselors and scaling therapeutic assessment. However, prior work is fragmented: existing approaches rely on incompatible, non-standardized data formats, prompts, and evaluation metrics, hindering reproducibility and…

1380

MolmoSpaces: A Large-Scale Open Ecosystem for Robot Navigation and Manipulation

Feb 2026 · 2602.11337
Long-HorizonPlanningBenchmarksReinforcement

Deploying robots at scale demands robustness to the long tail of everyday situations. The countless variations in scene layout, object geometry, and task specifications that characterize real environments are vast and underrepresented in existing robot benchmarks. Measuring this level of generalization requires infrastructure at a scale and…

1381

Can Large Language Models Make Everyone Happy?

Feb 2026 · 2602.11091
BenchmarksSafety

Misalignment in Large Language Models (LLMs) refers to the failure to simultaneously satisfy safety, value, and cultural dimensions, leading to behaviors that diverge from human expectations in real-world settings where these dimensions must co-occur. Existing benchmarks, such as SAFETUNEBED (safety-centric), VALUEBENCH (value-centric), and…

1382

Divide, Harmonize, Then Conquer It: Shooting Multi-Commodity Flow Problems with Multimodal Language Models

Feb 2026 · 2602.11057
ReasoningMulti-AgentReinforcement

The multi-commodity flow (MCF) problem is a fundamental topic in network flow and combinatorial optimization, with broad applications in transportation, communication, and logistics, etc. Nowadays, the rapid expansion of allocation systems has posed challenges for existing optimization engines in balancing optimality and tractability. In this…

1383

Chain-of-Look Spatial Reasoning for Dense Surgical Instrument Counting

Feb 2026 · 2602.11024
AgenticReasoningSafety

Accurate counting of surgical instruments in Operating Rooms (OR) is a critical prerequisite for ensuring patient safety during surgery. Despite recent progress of large visual-language models and agentic AI, accurately counting such instruments remains highly challenging, particularly in dense scenarios where instruments are tightly clustered. To…

1384

The CLEF-2026 FinMMEval Lab: Multilingual and Multimodal Evaluation of Financial AI Systems

Feb 2026 · 2602.10886
ReasoningBenchmarksReinforcement

We present the setup and the tasks of the FinMMEval Lab at CLEF 2026, which introduces the first multilingual and multimodal evaluation framework for financial Large Language Models (LLMs). While recent advances in financial natural language processing have enabled automated analysis of market reports, regulatory documents, and investor…

1385

C^2ROPE: Causal Continuous Rotary Positional Encoding for 3D Large Multimodal-Models Reasoning

Feb 2026 · 2602.10551
ReasoningBenchmarksSafety

Recent advances in 3D Large Multimodal Models (LMMs) built on Large Language Models (LLMs) have established the alignment of 3D visual features with LLM representations as the dominant paradigm. However, the inherited Rotary Position Embedding (RoPE) introduces limitations for multimodal processing. Specifically, applying 1D temporal positional…

1386

Prioritize the Process, Not Just the Outcome: Rewarding Latent Thought Trajectories Improves Reasoning in Looped Language Models

Feb 2026 · 2602.10520
ReasoningBenchmarksReinforcementInference

Looped Language Models (LoopLMs) perform multi-step latent reasoning prior to token generation and outperform conventional LLMs on reasoning benchmarks at smaller parameter budgets. However, attempts to further improve LoopLM reasoning with reinforcement learning have failed - standard objectives such as Group Relative Policy Optimization (GRPO)…

1387

Protecting Context and Prompts: Deterministic Security for Non-Deterministic AI

Feb 2026 · 2602.10481
ReasoningBenchmarks

Large Language Model (LLM) applications are vulnerable to prompt injection and context manipulation attacks that traditional security models cannot prevent. We introduce two novel primitives--authenticated prompts and authenticated context--that provide cryptographically verifiable provenance across LLM workflows. Authenticated prompts enable…

1388

Online Generalized-mean Welfare Maximization: Achieving Near-Optimal Regret from Samples

Feb 2026 · 2602.10469

We study online fair allocation of $T$ sequentially arriving items among $n$ agents with heterogeneous preferences, with the objective of maximizing generalized-mean welfare, defined as the $p$-mean of agents' time-averaged utilities, with $p\in (-\infty, 1)$. We first consider the i.i.d. arrival model and show that the pure greedy algorithm --…

1389

Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs

Feb 2026 · 2602.10388
Reinforcement

The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals for the task-relevant features that determine…

1390

Synthesizing the Kill Chain: A Zero-Shot Framework for Target Verification and Tactical Reasoning on the Edge

Feb 2026 · 2602.13324
AgenticReasoningArchitectureSafety

Deploying autonomous edge robotics in dynamic military environments is constrained by both scarce domain-specific training data and the computational limits of edge hardware. This paper introduces a hierarchical, zero-shot framework that cascades lightweight object detection with compact Vision-Language Models (VLMs) from the Qwen and Gemma…

1391

Biases in the Blind Spot: Detecting What LLMs Fail to Mention

Feb 2026 · 2602.10117
ReasoningBenchmarks

Large Language Models (LLMs) often provide chain-of-thought (CoT) reasoning traces that appear plausible, but may hide internal biases. We call these *unverbalized biases*. Monitoring models via their stated reasoning is therefore unreliable, and existing bias evaluations typically require predefined categories and hand-crafted datasets. In this…

1392

Would a Large Language Model Pay Extra for a View? Inferring Willingness to Pay from Subjective Choices

Feb 2026 · 2602.09802
BenchmarksReinforcementPrompting

As Large Language Models (LLMs) are increasingly deployed in applications such as travel assistance and purchasing support, they are often required to make subjective choices on behalf of users in settings where no objectively correct answer exists. We study LLM decision-making in a travel-assistant context by presenting models with choice…

1393

AnomSeer: Reinforcing Multimodal LLMs to Reason for Time-Series Anomaly Detection

Feb 2026 · 2602.08868
ReasoningReinforcement

Time-series anomaly detection (TSAD) with multimodal large language models (MLLMs) is an emerging area, yet a persistent challenge remains: MLLMs rely on coarse time-series heuristics but struggle with multi-dimensional, detailed reasoning, which is vital for understanding complex time-series data. We present AnomSeer to address this by…

1394

FlexMoRE: A Flexible Mixture of Rank-heterogeneous Experts for Efficient Federatedly-trained Large Language Models

Feb 2026 · 2602.08818
MemoryReasoningBenchmarksArchitecture

Recent advances in mixture-of-experts architectures have shown that individual experts models can be trained federatedly, i.e., in isolation from other experts by using a common base model to facilitate coordination. However, we hypothesize that full-sized experts may not be necessary for all domains and that instead low-rank adapters may be…

Selection Criteria

How papers are selected and ranked

Papers are sourced from arXiv via 30 targeted search queries across eight topic clusters: agentic AI, context/memory management, recursive/self-improving learning, long-horizon planning, software development agents, knowledge & reasoning, alignment & RLHF, and evaluation & benchmarks. Relevance is scored by weighted keyword matching (56 phrases, weights 1–5) against title and abstract, with title matches receiving a 50% bonus. Papers scoring ≥ 15 are included and appended newest-first to this cumulative digest. Only genuinely new papers (by arXiv ID) are added — duplicates are skipped.