alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Ask or search anything...

Events

Watch Recordings

AI for Law01/09 · Joel Niklaus · Hugging Face

Papers Benchmarks

Step-DeepResearch Technical Report

24 Dec 2025

StepFun's Step-DeepResearch introduces a cost-effective, end-to-end Deep Research agent based on a 32B-parameter model, demonstrating expert-level capabilities on complex, open-ended research tasks. It achieves a score of 61.42 on the ResearchRubrics benchmark, placing second overall and outperforming OpenAI DeepResearch while being more than ten times cheaper than leading commercial systems, and also introduces the new ADR-Bench for evaluating real-world Chinese research tasks.

#agentic-frameworks #agents #computer-science

Paper thumbnail

NVIDIA Nemotron 3: Efficient and Open Intelligence

24 Dec 2025

NVIDIA introduces Nemotron 3, a family of hybrid Mamba-Transformer Mixture-of-Experts LLMs, achieving up to 3.3x higher inference throughput and 1 million token context length with state-of-the-art accuracy for agentic AI. The project openly releases model weights, training software, recipes, and over 10 trillion tokens of training data.

#agents #computer-science #conversational-ai

Paper thumbnail

Attention Is Not What You Need

22 Dec 2025

A Causal Grassmann architecture replaces self-attention with geometrically structured Grassmann flows for sequence modeling, achieving competitive perplexity on Wikitext-2 and slightly higher accuracy on SNLI while demonstrating theoretical linear computational complexity in sequence length.

#attention-mechanisms #computer-science #artificial-intelligence

Paper thumbnail

The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding

22 Dec 2025

Researchers from Nanyang Technological University and SenseTime Research introduce the "Prism Hypothesis" and Unified Autoencoding (UAE), a model that integrates high-level semantic features and low-level pixel details into a single latent space. UAE achieves state-of-the-art visual reconstruction with rFID of 0.19 on ImageNet-1K and strong generative capabilities, while preserving semantic understanding with 83.0% linear probing accuracy.

#computer-science #computer-vision-and-pattern-recognition #embedding-methods

Paper thumbnail

Latent Implicit Visual Reasoning

24 Dec 2025

Latent Implicit Visual Reasoning (LIVR) enhances Large Multimodal Models (LMMs) by enabling them to implicitly learn and utilize visual abstractions through dedicated latent tokens and a visual bottleneck mechanism. This approach consistently outperforms direct supervised fine-tuning by an average of 3.43% to 6.24% and surpasses methods relying on explicit visual supervision across various perception-heavy tasks, eliminating the need for costly intermediate annotations.

#computer-science #computer-vision-and-pattern-recognition #instruction-tuning

Paper thumbnail

Streaming Video Instruction Tuning

24 Dec 2025

Hong Kong Baptist University Tencent YouTu Lab

This research introduces Streamo, an end-to-end real-time streaming video Large Language Model designed to function as a general-purpose interactive assistant, enabling precise frame-level decision-making and response timing. It achieves state-of-the-art performance on online video understanding tasks while also enhancing capabilities on traditional offline benchmarks.

#computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

How Much 3D Do Video Foundation Models Encode?

23 Dec 2025

University of Illinois at Urbana-Champaign Impossible, Inc.

A model-agnostic framework quantitatively reveals that Video Foundation Models, trained exclusively on 2D video data, develop a robust understanding of 3D objects, scenes, and ego-motion. These models demonstrate 3D awareness competitive with or superior to specialized 3D reconstruction methods, particularly in generalizing to new scenes, and improve feedforward 3D reconstruction using less supervised data.

#computer-science #artificial-intelligence #computer-vision-and-pattern-recognition

Paper thumbnail

SpatialTree: How Spatial Abilities Branch Out in MLLMs

23 Dec 2025

Zhejiang University ByteDance logo

The paper introduces "SpatialTree," a cognitive-science-inspired hierarchical framework for spatial intelligence in multimodal large language models (MLLMs), and "SpatialTree-Bench," a corresponding benchmark with 27 sub-abilities. It reveals distinct transfer dynamics among spatial skills and proposes an "auto-think strategy" for reinforcement learning that consistently improves spatial performance across all hierarchical levels.

#agents #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

KerJEPA: Kernel Discrepancies for Euclidean Self-Supervised Learning

22 Dec 2025

KERJEPA introduces a generalized framework for Euclidean self-supervised learning, building on LeJEPA by employing various kernel discrepancies for regularization. It reveals that LeJEPA's slicing implicitly yields heavy-tailed, dimension-dependent kernels and empirically demonstrates that analytically derived discrepancies improve training stability and convergence, with KSD leveraging non-Gaussian priors effectively and an IMQ kernel achieving 91.90% accuracy on ImageNette.

#computer-science #computer-vision-and-pattern-recognition #machine-learning

Paper thumbnail

Block-Recurrent Dynamics in Vision Transformers

23 Dec 2025

This research introduces and validates the Block-Recurrent Hypothesis (BRH), demonstrating that Vision Transformer computations can be accurately approximated by a small number of distinct blocks applied recurrently. A 3-block recurrent model achieved 98% of the original DINOv2's performance on ImageNet-1k, suggesting an underlying simplicity bias in these models.

#computer-science #artificial-intelligence #computer-vision-and-pattern-recognition

Paper thumbnail

Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

23 Dec 2025

A framework developed by researchers at The University of Hong Kong and Tencent PCG enhances Vision-Language Models' dynamic spatial reasoning capabilities through a new automated data generation pipeline and a Geometry Selection Module. This approach achieves 58.9% average accuracy on the DSR-Bench benchmark while preserving general video understanding performance.

#computer-science #computer-vision-and-pattern-recognition #data-curation

Paper thumbnail

Toward Training Superintelligent Software Agents through Self-Play SWE-RL

21 Dec 2025

Meta FAIR researchers developed Self-play SWE-RL (SSR), a training paradigm for software agents that autonomously generates learning experiences from real-world codebases through a self-play loop. This approach enabled agents to achieve consistent self-improvement and outperform human-data baselines by +10.4 points on SWE-bench Verified and +7.8 points on SWE-Bench Pro, without relying on human-curated issue descriptions or pre-existing tests.

#agentic-frameworks #agents #computer-science

Paper thumbnail

USE: A Unified Model for Universal Sound Separation and Extraction

24 Dec 2025

The USE model from Shanghai Jiao Tong University integrates Universal Sound Separation (SS) and Target Sound Extraction (TSE) into a single framework by semantically aligning internal sound attractors with external multi-modal clue embeddings. This approach robustly handles an unknown number of diverse sound sources and variable clue availability, achieving significant SNRi improvements of up to 35.4% over baselines in TSE and strong performance in autonomous SS while maintaining real-time inference speed.

#audio-and-speech-processing #electrical-engineering

Paper thumbnail

SemanticGen: Video Generation in Semantic Space

24 Dec 2025

Zhejiang University

The Chinese University of Hong Kong

SemanticGen introduces a two-stage diffusion framework for video generation that first establishes global planning in a compact semantic space and then refines details in the VAE latent space. This approach achieves faster training convergence and generates high-quality, long-form videos with superior temporal consistency compared to direct VAE-latent modeling.

#computer-science #computer-vision-and-pattern-recognition #generative-models

Paper thumbnail

AMoE: Agglomerative Mixture-of-Experts Vision Foundation Model

23 Dec 2025

Technology Innovation Institute MIT-IBM Watson AI Lab

The AMoE framework introduces an efficient agglomerative Vision Foundation Model, achieving state-of-the-art global and dense visual representations through multi-teacher distillation and a Mixture-of-Experts architecture. It demonstrates superior performance with 4.7 times fewer training tokens compared to prior methods like RADIOv2.5.

#computer-science #computer-vision-and-pattern-recognition #data-curation

Paper thumbnail

Generalization of Diffusion Models Arises with a Balanced Representation Space

24 Dec 2025

Georgia Institute of Technology University of Michigan logo

University of Michigan

Researchers from the University of Michigan and Georgia Institute of Technology established a mathematical framework explaining how diffusion models either memorize data with "spiky" internal representations or generalize to new data via "balanced" representations. This framework enabled the development of a prompt-free memorization detection method and a technique for interpretable image editing through representation steering.

#computer-science #computer-vision-and-pattern-recognition #machine-learning

Paper thumbnail

^3

: Memory Retrieval via Reflective Reasoning for LLM Agents

23 Dec 2025

MemR^3 introduces a system for LLM agents that uses reflective reasoning and an explicit evidence-gap tracker to enable closed-loop control over memory retrieval. This approach consistently enhanced answer quality, achieving up to a 7.29% improvement in LLM-as-a-Judge scores on the LoCoMo benchmark compared to traditional RAG.

#agentic-frameworks #agents #computer-science

Paper thumbnail

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

22 Dec 2025

tan-yuqiao

Yuqiao Tan

Researchers from CASIA, UCAS, and Tencent AI Lab developed a framework to decompose large language model (LLM) policies into internal layer and modular policies, using entropy analysis to reveal distinct, architecture-specific reasoning patterns. Their Bottom-up Policy Optimization (BuPO) method, which optimizes internal layers before the full policy, consistently improved complex reasoning performance on benchmarks like MATH and AIME.

#agents #attention-mechanisms #computer-science

Paper thumbnail

Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning

24 Dec 2025

A novel hierarchical reinforcement learning paradigm, termed 'internal RL,' leverages emergent temporal abstractions within pretrained autoregressive models to efficiently solve complex, sparse-reward tasks. This approach enables a metacontroller to discover and operate on abstract actions, drastically outperforming traditional reinforcement learning and prior HRL methods on challenging grid world and continuous control environments.

#agents #computer-science #artificial-intelligence

Paper thumbnail

MemEvolve: Meta-Evolution of Agent Memory Systems

21 Dec 2025

OPPO LV-NUS lab

OPPO AI Agent Team and LV-NUS lab developed MemEvolve, a meta-evolutionary framework that allows large language model agents to adaptively refine their memory architectures. This approach enables agents to consistently improve task performance by up to 17.06% and generalize across various tasks, LLMs, and agent frameworks.

#agentic-frameworks #agents #computer-science

Paper thumbnail

There are no more papers matching your filters at the moment.