Step-DeepResearch Technical Report
24 Dec 2025

StepFun's Step-DeepResearch introduces a cost-effective, end-to-end Deep Research agent based on a 32B-parameter model, demonstrating expert-level capabilities on complex, open-ended research tasks. It achieves a score of 61.42 on the ResearchRubrics benchmark, placing second overall and outperforming OpenAI DeepResearch while being more than ten times cheaper than leading commercial systems, and also introduces the new ADR-Bench for evaluating real-world Chinese research tasks.

View blog
Resources112
NVIDIA Nemotron 3: Efficient and Open Intelligence
24 Dec 2025

NVIDIA introduces Nemotron 3, a family of hybrid Mamba-Transformer Mixture-of-Experts LLMs, achieving up to 3.3x higher inference throughput and 1 million token context length with state-of-the-art accuracy for agentic AI. The project openly releases model weights, training software, recipes, and over 10 trillion tokens of training data.

View blog
Resources
Attention Is Not What You Need
22 Dec 2025

A Causal Grassmann architecture replaces self-attention with geometrically structured Grassmann flows for sequence modeling, achieving competitive perplexity on Wikitext-2 and slightly higher accuracy on SNLI while demonstrating theoretical linear computational complexity in sequence length.

View blog
Resources71
The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding
22 Dec 2025

Researchers from Nanyang Technological University and SenseTime Research introduce the "Prism Hypothesis" and Unified Autoencoding (UAE), a model that integrates high-level semantic features and low-level pixel details into a single latent space. UAE achieves state-of-the-art visual reconstruction with rFID of 0.19 on ImageNet-1K and strong generative capabilities, while preserving semantic understanding with 83.0% linear probing accuracy.

View blog
Resources
Latent Implicit Visual Reasoning
24 Dec 2025

Latent Implicit Visual Reasoning (LIVR) enhances Large Multimodal Models (LMMs) by enabling them to implicitly learn and utilize visual abstractions through dedicated latent tokens and a visual bottleneck mechanism. This approach consistently outperforms direct supervised fine-tuning by an average of 3.43% to 6.24% and surpasses methods relying on explicit visual supervision across various perception-heavy tasks, eliminating the need for costly intermediate annotations.

View blog
Resources
Streaming Video Instruction Tuning

This research introduces Streamo, an end-to-end real-time streaming video Large Language Model designed to function as a general-purpose interactive assistant, enabling precise frame-level decision-making and response timing. It achieves state-of-the-art performance on online video understanding tasks while also enhancing capabilities on traditional offline benchmarks.

View blog
Resources248
How Much 3D Do Video Foundation Models Encode?

A model-agnostic framework quantitatively reveals that Video Foundation Models, trained exclusively on 2D video data, develop a robust understanding of 3D objects, scenes, and ego-motion. These models demonstrate 3D awareness competitive with or superior to specialized 3D reconstruction methods, particularly in generalizing to new scenes, and improve feedforward 3D reconstruction using less supervised data.

View blog
Resources
SpatialTree: How Spatial Abilities Branch Out in MLLMs

The paper introduces "SpatialTree," a cognitive-science-inspired hierarchical framework for spatial intelligence in multimodal large language models (MLLMs), and "SpatialTree-Bench," a corresponding benchmark with 27 sub-abilities. It reveals distinct transfer dynamics among spatial skills and proposes an "auto-think strategy" for reinforcement learning that consistently improves spatial performance across all hierarchical levels.

View blog
Resources
KerJEPA: Kernel Discrepancies for Euclidean Self-Supervised Learning
22 Dec 2025

KERJEPA introduces a generalized framework for Euclidean self-supervised learning, building on LeJEPA by employing various kernel discrepancies for regularization. It reveals that LeJEPA's slicing implicitly yields heavy-tailed, dimension-dependent kernels and empirically demonstrates that analytically derived discrepancies improve training stability and convergence, with KSD leveraging non-Gaussian priors effectively and an IMQ kernel achieving 91.90% accuracy on ImageNette.

View blog
Resources
Block-Recurrent Dynamics in Vision Transformers
23 Dec 2025

This research introduces and validates the Block-Recurrent Hypothesis (BRH), demonstrating that Vision Transformer computations can be accurately approximated by a small number of distinct blocks applied recurrently. A 3-block recurrent model achieved 98% of the original DINOv2's performance on ImageNet-1k, suggesting an underlying simplicity bias in these models.

View blog
Resources
Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
23 Dec 2025

A framework developed by researchers at The University of Hong Kong and Tencent PCG enhances Vision-Language Models' dynamic spatial reasoning capabilities through a new automated data generation pipeline and a Geometry Selection Module. This approach achieves 58.9% average accuracy on the DSR-Bench benchmark while preserving general video understanding performance.

View blog
Resources
Toward Training Superintelligent Software Agents through Self-Play SWE-RL
21 Dec 2025

Meta FAIR researchers developed Self-play SWE-RL (SSR), a training paradigm for software agents that autonomously generates learning experiences from real-world codebases through a self-play loop. This approach enabled agents to achieve consistent self-improvement and outperform human-data baselines by +10.4 points on SWE-bench Verified and +7.8 points on SWE-Bench Pro, without relying on human-curated issue descriptions or pre-existing tests.

View blog
Resources
USE: A Unified Model for Universal Sound Separation and Extraction
24 Dec 2025

The USE model from Shanghai Jiao Tong University integrates Universal Sound Separation (SS) and Target Sound Extraction (TSE) into a single framework by semantically aligning internal sound attractors with external multi-modal clue embeddings. This approach robustly handles an unknown number of diverse sound sources and variable clue availability, achieving significant SNRi improvements of up to 35.4% over baselines in TSE and strong performance in autonomous SS while maintaining real-time inference speed.

View blog
Resources
SemanticGen: Video Generation in Semantic Space

SemanticGen introduces a two-stage diffusion framework for video generation that first establishes global planning in a compact semantic space and then refines details in the VAE latent space. This approach achieves faster training convergence and generates high-quality, long-form videos with superior temporal consistency compared to direct VAE-latent modeling.

View blog
Resources
AMoE: Agglomerative Mixture-of-Experts Vision Foundation Model

The AMoE framework introduces an efficient agglomerative Vision Foundation Model, achieving state-of-the-art global and dense visual representations through multi-teacher distillation and a Mixture-of-Experts architecture. It demonstrates superior performance with 4.7 times fewer training tokens compared to prior methods like RADIOv2.5.

View blog
Resources1
Generalization of Diffusion Models Arises with a Balanced Representation Space

Researchers from the University of Michigan and Georgia Institute of Technology established a mathematical framework explaining how diffusion models either memorize data with "spiky" internal representations or generalize to new data via "balanced" representations. This framework enabled the development of a prompt-free memorization detection method and a technique for interpretable image editing through representation steering.

View blog
Resources
MemR3^3: Memory Retrieval via Reflective Reasoning for LLM Agents
23 Dec 2025

MemR^3 introduces a system for LLM agents that uses reflective reasoning and an explicit evidence-gap tracker to enable closed-loop control over memory retrieval. This approach consistently enhanced answer quality, achieving up to a 7.29% improvement in LLM-as-a-Judge scores on the LoCoMo benchmark compared to traditional RAG.

View blog
Resources
Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies
22 Dec 2025

Researchers from CASIA, UCAS, and Tencent AI Lab developed a framework to decompose large language model (LLM) policies into internal layer and modular policies, using entropy analysis to reveal distinct, architecture-specific reasoning patterns. Their Bottom-up Policy Optimization (BuPO) method, which optimizes internal layers before the full policy, consistently improved complex reasoning performance on benchmarks like MATH and AIME.

View blog
Resources5
Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning
24 Dec 2025

A novel hierarchical reinforcement learning paradigm, termed 'internal RL,' leverages emergent temporal abstractions within pretrained autoregressive models to efficiently solve complex, sparse-reward tasks. This approach enables a metacontroller to discover and operate on abstract actions, drastically outperforming traditional reinforcement learning and prior HRL methods on challenging grid world and continuous control environments.

View blog
Resources
MemEvolve: Meta-Evolution of Agent Memory Systems
21 Dec 2025

OPPO AI Agent Team and LV-NUS lab developed MemEvolve, a meta-evolutionary framework that allows large language model agents to adaptively refine their memory architectures. This approach enables agents to consistently improve task performance by up to 17.06% and generalize across various tasks, LLMs, and agent frameworks.

View blog
Resources7
There are no more papers matching your filters at the moment.