📅 2026-04-11

每日精选 · arXiv · Hacker News · GitHub Trending

计算机视觉: 5 归档 →

📄 arXiv 论文
计算机视觉 2604.08213
相关性 75/100

EditCaption: Human-Aligned Instruction Synthesis for Image Editing via Supervised Fine-Tuning and Direct Preference Optimization

EditCaption:通过监督微调与直接偏好优化实现图像编辑的人类对齐指令合成

Xiangyuan Wang, Honghao Cai, Yunhao Bai, Tianze Zhou, Haohua Chen 等 (9 位作者)

核心贡献: 提出EditCaption,一种可扩展的两阶段训练后流程,显著提升视觉语言模型在图像编辑指令合成中的准确性与人类对齐性。
方法: 1. 第一阶段通过GLM自动标注、EditScore过滤和人工修正构建10万条监督微调数据集,解决空间、方向和属性级精度问题; 2. 第二阶段收集1万条人类偏好数据,针对三种系统错误模式,应用直接偏好优化(DPO)实现超越监督微调的对齐效果。
关键发现: 1. 微调后的Qwen3-VL模型在Eval-400、ByteMorph-Bench和HQ-Edit基准上超越开源基线,235B模型在Eval-400得分4.712(优于Gemini-3-Pro/GPT-4.1等); 2. 人类评估显示关键错误率从47.75%降至23%,正确率从41.75%提升至66%。
查看原文摘要

High-quality training triplets (source-target image pairs with precise editing instructions) are a critical bottleneck for scaling instruction-guided image editing models. Vision-language models (VLMs) are widely used for automated instruction synthesis, but we identify three systematic failure modes in image-pair settings: orientation inconsistency (e.g., left/right confusion), viewpoint ambiguity, and insufficient fine-grained attribute description. Human evaluation shows that over 47% of instructions from strong baseline VLMs contain critical errors unusable for downstream training. We propose EditCaption, a scalable two-stage post-training pipeline for VLM-based instruction synthesis. Stage 1 builds a 100K supervised fine-tuning (SFT) dataset by combining GLM automatic annotation, EditScore-based filtering, and human refinement for spatial, directional, and attribute-level accuracy. Stage 2 collects 10K human preference pairs targeting the three failure modes and applies direct preference optimization (DPO) for alignment beyond SFT alone. On Eval-400, ByteMorph-Bench, and HQ-Edit, fine-tuned Qwen3-VL models outperform open-source baselines; the 235B model reaches 4.712 on Eval-400 (vs. Gemini-3-Pro 4.706, GPT-4.1 4.220, Kimi-K2.5 4.111) and 4.588 on ByteMorph-Bench (vs. Gemini-3-Pro 4.522, GPT-4.1 3.412). Human evaluation shows critical errors falling from 47.75% to 23% and correctness rising from 41.75% to 66%. The work offers a practical path to scalable, human-aligned instruction synthesis for image editing data.

计算机视觉 2604.08042
相关性 75/100

3DrawAgent: Teaching LLM to Draw in 3D with Early Contrastive Experience

3DrawAgent:通过早期对比经验教授大语言模型进行3D绘图

Hongcan Xiao, Xinyue Xiao, Yilin Wang, Yue Zhang, Yonggang Qi

核心贡献: 提出了一种无需训练的3D草图生成框架3DrawAgent,利用大语言模型(LLMs)在几何反馈下顺序绘制3D贝塞尔曲线,并通过对比经验优化策略提升模型的空间感知能力。
方法: 该方法采用相对经验优化策略,基于Group Reward Policy Optimization(GRPO)范式,通过构建生成草图的成对比较(基于CLIP感知奖励和LLM细粒度定性评估)来迭代优化模型的3D绘图先验知识。整个过程无需参数更新,实现了模型在3D空间理解和绘图质量上的自我提升。
关键发现: 实验表明,3DrawAgent能够根据多样化的文本提示生成复杂且连贯的3D贝塞尔草图,展现出涌现的几何推理能力,并能推广到新形状,为无需训练的3D草图智能领域建立了新范式。
查看原文摘要

Sketching in 3D space enables expressive reasoning about shape, structure, and spatial relationships, yet generating 3D sketches through natural language remains a major challenge. In this work, we introduce 3DrawAgent, a training-free, language-driven framework for 3D sketch generation that leverages large language models (LLMs) to sequentially draw 3D Bezier curves under geometric feedback. Unlike prior 2D sketch agents, our method introduces a relative experience optimization strategy that adapts the recently proposed Group Reward Policy Optimization (GRPO) paradigm. Instead of relying on explicit ground-truth supervision, we construct pairwise comparisons among generated sketches, with each pair consisting of a relatively better and a worse result based on CLIP-based perceptual rewards and LLM-based fine-grained qualitative assessment. These experiences are then used to iteratively refine the prior knowledge of 3D drawing, enabling black-box reinforcement of the model's 3D awareness. This design allows our model to self-improve its spatial understanding and drawing quality without parameter updates. Experiments show that 3DrawAgent can generate complex and coherent 3D Bezier sketches from diverse textual prompts, exhibit emergent geometric reasoning, and generalize to novel shapes, establishing a new paradigm for advancing the field of training-free 3D sketch intelligence.

计算机视觉 2604.07966
相关性 75/100

Lighting-grounded Video Generation with Renderer-based Agent Reasoning

基于光照基础与渲染器代理推理的视频生成

Ziqi Cai, Taoyu Yang, Zheng Chang, Si Li, Han Jiang 等 (7 位作者)

核心贡献: 提出了LiVER框架,通过解耦3D场景属性(布局、光照、相机轨迹)实现高度可控的视频生成,并开发了自动将用户指令转化为3D控制信号的场景代理。
方法: 1. 构建了一个包含密集标注(物体布局、光照、相机参数)的大规模数据集;2. 通过统一3D表示渲染控制信号以解耦场景属性;3. 设计轻量级条件模块和渐进式训练策略,将控制信号集成到基础视频扩散模型中。
关键发现: LiVER在光真实性和时间一致性上达到SOTA水平,同时支持对场景因素的精确解耦控制,为可控视频生成设立了新标准。
查看原文摘要

Diffusion models have achieved remarkable progress in video generation, but their controllability remains a major limitation. Key scene factors such as layout, lighting, and camera trajectory are often entangled or only weakly modeled, restricting their applicability in domains like filmmaking and virtual production where explicit scene control is essential. We present LiVER, a diffusion-based framework for scene-controllable video generation. To achieve this, we introduce a novel framework that conditions video synthesis on explicit 3D scene properties, supported by a new large-scale dataset with dense annotations of object layout, lighting, and camera parameters. Our method disentangles these properties by rendering control signals from a unified 3D representation. We propose a lightweight conditioning module and a progressive training strategy to integrate these signals into a foundational video diffusion model, ensuring stable convergence and high fidelity. Our framework enables a wide range of applications, including image-to-video and video-to-video synthesis where the underlying 3D scene is fully editable. To further enhance usability, we develop a scene agent that automatically translates high-level user instructions into the required 3D control signals. Experiments show that LiVER achieves state-of-the-art photorealism and temporal consistency while enabling precise, disentangled control over scene factors, setting a new standard for controllable video generation.

计算机视觉 2604.07522
相关性 75/100

Training-free Spatially Grounded Geometric Shape Encoding (Technical Report)

免训练的空间几何形状编码技术报告

Yuhang He

核心贡献: 提出了一种免训练、通用的编码策略XShapeEnc,能够将任意空间接地的2D几何形状编码为具有五种优良特性的紧凑表示。
方法: 该方法将2D几何形状分解为归一化几何和姿态向量,利用正交Zernike基对形状几何和姿态进行独立或联合编码,并通过频率传播操作引入高频内容。
关键发现: 通过广泛的分析和实验,证明了XShapeEnc在理论有效性、效率、可区分性和适用性方面的优势,适用于多种形状感知任务。
查看原文摘要

Positional encoding has become the de facto standard for grounding deep neural networks on discrete point-wise positions, and it has achieved remarkable success in tasks where the input can be represented as a one-dimensional sequence. However, extending this concept to 2D spatial geometric shapes demands carefully designed encoding strategies that account not only for shape geometry and pose, but also for compatibility with neural network learning. In this work, we address these challenges by introducing a training-free, general-purpose encoding strategy, dubbed XShapeEnc, that encodes an arbitrary spatially grounded 2D geometric shape into a compact representation exhibiting five favorable properties, including invertibility, adaptivity, and frequency richness. Specifically, a 2D spatially grounded geometric shape is decomposed into its normalized geometry within the unit disk and its pose vector, where the pose is further transformed into a harmonic pose field that also lies within the unit disk. A set of orthogonal Zernike bases is constructed to encode shape geometry and pose either independently or jointly, followed by a frequency-propagation operation to introduce high-frequency content into the encoding. We demonstrate the theoretical validity, efficiency, discriminability, and applicability of XShapeEnc via extensive analysis and experiments across a wide range of shape-aware tasks and our self-curated XShapeCorpus. We envision XShapeEnc as a foundational tool for research that goes beyond one-dimensional sequential data toward frontier 2D spatial intelligence.

计算机视觉 2604.06966
相关性 75/100

MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation

MAR-GRPO:用于AR-扩散混合图像生成的稳定化GRPO

Xiaoxiao Ma, Jiachen Lei, Tianfei Ren, Jie Huang, Siming Fu 等 (9 位作者)

核心贡献: 提出一种针对混合自回归-扩散框架的稳定强化学习算法,通过多轨迹期望和不确定性感知优化解决梯度噪声问题,显著提升生成质量和训练稳定性。
方法: 1. 引入多轨迹期望(MTE)机制,通过平均多条扩散轨迹的梯度降低噪声;2. 基于多轨迹方差识别top-k%不确定token进行局部优化;3. 采用一致性感知的token选择策略过滤与最终生成内容不一致的AR token。
关键发现: 实验表明:1. 在多个基准测试中视觉质量和空间结构理解显著优于基线GRPO;2. 训练稳定性提升,避免了早期性能饱和;3. 代码已开源。
查看原文摘要

Reinforcement learning (RL) has been successfully applied to autoregressive (AR) and diffusion models. However, extending RL to hybrid AR-diffusion frameworks remains challenging due to interleaved inference and noisy log-probability estimation. In this work, we study masked autoregressive models (MAR) and show that the diffusion head plays a critical role in training dynamics, often introducing noisy gradients that lead to instability and early performance saturation. To address this issue, we propose a stabilized RL framework for MAR. We introduce multi-trajectory expectation (MTE), which estimates the optimization direction by averaging over multiple diffusion trajectories, thereby reducing diffusion-induced gradient noise. To avoid over-smoothing, we further estimate token-wise uncertainty from multiple trajectories and apply multi-trajectory optimization only to the top-k% uncertain tokens. In addition, we introduce a consistency-aware token selection strategy that filters out AR tokens that are less aligned with the final generated content. Extensive experiments across multiple benchmarks demonstrate that our method consistently improves visual quality, training stability, and spatial structure understanding over baseline GRPO and pre-RL models. Code is available at: https://github.com/AMAP-ML/mar-grpo.

🔥 Hacker News
HN ▲ 129  💬 39
推荐度 90/100

Bevy game development tutorials and in-depth resources

by GenericCanadian

Bevy游戏引擎的开发教程和深度资源,提供学习游戏开发的实用指南和技巧。
HN ▲ 125  💬 39
推荐度 85/100

How We Broke Top AI Agent Benchmarks: And What Comes Next

by Anon84

该帖子介绍了团队如何突破顶级AI代理基准测试,并探讨了未来发展方向。
HN ▲ 143  💬 47
推荐度 85/100

Surelock: Deadlock-Free Mutexes for Rust

by codetheweb

该帖子介绍了Surelock,一种为Rust设计的无死锁互斥锁机制,旨在解决并发编程中的死锁问题。
HN ▲ 524  💬 132
推荐度 85/100

Starfling: A one-tap endless orbital slingshot game in a single HTML file

by iceberger2001

一款单HTML文件实现的点击操控无尽轨道弹射游戏,玩法简单易上手。
HN ▲ 487  💬 368
推荐度 85/100

AI assistance when contributing to the Linux kernel

by hmokiguess

该帖子讨论了在贡献Linux内核代码时使用AI辅助工具的指导原则和注意事项。
HN ▲ 76  💬 81
推荐度 85/100

Launch HN: Twill.ai (YC S25) – Delegate to cloud agents, get back PRs

by danoandco

Twill.ai 是一个云端代理平台,通过 Slack、GitHub 等渠道接收任务后,在隔离沙盒中运行 AI 编码工具并返回 PR 或解决方案,同时保持用户对关键节点的控制权。
HN ▲ 9  💬 2
推荐度 85/100

Can It Resolve Doom? Game Engine in 2k DNS Records

by choult

这篇帖子介绍了一种创新方法,通过2千条DNS记录构建游戏引擎,并成功运行经典游戏《毁灭战士》,展示了DNS协议的非传统用途和技术趣味性。
HN ▲ 66  💬 34
推荐度 85/100

Show HN: Eve – Managed OpenClaw for work

by zachdive

Eve是一个在隔离Linux沙箱中运行的AI代理工具,提供文件系统、无头浏览器和代码执行功能,可连接上千种服务,用户只需下达任务即可自动完成,旨在成为高效的工作助手而非个人助理。
🐙 GitHub Trending
Python ⭐ 9,831  +1136 today
推荐度 85/100

OpenBMB/VoxCPM

Star OpenBMB / VoxCPM VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning

VoxCPM2是一个无需分词器的多语言语音生成工具,支持创意语音设计和逼真语音克隆,因其创新的无分词器技术和多语言能力而值得关注。
Python ⭐ 58,517  +6437 today
推荐度 70/100

NousResearch/hermes-agent

Star NousResearch / hermes-agent The agent that grows with you

Hermes-Agent 是一个能随需求成长的智能代理项目,值得关注因其灵活性和适应性可满足不同开发需求。
Python ⭐ 16,706  +836 today
推荐度 70/100

HKUDS/DeepTutor

Star HKUDS / DeepTutor "DeepTutor: Agent-Native Personalized Learning Assistant"

DeepTutor 是一个基于智能代理的个性化学习助手,利用深度学习技术提供定制化教育服务,值得关注因其创新性地结合AI与教育领域实现高效学习辅导。
TypeScript ⭐ 7,791  +1950 today
推荐度 65/100

multica-ai/multica

Star multica-ai / multica The open-source managed agents platform. Turn coding agents into real teammates — assign tasks, track progress, compound skills.

Multica 是一个开源的管理式智能体平台,能将编码智能体转化为真正的团队成员,支持任务分配、进度追踪和技能复合,值得关注因其开创性地将AI协作融入开发流程。
TypeScript ⭐ 16,434  +1339 today
推荐度 60/100

coleam00/Archon

Star coleam00 / Archon The first open-source harness builder for AI coding. Make AI coding deterministic and repeatable.

Archon 是首个开源的 AI 编程工具构建框架,旨在使 AI 编码过程变得确定性和可重复,值得关注因其创新性地解决了 AI 生成代码的可靠性问题。
⭐ 22,214  +1978 today
推荐度 60/100

alexpate/awesome-design-systems

Star alexpate / awesome-design-systems 💅🏻 ⚒ A collection of awesome design systems

该项目收集整理了各种优秀的**设计系统**资源库,为设计师和开发者提供丰富的参考案例和工具,值得关注因为它能帮助团队高效构建统一的产品设计语言。
Python ⭐ 102,014  +3069 today
推荐度 50/100

microsoft/markitdown

Star microsoft / markitdown Python tool for converting files and office documents to Markdown.

Microsoft的markitdown是一个Python工具,用于将文件和办公文档转换为Markdown格式,值得关注因为它简化了文档格式转换流程。
Java ⭐ 15,568  +777 today
推荐度 50/100

opendataloader-project/opendataloader-pdf

Star opendataloader-project / opendataloader-pdf PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.

OpenDataloader-PDF是一个用Java开发的开源PDF解析工具,专注于将PDF文档转换为AI可用的数据格式并自动化提升PDF可访问性,值得关注因其简化了非结构化PDF数据的处理流程。