📄 arXiv 论文
从静态到交互:面向用户驱动任务的视觉上下文学习器适配方法
Carlos Schmidt, Simon Reiß
核心贡献: 提出了一种将静态视觉上下文学习模型(如DeLVM)转化为用户可交互驱动系统的方法,支持通过涂鸦、点击或绘制框等自然视觉线索动态引导模型预测。
方法: 通过将用户交互直接编码到示例输入-输出对中,保持视觉上下文学习的核心思想不变,无需微调即可处理未见过的交互类型。该方法特别设计了交互式DeLVM框架,允许用户通过直观操作(如标记目标区域)实时调整模型行为。
关键发现: 实验表明:现有视觉上下文学习模型无法有效利用交互线索(常完全忽略用户引导),而本方法在交互式分割(IoU提升7.95%)、定向超分辨率(PSNR提高2.46)和交互式物体移除(LPIPS降低3.14%)等任务中显著优于静态模型,成功弥合了静态任务适应与用户中心化交互之间的鸿沟。
查看原文摘要
Visual in-context learning models are designed to adapt to new tasks by leveraging a set of example input-output pairs, enabling rapid generalization without task-specific fine-tuning. However, these models operate in a fundamentally static paradigm: while they can adapt to new tasks, they lack any mechanism to incorporate user-provided guidance signals such as scribbles, clicks, or bounding boxes to steer or refine the prediction process. This limitation is particularly restrictive in real-world applications, where users want to actively guide model predictions, e.g., by highlighting the target object for segmentation, indicating a region which should be visually altered, or isolating a specific person in a complex scene to run targeted pose estimation. In this work, we propose a simple method to transform static visual in-context learners, particularly the DeLVM approach, into highly controllable, user-driven systems, i.e., Interactive DeLVM, enabling seamless interaction through natural visual cues such as scribbles, clicks, or drawing boxes. Specifically, by encoding interactions directly into the example input-output pairs, we keep the philosophy of visual in-context learning intact: enabling users to prompt models with unseen interactions without fine-tuning and empowering them to dynamically steer model predictions with personalized interactions. Our experiments demonstrate that SOTA visual in-context learning models fail to effectively leverage interaction cues, often ignoring user guidance entirely. In contrast, our method excels in controllable, user-guided scenarios, achieving improvements of $+7.95%$ IoU for interactive segmentation, $+2.46$ PSNR for directed super-resolution, and $-3.14%$ LPIPS for interactive object removal. With this, our work bridges the gap between rigid static task adaptation and fluid interactivity for user-centric visual in-context learning.
PhyEdit:基于物理基础的图像编辑实现真实世界物体操控
Ruihang Xu, Dewei Zhou, Xiaolong Shen, Fan Ma, Yi Yang
核心贡献: 提出PhyEdit框架,通过显式几何模拟实现3D感知的图像编辑,显著提升物体操控的物理准确性和一致性。
方法: 1. 结合可插拔的3D几何先验与联合2D-3D监督学习
2. 利用显式几何模拟生成3D感知的视觉引导
3. 构建RealManip-10K真实世界数据集(含配对图像与深度标注)
4. 设计ManipEval多维度评估基准测试几何一致性
关键发现: 1. 在3D几何精度和操控一致性上超越现有方法(包括闭源模型)
2. 提出的RealManip-10K数据集和ManipEval基准有效支持方法验证
3. 物理模拟引导显著改善物体缩放/定位的空间准确性
查看原文摘要
Achieving physically accurate object manipulation in image editing is essential for its potential applications in interactive world models. However, existing visual generative models often fail at precise spatial manipulation, resulting in incorrect scaling and positioning of objects. This limitation primarily stems from the lack of explicit mechanisms to incorporate 3D geometry and perspective projection. To achieve accurate manipulation, we develop PhyEdit, an image editing framework that leverages explicit geometric simulation as contextual 3D-aware visual guidance. By combining this plug-and-play 3D prior with joint 2D--3D supervision, our method effectively improves physical accuracy and manipulation consistency. To support this method and evaluate performance, we present a real-world dataset, RealManip-10K, for 3D-aware object manipulation featuring paired images and depth annotations. We also propose ManipEval, a benchmark with multi-dimensional metrics to evaluate 3D spatial control and geometric consistency. Extensive experiments show that our approach outperforms existing methods, including strong closed-source models, in both 3D geometric accuracy and manipulation consistency.
当数字说话:在文本到视频扩散模型中对齐文本数字与视觉实例
Zhengyang Sun, Yu Chen, Xin Zhou, Xiaofan Li, Xiwu Chen 等 (7 位作者)
核心贡献: 提出NUMINA框架,无需训练即可通过识别-引导机制提升文本到视频生成中数字与视觉实例的匹配精度。
方法: 1. 通过筛选自注意力和交叉注意力头识别提示与潜在布局的不一致性,生成可计数的潜在布局;2. 保守优化该布局并调制交叉注意力以引导视频重新生成;3. 结合结构指导与种子搜索和提示增强技术。
关键发现: 1. 在CountBench测试集上,NUMINA使Wan2.1-1.3B模型的计数准确率最高提升7.4%,5B和14B模型分别提升4.9%和5.5%;2. 在保持时间一致性的同时改善了CLIP对齐效果;3. 代码已开源。
查看原文摘要
Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA , a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration. On the introduced CountBench, NUMINA improves counting accuracy by up to 7.4% on Wan2.1-1.3B, and by 4.9% and 5.5% on 5B and 14B models, respectively. Furthermore, CLIP alignment is improved while maintaining temporal consistency. These results demonstrate that structural guidance complements seed search and prompt enhancement, offering a practical path toward count-accurate text-to-video diffusion. The code is available at https://github.com/H-EmbodVis/NUMINA.
OpenVLThinkerV2:一种面向多领域视觉任务的通用多模态推理模型
Wenbo Hu, Xin Chen, Yan Gao-Tian, Yihe Deng, Nanyun Peng 等 (6 位作者)
核心贡献: 提出高斯GRPO(G$^2$RPO)强化学习训练目标,解决多模态通用模型中奖励拓扑差异大和感知-推理平衡难题,并推出高性能开源多模态模型OpenVLThinkerV2。
方法: 1. 设计G$^2$RPO目标,通过非线性分布匹配强制优势函数收敛至标准正态分布,实现跨任务梯度均衡;
2. 引入响应长度塑造机制动态调节推理链长度与视觉 grounding;
3. 采用熵塑造机制约束模型探索范围,避免熵崩溃或爆炸。
关键发现: 在18个多样化基准测试中,OpenVLThinkerV2性能超越主流开源模型及前沿专有模型,验证了其鲁棒性与通用性。
查看原文摘要
Group Relative Policy Optimization (GRPO) has emerged as the de facto Reinforcement Learning (RL) objective driving recent advancements in Multimodal Large Language Models. However, extending this success to open-source multimodal generalist models remains heavily constrained by two primary challenges: the extreme variance in reward topologies across diverse visual tasks, and the inherent difficulty of balancing fine-grained perception with multi-step reasoning capabilities. To address these issues, we introduce Gaussian GRPO (G$^2$RPO), a novel RL training objective that replaces standard linear scaling with non-linear distributional matching. By mathematically forcing the advantage distribution of any given task to strictly converge to a standard normal distribution, $\mathcal{N}(0,1)$, G$^2$RPO theoretically ensures inter-task gradient equity, mitigates vulnerabilities to heavy-tail outliers, and offers symmetric update for positive and negative rewards. Leveraging the enhanced training stability provided by G$^2$RPO, we introduce two task-level shaping mechanisms to seamlessly balance perception and reasoning. First, response length shaping dynamically elicits extended reasoning chains for complex queries while enforce direct outputs to bolster visual grounding. Second, entropy shaping tightly bounds the model's exploration zone, effectively preventing both entropy collapse and entropy explosion. Integrating these methodologies, we present OpenVLThinkerV2, a highly robust, general-purpose multimodal model. Extensive evaluations across 18 diverse benchmarks demonstrate its superior performance over strong open-source and leading proprietary frontier models.
FIT:一个用于感知合身性的虚拟试穿大规模数据集
Johanna Karras, Yuanhao Wang, Yingwei Li, Ira Kemelmacher-Shlizerman
核心贡献: 提出了首个包含精确身体和服装尺寸信息的大规模虚拟试穿数据集FIT,并开发了一个基线模型,推动了合身感知虚拟试穿领域的研究。
方法: 1. 通过程序化生成3D服装并使用物理模拟捕捉真实合身效果;2. 采用新颖的重纹理框架将合成渲染转化为逼真图像;3. 在重纹理模型中引入人物身份保留机制,生成成对人物图像用于监督训练。
关键发现: FIT数据集和基线模型为合身感知虚拟试穿设立了新的技术基准,并提供了未来研究的可靠评估标准。实验表明该方法能有效处理不同合身情况(如过大或过小服装)的虚拟试穿任务。
查看原文摘要
Given a person and a garment image, virtual try-on (VTO) aims to synthesize a realistic image of the person wearing the garment, while preserving their original pose and identity. Although recent VTO methods excel at visualizing garment appearance, they largely overlook a crucial aspect of the try-on experience: the accuracy of garment fit -- for example, depicting how an extra-large shirt looks on an extra-small person. A key obstacle is the absence of datasets that provide precise garment and body size information, particularly for "ill-fit" cases, where garments are significantly too large or too small. Consequently, current VTO methods default to generating well-fitted results regardless of the garment or person size. In this paper, we take the first steps towards solving this open problem. We introduce FIT (Fit-Inclusive Try-on), a large-scale VTO dataset comprising over 1.13M try-on image triplets accompanied by precise body and garment measurements. We overcome the challenges of data collection via a scalable synthetic strategy: (1) We programmatically generate 3D garments using GarmentCode and drape them via physics simulation to capture realistic garment fit. (2) We employ a novel re-texturing framework to transform synthetic renderings into photorealistic images while strictly preserving geometry. (3) We introduce person identity preservation into our re-texturing model to generate paired person images (same person, different garments) for supervised training. Finally, we leverage our FIT dataset to train a baseline fit-aware virtual try-on model. Our data and results set the new state-of-the-art for fit-aware virtual try-on, as well as offer a robust benchmark for future research. We will make all data and code publicly available on our project page: https://johannakarras.github.io/FIT.
🔥 Hacker News
by kashifr
该帖子介绍了使用TRL工具将超过1000亿参数的模型蒸馏训练速度提升40倍的方法,展示了高效模型压缩的技术突破。
by GenericCanadian
该帖子分享了关于 Bevy 游戏引擎的开发教程和深度资源,适合游戏开发者学习使用。
by halfwhey
Claudraband是一个为Claude Code设计的增强工具,通过tmux或xterm.js提供可控终端环境,支持可恢复的非交互式工作流,例如让当前会话查询历史决策记录。
by Anon84
该帖子探讨了如何利用最著名AI代理基准测试的漏洞,并讨论了确保基准测试可信度的重要性。
by creaktive
通过curl命令行工具运行经典游戏《毁灭战士》(Doom),展示技术创意与趣味性。
by bratao
Claude Opus 4.6在BridgeBench幻觉测试中的准确率从83%降至68%,引发对其性能退步的关注。
by maxloh
这篇帖子探讨如何在Rust语言中通过高层抽象获得大部分优势,同时避免过度复杂的底层细节,实现高效与易用性的平衡。
by iceberger2001
一款单HTML文件实现的点击式无尽轨道弹射游戏,玩法简单易上手。
🐙 GitHub Trending
Star
OpenBMB /
VoxCPM
VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning
VoxCPM是一个支持多语言语音生成、创意声音设计和逼真语音克隆的无分词器TTS项目,值得关注因其创新性地实现了高质量、多样化的语音合成技术。
Star
NousResearch /
hermes-agent
The agent that grows with you
Hermes-Agent 是一个能随着用户需求不断成长的智能代理项目,值得关注因为它结合了灵活性和可扩展性,适合开发者构建个性化AI助手。
Star
multica-ai /
multica
The open-source managed agents platform. Turn coding agents into real teammates — assign tasks, track progress, compound skills.
Multica 是一个开源的管理式智能体平台,旨在将编码智能体转化为真正的团队成员,支持任务分配、进度追踪和技能复合,值得关注因其能提升团队协作效率。
Star
coleam00 /
Archon
The first open-source harness builder for AI coding. Make AI coding deterministic and repeatable.
Archon 是首个开源的 AI 编程工具构建框架,旨在使 AI 编码过程变得确定性和可重复,值得关注因其创新性地解决了 AI 生成代码的可靠性问题。
Star
rustfs /
rustfs
🚀2.3x faster than MinIO for 4KB object payloads. RustFS is an open-source, S3-compatible high-performance object storage system supporting migration and coexistence with other S3-compatible platforms such as MinIO and Ceph.
RustFS是一个开源的高性能对象存储系统,兼容S3协议,性能比MinIO快2.3倍(4KB对象场景),支持与MinIO、Ceph等其他S3兼容平台迁移和共存。
Star
snarktank /
ralph
Ralph is an autonomous AI agent loop that runs repeatedly until all PRD items are complete.
Ralph是一个自主运行的AI代理循环,持续执行直至所有产品需求文档(PRD)项完成,值得关注因其能自动化处理开发任务直至目标达成。
Star
microsoft /
markitdown
Python tool for converting files and office documents to Markdown.
Microsoft的markitdown是一个Python工具,用于将文件和办公文档转换为Markdown格式,值得关注因为它简化了文档格式转换流程。
Star
shanraisshan /
claude-code-best-practice
practice made claude perfect
该项目收集了使用Claude AI进行编程时的最佳实践案例,值得关注因为它能帮助开发者更高效地利用Claude辅助代码编写。