📅 2026-04-07

每日精选 · arXiv · Hacker News · GitHub Trending

计算机视觉: 5 归档 →

📄 arXiv 论文
计算机视觉 2604.04487
相关性 85/100

Training-Free Image Editing with Visual Context Integration and Concept Alignment

无需训练的视觉上下文整合与概念对齐图像编辑

Rui Song, Guo-Hua Wang, Qing-Guo Chen, Weihua Luo, Tongda Xu 等 (9 位作者)

核心贡献: 提出VicoEdit方法,无需训练或扩散反演即可将视觉上下文整合到预训练的文本提示编辑模型中,显著提升编辑一致性和灵活性。
方法: 1. 直接基于视觉上下文将源图像转换为目标图像,避免传统扩散反演导致的轨迹偏差;2. 设计概念对齐引导的后验采样策略以增强编辑一致性;3. 完全基于预训练模型实现,无需额外数据收集或微调。
关键发现: 实验表明,该训练免费方法在编辑性能上优于当前最先进的基于训练的模型,尤其在保持上下文一致性和处理灵活性方面表现突出。
查看原文摘要

In image editing, it is essential to incorporate a context image to convey the user's precise requirements, such as subject appearance or image style. Existing training-based visual context-aware editing methods incur data collection effort and training cost. On the other hand, the training-free alternatives are typically established on diffusion inversion, which struggles with consistency and flexibility. In this work, we propose VicoEdit, a training-free and inversion-free method to inject the visual context into the pretrained text-prompted editing model. More specifically, VicoEdit directly transforms the source image into the target one based on the visual context, thereby eliminating the need for inversion that can lead to deviated trajectories. Moreover, we design a posterior sampling approach guided by concept alignment to enhance the editing consistency. Empirical results demonstrate that our training-free method achieves even better editing performance than the state-of-the-art training-based models.

计算机视觉 2604.04018
相关性 85/100

1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation

1.x-Distill:突破分布匹配蒸馏中的多样性、质量和效率壁垒

Haoyu Li, Tingyan Wen, Lin Qi, Zhe Wu, Yihuang Chen 等 (9 位作者)

核心贡献: 提出了首个分数步蒸馏框架1.x-Distill,突破了传统整数步限制,实现了1.x步高效生成,并在极低步数下保持生成质量和多样性。
方法: 1. 分析了教师CFG在DMD中的作用并改进以抑制模式崩溃;2. 提出两阶段蒸馏策略:先通过多样性保留的分布匹配学习粗粒度结构,再通过推理一致的对抗蒸馏优化细节;3. 设计了轻量级补偿模块,将块级缓存整合到蒸馏流程中。
关键发现: 在SD3-Medium和SD3.5-Large模型上,1.x-Distill以1.67/1.74有效NFEs超越现有少步方法,质量与多样性更优,相比原始28x2 NFE采样加速达33倍。
查看原文摘要

Diffusion models produce high-quality text-to-image results, but their iterative denoising is computationally expensive.Distribution Matching Distillation (DMD) emerges as a promising path to few-step distillation, but suffers from diversity collapse and fidelity degradation when reduced to two steps or fewer. We present 1.x-Distill, the first fractional-step distillation framework that breaks the integer-step constraint of prior few-step methods and establishes 1.x-step generation as a practical regime for distilled diffusion models.Specifically, we first analyze the overlooked role of teacher CFG in DMD and introduce a simple yet effective modification to suppress mode collapse. Then, to improve performance under extreme steps, we introduce Stagewise Focused Distillation, a two-stage strategy that learns coarse structure through diversity-preserving distribution matching and refines details with inference-consistent adversarial distillation. Furthermore, we design a lightweight compensation module for Distill--Cache co-Training, which naturally incorporates block-level caching into our distillation pipeline.Experiments on SD3-Medium and SD3.5-Large show that 1.x-Distill surpasses prior few-step methods, achieving better quality and diversity at 1.67 and 1.74 effective NFEs, respectively, with up to 33x speedup over original 28x2 NFE sampling.

计算机视觉 2604.03611
相关性 85/100

PortraitCraft: A Benchmark for Portrait Composition Understanding and Generation

PortraitCraft:肖像构图理解与生成的基准数据集

Yuyang Sha, Zijie Lou, Youyun Tang, Xiaochao Qu, Haoxiang Li 等 (7 位作者)

核心贡献: 提出了PortraitCraft,一个统一的肖像构图理解与生成基准数据集,填补了现有研究在结构化肖像构图分析和可控肖像生成方面的空白。
方法: PortraitCraft基于约50,000张精选真实肖像图像构建,包含多级结构化标注(如全局构图评分、13种构图属性标注、解释文本、视觉问答对及生成导向的文本描述)。基于此数据集,作者设计了构图理解与构图感知生成两项互补的基准任务,并采用统一框架进行评估。
关键发现: 实验结果表明,PortraitCraft为细粒度肖像理解、可解释美学评估和可控肖像生成提供了全面基准,并通过标准化评估协议和代表性多模态模型基线结果验证了其有效性。
查看原文摘要

Portrait composition plays a central role in portrait aesthetics and visual communication, yet existing datasets and benchmarks mainly focus on coarse aesthetic scoring, generic image aesthetics, or unconstrained portrait generation. This limits systematic research on structured portrait composition analysis and controllable portrait generation under explicit composition requirements. In this paper, we introduce PortraitCraft, a unified benchmark for portrait composition understanding and generation. PortraitCraft is built on a dataset of approximately 50,000 curated real portrait images with structured multi-level supervision, including global composition scores, annotations over 13 composition attributes, attribute-level explanation texts, visual question answering pairs, and composition-oriented textual descriptions for generation. Based on this dataset, we establish two complementary benchmark tasks for composition understanding and composition-aware generation within a unified framework. The first evaluates portrait composition understanding through score prediction, fine-grained attribute reasoning, and image-grounded visual question answering, while the second evaluates portrait generation from structured composition descriptions under explicit composition constraints. We further define standardized evaluation protocols and provide reference baseline results with representative multimodal models. PortraitCraft provides a comprehensive benchmark for future research on fine-grained portrait understanding, interpretable aesthetic assessment, and controllable portrait generation.

计算机视觉 2604.04924
相关性 75/100

Your Pre-trained Diffusion Model Secretly Knows Restoration

你的预训练扩散模型暗藏修复能力

Sudarshan Rajagopalan, Vishal M. Patel

核心贡献: 揭示了预训练扩散模型本身具备修复能力,提出通过直接学习文本编码器输出的提示嵌入来解锁这种能力,无需微调或额外控制模块。
方法: 1. 发现预训练扩散模型的修复行为可通过学习文本编码器输出的提示嵌入解锁,而非文本提示或文本标记嵌入优化。2. 提出在扩散桥框架中训练提示,使训练和推理动态对齐,确保从噪声退化状态到干净图像的一致去噪路径。3. 在预训练的WAN视频模型和FLUX图像模型上应用轻量级学习提示,将其转化为高性能修复模型。
关键发现: 1. 预训练扩散模型本身具备修复能力,但传统方法难以通过文本提示解锁。2. 提出的方法在多种退化情况下实现了竞争性的性能和泛化能力,避免了模型微调和专用控制模块的使用。
查看原文摘要

Pre-trained diffusion models have enabled significant advancements in All-in-One Restoration (AiOR), offering improved perceptual quality and generalization. However, diffusion-based restoration methods primarily rely on fine-tuning or Control-Net style modules to leverage the pre-trained diffusion model's priors for AiOR. In this work, we show that these pre-trained diffusion models inherently possess restoration behavior, which can be unlocked by directly learning prompt embeddings at the output of the text encoder. Interestingly, this behavior is largely inaccessible through text prompts and text-token embedding optimization. Furthermore, we observe that naive prompt learning is unstable because the forward noising process using degraded images is misaligned with the reverse sampling trajectory. To resolve this, we train prompts within a diffusion bridge formulation that aligns training and inference dynamics, enforcing a coherent denoising path from noisy degraded states to clean images. Building on these insights, we introduce our lightweight learned prompts on the pre-trained WAN video model and FLUX image models, converting them into high-performing restoration models. Extensive experiments demonstrate that our approach achieves competitive performance and generalization across diverse degradations, while avoiding fine-tuning and restoration-specific control modules.

计算机视觉 2604.04859
相关性 75/100

Unified Vector Floorplan Generation via Markup Representation

基于标记化表示的统一矢量平面图生成

Kaede Shiohara, Toshihiko Yamasaki

核心贡献: 提出Floorplan Markup Language (FML)通用表示法,将平面图生成问题转化为序列预测任务,并开发了基于Transformer的生成模型FMLM,支持多种条件下的高保真平面图生成。
方法: 1. 设计结构化语法标记语言FML统一编码平面图信息 2. 将生成任务重构为序列预测问题 3. 采用Transformer架构构建FMLM模型 4. 支持边界、邻接图、局部布局等多种输入条件
关键发现: 在RPLAN数据集上的实验表明,FMLM作为单一模型,性能超越所有现有针对特定任务的state-of-the-art方法,且能生成功能完备的高质量平面图。
查看原文摘要

Automatic residential floorplan generation has long been a central challenge bridging architecture and computer graphics, aiming to make spatial design more efficient and accessible. While early methods based on constraint satisfaction or combinatorial optimization ensure feasibility, they lack diversity and flexibility. Recent generative models achieve promising results but struggle to generalize across heterogeneous conditional tasks, such as generation from site boundaries, room adjacency graphs, or partial layouts, due to their suboptimal representations. To address this gap, we introduce Floorplan Markup Language (FML), a general representation that encodes floorplan information within a single structured grammar, which casts the entire floorplan generation problem into a next token prediction task. Leveraging FML, we develop a transformer-based generative model, FMLM, capable of producing high-fidelity and functional floorplans under diverse conditions. Comprehensive experiments on the RPLAN dataset demonstrate that FMLM, despite being a single model, surpasses the previous task-specific state-of-the-art methods.

🔥 Hacker News
HN ▲ 93  💬 8
推荐度 90/100

Show HN: Gemma 4 Multimodal Fine-Tuner for Apple Silicon

by MediaSquirrel

开发者展示了一个针对Apple Silicon优化的多模态微调工具Gemma 4,支持从云端流式传输音频数据并整合了Whisper和Gemma模型的功能。
HN ▲ 445  💬 310
推荐度 85/100

System Card: Claude Mythos Preview [pdf]

by be7a

该帖子介绍了Claude Mythos Preview的系统卡片PDF文件,并附带相关讨论链接,重点评估其网络安全能力。
HN ▲ 360  💬 105
推荐度 85/100

GLM-5.1: Towards Long-Horizon Tasks

by zixuanlimit

GLM-5.1 是一个面向长时程任务的模型升级,重点提升了处理复杂、长期依赖任务的能力。
HN ▲ 25  💬 6
推荐度 85/100

A whole boss fight in 256 bytes

by HellMood

这篇帖子展示了一个仅用256字节代码实现的完整Boss战游戏,展示了极致的代码压缩和创意编程技巧。
HN ▲ 133  💬 42
推荐度 85/100

Google open-sources experimental agent orchestration testbed Scion

by timbilt

谷歌开源实验性智能体编排平台Scion,提供模块化工具包用于构建和测试多智能体系统。
HN ▲ 101  💬 36
推荐度 85/100

AI helps add 10k more photos to OldNYC

by evakhoury

AI技术助力OldNYC项目新增上万张历史照片,展示纽约旧貌。
HN ▲ 50  💬 8
推荐度 85/100

Emotion Concepts and Their Function in a Large Language Model

by Anon84

该帖子探讨了大型语言模型中情感概念的功能及其作用机制,分析了情感如何影响模型的输出和行为。
HN ▲ 311  💬 155
推荐度 85/100

Launch HN: Freestyle – Sandboxes for Coding Agents

by benswerd

Freestyle 是一个为AI编程代理设计的云端沙盒环境,支持AI利用完整计算机能力进行开发,而不仅限于简单脚本或服务器应用。
🐙 GitHub Trending
Kotlin ⭐ 18,711  +899 today
推荐度 85/100

google-ai-edge/gallery

Star google-ai-edge / gallery A gallery that showcases on-device ML/GenAI use cases and allows people to try and use models locally.

该项目展示并让用户本地体验谷歌在设备端机器学习和生成式AI的应用案例,值得关注因其提供了前沿AI技术的实际演示和本地运行能力。
C++ ⭐ 2,503  +522 today
推荐度 85/100

google-ai-edge/LiteRT-LM

LiteRT-LM 是 Google AI Edge 推出的轻量级实时语言模型推理框架,专为边缘设备优化,值得关注因其高效性能和低延迟特性。
Python ⭐ 12,069  +339 today
推荐度 70/100

HKUDS/DeepTutor

Star HKUDS / DeepTutor "DeepTutor: Agent-Native Personalized Learning Assistant"

DeepTutor是一个基于深度学习的个性化学习助手,能够根据学生的需求提供定制化的学习支持,值得关注因其创新性地结合了AI技术实现教育领域的智能辅导。
TypeScript ⭐ 19,500  +859 today
推荐度 60/100

tobi/qmd

Star tobi / qmd mini cli search engine for your docs, knowledge bases, meeting notes, whatever. Tracking current sota approaches while being all local

qmd 是一个本地迷你 CLI 搜索引擎,用于快速检索文档、知识库和会议笔记等,采用当前前沿技术且完全离线运行,值得关注。
Python ⭐ 9,995  +656 today
推荐度 50/100

elebumm/RedditVideoMakerBot

Star elebumm / RedditVideoMakerBot Create Reddit Videos with just✨ one command ✨

这是一个用Python编写的自动化工具,能通过一条命令生成Reddit视频,适合快速制作社交媒体内容,因其高效便捷而值得关注。
TypeScript ⭐ 24,432  +1174 today
推荐度 40/100

sponsors/abhigyanpatwari

Sponsor Star abhigyanpatwari / GitNexus GitNexus: The Zero-Server Code Intelligence Engine - GitNexus is a client-side knowledge graph creator that runs entirely in your browser. Drop in a GitHub repo or ZIP file, and get an interactive knowledge graph wit a built in Graph RAG Agent. Perfect for code exploration

GitNexus是一个完全在浏览器中运行的客户端知识图谱生成工具,无需服务器即可通过GitHub仓库或ZIP文件创建交互式知识图谱,内置Graph RAG代理,非常适合代码探索。
Python ⭐ 7,901  +663 today
推荐度 30/100

NVIDIA/personaplex

Star NVIDIA / personaplex PersonaPlex code.

NVIDIA/personaplex 是一个用于构建和管理个性化AI角色的Python项目,值得关注因为它由NVIDIA开发,可能涉及先进的AI技术和个性化交互体验。
Python ⭐ 3,828  +213 today
推荐度 30/100

TheCraigHewitt/seomachine

Star TheCraigHewitt / seomachine A specialized Claude Code workspace for creating long-form, SEO-optimized blog content for any business. This system helps you research, write, analyze, and optimize content that ranks well and serves your target audience.

该项目是一个基于Python的SEO内容生成工具,专注于帮助企业创建长篇、SEO优化的博客内容,值得关注因为它能自动完成内容研究、撰写和优化,提升搜索引擎排名效果。