📅 2026-04-13

每日精选 · arXiv · Hacker News · GitHub Trending

计算机视觉: 5 归档 →

📄 arXiv 论文
计算机视觉 2604.09213
相关性 85/100

SHIFT: Steering Hidden Intermediates in Flow Transformers

SHIFT:在流式Transformer中引导隐藏中间层

Nina Konovalova, Andrey Kuznetsov, Aibek Alanov

核心贡献: 提出SHIFT框架,通过动态操纵DiT扩散模型的中间激活层,实现无需重新训练的概念移除与生成控制。
方法: 受大语言模型激活引导启发,SHIFT学习动态应用于选定层和时间步的引导向量,抑制不需要的视觉概念或调整生成风格。该方法通过轻量级干预实现,无需修改模型参数或进行耗时微调。
关键发现: 实验表明,SHIFT能有效消除DiT生成图像中的特定概念(如物体、风格),同时保持其他内容与图像质量;同一机制还可用于定向添加目标对象或调整风格,且适用于多样化提示词。
查看原文摘要

Diffusion models have become leading approaches for high-fidelity image generation. Recent DiT-based diffusion models, in particular, achieve strong prompt adherence while producing high-quality samples. We propose SHIFT, a simple but effective and lightweight framework for concept removal in DiT diffusion models via targeted manipulation of intermediate activations at inference time, inspired by activation steering in large language models. SHIFT learns steering vectors that are dynamically applied to selected layers and timesteps to suppress unwanted visual concepts while preserving the prompt's remaining content and overall image quality. Beyond suppression, the same mechanism can shift generations into a desired \emph{style domain} or bias samples toward adding or changing target objects. We demonstrate that SHIFT provides effective and flexible control over DiT generation across diverse prompts and targets without time-consuming retraining.

计算机视觉 2604.09168
相关性 85/100

ELT: Elastic Looped Transformers for Visual Generation

弹性循环Transformer:用于视觉生成的可扩展循环架构

Sahil Goyal, Swayam Agrawal, Gautham Govind Anil, Prateek Jain, Sujoy Paul 等 (6 位作者)

核心贡献: 提出参数高效的弹性循环Transformer(ELT),通过权重共享的循环架构和自蒸馏训练方法,在显著减少参数量的同时保持高质量视觉生成能力。
方法: 1. 采用循环权重共享的Transformer块替代传统堆叠式结构,大幅降低参数量 2. 提出Intra-Loop自蒸馏(ILSD)方法,在单步训练中通过教师-学生配置确保模型深度一致性 3. 支持动态计算成本与生成质量的权衡,实现单一训练生成弹性模型家族
关键发现: 1. 在同等计算量下参数量减少4倍,ImageNet 256×256条件生成达到FID 2.0 2. UCF-101视频生成取得FVD 72.8 3. 首次实现单一模型支持任意时长的推理能力(Any-Time inference)
查看原文摘要

We introduce Elastic Looped Transformers (ELT), a highly parameter-efficient class of visual generative models based on a recurrent transformer architecture. While conventional generative models rely on deep stacks of unique transformer layers, our approach employs iterative, weight-shared transformer blocks to drastically reduce parameter counts while maintaining high synthesis quality. To effectively train these models for image and video generation, we propose the idea of Intra-Loop Self Distillation (ILSD), where student configurations (intermediate loops) are distilled from the teacher configuration (maximum training loops) to ensure consistency across the model's depth in a single training step. Our framework yields a family of elastic models from a single training run, enabling Any-Time inference capability with dynamic trade-offs between computational cost and generation quality, with the same parameter count. ELT significantly shifts the efficiency frontier for visual synthesis. With $4\times$ reduction in parameter count under iso-inference-compute settings, ELT achieves a competitive FID of $2.0$ on class-conditional ImageNet $256 \times 256$ and FVD of $72.8$ on class-conditional UCF-101.

计算机视觉 2604.08836
相关性 85/100

CatalogStitch: Dimension-Aware and Occlusion-Preserving Object Compositing for Catalog Image Generation

CatalogStitch:面向目录图像生成的维度感知与遮挡保持对象合成技术

Sanyam Jain, Pragya Kandari, Manit Singhal, He Zhang, Soo Ye Kim

核心贡献: 提出CatalogStitch,通过自动化维度适配和遮挡修复技术,显著减少生成式对象合成在商品目录图像生成中的手动干预需求。
方法: 1. 开发维度感知掩码计算算法,自动调整目标区域以适应不同尺寸的商品,无需手动修改掩码;2. 提出遮挡感知的混合修复方法,确保被遮挡元素的像素级完美恢复;3. 构建CatalogStitch-Eval基准测试集(含58个示例),覆盖比例失调和重度遮挡场景。
关键发现: 在三种先进合成模型(ObjectStitch/OmniPaint/InsertAnything)上的实验表明,该方法能一致性地提升各类商品目录场景的合成效果,将生成式合成转化为实用化生产工具。
查看原文摘要

Generative object compositing methods have shown remarkable ability to seamlessly insert objects into scenes. However, when applied to real-world catalog image generation, these methods require tedious manual intervention: users must carefully adjust masks when product dimensions differ, and painstakingly restore occluded elements post-generation. We present CatalogStitch, a set of model-agnostic techniques that automate these corrections, enabling user-friendly content creation. Our dimension-aware mask computation algorithm automatically adapts the target region to accommodate products with different dimensions; users simply provide a product image and background, without manual mask adjustments. Our occlusion-aware hybrid restoration method guarantees pixel-perfect preservation of occluding elements, eliminating post-editing workflows. We additionally introduce CatalogStitch-Eval, a 58-example benchmark covering aspect-ratio mismatch and occlusion-heavy catalog scenarios, together with supplementary PDF and HTML viewers. We evaluate our techniques with three state-of-the-art compositing models (ObjectStitch, OmniPaint, and InsertAnything), demonstrating consistent improvements across diverse catalog scenarios. By reducing manual intervention and automating tedious corrections, our approach transforms generative compositing into a practical, human-friendly tool for production catalog workflows.

计算机视觉 2604.09531
相关性 75/100

VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

VisionFoundry:利用合成图像教授视觉语言模型视觉感知

Guanyu Zhou, Yida Yin, Wenhao Chai, Shengbang Tong, Xingyu Fu 等 (6 位作者)

核心贡献: 提出VisionFoundry框架,通过任务导向的合成数据生成方法,显著提升视觉语言模型在空间理解和视角识别等视觉感知任务上的性能。
方法: 1. 设计任务感知的合成数据生成流程,仅需任务名称作为输入,利用大语言模型(LLMs)自动生成问题、答案和文本到图像(T2I)提示。 2. 通过T2I模型合成图像,并利用专有视觉语言模型(VLM)验证数据一致性,全程无需参考图像或人工标注。 3. 构建VisionFoundry-10K数据集,包含10个任务的1万组图像-问题-答案三元组。
关键发现: 1. 在VisionFoundry-10K上训练的模型在视觉感知基准测试中取得显著提升:MMVP任务准确率提高7%,CV-Bench-3D任务提高10%。 2. 模型在保持广泛能力的同时,展现出随数据规模增加的正向扩展性。 3. 结果表明任务导向的合成监督是突破视觉语言模型瓶颈的有效途径。
查看原文摘要

Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low-level visual skills. This motivates a practical question: can targeted synthetic supervision, generated from only a task keyword such as Depth Order, address these weaknesses? To investigate this question, we introduce VisionFoundry, a task-aware synthetic data generation pipeline that takes only the task name as input and uses large language models (LLMs) to generate questions, answers, and text-to-image (T2I) prompts, then synthesizes images with T2I models and verifies consistency with a proprietary VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry-10K, a synthetic visual question answering (VQA) dataset containing 10k image-question-answer triples spanning 10 tasks. Models trained on VisionFoundry-10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP and +10% on CV-Bench-3D, while preserving broader capabilities and showing favorable scaling behavior as data size increases. Our results suggest that limited task-targeted supervision is an important contributor to this bottleneck and that synthetic supervision is a promising path toward more systematic training for VLMs.

计算机视觉 2604.09405
相关性 75/100

EGLOCE: Training-Free Energy-Guided Latent Optimization for Concept Erasure

EGLOCE:基于能量引导的无训练潜在优化概念擦除方法

Junyeong Ahn, Seojin Yoon, Sungyong Baik

核心贡献: 提出了一种无需重新训练的推理阶段概念擦除方法EGLOCE,通过双能量引导框架在潜在空间优化实现高效且可控的特定概念移除。
方法: 1. 采用双目标框架:排斥能量通过潜在空间梯度下降远离目标概念,保留能量维持原始提示的语义对齐 2. 完全在推理阶段操作,无需修改模型权重 3. 结合能量引导采样技术实现即插即用集成
关键发现: 1. 在多种基线方法上显著提升概念擦除效果,同时保持图像质量和提示对齐 2. 即使面对对抗性攻击仍能稳定工作 3. 首次通过采样过程中的双能量引导建立了安全可控图像生成的新范式
查看原文摘要

As text-to-image diffusion models grow increasingly prevalent, the ability to remove specific concepts-mostly explicit content and many copyrighted characters or styles-has become essential for safety and compliance. Existing unlearning approaches often require costly re-training, modify parameters at the cost of degradation of unrelated concept fidelity, or depend on indirect inference-time adjustment that compromise the effectiveness of concept erasure. Inspired by the success of energy-guided sampling for preservation of the condition of diffusion models, we introduce Energy-Guided Latent Optimization for Concept Erasure (EGLOCE), a training-free approach that removes unwanted concepts by re-directing noisy latent during inference. Our method employs a dual-objective framework: a repulsion energy that steers generation away from target concepts via gradient descent in latent space, and a retention energy that preserves semantic alignment to the original prompt. Combined with previous approaches that either require erroneous modified model weights or provide weak inference-time guidance, EGLOCE operates entirely at inference and enhances erasure performance, enabling plug-and-play integration. Extensive experiments demonstrate that EGLOCE improves concept removal while maintaining image quality and prompt alignment across baselines, even with adversarial attacks. To the best of our knowledge, our work is the first to establish a new paradigm for safe and controllable image generation through dual energy-based guidance during sampling.

🔥 Hacker News
HN ▲ 42  💬 6
推荐度 85/100

(AMD) Build AI Agents That Run Locally

by galaxyLogic

AMD推出本地运行AI代理工具Gaia,支持开发者构建无需云端依赖的AI应用。
HN ▲ 390  💬 129
推荐度 85/100

Servo is now available on crates.io

by ffin

Servo浏览器引擎的首个版本0.1.0正式发布并上架crates.io,标志着这个由Rust编写的现代化渲染引擎迈入新阶段。
HN ▲ 10  💬 4
推荐度 85/100

N-Day-Bench – Can LLMs find real vulnerabilities in real codebases?

by mufeedvh

该帖子介绍了一个名为N-Day-Bench的测试项目,旨在评估前沿大语言模型能否在真实代码库中发现已知安全漏洞,通过每月更新测试案例避免数据污染问题。
HN ▲ 10  💬 1
推荐度 85/100

What We Learned Building a Rust Runtime for TypeScript

by vinhnx

该帖子分享了开发团队在构建支持TypeScript的Rust运行时过程中的经验与关键发现。
HN ▲ 7  💬 2
推荐度 85/100

Show HN: Continual Learning with .md

by wenhan_zhou

该帖子介绍了一个利用Markdown文件实现LLM持续学习的无代码方案,通过语义文件系统方便检索,作者称其效果优于现有方法并寻求反馈。
HN ▲ 234  💬 101
推荐度 85/100

I ran Gemma 4 as a local model in Codex CLI

by dvaughan

作者在Codex CLI中本地运行了Gemma 4模型,分享了使用体验和操作过程。
HN ▲ 573  💬 137
推荐度 85/100

Exploiting the most prominent AI agent benchmarks

by Anon84

该帖子探讨了如何利用最著名的人工智能代理基准测试中的漏洞,揭示了当前基准测试在可信度方面的潜在问题。
HN ▲ 117  💬 30
推荐度 85/100

Doom, Played over Curl

by creaktive

通过curl命令行工具运行经典游戏《毁灭战士》(Doom),展示网络协议与复古游戏的创意结合。
🐙 GitHub Trending
Python ⭐ 76,641  +11297 today
推荐度 70/100

NousResearch/hermes-agent

Star NousResearch / hermes-agent The agent that grows with you

Hermes-Agent 是一个能随用户需求成长的智能代理项目,值得关注因其灵活性和适应性可满足不断变化的开发需求。
TypeScript ⭐ 11,022  +1724 today
推荐度 65/100

multica-ai/multica

Star multica-ai / multica The open-source managed agents platform. Turn coding agents into real teammates — assign tasks, track progress, compound skills.

Multica 是一个开源的管理式智能体平台,可将编码智能体转化为真正的团队成员,支持任务分配、进度追踪和技能复合,值得关注因其能提升团队协作效率。
TypeScript ⭐ 17,581  +679 today
推荐度 60/100

coleam00/Archon

Star coleam00 / Archon The first open-source harness builder for AI coding. Make AI coding deterministic and repeatable.

Archon 是首个开源 AI 编程工具构建框架,旨在让 AI 编程变得确定性和可重复,值得关注因其开创性地解决了 AI 生成代码的不可控问题。
TypeScript ⭐ 16,488  +683 today
推荐度 60/100

snarktank/ralph

Star snarktank / ralph Ralph is an autonomous AI agent loop that runs repeatedly until all PRD items are complete.

Ralph是一个自主运行的AI代理循环,持续执行直至所有产品需求文档(PRD)项完成,值得关注因其能自动化处理开发任务。
Jupyter Notebook ⭐ 39,444  +1127 today
推荐度 60/100

anthropics/claude-cookbooks

Star anthropics / claude-cookbooks A collection of notebooks/recipes showcasing some fun and effective ways of using Claude.

该项目是Anthropic公司提供的Claude AI使用示例合集,通过Jupyter Notebook展示有趣且高效的Claude应用方法,值得关注因为它能帮助开发者快速掌握Claude的实用技巧。
TypeScript ⭐ 16,243  +652 today
推荐度 60/100

jamiepine/voicebox

Star jamiepine / voicebox The open-source voice synthesis studio

Voicebox 是一个开源的语音合成工作室,值得关注因为它提供了灵活的工具来创建和定制高质量的语音合成效果。
Python ⭐ 106,836  +2811 today
推荐度 50/100

microsoft/markitdown

Star microsoft / markitdown Python tool for converting files and office documents to Markdown.

Microsoft的markitdown是一个Python工具,用于将文件和办公文档转换为Markdown格式,值得关注因为它简化了文档格式转换流程。
JavaScript ⭐ 52,052  +630 today
推荐度 50/100

gsd-build/get-shit-done

Star gsd-build / get-shit-done A light-weight and powerful meta-prompting, context engineering and spec-driven development system for Claude Code by TÂCHES.

"get-shit-done 是一个轻量级且强大的元提示和上下文工程系统,专为 Claude Code 设计,值得关注因为它能通过规范驱动开发提高效率。"