📅 2026-04-08

每日精选 · arXiv · Hacker News · GitHub Trending

计算机视觉: 5 归档 →

📄 arXiv 论文
计算机视觉 2604.06079
相关性 85/100

Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning

科学图形程序合成:基于双重自一致性强化学习的方法

Juekai Lin, Yun Zhu, Honglin Lin, Sijing Li, Tianwei Lin 等 (9 位作者)

核心贡献: 提出了一个闭环框架,包括高质量数据集SciTikZ-230K和多方面基准测试SciTikZ-Bench,并引入双重自一致性强化学习优化范式,显著提升了科学图形程序合成的性能。
方法: 1. 通过Execution-Centric Data Engine构建了大规模、高质量的SciTikZ-230K数据集,涵盖11个科学领域。2. 设计了SciTikZ-Bench基准测试,评估视觉保真度和结构逻辑。3. 提出双重自一致性强化学习优化范式,利用往返验证惩罚退化代码并提升自一致性。
关键发现: 训练模型SciTikZer-8B在性能上达到了最先进水平, consistently outperforming proprietary giants like Gemini-2.5-Pro and massive models like Qwen3-VL-235B-A22B-Instruct.
查看原文摘要

Graphics Program Synthesis is pivotal for interpreting and editing visual data, effectively facilitating the reverse-engineering of static visuals into editable TikZ code. While TikZ is the de facto standard for scientific schematics due to its programmatic flexibility, its requirement for rigorous spatial precision presents a significant challenge for Multimodal Large Language Models. Progress is currently stifled by two primary gaps: (1) Data Quality Gap: existing image-TikZ corpora often lack strict executability and reliable visual alignment; (2) Evaluation Gap: a lack of benchmarks for both structural and visual fidelity. To address these, we present a closed-loop framework featuring: SciTikZ-230K, a large-scale, high-quality dataset from our Execution-Centric Data Engine covering 11 diverse scientific disciplines; SciTikZ-Bench, a multifaceted benchmark spanning from basic geometric constructs to intricate hierarchical schematics to evaluate both visual fidelity and structural logic. To further broaden the scope of visual-code optimization methodology, we introduce a novel Dual Self-Consistency Reinforcement Learning optimization paradigm, which utilizes Round-Trip Verification to penalize degenerate code and boost overall self-consistency. Empowered by these, our trained model SciTikZer-8B achieves state-of-the-art performance, consistently outperforming proprietary giants like Gemini-2.5-Pro and massive models like Qwen3-VL-235B-A22B-Instruct.

计算机视觉 2604.06074
相关性 85/100

Graph-PiT: Enhancing Structural Coherence in Part-Based Image Synthesis via Graph Priors

Graph-PiT:通过图先验增强基于部分的图像合成的结构一致性

Junbin Zhang, Meng Cao, Feng Tan, Yikai Lin, Yuexian Zou

核心贡献: 提出Graph-PiT框架,通过图先验显式建模视觉组件的结构依赖关系,显著提升多部分图像合成的结构一致性和可控性。
方法: 1. 将视觉部分表示为节点,空间语义关系表示为边,构建图结构; 2. 采用分层图神经网络(HGNN)模块,在粗粒度部分级超节点和细粒度IP+ token子节点之间进行双向消息传递; 3. 引入图拉普拉斯平滑损失和边重建损失,确保相邻部分获得兼容且关系感知的嵌入。
关键发现: 1. 在字符、产品、室内布局和拼图等合成领域的定量实验中,Graph-PiT显著优于原始PiT的结构一致性; 2. 定性实验表明该方法能有效迁移到真实网络图像; 3. 消融实验证实显式关系推理对强制执行用户指定的邻接约束至关重要。
查看原文摘要

Achieving fine-grained and structurally sound controllability is a cornerstone of advanced visual generation. Existing part-based frameworks treat user-provided parts as an unordered set and therefore ignore their intrinsic spatial and semantic relationships, which often results in compositions that lack structural integrity. To bridge this gap, we propose Graph-PiT, a framework that explicitly models the structural dependencies of visual components using a graph prior. Specifically, we represent visual parts as nodes and their spatial-semantic relationships as edges. At the heart of our method is a Hierarchical Graph Neural Network (HGNN) module that performs bidirectional message passing between coarse-grained part-level super-nodes and fine-grained IP+ token sub-nodes, refining part embeddings before they enter the generative pipeline. We also introduce a graph Laplacian smoothness loss and an edge-reconstruction loss so that adjacent parts acquire compatible, relation-aware embeddings. Quantitative experiments on controlled synthetic domains (character, product, indoor layout, and jigsaw), together with qualitative transfer to real web images, show that Graph-PiT improves structural coherence over vanilla PiT while remaining compatible with the original IP-Prior pipeline. Ablation experiments confirm that explicit relational reasoning is crucial for enforcing user-specified adjacency constraints. Our approach not only enhances the plausibility of generated concepts but also offers a scalable and interpretable mechanism for complex, multi-part image synthesis. The code is available at https://github.com/wolf-bailang/Graph-PiT.

计算机视觉 2604.05853
相关性 85/100

Reading Between the Pixels: An Inscriptive Jailbreak Attack on Text-to-Image Models

像素间的秘密:针对文本到图像模型的铭文越狱攻击

Zonghao Ying, Haowen Dai, Lianyu Hu, Zonglei Jing, Quanchen Zou 等 (8 位作者)

核心贡献: 提出并形式化了一种新型的铭文越狱攻击(inscriptive jailbreak),通过生成视觉无害但包含有害文本的图像来绕过现有安全机制,并开发了Etch攻击框架实现高效攻击。
方法: 1. 将对抗提示分解为语义伪装、视觉空间锚定和字体编码三个正交层次 2. 通过零阶循环迭代优化各层次子问题 3. 利用视觉语言模型对生成图像进行批判性评估,定位失败层次并针对性修正
关键发现: 1. 在7个模型和2个基准测试中,Etch平均攻击成功率达65.57%(最高91%),显著优于基线方法 2. 揭示了当前T2I安全对齐机制在字体感知防御方面的关键盲区
查看原文摘要

Modern text-to-image (T2I) models can now render legible, paragraph-length text, enabling a fundamentally new class of misuse. We identify and formalize the inscriptive jailbreak, where an adversary coerces a T2I system into generating images containing harmful textual payloads (e.g., fraudulent documents) embedded within visually benign scenes. Unlike traditional depictive jailbreaks that elicit visually objectionable imagery, inscriptive attacks weaponize the text-rendering capability itself. Because existing jailbreak techniques are designed for coarse visual manipulation, they struggle to bypass multi-stage safety filters while maintaining character-level fidelity. To expose this vulnerability, we propose Etch, a black-box attack framework that decomposes the adversarial prompt into three functionally orthogonal layers: semantic camouflage, visual-spatial anchoring, and typographic encoding. This decomposition reduces joint optimization over the full prompt space to tractable sub-problems, which are iteratively refined through a zero-order loop. In this process, a vision-language model critiques each generated image, localizes failures to specific layers, and prescribes targeted revisions. Extensive evaluations across 7 models on the 2 benchmarks demonstrate that Etch achieves an average attack success rate of 65.57% (peaking at 91.00%), significantly outperforming existing baselines. Our results reveal a critical blind spot in current T2I safety alignments and underscore the urgent need for typography-aware defense multimodal mechanisms.

计算机视觉 2604.05761
相关性 85/100

Improving Controllable Generation: Faster Training and Better Performance via $x_0$-Supervision

提升可控生成:通过$x_0$监督实现更快训练与更优性能

Amadou S. Sangare, Adrien Maglo, Mohamed Chaouch, Bertrand Luvison

核心贡献: 提出通过$x_0$-监督(直接对干净目标图像进行监督)或等效的扩散损失重加权方法,显著加速可控扩散模型的训练收敛速度,同时提升生成图像的视觉质量和条件准确性。
方法: 1. 分析了可控扩散模型的去噪动态,发现传统训练目标存在收敛慢的问题;2. 提出$x_0$-监督方法,直接对干净目标图像进行监督;3. 提出等效的扩散损失重加权策略,优化训练效率。
关键发现: 1. 实验表明,新方法在多种控制设置下将收敛速度提升至多2倍(基于新指标mAUCC);2. 同时改善了生成图像的视觉质量和条件匹配精度;3. 代码已开源。
查看原文摘要

Text-to-Image (T2I) diffusion/flow models have recently achieved remarkable progress in visual fidelity and text alignment. However, they remain limited when users need to precisely control image layouts, something that natural language alone cannot reliably express. Controllable generation methods augment the initial T2I model with additional conditions that more easily describe the scene. Prior works straightforwardly train the augmented network with the same loss as the initial network. Although natural at first glance, this can lead to very long training times in some cases before convergence. In this work, we revisit the training objective of controllable diffusion models through a detailed analysis of their denoising dynamics. We show that direct supervision on the clean target image, dubbed $x_0$-supervision, or an equivalent re-weighting of the diffusion loss, yields faster convergence. Experiments on multiple control settings demonstrate that our formulation accelerates convergence by up to 2$\times$ according to our novel metric (mean Area Under the Convergence Curve - mAUCC), while also improving both visual quality and conditioning accuracy. Our code is available at https://github.com/CEA-LIST/x0-supervision

计算机视觉 2604.05906
相关性 75/100

Selective Aggregation of Attention Maps Improves Diffusion-Based Visual Interpretation

选择性聚合注意力图改进基于扩散的视觉解释

Jungwon Park, Jungmin Ko, Dongnam Byun, Wonjong Rhee

核心贡献: 通过选择性聚合与目标概念最相关的注意力头,提高了文本到图像生成模型的视觉可解释性。
方法: 研究分析了不同注意力头的特性,提出了一种选择性聚合方法,仅聚合与目标概念最相关的注意力头的注意力图。该方法通过评估注意力头与目标概念的相关性,选择最具代表性的注意力图进行聚合。
关键发现: 与基于扩散的分割方法DAAM相比,选择性聚合方法获得了更高的平均IoU分数。研究发现最相关的注意力头能更准确地捕捉特定概念的特征,选择性聚合有助于诊断提示误解。这些发现表明注意力头选择为提高文本到图像生成的可解释性和可控性提供了有前景的方向。
查看原文摘要

Numerous studies on text-to-image (T2I) generative models have utilized cross-attention maps to boost application performance and interpret model behavior. However, the distinct characteristics of attention maps from different attention heads remain relatively underexplored. In this study, we show that selectively aggregating cross-attention maps from heads most relevant to a target concept can improve visual interpretability. Compared to the diffusion-based segmentation method DAAM, our approach achieves higher mean IoU scores. We also find that the most relevant heads capture concept-specific features more accurately than the least relevant ones, and that selective aggregation helps diagnose prompt misinterpretations. These findings suggest that attention head selection offers a promising direction for improving the interpretability and controllability of T2I generation.

🔥 Hacker News
HN ▲ 241  💬 46
推荐度 90/100

MegaTrain: Full Precision Training of 100B+ Parameter LLMs on a Single GPU

by chrsw

该帖子介绍了一种名为MegaTrain的新技术,可在单块GPU上实现1000亿参数以上大语言模型的全精度训练,突破了硬件限制的瓶颈。
HN ▲ 132  💬 80
推荐度 90/100

Xilem – An experimental Rust native UI framework

by Levitating

Xilem 是一个实验性的 Rust 原生 UI 框架,由 linebender 团队开发,旨在探索高效、灵活的界面构建方案。
HN ▲ 1057  💬 196
推荐度 85/100

I ported Mac OS X to the Nintendo Wii

by blkhp19

一名开发者成功将Mac OS X移植到任天堂Wii游戏机上,展示了跨平台移植的技术突破。
HN ▲ 221  💬 262
推荐度 85/100

Muse Spark: Scaling towards personal superintelligence

by chabons

Meta推出Muse Spark项目,旨在通过AI技术帮助个人实现超级智能的扩展。
HN ▲ 323  💬 363
推荐度 85/100

ML promises to be profoundly weird

by pabs3

这篇帖子探讨了机器学习技术的诡异未来,指出其可能带来不可预测的谎言和怪异现象。
HN ▲ 32  💬 32
推荐度 85/100

Show HN: TUI-use: Let AI agents control interactive terminal programs

by dreamsome

该帖子展示了一个名为TUI-use的工具,允许AI代理控制交互式终端程序,亮点在于通过AI自动化操作终端界面。
HN ▲ 133  💬 62
推荐度 85/100

Claude Managed Agents

by adocomplete

Claude推出托管代理服务,帮助企业自动化任务并提升AI助手的工作效率。
HN ▲ 65  💬 28
推荐度 85/100

The AI Great Leap Forward

by jodah

文章探讨了人工智能技术的快速发展及其对社会、经济的深远影响,提出了对AI未来发展的思考与警示。
🐙 GitHub Trending
Kotlin ⭐ 19,468  +853 today
推荐度 85/100

google-ai-edge/gallery

Star google-ai-edge / gallery A gallery that showcases on-device ML/GenAI use cases and allows people to try and use models locally.

该项目展示并提供了本地运行的设备端机器学习和生成式AI用例,让用户能直接体验和使用本地模型,值得关注因其展示了Google在边缘计算和AI应用的前沿实践。
C++ ⭐ 2,959  +500 today
推荐度 85/100

google-ai-edge/LiteRT-LM

LiteRT-LM 是 Google AI Edge 推出的轻量级实时语言模型推理框架,专为边缘设备优化,值得关注因其高效性能和低延迟特性。
Python ⭐ 4,060  +67 today
推荐度 60/100

newton-physics/newton

Star newton-physics / newton An open-source, GPU-accelerated physics simulation engine built upon NVIDIA Warp, specifically targeting roboticists and simulation researchers.

Newton 是一个基于 NVIDIA Warp 的开源 GPU 加速物理模拟引擎,专为机器人学和仿真研究设计,值得关注因其高性能和针对性优化。
Python ⭐ 10,445  +572 today
推荐度 50/100

elebumm/RedditVideoMakerBot

Star elebumm / RedditVideoMakerBot Create Reddit Videos with just✨ one command ✨

这是一个用Python开发的自动化工具,能通过一条命令快速生成Reddit视频,适合内容创作者高效制作社交媒体视频。
TypeScript ⭐ 25,255  +981 today
推荐度 40/100

sponsors/abhigyanpatwari

Sponsor Star abhigyanpatwari / GitNexus GitNexus: The Zero-Server Code Intelligence Engine - GitNexus is a client-side knowledge graph creator that runs entirely in your browser. Drop in a GitHub repo or ZIP file, and get an interactive knowledge graph wit a built in Graph RAG Agent. Perfect for code exploration

GitNexus是一个完全在浏览器中运行的客户端知识图谱生成工具,支持通过GitHub仓库或ZIP文件创建交互式知识图谱,并内置Graph RAG代理,非常适合代码探索。
Python ⭐ 50,647  +123 today
推荐度 40/100

virattt/ai-hedge-fund

Star virattt / ai-hedge-fund An AI Hedge Fund Team

这是一个用Python实现的AI对冲基金项目,通过人工智能技术进行量化交易和投资决策,值得关注因为它结合了前沿AI技术与金融投资实战。
Python ⭐ 4,527  +645 today
推荐度 30/100

TheCraigHewitt/seomachine

Star TheCraigHewitt / seomachine A specialized Claude Code workspace for creating long-form, SEO-optimized blog content for any business. This system helps you research, write, analyze, and optimize content that ranks well and serves your target audience.

该项目是一个基于Claude的SEO优化工具,专注于帮助企业生成高质量、SEO友好的长篇博客内容,提升搜索排名和受众吸引力,值得关注。
Python ⭐ 8,383  +589 today
推荐度 30/100

NVIDIA/personaplex

Star NVIDIA / personaplex PersonaPlex code.

NVIDIA/personaplex 是一个用于构建和管理个性化AI角色的Python项目,由NVIDIA开发,值得关注因其在个性化AI交互领域的先进技术。