📅 2026-04-06

每日精选 · arXiv · Hacker News · GitHub Trending

计算机视觉: 5 归档 →

📄 arXiv 论文
计算机视觉 2604.02817
相关性 65/100

MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling

MMPhysVideo:通过联合多模态建模扩展视频生成中的物理合理性

Shubo Lin, Xuanyang Zhang, Wei Cheng, Weiming Hu, Gang Yu 等 (6 位作者)

核心贡献: 提出首个通过联合多模态建模提升视频生成物理合理性的框架MMPhysVideo,并开发了支持物理丰富多模态数据集构建的标注流程MMPhysPipe。
方法: 1. 将语义、几何和时空轨迹等感知线索统一为伪RGB格式,使视频扩散模型直接学习复杂物理动态 2. 设计双向控制教师架构(BCT),通过并行分支解耦RGB与感知处理,并采用零初始化控制链接逐步学习像素级一致性 3. 通过表示对齐将教师模型的物理先验蒸馏到单流学生模型以提升推理效率 4. 基于视觉证据链规则开发VLM引导的自动化数据标注流程,精准提取多粒度物理信息
关键发现: 1. 在多个基准测试中无需额外推理成本即显著提升物理合理性和视觉质量 2. 相比现有方法实现最先进性能 3. MMPhysPipe能有效构建物理信息丰富的多模态数据集
查看原文摘要

Despite advancements in generating visually stunning content, video diffusion models (VDMs) often yield physically inconsistent results due to pixel-only reconstruction. To address this, we propose MMPhysVideo, the first framework to scale physical plausibility in video generation through joint multimodal modeling. We recast perceptual cues, specifically semantics, geometry, and spatio-temporal trajectory, into a unified pseudo-RGB format, enabling VDMs to directly capture complex physical dynamics. To mitigate cross-modal interference, we propose a Bidirectionally Controlled Teacher architecture, which utilizes parallel branches to fully decouple RGB and perception processing and adopts two zero-initialized control links to gradually learn pixel-wise consistency. For inference efficiency, the teacher's physical prior is distilled into a single-stream student model via representation alignment. Furthermore, we present MMPhysPipe, a scalable data curation and annotation pipeline tailored for constructing physics-rich multimodal datasets. MMPhysPipe employs a vision-language model (VLM) guided by a chain-of-visual-evidence rule to pinpoint physical subjects, enabling expert models to extract multi-granular perceptual information. Without additional inference costs, MMPhysVideo consistently improves physical plausibility and visual quality over advanced models across various benchmarks and achieves state-of-the-art performance compared to existing methods.

计算机视觉 2604.02799
相关性 65/100

UNICA: A Unified Neural Framework for Controllable 3D Avatars

UNICA:可控3D数字人的统一神经框架

Jiahe Zhu, Xinyao Wang, Yiyu Zhuang, Yanwen Wang, Jing Tian 等 (7 位作者)

核心贡献: 提出首个将动作规划、骨骼绑定、物理模拟和渲染流程统一的神经框架,通过单一模型实现可控3D数字人生成。
方法: 1. 采用基于2D位置图的动作条件扩散模型生成下一帧几何形态 2. 使用点Transformer将几何映射到3D高斯泼溅表示 3. 无需手动物理模拟即可捕捉头发/衣物动态 4. 支持类游戏键盘控制的超长自回归生成
关键发现: 1. 实现端到端的高保真自由视角渲染 2. 自动处理传统方法需分步解决的动态效果 3. 在保持生成质量的同时支持长时间连续动作生成
查看原文摘要

Controllable 3D human avatars have found widespread applications in 3D games, the metaverse, and AR/VR scenarios. The conventional approach to creating such a 3D avatar requires a lengthy, intricate pipeline encompassing appearance modeling, motion planning, rigging, and physical simulation. In this paper, we introduce UNICA (UNIfied neural Controllable Avatar), a skeleton-free generative model that unifies all avatar control components into a single neural framework. Given keyboard inputs akin to video game controls, UNICA generates the next frame of a 3D avatar's geometry through an action-conditioned diffusion model operating on 2D position maps. A point transformer then maps the resulting geometry to 3D Gaussian Splatting for high-fidelity free-view rendering. Our approach naturally captures hair and loose clothing dynamics without manually designed physical simulation, and supports extra-long autoregressive generation. To the best of our knowledge, UNICA is the first model to unify the workflow of "motion planning, rigging, physical simulation, and rendering". Code is released at https://github.com/zjh21/UNICA.

计算机视觉 2604.02787
相关性 65/100

LumaFlux: Lifting 8-Bit Worlds to HDR Reality with Physically-Guided Diffusion Transformers

LumaFlux:通过物理引导的扩散变换器将8位世界提升至HDR现实

Shreshth Saini, Hakan Gedik, Neil Birkbeck, Yilin Wang, Balu Adsumilli 等 (6 位作者)

核心贡献: 提出了首个基于物理和感知引导的扩散变换器(DiT)LumaFlux,用于从标准动态范围(SDR)到高动态范围(HDR)的重建,并通过创新的模块设计提升了HDR重建的质量和稳定性。
方法: LumaFlux引入了三个关键模块:(1) 物理引导适应(PGA)模块,通过低秩残差将亮度、空间描述符和频率线索注入注意力机制;(2) 感知交叉调制(PCM)层,通过FiLM条件化从视觉编码器特征中稳定色度和纹理;(3) HDR残差耦合器,在时间和层自适应调制计划下融合物理和感知信号。此外,采用轻量级有理二次样条解码器重建平滑、可解释的色调场。
关键发现: LumaFlux在多个基准测试中优于现有最先进方法,实现了更优的亮度重建和感知色彩保真度,同时仅需少量额外参数。此外,研究还构建了首个大规模SDR-HDR训练语料库和新的评估基准,为后续研究提供了可靠的数据支持。
查看原文摘要

The rapid adoption of HDR-capable devices has created a pressing need to convert the 8-bit Standard Dynamic Range (SDR) content into perceptually and physically accurate 10-bit High Dynamic Range (HDR). Existing inverse tone-mapping (ITM) methods often rely on fixed tone-mapping operators that struggle to generalize to real-world degradations, stylistic variations, and camera pipelines, frequently producing clipped highlights, desaturated colors, or unstable tone reproduction. We introduce LumaFlux, a first physically and perceptually guided diffusion transformer (DiT) for SDR-to-HDR reconstruction by adapting a large pretrained DiT. Our LumaFlux introduces (1) a Physically-Guided Adaptation (PGA) module that injects luminance, spatial descriptors, and frequency cues into attention through low-rank residuals; (2) a Perceptual Cross-Modulation (PCM) layer that stabilizes chroma and texture via FiLM conditioning from vision encoder features; and (3) an HDR Residual Coupler that fuses physical and perceptual signals under a timestep- and layer-adaptive modulation schedule. Finally, a lightweight Rational-Quadratic Spline decoder reconstructs smooth, interpretable tone fields for highlight and exposure expansion, enhancing the output of the VAE decoder to generate HDR. To enable robust HDR learning, we curate the first large-scale SDR-HDR training corpus. For fair and reproducible comparison, we further establish a new evaluation benchmark, comprising HDR references and corresponding expert-graded SDR versions. Across benchmarks, LumaFlux outperforms state-of-the-art baselines, achieving superior luminance reconstruction and perceptual color fidelity with minimal additional parameters.

计算机视觉 2604.02753
相关性 65/100

DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection

DeCo-DETR:面向高效开放词汇目标检测的解耦认知DETR

Siheng Wang, Yanshu Li, Bohan Hu, Zhengdao Li, Haibo Zhan 等 (15 位作者)

核心贡献: 提出DeCo-DETR框架,通过解耦语义认知与目标检测,显著提升开放词汇目标检测的推理效率与性能。
方法: 1. 利用预训练LVLMs生成区域级描述并构建层次化语义原型空间,避免推理时依赖文本编码器;2. 采用解耦训练策略,将语义对齐与目标检测分离为并行优化任务;3. 通过CLIP对齐语义表示,实现高效可复用的视觉-语言关联。
关键发现: 在标准OVOD基准测试中,DeCo-DETR在保持竞争力的零样本检测性能的同时,显著提升了推理效率,验证了解耦认知策略对可扩展开放词汇检测系统的有效性。
查看原文摘要

Open-vocabulary Object Detection (OVOD) enables models to recognize objects beyond predefined categories, but existing approaches remain limited in practical deployment. On the one hand, multimodal designs often incur substantial computational overhead due to their reliance on text encoders at inference time. On the other hand, tightly coupled training objectives introduce a trade-off between closed-set detection accuracy and open-world generalization. Thus, we propose Decoupled Cognition DETR (DeCo-DETR), a vision-centric framework that addresses these challenges through a unified decoupling paradigm. Instead of depending on online text encoding, DeCo-DETR constructs a hierarchical semantic prototype space from region-level descriptions generated by pre-trained LVLMs and aligned via CLIP, enabling efficient and reusable semantic representation. Building upon this representation, the framework further disentangles semantic reasoning from localization through a decoupled training strategy, which separates alignment and detection into parallel optimization streams. Extensive experiments on standard OVOD benchmarks demonstrate that DeCo-DETR achieves competitive zero-shot detection performance while significantly improving inference efficiency. These results highlight the effectiveness of decoupling semantic cognition from detection, offering a practical direction for scalable OVOD systems.

计算机视觉 2604.02736
相关性 65/100

THOM: Generating Physically Plausible Hand-Object Meshes From Text

THOM:从文本生成物理合理的手-物体网格

Uyoung Jeong, Yihalem Yimolal Tiruneh, Hyung Jin Chang, Seungryul Baek, Kwang In Kim

核心贡献: 提出了一种无需训练的框架THOM,能够生成高真实感且物理合理的3D手-物体交互网格,无需模板物体网格。
方法: THOM采用两阶段流程:首先生成手和物体的高斯表示,随后进行基于物理的交互优化。通过新颖的网格提取方法和顶点-高斯映射,实现拓扑感知的正则化。此外,利用视觉语言模型(VLM)引导的平移细化和接触感知优化,提升交互的物理合理性。
关键发现: 实验表明,THOM在文本对齐、视觉真实感和交互合理性方面均优于现有最先进方法。
查看原文摘要

The generation of 3D hand-object interactions (HOIs) from text is crucial for dexterous robotic grasping and VR/AR content generation, requiring both high visual fidelity and physical plausibility. Nevertheless, the ill-posed problem of mesh extraction from text-generated Gaussians, and physics-based optimization on the erroneous meshes pose challenges. To address these issues, we introduce THOM, a training-free framework that generates photorealistic, physically plausible 3D HOI meshes without the need for a template object mesh. THOM employs a two-stage pipeline, initially generating the hand and object Gaussians, followed by physics-based HOI optimization. Our new mesh extraction method and vertex-to-Gaussian mapping explicitly assign Gaussian elements to mesh vertices, allowing topology-aware regularization. Furthermore, we improve the physical plausibility of interactions by VLM-guided translation refinement and contact-aware optimization. Comprehensive experiments demonstrate that THOM consistently surpasses state-of-the-art methods in terms of text alignment, visual realism, and interaction plausibility.

🔥 Hacker News
HN ▲ 830  💬 126
推荐度 90/100

Show HN: I built a tiny LLM to demystify how language models work

by armanified

作者开发了一个约900万参数的小型语言模型,使用基础Transformer架构和6万条合成对话数据,仅需130行PyTorch代码,可在免费Colab T4上5分钟完成训练,模型还自带趣味回答(如认为生命意义是食物),并鼓励用户自定义角色性格。
HN ▲ 165  💬 95
推荐度 85/100

Launch HN: Freestyle – Sandboxes for Coding Agents

by benswerd

Freestyle 是一个为AI编码代理设计的云平台,提供沙盒环境让AI能充分利用计算机的全部能力,而不仅仅是简单的脚本或服务器应用。
HN ▲ 7  💬 2
推荐度 85/100

Show HN: Hippo, biologically inspired memory for AI agents

by kitfunso

该帖子介绍了一个名为Hippo的生物启发式记忆系统,专为AI代理设计,旨在模拟生物记忆机制以提升AI的学习和记忆能力。
HN ▲ 12  💬 2
推荐度 85/100

Show HN: TTF-DOOM – A raycaster running inside TrueType font hinting

by 4RH1T3CT0R

这个帖子展示了一个在TrueType字体提示虚拟机中运行的射线投射引擎,利用字体轮廓和字节码实现了类似《德军总部》的3D视觉效果。
HN ▲ 40  💬 5
推荐度 85/100

Reducto releases Deep Extract

by raunakchowdhuri

Reducto发布Deep Extract,介绍其新型深度提取代理技术,可高效处理复杂数据提取任务。
HN ▲ 257  💬 29
推荐度 85/100

Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B

by karimf

展示在M3 Pro芯片上运行Gemma E2B实现实时AI音视频输入与语音输出的技术方案。
HN ▲ 3  💬 0
推荐度 85/100

Show HN: Meta-agent: self-improving agent harnesses from live traces

by essamsleiman

Meta-agent是一个开源库,能通过实时生产数据自动持续优化AI代理的性能,例如在测试中准确率从67%提升至87%。
HN ▲ 141  💬 20
推荐度 85/100

Show HN: Gemma Gem – AI model embedded in a browser – no API keys, no cloud

by ikessler

Gemma Gem是一款Chrome扩展,将Google的Gemma 4(2B)AI模型嵌入浏览器,无需API密钥或云端,可直接与网页交互(如读取内容、点击元素等),并自带小型聊天界面,支持链式思考推理,适合简单页面操作但多步骤任务可能不稳定。
🐙 GitHub Trending
C++ ⭐ 101,987  +318 today
推荐度 90/100

ggml-org/llama.cpp

Star ggml-org / llama.cpp LLM inference in C/C++

llama.cpp 是一个用 C/C++ 实现的高效轻量级大语言模型推理框架,值得关注因为它能在资源受限的设备上高效运行,无需依赖复杂框架即可本地部署 LLM。
Kotlin ⭐ 17,813  +1109 today
推荐度 85/100

google-ai-edge/gallery

Star google-ai-edge / gallery A gallery that showcases on-device ML/GenAI use cases and allows people to try and use models locally.

该项目展示了谷歌在设备端机器学习和生成式AI的应用案例,让用户能直接在本地试用模型,值得关注因其提供了前沿AI技术的实际落地体验。
Rust ⭐ 38,048  +1514 today
推荐度 85/100

block/goose

Star block / goose an open source, extensible AI agent that goes beyond code suggestions - install, execute, edit, and test with any LLM

Goose 是一个开源且可扩展的 AI 代理,不仅能提供代码建议,还能安装、执行、编辑和测试任何大型语言模型(LLM),值得关注因为它将 AI 辅助开发提升到了更全面的自动化水平。
C++ ⭐ 1,973  +487 today
推荐度 85/100

google-ai-edge/LiteRT-LM

LiteRT-LM 是 Google AI Edge 推出的轻量级实时语言模型推理框架,专注于在边缘设备上高效运行大语言模型。该项目值得关注因为它为资源受限设备提供了高性能的模型部署方案,推动了AI在边缘计算场景的落地应用。
Go ⭐ 167,647  +263 today
推荐度 85/100

ollama/ollama

Star ollama / ollama Get up and running with Kimi-K2.5, GLM-5, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other models.

Ollama 是一个帮助用户快速部署和使用 Kimi-K2.5、GLM-5、MiniMax、DeepSeek 等多种开源大模型的工具,值得关注因为它简化了这些先进模型的本地运行流程。
Python ⭐ 27,957  +1721 today
推荐度 70/100

NousResearch/hermes-agent

Star NousResearch / hermes-agent The agent that grows with you

Hermes-Agent 是一个能随用户需求成长的智能代理项目,值得关注因其灵活性和适应性可满足不断变化的AI应用场景。
TypeScript ⭐ 18,679  +526 today
推荐度 60/100

tobi/qmd

Star tobi / qmd mini cli search engine for your docs, knowledge bases, meeting notes, whatever. Tracking current sota approaches while being all local

qmd 是一个本地化的迷你 CLI 搜索引擎,专为文档、知识库和会议笔记等设计,采用当前最先进的技术实现,值得关注因其高效且完全本地运行的特点。
TypeScript ⭐ 23,883  +1823 today
推荐度 45/100

siddharthvaddem/openscreen

Star siddharthvaddem / openscreen Create stunning demos for free. Open-source, no subscriptions, no watermarks, and free for commercial use. An alternative to Screen Studio.

OpenScreen 是一个开源的 TypeScript 项目,用于免费创建高质量演示视频,无订阅、无水印且可商用,是 Screen Studio 的替代方案。