Vision-Language 多模态：CLIP 模型、图文对齐

📂 所属阶段：第四阶段 — 视觉新范式（Transformer 篇）
🔗 相关章节：MAE (Masked Autoencoders) · 模型轻量化

1. CLIP 核心思想

CLIP = Contrastive Language-Image Pre-training

创新：用自然语言监督学习视觉表示

步骤：
1. 图像编码器：提取图像特征
2. 文本编码器：提取文本特征
3. 对比学习：相关的图文对相似，无关的不相似
4. 零样本分类：无需训练就能分类新类别

2. CLIP 原理

"""
CLIP 损失函数：

给定 N 个图文对：
- 正样本对：图像 i 和文本 i 相似度应该高
- 负样本对：图像 i 和文本 j (j≠i) 相似度应该低

使用对比损失（Contrastive Loss）训练
"""

import torch
import torch.nn as nn

class CLIPLoss(nn.Module):
    def __init__(self, temperature=0.07):
        super().__init__()
        self.temperature = temperature
    
    def forward(self, image_features, text_features):
        # 归一化
        image_features = image_features / image_features.norm(dim=-1, keepdim=True)
        text_features = text_features / text_features.norm(dim=-1, keepdim=True)
        
        # 计算相似度矩阵
        logits = image_features @ text_features.T / self.temperature
        
        # 对比损失
        labels = torch.arange(len(image_features))
        loss_i = nn.CrossEntropyLoss()(logits, labels)
        loss_t = nn.CrossEntropyLoss()(logits.T, labels)
        
        return (loss_i + loss_t) / 2

3. 使用 CLIP

import torch
import clip

# 加载预训练 CLIP
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# 图像分类（零样本）
image = preprocess(Image.open("image.jpg")).unsqueeze(0).to(device)
text = clip.tokenize(["a dog", "a cat", "a bird"]).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    # 计算相似度
    logits_per_image = image_features @ text_features.T
    probs = logits_per_image.softmax(dim=-1)

print(probs)  # 每个类别的概率

4. CLIP 应用

零样本分类：
  - 无需训练，直接分类新类别
  - 只需提供文本描述

图文检索：
  - 用文本搜索图像
  - 用图像搜索文本

图像生成：
  - 结合 Diffusion Model
  - 文本生成图像（DALL-E）

5. 小结

CLIP 的意义：

1. 零样本学习：无需训练就能分类
2. 多模态理解：连接视觉和语言
3. 可扩展性：可以处理任意文本描述

2026 年应用：
- 图文检索：Pinterest、Google Images
- 内容审核：自动标注不当内容
- 图像生成：DALL-E、Midjourney

💡 记住：CLIP 是多模态学习的里程碑。它证明了用自然语言监督可以学到通用的视觉表示。

🔗 扩展阅读

#Vision-Language 多模态：CLIP 模型、图文对齐

#1. CLIP 核心思想

#2. CLIP 原理

#3. 使用 CLIP

#4. CLIP 应用

#5. 小结

Vision-Language 多模态：CLIP 模型、图文对齐

1. CLIP 核心思想

2. CLIP 原理

3. 使用 CLIP

4. CLIP 应用

5. 小结