快速提示：通过 Python Notebook 和 OpenAI CLIP 构建视频向量嵌入

快速提示：通过 python notebook 和 openai clip 构建视频向量嵌入

抽象的

随着人工智能继续影响多种类型的数据处理，矢量嵌入也已成为视频分析的强大工具。本文深入探讨了人工智能在分析视频数据方面的一些功能。我们将探索如何使用 python 和 openai clip 创建的向量嵌入来解释和分析视频内容。

本文中使用的笔记本文件可在 github 上找到。

介绍

本文讨论了矢量嵌入在视频分析中的重要性，并通过一个简单的示例提供了构建这些嵌入的分步指南。

创建 singlestore 云帐户

上一篇文章展示了创建免费 singlestore 云帐户的步骤。我们将使用免费共享层并采用工作区和数据库的默认名称。

导入笔记本

我们将从 github 下载笔记本。

从 singlestore 云门户的左侧导航窗格中，我们将选择 develop > data studio。

在网页的右上角，我们将选择新建笔记本 > 从文件导入。我们将使用向导找到并导入从 github 下载的笔记本。

运行笔记本

检查我们是否已连接到 singlestore 工作区后，我们将逐个运行单元。

我们首先从 github 下载示例视频，然后直接在笔记本中播放短视频。示例视频时长 142 秒。

接下来，我们将安装一些库，包括 openai clip。

对比语言-图像预训练 (clip) 是 openai 的一个模型，它通过将图像和文本关联到共享嵌入空间来理解图像和文本。我们将按如下方式加载它：

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("vit-b/32", device = device)

我们将把视频分解为单独的图片帧，如下所示：

def extract_frames(video_path):
    frames = []
    cap = cv2.videocapture(video_path)
    frame_rate = cap.get(cv2.cap_prop_fps)
    total_frames = int(cap.get(cv2.cap_prop_frame_count))
    total_seconds = total_frames / frame_rate
    target_frame_count = int(total_seconds)
    target_frame_index = 0
    for i in range(target_frame_count):
        cap.set(cv2.cap_prop_pos_frames, target_frame_index)
        ret, frame = cap.read()
        if not ret:
            break
        frames.append(frame)
        target_frame_index += int(frame_rate)
    cap.release()
    return frames

接下来，我们将以更简单的形式总结图片中发生的事情：

def generate_embedding(frame):
    frame_tensor = preprocess(pilimage.fromarray(frame)).unsqueeze(0).to(device)
    with torch.no_grad():
        embedding = model.encode_image(frame_tensor).cpu().numpy()
    return embedding[0]

我们现在将从视频中提取视觉信息并将其总结为结构化格式以供进一步分析：

def store_frame_embedding_and_image(video_path):
    frames = extract_frames(video_path)
    data = [
        (i+1, generate_embedding(frame), frame)
        for i, frame in enumerate(tqdm(
            frames,
            desc = "processing frames",
            bar_format = "{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}



<p>让我们检查一下 dataframe 中存储的数据的大小特征：<br></p>

embedding_lengths = df["embedding_data"].str.len()
frame_lengths = df["frame_data"].str.len()

# calculate min and max lengths for embeddings and frames
min_embedding_length, max_embedding_length = embedding_lengths.min(), embedding_lengths.max()
min_frame_length, max_frame_length = frame_lengths.min(), frame_lengths.max()

# print results
print(f"min length of embedding vectors: {min_embedding_length}")
print(f"max length of embedding vectors: {max_embedding_length}")
print(f"min length of frame data vectors: {min_frame_length}")
print(f"max length of frame data vectors: {max_frame_length}")




示例输出：


min length of embedding vectors: 512
max length of embedding vectors: 512
min length of frame data vectors: 1080
max length of frame data vectors: 1080




现在，让我们量化查询嵌入与 dataframe 中每个帧的嵌入的相似程度，提供查询与帧之间相似性的度量：


def calculate_similarity(query_embedding, df):
    # convert the query embedding to a tensor
    query_tensor = torch.tensor(query_embedding, dtype = torch.float32).to(device)

    # convert the list of embeddings to a numpy array
    embeddings_np = np.array(df["embedding_data"].tolist())

    # create a tensor from the numpy array
    embeddings_tensor = torch.tensor(embeddings_np, dtype = torch.float32).to(device)

    # compute similarities using matrix multiplication
    similarities = torch.mm(embeddings_tensor, query_tensor.unsqueeze(1)).squeeze().tolist()
    return similarities




现在，我们将以更简单的数字形式总结文本查询的含义：


def encode_text_query(query):
    # tokenize the query text
    tokens = clip.tokenize([query]).to(device)

    # compute text features using the pretrained model
    with torch.no_grad():
        text_features = model.encode_text(tokens)

    # convert the tensor to a numpy array and return it
    return text_features.cpu().numpy().flatten()




并在出现提示时输入查询字符串“ultra-fast ingestion”：


query = input("enter your query: ")
text_query_embedding = encode_text_query(query)
text_similarities = calculate_similarity(text_query_embedding, df)
df["text_similarity"] = text_similarities




我们现在将获得前 5 个最佳文本匹配：


# retrieve the top 5 text matches based on similarity
top_text_matches = df.nlargest(5, "text_similarity")

print("top 5 best matches:")
print(top_text_matches[["frame_number", "text_similarity"]].to_string(index = false))




示例输出：


top 5 best matches:
 frame_number  text_similarity
           40        36.456184
           39        36.081161
           43        33.295975
           42        32.423229
           45        31.931164




我们还可以绘制帧：


def plot_frames(frames, frame_numbers):
    num_frames = len(frames)
    fig, axes = plt.subplots(1, num_frames, figsize = (15, 5))

    for ax, frame_data, frame_number in zip(axes, frames, frame_numbers):
        ax.imshow(frame_data)
        ax.set_title(f"frame {frame_number}")
        ax.axis("off")

    plt.tight_layout()
    plt.show()

# collect frame data and numbers for the top text matches
top_text_matches_indices = top_text_matches.index.tolist()
frames = [df.at[index, "frame_data"] for index in top_text_matches_indices]
frame_numbers = [df.at[index, "frame_number"] for index in top_text_matches_indices]

# plot the frames
plot_frames(frames, frame_numbers)




现在，我们将以更简单的数字形式总结图像查询：


def encode_image_query(image):
    # preprocess the image and add batch dimension
    image_tensor = preprocess(image).unsqueeze(0).to(device)

    # extract features using the model
    with torch.no_grad():
        image_features = model.encode_image(image_tensor)

    # convert features to numpy array and flatten
    return image_features.cpu().numpy().flatten()




并下载示例图像以用于查询：


image_url = "https://github.com/veryfatboy/clip-demo/raw/main/thumbnails/1_what_makes_singlestore_unique.png"

response = requests.get(image_url)

if response.status_code == 200:
    display(image(url = image_url))
    image_file = pilimage.open(bytesio(response.content))

    image_query_embedding = encode_image_query(image_file)
    image_similarities = calculate_similarity(image_query_embedding, df)
    df["image_similarity"] = image_similarities
else:
    print("failed to download the image, status code:", response.status_code)




我们现在将获得前 5 个最佳图像匹配：


top_image_matches = df.nlargest(5, "image_similarity")

print("top 5 best matches:")
print(top_image_matches[["frame_number", "image_similarity"]].to_string(index = false))




示例输出：


top 5 best matches:
 frame_number  image_similarity
            7         57.674603
            9         43.669739
            6         42.573799
           15         40.296551
           93         40.201733




我们还可以绘制帧：


# collect frame data and numbers for the top image matches
top_image_matches_indices = top_image_matches.index.tolist()
frames = [df.at[index, "frame_data"] for index in top_image_matches_indices]
frame_numbers = [df.at[index, "frame_number"] for index in top_image_matches_indices]

# plot the frames
plot_frames(frames, frame_numbers)




现在让我们使用逐元素平均来组合文本和图像：


combined_query_embedding = (text_query_embedding + image_query_embedding) / 2
combined_similarities = calculate_similarity(combined_query_embedding, df)
df["combined_similarity"] = combined_similarities




我们现在将获得前 5 名最佳组合比赛：


top_combined_matches = df.nlargest(5, "combined_similarity")

print("top 5 best matches:")
print(top_combined_matches[["frame_number", "combined_similarity"]].to_string(index = false))




示例输出：


top 5 best matches:
 frame_number  combined_similarity
            7            36.337120
            5            32.869991
            6            32.559093
           93            32.205418
           94            31.881357




我们还可以绘制帧：


# collect frame data and numbers for the top combined matches
top_combined_matches_indices = top_combined_matches.index.tolist()
frames = [df.at[index, "frame_data"] for index in top_combined_matches_indices]
frame_numbers = [df.at[index, "frame_number"] for index in top_combined_matches_indices]

# plot the frames
plot_frames(frames, frame_numbers)




接下来，我们将数据存储在 singlestore 中。首先，我们准备数据：


frames_df = df.copy()
frames_df.drop(
    columns = ["text_similarity", "image_similarity", "combined_similarity"],
    inplace = true
)

query_string = combined_query_embedding.copy()




我们还需要执行一些数据清理：


def process_data(arr):
    return np.array2string(arr, separator = ",").replace("\n", "")

frames_df["embedding_data"] = frames_df["embedding_data"].apply(process_data)
frames_df["frame_data"] = frames_df["frame_data"].apply(process_data)
query_string = process_data(query_string)




我们将检查我们是否在免费共享层上运行：


shared_tier_check = %sql show variables like "is_shared_tier"
if not shared_tier_check or shared_tier_check[0][1] == "off":
    %sql drop database if exists video_db;
    %sql create database if not exists video_db;




然后连接到数据库：


from sqlalchemy import *

db_connection = create_engine(connection_url)




我们将确保有一个表可用于存储数据：


drop table if exists frames;

create table if not exists frames (
    frame_number int(10) unsigned not null,
    embedding_data vector(512) not null,
    frame_data text,
    key(frame_number)
);




然后将dataframe写入singlestore：


frames_df.to_sql(
    "frames",
    con = db_connection,
    if_exists = "append",
    index = false,
    chunksize = 1000
)




我们可以从 singlestore 读回一些数据：


select frame_number,
    substring(embedding_data, 1, 50) as embedding_data,
    substring(frame_data, 1, 50) as frame_data
from frames
limit 1;




我们还可以创建一个 ann 索引：


alter table frames add vector index (embedding_data)
     index_options '{
          "index_type":"auto",
          "metric_type":"dot_product"
     }';




首先，让我们在不使用 ann 索引的情况下运行查询：


select frame_number,
    embedding_data  :query_string as similarity
from frames
order by similarity use index () desc
limit 5;




示例输出：


frame_number         similarity
           7 36.337120056152344
           5   32.8699951171875
           6   32.5590934753418
          93  32.20541763305664
          94 31.881359100341797




现在，我们将使用 ann 索引运行查询：


select frame_number,
    embedding_data  :query_string as similarity
from frames
order by similarity desc
limit 5;




示例输出：


frame_number         similarity
           7 36.337120056152344
           5   32.8699951171875
           6   32.5590934753418
          93  32.20541763305664
          94 31.881359100341797




我们也可以使用python作为替代：


sql_query = """
SELECT frame_number, embedding_data, frame_data
FROM frames
ORDER BY embedding_data  %s DESC
LIMIT 5;
"""

new_frames_df = pd.read_sql(
    sql_query,
    con = db_connection,
    params = (query_string,)
)

new_frames_df.head()




由于我们只存储少量数据（142 行），因此无论我们是否使用 ann 索引，结果都是相同的。我们查询数据库的结果与我们之前的组合查询结果一致。


  
  
  概括


在本文中，我们使用 python 和 openai 的 clip 模型将矢量嵌入应用于视频分析。我们了解了如何从视频中提取帧，为每个帧生成嵌入，并使用这些嵌入基于文本和图像查询执行相似性搜索。这使我们能够检索相关视频片段，使其成为视频内容分析的有用工具。

如今，许多现代法学硕士都提供多模式功能以及对音频、图像和视频的广泛支持。然而，本文中的示例表明可以使用免费软件来实现一些相同的功能。

以上就是快速提示：通过 Python Notebook 和 OpenAI CLIP 构建视频向量嵌入的详细内容，更多请关注其它相关文章！