データ内の画像の重複について

Tyanakai

データ内の画像の重複について

コンペ開催ありがとうございます。

おそらく同一であると考えられる画像がありましたので共有させていただきます。
見た所、撮影方法に違いがありますが、元となった画像は同一ではないかと思われます。あるいは、模写かもしれません。

精査した訳ではありませんが、見つけた分は、下記で全てです。
同様の重複は、それ程多くはなさそうです。

過学習防止などにお役立てください。

import os
import numpy as np

from matplotlib import pyplot as plt

from config import Folder

folder = Folder()
input_folder = folder.input_folder

train_img_path = os.path.join(input_folder, "christ-train-imgs.npz")
train_label_path = os.path.join(input_folder, "christ-train-labels.npz")
test_img_path = os.path.join(input_folder, "christ-test-imgs.npz")

train_imgs =  np.load(train_img_path)["arr_0"]
test_imgs = np.load(test_img_path)["arr_0"]
train_labels = np.load(train_label_path)["arr_0"]

same_image = [(50, 1129), (506, 598), (138, 793), (9, 66), (250, 1011),
                (238, 341, 415), (75, 772), (71, 535), (333, 513),
                (163, 443), (894, 898), (414, 893), (661, 880),(121, 1067),
                (321, 825), (424, 953), (30, 773), (122, 545)]

similar_image = [921, 960]

num_train_img = train_imgs.shape[0]

def plot_img(plt_pos, idx, num_train_img):
    plt.subplot(1, 3, plt_pos+1)
    if idx > num_train_img:
        plt.imshow(test_imgs[idx - num_train_img])
        plt.xlabel(f"test_img {idx - num_train_img}")
    else:
        plt.imshow(train_imgs[idx])
        plt.xlabel(f"train_img {idx}")

for tpl in same_image:
    plt.figure(figsize=(10,15))
    for i, idx in enumerate(tpl):
        plot_img(i, idx, num_train_img)

下記は、非常に似ているが同一ではない画像です。

plt.figure(figsize=(10,15))
for i, idx in enumerate(similar_image):
    plot_img(i, idx, num_train_img)

データ内の画像の重複について

添付データ

new user