Fine-tuning questions and the dataset splits method?
HongdaChen opened this issue · 0 comments
HongdaChen commented
Scripts
With the help of chatGPT, the following script can output the intersection of several *.txt files:
def read_file(filename):
with open(filename, 'r') as file:
lines = file.readlines()
numbers = [line.strip() for line in lines]
result = set(numbers)
print(f"{filename} has {len(result)} images")
return result
def find_intersection(files):
if len(files) < 2:
raise ValueError("At least two files are required for finding the intersection.")
sets = [read_file(file) for file in files]
intersection = set.intersection(*sets)
print(f"intersection num of {files} is {len(intersection)}")
return intersection
# Example usage
# files = ['t1_train.txt', 't2_train.txt', 't3_train.txt', 't4_train.txt'] # replace with the actual paths to your files
files = ['t2_ft.txt', 't3_ft.txt']
# files = ['t2_train.txt', 't2_ft.txt']
intersection = find_intersection(files)
Find the dataset split method under the hood
root@46a2a355a17d:/owod_master/datasets/OWOD_imagesets# python find_intersection.py
t1_train.txt has 16551 images
t2_train.txt has 45520 images
t3_train.txt has 39402 images
t4_train.txt has 40260 images
intersection num of ['t1_train.txt', 't2_train.txt', 't3_train.txt', 't4_train.txt'] is 0
root@46a2a355a17d:/owod_master/datasets/OWOD_imagesets# python find_intersection.py
t1_train.txt has 16551 images
t2_train.txt has 45520 images
t2_ft.txt has 1743 images
intersection num of ['t1_train.txt', 't2_train.txt', 't2_ft.txt'] is 0
root@46a2a355a17d:/owod_master/datasets/OWOD_imagesets# python find_intersection.py
t2_train.txt has 45520 images
t2_ft.txt has 1743 images
intersection num of ['t2_train.txt', 't2_ft.txt'] is 1330
root@46a2a355a17d:/owod_master/datasets/OWOD_imagesets# python find_intersection.py
t1_train.txt has 16551 images
t2_ft.txt has 1743 images
intersection num of ['t1_train.txt', 't2_ft.txt'] is 413
root@46a2a355a17d:/owod_master/datasets/OWOD_imagesets# python find_intersection.py
t2_train.txt has 45520 images
t3_ft.txt has 2361 images
intersection num of ['t2_train.txt', 't3_ft.txt'] is 1402
root@46a2a355a17d:/owod_master/datasets/OWOD_imagesets# python find_intersection.py
t1_train.txt has 16551 images
t3_ft.txt has 2361 images
intersection num of ['t1_train.txt', 't3_ft.txt'] is 374
root@46a2a355a17d:/owod_master/datasets/OWOD_imagesets# python find_intersection.py
t3_train.txt has 39402 images
t3_ft.txt has 2361 images
intersection num of ['t3_train.txt', 't3_ft.txt'] is 938
root@46a2a355a17d:/owod_master/datasets/OWOD_imagesets# python find_intersection.py
t2_ft.txt has 1743 images
t3_ft.txt has 2361 images
intersection num of ['t2_ft.txt', 't3_ft.txt'] is 107
root@46a2a355a17d:/owod_master/datasets/OWOD_imagesets#
- Why t2_ft.txt does not contain the
$N_{ex} \times 20 = 1000$ from t1_train.txt, where$N_{ex}=50$ as you suggested in the paper. - Why t3_ft.txt contains less than {
t1_train
$\cap$ t3_ft
plust2_train
$\cap$ t3_ft
plust3_train
$\cap$ t3_ft
} = 374+1402 + 938 = 2714, where t3_ft has 2361 images.