You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for sharing this implementation. I got one question when flattening the HxW patch to D dimension, you use a FC layer to map it to a fixed dimension. But in the original paper, they use 224x224 to train and 384x384 to test, which can not be achieved if the flatten layer is fixed. Also, in another repo you shared (https://github.com/rwightman/pytorch-image-models/blob/6f43aeb2526f9fc35cde5262df939ea23a18006c/timm/models/vision_transformer.py#L146), they use 1D conv to avoid resolution mismatch problem. Do you know which one is correct? Thanks!
The text was updated successfully, but these errors were encountered:
@yueruchen Hi Yifan! If you make sure that both your image sizes are divisible by the patch size, as long as you instantiate ViT with image_size as the maximum image size you will be using (in your case 384), it should work fine for images you pass in of a smaller size
Hi,
Thanks for sharing this implementation. I got one question when flattening the HxW patch to D dimension, you use a FC layer to map it to a fixed dimension. But in the original paper, they use 224x224 to train and 384x384 to test, which can not be achieved if the flatten layer is fixed. Also, in another repo you shared (https://github.com/rwightman/pytorch-image-models/blob/6f43aeb2526f9fc35cde5262df939ea23a18006c/timm/models/vision_transformer.py#L146), they use 1D conv to avoid resolution mismatch problem. Do you know which one is correct? Thanks!
The text was updated successfully, but these errors were encountered: