Recurrent Transformer Networks (RTNs)

Seungryong Kim¹

Stephen Lin²

Sangryul Jeon³

Dongbo Min⁴

Kwanghoon Sohn³

Korea University¹, MSRA², Yonsei University³, Ewha Womans University⁴

[NeurIPS'18 paper]

[NeurIPS'18 slide]

[Code]

Intuition of RTNs: (a) methods for geometric invariance in the feature extraction step, e.g., STN-based methods [5, 19], (b) methods for geometric invariance in the regularization step, e.g., geometric matching-based methods [30, 31], and (c) RTNs, which weave the advantages of both existing STN-based methods and geometric matching techniques, by recursively estimating geometric transformation residuals using geometry-aligned feature activations.

We present recurrent transformer networks (RTNs) for obtaining dense correspondences between semantically similar images. Our networks accomplish this through an iterative process of estimating spatial transformations between the input images and using these transformations to generate aligned convolutional activations. By directly estimating the transformations between an image pair, rather than employing spatial transformer networks to independently normalize each individual image, we show that greater accuracy can be achieved. This process is conducted in a recursive manner to refine both the transformation estimates and the feature representations. In addition, a technique is presented for weakly-supervised training of RTNs that is based on a proposed classification loss. With RTNs, state-of-the-art performance is attained on several benchmarks for semantic correspondence.

Results

Qualitative results on the PF-PASCAL benchmark [8]: (a) source image, (b) target image, (c) CAT-FCSS w/SF [19], (d) SCNet [9], (e) GMat. w/Inl. [31], and (f) RTNs. The source images are warped to the target images using correspondences.