SPT: Spatial Pyramid Transformer for Image Captioning

Haonan Zhang, Pengpeng Zeng, Lianli Gao, Xinyu Lyu, Jingkuan Song, Heng Tao Shen,

[Paper] | TCSVT 2023

This is the code implementation of the paper "SPT: Spatial Pyramid Transformer for Image Captioning", the checkpoint and feature will be released soon.

Overview

The canonical approaches to image captioning tend to vision transformers to learn sentence generation. These methods typically treat visually representative modeling of an image as a sequential problem (i.e., flatting image patches), which demonstrates impressive levels of performance. However, the spatial semantic loss for flattened grid features of images has not received much attention to date. Besides, the routine of the current transformer models tend to maintain a full-length patch sequence during training and inference, which lacks hierarchal representation and makes it difficult to generate sentences with multiple levels of granularity. To this end, we propose a Spatially Pyramidal Transformer (SPT), which progressively pools vision patches to shrink sequence length for caption generation with varying graininess among image grids.

Figure 1. Overview of the Spatial Pyramid Transformer (SPT) for Image Captioning.

Dataset and Training Details

Note

For the data preparation, feature download, and training details, please refer to this Repo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

SPT: Spatial Pyramid Transformer for Image Captioning

Haonan Zhang, Pengpeng Zeng, Lianli Gao, Xinyu Lyu, Jingkuan Song, Heng Tao Shen,

Overview

Dataset and Training Details

Files

README.md

Latest commit

History

README.md

File metadata and controls

SPT: Spatial Pyramid Transformer for Image Captioning

Haonan Zhang, Pengpeng Zeng, Lianli Gao, Xinyu Lyu, Jingkuan Song, Heng Tao Shen,

Overview

Dataset and Training Details