PCLE Implementation

Prompted Contrastive Learning for Skeleton-based Action Recognition

Taiyi Su, Zheng Liu, Hanli Wang

Overview:

Large pre-trained vision-language models have shown great potential across various visual understanding tasks. However, in skeleton-based action recognition, it is challenging to design suitable prompts to guide the model in understanding actions depicted by skeletal structures. Moreover, the modality gap between skeleton data and text descriptions poses obstacles to effective cross-modality contrastive learning. In this work, we propose a prompted contrastive learning framework for skeleton-based action recognition. Specifically, to automatically generate natural language descriptions of actions, we introduce prompted contrast with knowledge engine (PCKE), which utilizes a pre-trained large language model as the knowledge engine to generate text prompts for the input of text encoder. Moreover, as the prompt words generated by the language model may not always be optimal, we explore the potential of the model to autonomously learn prompts. To this end, we present prompted contrast with learning engine (PCLE), a straightforward method that replaces the prompts' context words with learnable vectors. After separately encoding skeleton data and text prompts with the skeleton and text encoder, a skeleton-language interaction process is implemented to optimize feature alignment and reduce modality gap. The interaction module contains two components: a skeleton-language bridger to connect the representation spaces of the skeleton and text encoders, and cross-modality attention to fuse information of both modalities, facilitating cross-modal knowledge transfer and refining feature alignment.

Method:

The pipeline of the proposed prompted contrastive skeleton-based action recognition framework is shown in Fig. 1, including two alternative implementations: PCKE and PCLE. In PCKE, the text encoder takes fixed synonymous descriptions generated for each action category as prompts, while in PCLE the prompt context is replaced with learnable vectors that can be optimized during training. After the skeleton sequence and the corresponding text prompt are encoded by the skeleton encoder and the text encoder, a skeleton-language interaction module is introduced to improve feature alignment between the two modalities. This interaction process contains two key components. First, a skeleton-language bridger extracts language-aware skeleton representations through cross-attention. Then, a cross-modality attention module further fuses skeleton and text information to enable deeper knowledge transfer across modalities. The whole framework is trained with classification supervision, skeleton-text contrastive supervision, and a ranking objective that encourages better alignment of skeleton and text features both before and after the interaction process.

Fig. 1. Overview of the proposed Prompted Contrastive Learning framework.

Result:

The comparison results of the proposed method and state-of-the-art methodson the NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets are shown in Tables 1-3. Overall, the proposed model achieves state-of-the-art performance, demonstrating its superiority over state-of-the-artmethods.

Table 1. Comparison with State-of-the-Art methods on NTU RGB+D.

Table 2. Comparison with State-of-the-Art methods on NTU RGB+D 120.

Table3. Comparison with State-of-the-Art methods on NW-UCLA.

Figure 2 illustrates a qualitative t-SNE visualization comparing the GAP model with the proposed PCLE model. The skeleton embeddings are shown as dots, while the text embeddings are represented as diamonds. The samples belonging to the same class are indicated by the same color, and the misclassified samples are illustrated in gray. Regarding the GAP model, which lacks skeleton-language interaction, the clusters formed by each class lack distinctiveness. On the contrary, with the incorporation of skeleton-language interaction in the proposed PCLE model, the separation between classes becomes well-defined and distinct.

Fig. 2. Qualitative t-SNE illustration of the distribution of semantic embeddings on 9 classes in the NTU RGB+D dataset. (a) The GAP modeland (b) the proposed PCLE model.

Source Code:

PCLE1.0.zip

Citation:

Please cite the following paper if you find this work useful:

Taiyi Su, Zheng Liu, Hanli Wang,Prompted Contrastive Learning for Skeleton-based Action Recognition, IEEE Transactions on Circuits and Systems for Video Technology, accepted, 2026.


Multimedia and Intelligent Computing Lab (MIC)