多媒体与智能计算实验室

Information Coding

Information coding is a key technology in the field of multimedia computing and communication transmission. Its purpose is to eliminate all kinds of redundant information in visual signals, so as to store and transmit data in a more efficient way. The purpose of information coding for intelligent applications is no longer limited to saving storage space and transmission bandwidth, and providing users with high-definition visual services, but to provide efficient visual data representation for more intelligent visual analysis and processing needs.

Natural Language Moment Retrieval

Natural language moment retrieval is a task which aims to associate specific video related activity scenes with the given text descriptions. The challenge lies not only in accurate comprehension of temporal cross-modal context, but also in efficient video moment selection. Natural language moment retrieval research has drawn great interests from computer vision, natural language processing and multimedia analysis, and has a great deal in Internet video audit, automated security monitoring, unmanned driving assistance and other scenarios.

Cross-media Retrieval

Cross-media retrieval is a task which aims to retrieve visual information via given textual description or vice verse. The challege behinds this task is to not just understand respective semantic information, more importantly, how to design effective method to eliminate the problem of heterogeneity gap between different media. Cross-media retrieval has gained more and more attention in the field of CV, NLP or the intersection of CV and NLP. Actually, cross-media retrieval is a vital guarantee for lots of potential applications, ranging from recommended system to search engines.

Image Captioning

Automatic image description is designed to generate natural language to describe the content of the image, is the computer vision and the key problem of the research of natural language processing cross, is also an important part of artificial intelligence and human-computer interaction technology, in the objective report generation, multimedia information processing, scene patrolling and monitoring and intelligent human-computer interaction, and other fields has a broad application prospect. In the task of automatic image description, the machine needs to understand the visual semantics of the image, and the difficulty lies in the precise understanding of the image semantics, mining the objects, attributes and relations, and at the same time converting them into fluent language.

Video Captioning

Video captioning has been committed to generating a single sentence according to a given video. The generated sentence is expected to depict visual content accurately and comprehensively. In detail, extracted visual and textual representations first interact with each other to establish a mapping or alignment relation, and then a sequence of words (which make up a sentence) are inferred under the guidance of visual cues. Therefore, the core is powerful representations and sharing semantics of two gapped modalities, which are also main challenges and research directions. More concrete problems include, but are not limited to, long-tailed distribution of vocabulary, limited scale of annotated datasets, inadequate exploitation of modality information, bias via transferring pretrained extractors and accumulation of misalignment or error generation. In terms of applications, there are human-computer interaction, video surveillance, aid for visually-impaired persons, visual language navigation and etc.

Video Paragraph Captioning

Video paragraph captioning targets at generating a detailed and abundant paragraph description about sampled video frames, which is a challenging task and attracts much interests in vision-language area. Compared with the task of video captioning that generates simplified sentences of video contents, video paragraph captioning methods need to understand complex information in multiple events and establish local and long-term associations of extracted visual features, so as to produce elaborate sentences. Video paragraph captioning has great potential and many applications, such as human-machine communication, automatic news subtitle, visual assistant for the blind, etc.

Visual Storytelling

Visual storytelling is a task to generate a paragraph to describe an ordered photo stream. It takes one step further to investigate the generation of a paragraph to describe each photo stream. Visual storytelling can thus benefit a wide range of applications, such as image retrieval, image subtitling and blind navigation, etc. To further investigate machine’s capabilities in understanding more complicated visual scenarios and composing more structured expressions, visual storytelling has attracted more attention in the field of vision and language. In contrast to image captioning where the descriptions of an individual image are generated automatically, visual storytelling is a more complicated and challenging task due to not only recognizing various objects and relationships within images but also learning the dependencies between images. Furthermore, open-domain photo albums cover a broad range of topics, which results in highly variable vocabularies and expression styles to describe the content of photo album. Therefore, how to generate accurate and descriptive storylike descriptions for sequential images in photo album is still an open research question.

Action Detection

Temporal action detection is a well-known task in computer vision, which is aimed towards finding precise temporal boundaries among actions occurring in untrimmed videos. It aligns well with real-world settings, because every minute of a video is potentially filled with multiple actions to be detected and labelled. Temporal action detection is at the core of several down-stream tasks such as video classification, video captioning and video editing.

Visual Question Answer

Visual question answering is a task in which a model answers human questions based on a given image. The challenge lies not only in the adequate understanding of images and texts, but also in the effective inference of the multimodal information obtained to complete complex question answers. Visual question-and-answer research has gained extensive attention in the fields of computer vision, natural language processing and multimedia analysis, and has great potential in visual disability assistance, intelligent education, online shopping guide, unmanned driving assistance and other scenarios.

Visual Dialog

The visual dialog task requires an AI agent to interact with humans in multi-round dialogs based on a visual environment. The core challenges of this task is visual co-reference resolution and the effective inference of the multi-modal information. As a step towards conversational visual AI, visual dialog task can aid visually impaired users in understanding their surroundings or social media content, or aid analysts in making decisions based on large quantities of surveillance data, or interact with an AI assistant.

Visual Commonsense Reasoning

Given an image and a question, visual commonsense reasoning (VCR) is a task that contains two four-way multiple choice subtasks. The task needs not only answer the question, but also provides a rationale justifying for the choice. In addition, the choices in VCR are represented with more complex visual and linguistic expressions instead of simple short phrases. VCR as a cognition-level reasoning task has gain increasing attention from both academia and industry, which can be applied in many fields such as human-computer interaction, preschool education and visual disability assistance.

Image Restoration

Image restoration is the process of recovering an image from a degraded version—degradations come from the image-capture process (e.g., noise, lens blur), post-processing (e.g., JPEG compression), or photography in non-ideal conditions (e.g., haze, motion blur), etc. It is a fundamental problem in image processing, but a highly ill-posed problem due to the existence of infinite feasible solutions. The image degradations not only impact human visibility, but also degrade various computer vision applications, such as autonomous driving, drone flying, and surveillance systems, etc. Therefore, it is crucial for image restoration to remove any unnecessary information while carefully preserving the desired content according to specific needs.

Image Enhancement

The principal objective of image enhancement is to process the information of a given image so that the result is more suitable than the original image for a specific application, such as improving visual quality for human beings or improving visual understanding for downstream high-level computer vision tasks. Image enhancement typically improves image quality via various post-processing techniques, such as contrast enhancement, color reproduction, and sharpening. Experienced photographers can freely generate their favorite visually pleasing images through professional image-editing software (e.g., Adobe Photoshop and Lightroom), which are desired by the general public who lacks professional image-editing skills. This contradiction highlights the importance of the user-oriented automatic image enhancement method for the general public to produce the high-quality images they want. Furthermore, automatic image enhancement is already a build-in technology for displays, cameras, scanners, and photography applications to provide users with better-customized services.

Image Quality Assessment

Image quality assessment (IQA) is a fundamental problem in both human and computational vision, and is critical in various real-world applications, such as image compression, image enhancement, image restoration, and so on. It has evolved rapidly in the past two decades and has also gained increasing attention from both academic and industry due to its wide range of applications. IQA can be divided into subjective IQA and objective IQA according to the subject of assessment. The goal of subjective IQA is to collect reliable mean opinion scores (MOS) from human subjects on the perceived quality of test images, which is the most straightforward and reliable method. Objective IQA aims to develop computational algorithms that automatically provide quality predictions consistent with human data. It is highly nontrivial to apply IQA techniques in the field of multimedia, and it continues to play a fundamental role in the development of signal and image processing algorithms.