汇总一下ACL EMNLP AAAI上多模态的论文
##EMNLP21
###多模态
1、Inflate and Shrink:Enriching and Reducing Interactions for Fast Text-Image Retrieval
2、Multi-Modal Open-Domain Dialogue
https://arxiv.org/abs/2010.01082
3、Adaptive Proposal Generation Network for Temporal Sentence Localization in Videos
https://arxiv.org/abs/2109.06398
4、Progressively Guide to Attend: An Iterative Alignment Framework for Temporal Sentence Grounding
https://arxiv.org/abs/2109.06400
5、R^3Net:Relation-embedded Representation Reconstruction Network for Change Captioning
https://arxiv.org/abs/2110.10328
6、Unimodal and Crossmodal Refinement Network for Multimodal Sequence Fusion
7、CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations
https://arxiv.org/abs/2109.00181
8、LayoutReader: Pre-training of Text and Layout for Reading Order Detection
https://arxiv.org/abs/2108.11591
9、On Pursuit of Designing Multi-modal Transformer for Video Grounding
https://arxiv.org/abs/2109.06085
10、Improving Multimodal fusion via Mutual Dependency Maximisation
https://arxiv.org/abs/2109.00922
11、Relation-aware Video Reading Comprehension for Temporal Language Grounding
https://arxiv.org/abs/2110.05717
12、Multimodal Phased Transformer for Sentiment Analysis
13、Scalable Font Reconstruction with Dual Latent Manifolds
https://arxiv.org/abs/2109.06627
14、Discovering the Unknown Knowns: Turning Implicit Knowledge in the Dataset into Explicit Training Examples for Visual Question Answering
15、COVR: A Test-Bed for Visually Grounded Compositional Generalization with Real Images
https://arxiv.org/abs/2109.10613
16、Joint Multi-modal Aspect-Sentiment Analysis with Auxiliary Cross-modal Relation Detection
17、Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning
https://arxiv.org/abs/2109.06860
18、Visually Grounded Reasoning across Languages and Cultures
https://arxiv.org/abs/2109.13238
19、Region under Discussion for visual dialog
https://githubmemory.com/repo/mmazuecos/Region-under-discussion-for-visual-dialog
20、Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization
https://arxiv.org/abs/2109.02401
21、Natural Language Video Localization with Learnable Moment Proposals
https://arxiv.org/abs/2109.10678
22、Point-of-Interest Type Prediction using Text and Images
https://arxiv.org/abs/2109.00602
23、Journalistic Guidelines Aware News Image Captioning
https://arxiv.org/abs/2109.02865
24、Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers
https://arxiv.org/abs/2109.04448
25、Visual News: Benchmark and Challenges in News Image Captioning
https://underline.io/lecture/37789-visual-news-benchmark-and-challenges-in-news-image-captioning
26、HintedBT: Augmenting Back-Translation with Quality and Transliteration Hints
https://arxiv.org/abs/2109.04443
27、WhyAct: Identifying Action Reasons in Lifestyle Vlogs
https://arxiv.org/abs/2109.02747
28、Hitting your MARQ: Multimodal ARgument Quality Assessment in Long Debate Video
29、Mind the Context: The Impact of Contextualization in Neural Module Networks for Grounding Visual Referring Expressions
30、CrossVQA: Scalably Generating Benchmarks for Systematically Testing VQA Generalization
31、Weakly-Supervised Visual-Retriever-Reader for Knowledge-based Question Answering
https://arxiv.org/abs/2109.04014
32、Iconary: A Pictionary-Based Game for Testing Multimodal Communication with Drawings and Text
33、Integrating Visuospatial, Linguistic, and Commonsense Structure into Story Visualization
https://arxiv.org/abs/2110.10834
34、VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
https://arxiv.org/abs/2109.14084
35、StreamHover: Livestream Transcript Summarization and Annotation
https://arxiv.org/abs/2109.05160
36、Text2Mol: Cross-Modal Molecule Retrieval with Natural Language Queries
37、NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media
[https://arxiv.org/abs/2104.0589]
##ACL21 MM
汇总
Annual Meeting of the Association for Computational Linguistics (2021) - ACL Anthology
- PhotoChat: A Human-Human Dialogue Dataset With Photo Sharing Behavior For Joint Image-Text Modeling
- Control Image Captioning Spatially and Temporally
- Hierarchical Context-aware Network for Dense Video Event Captioning
- LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding
- Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
- A Large-Scale Chinese Multimodal NER Dataset with Speech Clues
- MultiMET: A Multimodal Dataset for Metaphor Understanding
- HateCheck: Functional Tests for Hate Speech Detection Models
- Multi-stage Pre-training over Simplified Multimodal Pre-training Models
- CTFN: Hierarchical Learning for Multimodal Sentiment Analysis Using Coupled-Translation Fusion Network
- VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
- VisualSparta: An Embarrassingly Simple Approach to Large-scale Text-to-Image Search with Weighted Bag-of-words
- Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders
- Multimodal Sentiment Detection Based on Multi-channel Graph Neural Networks
- Good for Misconceived Reasons: An Empirical Revisiting on the Need for Visual Context in Multimodal Machine Translation
- Beyond Sentence-Level End-to-End Speech Translation: Context Helps
- Cascade versus Direct Speech Translation: Do the Differences Still Make a Difference?
- KM-BART: Knowledge Enhanced Multimodal BART for Visual Commonsense Generation
- Improving Speech Translation by Understanding and Learning from the Auxiliary Text Translation Task
- Multilingual Speech Translation from Efficient Finetuning of Pretrained Models
- E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning
- Self-Supervised Multimodal Opinion Summarization
- PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World
- QASR: QCRI Aljazeera Speech Resource A Large Scale Annotated Arabic Speech Corpus
##AAAI21 1.ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph
2.VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning
3.RpBERT: A Text-Image Relation Propagation-Based BERT Model for Multimodal NER
4.Efficient Object-Level Visual Context Modeling for Multimodal Machine Translation: Masking Irrelevant Objects Helps Grounding
5.Commonsense Knowledge Aware Concept Selection For Diverse and Informative Visual Storytelling
6.Multi-modal Graph Fusion for Named Entity Recognition with Targeted Visual Guidance
7.Non-Autoregressive Coarse-to-Fine Video Captioning
8.Augmented Partial Mutual Learning with Frame Masking for Video Captioning
9.Mind-the-Gap! Unsupervised Domain Adaptation for Text-Video Retrieval
10.Dense Events Grounding in Video
11.Boundary Proposal Network for Two-Stage Natural Language Video Localization
12.Audio-Oriented Multimodal Machine Comprehension via Dynamic Inter- and Intra-modality Attention
13.MUFASA: Multimodal Fusion Architecture Search for Electronic Health Records
14.VMLoc: Variational Fusion For Learning-Based Multimodal Camera Localization
15.FL-MSRE: A Few-Shot Learning based Approach to Multimodal Social Relation Extraction
16.Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis
17.Multi-modal Multi-label Emotion Recognition with Heterogeneous Hierarchical Message Passing
18.Embracing Domain Differences in Fake News: Cross-domain Fake News Detection using Multimodal Data