Congratulations to Dr. Adriana Kovashka for her recent NSF CAREER award, for Natural Narratives and Multimodal Context as Weak Supervision for Learning Object Categories.
This project develops a framework to train computer vision models for detection of objects from weak, naturally-occurring supervision of language (text or speech) and additional multimodal signals. It considers dynamic settings, where humans interact with their visual environment and refer to the encountered objects, e.g., ?Carefully put the tomato plants in the ground? and ?Please put the phone down and come set the table,? and captions written for a human audience to complement an image, e.g., news article captions. The challenge of using such language-based supervision for training detection systems is that along with useful signal, the speech contains many irrelevant tokens. The project will benefit society by exploring novel avenues for overcoming this challenge and reducing the need for expensive and potentially unnatural crowdsourced labels for training. It has the potential to make object detection systems more scalable and thus more usable by a broad user base in a variety of settings. The resources and tools developed would allow natural, lightweight learning in different environments, e.g., different languages or types of imagery where the well-known object categories are not useful or where there is a shift in both the pixels as well as the way in which humans refer to objects (different cultures, medicine, art). This project opens possibilities for learning in vivo rather than in vitro; while the focus here is on object categories, multimodal weak supervision is useful for a larger variety of tasks. Research and education are integrated through local community outreach and research mentoring for students from lesser-known universities, new programs for student training including honing graduate students' writing skills, and development of interactive educational modules and demos based on research findings.
This project creatively connects two domains, vision-and-language, and object detection, and pioneers training of object detection models with weak language supervision and a large vocabulary of potential classes. The impact of noise in the language channel will be mitigated through three complementary techniques that model visual concreteness of words, to what extent the text refers to the visual environment it appears with, and whether the weakly-supervised models that are learned are logically consistent. Two complementary word-region association mechanisms will be used (metric learning and cross-modal transformers), whose application is novel for weakly-supervised detection. Importantly, to make detection feasible, not only the semantics of image-text pairs, but their discourse relationship, will be captured. To facilitate and disambiguate the association of words to a physical environment, the latter will be represented through additional modalities, namely sound, motion, depth and touch, which are either present in the data or estimated. This project advances knowledge of how multimodal cues contextualize the relation between image and text; no prior work has modeled image-text relationships along multiple channels (sound, depth, touch, motion). Finally, to connect the appearance of objects to the purpose and use of these objects, relationships between objects, properties and actions will be semantically organized in a graph, and grammars to represent activities involving objects will be extracted, still maintaining the weakly-supervised setting.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.