Afra Alishahi: Decoding what deep, grounded neural models learn about language Abstract: Humans learn to understand speech from weak and noisy supervision: they extract structure and meaning from speech by simply being exposed to utterances situated and grounded in their daily sensory experience. Emulating this remarkable skill has been the goal of numerous studies; however, researchers have often used severely simplified settings where either the language input or the extralinguistic sensory input, or both, are small-scale and symbolically represented. Recently, deep neural network models have been successfully used for visually grounded language understanding, where representations of images are mapped to those of their written or spoken descriptions. Despite their high performance, these architectures come at a cost: we know little about the type of linguistic knowledge these models capture from the input signal in order to perform their target task. I present a series of studies on modelling visually grounded language learning and analyzing the emergent linguistic representations in these models. Using variations of recurrent neural networks to model the temporal nature of spoken language, we examine how form and meaning-based linguistic knowledge emerges from the input signal.