Street-level imagery has emerged as a valuable tool for observing large-scale urban spaces with unprecedented detail. However, previous studies have been limited to analyzing individual street-level images. This approach falls short in representing the characteristics of a spatial unit, such as a street or grid, which may contain varying numbers of street-level images ranging from several to hundreds. As a result, a more comprehensive and representative approach is required to capture the complexity and diversity of urban environments at different spatial scales. To address this issue, this study proposes a deep learning-based module called Vision-LSTM, which can effectively obtain vector representation from varying numbers of street-level images in spatial units. The effectiveness of the module is validated through experiments to recognize urban villages, achieving reliable recognition results (overall accuracy: 91.6%) through multimodal learning that combines street-level imagery with remote sensing imagery and social sensing data. Compared to existing image fusion methods, Vision-LSTM demonstrates significant effectiveness in capturing associations between street-level images. The proposed module can provide a more comprehensive understanding of urban spaces, enhancing the research value of street-level imagery and facilitating multimodal learning-based urban research.
@article{huang2023,
title = {Comprehensive Urban Space Representation with Varying Numbers of Street-Level Images},
author = {Huang, Yingjing and Zhang, Fan and Gao, Yong and Tu, Wei and Duarte, Fabio and Ratti, Carlo and Guo, Diansheng and Liu, Yu},
year = {2023},
month = dec,
journal = {Computers, Environment and Urban Systems},
volume = {106},
pages = {102043},
issn = {0198-9715}
}