Deep Spatio-Temporal Residual Networks for Citywide Crowd Flows Prediction

Junbo Zhang, Yu Zheng, Dekang Qi (Microsoft Research) 2017

Keras implementation :


  • Forecating the flow of crowds
  • In this paper, we predict two types fo crowd flows : inflow and outflow

  • Inflow and outflow of crowds are affected by the following
    • Spatial dependencies
    • Temporal dependencies
    • External influence : such as weather, events
  • Contributions
    • ST-ResNet employs convolution-based residual networks to model nearby and distance spatial dependencies between any two regions
    • three categories of temporal properties : temporal closeness, period, and trend. ST-ResNet use three residual netowrks to model these, respectively
    • ST-ResNet dynamically aggregates the output of the three aforementioned networks.

Formulation of Crowd Flows Problem

  • Region : we partition a city into an I*J grid map
  • Inflow/outflow : Let P be a collection of trajectories at the tth time interval. For a grid (i, j) that lies at the ith row and jth column, the inflow and outflow of the crowds at the tiem interval t are defined respectively as where
    • is a trajectory in P
    • is the geospatial coordinate
    • means the point lies within grid (i, j), and vice versa
    • denotes the cardinality of a set

Deep Spatio-Temporal Residual Networks

  • comprised of four major components modeling temporal closeness, period, trend, and external influence, respectively.

  • First, we turn inflow and outflow throughout a city at each time interval into a 2-channel image-like matrix.

  • Then, we divide the time axis into three fragments, denoting recent time, near history and distant history. The 2-channel flow matrics of intervals in each time fragment are the fed into the first three components seperately to model the aforementioned three temporal properties: closeness, period, and trend
    • three components share the same network structure(Regisudal Unit sequence)
    • The output of the three components are fused as based on parameter metrics, which assign different weights to the results of different components in different regions.
  • In the external component, we manually extract some feature form external datasets, such as weather conditions and events, feeding them into a two-layer fully-connected neural network.

  • and are integrated together. Then, the final output is mapped into [-1, 1] using Tanh function.

Structures of the First Three Components

  • Do not user subsampling, but only convolutions
  • closeness component
    • : concatnate them along with the first axis
    • is followed by conv1
    • Residual Unit : stack residual units to capture very large citywide dependencies
    • Residual Unit combinations fo “ReLu + Convolution” and “BatchNormalization” is added before ReLu.
    • On top of the residual unit, we append a convolutional layer conv2
    • output of the closeness componet is

  • period component
    • Assume that there are time intervals from the period fragment and the period is ;
    • output :
    • in implementation, p is equal to one-day (daily periodicity)
  • trend component
    • is the length of the trend dependent sequence and q is the trend span
    • input :
    • output :
    • in implementation, q is equal to one-week(week trend)

The Structure of the External Component

  • mainly consider weather, holiday event, and metadata(DayOfWeek, Weekday/Weekend)

  • stack two fully-connected layers upon

    • first layer : embedding layer
    • second layer : to map low to high dimensions that have the same shape with


  • flows of two regions are all affected by closeness, period, and trend, but the degrees of influence may be very different ; parametric-matrix-based fusion
  • is Hadamard product (i.e., element-wise multiplication)
  • are learnable parameters

  • fusing the external component
  • objectives : minimizing mean squared error between the predicted flow matrix and the true flow matrix.


  • Datasets

  • Baselines
    • HA : historical data (previous week, same time)
    • ST-ANN : It first extracts spatial (nearby 8 regions’ values) and temporal (8 previous time intervals) features, then fed into an artificial neural network.
    • DeepST : (Zhang et al. 2016)
  • Preprocessing
    • min-max normalization : [-1, 1] (tanh)
    • one-hot encoding for external data
  • Result