In neural networks, a pooling layer is a kind of network layer that downsamples and aggregates information that is dispersed among many vectors into fewer vectors.
As notation, we consider a tensor , where is height, is width, and is the number of channels. A pooling layer outputs a tensor .
We define two variables called "filter size" (aka "kernel size") and "stride". Sometimes, it is necessary to use a different filter size and stride for horizontal and vertical directions. In such cases, we define 4 variables .
The receptive field of an entry in the output tensor are all the entries in that can affect that entry.
Definewhere means the range . Note that we need to avoid the off-by-one error. The next input isand so on. The receptive field of is , so in general,If the horizontal and vertical filter size and strides differ, then in general,More succinctly, we can write . If is not expressible as where is an integer, then for computing the entries of the output tensor on the boundaries, max pooling would attempt to take as inputs variables off the tensor. In this case, how those non-existent variables are handled depends on the padding conditions, illustrated on the right.
Global Max Pooling (GMP) is a specific kind of max pooling where the output tensor has shape and the receptive field of is all of . That is, it takes the maximum over each entire channel. It is often used just before the final fully connected layers in a CNN classification head.
Mixed Pooling is a linear sum of maxpooling and average pooling.
where is either a hyperparameter, a learnable parameter, or randomly sampled anew every time.
Lp Pooling is like average pooling, but uses Lp norm average instead of average:where is the size of receptive field, and is a hyperparameter. If all activations are non-negative, then average pooling is the case of , and maxpooling is the case of . Square-root pooling is the case of .
Stochastic pooling samples a random activation from the receptive field with probability . It is the same as average pooling Expected value.
Softmax pooling is like maxpooling, but uses Softmax function, i.e. where . Average pooling is the case of , and maxpooling is the case of
Local Importance-based Pooling generalizes softmax pooling by where is a learnable function.
Region of Interest Pooling (also known as RoI pooling) is a variant of max pooling used in R-CNNs for object detection.
Covariance pooling computes the covariance matrix of the vectors which is then flattened to a -dimensional vector . Global covariance pooling is used similarly to global max pooling. As average pooling computes the average, which is a first-degree statistic, and covariance is a second-degree statistic, covariance pooling is also called "second-order pooling". It can be generalized to higher-order poolings.
Blur Pooling means applying a blurring method before downsampling. For example, the Rect-2 blur pooling means taking an average pooling at , then taking every second pixel (identity with ).
BERT-like pooling uses a dummy [CLS] token ("classification"). For classification, the output at [CLS] is the classification token, which is then processed by a LayerNorm-feedforward-softmax module into a probability distribution, which is the network's prediction of class probability distribution. This is the one used by the original ViT and Masked Autoencoder.
Global average pooling (GAP) does not use the dummy token, but simply takes the average of all output tokens as the classification token. It was mentioned in the original ViT as being equally good.
Multihead attention pooling (MAP) applies a multiheaded attention block to pooling. Specifically, it takes as input a list of vectors , which might be thought of as the output vectors of a layer of a ViT. It then applies a feedforward layer on each vector, resulting in a matrix . This is then sent to a multiheaded attention, resulting in , where is a matrix of trainable parameters. This was first proposed in the Set Transformer architecture.
Later papers demonstrated that GAP and MAP both perform better than BERT-like pooling.
Local pooling layers coarsen the graph via downsampling. We present here several learnable local pooling strategies that have been proposed. For each cases, the input is the initial graph is represented by a matrix of node features, and the graph adjacency matrix . The output is the new matrix of node features, and the new graph adjacency matrix .
where is a learnable projection vector. The projection vector computes a scalar projection value for each graph node.
The top-k pooling layer can then be formalised as follows:
where is the subset of nodes with the top-k highest projection scores, denotes element-wise matrix multiplication, and is the sigmoid function. In other words, the nodes with the top-k highest projection scores are retained in the new adjacency matrix . The operation makes the projection vector trainable by backpropagation, which otherwise would produce discrete outputs.
The Self-attention pooling layer can then be formalised as follows:
where is the subset of nodes with the top-k highest projection scores, denotes element-wise matrix multiplication.
The self-attention pooling layer can be seen as an extension of the top-k pooling layer. Differently from top-k pooling, the self-attention scores computed in self-attention pooling account both for the graph features and the graph topology.
During the 1970s, to explain the effects of depth perception, some such as (Julesz and Chang, 1976) proposed that the vision system implements a disparity-selective mechanism by global pooling, where the outputs from matching pairs of retinal regions in the two eyes are pooled in higher order cells. See for citations to these early literature.
In artificial neural networks, max pooling was used in 1990 for speech processing (1-dimensional convolution), and for image processing, was first used in the Cresceptron of 1992.
|
|