VGG-16 is a network that achieved 92.7% accuracy in ImageNet top-5 classification in 2014. It has the following layer structure:
As you can see, VGG follows traditional pyramid architecture, which is a sequence of convolution-pooling layers.
ResNet is a family of models proposed by Microsoft Research in 2015. The main idea of ResNet is to use residual blocks:
The reason for using identity pass-through is to have our layer predict the difference between the result of a previous layer and the output of the residual block - hence the name residual. Those blocks are much easier to train, and one can construct networks with several hundreds of those blocks (most common variants are ResNet-52, ResNet-101 and ResNet-152).
You can also think of this network as being able to adjust its complexity to the dataset. Initially, when you are starting to train the network, weights values are small, and most of the signal goes through passthrough identity layers. As training progresses and weights become larger, the significance of network parameters grow, and the networks adjusts to accommodate required expressive power to correctly classify training images.
Google Inception architecture takes this idea one step further, and builds each network layer as a combination of several different paths:
Here, we need to emphasize the role of 1x1 convolutions, because at first they do not make sense. Why would we need to run through the image with 1x1 filter? However, you need to remember that convolution filter also works with several depth channels (originally - RGB colors, in subsequent layers - channels for different filters), and 1x1 convolution is used to mix those input channels together using different trainable weights. It can be also viewed as downsampling (pooling) over channel dimension.
Here is a good blog post on the subject, and original paper.
MobileNet is a family of models with reduced size, suitable for mobile devices. Use them if you are short in resources, and can sacrifice a little bit of accuracy. The main idea behind them is so-called depthwise separable convolution, which allows representing convolution filters by a composition of spatial convolutions and 1x1 convolution over depth channels. This significantly reduces the number of parameters, making the network smaller in size, and also easier to train with less data.
Here is a good blog post on MobileNet.