Lossy image autoencoders with convolution and deconvolution networks in Tensorflow

Autoencoders are a very interesting deep learning application because they allow a consistent dimensionality reduction of an entire dataset with a controllable loss level. The Jupyter notebook for this small project is available on the Github repository: https://github.com/giuseppebonaccorso/lossy_image_autoencoder.

The structure of a generic autoencoder is represented in the following figure:

The encoder is a function that processes an input matrix (image) and outputs a fixed-length code:

In this model, the encoding function is implemented using a convolutional layer followed by flattening and dense layers. The code is then fed into the decoder, which reconstructs a lossy version of the original image:

The decoder is implemented using a deconvolutional (separable convolution) layer with 3 filters (one per channel). The model is trained minimizing the L2 loss:

For the experiment, I’ve used the CIFAR-10 dataset (https://www.cs.toronto.edu/~kriz/cifar.html), using only the training samples (50000 32 x 32 RGB images) and the Keras wrapper:

from keras.datasets import cifar10

(X_train, Y_train), (X_test, Y_test) = cifar10.load_data()

The model is implemented using Tensorflow with RMSProp optimizer:

import tensorflow as tf

width = 32
height = 32
batch_size = 10
nb_epochs = 15
code_length = 128

graph = tf.Graph()

with graph.as_default():
    # Global step
    global_step = tf.Variable(0, trainable=False)
    # Input batch
    input_images = tf.placeholder(tf.float32, shape=(batch_size, height, width, 3))

    # Convolutional layer 1
    conv1 = tf.layers.conv2d(inputs=input_images,
                             kernel_size=(3, 3),

    # Convolutional output (flattened)
    conv_output = tf.contrib.layers.flatten(conv1)

    # Code layer
    code_layer = tf.layers.dense(inputs=conv_output,
    # Code output layer
    code_output = tf.layers.dense(inputs=code_layer,
                                  units=(height - 2) * (width - 2) * 3,

    # Deconvolution input
    deconv_input = tf.reshape(code_output, (batch_size, height - 2, width - 2, 3))

    # Deconvolution layer 1
    deconv1 = tf.layers.conv2d_transpose(inputs=deconv_input,
                                         kernel_size=(3, 3),
    # Output batch
    output_images = tf.cast(tf.reshape(deconv1, 
                                       (batch_size, height, width, 3)) * 255.0, tf.uint8)

    # Reconstruction L2 loss
    loss = tf.nn.l2_loss(input_images - deconv1)

    # Training operations
    learning_rate = tf.train.exponential_decay(learning_rate=0.0005, 
                                               decay_steps=int(X_train.shape[0] / (2 * batch_size)), 
    trainer = tf.train.RMSPropOptimizer(learning_rate)
    training_step = trainer.minimize(loss)

For a code length equal to 128 (float32/64), which is quite smaller than the image size: 32 x 32 x 3 = 3072 bytes, therefore the reconstruction error will be medium-high (it’s useful to test different values to find the best trade-off). Moreover, it’s also possible to add an L1 regularization to the code in order to increase sparsity. The training code is shown in the following snippet:

import numpy as np

def create_batch(t, gray=False):
    X = np.zeros((batch_size, height, width, 3 if not gray else 1), dtype=np.float32)
    for k, image in enumerate(X_train[t:t+batch_size]):
        if gray:
            X[k, :, :, :] = rgb2gray(image)
            X[k, :, :, :] = image / 255.0
    return X

def train():
    for e in range(nb_epochs):
        total_loss = 0.0

        for t in range(0, X_train.shape[0], batch_size):
            feed_dict = {
                input_images: create_batch(t)

            _, v_loss = session.run([training_step, loss], feed_dict=feed_dict)
            total_loss += v_loss

        print('Epoch {} - Total loss: {}'.format(e+1, total_loss))

After 15 epochs (in a production implementation it’s useful to increase this value until the loss function stops decreasing), the reconstruction of some random images is shown in the following figure (first row, original images, second row, reconstructed ones):

As expected the quality is not very high, but the “semantics” of each image is almost preserved. Possible improvements include:

  • Adding a flag (using a placeholder) to use the model for both training and prediction. In the former mode, the input is an image batch, while in the latter is a code batch
  • Using L1 (and/or L2) code regularization

See also:

CIFAR-10 image classification with Keras ConvNet – Giuseppe Bonaccorso

CIFAR-10 is a small image (32 x 32) dataset made up of 60000 images subdivided into 10 main categories. Check the web page in the reference list in order to have further information about it and download the whole set.