2018-08-13

PCA 与 autoencoder

一、PCA的基本原理

PCA是最简单的一种降维手段，目标就是找到一个最好的投影方向，然后把所有的d维的点都投影到这个方向上，如果投影这一次不够，那么就再找出一个次好的方向，这个次好的方向的搜索空间是第一个主方向的正交子空间。以此类推，每次能够捕获的信息都递减，当找到了d个新的方向后，此时转换后的样本点和新的样本点所拥有的信息完全一致，只不过是换了一种表示，本质上是做了一个可逆的线性变换。通常来说我们只需要找到d’个主方向，每个方向用一个d维向量表示，这d’个方向又被称为“主成分”。那么如何找到主成分呢，通常有两种推导方法，下面进行分别列出。

1. 基于最大方差

为什么要最大化方差呢？因为从信号处理的角度考虑，通常认为信号具有较大方差，噪声具有较小方差，信号与噪声之比称为信噪比，信噪比越大认为信号的质量越好。

对协方差矩阵进行特征分解后可以得到d个主成分，取前d’个主成分对应的特征向量作为投影方向。此时的信息保留比为：
$$\eta = \sqrt\frac {\sum_{i=1}^{d’}\lambda_i}{\sum_{i=1}^{d}\lambda_i}$$

2. 基于最小重构（投影）误差

3. 从重构的角度解释PCA

首先解释一下下面的图，图片来自于李宏毅老师的课件。图中标的是X矩阵中的一个列向量代表一个样本点，但是通常在编程中一个行向量是样本点，这里我也不对原图进行改动了，下面都基于假设，X矩阵中一个行向量代表一个样本点。
x降维之后得到的新的低维向量是u(前面推导中记作x波浪号，这里使用u表示)。其中u每一维的含义就是对应的主成分的重构权重。即下图的u矩阵中的行向量中的各个元素。U矩阵的行向量就是投影后的样本。C矩阵中的每个行向量都是一个主成分。

真正在做的时候，使用SVD得到如上结果，即$X=U\Sigma C$，其中的$$$U’=U\Sigma$$$就是投影得到的特征向量，记作$$$X=U’C$$$。因为K时是小于N的，因此信息有缺失，等式两边不是完全相等的。做变换：
$$ X\approx U’C $$
$$XC^T=U’CC^T$$
$$XC^T=U’$$
$$XC^TC=U’C\approx X$$
损失时出现在$$$C^T$$$这里。最后一个式子实际上就是上图表示的autoencoder的原理。

4. 代码展示

基于tensorflow实现的PCA。

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing

# from https://github.com/eliorc/Medium/blob/master/PCA-tSNE-AE.ipynb
# reference http://projector.tensorflow.org/


class TF_PCA:
    def __init__(self, dtype=tf.float32, n_dimensions=None, keep_info=None):
        self.dtype = dtype
        self.n_dimensions = n_dimensions
        self.keep_info = keep_info
        self.graph = None
        self.X = None
        self.u = None
        self.singular_values = None
        self.sigma = None

    def fit(self, data):
        self.graph = tf.Graph()
        with self.graph.as_default():
            self.X = tf.placeholder(self.dtype, shape=(None, data.shape[1]))

            # Perform SVD
            singular_values, u, components = tf.svd(self.X)

            # Create sigma matrix
            sigma = tf.diag(singular_values)

        with tf.Session(graph=self.graph) as session:
            self.u, self.singular_values, self.sigma, self.components = session.run(
                                                                [u, singular_values, sigma, components],
                                                                feed_dict={self.X: data})

    def transform(self, data):
        if self.keep_info:
            # Normalize singular values
            normalized_singular_values = self.singular_values / sum(self.singular_values)

            # Create the aggregated ladder of kept information per dimension
            ladder = np.cumsum(normalized_singular_values)

            # Get the first index which is above the given information threshold
            index = next(idx for idx, value in enumerate(ladder) if value >= keep_info) + 1
            self.n_dimensions = index

        print("cache information", sum(self.singular_values[:self.n_dimensions])/sum(self.singular_values))
        with self.graph.as_default():
            most_components = tf.slice(self.components, (0,0), (self.components.shape[0], self.n_dimensions))
            new_data = tf.placeholder(dtype=self.dtype, shape=(None, data.shape[1]))
            pca = tf.matmul(new_data, most_components)

        with tf.Session(graph=self.graph) as session:
            return session.run(pca, feed_dict={new_data: data})
    
    def reconstruct(self, data):
        with self.graph.as_default():
            weight = tf.placeholder(self.dtype, shape=(data.shape[0], self.n_dimensions))
            components = tf.slice(tf.transpose(self.components), 
                                  [0, 0], 
                                  [self.n_dimensions, self.components.shape[1]])
            re_vector = tf.matmul(weight, components)
            
        weight_value = self.transform(data)
        with tf.Session(graph=self.graph) as session:
            return session.run(re_vector, feed_dict={weight: weight_value})

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/media/tao/文档/学习/电子书/tensorflow/TensorFlow实战Google深度学习框架/1.0.0/Chapter05/datasets/MNIST_data", one_hot=False)




scaler = preprocessing.StandardScaler()
scaler.fit(mnist.train.images)
data = mnist.train.images - scaler.mean_


fig = plt.figure()
ax = fig.add_subplot(131)
ax.imshow(mnist.train.images[0].reshape((28,28)))
ax.set_title("before scale")
ax = fig.add_subplot(132)
ax.imshow(data[0].reshape((28,28)))
ax.set_title("after scale")
ax = fig.add_subplot(133)
ax.imshow(scaler.mean_.reshape((28,28)))
ax.set_title("mean image")
fig.show()

png

pca = TF_PCA(n_dimensions=14*14)
pca.fit(data)

a = pca.reconstruct(data[0].reshape((1,784)))
plt.imshow((a+scaler.mean_).reshape((28,28)))

cache information 0.7495891560740073

png

1 2	reduced_mnist = pca.transform(data) plt.scatter(reduced_mnist[:,0], reduced_mnist[:,1],c=mnist.train.labels)

cache information 0.7495891560740073

png

fig = plt.figure()
fig.add_subplot(331).imshow(pca.components[:,0].reshape((28,28)))
fig.add_subplot(332).imshow(pca.components[:,1].reshape((28,28)))
fig.add_subplot(333).imshow(pca.components[:,2].reshape((28,28)))
fig.add_subplot(334).imshow(pca.components[:,3].reshape((28,28)))
fig.add_subplot(335).imshow(pca.components[:,4].reshape((28,28)))
fig.add_subplot(336).imshow(pca.components[:,5].reshape((28,28)))
fig.add_subplot(337).imshow(pca.components[:,6].reshape((28,28)))
fig.add_subplot(338).imshow(pca.components[:,7].reshape((28,28)))
fig.add_subplot(339).imshow(pca.components[:,8].reshape((28,28)))
plt.show()

png

代码的实现基本重现了上面的原理，transform和reconstruct函数分别就是右乘投影矩阵或者投影矩阵的转置。其实就是autoencoder做的事情。但是注意一点，不同的是如果真的做一个autoencoder这么训练的话是没有PCA的效果好的，因为这样训练出来的autoencoder的encoder和decoder之间没有关系，但是实际上对于PCA二者参数是一样的，另外PCA的线性变换矩阵还要满足正交的要求。因此实际做出来的autocoder没有PCA效果好，但是好在autoencoder可以往deep方向加强。在网上都很多多层去噪自编码器的keras或者tensorflow实现，这里就不贴出来了。下面这个图片是我自己做的实验，使用降噪自编码器重构图片之后，得到的效果。

另外PCA的最大的缺点是只能进行线性的降维，处理不了流行问题。因此有很多非线性降维的手段，这个另作总结。对于PCA也可以利用kernel trick变形为KPCA解决非线性问题。