python - Dimensionality reduction being way too slow using PCA and a small dataset -


i have following data set stored using numpy: https://www.dropbox.com/sh/ppseiv9skqlhljr/aacqewzh11oszl5-z_nhqre3a?dl=0

there different numpy file training , development partitions of data set

[50,1,396] 

i using pca fast mlpy library in order perform dimensionality reduction. whole process slow , can not find out why. before perform pca convert dataset following shape:

[50,396] 

so shape of dataset not cause of problem.

the code use following:

import os import numpy np import sys import csv import mlpy  inputfiletrain='' outputfiletrain='' inputfiledev='' outputfiledev=''  def parsecommandlineargs():         global inputfiletrain         global outputfiletrain         global inputfiledev         global outputfiledev          in range(0, len(sys.argv)):                  if sys.argv[i] == 'inputfiletrain':                         inputfiletrain = sys.argv[i + 1]                         print                         print "------*****using directory :*****------"                         print 'inputfiletrain=' + inputfiletrain                         print "------**********************------"                         print                  if sys.argv[i] == 'outputfiletrain':                         outputfiletrain = sys.argv[i + 1]                         print                         print "------*****using directory :*****------"                         print 'outputfiletrain=' + outputfiletrain                         print "------**********************------"                         print                  if sys.argv[i] == 'inputfiledev':                         inputfiledev = sys.argv[i + 1]                         print                         print "------*****using directory :*****------"                         print 'inputfiledev=' + inputfiledev                         print "------**********************------"                         print                  if sys.argv[i] == 'outputfiledev':                         outputfiledev = sys.argv[i + 1]                         print                         print "------*****using outputfeatures filename :*****------"                         print 'outputfiledev=' + outputfiledev                         print "------**********************------"                         print     def pcadimred(features, ndims):         x=np.empty([features.shape[0], features.shape[2]])         print features.shape[2]         print x.shape           i,f in enumerate(features):              #np.append(x,f[0],axis=0)              x[i]=f[0]         #np.vstack(x)           print x         print "pcastarting"     #pca = mlpy.pca(method='cov')     pca=  mlpy.pcafast(k=ndims, eps=0.1)     pca.learn(x)     coeff = pca.coeff()     coeff = coeff[:,0:ndims]          print "pcaending"     featuresnew = []     f in x:         ft = f.copy() #       ft = pca.transform(ft, k=ndims)         ft = np.dot(f, coeff)         featuresnew.append(ft)           thodwrisformat = np.empty((len(files), 1, mean.shape[0]))         i,f in enumerate(featuresnew):             thodwrisformat[i][0]=f      return (thodwrisformat, coeff)  def pcadevelopmentset(features, ndims,coeff):          featuresnew = []                  f in features:                         ft = f.copy()         #       ft = pca.transform(ft, k=ndims)                         ft = np.dot(f, coeff)                         featuresnew.append(ft)                 return featuresnew  parsecommandlineargs() print inputfiledev featuresdev = np.load(inputfiledev) featurestrain = np.load(inputfiletrain)  pcatrain=pcadimred(featurestrain,68) featurestrain=pcatrain[1] coeff=pcatrain[2] featuresdev=pcadevelopmentset(featuresdev, 68,coeff)   np.save(outputfiledev,featuresdev) np.save(outputfiletrain,featurestrain) 

i using code under ubuntu linux , python 2.7. install mlpy 1 has use following commands:

wget http://sourceforge.net/projects/mlpy/files/mlpy%203.5.0/mlpy-3.5.0.tar.gz tar xvf mlpy-3.5.0.tar.gz cd mlpy-3.5.0 sudo python setup.py install 

finally run code, assuming script stored pca.py , is in same folder directory feature_vectors containing partitions of datasets resides, 1 must use following command:

python pca.py inputfiletrain feature_vectors/train/featuresshape.npy outputfiletrain feature_vectors/train/featuresshapepca.npy inputfiledev feature_vectors/development/featuresshape.npy outputfiledev feature_vectors/development/featuresshapepca.npy  

i need ideas, why pca slow on dataset...

regarding discuss:

  • if measure fastness per batches: process slower because of higher dimensionality i.e. data shape of 396.
  • if measure fastness per epoch: process slower because of more data i.e. 50x396 = 19800 vs. 100x100 random example.

Comments