i have following data set stored using numpy: https://www.dropbox.com/sh/ppseiv9skqlhljr/aacqewzh11oszl5-z_nhqre3a?dl=0
there different numpy file training , development partitions of data set
[50,1,396] i using pca fast mlpy library in order perform dimensionality reduction. whole process slow , can not find out why. before perform pca convert dataset following shape:
[50,396] so shape of dataset not cause of problem.
the code use following:
import os import numpy np import sys import csv import mlpy inputfiletrain='' outputfiletrain='' inputfiledev='' outputfiledev='' def parsecommandlineargs(): global inputfiletrain global outputfiletrain global inputfiledev global outputfiledev in range(0, len(sys.argv)): if sys.argv[i] == 'inputfiletrain': inputfiletrain = sys.argv[i + 1] print print "------*****using directory :*****------" print 'inputfiletrain=' + inputfiletrain print "------**********************------" print if sys.argv[i] == 'outputfiletrain': outputfiletrain = sys.argv[i + 1] print print "------*****using directory :*****------" print 'outputfiletrain=' + outputfiletrain print "------**********************------" print if sys.argv[i] == 'inputfiledev': inputfiledev = sys.argv[i + 1] print print "------*****using directory :*****------" print 'inputfiledev=' + inputfiledev print "------**********************------" print if sys.argv[i] == 'outputfiledev': outputfiledev = sys.argv[i + 1] print print "------*****using outputfeatures filename :*****------" print 'outputfiledev=' + outputfiledev print "------**********************------" print def pcadimred(features, ndims): x=np.empty([features.shape[0], features.shape[2]]) print features.shape[2] print x.shape i,f in enumerate(features): #np.append(x,f[0],axis=0) x[i]=f[0] #np.vstack(x) print x print "pcastarting" #pca = mlpy.pca(method='cov') pca= mlpy.pcafast(k=ndims, eps=0.1) pca.learn(x) coeff = pca.coeff() coeff = coeff[:,0:ndims] print "pcaending" featuresnew = [] f in x: ft = f.copy() # ft = pca.transform(ft, k=ndims) ft = np.dot(f, coeff) featuresnew.append(ft) thodwrisformat = np.empty((len(files), 1, mean.shape[0])) i,f in enumerate(featuresnew): thodwrisformat[i][0]=f return (thodwrisformat, coeff) def pcadevelopmentset(features, ndims,coeff): featuresnew = [] f in features: ft = f.copy() # ft = pca.transform(ft, k=ndims) ft = np.dot(f, coeff) featuresnew.append(ft) return featuresnew parsecommandlineargs() print inputfiledev featuresdev = np.load(inputfiledev) featurestrain = np.load(inputfiletrain) pcatrain=pcadimred(featurestrain,68) featurestrain=pcatrain[1] coeff=pcatrain[2] featuresdev=pcadevelopmentset(featuresdev, 68,coeff) np.save(outputfiledev,featuresdev) np.save(outputfiletrain,featurestrain) i using code under ubuntu linux , python 2.7. install mlpy 1 has use following commands:
wget http://sourceforge.net/projects/mlpy/files/mlpy%203.5.0/mlpy-3.5.0.tar.gz tar xvf mlpy-3.5.0.tar.gz cd mlpy-3.5.0 sudo python setup.py install finally run code, assuming script stored pca.py , is in same folder directory feature_vectors containing partitions of datasets resides, 1 must use following command:
python pca.py inputfiletrain feature_vectors/train/featuresshape.npy outputfiletrain feature_vectors/train/featuresshapepca.npy inputfiledev feature_vectors/development/featuresshape.npy outputfiledev feature_vectors/development/featuresshapepca.npy i need ideas, why pca slow on dataset...
regarding discuss:
- if measure fastness per batches: process slower because of higher dimensionality i.e. data shape of 396.
- if measure fastness per epoch: process slower because of more data i.e. 50x396 = 19800 vs. 100x100 random example.
Comments
Post a Comment