python - Incremental PCA on big data -


i tried using incrementalpca sklearn.decomposition, threw memoryerror pca , randomizedpca before. problem is, matrix trying load big fit ram. right stored in hdf5 database dataset of shape ~(1000000, 1000), have 1.000.000.000 float32 values. thought incrementalpca loads data in batches, apparently tries load entire dataset, not help. how library meant used? hdf5 format problem?

from sklearn.decomposition import incrementalpca import h5py  db = h5py.file("db.h5","r") data = db["data"] incrementalpca(n_components=10, batch_size=1).fit(data) traceback (most recent call last):   file "<stdin>", line 1, in <module>   file "/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/decomposition/incremental_pca.py", line 165, in fit     x = check_array(x, dtype=np.float)   file "/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/utils/validation.py", line 337, in check_array     array = np.atleast_2d(array)   file "/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/shape_base.py", line 99, in atleast_2d     ary = asanyarray(ary)   file "/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/numeric.py", line 514, in asanyarray     return array(a, dtype, copy=false, order=order, subok=true)   file "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2458)   file "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2415)   file "/software/anaconda/2.3.0/lib/python2.7/site-packages/h5py/_hl/dataset.py", line 640, in __array__     arr = numpy.empty(self.shape, dtype=self.dtype if dtype none else dtype) memoryerror 

thanks help

you program failing in trying load entire dataset ram. 32 bits per float32 × 1,000,000 × 1000 3.7 gib. can problem on machines 4 gib ram. check it's problem, try creating array of size alone:

>>> import numpy np >>> np.zeros((1000000, 1000), dtype=np.float32) 

if see memoryerror, either need more ram, or need process dataset 1 chunk @ time.

with h5py datasets should avoid passing entire dataset our methods, , pass slices of dataset instead. 1 @ time.

as don't have data, let me start creating random dataset of same size:

import h5py import numpy np h5 = h5py.file('rand-1mx1k.h5', 'w') h5.create_dataset('data', shape=(1000000,1000), dtype=np.float32) in range(1000):     h5['data'][i*1000:(i+1)*1000] = np.random.rand(1000, 1000) h5.close() 

it creates nice 3.8 gib file.

now, if in linux, can limit how memory available our program:

$ bash $ ulimit -m $((1024*1024*2)) $ ulimit -m 2097152 

now if try run code, we'll memoryerror. (press ctrl-d quit new bash session , reset limit later)

let's try solve problem. we'll create incrementalpca object, , call .partial_fit() method many times, providing different slice of dataset each time.

import h5py import numpy np sklearn.decomposition import incrementalpca  h5 = h5py.file('rand-1mx1k.h5') data = h5['data'] # it's ok, dataset not fetched memory yet  n = data.shape[0] # how many rows have in dataset chunk_size = 1000 # how many rows feed ipca @ time, divisor of n icpa = incrementalpca(n_components=10, batch_size=16)  in range(0, n//chunk_size):     ipca.partial_fit(data[i*chunk_size : (i+1)*chunk_size]) 

it seems working me, , if @ top reports, memory allocation stays below 200m.


Comments