i tried using incrementalpca sklearn.decomposition, threw memoryerror pca , randomizedpca before. problem is, matrix trying load big fit ram. right stored in hdf5 database dataset of shape ~(1000000, 1000), have 1.000.000.000 float32 values. thought incrementalpca loads data in batches, apparently tries load entire dataset, not help. how library meant used? hdf5 format problem?
from sklearn.decomposition import incrementalpca import h5py db = h5py.file("db.h5","r") data = db["data"] incrementalpca(n_components=10, batch_size=1).fit(data) traceback (most recent call last): file "<stdin>", line 1, in <module> file "/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/decomposition/incremental_pca.py", line 165, in fit x = check_array(x, dtype=np.float) file "/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/utils/validation.py", line 337, in check_array array = np.atleast_2d(array) file "/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/shape_base.py", line 99, in atleast_2d ary = asanyarray(ary) file "/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/numeric.py", line 514, in asanyarray return array(a, dtype, copy=false, order=order, subok=true) file "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2458) file "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2415) file "/software/anaconda/2.3.0/lib/python2.7/site-packages/h5py/_hl/dataset.py", line 640, in __array__ arr = numpy.empty(self.shape, dtype=self.dtype if dtype none else dtype) memoryerror thanks help
you program failing in trying load entire dataset ram. 32 bits per float32 × 1,000,000 × 1000 3.7 gib. can problem on machines 4 gib ram. check it's problem, try creating array of size alone:
>>> import numpy np >>> np.zeros((1000000, 1000), dtype=np.float32) if see memoryerror, either need more ram, or need process dataset 1 chunk @ time.
with h5py datasets should avoid passing entire dataset our methods, , pass slices of dataset instead. 1 @ time.
as don't have data, let me start creating random dataset of same size:
import h5py import numpy np h5 = h5py.file('rand-1mx1k.h5', 'w') h5.create_dataset('data', shape=(1000000,1000), dtype=np.float32) in range(1000): h5['data'][i*1000:(i+1)*1000] = np.random.rand(1000, 1000) h5.close() it creates nice 3.8 gib file.
now, if in linux, can limit how memory available our program:
$ bash $ ulimit -m $((1024*1024*2)) $ ulimit -m 2097152 now if try run code, we'll memoryerror. (press ctrl-d quit new bash session , reset limit later)
let's try solve problem. we'll create incrementalpca object, , call .partial_fit() method many times, providing different slice of dataset each time.
import h5py import numpy np sklearn.decomposition import incrementalpca h5 = h5py.file('rand-1mx1k.h5') data = h5['data'] # it's ok, dataset not fetched memory yet n = data.shape[0] # how many rows have in dataset chunk_size = 1000 # how many rows feed ipca @ time, divisor of n icpa = incrementalpca(n_components=10, batch_size=16) in range(0, n//chunk_size): ipca.partial_fit(data[i*chunk_size : (i+1)*chunk_size]) it seems working me, , if @ top reports, memory allocation stays below 200m.
Comments
Post a Comment