i have read several times turning on compression in hdf5 can lead better read/write performance.
i wonder ideal settings can achieve read/write performance at:
data_df.to_hdf(..., format='fixed', complib=..., complevel=..., chunksize=...) i'm using fixed format (i.e. h5py) it's faster table. have strong processors , not care disk space.
i store dataframes of float64 , str types in files of approx. 2500 rows x 9000 columns.
there couple of possible compression filters use. since hdf5 version 1.8.11 can register 3rd party compresssion filters.
regarding performance:
it depends on access pattern because want define proper dimensions chunks aligns access pattern otherwise performance suffer lot (for example if you know access 1 column , rows should define chunk shape accordingly (1,9000)). see here, here , here infos.
however afaik pandas end loading entire hdf5 file memory unless use read_table , iterator (see here) or partial io (see here) , doesn't benefit of defining chunk size.
nevertheless might still benefit compression because loading compressed data memory , uncompressing using cpus faster loading uncompressed data.
regarding original question:
i recommend take @ blosc. multi-threaded meta-compressor library supports various different compression filters:
- blosclz: internal default compressor, heavily based on fastlz.
- lz4: compact, popular , fast compressor.
- lz4hc: tweaked version of lz4, produces better compression ratios @ expense of speed.
- snappy: popular compressor used in many places.
- zlib: classic; slower previous ones, achieving better compression ratios.
these have different strengths , best thing try , benchmark them data , see works best.
Comments
Post a Comment