What is the recommended compression for HDF5 for fast read/write performance (in Python/pandas)? -


i have read several times turning on compression in hdf5 can lead better read/write performance.

i wonder ideal settings can achieve read/write performance at:

 data_df.to_hdf(..., format='fixed', complib=..., complevel=..., chunksize=...) 

i'm using fixed format (i.e. h5py) it's faster table. have strong processors , not care disk space.

i store dataframes of float64 , str types in files of approx. 2500 rows x 9000 columns.

there couple of possible compression filters use. since hdf5 version 1.8.11 can register 3rd party compresssion filters.

regarding performance:

it depends on access pattern because want define proper dimensions chunks aligns access pattern otherwise performance suffer lot (for example if you know access 1 column , rows should define chunk shape accordingly (1,9000)). see here, here , here infos.

however afaik pandas end loading entire hdf5 file memory unless use read_table , iterator (see here) or partial io (see here) , doesn't benefit of defining chunk size.

nevertheless might still benefit compression because loading compressed data memory , uncompressing using cpus faster loading uncompressed data.

regarding original question:

i recommend take @ blosc. multi-threaded meta-compressor library supports various different compression filters:

  • blosclz: internal default compressor, heavily based on fastlz.
  • lz4: compact, popular , fast compressor.
  • lz4hc: tweaked version of lz4, produces better compression ratios @ expense of speed.
  • snappy: popular compressor used in many places.
  • zlib: classic; slower previous ones, achieving better compression ratios.

these have different strengths , best thing try , benchmark them data , see works best.


Comments