Cassandra schema for doing hourly queries -


i want store data in cassandra many sources , run job once hour process data points specific hour. best schema use this?

to avoid having hot spot, cannot put data hour in single partition, data each hour need spread across many partitions.

so see 2 ways allow queries hour:

  1. create new table each hour , select * of table no clause read hour. think efficient reads , writes, annoying have manage many tables.

  2. create new table each week , have column hour number within week (i.e. 1 168), , create secondary index on it. can select * hour=x. appears work, i'm worried won't scale if there lot of rows.

does know approach scale better? there better way this?

thanks.

in situations can use "buckets". bucket way of splitting partitions several separate pieces. example, imagine schema follows: create keyspace timeseries replication = { 'class' : 'simplestrategy', 'replication_factor' : 1 }; use timeseries; create table hourly ( source_id text, hour text, date timestamp, data text, bucket int, primary key ((hour, bucket), date) );

then use bucket divide hours example 10 partitions using hash function of known identifier (e.g. source_id).

at query time, need specify hour , buckets:

select * hourly hour = '2015-07-20 23:00' , bucket in (0,1,2,3,4,5,6,7,8,9);

the hash function important because want distribute evenly data amongst different partitions if identified being hashed not uniformly distributed don't want complex function.

this jsfiddle gives example hashing function simple, uniformly distributes data , can reproduced in language: http://jsfiddle.net/joscas/yfp72fq5/

otherwise instead of hash function, modulo of id or modulo of epoch time suffice instead of hash function if use modulo of id have check numbers don't end in uniform pattern. on other hand, if take modulo of timestamp, writing 1 bucket amount of time , creating hot spots specially if number of buckets small.


Comments