sorting - Most efficient data structure for finding most frequent items -

i want extract frequent words google n-grams dataset 20 gb in uncompressed form. don't want whole data set resorted, frequent 5000 of them. if write

take 5000 $ sortby (flip $ comparing snd) dataset -- dataset :: io [(word::string, frequency::int)]

it's going endless waiting. should instead?

i know there data.array.marray package available in-place array computation, cannot see function items modification on documentation page. there data.hashtable.io, it's unordered data structure.

i'd use simple data.intmap.strict (with convenient lookuple function), don't think efficient because produces new map on each alteration. st monad improve that?

upd: i've posted final version of program on corereview.sx.

how

using splitat divide data set first 5000 items , rest.
sort first 5000 items frequency (ascending)
go through rest
- if item has greater frequency lowest freq in sorted items
- drop lowest frequency item sorted items
- insert new item in proper place in sorted items

the process becomes linear, though coefficient improved if use data structure sorted 5000 elements has sublinear min-delete , insertion.

for example, using data.heap heap package:

import data.list (foldl') import data.maybe (fromjust) import data.heap hiding (splitat)  mostfreq :: int -> [(string, int)] -> [(string, int)] mostfreq n dataset = final       -- change our pairs (string,int) (int,string)     pairs = map swap dataset     -- first `n` pairs in 1 list, , rest of pairs in     (first, rest) = splitat n pairs     -- put first `n` pairs minheap     start = fromlist first :: minheap (int, string)     -- run through rest of pairs     stop = foldl' step start rest     -- modifying heap replace least frequent pair     -- new pair if new pair more frequent     step heap pair = if viewhead heap < pair                        insert pair (fromjust $ viewtail heap)                        else heap     -- turn our heap of (int, string) pairs list of (string,int) pairs     final = map swap (tolist stop)     swap ~(a,b) = (b,a)

WIKI

Search This Blog

sorting - Most efficient data structure for finding most frequent items -

Comments

Post a Comment