i want extract frequent words google n-grams dataset 20 gb in uncompressed form. don't want whole data set resorted, frequent 5000 of them. if write
take 5000 $ sortby (flip $ comparing snd) dataset -- dataset :: io [(word::string, frequency::int)] it's going endless waiting. should instead?
i know there data.array.marray package available in-place array computation, cannot see function items modification on documentation page. there data.hashtable.io, it's unordered data structure.
i'd use simple data.intmap.strict (with convenient lookuple function), don't think efficient because produces new map on each alteration. st monad improve that?
upd: i've posted final version of program on corereview.sx.
how
- using
splitatdivide data set first 5000 items , rest. - sort first 5000 items frequency (ascending)
- go through rest
- if item has greater frequency lowest freq in sorted items
- drop lowest frequency item sorted items
- insert new item in proper place in sorted items
the process becomes linear, though coefficient improved if use data structure sorted 5000 elements has sublinear min-delete , insertion.
for example, using data.heap heap package:
import data.list (foldl') import data.maybe (fromjust) import data.heap hiding (splitat) mostfreq :: int -> [(string, int)] -> [(string, int)] mostfreq n dataset = final -- change our pairs (string,int) (int,string) pairs = map swap dataset -- first `n` pairs in 1 list, , rest of pairs in (first, rest) = splitat n pairs -- put first `n` pairs minheap start = fromlist first :: minheap (int, string) -- run through rest of pairs stop = foldl' step start rest -- modifying heap replace least frequent pair -- new pair if new pair more frequent step heap pair = if viewhead heap < pair insert pair (fromjust $ viewtail heap) else heap -- turn our heap of (int, string) pairs list of (string,int) pairs final = map swap (tolist stop) swap ~(a,b) = (b,a)
Comments
Post a Comment