problem: have large csv several columns , tens of millions of rows. 1 of these columns have query variable stores plaintext. need find common queries across data, because these user generated queries, need have queries
help installing [x product]
and
[x product] installing
contribute same query summary/reporting sake.
i have pruned data of common words , other json wrapping, etc.
you can assume values of each row contains important words query user made, , need retrieve common queries made.
i understand it's impossible provide code high level question this, appreciate being pointed in right direction - if keywords search in google begin search/understanding of problem space, or research paper or blog post addresses problem have.
you have interesting problem on hands. i'll throw out (hopefully) useful tools throw @ it.
the first thing want apply levenshtein distance algorithm correct possible misspellings in queries. part of data clean process you'll want before main algorithm it's thing.
the second thing you'll want apply stemming algorithm each query. stemming involves returning root of word. examples:
jumping -> jump jumps -> jump jumper -> jump this way normalizing of important keywords in queries.
now may go "more this" approach each query find other queries similar it, batch them together. how "more this" work?
it involves 3 components: tf, idf , field length. tf term frequency - how each term in current query (which boosts score). idf inverse document frequency - how term in of queries (which sinks score). , field length how long query field (the shorter more boost score). let me expand bit:
you want compare help instaling [product x] other records.
the first order of business correct misspellings across of queries:
help installing [product x] next, stem queries:
help install [product x] next, choose 1 query , begin comparing other queries (or @ least of other queries have not been matched similar other queries you've processed). we'll start query above.
let's create term vector
help (1) install (1) [x] (1) each of these terms appear once. that's term frequency current query. let's compare idf of each of these terms. turns out help comes 15,000 times across of queries, install comes 2,000 times, , [product x] comes 500 times. means help least relevant because appears often, , [product x] relevant because appears rarely. it?
and field length works finding score this: longer query, lower it's score. why? because if query of 20 characters matches terms, far more exact duplicate query of 1,000 characters user rambling , talking many different topics. see?
now can learn more tf/idf figure out implementation.
but have news you. of work has been done in lucene library. using lucene can index each of queries document. lucene apply stemming automatically when indexing it. furthermore, lucene has "more this" algorithm uses tf/idf you. and, finally, lucene can apply fuzzy matching using levenshtein distance calculator each of queries, too. awesome!! if find working lucene little close "bare metal", can use elasticsearch advanced high-level wrapper around lucene. kicks butt.
please note, not expert in these topics. however, hope gives ideas. cheers!
http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/ https://en.wikipedia.org/wiki/levenshtein_distance
Comments
Post a Comment