i working large text data millions of lines in it. basic step of text analytics, need split text individual words , store number of words in each line.
1) line.split() efficient way split text words? (not bothered punctuation)
2) efficient way store word count? through arrays/lists/tuples? 1 faster.
sorry if seems basic. getting started.
have @ nltk python.
it handles operations tokenization (splitting text words, including punctuation , other non-trivial cases) efficiently large files , provides cool features dispersion plots (where words occur in text) , word count.
an example latter (taken this ntlk cheatsheet):
>>> len(text1) # number of words >>> text1.count("heaven") # how many times word occur? >>> fd = nltk.freqdist(text1) # information word frequency >>> fd["the"] # how many occurences of word ‘the’ >>> fd.plot(50, cumulative=false) # generate chart of 50 frequent words about second part of question, here depends on how want further use these numbers. if you're interested in raw numbers, list fine:
word_count = [len(text1), len(text2), len(text3), ...] # how words per average? print(sum(word_count)/len(word_count)) if want store text has how many words/tokens , want access them names, maybe you're better off dictionary:
word_count = {'first text' = len(text1), 'second text' = len(text2), ...} # how words in first text? print(word_count['first text']) when storing word counts simple numbers isn't matter of speed data structure you're using, either dict or list fine.
Comments
Post a Comment