memory - Text analytics in Python -


i working large text data millions of lines in it. basic step of text analytics, need split text individual words , store number of words in each line.

1) line.split() efficient way split text words? (not bothered punctuation)

2) efficient way store word count? through arrays/lists/tuples? 1 faster.

sorry if seems basic. getting started.

have @ nltk python.

it handles operations tokenization (splitting text words, including punctuation , other non-trivial cases) efficiently large files , provides cool features dispersion plots (where words occur in text) , word count.

an example latter (taken this ntlk cheatsheet):

>>> len(text1)                    # number of words >>> text1.count("heaven")         # how many times word occur? >>> fd = nltk.freqdist(text1)     # information word frequency >>> fd["the"]                     # how many occurences of word ‘the’  >>> fd.plot(50, cumulative=false) # generate chart of 50 frequent words 

about second part of question, here depends on how want further use these numbers. if you're interested in raw numbers, list fine:

word_count = [len(text1), len(text2), len(text3), ...]  # how words per average? print(sum(word_count)/len(word_count)) 

if want store text has how many words/tokens , want access them names, maybe you're better off dictionary:

word_count = {'first text' = len(text1), 'second text' = len(text2), ...}  # how words in first text? print(word_count['first text']) 

when storing word counts simple numbers isn't matter of speed data structure you're using, either dict or list fine.


Comments