i extracted tweets twitter using twitter package , saved them text file.
i have carried out following on corpus
xx<-tm_map(xx,removenumbers, lazy=true, 'mc.cores=1') xx<-tm_map(xx,stripwhitespace, lazy=true, 'mc.cores=1') xx<-tm_map(xx,removepunctuation, lazy=true, 'mc.cores=1') xx<-tm_map(xx,strip_retweets, lazy=true, 'mc.cores=1') xx<-tm_map(xx,removewords,stopwords(english), lazy=true, 'mc.cores=1') (using mc.cores=1 , lazy=true otherwise r on mac running errors)
tdm<-termdocumentmatrix(xx) but term document matrix has lot of strange symbols, meaningless words , like. if tweet
rt @foxtel: 1 man stands between , annihilation: @ianziering. sharknado‚Äã 3: oh hell no! - july 23 on foxtel @syfyau after cleaning tweet want proper complete english words left , i.e sentence/phrase void of else (user names, shortened words, urls)
example:
one man stands between , annihilation oh hell no on (note: transformation commands in tm package able remove stop words, punctuation whitespaces , conversion lowercase)
using gsub ,
stringr package
i have figured out part of solution removing retweets, references screen names, hashtags, spaces, numbers, punctuations, urls .
clean_tweet = gsub("&", "", unclean_tweet) clean_tweet = gsub("(rt|via)((?:\\b\\w*@\\w+)+)", "", clean_tweet) clean_tweet = gsub("@\\w+", "", clean_tweet) clean_tweet = gsub("[[:punct:]]", "", clean_tweet) clean_tweet = gsub("[[:digit:]]", "", clean_tweet) clean_tweet = gsub("http\\w+", "", clean_tweet) clean_tweet = gsub("[ \t]{2,}", "", clean_tweet) clean_tweet = gsub("^\\s+|\\s+$", "", clean_tweet) ref: ( hicks , 2014) after above did below.
#get rid of unnecessary spaces clean_tweet <- str_replace_all(clean_tweet," "," ") # rid of urls clean_tweet <- str_replace_all(clean_tweet, "http://t.co/[a-z,a-z,0-9]*{8}","") # take out retweet header, there 1 clean_tweet <- str_replace(clean_tweet,"rt @[a-z,a-z]*: ","") # rid of hashtags clean_tweet <- str_replace_all(clean_tweet,"#[a-z,a-z]*","") # rid of references other screennames clean_tweet <- str_replace_all(clean_tweet,"@[a-z,a-z]*","") ref: (stanton 2013)
before doing of above collapsed whole string single long character using below.
paste(mytweets, collapse=" ")
this cleaning process has worked me quite opposed tm_map transforms.
all left set of proper words , few improper words. now, have figure out how remove non proper english words. have subtract set of words dictionary of words.
Comments
Post a Comment