How do I clean twitter data in R? -


i extracted tweets twitter using twitter package , saved them text file.

i have carried out following on corpus

xx<-tm_map(xx,removenumbers, lazy=true, 'mc.cores=1') xx<-tm_map(xx,stripwhitespace, lazy=true, 'mc.cores=1') xx<-tm_map(xx,removepunctuation, lazy=true, 'mc.cores=1') xx<-tm_map(xx,strip_retweets, lazy=true, 'mc.cores=1') xx<-tm_map(xx,removewords,stopwords(english), lazy=true, 'mc.cores=1') 

(using mc.cores=1 , lazy=true otherwise r on mac running errors)

tdm<-termdocumentmatrix(xx) 

but term document matrix has lot of strange symbols, meaningless words , like. if tweet

 rt @foxtel: 1 man stands between , annihilation: @ianziering.  sharknado‚Äã 3: oh hell no! - july 23 on foxtel @syfyau 

after cleaning tweet want proper complete english words left , i.e sentence/phrase void of else (user names, shortened words, urls)

example:

one man stands between , annihilation oh hell no on  

(note: transformation commands in tm package able remove stop words, punctuation whitespaces , conversion lowercase)

using gsub ,

stringr package

i have figured out part of solution removing retweets, references screen names, hashtags, spaces, numbers, punctuations, urls .

  clean_tweet = gsub("&amp", "", unclean_tweet)   clean_tweet = gsub("(rt|via)((?:\\b\\w*@\\w+)+)", "", clean_tweet)   clean_tweet = gsub("@\\w+", "", clean_tweet)   clean_tweet = gsub("[[:punct:]]", "", clean_tweet)   clean_tweet = gsub("[[:digit:]]", "", clean_tweet)   clean_tweet = gsub("http\\w+", "", clean_tweet)   clean_tweet = gsub("[ \t]{2,}", "", clean_tweet)   clean_tweet = gsub("^\\s+|\\s+$", "", clean_tweet)  

ref: ( hicks , 2014) after above did below.

 #get rid of unnecessary spaces clean_tweet <- str_replace_all(clean_tweet," "," ") # rid of urls clean_tweet <- str_replace_all(clean_tweet, "http://t.co/[a-z,a-z,0-9]*{8}","") # take out retweet header, there 1 clean_tweet <- str_replace(clean_tweet,"rt @[a-z,a-z]*: ","") # rid of hashtags clean_tweet <- str_replace_all(clean_tweet,"#[a-z,a-z]*","") # rid of references other screennames clean_tweet <- str_replace_all(clean_tweet,"@[a-z,a-z]*","")    

ref: (stanton 2013)

before doing of above collapsed whole string single long character using below.

paste(mytweets, collapse=" ")

this cleaning process has worked me quite opposed tm_map transforms.

all left set of proper words , few improper words. now, have figure out how remove non proper english words. have subtract set of words dictionary of words.


Comments