i have used removepuncutation "tm" package in r on term document matrix. reason still left strange characters in plot of letters versus proportion in corpus i've analyzed.
below code used clean corpus:
docs <- tm_map(docs, tospace, "/|@|\\|") docs <- tm_map(docs, content_transformer(tolower)) docs <- tm_map(docs, removenumbers) docs <- tm_map(docs, removepunctuation) docs <- tm_map(docs, stripwhitespace) dtm <- documenttermmatrix(docs) freq <- colsums(as.matrix(dtm)) words <- dtm %>%as.matrix %>%colnames %>% (function(x) x[nchar(x) < 20]) library(dplyr) library(stringr) words %>%str_split("") %>%sapply(function(x) x[-1]) %>%unlist%>%dist_tab %>%mutate(letter=factor(toupper(interval),levels=toupper(interval[order(freq)]))) %>%ggplot(aes(letter, weight=percent))+geom_bar()+coord_flip()+ylab("proportion")+scale_y_continuous(breaks=seq(0, 12,2),label=function(x) paste0(x, "%"),expand=c(0,0), limits=c(0,12)) i'm left following plot:

i'm trying figure out went wrong here.
Comments
Post a Comment