i build text corpus nlp project in python. i've seen text format in lshtc4 kaggle challenge:
5 0:10 8:1 18:2 54:1 442:2 3784:1 5640:1 43501:1 the first number corresponds label.
each set of numbers separated ‘:‘ correspond (feature,value) pair of vector, first number feature’s id , second number frequency (for example feature id 18 appears 2 times in instance).
i don't know if common way pre-process text data numeric vector. can't find pre-processing procedure in challenge, data pre-processed.
no package necessary in r (nor in python if i'm not mistaken). first split (and remove initial 5). i'm guessing want result numbers, not strings:
x<-"5 0:10 8:1 18:2 54:1 442:2 3784:1 5640:1 43501:1" y<-as.integer(unlist(strsplit(x,split=" |:"))[-1]) feature<-y[seq(1,length(y),by=2)] [1] 0 8 18 54 442 3784 5640 43501 value<-y[seq(2,length(y),by=2)] [1] 10 1 2 1 2 1 1 1 if want them side-by-side:
cbind(feature,value) feature value [1,] 0 10 [2,] 8 1 [3,] 18 2 [4,] 54 1 [5,] 442 2 [6,] 3784 1 [7,] 5640 1 [8,] 43501 1 if want assign them data.table analysis:
library(data.table) dt<-data.table(feature=feature,value=value)
> dt feature value 1: 0 10 2: 8 1 3: 18 2 4: 54 1 5: 442 2 6: 3784 1 7: 5640 1 8: 43501 1 etc.
Comments
Post a Comment