nlp - Text preprocessing in Python -


i build text corpus nlp project in python. i've seen text format in lshtc4 kaggle challenge:

5 0:10 8:1 18:2 54:1 442:2 3784:1 5640:1 43501:1  

the first number corresponds label.

each set of numbers separated ‘:‘ correspond (feature,value) pair of vector, first number feature’s id , second number frequency (for example feature id 18 appears 2 times in instance).

i don't know if common way pre-process text data numeric vector. can't find pre-processing procedure in challenge, data pre-processed.

no package necessary in r (nor in python if i'm not mistaken). first split (and remove initial 5). i'm guessing want result numbers, not strings:

x<-"5 0:10 8:1 18:2 54:1 442:2 3784:1 5640:1 43501:1" y<-as.integer(unlist(strsplit(x,split=" |:"))[-1]) feature<-y[seq(1,length(y),by=2)] [1]     0     8    18    54   442  3784  5640 43501 value<-y[seq(2,length(y),by=2)] [1] 10  1  2  1  2  1  1  1 

if want them side-by-side:

cbind(feature,value)      feature value [1,]       0    10 [2,]       8     1 [3,]      18     2 [4,]      54     1 [5,]     442     2 [6,]    3784     1 [7,]    5640     1 [8,]   43501     1 

if want assign them data.table analysis:

library(data.table) dt<-data.table(feature=feature,value=value)

> dt    feature value 1:       0    10 2:       8     1 3:      18     2 4:      54     1 5:     442     2 6:    3784     1 7:    5640     1 8:   43501     1 

etc.


Comments