python - Pandas + Patsy + Statsmodels Linear Reg issue passing in categorical variable (duplicate rows) -
[preface: realize should've used classification model (maybe decision tree) instead ended using linear regression model.]
i had pandas dataframe such:

and want predict audience score using genre, year, tomato-meter score. constructed, genres each movie came in list, felt need isolate each genre pass each genre model separate variables.
after doing such, modified dataframe looks this, duplicate rows each movie, each genre element of movie isolated (just 1 movie pulled dataframe show):

now, question is, can pass in second dataframe patsy , statsmodel linear regression, or row duplication introduce bias model?
y1, x1 = dmatrices('q("audience score") ~ year + q("tomato-meter") + genre', data=df2, return_type='dataframe') in summary, looking way patsy , model recognize treat each genre separate variables.. want make sure i'm not fudging numbers/model passing in dataframe in format data (as not every movie same # of genres).
i see 2 problems approach:
parameter estimates:
if there different number of repeated observations, weight observations multiple categories larger observations single category. corrected using weights in linear model. use wls weights equal inverse of number of repetitions (or square root of ?). weights not available other models poisson or logit or glm-binomial. not make larger difference parameter estimates, if "pattern", i.e. underlying parameters not systematically different across movies different number of categories.
inference, standard error of parameter estimates:
all basic models ols, poisson , on assume each row independent observation. total number of rows larger number of actual observations , estimated standard errors of parameters underestimated. (we use cluster robust standard errors, never checked how work duplicate observations, i.e. response identical across several observations.)
alternative
as alternative repeating observations, encode categories non-exclusive dummy variables. example, if have 3 levels of categorical variable, movie categories in case, add 1 in each corresponding column if observation "in" category.
patsy doesn't have premade support this, design matrix movie category need build hand or sum of individual dummy matrices (without dropping reference category).
alternative model
this not directly related issue of multiple categories in explanatory variables.
the response variable movie ratings bound between 0 , 100. linear model work local approximation, not take account observed ratings in limited range , not enforce prediction.
poisson regression used take non-negativity account, wouldn't use upper bound. 2 alternatives more appropriate glm binomial family , total count each observation set 100 (maximum possible rating), or use binary model, e.g. logit or probit, after rescaling ratings between 0 , 1. latter corresponds estimating model proportions can estimated statsmodels binary response models. have inference correct if data not binary, can use robust standard errors. example
result = sm.logity(y_proportion, x).fit(cov_type='hc0')
Comments
Post a Comment