python - Evaluate slope and error for specific category for statsmodels ols fit -
i have dataframe df
following fields: weight
, length
, , animal
. first 2 continuous variables, while animal
categorical variable values cat
, dog
, , snake
.
i'd estimate relationship between weight , length, needs conditioned on type of animal, interact length variable animal
categorical variable.
model = ols(formula='weight ~ length * animal', data=df) results = model.fit()
how can programmatically extract slope of relationship between weight , length e.g. snakes? understand how manually: add coefficient length
coefficient animal[t.snake]:length
. cumbersome , manual, , requires me handle base case specially, i'd extract information automatically.
furthermore, i'd estimate error on slope. believe understand how calculate combining standard errors , covariances (more precisely, performing calculation here). more cumbersome above, , i'm wondering if there's shortcut extract information.
my manual method calculate these follows.
edit (06/22/2015): there seems error in original code below calculating errors. standard errors calculated in user333700's answer different ones calculate, haven't invested time in figuring out why.
def get_contained_animal(animals, p): # relies on parameters of form animal[t.snake]:length. in animals: if in p: return return none animals = ['cat', 'dog', 'snake'] slopes = {} errors = {} animal in animals: slope = 0. params = [] # if param related length variable , # animal in question, add slope. param, val in results.params.iteritems(): ac = get_contained_animal(animals, param) if (param == 'length' or ('length' in param , ac none or ac == animal)): params.append(param) slope += val # calculate overall error adding standard errors , # covariances. tot_err = 0. i, p1 in enumerate(params): tot_err += results.bse[p1]*results.bse[p1] j, p2 in enumerate(params[i:]): # add covariance of these parameters tot_err += 2*results.cov_params()[p1][p2] slopes[animal] = slope errors[animal] = tot_err**0.5
this code might seem overkill, in real-world use case have continuous variable interacting 2 separate categorical variables, each large number of categories (along other terms in model need ignore these purposes).
very brief background:
the general question how prediction change if change on of explanatory variables, holding other explanatory variables fixed or averaging on those.
in nonlinear discrete models, there special margins method calculates this, although not implemented changes in categorical variables.
in linear model, prediction , change in prediction linear function of estimated parameters, , can (mis)use t_test
calculate effect, standard error , confidence interval us.
(aside: there more helper methods in works statsmodels make prediction , margin calculations easier , available later in year.)
as brief explanation of following code:
- i make similar example.
- i define explanatory variables length = 1 or 2, each animal type
- then, calculate difference in these explanatory variables
- this defines linear combinations or contrast of parameters, can used in t_test.
finally, compare result predict check didn't make obvious mistakes. (i assume correct had written pretty fast.)
import numpy np import pandas pd statsmodels.regression.linear_model import ols np.random.seed(2) nobs = 20 animal_names = np.array(['cat', 'dog', 'snake']) animal_idx = np.random.random_integers(0, 2, size=nobs) animal = animal_names[animal_idx] length = np.random.randn(nobs) + animal_idx weight = np.random.randn(nobs) + animal_idx + length data = pd.dataframe(dict(length=length, weight=weight, animal=animal)) res = ols.from_formula('weight ~ length * animal', data=data).fit() print(res.summary()) data_predict1 = data = pd.dataframe(dict(length=np.ones(3), weight=np.ones(3), animal=animal_names)) data_predict2 = data = pd.dataframe(dict(length=2*np.ones(3), weight=np.ones(3), animal=animal_names)) import patsy x1 = patsy.dmatrix('length * animal', data_predict1) x2 = patsy.dmatrix('length * animal', data_predict2) tt = res.t_test(x2 - x1) print(tt.summary(xname=animal_names.tolist()))
the result of last print is
test constraints ============================================================================== coef std err t p>|t| [95.0% conf. int.] ------------------------------------------------------------------------------ cat 1.0980 0.280 3.926 0.002 0.498 1.698 dog 0.9664 0.860 1.124 0.280 -0.878 2.811 snake 1.5930 0.428 3.720 0.002 0.675 2.511
we can verify results using predict , compare difference in predicted weight if length given animal type increases 1 2:
>>> [res.predict({'length': 2, 'animal':[an]}) - res.predict({'length': 1, 'animal':[an]}) in animal_names] [array([ 1.09801656]), array([ 0.96641455]), array([ 1.59301594])] >>> tt.effect array([ 1.09801656, 0.96641455, 1.59301594])
note: forgot add seed random numbers , numbers cannot replicated.
Comments
Post a Comment