
skkirkham
 Contributor
 Offline
 Posts: 38
 Joined: Mon Jan 06, 2014 10:36 am
Formatting results from lm and glm functions in R
Hi all
While doing some development on a GLM model for a client I realised that the R naming convention for model coefficients is very difficult to parse out into the original attribute name and the value unless you are very strict and careful about how you pass column names and values into R. This could effectively mean that you have to do a lot of preprocessing beforehand. Not ideal. Therefore I've done a bit of investigation in R and come up with a way of converting the coefficient names back into their constituent parts.
As an example consider some demographic data on customers where one categorical attribute might be GENDER. This takes values in the data of Female, Male or Unknown. Within R the lm or glm calls produce a scorecard with one coefficient for each continuous attribute and one for each different value of a categorical attribute (known as factors in R). So for the attribute GENDER the coefficients generated are called GENDERMale and GENDERUnknown, namely a concatenation of the attribute name and value. The coefficients themselves are the scores compared to the base case, in this example: GENDERFemale. Note for categorical variables R always orders any values alphabetically. The first one is always used as the base case and the scores generated for the other values represent the influence of this other value compared to this base value.
If you want to use Kognitio to score this model against a large data set without having to pass all your data through R then you might want to consider how you output this information for later use. It may be difficult to parse GENDERMale (or other attributes) into an SQL snippet of the form:
It makes sense to generate output in the form of the original attribute, the value and the score rather than the concatenated coefficient name. So here is an R function to do it:
To use this function copy into the top of your R code and then use it like this
While doing some development on a GLM model for a client I realised that the R naming convention for model coefficients is very difficult to parse out into the original attribute name and the value unless you are very strict and careful about how you pass column names and values into R. This could effectively mean that you have to do a lot of preprocessing beforehand. Not ideal. Therefore I've done a bit of investigation in R and come up with a way of converting the coefficient names back into their constituent parts.
As an example consider some demographic data on customers where one categorical attribute might be GENDER. This takes values in the data of Female, Male or Unknown. Within R the lm or glm calls produce a scorecard with one coefficient for each continuous attribute and one for each different value of a categorical attribute (known as factors in R). So for the attribute GENDER the coefficients generated are called GENDERMale and GENDERUnknown, namely a concatenation of the attribute name and value. The coefficients themselves are the scores compared to the base case, in this example: GENDERFemale. Note for categorical variables R always orders any values alphabetically. The first one is always used as the base case and the scores generated for the other values represent the influence of this other value compared to this base value.
If you want to use Kognitio to score this model against a large data set without having to pass all your data through R then you might want to consider how you output this information for later use. It may be difficult to parse GENDERMale (or other attributes) into an SQL snippet of the form:
Code: Select all
case when GENDER=Male then Score else 0 end
Code: Select all
kog_coeffs<function(mfit){
N=0
for (x in labels(mfit$terms)){
N<if (attr(mfit$terms,"dataClasses")[x]=="factor") N+length(unlist(mfit$xlevels[x])) else N+1
}
#
# Set up new data frame to hold results
# Is_Factor is flag for factor columns
# Is_Base indicates if this is the default value
my_coef<data.frame(matrix(0,ncol=5,nrow=N))
colnames(my_coef)<c("Rcoef","Is_Factor","Is_Base","Att","Level")
# initialise index
i=1
for (x in labels(mfit$terms)){
if (attr(mfit$terms,"dataClasses")[x]=="factor"){
# for factor loop through levels builing R coef names and flags
# First entry in xlevels is always the default even if you relevel
j<1
for (y in as.vector(unlist(mfit$xlevels[x]))){
my_coef$Rcoef[i]=paste(x,y,sep="")
my_coef$Is_Factor[i]=1
my_coef$Is_Base[i]=if (j==1) 1 else 0
my_coef$Att[i]=x
my_coef$Level[i]=y
j<j+1
i<i+1
}
} else {
my_coef$Rcoef[i]=x
my_coef$Is_Factor[i]=0
my_coef$Is_Base[i]=1
my_coef$Att[i]=x
i<i+1
}
}
# merge the split names and flags with summary coeffs info
# and set base values to zero.
#
r_coef<data.frame(summary(mfit)$coefficients)
colnames(r_coef)<c("Score","StdError","ZValue","ProbZ")
merge1<merge(my_coef,r_coef, by.x="Rcoef",by.y="row.names",all.x=TRUE, all.y=TRUE)
# Set Is_Base and Att value for the intercept coefficient if it exists
merge1[merge1$Rcoef=="(Intercept)",]$Is_Base<1
merge1[merge1$Rcoef=="(Intercept)",]$Att<"Intercept"
merge1[is.na(merge1)]<0
# Remove objects
remove(r_coef)
remove(my_coef)
return(merge1)
}
Code: Select all
# load data and run your model assigning it to my_model
# As example using the logisitic regression code from here:
# http://data.princeton.edu/R/glms.html (thanks Germán)
cuse < read.table("http://data.princeton.edu/wws509/datasets/cuse.dat", header=TRUE)
attach(cuse)
my_model < glm(cbind(using, notUsing) ~ age + education + wantsMore , family = binomial)
#
# Call the kog_coeff function to see output
#
kog_coeffs(my_model)
#
# or assign it to my_output
#
my_output<kog_coeffs(my_model)
my_output
Who is online
Users browsing this forum: No registered users and 1 guest