Lasso regression analysis is a shrinkage and variable selection method for linear regression models. The goal of lasso regression is to obtain the subset of predictors that minimizes prediction error for a quantitative response variable. The lasso does this by imposing a constraint on the model parameters that causes regression coefficients for some variables to shrink toward zero. Variables with a regression coefficient equal to zero after the shrinkage process are excluded from the model. Variables with non-zero regression coefficients variables are most strongly associated with the response variable. Explanatory variables can be either quantitative, categorical or both.
I have used gapminder dataset. All predictor variables(incomeperperson alcconsumption co2emissions oilperperson suicideper100th employrate ) were quantitative and Response variable(lifeexpectancy) was also quantitative.All predictor variables were standardized to have a mean of zero and a standard deviation of one.
Data were randomly split into a training set that included 70% of the observations and a test set that included 30% of the observations. The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.
LIBNAME mydata “/courses/d1406ae5ba27fe300 ” access=readonly;
DATA new; set mydata.gapminder;
keep COUNTRY incomeperpersonalcconsumptionlifeexpectancy co2emissions oilperperson suicideper100th employrate;
proc sort; by COUNTRY; /*sort the data by country */
* delete observations with missing data;
*if cmiss(of _all_) then delete;
ods graphics on;
* Split data randomly into test and training data;
proc surveyselect data=new out=traintest seed = 123
samprate=0.7 method=srs outall;
* lasso multiple regression with lars algorithm k=10 fold validation;
proc glmselect data=traintest plots=all seed=123;
partition ROLE=selected(train=‘1’ test=’0’);
model lifeexpectancy = incomeperperson alcconsumption co2emissions oilperperson employrate suicideper100th/selection=lar(choose=cv stop=none) cvmethod=random(10);