What is Partial Least Squares Regression

Understanding Partial Least Squares Regression (PLSR) in Machine Learning

Partial least squares regression (PLSR) is a statistical technique that is commonly used in machine learning problems. It is a linear model that helps to predict the relationship between input variables (X) and output variables (Y) by considering the covariance or correlation between these variables.

PLSR is similar to multiple linear regression (MLR) but is specifically designed for situations in which the number of variables is much larger than the number of observations. It is suitable for high-dimensional data, where the traditional regression methods fail to give accurate results due to the problem of multicollinearity or overfitting.

PLSR is an efficient approach for data reduction and building predictive models from complex and noisy data sets. In this article, we will discuss PLSR in detail, exploring its purpose, advantages, and limitations. We will also walk through a step-by-step guide to implementing PLSR in Python and R.

What is PLSR and How Does it Work?

PLSR is a supervised learning algorithm that aims to predict one or more output variables (Y) based on a set of predictor variables (X). The technique analyzes the covariance or correlation between the input and output variables and constructs a set of new latent variables called components. These new components are derived from linear combinations of the original predictor variables and are selected to explain the maximum covariance between the predictor variables and the response variables.

PLSR tries to find the best linear combination of predictor variables that helps in predicting the response variables with high accuracy. It generates a set of orthogonal components, where each component is a linear combination of the original variables. Each component captures a maximum linear relationship between the input and output variables that are not explained by the previous components. Thus, the first component explains most of the variance in the data, followed by the second component, and so on.

The PLSR model consists of two parts: the predictor model and the response model. The predictor model relates the predictor variables to the components, while the response model relates the components to the response variables. A new observation is predicted by computing its components based on the predictor model and estimating the response variables based on the response model.

Advantages and Limitations of PLSR

PLSR has several advantages over traditional regression techniques, including:

It is effective for high-dimensional data where the number of variables is much larger than the number of observations.
It reduces the dimensionality of the data, leading to improved model accuracy and interpretability.
It handles multicollinearity better than other regression techniques.
It can handle missing data, making it useful in practical scenarios.
It is efficient and robust in dealing with noisy data sets.

However, like any other technique, PLSR has certain limitations that should be considered. These include:

PLSR requires a large number of training samples to build an accurate model.
It may overfit the data if the number of components is too large, leading to poor generalization performance.
It may not be suitable for nonlinear relationships between the input and output variables.
It may produce unstable predictions for noisy data sets.

Implementing PLSR in Python and R

In this section, we will walk through the steps to implement PLSR in Python and R. For this demonstration, we will be using the Boston Housing data set, which contains information about housing values in Boston suburbs.

Python Implementation

First, we will start by importing the necessary libraries:


                    import numpy as np
            import pandas as pd
            from sklearn.cross_decomposition import PLSRegression
            from sklearn.model_selection import train_test_split
            from sklearn.metrics import r2_score

Next, let's load the Boston Housing data set and split it into training and test sets:


                    data = pd.read_csv('boston_housing.csv')
            X = data.drop('MEDV', axis=1)
            y = data['MEDV']
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Now, we will apply PLSR to the training data:


                    pls = PLSRegression(n_components=3)
            X_train_pls, Y_train_pls = pls.fit_transform(X_train, y_train)

Here, we have defined the PLSR object with three components and transformed the training data into the new PLSR space. Next, let's apply the PLSR model to the test data:


                    X_test_pls, Y_test_pls = pls.transform(X_test)

Finally, we will evaluate the performance of the PLSR model using the R-squared metric:


                    r2_score(y_test, pls.predict(X_test))

The complete Python code for performing PLSR on the Boston Housing data set is shown below:


                    # Importing the libraries
            import numpy as np
            import pandas as pd
            from sklearn.cross_decomposition import PLSRegression
            from sklearn.model_selection import train_test_split
            from sklearn.metrics import r2_score

            # Loading the data
            data = pd.read_csv('boston_housing.csv')
            X = data.drop('MEDV', axis=1)
            y = data['MEDV']

            # Splitting the data into training and test sets
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

            # Fitting PLSR to the training data
            pls = PLSRegression(n_components=3)
            X_train_pls, Y_train_pls = pls.fit_transform(X_train, y_train)

            # Transforming the test data into the PLSR space
            X_test_pls, Y_test_pls = pls.transform(X_test)

            # Evaluating the performance of PLSR on the test data
            r2_score(y_test, pls.predict(X_test))

R Implementation

Now, let's perform a similar analysis in R:


                    # Loading the libraries
            library(pls)
            library(caret)

            # Loading the Boston Housing data set
            data(BostonHousing,package = 'mlbench')

            # Splitting the data into training and test sets
            set.seed(123)
            train.index <- createDataPartition(y=data$medv,p=0.7,list=FALSE)

            X_train <- data[train.index,-14]
            X_test <- data[-train.index,-14]
            length(unique(train.index))/nrow(data) #Training: 0.7

            y_train <- data[train.index,14]
            y_test <- data[-train.index,14]

            # Fitting PLSR to the training data
            fit <- plsr(medv~.,ncomp=3,data=data.frame(X_train,y_train))

            # Transforming the test data into the PLSR space
            X_test_pls <- predict(fit,newdata=X_test,type='scores')

            # Evaluating the performance of PLSR on the test data
            RMSE(pls:::predict.plsr(fit,X_test),y_test)

Here, we have loaded the necessary libraries and imported the Boston Housing data set. We have then split the data into training and test sets using the caret package. Next, we have fitted PLSR to the training data and transformed the test data into the PLSR space. Finally, we have evaluated the performance of PLSR using RMSE as the evaluation metric.

Conclusion

Partial least squares regression (PLSR) is a beneficial statistical technique that can help researchers and data scientists make predictions on data sets with high-dimensional input variables. It is an efficient algorithm for reducing variables while maintaining the most important relationships between the input and output variables. Overall, the strength of PLSR lies in its ability to handle the multicollinearity of data and its effectiveness on even unbalanced data.

Related AI Basics