How to split data based on a column value in sklearn

Written by- Aionlinecourse526 times views

You can use the train_test_split function from scikit-learn's model_selection module to split a dataset into a training set and a test set based on a specified split ratio. For example, you can use the following code to split the data into a training set that contains 75% of the data and a test set that contains 25% of the data:
from sklearn.model_selection import train_test_split

# Split the data into a training set (75%) and a test set (25%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
Here, X and y are the feature matrix and the target vector, respectively. The test_size parameter specifies the proportion of the data that should be allocated to the test set.

If you want to split the data based on the values of a specific column, you can extract that column as a separate array and use it as the target vector in the train_test_split function. For example:
# Extract the 'age' column as the target vector
y = df['age']

# Split the data into a training set (75%) and a test set (25%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
This will split the data into a training set and a test set based on the values in the 'age' column.