How to include SimpleImputer before CountVectorizer in a scikit-learn Pipeline?

Written by - Aionlinecourse1948 times views

How to include SimpleImputer before CountVectorizer in a scikit-learn Pipeline?

This article will talk about how to include SimpleImputer before CountVectorizer in a scikit-learn Pipeline.

How to include SimpleImputer before CountVectorizer in a scikit-learn Pipeline

Solution 1:

The best solution I have found is to insert a custom transformer into the Pipeline that reshapes the output of SimpleImputer from 2D to 1D before it is passed to CountVectorizer.

Here's the complete code:

import pandas as pd

import numpy as np

df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})

from sklearn.impute import SimpleImputer

imp = SimpleImputer(strategy='constant')


from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()


#CREATE TRANSFORMER
from sklearn.preprocessing import FunctionTransformer
one_dim = FunctionTransformer(np.reshape, kw_args={'newshape':-1})

# INCLUDE TRANSFORMER IN PIPELINE
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imp, one_dim, vect)

pipe.fit_transform(df[['text']]).toarray()

It has been proposed on GitHub that CountVectorizer should allow 2D input as long as the second dimension is 1 (meaning: a single column of data). That modification to CountVectorizer would be a great solution to this problem!


Solution 2:

One solution would be to create a class off SimpleImputer and override its transform() method:

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

class ModifiedSimpleImputer(SimpleImputer):  
def transform(self, X):

  return super().transform(X).flatten()


df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})

imp = ModifiedSimpleImputer(strategy='constant')

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imp, vect)

pipe.fit_transform(df[['text']]).toarray()

Solution 3:

I use this one dimensional wrapper for sklearn Transformer when I have one dimensional data. I think, this wrapper can be used to wrap the simpleImputer for the one dimensional data (a pandas series with string values) in your case.

class OneDWrapper:
 """One dimensional wrapper for sklearn Transformers"""

    def __init__(self, transformer):
 
 self.transformer = transformer

    def fit(self, X, y=None):
 
self.transformer.fit(np.array(X).reshape(-1, 1))
       
 return self

    def transform(self, X, y=None):
  return self.transformer.transform(
 np.array(X).reshape(-1, 1)).ravel()

    def inverse_transform(self, X, y=None):
return self.transformer.inverse_transform(
 np.expand_dims(X, axis=1)).ravel()

Now, you don't need an additional step in the pipeline.

one_d_imputer = OneDWrapper(SimpleImputer(strategy='constant'))
pipe = make_pipeline(one_d_imputer, vect)
pipe.fit_transform(df['text']).toarray() 
# note we are feeding a pd.Series here!


Thank you for reading the article.

Recommended Projects

Deep Learning Interview Guide

Topic modeling using K-means clustering to group customer reviews

Have you ever thought about the ways one can analyze a review to extract all the misleading or useful information?...

Natural Language Processing
Deep Learning Interview Guide

Automatic Eye Cataract Detection Using YOLOv8

Cataracts are a leading cause of vision impairment worldwide, affecting millions of people every year. Early detection and timely intervention...

Computer Vision
Deep Learning Interview Guide

Medical Image Segmentation With UNET

Have you ever thought about how doctors are so precise in diagnosing any conditions based on medical images? Quite simply,...

Computer Vision
Deep Learning Interview Guide

Build A Book Recommender System With TF-IDF And Clustering(Python)

Have you ever thought about the reasons behind the segregation and recommendation of books with similarities? This project is aimed...

Machine LearningDeep LearningNatural Language Processing
Deep Learning Interview Guide

Build Regression Models in Python for House Price Prediction

Ever wondered how experts predict house prices? This project dives into exactly that! Using Python, we'll build regression models that...

Machine Learning
Deep Learning Interview Guide

Optimizing Chunk Sizes for Efficient and Accurate Document Retrieval Using HyDE Evaluation

This project demonstrates the integration of generative AI techniques with efficient document retrieval by leveraging GPT-4 and vector indexing. It...

Natural Language ProcessingGenerative AI
Deep Learning Interview Guide

Crop Disease Detection Using YOLOv8

In this project, we are utilizing AI for a noble objective, which is crop disease detection. Well, you're here if...

Computer Vision