fit() vs transform() vs fit_transform() in Python scikit-learn

What’s the difference between fit, transform and fit_transform methods in sklearn

Giorgos Myrianthous
Geek Culture
Published in
4 min readMar 14, 2021

--

Photo by Kelly Sikkema on Unsplash

scikit-learn (or commonly referred to as sklearn) is probably one of the most powerful and widely used Machine Learning libraries in Python. It comes with a comprehensive set of tools and ready-to-train models — from pre-processing utilities, to model training and model evaluation utilities.

Transformers are among the most fundamental object types in sklearn, which implement three specific methods namely fit(), transform()and fit_transform(). Essentially, they are conventions applied in scikit-learn and its API. In this article, we are going to explore how each of these work and when to use one over the other.

Note that in this article we are going to explore the aforementioned functions using specific examples, but the concepts explained here are applicable to most (if not all) transformers that implement these methods.

Subscribe to Data Pipeline, a newsletter dedicated to Data Engineering

Before explaining the intuition behind fit(), transform()and fit_transform(), it is important to first understand what a transformer is in scikit-learn API.

What are transformers in scikit-learn

In scikit-learn, a transformer is an estimator that implements the transform() and/or fit_transform() methods. base.TransformerMixin is the default implementation and a Mixin class that provides a consistent interface across transformers of sklearn.

What does fit() do in transformers

In scikit-learn transformers, the fit() method is used to fit the transformer to the input data and perform the required computations to the specific transformer we apply.

As an example, let’s assume that we need to scale our features using the preprocessing transformer called MinMaxScaler. In the code below, we first create some sample data and we subsequently split it into training and testing sets. Then we instantiate a MinMaxScaler and fit the training data in order to compute the minimum and maximum to be used for later scaling.

--

--

Giorgos Myrianthous
Geek Culture

I strive to build data-intensive systems that are not only functional, but also scalable, cost effective and maintainable over the long term.