How to Fill In Missing Data Using Python pandas

Data cleaning undoubtedly takes a ton of time in data science, and missing data is one of the challenges you'll face often. Pandas is a valuable Python data manipulation tool that helps you fix missing values in your dataset, among other things.

You can fix missing data by either dropping or filling them with other values. In this article, we'll explain and explore the different ways to fill in missing data using pandas.

Set Up Pandas and Prepare the Dataset

Before we start, make sure you install pandas into your Python virtual environment using pip via your terminal:

        pip install pandas

You might follow along with any dataset. This could be an Excel file loaded with Pandas.

But we'll use the following mock data throughout this article—it's a DataFrame containing some missing or null values (Nan).

        
import pandas
import numpy

df = pandas.DataFrame({
    'A' :[0, 3, numpy.nan, 10, 3, numpy.nan], 
    'B' : [numpy.nan, numpy.nan, 7.13, 13.82, 7, 7], 
    'C' : [numpy.nan, "Pandas", numpy.nan, "Pandas", "Python", "JavaScript"],
    'D' : ["Sound", numpy.nan, numpy.nan, "Music", "Songs", numpy.nan]   
})

print(df)

The dataset looks like this:

Now, check out how you can fill in these missing values using the various available methods in pandas.

1. Use the fillna() Method

The fillna() function iterates through your dataset and fills all empty rows with a specified value. This could be the mean, median, modal, or any other value.

This pandas operation accepts some optional arguments; take note of the following:

value: This is the computed value you want to insert into the missing rows.
method: Let you fill in missing values forward or in reverse. It accepts a bfill or ffill parameter.
inplace: This accepts a conditional statement. If True, it modifies the DataFrame permanently. Otherwise, it doesn't.

Let's see the techniques for filling in missing data with the fillna() method.

Fill Missing Values With Mean, Median, or Mode

This method involves replacing missing values with computed averages. Filling missing data with a mean or median value is applicable when the columns involved have integer or float data types.

You can also fill in missing data with the mode value, which is the most occurring value. This is also applicable to integers or floats. But it's handier when the columns in question contain strings.

Here's how to insert the mean and median into the missing rows in the DataFrame:

        # To insert the mean value of each column into its missing rows:
df.fillna(df.mean(numeric_only=True).round(1), inplace=True)

# For median:
df.fillna(df.median(numeric_only=True).round(1), inplace=True)

print(df)

The numeric_only argument set as True ensures that the average tendencies only applies to columns containing integer and float.

Since you can't calculate numeric averages on string columns, you want to get the modal value for them instead. However, we'll use a slightly different approach for the modal value:

        
string_columns = df.select_dtypes(include=['object']).columns
df[string_columns] = df[string_columns].fillna(df[string_columns].mode().iloc[0])

print(df)

The above code will select only string columns from the DataFrame and fill the Nan in each with its modal value.

You can also insert the mode into a specific column instead, say, column C:

        df['C'].fillna(df['C'].mode()[0], inplace=True)

If you want to be column-specific while inserting the mean, median, or mode:

        df.fillna({"A":df['A'].mean(), 
           "B": df['B'].median(), 
           "C": df['C'].mode()[0]}, 
          inplace=True)
print(df)

Fill Null Rows With Values Using ffill

This involves specifying the fill direction inside the fillna() function. This method fills each missing row with the value of the nearest one above it.

You could also call it forward-filling:

        df.fillna(method='ffill', inplace=True)

Fill Missing Rows With Values Using bfill

Here, you'll replace the ffill method mentioned above with bfill. It fills each missing row in the DataFrame with the nearest value below it.

This one is called backward-filling:

        df.fillna(method='bfill', inplace=True)

You might want to combine ffill and bfill to fill missing data in both directions. This prevents partial data filling.

2. The replace() Method

This method is handy for replacing values other than empty cells, as it's not limited to Nan values. It alters any specified value within the DataFrame.

However, like the fillna() method, you can use replace() to replace the Nan values in a specific column with the mean, median, mode, or any other value. And it also accepts the inplace keyword argument.

See how this works by replacing the null rows in a named column with its mean, median, or mode:

        import pandas
import numpy 

# Replace the null values with the mean:
df['A'].replace([numpy.nan], df['A'].mean(), inplace=True)

# Replace column A with the median:
df['B'].replace([numpy.nan], df['B'].median(), inplace=True)

# Use the modal value for column C:
df['C'].replace([numpy.nan], df['C'].mode()[0], inplace=True)
print(df)

3. Fill Missing Data With interpolate()

The interpolate() function uses existing values in the DataFrame to estimate the missing rows. Setting the inplace keyword to True alters the DataFrame permanently.

However, this method only applies to numeric columns, as it uses mathematical estimation for fill missing roles.

Run the following code to see how this works:

        # Interpolate backwardly across the column:
df.interpolate(method ='linear', limit_direction ='backward', inplace=True)

# Interpolate in forward order across the column:
df.interpolate(method ='linear', limit_direction ='forward', inplace=True)

The above code picks only numeric columns in the DataFrame automatically.

Deal With Missing Rows Carefully

While we've only considered filling missing data with default values like averages, mode, and other methods, other techniques exist for fixing missing values. Data scientists, for instance, sometimes remove these missing rows, depending on the case.

It's essential to think critically about your strategy before using it. Otherwise, you might get undesirable analysis or prediction results. Some initial data visualization strategies and analytics might also help.