NumPy - Handling Missing Data



Handling Missing Data in Arrays

Handling missing data is a common challenge in data analysis and processing. Missing data in arrays can arise due to various reasons, such as incomplete data collection, errors during data entry, or intentional omission.

In NumPy and data analysis, dealing with missing values involves identifying, handling, and processing them effectively to ensure data integrity and accurate results.

Identifying Missing Data

To handle missing data, the very first step is to identify it. In NumPy, missing values are often represented as np.nan in floating-point arrays. You can use specific functions such as np.isnan() to detect these missing values.

Example

In the following example, we create an array with missing values represented by np.nan. We then use the np.isnan() function to create a mask that identifies these missing values −

import numpy as np

# Creating an array with missing values
arr = np.array([1, 2, np.nan, 4, np.nan, 6])

# Checking for missing values
is_nan = np.isnan(arr)

print("Array with Missing Values:\n", arr)
print("Missing Value Mask:\n", is_nan)

Following is the output obtained −

Array with Missing Values:
[ 1.  2. nan  4. nan  6.]
Missing Value Mask:
[False False  True False  True False]

Removing Missing Data

Removing missing data involves eliminating parts of your dataset where data is missing.

In NumPy, you can use boolean indexing to exclude NaN values from arrays. For example, creating a mask that identifies missing values and then using it to filter out those values.

Example

In this example, we start with an array that contains missing values represented by "np.nan". We then remove these missing values using boolean indexing using the np.isnan() function to filter out the np.nan entries −

import numpy as np

# Creating an array with missing values
arr = np.array([1, 2, np.nan, 4, np.nan, 6])

# Removing missing values
cleaned_arr = arr[~np.isnan(arr)]

print("Original Array:\n", arr)
print("Array with Missing Values Removed:\n", cleaned_arr)

This will produce the following result −

Original Array:
[ 1.  2. nan  4. nan  6.]
Array with Missing Values Removed:
[1. 2. 4. 6.]

Replacing Missing Data

Replacing missing data means filling in the gaps where data is missing with a substitute value. In NumPy, you can use the np.nan_to_num() function to replace NaN values with a specific number, such as zero or the mean of the other values. Following is the syntax −

numpy.nan_to_num(x, copy=True, nan=0.0, posinf=None, neginf=None)

Where,

  • x: The input array containing NaN values, infinities, or other numerical values.
  • copy: A boolean indicating whether to make a copy of the array (True by default). If False, the operation may be performed in place.
  • nan: The value to replace NaN values with. The default is 0.0.
  • posinf: The value to replace positive infinity (inf) with. If not specified, it defaults to a very large number.
  • neginf: The value to replace negative infinity (-inf) with. If not specified, it defaults to a very small (negative) number.

Example

In the example below, we create an array that contains missing values represented by "np.nan". We then replace these missing values with zero using the np.nan_to_num() function, which fills np.nan entries with the specified value −

import numpy as np

# Creating an array with missing values
arr = np.array([1, 2, np.nan, 4, np.nan, 6])

# Replacing missing values with zero
filled_arr = np.nan_to_num(arr, nan=0)

print("Original Array:\n", arr)
print("Array with Missing Values Replaced:\n", filled_arr)

Following is the output of the above code −

Original Array:
[ 1.  2. nan  4. nan  6.]
Array with Missing Values Replaced:
[1. 2. 0. 4. 0. 6.]

Interpolating Missing Data

Interpolating missing data involves estimating and filling in missing values within a dataset based on the surrounding data.

Instead of replacing missing values with a constant like the mean, interpolation predicts what the missing value should be by analyzing the trend or pattern in the data.

For example, if a value is missing between "4" and "8", interpolation might estimate it as "6".

Example

In the following example, we handle an array with missing values (np.nan) by applying linear interpolation using "interp1d" from SciPy. This function estimates and fills the missing values based on the non-missing data, resulting in a complete array −

import numpy as np
from scipy.interpolate import interp1d

# Creating an array with missing values
arr = np.array([1, 2, np.nan, 4, np.nan, 6])

# Creating an index array
indices = np.arange(len(arr))

# Creating a mask for non-missing values
mask = ~np.isnan(arr)

# Performing linear interpolation
interp_func = interp1d(indices[mask], arr[mask], kind='linear', fill_value='extrapolate')
filled_arr = interp_func(indices)

print("Original Array:\n", arr)
print("Array with Interpolated Missing Values:\n", filled_arr)

The output obtained is as shown below −

Original Array:
 [ 1.  2. nan  4. nan  6.]
Array with Interpolated Missing Values:
 [1. 2. 3. 4. 5. 6.]
Advertisements