sciPy stats.trim1() function in Python
sciPy stats.trim1() function in Python is used for trimming outliers from a database. Specifically, it trims a proportion of data points from both ends of the given array or sequence.
SciPy stands for Scientific Python. It provides more utility functions for optimization, stats and signal processing. Like NumPy, SciPy is open source so we can use it freely. SciPy was created by NumPy’s creator Travis Olliphant.
Slice off a proportion from ONE end of the passed array distribution. If proportiontocut = 0.1, slices off ‘leftmost’ or ‘rightmost’ 10% of scores. The lowest or highest values are trimmed (depending on the tail).
scipy. stats module contains a large number of summary and frequency statistics, probability distributions, correlation functions, statistical tests, kernel density estimation, quasi-Monte Carlo functionality, and so on. In this tutorial, we will cover: scipy.
Syntax :
scipy.stats.trim1(data, proportioncut, tail='right')
Parameters :
- data : This is input data array o sequence from which you want to trim outliers.
- proportioncut : Specifies the proportion of data points to trim from each end of the data. For instance, proportioncut = 0.1 tims 10% of data points from both the lower and upper tails.
- tail : Specifies which end(s) of the data to trim. You can set it to ‘left’, ‘right’, or ‘both’ (the default is ‘right’).
Returns : An array or sequence with outliers trimmed based on the specified proportion.
Example :
import numpy as np
from scipy import stats
data = np.random.normal(loc=0, scale=1, size=100)
trimmed_data_right = stats.trim1(data, proportiontocut=0.1, tail='right')
trimmed_data_left = stats.trim1(data, proportiontocut=0.1, tail='left')
print("Original data length:", len(data))
print("Trimmed data length (right):", len(trimmed_data_right))
print("Trimmed data length (left):", len(trimmed_data_left))
OutPut :
Original data length: 100 Trimmed data length (right): 90 Trimmed data length (left): 90
Key Notes :
- If you want to trim both ends of the dataset, you should use stats.trimboth insted of stats.trim1.
- stats.trim1 works by trimming data only from the side specified by the tail parameter.
To trim data from both ends :
import numpy as np
from scipy import stats
data = np.random.normal(loc=0, scale=1, size=100)
trimmed_data_both = stats.trimboth(data, proportiontocut=0.1)
print("Original data length:", len(data))
print("Trimmed data length (both):", len(trimmed_data_both))
BreakDown of following code :
- The orignal dataset contains 100 elements.
-
With proportioncut = 0.1 , the function trims 10% of the data from both ends.
OutPut :
Original data length: 100 Trimmed data length (both): 80
This logic will hold true for other dataset sizes as well. The trimmed data length will always be:
Trimmed data length=Original length × (1 − 2 × proportiontocut)