Pandas apply function with np.std as function parameter, Gives different output


6 days ago


28 time


I am using sklearn.preprocessing.StandardScaler to re-scale my data. I want to use np.std do the same thing with StanardScaler.

However, I find an interesting thing that, with no additional parameter passing in pandas.apply(fun = np.std) , the outputs varies between sample std and population std. (See 2.Questions)

I know there is a parameter called ddof which control the divisor when calculating sample variance.Without changing default parameter ddof = 0, how could I get different output!

1. Dataset

First, I choose iris dataset as an example. I scale the first column of my data as follows.

from sklearn import datasets
import numpy as np
from sklearn.preprocessing import StandardScaler
iris = datasets.load_iris()
X_train =[:,[1]] # my X_train is the first coloum if iris data
sc = StandardScaler() # Using StandardScaler to scale it!

2. Questions: with no change to default ddof = 0 I got different output of np.std!

import pandas as pd
import sys
print("The mean and std(sample std) of X_train is :")
print(pd.DataFrame(X_train).apply([np.mean,np.std],axis = 0),"\n")

print("The std(population std) of X_train is :")
print(pd.DataFrame(X_train).apply(np.std,axis = 0),"\n") 

print("The std(population std) of X_train is :","{0:.6f}".format(sc.scale_[0]),'\n') 

print("Python version:",sys.version,
      "\npandas version:",pd.__version__,
      "\nsklearn version:",sklearn.__version__)


The mean and std(sample std) of X_train is :
mean  3.057333
std   0.435866 

The std(population std) of X_train is :
0    0.434411
dtype: float64 

The std(population std) of X_train is : 0.434411 

Python version: 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)] 
pandas version: 0.23.4 
sklearn version: 0.20.1

From Above results, pd.DataFrame(X_train).apply([np.mean,np.std],axis = 0) gives sample std 0.435866 while pd.DataFrame(X_train).apply(np.std,axis = 0) gives population std 0.434411.

3. My question

  1. Why using pandas.apply gives different results?

  2. How can I pass an additional parameter to np.std, which gives population std?

pd.DataFrame(X_train).apply(np.std,ddof = 1) can do it. But I am wondering that pd.DataFrame(X_train).apply([np.mean,np.std],**args)

0 answers