The calculation of normalized values of one or more columns is an important step for many machine learning analyses. For example, when we use measure reduction techniques such as Principal Component Analysis (PCA), we generally standardize all variables.
To normalize the variable, subtract each value of the variable from the average and divide it by the standard deviation of the variable. This converts the variable into a normal distribution with a zero-average and a unit variance.
Standardisation of variable A in Python
Variable standardization is also called z-score. In principle it concerns the number of standard deviations whose value deviates from the average of the variable. If the base value is greater than the average, the normalised value or z-value is positive. If the base value of the variable is below the average, the standardized value or estimate is negative.
In this article we see three ways to calculate standardized estimates for multiple variables based on Pandas data.
- First, we will use the Pandas function to manually calculate standardized scores for all columns at once.
- Then we use Numpy and calculate the default values.
- Finally, we will use the scikit-learn module to calculate standardized scores or z-scores for all columns of the data frame.
Import the packages needed to calculate the default scores and view them in Python.
import pandas as pd
import matplotlib.pyplot as plt
import sea pandas as sns
We will use the Palmer-Penguin dataset available in Seaborn’s integrated datasets and remove the missing data to simplify the process.
# load data of Seaborn
penguins = sns.load_dataset(penguins)
# remove rows with missing values
penguins = penguins.dropna()
Because we are only interested in numerical variables, we select the columns that are numerical.
data = penguins.select_dtypes(floater)
data.head()
Muzzle length_mm Muzzle depth_mm Wing length_mm Body weight_g
0 39,1 18,7 181,0 3750,0
1 39,5 17,4 186,0 3800,0
2 40,3 18,0 195,0 3250,0
4 36,7 19,3 193,0 3450,0
5 39,3 20,6 190,0 3650,0
We see that each column has a completely different range. We can quickly check the averages of each variable and see how they differ from each other.
df=data.mean().reset_index(name=avg)
df
index avg
0 Jaw length_mm 43.992793
1 Jaw depth_mm 17.164865
2 Needle length_mm 200.966967
3 Body_mass_g 4207.057057
We can also use density graphs to see how different their distribution is. The use of raw data as such can anticipate most machine learning techniques.
Multiple Density Rough Data Card
Standardization of different variables with pandas
We can normalize all numeric variables in the data frame using Pandas vector functions. Here we calculate the column values with mean() and the standard deviation with std() for all columns/variables in the data frame. We can subtract the average of one column and divide it by the standard deviation to calculate normalized values for all columns simultaneously.
data_z = (data-data.mean())/(data.std())
Our normalized values must have an average and unit variance of zero for all columns. We can verify this by making a density diagram as shown below.
sns.kdeplot(data=data_z)
Standardised variable density graph with pandas
Let’s also check the arithmetic mean and the standard deviation of each variable.
data_z.mean()
bill_length_mm -2.379811e-15
bill_depth_mm -1.678004e-15
flipper_length_mm 2.110424e-16
body_mass_g 1.733682e-17
d-type: float64
Let’s check the standard deviations.
data_z.std()
legislative proposal_length_mm 1,0
legislative proposal_depth_mm 1,0
flipper_length_mm 1,0
body_mass_g 1,0
d-type: float64
How to calculate normalized values or the Z-score with Numpy?
We can also use NumPy and calculate standardized estimates over different columns using vectorized operations. First, convert the Pandas data frame to a Numpy array using the to_numpy() function available in Pandas.
data_mat = data.to_numpy()
We can use the Numpy mean() and std() functions to calculate means and standard deviations and use them to calculate normalized values. Note that we have specified an axis to calculate the average value of the column and std().
data_z_np = (data_mat – np.mean(data_mat, as=0)) / np.std(data_mat, as=0)
Using NumPy we get our standardized estimates in the form of a NumPy table. Convert the Numpy array to a Pandas data frame using DataFrame().
data_z_np_df = pd.DataFrame(data_z_np,
index=data.index,
columns=data.columns)
And here are our new standardized data, and we can check the average and standard deviation as shown above.
data_z_np_df.head()
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
0 -0,896042 0,780732 -1,426752 -0,568475
1 -0,822788 0,119584 -1.069474 -0,506286
2 -0,676280 0,424729 -0,426373 -1,190361
4 -1,335566 1,085877 -0,569284 -0,941606
5 -0,859415 1,747026 -0,783651 -0,692852
How do you standardize different variables in a science curriculum?
We can normalize one or more variables using the Scikit-learn preprocessing module. We use StandardScaler from sklearn.preprocessing to normalize the variables.
import from sklearn.preprocessing StandardScaler
We follow the typical science fiction approach by first creating an instance of StandardScaler() and modifying the data to calculate standardized scores for all variables.
nrmlzd = StandardScaler()
data_std =nrmlzd.fit_transform(data)
scikit-learn also produces results in the form of a numerical table, and we can create a panda data framework as before.
data_std= pd.DataFrame(data_std,
index=data.index,
columns=data.columns)
data_std
beak length_mm beak depth_mm flipper_massa_g
0 -0.896042 0.780732 -1.426752 -0.568475
1 -0.822788 0.119584 -1.069474 -0.506286
2 -0.676280 0.424729 -0.426373 -1.190361
4 -1.335566 1.085877 -0.569284 -0.941606
5 -0.859415 1.747026 -0.783651 -0.692852
Let’s check the average and the standard deviation of the standardized scores.
data_std.mean()
banknote length_mm 1,026873e-16
banknote_depth_mm 3,267323e-16
flipper_mm 5,697811e-16
body_mass_g 2,360474e-16
d-type: float64
data_std.std()
banknote_length_mm 1.001505
banknote_depth_mm 1.001505
flipper_length_mm 1.001505
body_mass_g 1.001505
d-type: float64
You will notice that the standardized scores calculated by Pandas are different from the scores calculated by NumPy and Scikit-learn. This is probably due to differences in the way Pandas calculate the standard deviation of the Numpy and scikit learning sample.
However, they do not differ much from each other, as can be seen, they are differentiated by the third digit. Here is the scikit learned standardized point density graph, and we can verify that it has an average of zero and looks exactly like the graph calculated by Pandas.
sns.kdeplot(data=data_std)
Standardised variable density graph: sclerene StandardScalar
Ask yourself what difference does it make whether or not you can normalize the variables in your analysis? Read here about the importance of data standardization when performing PCA.
Related Tags:
python standardize data,python standardscaler example,robustscaler formula,standardize data python pandas,numpy standardize,describe the steps for data standardization using sklearn library function,normalize data python pandas,python normalize between 0 and 1,standardization vs normalization,normalize between negative 1 and 1 python,apply normalization python,how to denormalize data in python,minmaxscaler python