Industrial Training




Machine Learning - Preparing Data


Introduction

Machine Learning algorithms are completely dependent on data because it is the most crucial aspect that makes model training possible. On the other hand, if we won’t be able to make sense out of that data, before feeding it to ML algorithms, a machine will be useless. In simple words, we always need to feed right data i.e. the data in correct scale, format and containing meaningful features, for the problem we want machine to solve.


This makes data preparation the most important step in ML process. Data preparation may be defined as the procedure that makes our dataset more appropriate for ML process.


Why Data Pre-processing?

After selecting the raw data for ML training, the most important task is data pre-processing. In broad sense, data preprocessing will convert the selected data into a form we can work with or can feed to ML algorithms. We always need to preprocess our data so that it can be as per the expectation of machine learning algorithm.


Why Data Pre-processing?

After selecting the raw data for ML training, the most important task is data pre-processing. In broad sense, data preprocessing will convert the selected data into a form we can work with or can feed to ML algorithms. We always need to preprocess our data so that it can be as per the expectation of machine learning algorithm.


Scaling

Most probably our dataset comprises of the attributes with varying scale, but we cannot provide such data to ML algorithm hence it requires rescaling. Data rescaling makes sure that attributes are at same scale. Generally, attributes are rescaled into the range of 0 and 1. ML algorithms like gradient descent and k-Nearest Neighbors requires scaled data. We can rescale the data with the help of MinMaxScaler class of scikit-learn Python library.

Example

In this example we will rescale the data of Pima Indians Diabetes dataset which we used earlier. First, the CSV data will be loaded (as done in the previous chapters) and then with the help of MinMaxScaler class, it will be rescaled in the range of 0 and 1.

The first few lines of the following script are same as we have written in previous chapters while loading CSV data.

from pandas import read_csv
from numpy import set_printoptions
from sklearn import preprocessing
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(path, names=names)
array = dataframe.values

Now, we can use MinMaxScaler class to rescale the data in the range of 0 and 1.

data_scaler = preprocessing.MinMaxScaler(feature_range=(0,1))
data_rescaled = data_scaler.fit_transform(array)

We can also summarize the data for output as per our choice. Here, we are setting the precision to 1 and showing the first 10 rows in the output.

set_printoptions(precision=1)
print ("\nScaled data:\n", data_rescaled[0:10])

Output

Scaled data:
[[0.4 0.7 0.6 0.4 0.  0.5 0.2 0.5 1. ]
[0.1  0.4 0.5 0.3 0.  0.4 0.1 0.2 0. ]
[0.5  0.9 0.5 0.  0.  0.3 0.3 0.2 1. ]
[0.1  0.4 0.5 0.2 0.1 0.4 0.  0.  0. ]
[0.   0.7 0.3 0.4 0.2 0.6 0.9 0.2 1. ]
[0.3  0.6 0.6 0.  0.  0.4 0.1 0.2 0. ]
[0.2  0.4 0.4 0.3 0.1 0.5 0.1 0.1 1. ]
[0.6  0.6 0.  0.  0.  0.5 0.  0.1 0. ]
[0.1  1.  0.6 0.5 0.6 0.5 0.  0.5 1. ]
[0.5  0.6 0.8 0.  0.  0.  0.1 0.6 1. ]]

From the above output, all the data got rescaled into the range of 0 and 1.



Hi I am Pluto.