A Step-by-Step Introduction to Machine Learning in Python
The “Hello, World” of machine learning is the Iris Dataset (Iris flower). The Iris dataset is part of the UCI Machine Learning Repository, and originally comes from R.A. Fisher’s classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems.
Prepare environment
We are going to use:
- matplotlib: For creating graphs
- numpy: For quick and easy operations on arrays of data
- pandas: For quick and easy operations on tabular data
- scikit-learn (in Python, called
sklearn
): For machine learning algorithms and the Iris dataset
Load the data.
Luckily for us, the Iris Dataset
is so popular that it is included in the scikit-learn package. All we need to do is call datasets.load_iris()
and we get a
What we get is called a Bunch
. This is just scikit terminology for some data and other information bundled together. It pretty much behaves like a standard Python dictionary. We can see it has the following keys:
The data key contains a 2D numpy
array with 5 columns and 150 rows:
Each column is a feature
, and each row is an observation
. Each column (feature) in this data can be identified with iris_dataset['feature_names']
:
This will be all of the input data into our machine learning algorithm.
But what are we trying to predict? We are trying to predict the type of Iris from this data. The truth values for each one of these data points is contained in iris_dataset['target']
:
There is one value in this array for each row in data
. But wait, these are numbers, I was expecting names of Iris plants!?! Because we need to use numeric data, the type has been encoded as a integer (just like an enum
). The mapping between integer and Iris type is contained in iris_dataset['target_names']
:
So 0
means that row in the data is a setosa
Iris, 1
means it’s a versicolor
Iris, and 2
means its a virginica
Iris.
Split The Data Into Training And Test Groups
An important part of machine learning is to validate your algorithm after is had been trained. A common was to do this is to split the data into two groups, a training
set and a test
set. We will use the training
set to teach our machine learning algorithm. We will hold back the data in the test
set, and only use this to validate our algorithm after it has been trained.
An easy way to split the data is to use scikit-learn
’s train_test_split()
function. This function simply splits the observations into training and test groups according to the ratio given by test_size
. We will use 80% of the data for training, and 20% of the data for testing (hence test_size=0.2
):
Select A Model
We will use logistic regression for this machine learning example. It is beyond the scope of this tutorial to fully explain how logistic regression works