Building a classifier using Python and Scikit Learn

Scikit Learn is an easy to use Machine Learning library for Python. In this article, we will discuss how to create a basic classifier application where you can feed it data, and it will properly classify it for you. In this case we will use data on cars and classify them as Sedans, Pickup Trucks, or Mini Vans.

Prerequisites

Before we begin, you should be sure that you have pip and python installed. If you do not, check out the article on python basics. The article on Python basics starts off by explaining how to install Pip and Python for various platforms. Then covers other basis like Loops and if/else statements.

After you have pip and python installed, we want to install the sklearn library by running:
pip install sklearn
– or –
pip3 install sklearn

This will depend on whether you are running python or python3. For the rest of this article, we will assume you are running python 3.

Creating a dataset

Before we start, we need to create a basic dataset for training our model. We will feed in the first chunk of data from our dataset to train the model. Then we will use the second part of our dataset to validate that our model is giving us accurate predictions. If we don’t get accurate predictions, then that indicates that there might not be correlation. Or there might not be enough training data.

In the table below, we have our dataset:

Model	Horsepower	Drivetrain	Seating Capacity	Weight	Class
F-150	290	RWD	3	4069	PickupTruck
Silverado	285	RWD	3	4515	PickupTruck
Titan	390	RWD	3	5157	PickupTruck
Pacifica	287	FWD	7	4330	MiniVan
Sedona	276	FWD	7	4410	MiniVan
Sienna	296	FWD	7	4430	MiniVan
Impala	196	FWD	5	3662	Sedan
Charger	292	RWD	5	3934	Sedan
Taurus	288	FWD	5	3917	Sedan

Dataset

You can see we have the name of the car, various attributes about the car, and finally, what class of car it is. In our scikit learn program, we are going to use this data to train our model, and then use a similar dataset to verify that our model is working properly. The input for our model will be the various attributes of the car, and the output will be the class of the car.

Prepping the dataset

If you look at our data from the previous section, you will notice there are a few words in our dataset. Scikit learn does not like working with words, it only wants to work with numbers, or vectors. In order to convert our dataset into vectors, we need to create a bit of a code. In our drive train section, there are only two options found in our data. FWD, and RWD. We will convert those to a 0 for FWD, or a 1 for RWD.

Next we don’t really care about the name of the car, we can simply drop that when we feed it into our model. Finally, in the class column, we have 3 options. We will change those to 1,2,3 for Pickup, Minivan, and Sedan respectively.

Here is our updated dataset:

Horsepower	Drivetrain	SeatingCapacity	Weight	Class
290	1	3	4069	1
285	1	3	4515	1
390	1	3	5157	1
287	0	7	4330	2
276	0	7	4410	2
296	0	7	4430	2
196	0	5	3662	3
292	1	5	3934	3
288	0	5	3917	3

vector dataset

Notice it is now all numbers, which scikit learn likes.

In the last step, we need to restructure the data and create two arrays. The first is a two dimensional array containing all of the vehicle attributes. The second is a single dimensional array containing all of the vehicle classifications.

Here is our data restructured, and ready to be consumed by our python script. If you compare this to the above table, you will noticed that in each block, you have the Horsepower Value, the 1 or 0 showing whether it is front wheel drive or rear wheel drive, how many seats, and how much the vehicle weighs.

[[290,1,3,4069],[285,1,3,4515],[390,1,3,5157],[287,0,7,4330],[276,0,7,4410],[296,0,7,4430],[196,0,5,3662],[292,1,5,3934],[288,0,5,3917]]

This second array corresponds to each block of numbers above. The first 3 blocks are all trucks, so we have three 1’s. The next three are vans, so they are all 2’s, and the last three are Sedans, so they are all 3’s.

[1,1,1,2,2,2,3,3,3]

Creating The Scikit Learn Model

In this section we will finally crate our model in python. Below is a commented python script describing what it is doing in each section. Feel free to copy and paste it and run it.

#Import sklearn library
from sklearn import tree

#create array with vehicle features/attributes
#First Attribute = HP
#Second Attribute FWD(0) RWD(1)
#Third Attribute Number of Seats
#Fourth Attribute weight
features = [[290,1,3,4069],[285,1,3,4515],[390,1,3,5157],[287,0,7,4330],[276,0,7,4410],[296,0,7,4430],[196,0,5,3662],[292,1,5,3934],[288,0,5,3917]]

#1 = Truck
#2 = Van
#3 = Sedan
labels = [1,1,1,2,2,2,3,3,3]

#create Classifier Object

clf = tree.DecisionTreeClassifier()

#Feed in Training Data
clf.fit(features,labels)

#Predict a vehicle classifications based on its attributes. I have input the values for a Dodge Ram. If it works, it should output a 1, indicating a truck
print (clf.predict([[305,1,3,4548]]))

If you run the program above, you will notice it outputs a 1 for a truck. If we change the attributes that you feed in for the prediction, you might get different vehicles. For example, if you change the number of seats from a 3 to a 5, you will get a Sedan. Because all of the Sedans have 5 seats. Or you change the number of seats to a 7, it will return a 2 for a van, because all of the Vans have 7 seats.

The more interesting results come when you start feeding it values you have not fed in before. For example, if you change the number of seats to a 2 instead of a 3, it will still return truck. That is because the model is now smart enough to guess that you just entered a truck.