Tutorial: Machine Learning Data Set Preparation, Part 2

To see the entire Machine Learning Tutorial, go here.

Let’s start with the data.

Data just means “that which is given” in Latin. If I handed you a muffin, that’s data. But that’s not the data we’re going to think about. Data in this case typically means a bunch of information, and we need to extract that information from a source, put it into a format a machine can read, and then parse it in order to gleam something that isn’t immediately obvious from first examination.

What we’re given can be a picture, or a string of letters. The task of machine learning is to infer from what is given, those things that are not given. In some cases, humans will assist in a training role. That is, a person will be asked to identify what object(s) are in a picture (is it a fire hydrant? is it a car?) or what a sentence means. In other cases, there will be no training, and the machine will learn from just the data.

This concept translates fairly well over both visual and linguistic cases, but the linguistic case is a bit easier to start with.

So let’s consider some really boring data.

NAME	COUNTRY	OWN A CAR	LIKES ICE-CREAM
Eliza Santiago	Guatemala	Yes	No
Fred Winchester	Canada	No	Yes
Marvin Ngoma	Ghana	Yes	No
Xiong Mao	USA	Yes	???

We have some made-up people: Fred Winchester, Eliza Santiago, Marvin Ngoma, and Xiong Mao. And we have their phone numbers, the type of ice-cream they like, and if they drive a car. From this, we must ask part of the data we actually need and what is extraneous. Phone numbers are assigned in a largely arbitrary (but not necessarily random) fashion.

For the sake of reducing details (and thereby lessening the possibility of overfitting), we ought to either discard the phone number entirely, or at least reduce it to just the country code (say, +44 for the UK) and area code prefix. Let’s assume we had their name, their number, and the answer to at least one of the questions about their ice cream preferences and whether they drive.

So the easy question is, provided with this information, do we have sufficient data to know Xiong Mao’s ice cream preference?

The harder question is why this information is or is not sufficient. In subsequent installments of this tutorial I will go into the arguments for and against whether we can make a deductive claim off of incomplete information (this being one of the core premises and promises of machine learning).