Tutorial: Machine Learning Data Set Preparation, Part 1

To see the entire Machine Learning Tutorial, go here.

In this multi-part tutorial, I shall go over the basics of taking “live” human data and putting into a suitable format to feed into one’s machine learning platform of choice. This series will go from general to specific and offer insight on methodology before going into gathering data, putting this data into a machine-readable format, and then feeding this into machine learning platforms such as TensorFlow and Weka.

To start, I have to think of the data set in at least two lights.

One is theme-oriented: the data has to have a thematic character that is neither too broad as to capture spurious correlations, and not so narrow that it confirms the obvious. This is the story told by the data.

The other is feature-oriented. A good data set needs to have its raw data converted into a workable ontology — which is just a fancy word to refer to the objects and the landscape they inhabit. These are the nouns of story told by the data.

These are methodological, but reflect a deeper concern: doxology. What human belief do these reinforce? Will these reinforce a prevailing status quo, something people take as a given, or will the data be such that new insights can be gleamed? This goes beyond merely trying to avoid overfitting (when data is so finely grained that it has too many details and thereby has too much extraneous information to report to be of much use). This means making sure that if your data set is going to be free from racial bias, care has been taken to remove proxies for racial demographics, such as zip code.