ML isn’t perfect

There are many caveats of ML. Many are specific to different models being implemented, but some assumptions are universal for any ML model:

  • The data used, for the most part, is preprocessed and cleaned using the methods outlined in the earlier chapters. Almost no ML model will tolerate extremely dirty/incomplete data with missing values or categorical values.
  • Each row of a cleaned dataset represents a single observation of the environment we are trying to model.
  • The data as a whole should be representative of the task we are solving. This might sound obvious, but in so many cases, people use data to train an ML model that is close to but not exactly related to the task. This is often seen in criminal justice examples where people might use arrest data to train a model to predict criminality but, of course, arrests are not the same as convicting someone of a crime.
  • If our goal is to find relationships between variables, then there is an assumption that there is some kind of relationship between these variables. Again, this seems obvious, but if a human putting the data together is biased and “believes” there is a relationship between the data, then they might incorrectly judge an ML model to be more powerful than it actually is.

This assumption is particularly important. Many ML models take this assumption very seriously. These models are not able to communicate that there might not be a relationship.

  • ML models are generally considered semi-automatic, which means that intelligent decisions by humans are still needed.
  • The machine is very smart but has a hard time putting things into context. The output of most models is a series of numbers and metrics attempting to quantify how well the model did. It is up to a human to put these metrics into perspective and communicate the results to an audience.
  • Most ML models are sensitive to noisy data. This means that the models get confused when you include data that doesn’t make sense. For example, if you are attempting to find relationships between economic data around the world and one of your columns relates to puppy adoption rates in the capital city, that information is likely not relevant and will confuse the model.

These assumptions will come up again and again when dealing with ML. They are all too important and are often ignored by novice data scientists.

Leave a Reply

Your email address will not be published. Required fields are marked *