If correlation doesn’t imply causation, then what does? – Communicating Data

If correlation doesn’t imply causation, then what does? As a data scientist, it is often quite frustrating to work with correlation and not be able to draw conclusive causality. The best way to confidently obtain causality is, usually, through randomized experiments, such as the ones we saw in Chapter 8, Advanced Statistics. One would have […]

ML isn’t perfect – How to Tell if Your Toaster is Learning – Machine Learning Essentials

ML isn’t perfect There are many caveats of ML. Many are specific to different models being implemented, but some assumptions are universal for any ML model: This assumption is particularly important. Many ML models take this assumption very seriously. These models are not able to communicate that there might not be a relationship. These assumptions […]

UL – How to Tell if Your Toaster is Learning – Machine Learning Essentials

UL The second type of ML on our list does not deal with making predictions but has a much more open objective. UL takes in a set of predictors and utilizes relationships between the predictors in order to accomplish tasks such as the following: Both of these are examples of UL because they do not […]

ML paradigms – pros and cons – How to Tell if Your Toaster is Learning – Machine Learning Essentials

ML paradigms – pros and cons As we now know, ML can be broadly classified into three categories, each with its own set of advantages and disadvantages. SML This method leverages the relationships between input predictors and the output response variable to predict future data observations. The advantages of it are as follows: Let’s see […]

Adding more predictors – How to Tell if Your Toaster is Learning – Machine Learning Essentials

Adding more predictors Of course, temperature is not the only thing that will help us predict the number of bikes. Adding more predictors to the model is as simple as telling the linear regression model in scikit-learn about them! Before we do, we should look at the data dictionary provided to us to make more […]

Regression metrics 2 – How to Tell if Your Toaster is Learning – Machine Learning Essentials

Even better! At first, this seems like a major triumph, but there is actually a hidden danger here. Note that we are training the line to fit X and y and then asking it to predict X again! This is actually a huge mistake in ML because it can lead to overfitting, which means that […]

Performing naïve Bayes classification – Predictions Don’t Grow on Trees, or Do They?

Performing naïve Bayes classification Let’s get right into it! Let’s begin with naïve Bayes classification. This ML model relies heavily on results from previous chapters, specifically with Bayes’ theorem: Let’s look a little closer at the specific features of this formula: Naïve Bayes classification is a classification model, and therefore a supervised model. Given this, […]

Classification metrics 3 – Predictions Don’t Grow on Trees, or Do They?

Note that each row represents one of the three documents (sentences), each column represents one of the words present in the documents, and each cell contains the number of times each word appears in each document. We can then use the count vectorizer to transform new incoming test documents to conform with our training set […]

Understanding decision trees – Predictions Don’t Grow on Trees, or Do They?

Understanding decision trees Decision trees are supervised models that can either perform regression or classification. They are a flowchart-like structure in which each internal node represents a test on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label (for classification) or a value (for regression). One […]

Dummy variables – Predictions Don’t Grow on Trees, or Do They?

Dummy variables Dummy variables are used when we are hoping to convert a categorical feature into a quantitative one. Remember that we have two types of categorical features: nominal and ordinal. Ordinal features have natural order among them, while nominal data does not. Encoding qualitative (nominal) data using separate columns is called making dummy variables, […]