Our data can be seen in Figure 10.8: Figure 10.8 – The first five rows (the head) of our bike-share data We can see that every row represents a single hour of bike usage. In this case, we are interested in predicting the count value, which represents the total number of bikes rented in the […]
ML paradigms – pros and cons – How to Tell if Your Toaster is Learning – Machine Learning Essentials
ML paradigms – pros and cons As we now know, ML can be broadly classified into three categories, each with its own set of advantages and disadvantages. SML This method leverages the relationships between input predictors and the output response variable to predict future data observations. The advantages of it are as follows: Let’s see […]
Correlation versus causation – How to Tell if Your Toaster is Learning – Machine Learning Essentials
Correlation versus causation In the context of linear regression, coefficients represent the strength and direction of the relationship between the predictor variables and the response variable. However, this statistical relationship should not be confused with causation. The coefficient B1, with a value of 9.17 in our previous code snippet, indicates the average change in the […]
Adding more predictors – How to Tell if Your Toaster is Learning – Machine Learning Essentials
Adding more predictors Of course, temperature is not the only thing that will help us predict the number of bikes. Adding more predictors to the model is as simple as telling the linear regression model in scikit-learn about them! Before we do, we should look at the data dictionary provided to us to make more […]
Regression metrics 3 – How to Tell if Your Toaster is Learning – Machine Learning Essentials
All of this might sound complicated, but luckily, the scikit-learn package has a built-in method to do this, as shown: from sklearn.cross_validation import train_test_split# function that splits data into training and testing sets# setting our overall data X, and yfeature_cols = [‘temp’]X = bikes[feature_cols]y = bikes[‘count’]# Note that in this example, we are attempting to […]
Regression metrics 2 – How to Tell if Your Toaster is Learning – Machine Learning Essentials
Even better! At first, this seems like a major triumph, but there is actually a hidden danger here. Note that we are training the line to fit X and y and then asking it to predict X again! This is actually a huge mistake in ML because it can lead to overfitting, which means that […]
Regression metrics – How to Tell if Your Toaster is Learning – Machine Learning Essentials
Regression metrics There are usually three main metrics when using regression ML models. They are as follows: • Mean Absolute Error (MAE): This is the average of the absolute errors between the predicted values and the actual values. It’s calculated by taking the sum of the absolute values of the errors (the differences between the […]
Performing naïve Bayes classification – Predictions Don’t Grow on Trees, or Do They?
Performing naïve Bayes classification Let’s get right into it! Let’s begin with naïve Bayes classification. This ML model relies heavily on results from previous chapters, specifically with Bayes’ theorem: Let’s look a little closer at the specific features of this formula: Naïve Bayes classification is a classification model, and therefore a supervised model. Given this, […]
Classification metrics 4 – Predictions Don’t Grow on Trees, or Do They?
We will use sklearn’s built-in accuracy and confusion matrix to look at how well our naïve Bayes models are performing: # compare predictions to true labels from sklearn import metricsprint metrics.accuracy_score(y_test, preds) print metrics.confusion_matrix(y_test, preds) The output is as follows: accuracy == 0.988513998564confusion matrix ==[[12035][11174]] First off, our accuracy is great! Compared to our null […]
Classification metrics 3 – Predictions Don’t Grow on Trees, or Do They?
Note that each row represents one of the three documents (sentences), each column represents one of the words present in the documents, and each cell contains the number of times each word appears in each document. We can then use the count vectorizer to transform new incoming test documents to conform with our training set […]