Machine learning has played a key role in helping Google block more than 100 million spam emails per day. The technologies that allow Google’s powerful spam filters to keep your Gmail clean also help block more than 99.9% of phishing and malware attacks, protecting us all from costly cybercrime. But how have filters gotten so good at catching threats they’ve never seen before?
Introducing train-test splits
This is where train-test splits come in. A train-test split is a tool used to evaluate a machine learning model’s ability to predict a certain outcome accurately when exposed to real-world data it’s never seen before.
In a train-test split, you split an original dataset into two subsets—a training dataset and a testing dataset. Depending on the nature and complexity of the data, you may want to introduce a third validation dataset (train, validate, test). However, for brevity, we’ll introduce you to the train-test split without the validation process.
Why split data into training and test sets?
In the train-test split, you use the training data subset to train the algorithm and the test data for an unbiased evaluation of how the model would perform in a real-world scenario.
The reason not to use the same data to test and train the model? Because the test would artificially perform well due to biases the model already has. In order to see how well the model would perform with data it’s never seen, you must provide it with data it’s never seen.
In other words, the test data acts like tomorrow’s newest phishing emails.
How to quickly split the datasets
You could set up the test-train split by coding everything from scratch, but there are faster ways to run your tests. Instead, you can use the free Python library called Scikit-learn. Scikit-learn provides a set of tools that can shortcut many machine learning tasks involved in a test-train split procedure.
For starters, the Scikit‑learn library helps split the dataset into training and test subsets. To do this, you'll first need to choose the test and train size. There are no standard rules for an optimal split percentage since this would vary depending on your project’s objectives and the quality of your dataset, but common split percentages for training/test are 70/30 split, 80/20 split, and 50/50 split.
If a dataset is very large (roughly 500,000 - 1,000,000 samples and upwards), it isn’t unheard of to even use a 98/2 split. Assuming that the distribution is similar to that of the training set, even 2% of such a large amount is still a big enough testing size to give you confidence in the results.
The only time you shouldn’t split for the test-train procedure is when the dataset is simply too small (for example, under 100 samples).
Choosing a machine learning algorithm
Like any other mathematical algorithm, machine learning algorithms build off of base knowledge you feed them (data) and translate that into new findings (data validation).
In order to move forward with splitting your dataset, you'll also need to choose a supervised machine learning algorithm. All algorithms have advantages and disadvantages depending on the dataset, so the algorithm you choose will depend on the nature of your sample dataset. You have many options to choose from, including Naive Bayes, basic logistic regression, random forest algorithm, and many more. Keep in mind that if the model's performance isn't satisfactory, you can swap algorithms to see which has the best model performance.
Here are the basics of the major machine learning algorithms:
Logistic Regression is one of the simplest algorithms to implement, but because of that, it’s easily outperformed by other algorithms.
SVM (Support Vector Machine) is best for sample datasets that are easiest defined with binary classification, but for larger datasets, it’s very slow.
Naive Bayes, on the other hand, is very fast and scalable, but if your training data isn’t responsibly-sourced and accurate, it won’t work well.
Decision Tree has great visualization and can be easily translated to non-tech teams, but is very sensitive to data changes, so be careful when inputting the correct data.
k-NN (K Nearest Neighbors) is constantly evolving and has no assumption about the data you feed it, but it also can’t handle missing values or larger datasets.
Pro Tip: Scikit-learn contains all of these algorithms, which can greatly simplify this step for you.
Basics of a test-train split
Without going over the intricacies of every possible scenario, the basic steps for a test-train split are as follows:
Define the problem you’re addressing and your target goal using a problem statement
Load your data
Split the dataset into a train subset and a test subset using train_test_split
Prepare data (address missing values, ensure representative accuracy, etc.)
Choose a supervised machine learning algorithm
See predictions and compare to actual value
Evaluate model on training dataset
Evaluate model on testing dataset
You’d usually aim for a certain percentage in accuracy to determine success. What’s considered a good percentage varies from problem to problem.
When to use the train-test split
You generally want to use the train-test split when your available dataset is of considerable size. Otherwise, there’ll be insufficient data in the training subset and the model won't be effective, rendering the test invaluable to fully understanding the model’s performance capabilities.
There are no standard conventions for what's considered an adequate size because this is also specific to each unique dataset and problem. However, if your sample size is too small, you might want to consider k-fold cross-validation to evaluate the model's performance more effectively.
There are a variety of other cross-validation techniques, but k‑fold cross‑validation is one of the more straightforward validation methods. With k‑fold cross‑validation, the dataset is split equally into sets (called folds). K denotes the number of folds, and the number of folds vary from test to test. Then one of the sets is determined the validation fold to act similarly to the test fold, and the rest of the folds are training folds. In other words, there’ll always be k - 1 number of training folds and 1 validation fold . You'll repeat the train-then-validate process until you’ve used each fold as a validation set.
Avoiding overfitting and underfitting
In an ideal world, everything works smoothly, and the model works perfectly on real-world data. Of course, this isn't always the case. Let's talk about two common problems that can cause poor performance in a test-train split: Overfitting and underfitting.
Overfitting happens when a model performs well on a specific dataset it’s been trained on, but is bad at generalization (a model's ability to adapt to data it's never seen before). On the other hand, underfitting happens when a model performs badly on the training data and is also bad at generalization.
How to fix overfitting
There isn't a clear-cut way to detect overfitting when you build the model. However, when a model has a high performance score in predicting trained data and a considerable drop-off in score with test data, this is a sign of overfitting (roughly 95%+ to <75%, respectively).
Overfitting can happen when data is not randomized. Say you're trying to predict the likelihood of a certain disease for patients in a particular state. If your training dataset consists of only women, the model will have a bias and might make erroneous assumptions. The model will likely have poor performance in a real-world scenario where there's an equal amount of male and female patients.
So before using the train-test split, make sure your data is balanced.
Another solution to overfitting is cross-validation. As mentioned before, there are various cross-validation methods. A factor to consider when choosing one validation method over another is the size of your dataset. For example, with very small sample sizes, you'd likely want to use Leave-One-Out Cross-Validation (LOOCV). LOOCV is a version of k-fold cross-validation, which we talked about earlier.
How to fix underfitting
While underfitting is considerably less common than overfitting, it’s worthwhile to note in case you run into the problem while conducting your first train-test split.
Underfitting usually happens when there’s no communication between the data and the model, i.e. the model can’t establish a stable learning pattern due to oversimplification or simply performs poorly with training data.
The easiest way to overcome underfitting? Add more data. This will allow the model to learn patterns faster, and will likely make your tests more comprehensive, anyway.
In simple terms, the train-test split is a way to evaluate how a machine learning model will perform with real-world data, in real time. You don’t need to know how to code the entire procedure from scratch because the scikit-learn library simplifies many of the tasks associated with this procedure.
The train-test split can be used for a multitude of classification and regression problems ranging from serious problems like preventing cybercrime to simply tracking products that are missing from your fridge.