Machine Learning
Classification Using Machine Learning We can start applying machine learning algorithms once our data has undergone some cleaning, feature engineering, and feature selection. Supervised, unsupervised, and reinforcement learning are the three main types of machine learning, as we saw in the previous chapter. Since our data contains targets or labels, classification is a subset of supervised learning. For instance, we'll start by examining a dataset of credit card debt defaults. Each data point in this dataset contains a label that indicates whether or not a credit card payment was missed.
In this chapter, we will use the sklearn and statsmodels packages to learn the fundamentals of machine learning categorization. The following subjects will be covered in this chapter:
Binary and multi-class machine learning classification algorithms
Feature selection using machine learning classification methods
Let's start by going over a few fundamental machine learning classification techniques.
A Beautiful Mind | John Forbes Nash Jr. Dies in Monroe NJ
Algorithms for machine learning classification
Numerous machine learning algorithms exist, and new ones are always being developed. During a training phase, machine learning algorithms
use input data to learn, fit, or train. Then, during a process known as "inference," we construct predictions using the statistical patterns discovered
from the data. Here, we'll go over a few fundamental and straightforward categorization algorithms:
Regression analysis using logistic
import pandas as pd
from pandas_profiling import ProfileReport
df = pd.read_excel('data/default of credit card clients.xls',
skiprows=1,
index_col=0)
report = ProfileReport(df, interactions=None)
report.to_file('cc_defaults.html')
These algorithms work by providing them with labeled training data. In other words, we have a target or label (output) and our features (inputs). A class, either binary
(1 or 0) or multiclass (0 through the number of classes), should be the target. Our various classes are represented by the target's numbers 0 and 1 (as well as additional numbers
for multiclass categorization). A payment default, loan acceptance, an individual's decision to click on an online advertisement, or a person's illness are examples of binary classification.
Similar to the last chapter, if your machine is not very powerful or you want to speed up the code, you could sample down the data (df = df.sample(10000, random_state=42)).
Because that portion of the EDA report takes a long time to run, we set interactions=None in ProfileReport.
Binary classification using logistic regression
Since 1958, logistic regression has been in use. However, don't let its age deceive you; occasionally, simpler algorithms—like neural networks—can perform better than more
sophisticated ones. Although it may also be used for multi-class classification, binary classification is the main use for logistic regression. It belongs to a class of models
known as generalized linear models (GLMs), which also includes linear regression, which we will discuss in Chapter 13, Machine Learning with Regression. A GLM class can be used to
implement logistic regression in certain Python libraries, such statsmodels.
We will learn logistic regression by using it with a credit card defaults dataset (from here: https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset, also available
in the GitHub repository for the book). The Kaggle data page for the dataset provides descriptions of the different columns in the dataset. There are features for 6 months of data
from April to August in 2005. The PAY columns (like PAY_0) contain data on whether the payment was late for that month. The PAY_0 column is for payment for August 2005, the latest
month in the dataset. Other columns, such as BILL_AMT and PAY_AMT, contain the amount of the bill and payment for the 6 months in the dataset. Other columns should be self-explanatory
from
the column title. Let's start by loading the data and running pandas profiling for EDA:
Citation Youtube Programming with Mosh
Machine Learning
The "default payment next month" column is our objective variable that we will be forecasting; all other columns will be features. We can see from looking at the correlations
(for example, with df.corr().loc['default payment next month']) that some of the features are related to the goal; for a few features, the Pearson correlation is between 0.1 and 0.3,
and the PAY features have a phik correlation of about 0.5 with the goal. No feature engineering, feature selection, or data cleaning will be done just yet. df.info() shows us that no
values are missing.
The "default payment next month" column is our objective variable that we will be forecasting; all other columns will be features. We can see from looking at the correlations
(for example, with df.corr().loc['default payment next month']) that some of the features are related to the goal; for a few features, the Pearson correlation is between 0.1 and 0.3,
and the PAY features have a phik correlation of about 0.5 with the goal. No feature engineering, feature selection,
or data cleaning will be done just yet. df.info() shows us that no values are missing.
train_features = df.drop('default payment next month', axis=1)
train_targets = df['default payment next month']
The second column only retains the target column in the train_targets variable, while the first line eliminates the target column while retaining all other columns
as features. First, let's apply the sklearn logistic regression implementation:
import from sklearn.linear_model LogisticRegression lr_sklearn = lr_sklearn.fit(train_features, train_targets) LogisticRegression (random_state=42)
Every sklearn model functions in a similar way. The model class (LogisticRegression in this case) is first imported, and then it is instantiated (see the second line of code).
We can give the class arguments when we build the model object. The random seed for random processes is set by the random_state option for various models. In the event that the
algorithm contains any random processes, this will make our findings repeatable. We can establish other arguments, but we haven't done so yet. After our model object has been
initialized, we can use the fit() method to train it on our data. The model (machine)
is learning from the data we provide it, which is where "machine learning" comes into play.
Citation
BIO About the Author: Joseph P Fanning
Joe audited classes at Princeton and studied CS at Harvard. He owns https://www.Joepfanning.com and blogs alot about computational physics. He's currenty working on completing his M.S in computational physics and computer science CS
Phone - 201 334 8743
Email - Joe's App email
Suffolk County LI New York 11772 Bergen County NJ Programmer
br>
br>
br>
|