Sådan opbygges og trænes lineære og logistiske regression ML-modeller i Python

Lineær regression og logistisk regression er to af de mest populære maskinlæringsmodeller i dag.

I den sidste artikel lærte du om historien og teorien bag en lineær regression maskinindlæringsalgoritme.

Denne tutorial lærer dig, hvordan du opretter, træner og tester din første lineære regressionsmaskineindlæringsmodel i Python ved hjælp af scikit-learnbiblioteket.

Afsnit 1: Lineær regression

Datasættet, vi bruger i denne vejledning

Da vi lige er begyndt at lære om lineær regression i maskinindlæring, vil vi arbejde med kunstigt oprettede datasæt i denne vejledning. Dette giver dig mulighed for at fokusere på at lære maskinindlæringskoncepterne og undgå at bruge unødvendig tid på rengøring eller manipulation af data.

Mere specifikt vil vi arbejde med et datasæt med boligdata og forsøge at forudsige boligpriser. Før vi bygger modellen, skal vi først importere de nødvendige biblioteker.

Bibliotekerne, vi bruger i denne vejledning

Det første bibliotek, som vi skal importere, er pandaer, som er et portmanteau af "paneldata" og er det mest populære Python-bibliotek til arbejde med tabeldata.

Det er konvention at importere pandasunder aliaset pd. Du kan importere pandasmed følgende erklæring:

 import pandas as pd 

Derefter skal vi importere NumPy, som er et populært bibliotek til numerisk databehandling. Numpy er kendt for sin datastruktur af NumPy-array såvel som dens nyttige metoder omforme, ændre og tilføje.

Det er konvention at importere NumPy under aliaset np. Du kan importere numpymed følgende erklæring:

 import numpy as np 

Dernæst skal vi importere matplotlib, som er Pythons mest populære bibliotek til datavisualisering.

matplotlibimporteres typisk under aliaset plt. Du kan importere matplotlibmed følgende erklæring:

 import matplotlib.pyplot as plt %matplotlib inline 

Den %matplotlib inlineerklæring vil medføre for vores matplotlibvisualiseringer til at indlejre sig direkte i vores Jupyter Notebook, hvilket gør dem lettere at få adgang til og fortolke.

Endelig vil du gerne importere seaborn, hvilket er et andet Python-datavisualiseringsbibliotek, der gør det lettere at oprette smukke visualiseringer ved hjælp af matplotlib.

Du kan importere seabornmed følgende erklæring:

 import seaborn as sns 

For at opsummere er her alle de nødvendige importer i denne vejledning:

 import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns 

I fremtidige artikler vil jeg specificere, hvilken import der er nødvendig, men jeg vil ikke forklare hver import detaljeret som jeg gjorde her.

Import af datasættet

Som nævnt bruger vi et datasæt med boligoplysninger. Vi bruger

Datasættet er uploadet til mit websted som en .csvfil på følgende URL:

//nickmccullum.com/files/Housing_Data.csv 

For at importere datasættet til din Jupyter Notebook er det første, du skal gøre, at downloade filen ved at kopiere og indsætte denne URL i din browser. Flyt derefter filen i samme bibliotek som din Jupyter Notebook.

Når dette er gjort, importerer følgende Python-sætning husdatasættet til din Jupyter Notebook:

 raw_data = pd.read_csv('Housing_Data.csv') 

Dette datasæt har en række funktioner, herunder:

  • Den gennemsnitlige indkomst i huset
  • Det gennemsnitlige antal samlede værelser i området
  • Prisen, som huset solgte til
  • Husets adresse

Disse data genereres tilfældigt, så du vil se et par nuancer, der normalt ikke giver mening (såsom et stort antal decimaler efter et tal, der skal være et heltal).

Forståelse af datasættet

Nu da datasættet er importeret under raw_datavariablen, kan du bruge infometoden til at få nogle oplysninger på højt niveau om datasættet. Specifikt raw_data.info()giver løb :

 RangeIndex: 5000 entries, 0 to 4999 Data columns (total 7 columns): Avg. Area Income 5000 non-null float64 Avg. Area House Age 5000 non-null float64 Avg. Area Number of Rooms 5000 non-null float64 Avg. Area Number of Bedrooms 5000 non-null float64 Area Population 5000 non-null float64 Price 5000 non-null float64 Address 5000 non-null object dtypes: float64(6), object(1) memory usage: 273.6+ KB 

En anden nyttig måde, du kan lære om dette datasæt, er ved at generere et parplot. Du kan bruge seabornmetoden pairplottil dette og videregive det hele DataFramesom en parameter. Her er hele erklæringen for dette:

 sns.pairplot(raw_data) 

Resultatet af denne erklæring er nedenfor:

En havfødt parplot

Lad os derefter begynde at opbygge vores lineære regressionsmodel.

Opbygning af en maskinlæring Lineær regressionsmodel

Den første ting, vi skal gøre, er at opdele vores data i en x-array(som indeholder de data, som vi vil bruge til at forudsige) og en y-array(som indeholder de data, som vi prøver at forudsige.

Først skal vi beslutte, hvilke kolonner der skal medtages. Du kan generere en liste over DataFrames kolonner ved hjælp af raw_data.columnshvilke output:

Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms', 'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'], dtype="object") 

Vi bruger alle disse variabler i x-arrayundtagen Price(da det er den variabel, vi prøver at forudsige) og Address(da den kun indeholder tekst).

Lad os oprette vores x-arrayog tildele den til en variabel kaldet x.

 x = raw_data[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms', 'Avg. Area Number of Bedrooms', 'Area Population']] 

Lad os derefter oprette vores y-arrayog tildele den til en variabel kaldet y.

 y = raw_data['Price'] 

Vi har med succes opdelt vores datasæt i en x-array(som er inputværdierne i vores model) og en y-array(som er outputværdierne i vores model). Vi lærer, hvordan vi deler vores datasæt yderligere i træningsdata og testdata i næste afsnit.

Opdeling af vores datasæt i træningsdata og testdata

scikit-learngør det meget let at opdele vores datasæt i træningsdata og testdata. For at gøre dette skal vi importere funktionen train_test_splitfra model_selectionmodulet af scikit-learn.

Her er den fulde kode for at gøre dette:

 from sklearn.model_selection import train_test_split 

De train_test_splitdata accepterer tre argumenter:

  • Vores x-array
  • Vores y-array
  • Den ønskede størrelse af vores testdata

Med disse parametre train_test_splitdeler funktionen vores data for os! Her er koden til at gøre dette, hvis vi ønsker, at vores testdata skal være 30% af hele datasættet:

 x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3) 

Lad os pakke ud, hvad der sker her.

Den train_test_splitfunktion returnerer en Python liste over længden 4, hvor hvert punkt på listen er x_train, x_test, y_train, og y_testhhv. Vi bruger derefter listeudpakning til at tildele de korrekte værdier til de korrekte variabelnavne.

Nu hvor vi korrekt har opdelt vores datasæt, er det tid til at opbygge og træne vores lineære regression maskineindlæringsmodel.

Opbygning og træning af modellen

Den første ting, vi skal gøre, er at importere LinearRegressionestimatoren fra scikit-learn. Her er Python-erklæringen for dette:

 from sklearn.linear_model import LinearRegression 

Dernæst skal vi oprette en forekomst af Linear RegressionPython-objektet. Vi tildeler dette til en variabel kaldet model. Her er koden til dette:

 model = LinearRegression() 

Vi kan bruge scikit-learn's fitmetode til at træne denne model på vores træningsdata.

 model.fit(x_train, y_train) 

Our model has now been trained. You can examine each of the model’s coefficients using the following statement:

 print(model.coef_) 

This prints:

[2.16176350e+01 1.65221120e+05 1.21405377e+05 1.31871878e+03 1.52251955e+01] 

Similarly, here is how you can see the intercept of the regression equation:

 print(model.intercept_) 

This prints:

-2641372.6673013503 

A nicer way to view the coefficients is by placing them in a DataFrame. This can be done with the following statement:

 pd.DataFrame(model.coef_, x.columns, columns = ['Coeff']) 

The output in this case is much easier to interpret:

En koefficient DataFrame i en Jupyter Notebook

Let’s take a moment to understand what these coefficients mean. Let’s look at the Area Population variable specifically, which has a coefficient of approximately 15.

What this means is that if you hold all other variables constant, then a one-unit increase in Area Population will result in a 15-unit increase in the predicted variable - in this case, Price.

Said differently, large coefficients on a specific variable mean that that variable has a large impact on the value of the variable you’re trying to predict. Similarly, small values have small impact.

Now that we’ve generated our first machine learning linear regression model, it’s time to use the model to make predictions from our test data set.

Making Predictions From Our Model

scikit-learn makes it very easy to make predictions from a machine learning model. You simply need to call the predict method on the model variable that we created earlier.

Since the predict variable is designed to make predictions, it only accepts an x-array parameter. It will generate the y values for you!

Here is the code you’ll need to generate predictions from our model using the predict method:

 predictions = model.predict(x_test) 

The predictions variable holds the predicted values of the features stored in x_test. Since we used the train_test_split method to store the real values in y_test, what we want to do next is compare the values of the predictions array with the values of y_test.

An easy way to do this is plot the two arrays using a scatterplot. It’s easy to build matplotlib scatterplots using the plt.scatter method. Here’s the code for this:

 plt.scatter(y_test, predictions) 

Here’s the scatterplot that this code generates:

Et spredningsdiagram af forudsagte værdier mod realiserede værdier i en lineær regressionsmodel med maskinindlæring

As you can see, our predicted values are very close to the actual values for the observations in the data set. A perfectly straight diagonal line in this scatterplot would indicate that our model perfectly predicted the y-array values.

Another way to visually assess the performance of our model is to plot its residuals, which are the difference between the actual y-array values and the predicted y-array values.

An easy way to do this is with the following statement:

 plt.hist(y_test - predictions) 

Here is the visualization that this code generates:

Et resthistogram fra vores lineære regressionsindlæringsmodel

This is a histogram of the residuals from our machine learning model.

You may notice that the residuals from our machine learning model appear to be normally distributed. This is a very good sign!

It indicates that we have selected an appropriate model type (in this case, linear regression) to make predictions from our data set. We will learn more about how to make sure you’re using the right model later in this course.

Testing the Performance of our Model

We learned near the beginning of this course that there are three main performance metrics used for regression machine learning models:

  • Mean absolute error
  • Mean squared error
  • Root mean squared error

We will now see how to calculate each of these metrics for the model we’ve built in this tutorial. Before proceeding, run the following import statement within your Jupyter Notebook:

 from sklearn import metrics 

Mean Absolute Error (MAE)

You can calculate mean absolute error in Python with the following statement:

 metrics.mean_absolute_error(y_test, predictions) 

Mean Squared Error (MSE)

Similarly, you can calculate mean squared error in Python with the following statement:

 metrics.mean_squared_error(y_test, predictions) 

Root Mean Squared Error (RMSE)

Unlike mean absolute error and mean squared error, scikit-learn does not actually have a built-in method for calculating root mean squared error.

Fortunately, it really doesn’t need to. Since root mean squared error is just the square root of mean squared error, you can use NumPy’s sqrt method to easily calculate it:

 np.sqrt(metrics.mean_squared_error(y_test, predictions)) 

The Complete Code For This Tutorial

Here is the entire code for this Python linear regression machine learning tutorial. You can also view it in this GitHub repository.

 import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline raw_data = pd.read_csv('Housing_Data.csv') x = raw_data[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms', 'Avg. Area Number of Bedrooms', 'Area Population']] y = raw_data['Price'] from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3) from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(x_train, y_train) print(model.coef_) print(model.intercept_) pd.DataFrame(model.coef_, x.columns, columns = ['Coeff']) predictions = model.predict(x_test) # plt.scatter(y_test, predictions) plt.hist(y_test - predictions) from sklearn import metrics metrics.mean_absolute_error(y_test, predictions) metrics.mean_squared_error(y_test, predictions) np.sqrt(metrics.mean_squared_error(y_test, predictions)) 

Section 2: Logistic Regression

Note - if you have been coding along with this tutorial so far and built your linear regression model already, you'll want to open a new Jupyter Notebook (with no code in it) before proceeding.

The Data Set We Will Be Using in This Tutorial

The Titanic data set is a very famous data set that contains characteristics about the passengers on the Titanic. It is often used as an introductory data set for logistic regression problems.

In this tutorial, we will be using the Titanic data set combined with a Python logistic regression model to predict whether or not a passenger survived the Titanic crash.

The original Titanic data set is publicly available on Kaggle.com, which is a website that hosts data sets and data science competitions.

To make things easier for you as a student in this course, we will be using a semi-cleaned version of the Titanic data set, which will save you time on data cleaning and manipulation.

The cleaned Titanic data set has actually already been made available for you. You can download the data file by clicking the links below:

  • Titanic data

Once this file has been downloaded, open a Jupyter Notebook in the same working directory and we can begin building our logistic regression model.

The Imports We Will Be Using in This Tutorial

As before, we will be using multiple open-source software libraries in this tutorial. Here are the imports you will need to run to follow along as I code through our Python logistic regression model:

 import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns 

Next, we will need to import the Titanic data set into our Python script.

Importing the Data Set into our Python Script

We will be using pandas’ read_csv method to import our csv files into pandas DataFrames called titanic_data.

Here is the code to do this:

 titanic_data = pd.read_csv('titanic_train.csv') 

Next, let’s investigate what data is actually included in the Titanic data set. There are two main methods to do this (using the titanic_data DataFrame specifically):

  • The titanic_data.head(5) method will print the first 5 rows of the DataFrame. You can substitute 5 with whichever number you’d like.
  • You can also print titanic_data.columns, which will show you the column named.

Running the second command (titanic_data.columns) generates the following output:

 Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype="object" 

These are the names of the columns in the DataFrame. Here are brief explanations of each data point:

  • PassengerId: a numerical identifier for every passenger on the Titanic.
  • Survived: a binary identifier that indicates whether or not the passenger survived the Titanic crash. This variable will hold a value of 1 if they survived and 0 if they did not.
  • Pclass: the passenger class of the passenger in question. This can hold a value of 1, 2, or 3, depending on where the passenger was located in the ship.
  • Name: the passenger’s name.`
  • Sex: male or female.
  • Age: the age (in years) of the passenger.
  • SibSp: the number of siblings and spouses aboard the ship.
  • Parch: the number of parents and children aboard the ship.
  • Ticket: the passenger’s ticket number.
  • Fare: how much the passenger paid for their ticket on the Titanic.
  • Cabin: the passenger’s cabin number.
  • Embarked: the port where the passenger embarked (C = Cherbourg, Q = Queenstown, S = Southampton)

Next up, we will learn more about our data set by using some basic exploratory data analysis techniques.

Learning About Our Data Set With Exploratory Data Analysis

The Prevalence of Each Classification Category

When using machine learning techniques to model classification problems, it is always a good idea to have a sense of the ratio between categories. For this specific problem, it’s useful to see how many survivors vs. non-survivors exist in our training data.

An easy way to visualize this is using the seaborn plot countplot. In this example, you could create the appropriate seasborn plot with the following Python code:

 sns.countplot(x='Survived', data=titanic_data) 

This generates the following plot:

En havfødt countplot

As you can see, we have many more incidences of non-survivors than we do of survivors.

Survival Rates Between Genders

It is also useful to compare survival rates relative to some other data feature. For example, we can compare survival rates between the Male and Female values for Sex using the following Python code:

 sns.countplot(x='Survived', hue="Sex", data=titanic_data) 

This generates the following plot:

En søfødt countplot med en sexfarve

As you can see, passengers with a Sex of Male were much more likely to be non-survivors than passengers with a Sex of Female.

Survival Rates Between Passenger Classes

We can perform a similar analysis using the Pclass variable to see which passenger class was the most (and least) likely to have passengers that were survivors.

Here is the code to do this:

 sns.countplot(x='Survived', hue="Pclass", data=titanic_data) 

This generates the following plot:

En havfødt countplot med en Pclass-nuance

The most noticeable observation from this plot is that passengers with a Pclass value of 3 - which indicates the third class, which was the cheapest and least luxurious - were much more likely to die when the Titanic crashed.

The Age Distribution of Titanic Passengers

One other useful analysis we could perform is investigating the age distribution of Titanic passengers. A histogram is an excellent tool for this.

You can generate a histogram of the Age variable with the following code:

 plt.hist(titanic_data['Age'].dropna()) 

Note that the dropna() method is necessary since the data set contains several nulls values.

Here is the histogram that this code generates:

Et histogram over aldersvariabler fra det titaniske datasæt

As you can see, there is a concentration of Titanic passengers with an Age value between 20 and 40.

The Ticket Price Distribution of Titanic Passengers

The last exploratory data analysis technique that we will use is investigating the distribution of fare prices within the Titanic data set.

You can do this with the following code:

 plt.hist(titanic_data['Fare']) 

This generates the following plot:

A histogram of fare variables from the titanic data set

As you can see, there are three distinct groups of Fare prices within the Titanic data set. This makes sense because there are also three unique values for the Pclass variable. The difference Fare groups correspond to the different Pclass categories.

Since the Titanic data set is a real-world data set, it contains some missing data. We will learn how to deal with missing data in the next section.

Removing Null Data From Our Data Set

To start, let’s examine where our data set contains missing data. To do this, run the following command:

 titanic_data.isnull() 

This will generate a DataFrame of boolean values where the cell contains True if it is a null value and False otherwise. Here is an image of what this looks like:

A DataFrame of boolean values indicating where null data exists

A far more useful method for assessing missing data in this data set is by creating a quick visualization. To do this, we can use the seaborn visualization library. Here is quick command that you can use to create a heatmap using the seaborn library:

 sns.heatmap(titanic_data.isnull(), cbar=False) 

Here is the visualization that this generates:

A DataFrame of boolean values indicating where null data exists

In this visualization, the white lines indicate missing values in the dataset. You can see that the Age and Cabin columns contain the majority of the missing data in the Titanic data set.

The Age column in particular contains a small enough amount of missing that that we can fill in the missing data using some form of mathematics. On the other hand, the Cabin data is missing enough data that we could probably remove it from our model entirely.

The process of filling in missing data with average data from the rest of the data set is called imputation. We will now use imputation to fill in the missing data from the Age column.

The most basic form of imputation would be to fill in the missing Age data with the average Age value across the entire data set. However, there are better methods.

We will fill in the missing Age values with the average Age value for the specific Pclass passenger class that the passenger belongs to. To understand why this is useful, consider the following boxplot:

 sns.boxplot(titanic_data['Pclass'], titanic_data['Age']) 
A boxplot of age values stratified by passenger classes

As you can see, the passengers with a Pclass value of 1 (the most expensive passenger class) tend to be the oldest while the passengers with a Pclass value of 3 (the cheapest) tend to be the youngest. This is very logical, so we will use the average Age value within different Pclass data to imputate the missing data in our Age column.

The easiest way to perform imputation on a data set like the Titanic data set is by building a custom function. To start, we will need to determine the mean Age value for each Pclass value.

 #Pclass value 1 titanic_data[titanic_data['Pclass'] == 1]['Age'].mean() #Pclass value 2 titanic_data[titanic_data['Pclass'] == 2]['Age'].mean() #Pclass 3 titanic_data[titanic_data['Pclass'] == 2]['Age'].mean() 

Here is the final function that we will use to imputate our missing Age variables:

 def impute_missing_age(columns): age = columns[0] passenger_class = columns[1] if pd.isnull(age): if(passenger_class == 1): return titanic_data[titanic_data['Pclass'] == 1]['Age'].mean() elif(passenger_class == 2): return titanic_data[titanic_data['Pclass'] == 2]['Age'].mean() elif(passenger_class == 3): return titanic_data[titanic_data['Pclass'] == 3]['Age'].mean() else: return age 

Now that this imputation function is complete, we need to apply it to every row in the titanic_data DataFrame. Python’s apply method is an excellent tool for this:

 titanic_data['Age'] = titanic_data[['Age', 'Pclass']].apply(impute_missing_age, axis = 1) 

Now that we have performed imputation on every row to deal with our missing Age data, let’s investigate our original boxplot:

 sns.heatmap(titanic_data.isnull(), cbar=False) 

You wil notice there is no longer any missing data in the Age column of our pandas DataFrame!

You might be wondering why we spent so much time dealing with missing data in the Age column specifically. It is because given the impact of Age on survival for most disasters and diseases, it is a variable that is likely to have high predictive value within our data set.

Now that we have an understanding of the structure of this data set and have removed its missing data, let’s begin building our logistic regression machine learning model.

Building a Logistic Regression Model

It is now time to remove our logistic regression model.

Removing Columns With Too Much Missing Data

First, let’s remove the Cabin column. As we mentioned, the high prevalence of missing data in this column means that it is unwise to impute the missing data, so we will remove it entirely with the following code:

 titanic_data.drop('Cabin', axis=1, inplace = True) 

Next, let’s remove any additional columns that contain missing data with the pandas dropna() method:

 titanic_data.dropna(inplace = True) 

Handling Categorical Data With Dummy Variables

The next task we need to handle is dealing with categorical features. Namely, we need to find a way to numerically work with observations that are not naturally numerical.

A great example of this is the Sex column, which has two values: Male and Female. Similarly, the Embarked column contains a single letter which indicates which city the passenger departed from.

To solve this problem, we will create dummy variables. These assign a numerical value to each category of a non-numerical feature.

Fortunately, pandas has a built-in method called get_dummies() that makes it easy to create dummy variables. The get_dummies method does have one issue - it will create a new column for each value in the DataFrame column.

Let’s consider an example to help understand this better. If we call the get_dummies() method on the Age column, we get the following output:

 pd.get_dummies(titanic_data['Sex']) 
An example of the pandas get_dummies method

As you can see, this creates two new columns: female and male. These columns will both be perfect predictors of each other, since a value of 0 in the female column indicates a value of 1 in the male column, and vice versa.

This is called multicollinearity and it significantly reduces the predictive power of your algorithm. To remove this, we can add the argument drop_first = True to the get_dummies method like this:

 pd.get_dummies(titanic_data['Sex'], drop_first = True) 

Now, let’s create dummy variable columns for our Sex and Embarked columns, and assign them to variables called sex and embarked.

 sex_data = pd.get_dummies(titanic_data['Sex'], drop_first = True) embarked_data = pd.get_dummies(titanic_data['Embarked'], drop_first = True) 

There is one important thing to note about the embarked variable defined below. It has two columns: Q and S, but since we’ve already removed one other column (the C column), neither of the remaining two columns are perfect predictors of each other, so multicollinearity does not exist in the new, modified data set.

Adding Dummy Variables to the pandas DataFrame

Next we need to add our sex and embarked columns to the DataFrame.

You can concatenate these data columns into the existing pandas DataFrame with the following code:

 titanic_data = pd.concat([titanic_data, sex_data, embarked_data], axis = 1) 

Now if you run the command print(titanic_data.columns), your Jupyter Notebook will generate the following output:

 Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Embarked', 'male', 'Q', 'S'], dtype="object") 

The existence of the male, Q, and S columns shows that our data was concatenated successfully.

Removing Unnecessary Columns From The Data Set

This means that we can now drop the original Sex and Embarked columns from the DataFrame. There are also other columns (like Name , PassengerId, Ticket) that are not predictive of Titanic crash survival rates, so we will remove those as well. The following code handles this for us:

 titanic_data.drop(['Name', 'Ticket', 'Sex', 'Embarked'], axis = 1, inplace = True) 

If you print titanic_data.columns now, your Jupyter Notebook will generate the following output:

 Index(['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'male', 'Q', 'S'], dtype="object" 

The DataFrame now has the following appearance:

The final DataFrame for our logistic regression model

As you can see, every field in this data set is now numeric, which makes it an excellent candidate for a logistic regression machine learning algorithm.

Creating Training Data and Test Data

Next, it’s time to split our titanic_data into training data and test data. As before, we will use built-in functionality from scikit-learn to do this.

First, we need to divide our data into x values (the data we will be using to make predictions) and y values (the data we are attempting to predict). The following code handles this:

 y_data = titanic_data['Survived'] x_data = titanic_data.drop('Survived', axis = 1) 

Next, we need to import the train_test_split function from scikit-learn. The following code executes this import:

 from sklearn.model_selection import train_test_split 

Lastly, we can use the train_test_split function combined with list unpacking to generate our training data and test data:

 x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(x_data, y_data, test_size = 0.3) 

Note that in this case, the test data is 30% of the original data set as specified with the parameter test_size = 0.3.

We have now created our training data and test data for our logistic regression model. We will train our model in the next section of this tutorial.

Training the Logistic Regression Model

To train our model, we will first need to import the appropriate model from scikit-learn with the following command:

 from sklearn.linear_model import LogisticRegression 

Next, we need to create our model by instantiating an instance of the LogisticRegression object:

 model = LogisticRegression() 

To train the model, we need to call the fit method on the LogisticRegression object we just created and pass in our x_training_data and y_training_data variables, like this:

 model.fit(x_training_data, y_training_data) 

Our model has now been trained. We will begin making predictions using this model in the next section of this tutorial.

Making Predictions With Our Logistic Regression Model

Let’s make a set of predictions on our test data using the model logistic regression model we just created. We will store these predictions in a variable called predictions:

 predictions = model.predict(x_test_data) 

Our predictions have been made. Let’s examine the accuracy of our model next.

Measuring the Performance of a Logistic Regression Machine Learning Model

scikit-learn has an excellent built-in module called classification_report that makes it easy to measure the performance of a classification machine learning model. We will use this module to measure the performance of the model that we just created.

First, let’s import the module:

 from sklearn.metrics import classification_report 

Next, let’s use the module to calculate the performance metrics for our logistic regression machine learning module:

 classification_report(y_test_data, predictions) 

Here is the output of this command:

 precision recall f1-score support 0 0.83 0.87 0.85 169 1 0.75 0.68 0.72 98 accuracy 0.80 267 macro avg 0.79 0.78 0.78 267 weighted avg 0.80 0.80 0.80 267 

If you’re interested in seeing the raw confusion matrix and calculating the performance metrics manually, you can do this with the following code:

 from sklearn.metrics import confusion_matrix print(confusion_matrix(y_test_data, predictions)) 

This generates the following output:

 [[145 22] [ 30 70]] 

The Full Code for This Tutorial

You can view the full code for this tutorial in this GitHub repository. It is also pasted below for your reference:

 import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns #Import the data set titanic_data = pd.read_csv('titanic_train.csv') #Exploratory data analysis sns.heatmap(titanic_data.isnull(), cbar=False) sns.countplot(x='Survived', data=titanic_data) sns.countplot(x='Survived', hue="Sex", data=titanic_data) sns.countplot(x='Survived', hue="Pclass", data=titanic_data) plt.hist(titanic_data['Age'].dropna()) plt.hist(titanic_data['Fare']) sns.boxplot(titanic_data['Pclass'], titanic_data['Age']) #Imputation function def impute_missing_age(columns): age = columns[0] passenger_class = columns[1] if pd.isnull(age): if(passenger_class == 1): return titanic_data[titanic_data['Pclass'] == 1]['Age'].mean() elif(passenger_class == 2): return titanic_data[titanic_data['Pclass'] == 2]['Age'].mean() elif(passenger_class == 3): return titanic_data[titanic_data['Pclass'] == 3]['Age'].mean() else: return age #Impute the missing Age data titanic_data['Age'] = titanic_data[['Age', 'Pclass']].apply(impute_missing_age, axis = 1) #Reinvestigate missing data sns.heatmap(titanic_data.isnull(), cbar=False) #Drop null data titanic_data.drop('Cabin', axis=1, inplace = True) titanic_data.dropna(inplace = True) #Create dummy variables for Sex and Embarked columns sex_data = pd.get_dummies(titanic_data['Sex'], drop_first = True) embarked_data = pd.get_dummies(titanic_data['Embarked'], drop_first = True) #Add dummy variables to the DataFrame and drop non-numeric data titanic_data = pd.concat([titanic_data, sex_data, embarked_data], axis = 1) titanic_data.drop(['Name', 'PassengerId', 'Ticket', 'Sex', 'Embarked'], axis = 1, inplace = True) #Print the finalized data set titanic_data.head() #Split the data set into x and y data y_data = titanic_data['Survived'] x_data = titanic_data.drop('Survived', axis = 1) #Split the data set into training data and test data from sklearn.model_selection import train_test_split x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(x_data, y_data, test_size = 0.3) #Create the model from sklearn.linear_model import LogisticRegression model = LogisticRegression() #Train the model and create predictions model.fit(x_training_data, y_training_data) predictions = model.predict(x_test_data) #Calculate performance metrics from sklearn.metrics import classification_report print(classification_report(y_test_data, predictions)) #Generate a confusion matrix from sklearn.metrics import confusion_matrix print(confusion_matrix(y_test_data, predictions)) 

Final Thoughts

In this tutorial, you learned how to build linear regression and logistic regression machine learning models in Python.

If you're interested in learning more about building, training, and deploying cutting-edge machine learning model, my eBook Pragmatic Machine Learning will teach you how to build 9 different machine learning models using real-world projects.

You can deploy the code from the eBook to your GitHub or personal portfolio to show to prospective employers. The book launches on August 3rd – preorder it for 50% off now!

Here is a brief summary of what you learned in this article:

  • How to import the libraries required to build a linear regression machine learning algorithm
  • How to split a data set into training data and test data using scikit-learn
  • How to use scikit-learn to train a linear regression model and make predictions using that model
  • How to calculate linear regression performance metrics using scikit-learn
  • Why the Titanic data set is often used for learning machine learning classification techniques
  • How to perform exploratory data analysis when working with a data set for classification machine learning problems
  • How to handle missing data in a pandas DataFrame
  • Hvad imputationbetyder og hvordan du kan bruge det til at udfylde manglende data
  • Sådan oprettes dummyvariabler til kategoriske data i maskinindlæringsdatasæt
  • Sådan træner du en logistisk regressionslæringsmodel i Python
  • Sådan foretages forudsigelser ved hjælp af en logistisk regressionsmodel i Python
  • Hvordan til scikit-learn's classification_reporthurtigt beregne effektivitetsmålinger efter problemer machine learning klassificering