Health

Predicting Breast Cancer with Machine Learning | by Dennis Niggl | Oct, 2021

“There can be life after breast cancer. The prerequisite is early detection.”

Ann Jillian

What is Breast Cancer?

Normally, cells in the body divide and reproduce when new cells are needed. Sometimes, cells in a part of the body grow and divide out of control, which creates a mass of tissue called a tumor. If the cells that are growing out of control are normal cells, the tumor is called benign (not cancerous). If the cells that are growing out of control are abnormal and don’t function like the body’s normal cells, the tumor is called malignant (cancerous).

Breast cancer originates in the breast tissue and can invade and grow into the tissues surrounding the breast. It can also travel to other parts of the body and form new tumors.

In the US, breast cancer is the second-leading cause of cancer death in women after lung cancer, and it’s the leading cause of cancer death among women ages 35 to 54. Only 5 to 10% of breast cancers occur in women with a clearly defined genetic predisposition for the disease.

Detecting Breast Cancer

One of the best ways to detect breast cancer is to have a surgical breast biopsy done to collect and test tissue for cancer cells. A breast biopsy is a procedure in which the doctor takes out cells or a small piece of tissue from part of the breast and examines the cells or tissue for signs of cancer. Early detection of breast cancer is critical and can greatly improve the prognosis and chances of survival by promoting clinical treatment to patients early on.

Treatment for Breast Cancer

Once breast cancer is detected, it can be treated in different ways depending on the type of breast cancer and how far it has spread. People with breast cancer are often treated by one of the following treatments.

  • Surgery. An operation where doctors cut out cancer tissue.
  • Chemotherapy. Using special medicines to shrink or kill the cancer cells. The drugs can be pills you take or medicines administered in your veins.
  • Hormonal therapy. Blocks cancer cells from getting the hormones they need to grow.
  • Biological therapy. Works with your body’s immune system to help it fight cancer cells or to control side effects from other cancer treatments.
  • Radiation therapy. Using high-energy rays (similar to X-rays) to kill the cancer cells.

Roadmap

The remainder of this article will present a program that will attempt to accurately predict whether a breast cancer is benign or malignant by analyzing tissue data. The following steps will be performed using machine learning and Python.

1. Import the required software libraries.

2. Access and import the dataset into a dataframe.

3. Data Analysis and Exploration.

4. Split the data into test and training data sets.

5. Train the model on the training data.

6. Make predictions on the test data.

7. Evaluate the model’s performance.

8. Draw conclusions from evaluations.

The Program

The objectives of this program include:

1. To use different machine learning models to find the model which provides the highest accuracy.

2. To determine which features are the most important in predicting if a breast cancer is benign or malignant.

Import the required software libraries

Access and Import the Data Set

Data Analysis and Exploration

We obtained the Breast Cancer Wisconsin (Diagnostic) dataset from Kaggle. Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

Feature variables include:

  1. ID number
  2. Diagnosis (M = malignant, B = benign)
  3. 3–32

Ten real-valued features are computed for each cell nucleus:

  1. radius (mean of distances from center to points on the perimeter)
  2. texture (standard deviation of gray-scale values)
  3. perimeter
  4. area
  5. smoothness (local variation in radius lengths)
  6. compactness (perimeter² / area — 1.0)
  7. concavity (severity of concave portions of the contour)
  8. concave points (number of concave portions of the contour)
  9. symmetry
  10. fractal dimension (“coastline approximation” — 1)

The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features.

There are 569 patient records and 33 features. The data set contains integer, object and float values. The diagnosis variable will be the target variable which has values of M (malignant) and B (benign).

Unnamed: 32 is the only column with null or missing values. We will remove this column. We will also remove id column because it has no relationship to the target variable.

Show counts for the number of Malignant and Benign cells.

There are 357 cases of benign and 212 cases of malignant.

(M = malignant, B = benign)

Visualize the distributions for each feature.

Feature Selection

Remove the features that have a low correlation (<0.5) with the target variable in order to improve the accuracy of the models and reduce the amount of complexity.

Visualize the features and target variable using the correlation matrix. We will use the Pearson Correlation method.

Split Data into Test and Training Data Sets

First we need to divide our data into x values (the data we will use to make predictions) and y values (the data we are attempting to predict).

Use train_test_split function to generate training data and test data. Test data set is 30% of original data set.

Normalize the feature data set.

We will now train models, make predictions and evaluate the performance of five different machine learning models.

Logical Regression Model

0.9649122807017544

The logical regression model correctly predicted 96% of breast cancers to be benign or malignant.

  • 96% correctly predicted breast cancer to be benign.
  • 97% correctly predicted breast cancer to be malignant.

Random Forrest Model

0.9649122807017544

The random forrest model correctly predicted 96% of breast cancers to be benign or malignant.

  • 96% correctly predicted breast cancer to be benign.
  • 97% correctly predicted breast cancer to be malignant.

K Nearest Neighbors

0.9415204678362573

The k nearest neighbors model correctly predicted 94% of breast cancers to be benign or malignant.

  • 95% correctly predicted breast cancer to be benign.
  • 93% correctly predicted breast cancer to be malignant.

Support Vector Machine

0.9415204678362573

The support vector machine model correctly predicted 94% of breast cancers to be benign or malignant.

  • 94% correctly predicted breast cancer to be benign.
  • 95% correctly predicted breast cancer to be malignant.

Extreme Gradient Boosting

0.9532163742690059

The extreme gradient boost model correctly predicted 95% of breast cancers to be benign or malignant.

  • 95% correctly predicted breast cancer to be benign.
  • 97% correctly predicted breast cancer to be malignant.

Show Accuracy Score by Model

Conclusions: All of the models tested showed high accuracy scores ranging from 94.15% to 96.49%. The logistic regression and random forest models had the highest accuracy scores. These models correctly predicted over 96% of breast cancers to be benign or malignant.

The models accuracy scores may be improved by having a larger dataset and tuning some of the models hyper parameters.

Thanks for reading my article! If you have any questions or comments please let me know.


Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button