Identifying Breast Cancer Cell Traits for Malignant vs Benign Cells.
Abstract
Breast cancer is the cause of the second-highest mortality rates, and its the most common type of cancer found in women. The cancer is initially diagnosed through abnormal lumps found in breast tissue or if calcium is detected in a breast x-ray. In order to help prevent the spread of breast cancer to the rest of the body, it’s important to be able to have defining characteristics for benign cancer cells vs malignant cancer cells.
Data Visualization
Plotting the data of the mean cell area and mean cell radius, we can see a clear difference of where benign cells and malignant cells fall in terms of numbers. Based on the plot below, we can see that there is a clear distinction in the average cell radius and area value for malignant cells vs benign cells
library(ggplot2)
ggplot(data, aes(radius_mean,area_mean, color = diagnosis)) + ggtitle("Cell Radius vs Cell Area Mean") +
geom_point()+ labs(x="Cell Radius Mean", y="Cell Area Mean ")
mean area — Malignant: 978.3764
mean area — Benign: 462.7902
The graph above tells us that malignant cells are larger in size compared to benign. With the mean area value for malignant cells being 978.3764 and the mean area for benign cells being 462.7902. That's a significant difference.
— — — — —
Let's take a look at another graph with 2 new factors:
ggplot(data, aes(concavity_mean,compactness_mean, color = diagnosis)) + ggtitle("Cell Concavity vs Cell Compactness Mean") +
geom_point()+ labs(x="Cell concanvity Mean", y="Cell Compactess Mean ")
We can see that a malignant cell has a clear compactness and concavity value range starting from 0.12 and greater, while benign cells mainly range from 0 to 0.1 compactness and concavity. If we see a cell with concavity and compactness to be 0.12 or higher, it has a. higher probability of being a malignant cell
Taken the data regarding the cell radius and area, as well as the compactness and concavity. We can infer that if we look at these cells under a microscope, the malignant cells would be large in area, thin, and concave.
Till now we have found 3 factors that can serve as distinct indicators between malignant and benign cells. We can discover another factor that can be used through a hypothesis test.
Hypothesis Test
Let’s explore the factor of texture. Can this factor serve as a determining factor in identifying which cells are malignant vs benign?
My hypothesis is that texture is a significant determining factor between benign and malignant
Null hypothesis: texture is NOT a significant determining factor between benign and malignant
Benign
Mean texture value: 17.91476
Standard deviation: 3.995125
Sample number: 357
Malignant:
Mean texture: 21.60491
Standard deviation: 3.77947
Sample length: 212
p-value: 0
Since the p-value of 0 < 0.05, the null hypothesis can be rejected. There is a significant difference between the texture of a benign cell and a malignant cell. As a result, texture can be a determining factor to use when identifying whether a cell is benign or malignant
textureM.mean <- mean(data[data$diagnosis=="M",]$texture_mean)
textureB.mean <- mean(data[data$diagnosis=="B",]$texture_mean)
textureM.sd <- sd(data[data$diagnosis=="M",]$texture_mean)
textureB.sd <- sd(data[data$diagnosis=="B",]$texture_mean)
textureM.le <- length(data[data$diagnosis=="M",]$texture_mean)
textureB.le <- length(data[data$diagnosis=="B",]$texture_mean)textureM.mean
textureM.le
textureM.sd
textureB.le
textureB.sd
textureB.meanzscore <- ((textureM.mean - textureB.mean)/(sqrt(((textureM.sd^2)/textureM.le) + ((textureB.sd^2)/textureB.le))))
zscore
pvalue <- 1 -pnorm(zscore)
pvalue
Conclusion:
Through the data visualizations and hypothesis testing above we have identified clear factors that can act as indicators to look in order to identify a benign cell vs a malignant tumor. This study comes with the limitation that only 4 factors were explored which could lead to a slight bias since the dataset has almost 30 other factors that haven’t been explored yet.
However, Here are some of the takeaways we got from the data analysis:
- malignant cells are larger in size compared to benign cells
- malignant cells are more compact and concave compared to benign cells
- visually, we can infer that malignant cells are thin, concave, and larger in size
- our hypothesis test tells us that there is a significant difference in texture between benign and malignant cells.
All in all, we found 4 major identifying factors (concavity, area, compactness, and texture) along with the specific values for each factor in order to predict/identify malignant cells apart from benign.
This finding can be built upon to create regression models and AI models to accurately predict if cells are malignant. Though the raw data set has around 32 columns, a regression model can be made from the 4 factors we identified in this study: given the area, concavity, compactness, and texture value, the model can predict if the cell is malignant or not.
There are a lot of use cases that can be done, all with the goal of identifying malignant cells to prioritize treatment for. Hopefully, this will help increase the prevention of cancer spreading all over the body.