Final Tutorial: Cervical Cancer on an Individual & Statewide Level

Analysis by Sarah Chiariello

Identifying the Datasets

Cervical Cancer Risk Classification:

United States Department of Agricultre Economic Research Service: https://data.ers.usda.gov/reports.aspx?ID=17826

CDC U.S. Cancer Cases: https://gis.cdc.gov/Cancer/USCS/DataViz.html

U.S. News Health Care Rankings: https://www.usnews.com/news/best-states/rankings/health-care

Project Description & Plan

Forty years ago, cervical cancer was the leading causes of cancer death for women in the United States. As one of the most preventable types of cancer, this statistic left much to be desired from the prior healthcare infrastructure. Due to early detection of abnormal cells within the cervix and more regularly scheduled Pap test screenings, The United States was able to control the mortality rates as a result of cervical cancer. Sadly, this cannot be said for developing healthcare systems around the world. Many women worldwide do not have the access to healthcare that would enable an early detection of cervical cancer by screening for human papilloma virus (HPV) cell types in their youth, which can increase their risk of cervical cancer in the future. Women with abnormal cells at a young age who do not get regular examinations are at a higher risk of localized cancer, which can lead to invasive cancer by the age of 50. While cervical cancer rates have declined in the US, death rates for African American women are twice as high as Caucasian women, while rates of invasive cervical cancer in Hispanic women are more than twice those of Caucasian women. These racial disparities may be results of less advanced healthcare systems worldwide, socioeconomic patternns, and low screening rates due to high poverty levels. These may also be the result of a lack of access to transportation, health insurance, or language translators.

Cervical cancer involves multiple risk factors and must be diagnosed by a healthcare professional after a biopsy or other type of examination. An increased sexual activity can introduce the risk of contracting HPV, the main risk factor for cervical cancer. HPV is a sexually transmitted infection which results in abnormal cell growth within the cervix, regularly screened in a Pap test. Family history of cervical cancer can also be a risk factor, along with the use of oral hormonal contraception pills. Other risk factors include, but are not limited to, smoking, past STDs, number of children, and history of diagnosed cervical cancer.

The purpose of this project is to examine data regarding individual cervical cancer risk factors and, hopefully, make meaningful insights about associations between certain risk factors and poverty levels, healthcare structures, or current racial disparities. Individual level analyses will be evaluated alongside state-level analyses to decipher a full picture idea of cervical cancer in women in America.

With this project, I hope to analyze cervical cancer risk factors on an individual level.

  • Are there any factors that seem to lead to malignancies at a higher rate?
  • What do the number of sexual partners or the age of first sexual intercourse mean and how does their weight affect the possibility of getting cancer due to the spread of HPV?
  • Hormonal contraceptives or smoking are believed to increase risk for cervical cancer if used for over a certain number of years. Is this true?

I hypothesize that I will be able to find trends in the following:

  • Higher number of sexual partners, increased risk of cervical cancer
  • Those who take hormonal contraceptives for longer, increased risk of cervical cancer
  • Those who have contracted STDs, specifically HPV, increased risk of cervical cancer

I also hope to relate cervical cancer data per state to other external factors that may be an indirect cause of increased cancer rates. I hope to create insights regarding the following:

  • Better the healthcare infrastructure in a state, lower incidence of cervical cancer
  • Higher rates of poverty in a state may lead to inaccessible healthcare and inability to go for annual Pap smear office visits
  • How could access to a clinic, accessibility of screening tests, regularity of visits to a clinic in underdeveloped neighborhoods, transportation time to the nearest clinic as it relates to risk of cervical cancer, neighborhood poverty levels, ability to make a follow up appointment, etc. lead to increased chnces of developing cervical cancer/inability to treat it?

Initial Data Extraction, Transform, Load & Tidy: Cervical Cancer Risk Classification

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

df = pd.read_csv(r"..\schiariello.github.io\project_data\cervical_raw.csv")
In [3]:
# Create values of NaN where there are currently ?, in accordance with best practices for tidy data.

display(df.replace("?", np.nan))
df.replace("?", np.nan,inplace=True)
Age Number of sexual partners First sexual intercourse Num of pregnancies Smokes Smokes (years) Smokes (packs/year) Hormonal Contraceptives Hormonal Contraceptives (years) IUD ... STDs: Time since first diagnosis STDs: Time since last diagnosis Dx:Cancer Dx:CIN (Cervical Intraepithelial Neoplasia) Dx:HPV Dx Hinselmann Schiller Citology Biopsy
0 18 4 15 1 0 0 0 0 0 0 ... NaN NaN 0 0 0 0 0 0 0 0
1 15 1 14 1 0 0 0 0 0 0 ... NaN NaN 0 0 0 0 0 0 0 0
2 34 1 NaN 1 0 0 0 0 0 0 ... NaN NaN 0 0 0 0 0 0 0 0
3 52 5 16 4 1 37 37 1 3 0 ... NaN NaN 1 0 1 0 0 0 0 0
4 46 3 21 4 0 0 0 1 15 0 ... NaN NaN 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
853 34 3 18 0 0 0 0 0 0 0 ... NaN NaN 0 0 0 0 0 0 0 0
854 32 2 19 1 0 0 0 1 8 0 ... NaN NaN 0 0 0 0 0 0 0 0
855 25 2 17 0 0 0 0 1 0.08 0 ... NaN NaN 0 0 0 0 0 0 1 0
856 33 2 24 2 0 0 0 1 0.08 0 ... NaN NaN 0 0 0 0 0 0 0 0
857 29 2 20 1 0 0 0 1 0.5 0 ... NaN NaN 0 0 0 0 0 0 0 0

858 rows × 36 columns

In [4]:
# Drop the columns I do not need for the purposes of this evaluation

df.drop(df.iloc[:, 13:28], inplace = True, axis = 1)

# Do not run this block more than once as it will delete the 13th-27th columns
In [5]:
# List the variable types for each column of the data, to ensure they are as we intend to use them later.
    # float64, int64 = quantitative
    # object = categorical

# Use df.dtypes first to see how Pandas is characterizing the varibles

# This block will convert from categorical to quantitative and vice-versa as deemed necessary by me.

df['Number of sexual partners'] = pd.to_numeric(df['Number of sexual partners'], errors="coerce")
df['First sexual intercourse'] = pd.to_numeric(df['First sexual intercourse'], errors="coerce")
df['Num of pregnancies'] = pd.to_numeric(df['Num of pregnancies'], errors="coerce")
df['Smokes'] = pd.to_numeric(df['Smokes'], errors="coerce")
df['Smokes (years)'] = pd.to_numeric(df['Smokes (years)'], errors="coerce")
df['Smokes (packs/year)'] = pd.to_numeric(df['Smokes (packs/year)'], errors="coerce")
df['Hormonal Contraceptives'] = pd.to_numeric(df['Hormonal Contraceptives'], errors="coerce")
df['Hormonal Contraceptives (years)'] = pd.to_numeric(df['Hormonal Contraceptives (years)'], errors="coerce")
df['IUD'] = pd.to_numeric(df['IUD'], errors="coerce")
df['IUD (years)'] = pd.to_numeric(df['IUD (years)'], errors="coerce")
df['STDs'] = pd.to_numeric(df['STDs'], errors="coerce")
df['STDs (number)'] = pd.to_numeric(df['STDs (number)'], errors="coerce")

df.dtypes
Out[5]:
Age                                              int64
Number of sexual partners                      float64
First sexual intercourse                       float64
Num of pregnancies                             float64
Smokes                                         float64
Smokes (years)                                 float64
Smokes (packs/year)                            float64
Hormonal Contraceptives                        float64
Hormonal Contraceptives (years)                float64
IUD                                            float64
IUD (years)                                    float64
STDs                                           float64
STDs (number)                                  float64
Dx:Cancer                                        int64
Dx:CIN (Cervical Intraepithelial Neoplasia)      int64
Dx:HPV                                           int64
Dx                                               int64
Hinselmann                                       int64
Schiller                                         int64
Citology                                         int64
Biopsy                                           int64
dtype: object

Note for tidying this data:

By running the .dtypes command, we can see how pandas is categorizing each variable. There are specific ways in which I want to categorize the data: either quantitative or categorical. For the purposes of this project, I have chosen to characterize the binary entries in the columns like: Smokes, Hormonal Contraceptives, IUD, and STDs as quantitative instead of categorical. I have chosen this method so I can perform analyses on the data later using the number or proportion of ones.

Additional Data ETL

In addition to the above dataframe showing various risks of cervical cancer in women, I will read in other datasets to assist with an insightful analysis.

Explanation of the dataset: CDC U.S. Cancer Cases

The following dataframe refers to the CDC data regarding Cervical Cancer Cases in the U.S. by state. Nationally between the years 2013 and 2017, there were 64,167 new cases of cervical cancer reported to the CDC among women. During this time frame, 20,902 women died of cervical cancer. Nationally, this represents 8 new cervical cancer cases reported and 2 resulting deaths per 100,000 people.

The following data reflects the age adjusted rate per 100,000 women of new cervical cancer cases in each state, in addition to the total population of the state and the actual number of women reporting new cases of cervical cancer in that state. This is a direct application of the initial dataframe regarding cervical cancer risk factors. I will read in the data and create a visual dataframe.

In [6]:
cancer_df = pd.read_csv(r"..\schiariello.github.io\project_data\cancer_per_state.csv") #read in data
cancer_df = cancer_df.drop('CancerType',axis=1)
cancer_df = cancer_df.drop('Sex',axis=1)
cancer_df = cancer_df.drop('Year',axis=1)
cancer_df = cancer_df.drop('Race',axis=1) #drop the unneccesary columns

cancer_df.rename({'Area':'State'}, axis=1, inplace=True) #standardize the column name of 'State', switch 'Area' to be standard
cancer_df.head()
Out[6]:
State AgeAdjustedRate CaseCount Population
0 Alabama 9.4 1196 12508432
1 Alaska 7.2 120 1760225
2 Arizona 6.5 1118 17196077
3 Arkansas 9.5 729 7580171
4 California 7.2 7280 97858943

Explanation of the dataset: U.S. News Health Care Rankings

The following was taken from a table shown in the U.S. News, "measuring how well states are meeting citizens' health care needs." This evaluation of health care was based on three factors in order to determine the rank of each state's health care system, and thus, the citizens' overall quality of health. The rank of this analysis is both vague and has it's limitations, so it is to be taken with a grain of salt for further applications. However, it is a good benchmark for the purposes of this project. It is to be noted that this data population is not only women, but the entire population of the state.

The factors shown in the following data are as follows: access to care (HC_Acess), quality of care (HC_Quality), and overall health of the population (PH = public health). As stated in the article, they accounted for percentages of adults without health insurance, percentages of those who have not had a routine checkup in the last year, the population of those who went without medical attention due to high costs. Those who ran the study also took into account the general measures corresponding with good physical and mental health, infant mortality, and overall mortality rates. More can be read on the site, linked at the top of the page.

This data is useful to the project as I will be comparing state health care quality, access, and perception of public health to new cervical cancer rates in the state, poverty of the state, and generalized cervical cancer risk factors. Here, I will read in the dataframe.

In [7]:
ranking_df = pd.read_csv(r"..\schiariello.github.io\project_data\HC_Rank.csv")
ranking_df.head()
Out[7]:
Rank State HC_Access HC_Quality PH
0 46 Alabama 37 39 47
1 25 Alaska 49 2 38
2 23 Arizona 42 11 18
3 49 Arkansas 45 49 49
4 7 California 23 10 1

Explanation of the dataset: U.S. Department of Agriculture Economic Research Service

The following dataframes were created as a reflection of poverty by state in America. The "percent" dataframe shows the percentage of the total population living in poverty along with the percentage of children under the age of 18 living in poverty. Similarly, the "number" dataframe shows the actual number of total people living in poverty per state, along with children under the age of 18 living in poverty per state. Lower and upper bounds are also included in case we wish to use them at a later point.

This data represents one of the many external factors which may lead someone to become infected with HPV or cervical cancer. Inability to pay for proper medical care, insurance, or any annual routine exams could delay the recognition of HPV or cervical cancer in a patient. Poverty and socioeconomic class, along with access to health care, number of primary care clinics in an area, or availiblity of screening tests or appropriate appointment times, can be factors that indirectly lead to an unknowing woman contracting cervical cancer. Looking at poverty percentages and numbers is just one application of a possible external "risk" of cervical cancer. Here, I create two separate dataframes from one file. I then merge all three of the additional dataframes into one full table.

In [8]:
poverty2 = pd.read_csv(r'..\schiariello.github.io\project_data\PovertyReport2.csv') #read in data
poverty2 = poverty2.drop('Textbox98',axis=1)
poverty2 = poverty2.drop('Textbox99',axis=1) #drop unneccessary columns

percent = pd.DataFrame(poverty2.loc[0:51]) #take the top half of the data which is the percentage data, create DF
number = pd.DataFrame(poverty2.loc[54:]) #take the bottom half of the data which is the number data, create separate DF

percent.columns = ['State','Total','min_total','max_total','under18','min_under18','max_under18']
number.columns = ['State','Total','min_total','max_total','under18','min_under18','max_under18']
#standardize columns to be meaningful variable names ^^

percent.drop(percent.index[27], inplace=True) #drop the row that has 'National' data
percent.reset_index(drop=True,inplace=True) #reset the index so it includes the 27th entry
#percent

number.drop(number.index[27], inplace=True) #same as above^^
number.reset_index(drop=True,inplace=True)
#number.head()

number_percent_pov = percent.merge(number,on=['State'],how='inner') #create a full dataframe for poverty
number_percent_poverty = number_percent_pov[['State','Total_x','Total_y']] #only take the columns I wish to use for now
#number_percent_poverty
number_percent_rank = number_percent_poverty.merge(ranking_df,on=['State'],how='outer') #merge with HC rank DF, outer will include DC
full_additional = number_percent_rank.merge(cancer_df,on=['State'],how='outer') #merge the poverty and rank DF with the cancer rates DF
full_additional.columns = ['State','Percent_poverty','Pop_poverty','Rank','HC_Access','HC_Quality','P_Health','Cancer_Rate','Cancer_Count','Population']
# set column titles to make sense and be informative to a reader
full_additional.head(10) #show all the state's data on what I thought would be some interesting qualitative data
Out[8]:
State Percent_poverty Pop_poverty Rank HC_Access HC_Quality P_Health Cancer_Rate Cancer_Count Population
0 Alabama 16.8 801,758 46.0 37.0 39.0 47.0 9.4 1196 12508432
1 Alaska 11.1 80,224 25.0 49.0 2.0 38.0 7.2 120 1760225
2 Arizona 14.1 990,291 23.0 42.0 11.0 18.0 6.5 1118 17196077
3 Arkansas 16.8 492,306 49.0 45.0 49.0 49.0 9.5 729 7580171
4 California 12.8 4,972,955 7.0 23.0 10.0 1.0 7.2 7280 97858943
5 Colorado 9.7 540,579 12.0 29.0 7.0 8.0 6.2 868 13547027
6 Connecticut 10.3 358,519 3.0 1.0 14.0 6.0 6.1 591 9185764
7 Delaware 12.2 114,691 15.0 18.0 6.0 28.0 7.8 199 2427686
8 District of Columbia 16.1 107,806 NaN NaN NaN NaN 8.8 155 1771124
9 Florida 13.7 2,854,438 29.0 46.0 24.0 15.0 8.9 5007 51742206
In [9]:
# check and fix the dtypes
full_additional['Percent_poverty'] = pd.to_numeric(full_additional['Percent_poverty'], errors="coerce")
#full_additional['Pop_poverty'] = full_additional['Pop_poverty'].str.extract('(\d+)',expand=False).astype(float)
full_additional['Pop_poverty'] = full_additional['Pop_poverty'].replace(',','', regex=True)
full_additional['Pop_poverty'] = pd.to_numeric(full_additional['Pop_poverty'], errors="coerce")
full_additional.dtypes
Out[9]:
State               object
Percent_poverty    float64
Pop_poverty          int64
Rank               float64
HC_Access          float64
HC_Quality         float64
P_Health           float64
Cancer_Rate        float64
Cancer_Count         int64
Population           int64
dtype: object

Exploratory Analysis of Initial Dataset: Cervical Cancer on an Individual Level

Now that I have tidy data, I want to be able to create meaningful visuals of the data. I can see that 'Age' is the first column of the dataset. Research suggest that age may be an indirect risk factor for cervical cancer. With age may come more sexual partners, and thus, a greater chance of contracting HPV, which is a known precursor to cervical cancer. Also with age, women may have children, which could increase their risk of getting cervical cancer. Furthermore, habits like smoking usually are age-dependent, and so is the use of birth control. Here, I will create a visualization of the age distribution of those surveyed in the data in order to make meaningful insights about the risk factors later on in the project.

In [10]:
(df['Age']).plot.hist(bins=20, title='Age Distribution of Risk Data Participants') 
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x1af3f6c18b0>

Question 1: How does the age of first sexual intercourse affect the possibility of getting cancer due to the spread of HPV?

My hypothesis is that those who experienced their first sexual encounter at an earlier age may come into contact with HPV sooner. Women are encouraged to get screened for HPV regularly to find any abnormalities early on. Those who became sexually active sooner may be tested earlier, revealing a positive diagnosis for HPV. Here, I will graph the age distribution of when women said they had their first sexual encounter. Then, I will check the HPV diagnosis column and pull out who had answered that they had received that diagnosis and at what age they had become sexually active.

In [12]:
(df['First sexual intercourse']).plot.hist(bins=22, title='Age Distribution of First Sexual Intercourse of Risk Data')
print(df['First sexual intercourse'].mean(),'is the average age that women in the dataset were first sexually active')
16.995299647473562 is the average age that women in the dataset were first sexually active
In [11]:
f = df.loc[df['Dx:HPV'] == 1] #filter for those who were diagnosed with HPV
f['First sexual intercourse'].value_counts().sort_index().plot.pie(autopct='%1.f%%',
                                                      title='Age of First Sexual Encounter for Those Diagnosed with HPV')
print(f['First sexual intercourse'].mean(),'is the average age at which the women diagnosed with HPV were first sexually active')
17.833333333333332 is the average age at which the women diagnosed with HPV were first sexually active

The first histogram showing the distribution of the age at which women experienced their first time has two major peaks at around age 15-16 and age 18-19. The two spikes represent the average age at which a woman surveyed had her first sexual experience: 17. When looking at the following pie chart, I have graphed the ages of first sexual intercourse for those who had been diagnosed with HPV. The average age of first sexual intercourse of the women who had been diagnosed with HPV was almost a year older than the average of the whole dataset: 18. This leads me to reject my hypothesis that women who were sexually active at a younger age would make up more of the HPV diagnosed subset of participants. As 28% of the women diagnosed with HPV were first sexually active at 18, it seems appropriate to encourage young sexually active women to also be screened for HPV around that age. However, American Cancer Society states that women should be screened starting at 25 years old. More Info Here%20should,HPV%20test*%20every%205%20years.)

Question 2: How does age relate to the number of sexual partners and the spread of HPV?

The next visualization I am interested in is the age vs the number of sexual partners. My immediate hypothesis would be that with age, women accumulate more sexual partners. As we know, as the number of partners increases, so does the risk of contracting a sexually transmitted disease. A woman with more sexual partners may have a higher chance of getting HPV and thus, have a higher likelihood of getting cervical cancer due to abnormal cells in the cervix. To view the relationship between age and number of sexual partners, I can graph a scatter plot.

In [13]:
df.plot.scatter(x='Age',y='Number of sexual partners',
                alpha=.3, figsize=(12,5), title='Relationship Between Age & Number of Sexual Partners')
print('The average number of sexual partners of women in this dataset is:',df['Number of sexual partners'].mean())
The average number of sexual partners of women in this dataset is: 2.527644230769231
In [14]:
n = df.loc[df['Dx:HPV'] == 1] #filter to look at those who texted positive for HPV
s = df.loc[df['STDs'] == 1] #filter to look at those who texted positive for HPV

# Output: Number of partners ; How many women were diagnosed with HPV who had that quantity of sexual partners
#n['Number of sexual partners'].value_counts().sort_index()  
#n['Number of sexual partners'].mean()

# Output: Number of partners ; How many women were diagnosed with an STD who had that quantity of sexual partners
#s['Number of sexual partners'].value_counts().sort_index()
#s['Number of sexual partners'].mean()

So, this is interesting. It does not reflect what I anticipated. As I mentioned, I believed the graph would have a slight upward trend, with more women having a greater number of sexual partners as they age. As shown above, it seems that most of the data hovers from 1-5 sexual partners no matter the age. The data includes a small sample of older women, so it may not be an accurate representation of the whole generation. Furthermore, perhaps an explanation of the data could be a change in societal standards and social perceptions. Societal standards and attitudes towards sex and women have changed drastically in the last few years. Perhaps the trend is shifting toward younger women having more sexual partners. This being said, research has shown that women are getting married later and later, so the argument may be made that older women today got married sooner, reflecting less sexual partners over time. Women now may be spending more of their youth single, resulting in more sexual partners with time. For example, the data point representing the 70 year old woman seems to show that she had one sexual partner throughout her life, meaning perhaps it was her husband. While this visualization does not capture all of these points, it raises some questions and associated conclusions, which is great for now.

Furthermore, it does not seem that those with more partners have a higher rate of getting HPV, or more generally STDs. The vast majority of those who contracted HPV or an STD still had few sexual partners. The difference in the average number of sexual partners is approximately 0.3 between all participants and those who contracted both HPV or an STD.

Question 3: Hormonal contraceptives and smoking are believed to increase risk for cervical cancer. Is this true?

Another relationship I may want to look at is that of age with either use of hormonal contraceptives or smoking. In order to determine whether these are good risk factors to evaluate for this study, there needs to be some population of the whole participant group that uses contraceptives or smokes. Both factors are said to increase chances of contracting cervical cancer. By creating a visualization of these variables, I can see whether they have a substantive population within the dataset to analyze the risk.

I suspect that more women will have used or are currently using hormonal contraceptives than those who smoke.

In [97]:
#pd.unique(df['STDs (number)'])
fig, ax = plt.subplots(1, 2, figsize=(20,5))
fig1 = df.groupby("Hormonal Contraceptives").Age.plot.hist(ax=ax[0], alpha=.5, legend=True,
                                                           bins=20,title='Age vs Hormonal Contraceptive Use')
fig2 = df.groupby("Smokes").Age.plot.hist(ax=ax[1], alpha=.5,legend=True, bins=20,title='Age vs Smokes')
# density=True,
ax[0].legend(['Does not use HC','Uses HC'])
ax[1].legend(['Does not smoke','Smokes'])
Out[97]:
<matplotlib.legend.Legend at 0x216a71fecd0>

The data for the first graph on the left is very interesting. After a quick check, 481 participants have used hormonal birth control and 269 have not. With a relatively high prevalence among women, contraceptive birth control such as the pill may still have unintended side effects or risks. As of now, doctors recognize an increase in blood clotting and/or depression as two of the most concerning health risks for taking hormonal contraceptives. Research.) states that taking contraceptive pills for over 5 years increases the risk of cervical cancer. Furthermore, the more years a woman takes the pill, the higher her risk of cervical cancer becomes. Here, the graph shows that more than half of the population of this study, who are mostly younger women in their 20s and 30s are using hormonal contraceptives. This is a major risk factor down the line for developing cervical cancer.

Smoking is another risk factor that increases a woman's likelihood to get diagnosed with cervical cancer. Tobacco by-products have been found in the mucus of the cervix in past studies. Tobacco also damages the DNA of the cervical cells and lessen their immune response capabilities, making them more susceptible to HPV. As per the graph above on the right, I expected more of the population of the study to be smokers. However, this not-normalized view gives an accurate representation of the minority of smokers from the group. Still, although prevalence of smoking has severely decreased over time, this shows there are still women who smoke who may be increasing their risk of getting cancer.

Question 4: Are there any factors that seem to lead to malignancies at a higher rate?

This dataset uses 4 target variables as possible measures of testing for malignancies: Hinselmann, Schiller, Citology, and Biopsy. Hinselmann and Schiller are two testing methods of staining and identifying irregular cellular appearance under a microscope. Citology and biopsy are two other examination methods for screening for cancerous cells requiring stained smears and surgical removal respectively. If a woman had these tests performed and it showed as a malignant sample, it was marked as so in the data. It should be noted that just because a participant may have had a malignant sample in one target variable does not mean she has a malignancy using a different test, nor does it necessarily mean she has cervical cancer.

In [98]:
df[['Smokes','Hormonal Contraceptives','IUD','STDs','Dx:Cancer','Dx:HPV',
   'Hinselmann','Schiller','Citology','Biopsy']].sum()
Out[98]:
Smokes                     123.0
Hormonal Contraceptives    481.0
IUD                         83.0
STDs                        79.0
Dx:Cancer                   18.0
Dx:HPV                      18.0
Hinselmann                  35.0
Schiller                    74.0
Citology                    44.0
Biopsy                      55.0
dtype: float64

As previously noted, I classified the binary variables as quantitative so I could use the sum() and mean() functions. The sum of the columns can tell us how many participants in the study smoke (123), use/have used hormonal contraceptives (481), have/have had an IUD (83), STDs (79), cancer diagnosis (18), and HPV diagnosis (18). This also tells us that 35 participants showed malignancies via the Hinselmann test, 74 with Schiller, 44 from citology, and 55 from a biopsy. Just by looking at the number of people who reflected the answers above in their survey, I find it hard to believe. HPV is the most commonly transmitted STI with approximately 80% of the sexually active population infected (perhpas with no symptoms). This is a red flag with respect to the validity of the data or the survey questions if only 18 of 859 participants answered that they had been diagnosed with HPV. It is believed by researchers that womn tend to under-report where men over-report their sexual history within surveys. This may have been the case in this dataset.

The mean of the risk factors can tell us the proportion of the ones. By using the mean function, I can tell the proportion of the risk factors among the participants who showed malignancy in each test. The proportion may have insights about which risk factor has more relevance to showing cellular signs of cancer. My hypothesis is that STDs and HPV diagnosis will lead to later malignancies in the four tests.

In [15]:
Hins1 = pd.DataFrame(df.loc[df['Hinselmann']==1])
Schil1 = pd.DataFrame(df.loc[df['Schiller']==1])
Cit1 = pd.DataFrame(df.loc[df['Citology']==1])
Bio1 = pd.DataFrame(df.loc[df['Biopsy']==1])
# Filter each test for those whose samples showed malignancies

fig, ax = plt.subplots(1, 4, figsize=(20,5))
Hins1[['Smokes','Hormonal Contraceptives','IUD','STDs','Dx:Cancer','Dx:HPV']].mean().plot.bar(ax=ax[0],
                                                                                             title='Hinselmann Test')
Schil1[['Smokes','Hormonal Contraceptives','IUD','STDs','Dx:Cancer','Dx:HPV']].mean().plot.bar(ax=ax[1],
                                                                                              title='Schiller Test')
Cit1[['Smokes','Hormonal Contraceptives','IUD','STDs','Dx:Cancer','Dx:HPV']].mean().plot.bar(ax=ax[2],
                                                                                            title='Citology Test')
Bio1[['Smokes','Hormonal Contraceptives','IUD','STDs','Dx:Cancer','Dx:HPV']].mean().plot.bar(ax=ax[3],
                                                                                            title='Biopsy Test')
fig.suptitle('Risk Factor Association to Malignant Samples from Four Different Tests')
Out[15]:
Text(0.5, 0.98, 'Risk Factor Association to Malignant Samples from Four Different Tests')

My hypothesis was not accurate at all for this question. Hormonal contraceptives seem to have the highest proportion to those who showed malignancies. However, this may be due to the sheer larger number of participants who have used hormonal contraceptives. Out of ~860 participants, 481 use hormonal contraceptives, whereas the number of those showing reporting HPV was 18 and the number of malignancies via Hinselmann was a mere 35. I've decided this may be inconclusive due to missing data or survey bias where women misreported due to embarrassment, etc.

Following contraceptives, smoking has the highest proportion in the Hinselmann test, smoking and STDs in the Schiller test, and STDs in the citology and biopsy examinations. I am surprised that HPV diagnosis did not have a higher association.

Discussion of Missing Data

Next, I would like to address the elephant in the dataset: the missing values. Because most of the data is in binary terms, participants either have to choose between yes or no. Because this data is about health, lifestyle choices, sexual experiences, and other sensitive and stigmatized information, I believe the data is missing: not at random. Most of the variables are not something observed about the participant. With respect to the participant survey information, none of them did not know their age, for example. This shows that the data that had been filled in with "??" was really just something that the participant did not want to disclose (like number of STDs), or simply may not have known (like number of sexual partners). (How did 117 people not know whether they had had an IUD or not when it is a an insertion done by a doctor?) Having a full data table with no missing values may have significantly changed some results or led to more questions. With more time, I would have liked to delve more into how to solve the missing data issue using types of imputations and figuring out where the error would be the least. This may have given more proportionally relevant conclusions. Below, I count the number of missing values within each column.

In [100]:
df.isnull().sum()
Out[100]:
Age                                              0
Number of sexual partners                       26
First sexual intercourse                         7
Num of pregnancies                              56
Smokes                                          13
Smokes (years)                                  13
Smokes (packs/year)                             13
Hormonal Contraceptives                        108
Hormonal Contraceptives (years)                108
IUD                                            117
IUD (years)                                    117
STDs                                           105
STDs (number)                                  105
Dx:Cancer                                        0
Dx:CIN (Cervical Intraepithelial Neoplasia)      0
Dx:HPV                                           0
Dx                                               0
Hinselmann                                       0
Schiller                                         0
Citology                                         0
Biopsy                                           0
dtype: int64

Exploratory Analysis of Additional Dataset: Cervical Cancer with a Larger Scope

Now that we have identified individualized risk factors that may lead to cervical cancer, we can visualize some proposed external relationships. I want to include the macro level infrastructure that could influence cervical cancer rates on a larger scale. I have chosen poverty data and healthcare data to see how they may be associated with cervical cancer rates within states.

First, I want to get a better idea of where the states fall in this health care ranking system. I can see based on the table that Mississippi has the worst ranking out of the 50 states overall, and equally as bad for each individual subcategory. I started off by creating a graph of all 50 of the states with their ranks but it was incredibly over-informative and overwhelming. Thus, I have taken the top 5 with the best health care systems and the ones with the lowest rank to graph them and view the differences.

In [101]:
full_by_rank = full_additional.set_index('Rank')
full_first = full_by_rank.loc[[1,2,3,4,5]]
full_last = full_by_rank.loc[[46,47,48,49,50]]
rank_min_max = pd.concat([full_first,full_last])
rank_min_max = rank_min_max.reset_index()
#rank_min_max
In [102]:
rank_min_max.plot.barh(x='State',y=['Rank','HC_Access','HC_Quality','P_Health'], figsize=(10,9),
                       title='Overall Health Care Rank from Worst (Top) to Best (Bottom)')
Out[102]:
<matplotlib.axes._subplots.AxesSubplot at 0x216a74928e0>

The longer the bar, the worse the health care ranking. So the upper most 5 states, Mississippi, Arkansas, West Virginia, Oklahoma, and Alabama have the worst ranked health care in the nation. This graph also shows the subcategories making up that ranking: access, quality, and public health. Vice versa, the bottom most state, Hawaii, is ranked for the best health care in the nation.I am most interested in looking at the access subcategory along with the rank. Access to health care is a huge issue nationwide and in developing nations. Poor access to health care can hinder diagnoses and treatment for cervical cancer, in particular.

Question 5: With better healthcare infrastructure in a state, will there be lower incidence of cervical cancer?

I want to look at how the health care ranking data relates to the cancer rates data. My original hypothesis is that there has to be some sort of correlation between the two sets. I think that with higher ranking and a more accessible health care system, cervical cancer rates will be lower. States with worse health care infrastucture in place will have higher incidence of cervical cancer. Perhaps women in those states face a lack of quality health care, lack of quality physicians, lack of educational health material, lack of education, increase in smoking, higher rates of STDs (perhaps resulting from the lack of education), etc. If there is no correlation, I would be surprised.

In [103]:
fig, ax = plt.subplots(1, 2, figsize=(15,5))
#fig1= full_additional.plot.scatter(ax=ax[0],x='Rank',y='Cancer_Rate',title='Health Care Rank vs Cancer Rate')
fig1 = sns.regplot(data=full_additional, x='Rank', y="Cancer_Rate",ax=ax[0], fit_reg=True, marker="o")
# add regression line using seaborn
fig2 = sns.regplot(data=full_additional, x='HC_Access', y="Cancer_Rate",ax=ax[1], fit_reg=True, marker="o")

My hypothesis was correct. States with the worst rank and the least accessibility to health care also seem to have the highest cancer rates. The regression lines show the correlation.

Question 6: Does a greater population inevitably mean a higher prevalence of cervica cancer among those who live there?

Next, I want to work a little bit with the population data at the end of the dataframe. My intuition tells me that, following logic, with a greater population, comes a greater incidence of cervical cancer. The more people are in a state, the more people are diagnosed with cervical cancer. Are there any outliers?

In [104]:
full_additional.plot.scatter(x='Population',
                             y='Cancer_Count', alpha=.3, s=32,
                             title='Overall State Population vs Number of People in the State with Cancer')
Out[104]:
<matplotlib.axes._subplots.AxesSubplot at 0x216a8cc4e20>

So, that appears to be sound logic. This graph certainly shows a high correlation between cancer count and population though, which confirms my initial ideas. There does not really seem to be any outlying states to look further into, either.

Question 7: Do states with higher rates of poverty have less accessible healthcare and does this effect cervical cancer rates?

I would like to incorporate the poverty data to test my theory that low ranked health care/inaccessible health care/higher cancer rates may be due to a poverty factor. My hypothesis is that poorer states may have worse ranked health care. Those living in poorer states may find health care inaccessible due to costs. Maybe they do not go for a check up or annual exam since they can't afford it. We previously spotted the correlation between higher cancer rates and inaccessible health care infrastructure. Perhaps this entire chain of events is due to the poverty levels of those in that state.

In [105]:
full_additional.plot.scatter(x='Percent_poverty',
                             y='Cancer_Rate', c='HC_Access',cmap='cividis',alpha=.5, figsize=(10,6), s=40,
                            title='Percentage in poverty vs Rates of new cancer cases with Access to HC per state colored')
#fig2= full_additional.plot.scatter(ax=ax[1], x='Percent_poverty',
#                             y='HC_Access', alpha=.5, title='Percentage of those in poverty vs Access to health care')
Out[105]:
<matplotlib.axes._subplots.AxesSubplot at 0x216a77f2eb0>

On the X axis of the graph above, we have 'percentage of people in poverty'. The hypothesis is that cancer rate is dependent on that percentage. This scatter plot shows that there is a clear positive correlation between the two variables. By adding in the access to health care variable to color the points, we can determine how the ranking for health care access plays into this correlation. As seen above, the least accessible states (yellow) seem to be concentrated more on the upper right corner of the graph corresponding to higher rates of cancer and higher poverty levels. The lower left side of the graph shows a greater concentation of dark dots (corresponding to the best accessibility), having slightly lower poverty and slightly lower cancer rates. This is a pretty rough approximation and conclusion from this graph, but it does show some semblance of correlation between the three factors. Very loosely, less access to health care systems in place relates to higher poverty percentages and greater cancer rates.

Conclusions

By looking on an individual level, I was able to identify possible risk factors and how they were associated to multiple tests for sample malignancies. Due to missing data, conclusions about these factors were difficult to make for certain. There seems to be little correlation within the risk dataset when looking at number of sexual partners and first incidence of sexual intercourse as they relate to spreading of STDs and HPV. However, women should be encouraged to see their doctor and get screened for cancer when they become sexually active, or around the age of 17. There were issues when determining the proportions of the malignancy samples from each test and their associated risk factors due to inconsistencies in the participant data.

By visualizing some aspects of the state-wide data, we were already able to see that there may be an indirect effect of poverty on cervical cancer rates through access to health care. This is great news (in the context of this project) as it aligns with some of my initial questions and hypotheses. We may be able to make some conclusions internationally. After studying the health care system while in Peru, I saw an immediate lack of access due to transportation to the closest health clinic. By boat, it took over an hour to get from one of the communities to the closest health care technician (not a doctor). Getting results back from a screening test, communicating the results to your patient, scheduling a secondary follow up visit were all rare occurrances. After looking at the data on both an individual and nation-wide level, what can this mean for cervical cancer in Peruvian women living on the Amazon?

Special thanks to Dr. Nick Mattei, Eli Mendels & Sri Korrapati