Cervical Cancer Risk Classification:
Retrieved from: https://www.kaggle.com/loveall/cervical-cancer-risk-classification
More Information found at: https://ieeexplore.ieee.org/document/8070120
United States Department of Agricultre Economic Research Service: https://data.ers.usda.gov/reports.aspx?ID=17826
CDC U.S. Cancer Cases: https://gis.cdc.gov/Cancer/USCS/DataViz.html
U.S. News Health Care Rankings: https://www.usnews.com/news/best-states/rankings/health-care
Forty years ago, cervical cancer was the leading causes of cancer death for women in the United States. As one of the most preventable types of cancer, this statistic left much to be desired from the prior healthcare infrastructure. Due to early detection of abnormal cells within the cervix and more regularly scheduled Pap test screenings, The United States was able to control the mortality rates as a result of cervical cancer. Sadly, this cannot be said for developing healthcare systems around the world. Many women worldwide do not have the access to healthcare that would enable an early detection of cervical cancer by screening for human papilloma virus (HPV) cell types in their youth, which can increase their risk of cervical cancer in the future. Women with abnormal cells at a young age who do not get regular examinations are at a higher risk of localized cancer, which can lead to invasive cancer by the age of 50. While cervical cancer rates have declined in the US, death rates for African American women are twice as high as Caucasian women, while rates of invasive cervical cancer in Hispanic women are more than twice those of Caucasian women. These racial disparities may be results of less advanced healthcare systems worldwide, socioeconomic patternns, and low screening rates due to high poverty levels. These may also be the result of a lack of access to transportation, health insurance, or language translators.
Cervical cancer involves multiple risk factors and must be diagnosed by a healthcare professional after a biopsy or other type of examination. An increased sexual activity can introduce the risk of contracting HPV, the main risk factor for cervical cancer. HPV is a sexually transmitted infection which results in abnormal cell growth within the cervix, regularly screened in a Pap test. Family history of cervical cancer can also be a risk factor, along with the use of oral hormonal contraception pills. Other risk factors include, but are not limited to, smoking, past STDs, number of children, and history of diagnosed cervical cancer.
The purpose of this project is to examine data regarding individual cervical cancer risk factors and, hopefully, make meaningful insights about associations between certain risk factors and poverty levels, healthcare structures, or current racial disparities. Individual level analyses will be evaluated alongside state-level analyses to decipher a full picture idea of cervical cancer in women in America.
With this project, I hope to analyze cervical cancer risk factors on an individual level.
I hypothesize that I will be able to find trends in the following:
I also hope to relate cervical cancer data per state to other external factors that may be an indirect cause of increased cancer rates. I hope to create insights regarding the following:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
df = pd.read_csv(r"..\schiariello.github.io\project_data\cervical_raw.csv")
# Create values of NaN where there are currently ?, in accordance with best practices for tidy data.
display(df.replace("?", np.nan))
df.replace("?", np.nan,inplace=True)
# Drop the columns I do not need for the purposes of this evaluation
df.drop(df.iloc[:, 13:28], inplace = True, axis = 1)
# Do not run this block more than once as it will delete the 13th-27th columns
# List the variable types for each column of the data, to ensure they are as we intend to use them later.
# float64, int64 = quantitative
# object = categorical
# Use df.dtypes first to see how Pandas is characterizing the varibles
# This block will convert from categorical to quantitative and vice-versa as deemed necessary by me.
df['Number of sexual partners'] = pd.to_numeric(df['Number of sexual partners'], errors="coerce")
df['First sexual intercourse'] = pd.to_numeric(df['First sexual intercourse'], errors="coerce")
df['Num of pregnancies'] = pd.to_numeric(df['Num of pregnancies'], errors="coerce")
df['Smokes'] = pd.to_numeric(df['Smokes'], errors="coerce")
df['Smokes (years)'] = pd.to_numeric(df['Smokes (years)'], errors="coerce")
df['Smokes (packs/year)'] = pd.to_numeric(df['Smokes (packs/year)'], errors="coerce")
df['Hormonal Contraceptives'] = pd.to_numeric(df['Hormonal Contraceptives'], errors="coerce")
df['Hormonal Contraceptives (years)'] = pd.to_numeric(df['Hormonal Contraceptives (years)'], errors="coerce")
df['IUD'] = pd.to_numeric(df['IUD'], errors="coerce")
df['IUD (years)'] = pd.to_numeric(df['IUD (years)'], errors="coerce")
df['STDs'] = pd.to_numeric(df['STDs'], errors="coerce")
df['STDs (number)'] = pd.to_numeric(df['STDs (number)'], errors="coerce")
df.dtypes
By running the .dtypes command, we can see how pandas is categorizing each variable. There are specific ways in which I want to categorize the data: either quantitative or categorical. For the purposes of this project, I have chosen to characterize the binary entries in the columns like: Smokes, Hormonal Contraceptives, IUD, and STDs as quantitative instead of categorical. I have chosen this method so I can perform analyses on the data later using the number or proportion of ones.
In addition to the above dataframe showing various risks of cervical cancer in women, I will read in other datasets to assist with an insightful analysis.
The following dataframe refers to the CDC data regarding Cervical Cancer Cases in the U.S. by state. Nationally between the years 2013 and 2017, there were 64,167 new cases of cervical cancer reported to the CDC among women. During this time frame, 20,902 women died of cervical cancer. Nationally, this represents 8 new cervical cancer cases reported and 2 resulting deaths per 100,000 people.
The following data reflects the age adjusted rate per 100,000 women of new cervical cancer cases in each state, in addition to the total population of the state and the actual number of women reporting new cases of cervical cancer in that state. This is a direct application of the initial dataframe regarding cervical cancer risk factors. I will read in the data and create a visual dataframe.
cancer_df = pd.read_csv(r"..\schiariello.github.io\project_data\cancer_per_state.csv") #read in data
cancer_df = cancer_df.drop('CancerType',axis=1)
cancer_df = cancer_df.drop('Sex',axis=1)
cancer_df = cancer_df.drop('Year',axis=1)
cancer_df = cancer_df.drop('Race',axis=1) #drop the unneccesary columns
cancer_df.rename({'Area':'State'}, axis=1, inplace=True) #standardize the column name of 'State', switch 'Area' to be standard
cancer_df.head()
The following was taken from a table shown in the U.S. News, "measuring how well states are meeting citizens' health care needs." This evaluation of health care was based on three factors in order to determine the rank of each state's health care system, and thus, the citizens' overall quality of health. The rank of this analysis is both vague and has it's limitations, so it is to be taken with a grain of salt for further applications. However, it is a good benchmark for the purposes of this project. It is to be noted that this data population is not only women, but the entire population of the state.
The factors shown in the following data are as follows: access to care (HC_Acess), quality of care (HC_Quality), and overall health of the population (PH = public health). As stated in the article, they accounted for percentages of adults without health insurance, percentages of those who have not had a routine checkup in the last year, the population of those who went without medical attention due to high costs. Those who ran the study also took into account the general measures corresponding with good physical and mental health, infant mortality, and overall mortality rates. More can be read on the site, linked at the top of the page.
This data is useful to the project as I will be comparing state health care quality, access, and perception of public health to new cervical cancer rates in the state, poverty of the state, and generalized cervical cancer risk factors. Here, I will read in the dataframe.
ranking_df = pd.read_csv(r"..\schiariello.github.io\project_data\HC_Rank.csv")
ranking_df.head()
The following dataframes were created as a reflection of poverty by state in America. The "percent" dataframe shows the percentage of the total population living in poverty along with the percentage of children under the age of 18 living in poverty. Similarly, the "number" dataframe shows the actual number of total people living in poverty per state, along with children under the age of 18 living in poverty per state. Lower and upper bounds are also included in case we wish to use them at a later point.
This data represents one of the many external factors which may lead someone to become infected with HPV or cervical cancer. Inability to pay for proper medical care, insurance, or any annual routine exams could delay the recognition of HPV or cervical cancer in a patient. Poverty and socioeconomic class, along with access to health care, number of primary care clinics in an area, or availiblity of screening tests or appropriate appointment times, can be factors that indirectly lead to an unknowing woman contracting cervical cancer. Looking at poverty percentages and numbers is just one application of a possible external "risk" of cervical cancer. Here, I create two separate dataframes from one file. I then merge all three of the additional dataframes into one full table.
poverty2 = pd.read_csv(r'..\schiariello.github.io\project_data\PovertyReport2.csv') #read in data
poverty2 = poverty2.drop('Textbox98',axis=1)
poverty2 = poverty2.drop('Textbox99',axis=1) #drop unneccessary columns
percent = pd.DataFrame(poverty2.loc[0:51]) #take the top half of the data which is the percentage data, create DF
number = pd.DataFrame(poverty2.loc[54:]) #take the bottom half of the data which is the number data, create separate DF
percent.columns = ['State','Total','min_total','max_total','under18','min_under18','max_under18']
number.columns = ['State','Total','min_total','max_total','under18','min_under18','max_under18']
#standardize columns to be meaningful variable names ^^
percent.drop(percent.index[27], inplace=True) #drop the row that has 'National' data
percent.reset_index(drop=True,inplace=True) #reset the index so it includes the 27th entry
#percent
number.drop(number.index[27], inplace=True) #same as above^^
number.reset_index(drop=True,inplace=True)
#number.head()
number_percent_pov = percent.merge(number,on=['State'],how='inner') #create a full dataframe for poverty
number_percent_poverty = number_percent_pov[['State','Total_x','Total_y']] #only take the columns I wish to use for now
#number_percent_poverty
number_percent_rank = number_percent_poverty.merge(ranking_df,on=['State'],how='outer') #merge with HC rank DF, outer will include DC
full_additional = number_percent_rank.merge(cancer_df,on=['State'],how='outer') #merge the poverty and rank DF with the cancer rates DF
full_additional.columns = ['State','Percent_poverty','Pop_poverty','Rank','HC_Access','HC_Quality','P_Health','Cancer_Rate','Cancer_Count','Population']
# set column titles to make sense and be informative to a reader
full_additional.head(10) #show all the state's data on what I thought would be some interesting qualitative data
# check and fix the dtypes
full_additional['Percent_poverty'] = pd.to_numeric(full_additional['Percent_poverty'], errors="coerce")
#full_additional['Pop_poverty'] = full_additional['Pop_poverty'].str.extract('(\d+)',expand=False).astype(float)
full_additional['Pop_poverty'] = full_additional['Pop_poverty'].replace(',','', regex=True)
full_additional['Pop_poverty'] = pd.to_numeric(full_additional['Pop_poverty'], errors="coerce")
full_additional.dtypes
Now that I have tidy data, I want to be able to create meaningful visuals of the data. I can see that 'Age' is the first column of the dataset. Research suggest that age may be an indirect risk factor for cervical cancer. With age may come more sexual partners, and thus, a greater chance of contracting HPV, which is a known precursor to cervical cancer. Also with age, women may have children, which could increase their risk of getting cervical cancer. Furthermore, habits like smoking usually are age-dependent, and so is the use of birth control. Here, I will create a visualization of the age distribution of those surveyed in the data in order to make meaningful insights about the risk factors later on in the project.
(df['Age']).plot.hist(bins=20, title='Age Distribution of Risk Data Participants')
My hypothesis is that those who experienced their first sexual encounter at an earlier age may come into contact with HPV sooner. Women are encouraged to get screened for HPV regularly to find any abnormalities early on. Those who became sexually active sooner may be tested earlier, revealing a positive diagnosis for HPV. Here, I will graph the age distribution of when women said they had their first sexual encounter. Then, I will check the HPV diagnosis column and pull out who had answered that they had received that diagnosis and at what age they had become sexually active.
(df['First sexual intercourse']).plot.hist(bins=22, title='Age Distribution of First Sexual Intercourse of Risk Data')
print(df['First sexual intercourse'].mean(),'is the average age that women in the dataset were first sexually active')
f = df.loc[df['Dx:HPV'] == 1] #filter for those who were diagnosed with HPV
f['First sexual intercourse'].value_counts().sort_index().plot.pie(autopct='%1.f%%',
title='Age of First Sexual Encounter for Those Diagnosed with HPV')
print(f['First sexual intercourse'].mean(),'is the average age at which the women diagnosed with HPV were first sexually active')
The first histogram showing the distribution of the age at which women experienced their first time has two major peaks at around age 15-16 and age 18-19. The two spikes represent the average age at which a woman surveyed had her first sexual experience: 17. When looking at the following pie chart, I have graphed the ages of first sexual intercourse for those who had been diagnosed with HPV. The average age of first sexual intercourse of the women who had been diagnosed with HPV was almost a year older than the average of the whole dataset: 18. This leads me to reject my hypothesis that women who were sexually active at a younger age would make up more of the HPV diagnosed subset of participants. As 28% of the women diagnosed with HPV were first sexually active at 18, it seems appropriate to encourage young sexually active women to also be screened for HPV around that age. However, American Cancer Society states that women should be screened starting at 25 years old. More Info Here%20should,HPV%20test*%20every%205%20years.)
The next visualization I am interested in is the age vs the number of sexual partners. My immediate hypothesis would be that with age, women accumulate more sexual partners. As we know, as the number of partners increases, so does the risk of contracting a sexually transmitted disease. A woman with more sexual partners may have a higher chance of getting HPV and thus, have a higher likelihood of getting cervical cancer due to abnormal cells in the cervix. To view the relationship between age and number of sexual partners, I can graph a scatter plot.
df.plot.scatter(x='Age',y='Number of sexual partners',
alpha=.3, figsize=(12,5), title='Relationship Between Age & Number of Sexual Partners')
print('The average number of sexual partners of women in this dataset is:',df['Number of sexual partners'].mean())
n = df.loc[df['Dx:HPV'] == 1] #filter to look at those who texted positive for HPV
s = df.loc[df['STDs'] == 1] #filter to look at those who texted positive for HPV
# Output: Number of partners ; How many women were diagnosed with HPV who had that quantity of sexual partners
#n['Number of sexual partners'].value_counts().sort_index()
#n['Number of sexual partners'].mean()
# Output: Number of partners ; How many women were diagnosed with an STD who had that quantity of sexual partners
#s['Number of sexual partners'].value_counts().sort_index()
#s['Number of sexual partners'].mean()
So, this is interesting. It does not reflect what I anticipated. As I mentioned, I believed the graph would have a slight upward trend, with more women having a greater number of sexual partners as they age. As shown above, it seems that most of the data hovers from 1-5 sexual partners no matter the age. The data includes a small sample of older women, so it may not be an accurate representation of the whole generation. Furthermore, perhaps an explanation of the data could be a change in societal standards and social perceptions. Societal standards and attitudes towards sex and women have changed drastically in the last few years. Perhaps the trend is shifting toward younger women having more sexual partners. This being said, research has shown that women are getting married later and later, so the argument may be made that older women today got married sooner, reflecting less sexual partners over time. Women now may be spending more of their youth single, resulting in more sexual partners with time. For example, the data point representing the 70 year old woman seems to show that she had one sexual partner throughout her life, meaning perhaps it was her husband. While this visualization does not capture all of these points, it raises some questions and associated conclusions, which is great for now.
Furthermore, it does not seem that those with more partners have a higher rate of getting HPV, or more generally STDs. The vast majority of those who contracted HPV or an STD still had few sexual partners. The difference in the average number of sexual partners is approximately 0.3 between all participants and those who contracted both HPV or an STD.
Another relationship I may want to look at is that of age with either use of hormonal contraceptives or smoking. In order to determine whether these are good risk factors to evaluate for this study, there needs to be some population of the whole participant group that uses contraceptives or smokes. Both factors are said to increase chances of contracting cervical cancer. By creating a visualization of these variables, I can see whether they have a substantive population within the dataset to analyze the risk.
I suspect that more women will have used or are currently using hormonal contraceptives than those who smoke.
#pd.unique(df['STDs (number)'])
fig, ax = plt.subplots(1, 2, figsize=(20,5))
fig1 = df.groupby("Hormonal Contraceptives").Age.plot.hist(ax=ax[0], alpha=.5, legend=True,
bins=20,title='Age vs Hormonal Contraceptive Use')
fig2 = df.groupby("Smokes").Age.plot.hist(ax=ax[1], alpha=.5,legend=True, bins=20,title='Age vs Smokes')
# density=True,
ax[0].legend(['Does not use HC','Uses HC'])
ax[1].legend(['Does not smoke','Smokes'])
The data for the first graph on the left is very interesting. After a quick check, 481 participants have used hormonal birth control and 269 have not. With a relatively high prevalence among women, contraceptive birth control such as the pill may still have unintended side effects or risks. As of now, doctors recognize an increase in blood clotting and/or depression as two of the most concerning health risks for taking hormonal contraceptives. Research.) states that taking contraceptive pills for over 5 years increases the risk of cervical cancer. Furthermore, the more years a woman takes the pill, the higher her risk of cervical cancer becomes. Here, the graph shows that more than half of the population of this study, who are mostly younger women in their 20s and 30s are using hormonal contraceptives. This is a major risk factor down the line for developing cervical cancer.
Smoking is another risk factor that increases a woman's likelihood to get diagnosed with cervical cancer. Tobacco by-products have been found in the mucus of the cervix in past studies. Tobacco also damages the DNA of the cervical cells and lessen their immune response capabilities, making them more susceptible to HPV. As per the graph above on the right, I expected more of the population of the study to be smokers. However, this not-normalized view gives an accurate representation of the minority of smokers from the group. Still, although prevalence of smoking has severely decreased over time, this shows there are still women who smoke who may be increasing their risk of getting cancer.
This dataset uses 4 target variables as possible measures of testing for malignancies: Hinselmann, Schiller, Citology, and Biopsy. Hinselmann and Schiller are two testing methods of staining and identifying irregular cellular appearance under a microscope. Citology and biopsy are two other examination methods for screening for cancerous cells requiring stained smears and surgical removal respectively. If a woman had these tests performed and it showed as a malignant sample, it was marked as so in the data. It should be noted that just because a participant may have had a malignant sample in one target variable does not mean she has a malignancy using a different test, nor does it necessarily mean she has cervical cancer.
df[['Smokes','Hormonal Contraceptives','IUD','STDs','Dx:Cancer','Dx:HPV',
'Hinselmann','Schiller','Citology','Biopsy']].sum()
As previously noted, I classified the binary variables as quantitative so I could use the sum() and mean() functions. The sum of the columns can tell us how many participants in the study smoke (123), use/have used hormonal contraceptives (481), have/have had an IUD (83), STDs (79), cancer diagnosis (18), and HPV diagnosis (18). This also tells us that 35 participants showed malignancies via the Hinselmann test, 74 with Schiller, 44 from citology, and 55 from a biopsy. Just by looking at the number of people who reflected the answers above in their survey, I find it hard to believe. HPV is the most commonly transmitted STI with approximately 80% of the sexually active population infected (perhpas with no symptoms). This is a red flag with respect to the validity of the data or the survey questions if only 18 of 859 participants answered that they had been diagnosed with HPV. It is believed by researchers that womn tend to under-report where men over-report their sexual history within surveys. This may have been the case in this dataset.
The mean of the risk factors can tell us the proportion of the ones. By using the mean function, I can tell the proportion of the risk factors among the participants who showed malignancy in each test. The proportion may have insights about which risk factor has more relevance to showing cellular signs of cancer. My hypothesis is that STDs and HPV diagnosis will lead to later malignancies in the four tests.
Hins1 = pd.DataFrame(df.loc[df['Hinselmann']==1])
Schil1 = pd.DataFrame(df.loc[df['Schiller']==1])
Cit1 = pd.DataFrame(df.loc[df['Citology']==1])
Bio1 = pd.DataFrame(df.loc[df['Biopsy']==1])
# Filter each test for those whose samples showed malignancies
fig, ax = plt.subplots(1, 4, figsize=(20,5))
Hins1[['Smokes','Hormonal Contraceptives','IUD','STDs','Dx:Cancer','Dx:HPV']].mean().plot.bar(ax=ax[0],
title='Hinselmann Test')
Schil1[['Smokes','Hormonal Contraceptives','IUD','STDs','Dx:Cancer','Dx:HPV']].mean().plot.bar(ax=ax[1],
title='Schiller Test')
Cit1[['Smokes','Hormonal Contraceptives','IUD','STDs','Dx:Cancer','Dx:HPV']].mean().plot.bar(ax=ax[2],
title='Citology Test')
Bio1[['Smokes','Hormonal Contraceptives','IUD','STDs','Dx:Cancer','Dx:HPV']].mean().plot.bar(ax=ax[3],
title='Biopsy Test')
fig.suptitle('Risk Factor Association to Malignant Samples from Four Different Tests')
My hypothesis was not accurate at all for this question. Hormonal contraceptives seem to have the highest proportion to those who showed malignancies. However, this may be due to the sheer larger number of participants who have used hormonal contraceptives. Out of ~860 participants, 481 use hormonal contraceptives, whereas the number of those showing reporting HPV was 18 and the number of malignancies via Hinselmann was a mere 35. I've decided this may be inconclusive due to missing data or survey bias where women misreported due to embarrassment, etc.
Following contraceptives, smoking has the highest proportion in the Hinselmann test, smoking and STDs in the Schiller test, and STDs in the citology and biopsy examinations. I am surprised that HPV diagnosis did not have a higher association.
Next, I would like to address the elephant in the dataset: the missing values. Because most of the data is in binary terms, participants either have to choose between yes or no. Because this data is about health, lifestyle choices, sexual experiences, and other sensitive and stigmatized information, I believe the data is missing: not at random. Most of the variables are not something observed about the participant. With respect to the participant survey information, none of them did not know their age, for example. This shows that the data that had been filled in with "??" was really just something that the participant did not want to disclose (like number of STDs), or simply may not have known (like number of sexual partners). (How did 117 people not know whether they had had an IUD or not when it is a an insertion done by a doctor?) Having a full data table with no missing values may have significantly changed some results or led to more questions. With more time, I would have liked to delve more into how to solve the missing data issue using types of imputations and figuring out where the error would be the least. This may have given more proportionally relevant conclusions. Below, I count the number of missing values within each column.
df.isnull().sum()
Now that we have identified individualized risk factors that may lead to cervical cancer, we can visualize some proposed external relationships. I want to include the macro level infrastructure that could influence cervical cancer rates on a larger scale. I have chosen poverty data and healthcare data to see how they may be associated with cervical cancer rates within states.
First, I want to get a better idea of where the states fall in this health care ranking system. I can see based on the table that Mississippi has the worst ranking out of the 50 states overall, and equally as bad for each individual subcategory. I started off by creating a graph of all 50 of the states with their ranks but it was incredibly over-informative and overwhelming. Thus, I have taken the top 5 with the best health care systems and the ones with the lowest rank to graph them and view the differences.
full_by_rank = full_additional.set_index('Rank')
full_first = full_by_rank.loc[[1,2,3,4,5]]
full_last = full_by_rank.loc[[46,47,48,49,50]]
rank_min_max = pd.concat([full_first,full_last])
rank_min_max = rank_min_max.reset_index()
#rank_min_max
rank_min_max.plot.barh(x='State',y=['Rank','HC_Access','HC_Quality','P_Health'], figsize=(10,9),
title='Overall Health Care Rank from Worst (Top) to Best (Bottom)')
The longer the bar, the worse the health care ranking. So the upper most 5 states, Mississippi, Arkansas, West Virginia, Oklahoma, and Alabama have the worst ranked health care in the nation. This graph also shows the subcategories making up that ranking: access, quality, and public health. Vice versa, the bottom most state, Hawaii, is ranked for the best health care in the nation.I am most interested in looking at the access subcategory along with the rank. Access to health care is a huge issue nationwide and in developing nations. Poor access to health care can hinder diagnoses and treatment for cervical cancer, in particular.
I want to look at how the health care ranking data relates to the cancer rates data. My original hypothesis is that there has to be some sort of correlation between the two sets. I think that with higher ranking and a more accessible health care system, cervical cancer rates will be lower. States with worse health care infrastucture in place will have higher incidence of cervical cancer. Perhaps women in those states face a lack of quality health care, lack of quality physicians, lack of educational health material, lack of education, increase in smoking, higher rates of STDs (perhaps resulting from the lack of education), etc. If there is no correlation, I would be surprised.
fig, ax = plt.subplots(1, 2, figsize=(15,5))
#fig1= full_additional.plot.scatter(ax=ax[0],x='Rank',y='Cancer_Rate',title='Health Care Rank vs Cancer Rate')
fig1 = sns.regplot(data=full_additional, x='Rank', y="Cancer_Rate",ax=ax[0], fit_reg=True, marker="o")
# add regression line using seaborn
fig2 = sns.regplot(data=full_additional, x='HC_Access', y="Cancer_Rate",ax=ax[1], fit_reg=True, marker="o")
My hypothesis was correct. States with the worst rank and the least accessibility to health care also seem to have the highest cancer rates. The regression lines show the correlation.
Next, I want to work a little bit with the population data at the end of the dataframe. My intuition tells me that, following logic, with a greater population, comes a greater incidence of cervical cancer. The more people are in a state, the more people are diagnosed with cervical cancer. Are there any outliers?
full_additional.plot.scatter(x='Population',
y='Cancer_Count', alpha=.3, s=32,
title='Overall State Population vs Number of People in the State with Cancer')
So, that appears to be sound logic. This graph certainly shows a high correlation between cancer count and population though, which confirms my initial ideas. There does not really seem to be any outlying states to look further into, either.
I would like to incorporate the poverty data to test my theory that low ranked health care/inaccessible health care/higher cancer rates may be due to a poverty factor. My hypothesis is that poorer states may have worse ranked health care. Those living in poorer states may find health care inaccessible due to costs. Maybe they do not go for a check up or annual exam since they can't afford it. We previously spotted the correlation between higher cancer rates and inaccessible health care infrastructure. Perhaps this entire chain of events is due to the poverty levels of those in that state.
full_additional.plot.scatter(x='Percent_poverty',
y='Cancer_Rate', c='HC_Access',cmap='cividis',alpha=.5, figsize=(10,6), s=40,
title='Percentage in poverty vs Rates of new cancer cases with Access to HC per state colored')
#fig2= full_additional.plot.scatter(ax=ax[1], x='Percent_poverty',
# y='HC_Access', alpha=.5, title='Percentage of those in poverty vs Access to health care')
On the X axis of the graph above, we have 'percentage of people in poverty'. The hypothesis is that cancer rate is dependent on that percentage. This scatter plot shows that there is a clear positive correlation between the two variables. By adding in the access to health care variable to color the points, we can determine how the ranking for health care access plays into this correlation. As seen above, the least accessible states (yellow) seem to be concentrated more on the upper right corner of the graph corresponding to higher rates of cancer and higher poverty levels. The lower left side of the graph shows a greater concentation of dark dots (corresponding to the best accessibility), having slightly lower poverty and slightly lower cancer rates. This is a pretty rough approximation and conclusion from this graph, but it does show some semblance of correlation between the three factors. Very loosely, less access to health care systems in place relates to higher poverty percentages and greater cancer rates.
By looking on an individual level, I was able to identify possible risk factors and how they were associated to multiple tests for sample malignancies. Due to missing data, conclusions about these factors were difficult to make for certain. There seems to be little correlation within the risk dataset when looking at number of sexual partners and first incidence of sexual intercourse as they relate to spreading of STDs and HPV. However, women should be encouraged to see their doctor and get screened for cancer when they become sexually active, or around the age of 17. There were issues when determining the proportions of the malignancy samples from each test and their associated risk factors due to inconsistencies in the participant data.
By visualizing some aspects of the state-wide data, we were already able to see that there may be an indirect effect of poverty on cervical cancer rates through access to health care. This is great news (in the context of this project) as it aligns with some of my initial questions and hypotheses. We may be able to make some conclusions internationally. After studying the health care system while in Peru, I saw an immediate lack of access due to transportation to the closest health clinic. By boat, it took over an hour to get from one of the communities to the closest health care technician (not a doctor). Getting results back from a screening test, communicating the results to your patient, scheduling a secondary follow up visit were all rare occurrances. After looking at the data on both an individual and nation-wide level, what can this mean for cervical cancer in Peruvian women living on the Amazon?