This project explores survival outcomes on the Titanic using the
titanic_train
dataset from the titanic R package. The
Titanic was a British ship that sank in the North Atlantic Ocean on 15
April 1912 after hitting an iceberg during its first voyage, which was
one of the most infamous maritime disasters in history.
The dataset includes information about 891 passengers, such as their age, gender, ticket class, fare paid, and whether they survived. By conducting exploratory data analysis (EDA), the project uncovers meaningful patterns in passenger demographics and visualizes survival distributions across key variables. This includes the impact of traveling alone versus with family, the effect of ticket fare as a proxy for class, and differences in survival across age groups and genders. Visual tools such as bar charts, density plots, ridge plots, box plots, and scatter plots are used to support the analysis and insights.
library(titanic)
library(dplyr)
library(ggplot2)
library(ggridges)
First, we aim to analyse the composition of the Titanic’s passengers and their survival rates. There is a well-known notion of “women and children first,” suggesting that gender may have played a significant role in survival outcomes.
data("titanic_train")
glimpse(titanic_train)
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
# Female percentage
total_passengers <- nrow(titanic_train)
total_females <- sum(titanic_train$Sex == "female", na.rm = TRUE)
female_pct <- (total_females/total_passengers)*100
cat("Percentage of passengers who were female:", round(female_pct, 2), "%\n")
## Percentage of passengers who were female: 35.24 %
# Survival percentage
total_survivors <- sum(titanic_train$Survived == 1, na.rm = TRUE)
survival_pct <- (total_survivors/total_passengers)*100
cat("Percentage of passengers who survived:", round(survival_pct, 2), "%\n")
## Percentage of passengers who survived: 38.38 %
# Percentage of female passengers who survived
total_survived_fem <- sum(titanic_train$Survived == 1 & titanic_train$Sex == "female", na.rm = TRUE)
survived_fem_pct <- (total_survived_fem/total_females)*100
cat("Percentage of female passengers who survived:", round(survived_fem_pct, 2), "%\n")
## Percentage of female passengers who survived: 74.2 %
The dataset contains 891 rows and 12 variables, including passenger demographics, ticket and cabin details, and survival status. Approximately 35.24% of passengers were female, and 38.38% survived. Among female passengers, the survival rate was notably higher at 74.2%, highlighting the strong influence of gender on survival outcomes. This aligns with historical accounts of the “women and children first” policy during evacuation.
ggplot(titanic_train, aes(x = Sex, fill = factor(Survived))) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
scale_fill_discrete(labels = c("Did Not Survive", "Survived")) +
labs(
title = "Survival Rates by Gender",
x = "Gender",
y = "Percentage",
fill = "Survival Status"
)
The bar chart and calculated statistics reveal a significant difference
in survival rates based on gender. A much higher percentage of female
passengers survived compared to male passengers, as evident from the
dominant blue segment in the female bar, representing survivors. In
contrast, the male bar shows a larger red segment, indicating that the
majority of male passengers did not survive. These findings align with
the “women and children first” policy that prioritised women during the
evacuation. The statistics further support this observation, with
approximately 74% of female passengers surviving, compared to a much
lower survival rate of around 19% for males. This disparity highlights
the influence of social norms and emergency protocols on survival
outcomes during the Titanic disaster.
Age is another critical factor to consider, as we often wonder whether children and younger adults had higher survival rates. To explore this, we create a density plot of passenger ages to distinguish between survival statuses.
ggplot(titanic_train, aes(x = Age, fill = factor(Survived))) +
geom_density(adjust = 1.1, alpha = 0.6) +
scale_fill_discrete(labels = c("Did Not Survive", "Survived")) +
labs(
title = "Density Plot of Passenger Ages by Survival Status",
x = "Age",
y = "Density",
fill = "Survival Status"
)
For children under 10 years old, there is a local peak in the density of
survivors, indicating a high proportion of children within this age
group among survivors. This aligns with the “women and children first”
policy implemented during the evacuation. In contrast, the density of
non-survivors is lower for young passengers, suggesting that children
were less represented among those who did not survive.
For teenagers and young adults, specifically those between 15 and 30 years old, the densities of survivors and non-survivors overlap. The non-survivor density exhibits a global peak around this age group, particularly concentrated at approximately 24 years old. This indicates that individuals in their mid-twenties were highly represented among the non-survivors. While both groups include a significant proportion of passengers from this age range, the proportion of this age group is higher among non-survivors than among survivors. (It is important to note that density reflects the distribution within each group, not absolute numbers. This peak signifies a higher proportion of non-survivors in this age group relative to other ages within the non-survivor category. Additionally, the area under each curve always accumulates to 100%, representing the proportions within each group.)
For middle-aged and older adults, aged 30 and above, both densities gradually decrease, reflecting the lower number of passengers in this age range on board. The densities overlap and fluctuate slightly but remain relatively stable across this range. Notably, for passengers aged around 58 and older, the non-survivor density becomes more pronounced, suggesting that this group was more heavily represented among non-survivors. This pattern may reflect the physical challenges older individuals faced during the evacuation or the prioritisation of younger passengers and families. Overall, the density plot suggests that younger passengers, particularly children, were highly represented among survivors, while older passengers comprised a higher percentage of non-survivors.
Since gender may also play a significant role in passenger survival, we aim to explore whether age distributions differ between male and female passengers and between survivors and non-survivors. To visualise this, we create a ridge plot comparing the age distributions across these groups.
ggplot(titanic_train, aes(x = Age, y = interaction(Sex, factor(Survived)), fill = factor(Survived))) +
geom_density_ridges(alpha = 0.7) +
scale_y_discrete(
labels = c(
"female.0" = "Female (Did Not Survive)",
"female.1" = "Female (Survived)",
"male.0" = "Male (Did Not Survive)",
"male.1" = "Male (Survived)")) +
scale_fill_discrete(labels = c("Did Not Survive", "Survived")) +
labs(
title = "Ridge Plot of Age Distributions by Gender and Survival Status",
x = "Age",
y = "Gender & Survival Status",
fill = "Survival Status"
)
## Picking joint bandwidth of 4.44
The ridge plot reveals distinct differences in the age distributions of
passengers based on gender and survival status. Among female survivors,
there is a peak for younger passengers, particularly children under the
age of 10. This pattern suggests that young females were prioritised
during the evacuation, consistent with the “women and children first”
policy. An even higher peak is observed for male survivors in this age
group, indicating that young boys also benefited significantly from the
prioritisation of children during the evacuation process. However, when
examining children who did not survive, a relatively higher proportion
is seen among female non-survivors compared to male non-survivors,
suggesting that children represented a substantial proportion of female
non-survivors.
Across all four distributions, the most represented age group (the peaks) is around 25 years old, with the highest peak and concentration observed among male non-survivors. This sharp peak for young and middle-aged males highlights that they were overrepresented among non-survivors, likely reflecting the lower priority given to adult males during the evacuation. In contrast, female survivors show relatively higher densities in this age range (20–40 years). Female non-survivors also exhibit noticeable densities in this range, particularly among older adults, indicating that older women made up a considerable proportion of female non-survivors.
Older adults, particularly those aged 50 and above, exhibit similar densities among survivors for both genders, with a slight peak for male survivors around the age of 80, which is somewhat surprising. For non-survivors in this age group, male densities are noticeably higher, suggesting that older individuals faced significant challenges during evacuation. These challenges may have included physical limitations or reduced prioritisation compared to younger passengers and families.
Overall, the plot underscores the combined influence of gender and age on survival outcomes. Among females, there are significant proportions represented across almost all age groups for both survivor and non-survivor densities. Among males, there is a highly concentrated group of young adult non-survivors, while male children have a notably high proportion among survivors. These patterns reflect the societal norms and evacuation protocols during the Titanic disaster.
The survival rates of passengers may also be influenced by whether
they had family members on board. This is because the presence of family
members could have provided support, assistance, and coordination during
the chaotic evacuation process. To investigate this, we create a binary
variable, Family
, indicating whether a passenger had at
least one sibling or parent on the ship.
titanic_train$Family <- ifelse(titanic_train$SibSp > 0 | titanic_train$Parch > 0, "Yes", "No")
ggplot(titanic_train, aes(x = Family, fill = factor(Survived))) +
geom_bar(position = "dodge") +
scale_fill_discrete(labels = c("Did Not Survive", "Survived")) +
labs(
title = "Survival Status Based on Family Presence",
x = "Family Onboard",
y = "Count of Passengers",
fill = "Survival Status"
)
The bar chart illustrates the survival status of passengers based on whether they had family members onboard the Titanic. Passengers without family onboard overwhelmingly constitute a significantly larger group of non-survivors. This suggests that traveling alone significantly reduced the likelihood of survival, possibly due to the lack of support during the chaotic evacuation process. At the same time, this group also represents the largest overall category, indicating that the majority of passengers traveled alone.
In contrast, passengers who had family onboard exhibited a more balanced distribution between survivors and non-survivors. This suggests that the presence of family members may have provided crucial assistance and coordination, improving the chances of survival. However, it also highlights that survival was not guaranteed, even with familial support, emphasizing the unpredictable and challenging circumstances during the disaster.
Ticket fare is often considered a proxy for wealth and social class, as higher fares were typically associated with first-class passengers who had better accommodations and greater access to lifeboats. Thus, we aim to determine whether paying more for a ticket correlates with higher survival odds.
ggplot(titanic_train, aes(x = Fare, y = factor(Survived), fill = factor(Survived))) +
geom_boxplot(alpha = 0.7) +
scale_fill_discrete(labels = c("Did Not Survive", "Survived")) +
labs(
title = "Box Plot of Ticket Fare by Survival Status",
x = "Ticket Fare",
y = "Survival Status",
fill = "Survival Status"
)
The box plot reveals a clear relationship between ticket fare and
survival status among Titanic passengers. Both box plots are positively
skewed, as indicated by the median being closer to the first quartile.
The central value for ticket fares among survivors is moderately higher
compared to non-survivors, and their fares (even the middle range)
exhibit a wider range, extending to significantly higher prices. This
includes several outliers who paid exceptionally high ticket prices.
These findings suggest that passengers in first-class accommodations,
typically associated with higher fares, had better access to lifeboats
and, consequently, a greater likelihood of survival.
In contrast, non-survivors show a lower median fare and a narrower range of ticket prices, but with many outliers in the higher fare range. The middle 50% of non-survivors paid fares within the range of approximately 10 to 25, suggesting that the majority of passengers who did not survive were in lower ticket classes. This highlights that passengers paying less for their fares were less likely to survive, likely due to limited access to lifeboats and less advantageous accommodations.
A particularly striking observation is the presence of outliers among survivors with fares exceeding 500. These high-paying passengers likely benefited from their wealth and social status, securing priority during the evacuation process. The disparity in fare distributions between survivors and non-survivors underscores the class-based inequalities in survival chances on the Titanic.
Overall, the box plot highlights the significant role of wealth and social class in shaping survival outcomes. Passengers who survived the tragedy paid higher fares overall, reflecting the unequal distribution of lifeboat access and evacuation opportunities. Conversely, passengers who did not survive paid lower fares, likely opting for lower-class accommodations, and faced greater challenges, emphasizing the stark class divisions on board the Titanic.
Age and ticket fare may be correlated, as different age groups might have had varying financial means and travel arrangements. Younger passengers were often traveling with families, which may have influenced their class and ticket prices. Conversely, older passengers, particularly those traveling alone, might have been more likely to afford higher fares and first-class accommodations. This section investigates the relationship between age and fare.
ggplot(titanic_train, aes(x = Age, y = Fare)) +
geom_point(alpha = 0.3, size = 2) +
geom_smooth(method = "lm", se = FALSE, color = "#00BFC4") +
labs(
title = "Scatter Plot of Age vs. Fare",
x = "Age",
y = "Ticket Fare"
)
## `geom_smooth()` using formula = 'y ~ x'
The scatter plot of age versus ticket fare reveals several patterns in
the relationship between these two variables. Young children, as well as
young and middle-aged passengers, particularly those aged below 40, are
predominantly clustered in the lower fare range, with most fares below
50. This pattern suggests that younger passengers were more likely to
travel in lower ticket classes, possibly due to traveling with families
or in groups, which may have reduced individual ticket costs. However,
there are also many individuals in this age group who paid significantly
higher ticket prices.
In contrast, older passengers, especially those aged 40 and above, exhibit a more evenly distributed range of ticket fares. While many older individuals also paid lower fares, there are many others who paid significantly higher amounts. These high-fare outliers indicate that some older passengers had the financial means to afford first-class accommodations, which offered greater comfort and increased access to lifeboats.
The trend line on the plot shows a slight positive slope, indicating a weak positive correlation between age and fare. This suggests that older passengers were somewhat more likely to pay higher fares on average, though the relationship is not strong. Additionally, several high-fare outliers are scattered across various age groups, representing passengers who paid premium prices, likely for first-class accommodations.
ggplot(titanic_train, aes(x = Age, y = Fare, color = factor(Survived), shape = factor(Survived))) +
geom_point(alpha = 0.3, size = 2) +
geom_smooth(method = "lm", se = FALSE, aes(linetype = factor(Survived))) +
scale_color_discrete(labels = c("Did Not Survive", "Survived")) +
scale_shape_manual(values = c(16, 17), labels = c("Did Not Survive", "Survived")) +
scale_linetype_discrete(labels = c("Did Not Survive", "Survived")) +
labs(
title = "Scatter Plot of Age vs. Fare by Survival Status",
x = "Age",
y = "Ticket Fare",
color = "Survival Status",
shape = "Survival Status",
linetype = "Survival Status"
)
## `geom_smooth()` using formula = 'y ~ x'
The scatter plot of age versus fare, differentiated by survival status,
reveals a few trends when considering age and fare together. Survivors,
represented by teal triangles, show an overall broader spread across
fare ranges, with a noticeable presence in higher fare ranges across
most age groups. This suggests that paying higher fares, often
associated with first-class accommodations, increased survival chances
regardless of age.
Non-survivors, represented by red circles, show a more concentrated pattern in the lower fare range across nearly all age groups. This clustering indicates that passengers in lower ticket classes, regardless of their age, were more likely to face challenges during evacuation. The more limited dispersion of non-survivors into higher fare ranges highlights the critical role of ticket class in influencing survival outcomes.
The trend lines further reinforce this relationship. The teal trend line for survivors indicates a slight positive correlation between age and fare, showing that older survivors tended to pay higher fares. In contrast, the red trend line for non-survivors is relatively flat, suggesting a weaker relationship between age and fare for those who did not survive.
In summary, passengers who survived were more likely to have paid higher fares across age groups, reflecting the advantages conferred by wealth and higher social class. Non-survivors, on the other hand, were predominantly concentrated in the lower fare ranges regardless of age, underscoring the disadvantage faced by lower-class passengers in terms of survival.
In this section, we analyse whether family size impacts the survival
rate of a passenger, specifically exploring whether having a larger
family improved survival chances on the Titanic. To investigate this, we
created a new variable, FamilySize
, by summing
SibSp
(siblings/spouses) and Parch
(parents/children). This represents the total family members each
passenger had onboard. Summary statistics for FamilySize
reveal a skewed distribution, with most passengers having no or very few
family members onboard.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.9046 1.0000 10.0000
Next, we created a new variable, FamilyCategory
, to
classify passengers by family size: “Alone” (family size 0), “Small”
(1–2), and “Large” (>2). The majority of passengers (537) traveled
alone, 263 belonged to small families, and only 91 were part of large
families. This highlights that traveling alone was the most common
family configuration on the Titanic, with few passengers in larger
family groups.
The bar chart above analyzes survival rates based on family
configuration, highlighting how family size influenced survival
outcomes.
Passengers traveling alone had the lowest survival rate, with approximately 30% surviving. This suggests a significant disadvantage for individuals without family support during the chaotic evacuation. In contrast, passengers in small families had a notably higher survival rate, about 26% greater than those traveling alone, likely benefiting from the support and coordination of a small group. Large families experienced a lower survival rate compared to small families, with a difference of about 22%, though their survival rate was slightly higher than those traveling alone. This may reflect the challenges large families faced in staying together and securing lifeboat access for everyone.
Overall, passengers in small families had the best survival chances. While traveling alone posed the greatest challenges, survival rates for those in large families were only slightly better, suggesting that larger family sizes offered diminishing returns in survival advantages.
Building on the previous analysis of family configuration and its impact on survival outcomes, we now examine how gender and passenger class influenced survival rates. The grouped bar chart highlights the interplay between these factors, providing deeper insights into survival patterns on the Titanic.
Among first-class passengers, females had a notably high survival rate compared to males, with the majority of women surviving, as shown by the prominent teal bars. In contrast, most first-class men did not survive. This trend reflects the prioritisation of women during evacuation. In second class, female passengers again show higher survival rates than males, though overall survival rates are lower than in first class. Most second-class men did not survive, underscoring the consistent lower prioritisation of men. In third class, stark disparities emerge: while half the women survived, the survival rate for men decreases significantly, reflecting structural disadvantages like limited lifeboat access and poorer evacuation conditions.
Overall, women consistently had better chances of survival than men, and first-class passengers fared significantly better than those in third class.
Expanding on the earlier analyses, this visualization examines the
distribution of age among survivors and non-survivors, segmented by
passenger class.
In first class, the survival curve peaks around the 20–40 age range, indicating a higher concentration of this demographic among survivors. Non-survivors exhibit a more even distribution with a slight skew toward older passengers, suggesting a higher proportion of older individuals among those who did not survive. Second-class passengers show a different pattern, with survivors concentrated in the 20–40 age range and a secondary peak among children under 10. Non-survivors similarly peak in the 20–40 range, with higher density as well as among older passengers above 50. In third class, similar patterns emerge as in the second class.
Overall, first-class age distributions are more balanced between survivors and non-survivors, while second and third classes show sharper disparities. However, the second and third classes highlight the importance of prioritizing children, as they are predominantly represented among survivors.
To summarize, the analysis reveals that survival on the Titanic was influenced by family size, gender, age, and socio-economic class. Passengers in small families had the highest survival rates, benefiting from mutual support, while those traveling alone faced the greatest disadvantages. Women consistently had better survival outcomes than men across all classes, reflecting prioritisation during evacuation, and first-class passengers fared significantly better than those in second or third class. Age also played a role, with children under 10 being consistently prioritised across all classes. These findings highlight how a combination of familial support, demographic factors, and socio-economic status shaped survival outcomes during the Titanic disaster.