3.1.3 Formatting dates and making sure loaded data has only Manhattan Borough entries
Here, we created a robust set of parse orders for mixed formats in both our datasets. We created cleaned date columns while keep original columns untouched. Also, we added new columns with month-year (Month name + Year) extracted which will be used for alluvial stacks later on
The date columns have been parsed properly and as we can see the both raw data we downloaded, only had data from the Manhattan borough since before and after filtering the number of rows remain same for both datasets i.e. 589005 rows in Housing Violations dataset and 425248 rows in 311 requests dataset.
3.1.4 Housing violation decription column data cleaning
[1] "(A) § HMC:FILE ANNUAL BEDBUG REPORT IN ACCORDANCE WITH HPD RULE AS DESCRIBED ON THE BACK OF THIS NOTICE OF VIOLATION OR AS DESCRIBED ON HPDâ\u0080\u0099S WEBSITE, WWW.NYC.GOV\\HPD, SEARCH BED BUGS."
[2] "§ 27-2005 ADM CODE PAINT METAL IN ACCORDANCE WITH DEPT. REGULATION RADIATOR AT THE SOUTH WALL IN THE KITCHEN LOCATED AT APT 9C, 9th STORY, 3rd APARTMENT FROM EAST AT SOUTH"
[3] "§ 27-2026 ADM CODE REPAIR THE LEAKY AND/OR DEFECTIVE FAUCETS AT SINK IN THE KITCHENETTE LOCATED AT B-ROOM 3N, 3rd STORY, 1st B-ROOM FROM NORTH AT EAST"
[4] "§ 27-2005, 27-2007, 27-2041.1 HMC: REPLACE OR REPAIR THE SELF-CLOSING DOORS THAT IS MISSING OR DEFECTIVE AT BULKHEAD"
[5] "§ 27-2026 ADM CODE REPAIR THE LEAKY AND/OR DEFECTIVE FAUCETS AT WASHBASIN IN THE BATHROOM LOCATED AT APT 20B, 2nd STORY, 1st APARTMENT FROM NORTH AT EAST"
[6] "§ 27-2046.1 HMC: REPAIR OR REPLACE THE CARBON MONOXIDE DETECTING DEVICE(S). MISSING IN THE ENTIRE APARTMENT LOCATED AT APT 7, 3rd STORY, 1st APARTMENT FROM EAST AT SOUTH"
[7] "§ 27-2045 ADM CODE REPAIR OR REPLACE THE SMOKE DETECTOR MISSING LOCATED AT APT 2A, 2nd STORY, 1st APARTMENT FROM WEST AT NORTH"
[8] "§ 27-2046.1 HMC: REPAIR OR REPLACE THE CARBON MONOXIDE DETECTING DEVICE(S). MISSING IN THE ENTIRE APARTMENT LOCATED AT APT 5C, 5th STORY, 1st APARTMENT FROM WEST AT NORTH"
[9] "§ 27-2045 ADM CODE REPAIR OR REPLACE THE SMOKE DETECTOR MISSING IN THE ENTIRE APARTMENT LOCATED AT APT 5C, 5th STORY, 1st APARTMENT FROM WEST AT NORTH"
[10] "§ 27-2005 ADM CODE REPAIR THE BROKEN OR DEFECTIVE PLASTERED SURFACES AND PAINT IN A UNIFORM COLOR EAST WALL APPROX 4SQ FT AND NORTH WALL APPROX 4 SQ FT IN THE BATHROOM LOCATED AT APT 1A, 1st STORY, 2nd APARTMENT FROM SOUTH AT WEST"
ViolationCategory n
1 PAINT/PLASTER 127336
2 PLUMBING 93154
3 DOOR/WINDOW/LOCK 91860
4 OTHER 53533
5 PEST/SANITATION 50753
6 SMOKE/CO DETECTOR 40227
7 HEAT/HOT WATER 39990
8 REGISTRATION/ADMIN 35133
9 WATER LEAK 23370
10 FLOOR/CEILING 19633
11 FIRE SAFETY 7121
12 GAS/APPLIANCES 4004
13 ELECTRICAL 1607
14 BUILDING SYSTEMS 679
15 BUILDING MANAGEMENT 376
16 MOLD 118
17 ELEVATOR 81
18 VENTILATION 30
Code
# ========== 311: CATEGORY MAPPING (parallel to housing) ==========# Compare with 311 Complaint Typessr311 |>count(Complaint.Type, sort =TRUE) |>print()
Complaint.Type n
1 HEAT/HOT WATER 229222
2 PLUMBING 50637
3 PAINT/PLASTER 44624
4 WATER LEAK 30190
5 General Construction/Plumbing 25207
6 Maintenance or Facility 20202
7 Elevator 16588
8 Plumbing 2197
9 ELEVATOR 1302
10 Non-Residential Heat 1094
11 School Maintenance 1064
12 OUTSIDE BUILDING 857
13 Water Leak 556
14 Heat/Hot Water 487
15 Paint/Plaster 404
16 Mold 402
17 Building Drinking Water Tank 83
18 Sewer Maintenance 68
19 Water Maintenance 31
20 Building Condition 19
21 Unstable Building 14
Category n
1 HEAT/HOT WATER 230803
2 PLUMBING 78041
3 PAINT/PLASTER 45028
4 WATER LEAK 30746
5 ELEVATOR 17890
6 OTHER 17784
7 PEST/SANITATION 4554
8 MOLD 402
Here, manual categorization logic we applied successfully maps unstructured text descriptions (NOVDescriptions) in Housing Violations dataset into unified high-level categories which enables a direct one-to-one comparison between both. We also normalized the categories in 311 Complaints dataset since there were multiple for the same category, ege. there were 2 types “HEAT/HOT WATER” and “Heat/Hot Water” which were merged into 1.
There is a substantial disparity in volume for the “HEAT/HOT WATER” category in violations vs complaints, where the number of 311 Service Requests is significantly higher than the count of official Housing violations issued for the same. Hence, we can conclude that while tenant complaints regarding heating are very very frequent, quite a smaller proportion result in confirmed enforcement actions by management. In other categories, the difference is relatively small. In case of Paint/Plaster and Pests, there is much higher number of building violation entries which shows that probably regular building inspections is conducted for those and hence less direct 311 complaints. Out of all top 7 identified categories overlapping in both the datasets, Heat/Hot Water is the highest and Mold is the lowest complaint type.
Across quarters, HEAT/HOT WATER 311 complaints and housing violations both show a clear seasonal pattern. They spike in Q1, drop sharply in Q2 and Q3 for all 3 years, and rise again Q4 matching winter heating needs. For PLUMBING, PAINT/PLASTER, PEST/SANITATION, WATER LEAK, ELEVATOR, and MOLD, housing violations counts change over the quarters (most show a local peak around Q3), but patterns are less sharp and not perfectly identical so they don’t show a single clear seasonal shape like HEAT/HOT WATER does. Pest and sanitation 311 complaints peak in Q3 in all 3 years, which suggests these problems are most common in late summer, which is also consistent with warmer weather since heat and humidity make it easier for rodents, roaches, and insects to spread, and garbage issues can get worse when it’s so hot. But it’s not as strong or as perfectly consistent across all years. For the other categories for 311 complaints, the quarter‑to‑quarter changes within each panel are relatively small and do not follow a clear, repeated seasonal shape across years.
3.4 Housing Violation Monthly Transition Alluvial for 2024 (Sample)
We can see how 2024 housing violations (we have taken a small sample slice of 200 here to visualize better) move from inspection month to NOV (notification of violation) month to when current status takes effect, with ribbon colors showing violation Class (A least serious, C most serious, I immediately hazardous). Mostly, flows stay within nearby months across 3 stages showing many violations progress through the process within the same part of the year. Thicker orange and blue ribbons show that Class B and C violations make up most of the pipeline, while Class A is thinner and Class I appears as a smaller set of immediately hazardous cases. There are also some blank bands in NOV step which occur when inspection and status dates are recorded but NOVIssuedDate or Class code is missing, so those months or classes appear as NA.
3.5 Facet Comparative Panel: Housing vs 311
Code
# ----------------------------# 1) STANDARDIZE ZIPCODE NAME# ----------------------------housing2 <- housing |>rename(ZipCode = Postcode) |># Convert housing block to character type immediatelymutate(Block =as.character(Block)) sr311_2 <- sr311 |>rename(ZipCode = Incident.Zip) |>mutate(# Extract 5 digits starting from the 2nd character (position 2) and removing start zeroesBlock =as.character(as.numeric(str_sub(BBL, start =2, end =6))) )# ----------------------------# 2) CREATE TOP SETS PER VARIABLE (20 each)# ----------------------------# Top 10 ZIPshousing_zip <- housing2 |>filter(!is.na(ZipCode)) |>count(ZipCode) |>slice_max(n, n =10)sr311_zip <- sr311_2 |>filter(!is.na(ZipCode)) |>count(ZipCode) |>slice_max(n, n =10)# Top 10 Streetshousing_st <- housing2 |>filter(!is.na(StreetName)) |>count(StreetName) |>slice_max(n, n =10)sr311_st <- sr311_2 |>filter(!is.na(Street.Name)) |>count(Street.Name) |>slice_max(n, n =10)# Top Categories (Housing only but we compare counts against 311 Categorieshousing_cat <- housing2 |>filter(!is.na(ViolationCategory)) |>count(ViolationCategory) |>slice_max(n, n =10) |>rename(Category = ViolationCategory)sr311_cat <- sr311_2 |>filter(!is.na(Category)) |>count(Category) |>slice_max(n, n =10) # Top 10 Blockshousing_block<- housing2 |>filter(!is.na(Block)) |>count(Block) |>slice_max(n, n =10)sr311_block <- sr311_2 |>filter(!is.na(Block)) |>count(Block) |>slice_max(n, n =10)# ----------------------------# 3) COMBINE PANELS (FIXED WITH as.character)# ----------------------------zip_panel <-bind_rows( housing_zip |>mutate(Value =as.character(ZipCode), Source ="Housing", Facet ="ZipCode"), sr311_zip |>mutate(Value =as.character(ZipCode), Source ="311", Facet ="ZipCode"))category_panel <-bind_rows( housing_cat |>mutate(Value =as.character(Category), Source ="Housing", Facet ="Category"), sr311_cat |>mutate(Value =as.character(Category), Source ="311", Facet ="Category"))street_panel <-bind_rows( housing_st |>mutate(Value =as.character(StreetName), Source ="Housing", Facet ="StreetName"), sr311_st |>mutate(Value =as.character(Street.Name), Source ="311", Facet ="StreetName"))block_panel <-bind_rows( housing_block |>mutate(Value =as.character(Block), Source ="Housing", Facet ="Block"), sr311_block |>mutate(Value =as.character(Block), Source ="311", Facet ="Block"))# ----------------------------# 4) FINAL COMBINED PANEL# ----------------------------panel_all <-bind_rows(zip_panel, street_panel, category_panel, block_panel) |>group_by(Facet) |>mutate(Value =fct_reorder(Value, n)) |>ungroup()# ----------------------------# 4) FINAL FACETED PLOT# ----------------------------ggplot(panel_all, aes(x = Value, y = n, fill = Source)) +geom_col(position ="dodge") +coord_flip() +facet_wrap(~ Facet, scales ="free", ncol =2) +scale_y_continuous(labels = scales::comma) +scale_fill_manual(values =c("Housing"="purple", "311"="pink")) +labs(title ="Top Manhattan Characteristics Across\nHousing Violations and 311 Complaints",subtitle ="Blocks, Violation/Complaint Categories,\nStreet Names, and ZIP Codes",x ="",y ="Count",fill ="Source" ) +theme_minimal(base_size =14) +theme(strip.text =element_text(face ="bold", size =9),plot.title =element_text(face ="bold", size =13),plot.subtitle =element_text(size =10),legend.position ="top",axis.text.y =element_text(size =6),axis.text.x =element_text(size =6) )
Here, we compared the top 10 blocks, housing/violation categories, streets, and ZIP codes in Manhattan across both are datasets and we can conclude that mostly same areas and same problem types show up in both datasets at the top, like blocks 2136 and 2170 have highest violations/complaints, streets like Broadway and St Nicholas Avenue also have highest records in both, ZIP codes 10031–10033, 10027 all have very high counts in both datasets, and categories like HEAT/HOT WATER, PLUMBING, and PAINT/PLASTER dominate for both complaints and violations, indicating that both our datasets focus on similar set of buildings, locations, categories, and issues.
The spikes here kind of look like a pattern, but the size and shape suggest they are just random glitches and not exactly a real trend. On normal days between 2022-2025, it is mostly noisy and a flat band around a few hundred inspections but on very few days it jumps over 5,000. This is extremely high & not very realistic just 1 day and does not spread over nearby days so it’s not a trend but these spikes are outliers.
It is almost a solid pink block with very tiny variation in consecutive days and that’s why it’s very hard to see any clear pattern. Hence, this plot does no give us much information. TO further deep dive into finding a trend in the time series, we plotted combined monthly series for both the dataset.
Here, both 311 complaints and housing violations in Manhattan during 2022-2025 have a fairly strong seasonal pattern, but there are also some exceptions. In most years the red complaint line is highest in late fall & winter, & lowest in late spring & summer which basically means people call 311 line more about housing problems during colder months when heat and building conditions matter more. Now blue violations line also have major winter peaks overall but it includes few extra spikes in other months (like June & August 2025), so its pattern is winter‑dominated but not specific to winter‑only. Also, some of those spikes might be anomalies and outlier inspections instead of a reasonable trend.
3.6.4 Time series Variation by Month and Year for Housing
Code
set.seed(2025)# Load datahousing_2 <-as.data.table(housing)# Convert date and extract time componentshousing_2[, InspectionDate :=as.Date(InspectionDate)]housing_2[, YearMonth :=format(InspectionDate, "%Y-%m")]housing_2[, Month :=month(InspectionDate)]housing_2[, Year :=year(InspectionDate)]# Count violations by month and yearviol_by_month <- housing_2[, .N, by = .(Year, Month)][order(Year, Month)]# Create the plotggplot(viol_by_month, aes(x = Month, y = N, color =factor(Year), group = Year)) +geom_line(linewidth =1.2) +geom_point(size =2.5) +scale_x_continuous(breaks =1:12, labels =c("Jan", "Feb", "Mar", "Apr", "May", "Jun","Jul", "Aug", "Sep", "Oct", "Nov", "Dec")) +scale_y_continuous(labels = scales::comma) +labs(title ="Housing Violations by Month and Year",subtitle ="Manhattan, 2022-2025",x ="Month",y ="Number of Violations",color ="Year" ) +theme_minimal(base_size =16) +theme(plot.title =element_text(face ="bold"),legend.position ="right" )
It shows that there is no stable seasonal pattern across years. Some months are high in one year and low in another like February is very unusually high only in 2022, while peaks in other years appear in different months like April 2024 or June 2025. Hence, we can conclude that monthly counts fluctuate randomly rather than following a consistent winter–summer or any other repeating trend.
# Check if they're all inspected in Feb or entered in Febsummary(feb_2022$InspectionDate)
Min. 1st Qu. Median Mean 3rd Qu. Max.
"2022-02-01" "2022-02-15" "2022-02-15" "2022-02-15" "2022-02-16" "2022-02-28"
Code
summary(feb_2022$ApprovedDate)
Length Class Mode
28414 character character
Upon further review we can see that Median/Mean inspection date is February 15, 2022 & most inspections are clustered around Feb 15-16 only. This does not seem like a data error but is simply appearing to be a mass inspection event in mid-February 2022. It is not a trend.
3.6.6 Smoothed Temporal View: Housing by Quarter
We tried to find any overall trends when we smooth out the monthly fluctuations
Code
# Aggregate by quarter for smoother trendhousing_2[, Quarter :=quarter(InspectionDate)]housing_2[, YearQuarter :=paste0(Year, "-Q", Quarter)]viol_by_quarter <- housing_2[, .N, by = .(Year, Quarter, YearQuarter)][order(Year, Quarter)]# Create quarter labelsviol_by_quarter[, QuarterLabel :=paste0(Year, "\nQ", Quarter)]# Plotggplot(viol_by_quarter, aes(x =factor(YearQuarter, levels = YearQuarter), y = N, group =1)) +geom_line(color ="steelblue", linewidth =1.2) +geom_point(color ="steelblue", size =3) +scale_y_continuous(labels = scales::comma) +labs(title ="Housing Violations by Quarter",subtitle ="Manhattan, 2022-2025 (smoothed trend)",x ="Quarter",y ="Number of Violations" ) +theme_minimal(base_size =16) +theme(plot.title =element_text(face ="bold"),axis.text.x =element_text(angle =45, hjust =1) )
It shows that housing violations in Manhattan from 2022-2025 move up and down between quarters but stay in a fairly similar range (30,000–45,000), so there is no strong upward or downward trend. Q4 20205 seems like an anomaly but it is not because it does not include data from mid November - December 2025 data are not fully included, hence it is not any abnormal evidence of a real drop.
3.7 Spatial sample plots
3.7.1 Housing Violations
Code
housing |>filter(!is.na(Longitude) &!is.na(Latitude)) |>slice_sample(n =5000) |>ggplot(aes(x = Longitude, y = Latitude)) +geom_point(alpha =0.4, size =0.6) +theme_minimal() +labs(title ="Housing Violations Locations (sample 5k)", x ="Longitude", y ="Latitude")
3.7.2 311 Complaints
Code
sr311 |>filter(!is.na(Longitude) &!is.na(Latitude)) |>slice_sample(n =5000) |>ggplot(aes(x = Longitude, y = Latitude)) +geom_point(alpha =0.4, size =0.6, color ="orange") +theme_minimal() +labs(title ="311 Complaints Locations (sample 5k)", x ="Longitude", y ="Latitude")
Both the above location plots show that overall the housing violations and 311 complaints are spread across the almost full north–south span of Manhattan with very similar clusters in same bands of longitude and latitude but 311 plot is a bit more denser yet still both datasets show similar island shape and highlight the same high‑activity corridors. This basically means that hot spots in the plot for both the datasets mostly occur in same parts of Manhattan.
3.8 Parallel Coordinates plot
Code
housing_dt <- housing_2sr311_dt <-as.data.table(sr311)housing_dt[, BBL :=as.character(BBL)]sr311_dt[, BBL :=as.character(BBL)]viol_severity <- housing_dt[!is.na(BBL) & BBL !="", .(ViolationCount = .N,ClassC =sum(Class =="C"),RentImp =sum(RentImpairing =="Y") ), by = BBL]viol_severity[, Pct_ClassC := ClassC / ViolationCount]viol_severity[, Pct_RentImp := RentImp / ViolationCount]complaints_by_building <- sr311_dt[!is.na(BBL) & BBL !="", .(ComplaintCount = .N), by = BBL]bldg_pcp <-merge( viol_severity, complaints_by_building,by ="BBL",all.x =TRUE)bldg_pcp[is.na(ComplaintCount), ComplaintCount :=0]bldg_pcp_filt <- bldg_pcp[ ViolationCount >=20| ComplaintCount >=50]pcp_df <-as.data.frame(bldg_pcp_filt[, .(ViolationCount =as.numeric(ViolationCount),ComplaintCount =as.numeric(ComplaintCount),Pct_ClassC =as.numeric(Pct_ClassC),Pct_RentImp =as.numeric(Pct_RentImp))])set.seed(2025)# Sample 300 buildingspcp_sample <- pcp_df[sample(nrow(pcp_df), 300), ]# Create severity groups based on Class C sharepcp_sample$SeverityGroup <-cut( pcp_sample$Pct_ClassC,breaks =c(-Inf, 0.2, 0.5, Inf),labels =c("Low C share", "Medium C share", "High C share"))ggparcoord(data = pcp_sample,columns =1:4,groupColumn ="SeverityGroup",scale ="uniminmax",alphaLines =0.25, # lighter, smoothershowPoints =FALSE) +scale_color_brewer(palette ="Set1") +theme_minimal(base_size =11) +# smaller base fonttheme(plot.title =element_text(face ="bold", size =16),axis.title.x =element_text(size =11),axis.title.y =element_text(size =11),axis.text.x =element_text(size =9), # reduce x‑axis text sizeaxis.text.y =element_text(size =9),legend.position ="bottom", # legend at bottomlegend.title =element_text(size =10),legend.text =element_text(size =9) ) +labs(title ="Parallel Coordinates: Sample of Buildings",x ="Variables",y ="Scaled value",color ="Share of Class C" )
This plot compares each sampled building violation across 4 variables & groups them by how many Class C violations they have. The only clear pattern we can see here is that buildings with higher share of Class C violations have higher share of rent‑impairing violations but for other variables (total violations, 311 complaints), the 3 groups overlap alot, so there is no simple correlation or strong extra pattern visible.
3.9 Top Violation Classes
Code
# === Violation Class Distribution ===# Class definitions: A=Non-Hazardous, B=Hazardous, C=Immediately Hazardous, I=Failure to Registerclass_data <- housing |>filter(!is.na(Class)) |>count(Class, name ="N") |>mutate(ClassLabel =case_when( Class =="A"~"Class A: Non-Hazardous", Class =="B"~"Class B: Hazardous", Class =="C"~"Class C: Immediately Hazardous", Class =="I"~"Class I: Failure to Register",TRUE~ Class ) ) |>arrange(desc(N))ggplot(class_data, aes(x =reorder(ClassLabel, N), y = N, fill = Class)) +geom_col(fill ="steelblue") +geom_text(aes(label = scales::comma(N)), hjust =-0.1, size =3) +coord_flip() +scale_y_continuous(labels = scales::comma, expand =expansion(mult =c(0, 0.15))) +labs(title ="Housing Violations by Severity Class",subtitle ="Manhattan, 2022-2025",x ="Violation Class",y ="Number of Violations" ) +theme_minimal(base_size =16) +theme(plot.title =element_text(face ="bold", size =12),legend.position ="none" )
Majority of violations (~238k) are classified as hazardous (Class B) followed by immediately hazardous (Class C) which is ~185k violations, indicating very serious safety and habitability concerns in Manhattan. Class B violations, which require correction within 30 days, dominate the landscape. This high proportion of serious violations highlights how important it is to thouroughly inspect a building before renting.
Here, Non-Rent-Impairing is 542,652 (92.1%) and has the a very big majority while rent-Impairing is only 46,353 (7.9%) which means serious habitability issues. This means only 7.9% directly affect habitability and could legally justify rent withholding but though most violations are non-rent-impairing only, rent-impairing violations represent conditions severe enough to make apartments legally uninhabitable like lack of basics amenities like heat, hot water, or having structural hazards.