3  Analysis & Results

Author

Shreya Shetty (svs2148) & Shruti Shetty (ss7592)

3.1 Import Libraries, Dataset Setup and Cleaning

3.1.1 Loading Libraries

Code
# Libraries
library(dplyr)
library(lubridate)
library(janitor)
library(ggplot2)
library(ggalluvial)
library(tidyr)
library(scales)
library(data.table)
library(readr)
library(tidyverse)
library(forcats)
library(stringr)
library(redav)
library(GGally)

3.1.2 Loading Housing Violations and 311 Complaints Datasets

Code
# ========== HOUSING VIOLATIONS ==========
housing_raw <- read.csv("/Users/shreyashetty/Documents/Fall 2025 Courses/EDAV/Final_Project/edav_project/datasets/Housing_Violations_2022_onwards.csv")

# ========== 311 HOUSING COMPLAINTS ==========
sr311_raw   <- read.csv("/Users/shreyashetty/Documents/Fall 2025 Courses/EDAV/Final_Project/edav_project/datasets/311_Housing_Complaints_2022_onwards.csv")

3.1.3 Formatting dates and making sure loaded data has only Manhattan Borough entries

Here, we created a robust set of parse orders for mixed formats in both our datasets. We created cleaned date columns while keep original columns untouched. Also, we added new columns with month-year (Month name + Year) extracted which will be used for alluvial stacks later on

Code
# VALID DATE ORDERS

date_orders <- c("mdY HMS", "mdY HM", "mdY", "Ymd HMS", "Ymd")

housing <- housing_raw |>
  mutate(
    borough_std = toupper(Borough),
    # PARSE master versions — keep duplicates untouched
    inspectiondate_clean = parse_date_time(InspectionDate, orders = date_orders),
    approveddate_clean   = parse_date_time(ApprovedDate, orders = date_orders),
    certifieddate_clean  = parse_date_time(CertifiedDate, orders = date_orders),
    novissued_clean      = parse_date_time(NOVIssuedDate, orders = date_orders),
    statusdate_clean     = parse_date_time(CurrentStatusDate, orders = date_orders),
    originalcertifybydate_clean = parse_date_time(OriginalCertifyByDate, orders = date_orders),
    originalcorrectbydate_clean = parse_date_time(OriginalCorrectByDate, orders = date_orders),
    newcertifybydate_clean = parse_date_time(NewCertifyByDate, orders = date_orders),
    newcorrectbydate_cean = parse_date_time(NewCorrectByDate, orders = date_orders),
  ) |>
  filter(borough_std == "MANHATTAN")

sr311 <- sr311_raw |>
  mutate(
    borough_std = toupper(Borough),
    created_clean = parse_date_time(Created.Date, orders = date_orders),
    closed_clean  = parse_date_time(Closed.Date, orders = date_orders)
  ) |>
  filter(borough_std == "MANHATTAN")


# adding month-year (Month name + Year) columns used for alluvial stacks

housing <- housing |>
  mutate(
    inspection_month = ifelse(!is.na(inspectiondate_clean),
                              format(floor_date(inspectiondate_clean, "month"), "%B %Y"),
                              NA_character_),
    approved_month = ifelse(!is.na(approveddate_clean), 
                            format(floor_date(approveddate_clean,"month"), "%B %Y"),
                            NA_character_),
    nov_month = ifelse(!is.na(novissued_clean),
                       format(floor_date(novissued_clean, "month"), "%B %Y"),
                       NA_character_),
    status_month = ifelse(!is.na(statusdate_clean),
                          format(floor_date(statusdate_clean, "month"), "%B %Y"),
                          NA_character_)
  )

Housing Violations updated rows check:

Code
# Hosuing violations rows
nrow(housing)
[1] 589005
Code
sr311 <- sr311 |>
  mutate(
    created_month = ifelse(!is.na(created_clean),
                           format(floor_date(created_clean, "month"), "%B %Y"),
                           NA_character_),
    closed_month = ifelse(!is.na(closed_clean),
                           format(floor_date(closed_clean, "month"), "%B %Y"),
                           NA_character_)
)

311 Request updated rows check:

Code
# 311 requests rows updated
nrow(sr311)
[1] 425248

The date columns have been parsed properly and as we can see the both raw data we downloaded, only had data from the Manhattan borough since before and after filtering the number of rows remain same for both datasets i.e. 589005 rows in Housing Violations dataset and 425248 rows in 311 requests dataset.

3.1.4 Housing violation decription column data cleaning

Code
## Housing violation description column data cleaning

# First, let's inspect NOVDescription
head(housing$NOVDescription, 10)
 [1] "(A) § HMC:FILE ANNUAL BEDBUG REPORT IN ACCORDANCE WITH HPD RULE AS DESCRIBED ON THE BACK OF THIS NOTICE OF VIOLATION OR AS DESCRIBED ON HPDâ\u0080\u0099S WEBSITE, WWW.NYC.GOV\\HPD, SEARCH BED BUGS."                                 
 [2] "§ 27-2005 ADM CODE PAINT METAL IN ACCORDANCE WITH DEPT. REGULATION RADIATOR AT THE SOUTH WALL IN THE KITCHEN LOCATED AT APT 9C, 9th STORY, 3rd APARTMENT FROM EAST AT SOUTH"                                                           
 [3] "§ 27-2026 ADM CODE REPAIR THE LEAKY AND/OR DEFECTIVE FAUCETS AT SINK IN THE KITCHENETTE LOCATED AT B-ROOM 3N, 3rd STORY, 1st B-ROOM FROM NORTH AT EAST"                                                                                
 [4] "§ 27-2005, 27-2007, 27-2041.1 HMC: REPLACE OR REPAIR THE SELF-CLOSING DOORS THAT IS MISSING OR DEFECTIVE AT BULKHEAD"                                                                                                                  
 [5] "§ 27-2026 ADM CODE REPAIR THE LEAKY AND/OR DEFECTIVE FAUCETS AT WASHBASIN IN THE BATHROOM LOCATED AT APT 20B, 2nd STORY, 1st APARTMENT FROM NORTH AT EAST"                                                                             
 [6] "§ 27-2046.1 HMC: REPAIR OR REPLACE THE CARBON MONOXIDE DETECTING DEVICE(S). MISSING IN THE ENTIRE APARTMENT LOCATED AT APT 7, 3rd STORY, 1st APARTMENT FROM EAST AT SOUTH"                                                             
 [7] "§ 27-2045 ADM CODE REPAIR OR REPLACE THE SMOKE DETECTOR MISSING LOCATED AT APT 2A, 2nd STORY, 1st APARTMENT FROM WEST AT NORTH"                                                                                                        
 [8] "§ 27-2046.1 HMC: REPAIR OR REPLACE THE CARBON MONOXIDE DETECTING DEVICE(S). MISSING IN THE ENTIRE APARTMENT LOCATED AT APT 5C, 5th STORY, 1st APARTMENT FROM WEST AT NORTH"                                                            
 [9] "§ 27-2045 ADM CODE REPAIR OR REPLACE THE SMOKE DETECTOR MISSING IN THE ENTIRE APARTMENT LOCATED AT APT 5C, 5th STORY, 1st APARTMENT FROM WEST AT NORTH"                                                                                
[10] "§ 27-2005 ADM CODE REPAIR THE BROKEN OR DEFECTIVE PLASTERED SURFACES AND PAINT IN A UNIFORM COLOR EAST WALL APPROX 4SQ FT AND NORTH WALL APPROX 4 SQ FT IN THE BATHROOM LOCATED AT APT 1A, 1st STORY, 2nd APARTMENT FROM SOUTH AT WEST"
Code
# ========== REFINED CATEGORIZATION ==========
housing <- housing |>
  mutate(
    ViolationCategory = case_when(

      # 1 — Registration (Class I)
      Class == "I" |
        str_detect(NOVDescription, regex("registration|register", ignore_case = TRUE)) ~
        "REGISTRATION/ADMIN",

      # 2 — Smoke / CO
      str_detect(NOVDescription, regex("smoke detector|carbon monoxide|co detect", ignore_case = TRUE)) ~
        "SMOKE/CO DETECTOR",

      # 3 — Heat / Hot Water
      str_detect(NOVDescription, regex("heat|heating|hot water|radiator|boiler|steam|too cold", ignore_case = TRUE)) ~
        "HEAT/HOT WATER",

      # 4 — Plumbing
      str_detect(NOVDescription, regex("plumb|pipe|faucet|drain|sewer|bathroom|sink|toilet|water supply", ignore_case = TRUE)) ~
        "PLUMBING",

      # 5 — Water leak
      str_detect(NOVDescription, regex("leak|leaking|leakage|water drip", ignore_case = TRUE)) ~
        "WATER LEAK",

      # 6 — Paint / Plaster
      str_detect(NOVDescription, regex("paint|plaster|peel|wall|ceiling", ignore_case = TRUE)) ~
        "PAINT/PLASTER",

      # 7 — Door / Window / Lock
      str_detect(NOVDescription, regex("door|window|lock|self-closing|entrance", ignore_case = TRUE)) ~
        "DOOR/WINDOW/LOCK",

      # 8 — Pest / Sanitation
      str_detect(NOVDescription, regex("rodent|pest|roach|mice|rat|garbage|infest|sanitation", ignore_case = TRUE)) ~
        "PEST/SANITATION",

      # 9 — Floor / Ceiling
      str_detect(NOVDescription, regex("floor|tile|carpet|wood floor|ceramic", ignore_case = TRUE)) ~
        "FLOOR/CEILING",

      # 10 — Elevator
      str_detect(NOVDescription, regex("elevator|lift", ignore_case = TRUE)) ~
        "ELEVATOR",

      # 11 — Mold
      str_detect(NOVDescription, regex("mold|mildew|moisture", ignore_case = TRUE)) ~
        "MOLD",

      # 12 — Electrical
      str_detect(NOVDescription, regex("electric|wiring|outlet|light|circuit", ignore_case = TRUE))  ~ "ELECTRICAL",

      # 13 — Gas / Appliances
      str_detect(NOVDescription, regex("gas|appliance|stove", ignore_case = TRUE)) ~
        "GAS/APPLIANCES",

      # 14 — Ventilation
      str_detect(NOVDescription, regex("ventilat|airflow|exhaust fan", ignore_case = TRUE)) ~
        "VENTILATION",

      # 15— Fire Safety
      str_detect(NOVDescription, regex("fire|sprinkler|fire escape|extinguish", ignore_case = TRUE)) ~
        "FIRE SAFETY",

      # 16— Intercom / Bell
      str_detect(NOVDescription, regex("bell|buzzer|intercom", ignore_case = TRUE)) ~
        "BUILDING SYSTEMS",

      # 17 — Building Management
      str_detect(NOVDescription, regex("janitor|superintendent|building service|super", ignore_case = TRUE)) ~
        "BUILDING MANAGEMENT",

      # Default
      TRUE ~ "OTHER"
    )
  )

# Check distribution of newly created categories
housing |>
  count(ViolationCategory, sort = TRUE) |>
  print()
     ViolationCategory      n
1        PAINT/PLASTER 127336
2             PLUMBING  93154
3     DOOR/WINDOW/LOCK  91860
4                OTHER  53533
5      PEST/SANITATION  50753
6    SMOKE/CO DETECTOR  40227
7       HEAT/HOT WATER  39990
8   REGISTRATION/ADMIN  35133
9           WATER LEAK  23370
10       FLOOR/CEILING  19633
11         FIRE SAFETY   7121
12      GAS/APPLIANCES   4004
13          ELECTRICAL   1607
14    BUILDING SYSTEMS    679
15 BUILDING MANAGEMENT    376
16                MOLD    118
17            ELEVATOR     81
18         VENTILATION     30
Code
# ========== 311: CATEGORY MAPPING (parallel to housing) ==========

# Compare with 311 Complaint Types
sr311 |>
  count(Complaint.Type, sort = TRUE) |>
  print()
                  Complaint.Type      n
1                 HEAT/HOT WATER 229222
2                       PLUMBING  50637
3                  PAINT/PLASTER  44624
4                     WATER LEAK  30190
5  General Construction/Plumbing  25207
6        Maintenance or Facility  20202
7                       Elevator  16588
8                       Plumbing   2197
9                       ELEVATOR   1302
10          Non-Residential Heat   1094
11            School Maintenance   1064
12              OUTSIDE BUILDING    857
13                    Water Leak    556
14                Heat/Hot Water    487
15                 Paint/Plaster    404
16                          Mold    402
17  Building Drinking Water Tank     83
18             Sewer Maintenance     68
19             Water Maintenance     31
20            Building Condition     19
21             Unstable Building     14
Code
# Mapped categories
sr311 <- sr311 |>
  mutate(
    Category = case_when(
      # Heat / Hot Water
      str_detect(`Complaint.Type`, regex("HEAT|HOT WATER", ignore_case = TRUE)) ~ "HEAT/HOT WATER",
      
      # Plumbing
      str_detect(`Complaint.Type`, regex("PLUMBING", ignore_case = TRUE)) ~ "PLUMBING",
      
      # Water Leak
      str_detect(`Complaint.Type`, regex("WATER LEAK", ignore_case = TRUE)) ~ "WATER LEAK",
      
      # Elevator
      str_detect(`Complaint.Type`, regex("ELEVATOR", ignore_case = TRUE)) ~ "ELEVATOR",
      
      # Mold
      str_detect(`Complaint.Type`, regex("MOLD", ignore_case = TRUE)) ~ "MOLD",
      
      # PAINT / PLASTER — newly added
      str_detect(`Complaint.Type`, regex("PAINT|PLASTER", ignore_case = TRUE)) ~ "PAINT/PLASTER",
      
      # Pest / sanitation based on Complaint.Type AND Descriptor
      (
        str_detect(`Complaint.Type`, regex("MAINTENANCE|FACILITY", ignore_case = TRUE)) &
        str_detect(Descriptor, regex("rodent|rodents|mice|rats|mouse|rat|insect|pest", ignore_case = TRUE))
      ) ~ "PEST/SANITATION",
      
      # Fallback
      TRUE ~ "OTHER"
    )
  )

# New mapped categories
sr311 |>
  count(Category, sort = TRUE) |>
  print()
         Category      n
1  HEAT/HOT WATER 230803
2        PLUMBING  78041
3   PAINT/PLASTER  45028
4      WATER LEAK  30746
5        ELEVATOR  17890
6           OTHER  17784
7 PEST/SANITATION   4554
8            MOLD    402

Here, manual categorization logic we applied successfully maps unstructured text descriptions (NOVDescriptions) in Housing Violations dataset into unified high-level categories which enables a direct one-to-one comparison between both. We also normalized the categories in 311 Complaints dataset since there were multiple for the same category, ege. there were 2 types “HEAT/HOT WATER” and “Heat/Hot Water” which were merged into 1.

3.1.5 Save cleaned CSV copies for reproducibility

Code
write.csv(housing, "datasets/housing_manhattan_3years_clean.csv", row.names = FALSE)
write.csv(sr311,   "datasets/sr311_manhattan_3years_clean.csv", row.names = FALSE)

3.2 CATEGORY COMPARISON: Housing vs 311

Code
comparison_categories <- c(
  "HEAT/HOT WATER", "PLUMBING", "PAINT/PLASTER",
  "WATER LEAK", "ELEVATOR", "MOLD", "PEST/SANITATION"
)

# Violations count
viol_comparison <- housing |>
  filter(ViolationCategory %in% comparison_categories) |>
    group_by(ViolationCategory) |>
      summarise(Count = n(), .groups = "drop") |>
        mutate(
          Category = ViolationCategory,
          Source = "Housing Violations"
        ) |>
          select(Category, Count, Source)

# 311 count
complaints_comparison <- sr311 |>
  filter(Category %in% comparison_categories) |>
    group_by(Category) |>
      summarise(Count = n(), .groups = "drop") |>
        mutate(Source = "311 Complaints")

# Combine both
combined <- bind_rows(viol_comparison, complaints_comparison)

# Plot
ggplot(combined, aes(x = reorder(Category, -Count), y = Count, fill = Source)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = scales::comma(Count)),
            position = position_dodge(width = 0.9),
            vjust = -0.5, size = 3) +
  scale_fill_manual(values = c(
    "Housing Violations" = "purple",
    "311 Complaints"     = "pink"
  )) +
  scale_y_continuous(labels = scales::comma,
                     expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Violations vs 311 Complaints by Category",
    subtitle = "Manhattan (Last 3 yrs) Comparable Categories Only",
    x = "Category",
    y = "Count",
    fill = "Source"
  ) +
  theme_minimal(base_size = 16) +
  theme(
    plot.title = element_text(face = "bold", size = 17),
    plot.subtitle = element_text(size = 14),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "top"
  )

There is a substantial disparity in volume for the “HEAT/HOT WATER” category in violations vs complaints, where the number of 311 Service Requests is significantly higher than the count of official Housing violations issued for the same. Hence, we can conclude that while tenant complaints regarding heating are very very frequent, quite a smaller proportion result in confirmed enforcement actions by management. In other categories, the difference is relatively small. In case of Paint/Plaster and Pests, there is much higher number of building violation entries which shows that probably regular building inspections is conducted for those and hence less direct 311 complaints. Out of all top 7 identified categories overlapping in both the datasets, Heat/Hot Water is the highest and Mold is the lowest complaint type.

3.3 3-year per-quarter faceted plots for Housing Violations & 311 Complaints

Code
# ----- HOUSING: QUARTERLY CATEGORY TRENDS (3-LINE-QTRS) -----

housing_quarter_lines <- housing |>
  filter(!is.na(inspectiondate_clean)) |>
    mutate(
      year = year(inspectiondate_clean),
      quarter = paste0("Q", quarter(inspectiondate_clean))
    ) |>
      count(year, quarter, ViolationCategory) |>
        mutate(
          quarter = factor(quarter, levels = c("Q1","Q2","Q3","Q4"))
        )

focus_cats <- c(
  "HEAT/HOT WATER",
  "PLUMBING",
  "PAINT/PLASTER",
  "WATER LEAK",
  "ELEVATOR",
  "MOLD",
  "PEST/SANITATION"
)

housing_quarter_lines_focus <- housing_quarter_lines |>
  dplyr::filter(ViolationCategory %in% focus_cats)

ggplot(housing_quarter_lines_focus,
       aes(x = quarter, y = n, color = factor(year), group = year)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 2.2) +
  scale_y_continuous(labels = scales::comma) +
  scale_color_manual(values = c("2023" = "#E63946",
                                "2024" = "#457B9D",
                                "2025" = "#2A9D8F"),
                     name = "Year") +
  facet_wrap(~ ViolationCategory,
             scales = "free_y",
             ncol = 2) +    
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", size = 17),
    axis.text.x = element_text(angle = 0),
    legend.position = "top"
  ) +
  labs(
    title = "Quarterly Housing Violations by Category",
    subtitle = "Manhattan (2022–25) • Selected Categories",
    x = "Quarter",
    y = "Number of Violations"
  )

Code
# ----- 311: QUARTERLY CATEGORY TRENDS (3-LINE-QTRS) -----

sr311_quarter_lines <- sr311 |>
  filter(!is.na(created_clean)) |>
  mutate(
    year    = year(created_clean),
    quarter = paste0("Q", quarter(created_clean))
  ) |>
  count(year, quarter, Category) |>
  mutate(
    quarter = factor(quarter, levels = c("Q1","Q2","Q3","Q4"))
  )

sr311_quarter_lines_focus <- sr311_quarter_lines |>
  filter(Category %in% focus_cats)

ggplot(sr311_quarter_lines_focus,
       aes(x = quarter, y = n, color = factor(year), group = year)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 2.2) +
  scale_y_continuous(labels = scales::comma) +
  scale_color_manual(values = c("2023" = "#E63946",
                                "2024" = "#457B9D",
                                "2025" = "#2A9D8F"),
                     name = "Year") +
  facet_wrap(~ Category, scales = "free_y", ncol = 2) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title  = element_text(face = "bold", size = 17),
    axis.text.x = element_text(angle = 0),
    legend.position = "top"
  ) +
  labs(
    title    = "Quarterly 311 Complaints by Category",
    subtitle = "Manhattan (2022-25) • Selected Categories",
    x = "Quarter",
    y = "Number of Complaints"
  )

Across quarters, HEAT/HOT WATER 311 complaints and housing violations both show a clear seasonal pattern. They spike in Q1, drop sharply in Q2 and Q3 for all 3 years, and rise again Q4 matching winter heating needs. For PLUMBING, PAINT/PLASTER, PEST/SANITATION, WATER LEAK, ELEVATOR, and MOLD, housing violations counts change over the quarters (most show a local peak around Q3), but patterns are less sharp and not perfectly identical so they don’t show a single clear seasonal shape like HEAT/HOT WATER does. Pest and sanitation 311 complaints peak in Q3 in all 3 years, which suggests these problems are most common in late summer, which is also consistent with warmer weather since heat and humidity make it easier for rodents, roaches, and insects to spread, and garbage issues can get worse when it’s so hot. But it’s not as strong or as perfectly consistent across all years. For the other categories for 311 complaints, the quarter‑to‑quarter changes within each panel are relatively small and do not follow a clear, repeated seasonal shape across years.

3.4 Housing Violation Monthly Transition Alluvial for 2024 (Sample)

Code
# ---- Housing Violation 2024 Monthly Transition Alluvial ----

set.seed(200)
month_levels <- c("Jan","Feb","Mar","Apr","May","Jun",
                  "Jul","Aug","Sep","Oct","Nov","Dec")

viol_flow_2024 <- housing |>
  select(ViolationID, Class,
         inspectiondate_clean, novissued_clean, statusdate_clean) |>
  mutate(
    insp_m = floor_date(inspectiondate_clean, "month"),
    nov_m  = floor_date(novissued_clean,       "month"),
    stat_m = floor_date(statusdate_clean,      "month")
  ) |>
  filter(!is.na(insp_m),
         year(insp_m) == 2024,
         (!is.na(nov_m) | !is.na(stat_m))) |>
  slice_sample(n = 200) |>
  mutate(
    insp_m_lab = factor(format(insp_m, "%b"),  levels = month_levels),
    nov_m_lab  = factor(format(nov_m,  "%b"),  levels = month_levels),
    stat_m_lab = factor(format(stat_m, "%b"),  levels = month_levels)
  ) |>
  group_by(insp_m_lab, nov_m_lab, stat_m_lab, Class) |>
  summarise(n = n(), .groups = "drop")


# plot
ggplot(viol_flow_2024,
       aes(axis1 = insp_m_lab, axis2 = nov_m_lab, axis3 = stat_m_lab, y = n)) +
  geom_alluvium(aes(fill = Class),
                width = 0.3, alpha = 0.9) +   # softer, smoother ribbons
  geom_stratum(aes(fill = Class),
               width = 0.3, alpha = 0.2) +
  geom_text(stat = "stratum",
            aes(label = after_stat(stratum)),
            size = 3, color = "black") +
  scale_fill_brewer(type=qual, palette = "Set2") +
  scale_y_continuous(labels = scales::comma) +
  labs(
    title = "2024 Housing Violations Flow: Inspection → NOV → Status (Sample)",
    x = "Stage (Month in 2024)",
    y = "Number of violations"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title   = element_text(face = "bold", size = 11),
    legend.title = element_text(face = "bold")
  )

We can see how 2024 housing violations (we have taken a small sample slice of 200 here to visualize better) move from inspection month to NOV (notification of violation) month to when current status takes effect, with ribbon colors showing violation Class (A least serious, C most serious, I immediately hazardous). Mostly, flows stay within nearby months across 3 stages showing many violations progress through the process within the same part of the year. Thicker orange and blue ribbons show that Class B and C violations make up most of the pipeline, while Class A is thinner and Class I appears as a smaller set of immediately hazardous cases. There are also some blank bands in NOV step which occur when inspection and status dates are recorded but NOVIssuedDate or Class code is missing, so those months or classes appear as NA.

3.5 Facet Comparative Panel: Housing vs 311

Code
# ----------------------------
# 1) STANDARDIZE ZIPCODE NAME
# ----------------------------

housing2 <- housing |>
  rename(ZipCode = Postcode) |>
  # Convert housing block to character type immediately
  mutate(Block = as.character(Block)) 

sr311_2 <- sr311 |>
  rename(ZipCode = Incident.Zip) |>
  mutate(
    # Extract 5 digits starting from the 2nd character (position 2) and removing start zeroes
    Block = as.character(as.numeric(str_sub(BBL, start = 2, end = 6)))
  )


# ----------------------------
# 2) CREATE TOP SETS PER VARIABLE (20 each)
# ----------------------------

# Top 10 ZIPs
housing_zip <- housing2 |> filter(!is.na(ZipCode)) |> count(ZipCode) |> slice_max(n, n = 10)
sr311_zip   <- sr311_2 |> filter(!is.na(ZipCode)) |> count(ZipCode) |> slice_max(n, n = 10)

# Top 10 Streets
housing_st <- housing2 |> filter(!is.na(StreetName)) |> count(StreetName) |> slice_max(n, n = 10)
sr311_st   <- sr311_2 |> filter(!is.na(Street.Name)) |> count(Street.Name) |> slice_max(n, n = 10)

# Top Categories (Housing only but we compare counts against 311 Categories
housing_cat <- housing2 |> filter(!is.na(ViolationCategory)) |> count(ViolationCategory) |> slice_max(n, n = 10) |> rename(Category = ViolationCategory)
sr311_cat   <- sr311_2 |> filter(!is.na(Category)) |> count(Category) |> slice_max(n, n = 10) 

# Top 10 Blocks
housing_block<- housing2 |> filter(!is.na(Block)) |> count(Block) |> slice_max(n, n = 10)
sr311_block  <- sr311_2 |> filter(!is.na(Block)) |> count(Block) |> slice_max(n, n = 10)


# ----------------------------
# 3) COMBINE PANELS  (FIXED WITH as.character)
# ----------------------------

zip_panel <- bind_rows(
  housing_zip |> mutate(Value = as.character(ZipCode), Source = "Housing", Facet = "ZipCode"),
  sr311_zip   |> mutate(Value = as.character(ZipCode), Source = "311",     Facet = "ZipCode")
)


category_panel <- bind_rows(
  housing_cat |> mutate(Value = as.character(Category), Source = "Housing", Facet = "Category"),
  sr311_cat   |> mutate(Value = as.character(Category), Source = "311",     Facet = "Category")
)

street_panel <- bind_rows(
  housing_st |> mutate(Value = as.character(StreetName), Source = "Housing", Facet = "StreetName"),
  sr311_st   |> mutate(Value = as.character(Street.Name), Source = "311",     Facet = "StreetName")
)


block_panel <- bind_rows(
  housing_block |> mutate(Value = as.character(Block), Source = "Housing", Facet = "Block"),
  sr311_block  |> mutate(Value = as.character(Block), Source = "311",     Facet = "Block")
)

# ----------------------------
# 4) FINAL COMBINED PANEL
# ----------------------------

panel_all <- bind_rows(zip_panel, street_panel, category_panel, block_panel) |>
  group_by(Facet) |>
  mutate(Value = fct_reorder(Value, n)) |>
  ungroup()# ----------------------------
# 4) FINAL FACETED PLOT
# ----------------------------

ggplot(panel_all, aes(x = Value, y = n, fill = Source)) +
  geom_col(position = "dodge") +
  coord_flip() +
  facet_wrap(~ Facet, scales = "free", ncol = 2) +
  scale_y_continuous(labels = scales::comma) +
  scale_fill_manual(values = c("Housing" = "purple", "311" = "pink")) +
  labs(
    title = "Top Manhattan Characteristics Across\nHousing Violations and 311 Complaints",
    subtitle = "Blocks, Violation/Complaint Categories,\nStreet Names, and ZIP Codes",
    x = "",
    y = "Count",
    fill = "Source"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    strip.text = element_text(face = "bold", size = 9),
    plot.title = element_text(face = "bold", size = 13),
    plot.subtitle = element_text(size = 10),
    legend.position = "top",
    axis.text.y = element_text(size = 6),
    axis.text.x = element_text(size = 6)
  )

Here, we compared the top 10 blocks, housing/violation categories, streets, and ZIP codes in Manhattan across both are datasets and we can conclude that mostly same areas and same problem types show up in both datasets at the top, like blocks 2136 and 2170 have highest violations/complaints, streets like Broadway and St Nicholas Avenue also have highest records in both, ZIP codes 10031–10033, 10027 all have very high counts in both datasets, and categories like HEAT/HOT WATER, PLUMBING, and PAINT/PLASTER dominate for both complaints and violations, indicating that both our datasets focus on similar set of buildings, locations, categories, and issues.

3.6 Time Series

3.6.1 Housing Daily

Code
housing |>
  filter(!is.na(inspectiondate_clean)) |>
  count(inspectiondate_clean) |>
  ggplot(aes(x = inspectiondate_clean, y = n)) +
  geom_line(color = "purple") +
  labs(title = "Housing Violations Timeline (daily, Manhattan, 2022-2025)",
  x = "Inspection date", y = "Count") +
  theme_minimal()

The spikes here kind of look like a pattern, but the size and shape suggest they are just random glitches and not exactly a real trend. On normal days between 2022-2025, it is mostly noisy and a flat band around a few hundred inspections but on very few days it jumps over 5,000. This is extremely high & not very realistic just 1 day and does not spread over nearby days so it’s not a trend but these spikes are outliers.

3.6.2 311 Daily

Code
sr311 |>
  filter(!is.na(created_clean)) |>
  count(created_clean) |>
  ggplot(aes(x = created_clean, y = n)) +
  geom_line(color = "pink") +
  labs(title = "311 Requests Timeline (daily, Manhattan, 2022-2025)",
  x = "Created date", y = "Count") +
  theme_minimal()

It is almost a solid pink block with very tiny variation in consecutive days and that’s why it’s very hard to see any clear pattern. Hence, this plot does no give us much information. TO further deep dive into finding a trend in the time series, we plotted combined monthly series for both the dataset.

3.6.3 Monthly combined of Housing and 311

Code
housing_month <- housing |>
  filter(!is.na(inspectiondate_clean)) |>
  mutate(month = floor_date(inspectiondate_clean, "month")) |>
  count(month, name = "HousingViolationCount")

sr311_month <- sr311 |>
  filter(!is.na(created_clean)) |>
  mutate(month = floor_date(created_clean, "month")) |>
  count(month, name = "311ComplaintCount")

monthly_combined <- full_join(housing_month, sr311_month, by = "month") |>
  replace_na(list(HousingCount = 0, ComplaintCount = 0))

monthly_combined |>
  pivot_longer(cols = c("HousingViolationCount", "311ComplaintCount"), names_to = "Source", values_to = "Count") |>
  ggplot(aes(x = month, y = Count, color = Source)) +
  geom_line(linewidth = 1) +
  geom_point(size = 1.5) +
  scale_color_manual(values = c("311ComplaintCount" = "pink", "HousingViolationCount" = "purple"))+
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = , size = 5)) +
  labs(title = "Monthly: Housing Violations vs 311 Complaints (Manhattan)",
  x = "Month", y = "Count")

Here, both 311 complaints and housing violations in Manhattan during 2022-2025 have a fairly strong seasonal pattern, but there are also some exceptions. In most years the red complaint line is highest in late fall & winter, & lowest in late spring & summer which basically means people call 311 line more about housing problems during colder months when heat and building conditions matter more. Now blue violations line also have major winter peaks overall but it includes few extra spikes in other months (like June & August 2025), so its pattern is winter‑dominated but not specific to winter‑only. Also, some of those spikes might be anomalies and outlier inspections instead of a reasonable trend.

3.6.4 Time series Variation by Month and Year for Housing

Code
set.seed(2025)

# Load data
housing_2 <- as.data.table(housing)

# Convert date and extract time components
housing_2[, InspectionDate := as.Date(InspectionDate)]
housing_2[, YearMonth := format(InspectionDate, "%Y-%m")]
housing_2[, Month := month(InspectionDate)]
housing_2[, Year := year(InspectionDate)]

# Count violations by month and year
viol_by_month <- housing_2[, .N, by = .(Year, Month)][order(Year, Month)]

# Create the plot
ggplot(viol_by_month, aes(x = Month, y = N, color = factor(Year), group = Year)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 2.5) +
  scale_x_continuous(breaks = 1:12, 
                     labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
                                "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")) +
  scale_y_continuous(labels = scales::comma) +
  labs(
    title = "Housing Violations by Month and Year",
    subtitle = "Manhattan, 2022-2025",
    x = "Month",
    y = "Number of Violations",
    color = "Year"
  ) +
  theme_minimal(base_size = 16) +
  theme(
    plot.title = element_text(face = "bold"),
    legend.position = "right"
  )

It shows that there is no stable seasonal pattern across years. Some months are high in one year and low in another like February is very unusually high only in 2022, while peaks in other years appear in different months like April 2024 or June 2025. Hence, we can conclude that monthly counts fluctuate randomly rather than following a consistent winter–summer or any other repeating trend.

3.6.5 Investigating February 2022 peak

Code
# Investigate February 2022
feb_2022 <- housing_2[Year == 2022 & Month == 2]
cat("February 2022 violations:", nrow(feb_2022), "\n")
February 2022 violations: 28414 
Code
# Check if they're all inspected in Feb or entered in Feb
summary(feb_2022$InspectionDate)
        Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
"2022-02-01" "2022-02-15" "2022-02-15" "2022-02-15" "2022-02-16" "2022-02-28" 
Code
summary(feb_2022$ApprovedDate)
   Length     Class      Mode 
    28414 character character 

Upon further review we can see that Median/Mean inspection date is February 15, 2022 & most inspections are clustered around Feb 15-16 only. This does not seem like a data error but is simply appearing to be a mass inspection event in mid-February 2022. It is not a trend.

3.6.6 Smoothed Temporal View: Housing by Quarter

We tried to find any overall trends when we smooth out the monthly fluctuations

Code
# Aggregate by quarter for smoother trend
housing_2[, Quarter := quarter(InspectionDate)]
housing_2[, YearQuarter := paste0(Year, "-Q", Quarter)]

viol_by_quarter <- housing_2[, .N, by = .(Year, Quarter, YearQuarter)][order(Year, Quarter)]

# Create quarter labels
viol_by_quarter[, QuarterLabel := paste0(Year, "\nQ", Quarter)]

# Plot
ggplot(viol_by_quarter, aes(x = factor(YearQuarter, levels = YearQuarter), 
                              y = N, group = 1)) +
  geom_line(color = "steelblue", linewidth = 1.2) +
  geom_point(color = "steelblue", size = 3) +
  scale_y_continuous(labels = scales::comma) +
  labs(
    title = "Housing Violations by Quarter",
    subtitle = "Manhattan, 2022-2025 (smoothed trend)",
    x = "Quarter",
    y = "Number of Violations"
  ) +
  theme_minimal(base_size = 16) +
  theme(
    plot.title = element_text(face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

It shows that housing violations in Manhattan from 2022-2025 move up and down between quarters but stay in a fairly similar range (30,000–45,000), so there is no strong upward or downward trend. Q4 20205 seems like an anomaly but it is not because it does not include data from mid November - December 2025 data are not fully included, hence it is not any abnormal evidence of a real drop.

3.7 Spatial sample plots

3.7.1 Housing Violations

Code
housing |>
  filter(!is.na(Longitude) & !is.na(Latitude)) |>
  slice_sample(n = 5000) |>
  ggplot(aes(x = Longitude, y = Latitude)) +
  geom_point(alpha = 0.4, size = 0.6) +
  theme_minimal() +
  labs(title = "Housing Violations Locations (sample 5k)", x = "Longitude", y = "Latitude")

3.7.2 311 Complaints

Code
sr311 |>
  filter(!is.na(Longitude) & !is.na(Latitude)) |>
  slice_sample(n = 5000) |>
  ggplot(aes(x = Longitude, y = Latitude)) +
  geom_point(alpha = 0.4, size = 0.6, color = "orange") +
  theme_minimal() +
  labs(title = "311 Complaints Locations (sample 5k)", x = "Longitude", y = "Latitude")

Both the above location plots show that overall the housing violations and 311 complaints are spread across the almost full north–south span of Manhattan with very similar clusters in same bands of longitude and latitude but 311 plot is a bit more denser yet still both datasets show similar island shape and highlight the same high‑activity corridors. This basically means that hot spots in the plot for both the datasets mostly occur in same parts of Manhattan.

3.8 Parallel Coordinates plot

Code
housing_dt <- housing_2
sr311_dt   <- as.data.table(sr311)

housing_dt[, BBL := as.character(BBL)]
sr311_dt[,  BBL := as.character(BBL)]

viol_severity <- housing_dt[
  !is.na(BBL) & BBL != "",
  .(
    ViolationCount = .N,
    ClassC         = sum(Class == "C"),
    RentImp        = sum(RentImpairing == "Y")
  ),
  by = BBL
]

viol_severity[, Pct_ClassC  := ClassC  / ViolationCount]
viol_severity[, Pct_RentImp := RentImp / ViolationCount]

complaints_by_building <- sr311_dt[
  !is.na(BBL) & BBL != "",
  .(ComplaintCount = .N),
  by = BBL
]

bldg_pcp <- merge(
  viol_severity,
  complaints_by_building,
  by = "BBL",
  all.x = TRUE
)
bldg_pcp[is.na(ComplaintCount), ComplaintCount := 0]

bldg_pcp_filt <- bldg_pcp[
  ViolationCount >= 20 | ComplaintCount >= 50
]

pcp_df <- as.data.frame(bldg_pcp_filt[, .(
  ViolationCount = as.numeric(ViolationCount),
  ComplaintCount = as.numeric(ComplaintCount),
  Pct_ClassC     = as.numeric(Pct_ClassC),
  Pct_RentImp    = as.numeric(Pct_RentImp)
)])

set.seed(2025)

# Sample 300 buildings
pcp_sample <- pcp_df[sample(nrow(pcp_df), 300), ]

# Create severity groups based on Class C share
pcp_sample$SeverityGroup <- cut(
  pcp_sample$Pct_ClassC,
  breaks = c(-Inf, 0.2, 0.5, Inf),
  labels = c("Low C share", "Medium C share", "High C share")
)


ggparcoord(
  data        = pcp_sample,
  columns     = 1:4,
  groupColumn = "SeverityGroup",
  scale       = "uniminmax",
  alphaLines  = 0.25,          # lighter, smoother
  showPoints  = FALSE
) +
  scale_color_brewer(palette = "Set1") +
  theme_minimal(base_size = 11) +   # smaller base font
  theme(
    plot.title   = element_text(face = "bold", size = 16),
    axis.title.x = element_text(size = 11),
    axis.title.y = element_text(size = 11),
    axis.text.x  = element_text(size = 9),   # reduce x‑axis text size
    axis.text.y  = element_text(size = 9),
    legend.position = "bottom",              # legend at bottom
    legend.title    = element_text(size = 10),
    legend.text     = element_text(size = 9)
  ) +
  labs(
    title = "Parallel Coordinates: Sample of Buildings",
    x = "Variables",
    y = "Scaled value",
    color = "Share of Class C"
  )

This plot compares each sampled building violation across 4 variables & groups them by how many Class C violations they have. The only clear pattern we can see here is that buildings with higher share of Class C violations have higher share of rent‑impairing violations but for other variables (total violations, 311 complaints), the 3 groups overlap alot, so there is no simple correlation or strong extra pattern visible.

3.9 Top Violation Classes

Code
# ===  Violation Class Distribution ===
# Class definitions: A=Non-Hazardous, B=Hazardous, C=Immediately Hazardous, I=Failure to Register

class_data <- housing |>
  filter(!is.na(Class)) |>
  count(Class, name = "N") |>
  mutate(
    ClassLabel = case_when(
      Class == "A" ~ "Class A: Non-Hazardous",
      Class == "B" ~ "Class B: Hazardous",
      Class == "C" ~ "Class C: Immediately Hazardous",
      Class == "I" ~ "Class I: Failure to Register",
      TRUE ~ Class
    )
  ) |>
  arrange(desc(N))

ggplot(class_data, aes(x = reorder(ClassLabel, N), y = N, fill = Class)) +
  geom_col(fill = "steelblue") +
  geom_text(aes(label = scales::comma(N)), hjust = -0.1, size = 3) +
  coord_flip() +
  scale_y_continuous(labels = scales::comma, expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Housing Violations by Severity Class",
    subtitle = "Manhattan, 2022-2025",
    x = "Violation Class",
    y = "Number of Violations"
  ) +
  theme_minimal(base_size = 16) +
  theme(
    plot.title = element_text(face = "bold", size =12),
    
    legend.position = "none"
  )

Majority of violations (~238k) are classified as hazardous (Class B) followed by immediately hazardous (Class C) which is ~185k violations, indicating very serious safety and habitability concerns in Manhattan. Class B violations, which require correction within 30 days, dominate the landscape. This high proportion of serious violations highlights how important it is to thouroughly inspect a building before renting.

3.10 Rent-Impairing Violation

Code
# === Rent-Impairing Violations ===

rent_data <- housing |>
  filter(!is.na(RentImpairing)) |>
  count(RentImpairing, name = "N") |>
  mutate(
    Label = ifelse(
      RentImpairing == "Y",
      "Rent-Impairing\n(affects habitability)",
      "Non-Rent-Impairing"
    )
  )

ggplot(rent_data, aes(x = Label, y = N, fill = RentImpairing)) +
  geom_col() +
  geom_text(
    aes(label = paste0(scales::comma(N), "\n(", round(N / sum(N) * 100, 1), "%)")),
    vjust = -0.5,
    size = 5
  ) +
  geom_col(fill = "steelblue") +
  scale_y_continuous(labels = scales::comma, expand = expansion(mult = c(0, 0.5))) +
  labs(
    title = "Rent-Impairing vs Non-Rent-Impairing Violations",
    subtitle = "Manhattan, 2022-2025",
    x = "",
    y = "Number of Violations"
  ) +
  theme_minimal(base_size = 16) +
  theme(
    plot.title = element_text(face = "bold"),
    legend.position = "none"
  )

Here, Non-Rent-Impairing is 542,652 (92.1%) and has the a very big majority while rent-Impairing is only 46,353 (7.9%) which means serious habitability issues. This means only 7.9% directly affect habitability and could legally justify rent withholding but though most violations are non-rent-impairing only, rent-impairing violations represent conditions severe enough to make apartments legally uninhabitable like lack of basics amenities like heat, hot water, or having structural hazards.

3.11 Save cleaned RDS as backups

Code
saveRDS(housing, file = "housing_manhattan_3years_clean.rds")
saveRDS(sr311,   file = "sr311_manhattan_3years_clean.rds")