2 Data

There are 2 dataset links that we will utilize from https://opendata.cityofnewyork.us. It is public data published by New York City agencies and other partners. As per our discussions during proposal approval, we have limited to 2 datasets (earlier we targeted 4) and downloaded filtered data for past 3 years and for Borough “Manhattan”.

The 2 datsets we are making use of from NYC Open Data are

Housing Maintenance Code Violations: Link: Housing Maintenance Code Violations
311 Service Requests from 2010 to Present: Link: 311 Service Requests from 2010 to Present

Both the datasets are CSV files in tabular format and they are updated daily. The HPD dataset is published by the Department of Housing Preservation and Development and contains one row per housing code violation, with about 10.4 million rows and 41 columns. It includes things like identifiers (ViolationID, BuildingID, BBL), address and borough, violation class (A/B/C/I), detailed text descriptions, multiple status dates, and basic geography. On the other hand, the 311 dataset is maintained by NYC’s 311 system and has around 41.5 million rows. It has one row per service request and the key fields include a unique request ID, the responding agency, complaint type and descriptor, creation and closure times, and several location fields such as address, ZIP, BBL, and latitude/longitude. We observed that in the 311 documentation that the “Agency Name” column is currently unreliable, so have informed us to use the “Agency” code instead.

Please note, that the entire data was too huge so we did not make use of it entirely and instead filtered it for analysis. For HPD housing violations, we restrict it to Manhattan from the year 2022 onwards and so the dataset has 589,005 rows and 41 columns. Same goes with 311 requests dataset as we filter it for Manhattan and for the same time period. It now has, 425,248 rows and 42 columns. To make the two of them comparable, we created new categorical variables: in the violations data, we used regular expressions on the free text NOVDescription field to assign each violation to a broad issue type (such as HEAT/HOT WATER, PLUMBING, PAINT/PLASTER, WATER LEAK, DOOR/WINDOW/LOCK, or PEST/SANITATION). And so in 311 data we mapped Complaint Type (and, for pest issues, Complaint Type together with Descriptor) into the same set of issue types.

2.1 Key distributions in Dataset

Additionally, we looked at some of the key distributions in the data to have a better understanding (see below). In HPD violations data, the most common CurrentStatus values are “VIOLATION DISMISSED”, “VIOLATION CLOSED”, and “NOV SENT OUT”, with a long list of more specific statuses such as “FIRST NO ACCESS TO RE-INSPECT VIOLATION” and “NOT COMPLIED WITH.” In the 311 data, housing related complaints are mostly dominated by a few Complaint Type categories like HEAT/HOT WATER, PLUMBING, PAINT/PLASTER, and WATER LEAK.

Code

# Libraries
library(dplyr)
library(lubridate)
library(janitor)
library(ggplot2)
library(ggalluvial)
library(tidyr)
library(scales)
library(data.table)
library(readr)
library(tidyverse)
library(forcats)
library(stringr)
library(redav)

# ========== HOUSING VIOLATIONS ==========
housing_raw <- read.csv("/Users/shreyashetty/Documents/Fall 2025 Courses/EDAV/Final_Project/edav_project/datasets/Housing_Violations_2022_onwards.csv")

# ========== 311 HOUSING COMPLAINTS ==========
sr311_raw   <- read.csv("/Users/shreyashetty/Documents/Fall 2025 Courses/EDAV/Final_Project/edav_project/datasets/311_Housing_Complaints_2022_onwards.csv")


# ========== KEY DISTRIBUTIONS ==========

# Use tidyverse counting (no data.table .N)

housing_copy <- housing_raw
sr311_copy <- sr311_raw

# Housing Violations - Top 10 Current Status (showing top 10 since there are many unique values)

housing_copy |>
  filter(!is.na(CurrentStatus)) |>
    count(CurrentStatus, sort = TRUE) |>
      slice_head(n = 10) |>
        print()

                              CurrentStatus      n
1                       VIOLATION DISMISSED 197906
2                          VIOLATION CLOSED 180829
3                              NOV SENT OUT 112090
4  FIRST NO ACCESS TO RE- INSPECT VIOLATION  18629
5                         NOT COMPLIED WITH  17976
6                         INFO NOV SENT OUT  12927
7         NOTICE OF ISSUANCE SENT TO TENANT  11160
8             VIOLATION WILL BE REINSPECTED  10263
9                              CIV14 MAILED   7244
10                     DEFECT LETTER ISSUED   5934

Code

# 311 Complaints - Top 10 Complaint Types (showing top 10)

sr311_copy |>
  filter(!is.na(Complaint.Type)) |>
    count(Complaint.Type, sort = TRUE) |>
      slice_head(n = 10) |>
        print()

                  Complaint.Type      n
1                 HEAT/HOT WATER 229222
2                       PLUMBING  50637
3                  PAINT/PLASTER  44624
4                     WATER LEAK  30190
5  General Construction/Plumbing  25207
6        Maintenance or Facility  20202
7                       Elevator  16588
8                       Plumbing   2197
9                       ELEVATOR   1302
10          Non-Residential Heat   1094

Code

# 311 Complaints - Status distribution

sr311_copy |>
  filter(!is.na(Status)) |>
    count(Status, sort = TRUE) |>
      print()

       Status      n
1      Closed 419047
2        Open   5152
3 In Progress    922
4    Assigned    124
5 Unspecified      3

Code

# 311 Complaints - Top Agencies

sr311_copy |>
  filter(!is.na(Agency)) |>
    count(Agency, sort = TRUE) |>
      slice_head(n = 10) |>
        print()

  Agency      n
1    HPD 358603
2    DOB  43687
3    DPR  20202
4  DOHMH   1579
5    DOE   1064
6    DEP     99
7    DOT     14

2.2 Missing Value Analysis

2.2.1 Missing Data summary:

Code

# === HOUSING VIOLATIONS ===

# Calculate missing counts and percentages
missing_housing <- housing_raw |>
  summarise(across(everything(), ~ sum(is.na(.)))) |>
    pivot_longer(everything(), names_to = "column", values_to = "missing_count") |>
      mutate(total = nrow(housing_raw), missing_pct = round(100 * missing_count / total, 2)) |>
        arrange(desc(missing_pct))

# Columns with missing values more than 0%
print(filter(missing_housing, missing_pct > 0))

# A tibble: 10 × 4
   column          missing_count  total missing_pct
   <chr>                   <int>  <int>       <dbl>
 1 Story                   77094 589005       13.1 
 2 NOVID                   34593 589005        5.87
 3 BIN                       305 589005        0.05
 4 BBL                       305 589005        0.05
 5 Postcode                  147 589005        0.02
 6 Latitude                   77 589005        0.01
 7 Longitude                  77 589005        0.01
 8 CommunityBoard             77 589005        0.01
 9 CouncilDistrict            77 589005        0.01
10 CensusTract                77 589005        0.01

Code

# === 311 COMPLAINTS ===

# Calculate missing counts and percentages
missing_311 <- sr311_raw |>
  summarise(across(everything(), ~ sum(is.na(.)))) |>
    pivot_longer(everything(), names_to = "column", values_to = "missing_count") |>
      mutate(total = nrow(sr311_raw), missing_pct = round(100 * missing_count / total, 2)) |>
        arrange(desc(missing_pct))

# Columns with >0% missing
print(filter(missing_311, missing_pct > 0))

# A tibble: 9 × 4
  column                   missing_count  total missing_pct
  <chr>                            <int>  <int>       <dbl>
1 Due.Date                        425248 425248      100   
2 Vehicle.Type                    425248 425248      100   
3 Taxi.Company.Borough            425248 425248      100   
4 Taxi.Pick.Up.Location           425248 425248      100   
5 Bridge.Highway.Name             425248 425248      100   
6 Bridge.Highway.Direction        425248 425248      100   
7 Road.Ramp                       425248 425248      100   
8 Bridge.Highway.Segment          425248 425248      100   
9 BBL                               2302 425248        0.54

Code

# Saving results for visualization
write.csv(missing_housing, "datasets/missing_values_violations.csv", row.names = FALSE)
write.csv(missing_311, "datasets/missing_values_311.csv", row.names = FALSE)

2.2.2 Missing Data Visualizations

Code

# --------- Visualizations: Missingness bar charts (tidy) ---------

# Load the saved summaries (if needed) or use objects directly

missing_viol_plot <- missing_housing |> filter(missing_pct > 0)
missing_311_plot  <- missing_311 |> filter(missing_pct > 0)

# ========== GRAPH 1: BAR CHART - HOUSING VIOLATIONS ==========

p1 <- ggplot(missing_viol_plot, aes(x = reorder(column, missing_pct), y = missing_pct)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(
  title = "Missing Values in Housing Violations Dataset",
  subtitle = "Manhattan, 2022-2025",
  x = "Column Name",
  y = "Percentage Missing (%)"
  ) +
  theme_minimal(base_size = 14) +
  theme(
  plot.title = element_text(face = "bold", size = 16),
  axis.text.y = element_text(size = 11)
  )

print(p1)

Code

# ========== GRAPH 2: BAR CHART - 311 COMPLAINTS ==========

p2 <- ggplot(missing_311_plot, aes(x = reorder(column, missing_pct), y = missing_pct)) +
  geom_bar(stat = "identity", fill = "coral") +
  coord_flip() +
  labs(
  title = "Missing Values in 311 Housing Complaints Dataset",
  subtitle = "Manhattan, 2022-2025",
  x = "Column Name",
  y = "Percentage Missing (%)"
  ) +
  theme_minimal(base_size = 14) +
  theme(
  plot.title = element_text(face = "bold", size = 16),
  axis.text.y = element_text(size = 11)
  )

print(p2)

2.2.3 MISSING DATA SUMMARY TABLES

2.2.3.1 Housing Violations – Missingness Summary Table

Code

missing_summary_housing <- tibble(
  Metric = c(
  "Total Columns",
  "Columns > 0% Missing",
  "Columns > 50% Missing",
  "Columns == 100% Missing",
  "Columns == 0% Missing"
  ),
  Value = c(
    nrow(missing_housing),
    nrow(filter(missing_housing, missing_pct > 0)),
    nrow(filter(missing_housing, missing_pct > 50)),
    nrow(filter(missing_housing, missing_pct == 100)),
    nrow(filter(missing_housing, missing_pct == 0))
  )
)

missing_summary_housing

# A tibble: 5 × 2
  Metric                  Value
  <chr>                   <int>
1 Total Columns              41
2 Columns > 0% Missing       10
3 Columns > 50% Missing       0
4 Columns == 100% Missing     0
5 Columns == 0% Missing      31

2.2.3.2 311 Complaints – Missingness Summary Table

Code

missing_summary_311 <- tibble(
  Metric = c(
  "Total Columns",
  "Columns > 0% Missing",
  "Columns > 50% Missing",
  "Columns == 100% Missing",
  "Columns == 0% Missing"
  ),
  Value = c(
    nrow(missing_311),
    nrow(filter(missing_311, missing_pct > 0)),
    nrow(filter(missing_311, missing_pct > 50)),
    nrow(filter(missing_311, missing_pct == 100)),
    nrow(filter(missing_311, missing_pct == 0))
  )
)

missing_summary_311

# A tibble: 5 × 2
  Metric                  Value
  <chr>                   <int>
1 Total Columns              42
2 Columns > 0% Missing        9
3 Columns > 50% Missing       8
4 Columns == 100% Missing     8
5 Columns == 0% Missing      33

2.2.4 MISSING VALUE HEATMAPS

Code

# Simple aggregated missing plot
plot_missing(housing_raw, percent = TRUE, num_char = 3, max_cols = 10)

Code

plot_missing(sr311_raw, percent = TRUE, num_char = 3, max_cols = 10)

2.2.5 Missing Value Analysis

Housing violations dataset is mostly complete since 31/40 columns contain 0 missing values. The remaining 10 columns have less than 50% missing data which in turn means that even incomplete variables retain significant analytical value. Story and NOVID are the columns with highest missing data (13% and 5.8% respectively) andHence, this dataset is consistent and reliable for majority of fields required for analysis.

311 service requests dataset has polarized missingness pattern with 33/42 columns fully complete but except 1 (BBL) , the remaining 8 (Due.Date, Vehicle.Type, Taxi.Company.Borough, Taxi.Pick.Up.Location, Bridge.Highway.Name, Bridge.Highway.Direction, Road.Ramp, Bridge.Highway.Segment) are entirely empty. We haven’t used these missing data columns in our analysis. These missing data columns suggest that though most of the complaint data is recorded properly some details (mostly not very improtant) are poorly recorded.

Commentary: The sharp contrast between both our datasets highlights that the data collection challenges were perhaps diffferent because housing violations data is robust and usable across all dimensions but 311 data contains gaps. The 8 blank columns in the 311 dataset are mostly excluded excluded from our analysis to prevent errors. We have used housing dataset with minimal imputation since it’s properly recorded. Overall, there is no specific pattern in the missing values, the missing values in both the datasets is quite random.