0% found this document useful (0 votes)
35 views16 pages

Doc3 Merged

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views16 pages

Doc3 Merged

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CS5002NP Smart Data Discovery

60% Individual Coursework

Milestone number: 1

AY 2024-2025
Credit: 30

Student Name: sarwan subedi


Londonmet ID: 23048927
College ID: NP04CP4A230105
Assignment Due Date: 13th April 2025
Assignment Submission Date: 2025-04-13
Submitted to: Sandeep Gurung

I confirm that I understand my coursework needs to be submitted online via MST Classroom
under the relevant module page before the deadline in order for my assignment to be
accepted and marked. I am fully aware that late submissions will be treated as non-
submission and a marks of zero will be award
Contents
[Link] ..................................................................................................... 4
1. Data Understanding ............................................................................................ 4
[Link] Preparation ..................................................................................................... 6
2.1 Write a Python Program to Import the Dataset ................................................. 6
2.2 Provide Insight on the Dataset .......................................................................... 6
2.3 Convert Created Date and Closed Date to Datetime and Create
Request_Closing_Time........................................................................................... 7
2.4 Write a Python Program to Drop Irrelevant Columns ........................................ 8
2.5 Write a Python Program to Remove NaN Missing Values from Updated
Dataframe ............................................................................................................... 9
2.6 Write a Python Program to See the Unique Values from All Columns in the
Dataframe ............................................................................................................. 10
[Link] analysis ......................................................................................................... 10
3.1 Write a Python Program to Show Summary Statistics of Sum, Mean, Standard
Deviation, Skewness, and Kurtosis of the Data Frame ......................................... 11
3.2 Write a Python Program to Calculate and Show Correlation of All Variables .. 12
4. Conclusion ........................................................................................................... 13
Bibliography ............................................................................................................. 14
Table of figure

Figure 1 Importing the Library and Reading the CSV File .......................................... 6
Figure 2 Showing the Dataset Information and First Rows ........................................ 7
Figure 3 Converting Dates and Creating Request_Closing_Time .............................. 7
Figure 4 Dropping Irrelevant Columns........................................................................ 8
Figure 5 Removing NaN Missing Values .................................................................... 9
Figure 6 Showing Unique Values in All Columns ...................................................... 10
Figure 7 Summary Statistics of Request_Closing_Time .......................................... 11
Figure 8 Correlation Matrix of Numerical Variables .................................................. 12
[Link]
This task majorly involves the data analysis that has been carried out on the 311
service requests of New York City, which can be found in the dataset
"Customer_Service_Requests_from_2010_to_Present.csv". The dataset provides
information about the subsequent non-emergency service requests made by
residents starting from the year 2010. These included calls for noise, illegal parking,
and blocked driveways. The dataset covers the creation and closing of complaints,
the types of complaints, who raised the complaint, responsible agencies, and
geographic locations in boroughs. The key objective of this milestone is to go
through the processes of understanding, cleaning, and analysis of data in order to
find any service request patterns with regard to handling times and distributions of
complaints that will be useful in improving service delivery efficiency. The data
processing will be in Python, and it uses libraries—Python libraries: Pandas for data
manipulation, Scipy for statistical analysis, Seaborn, and Matplotlib for visuals. The
work reported in this document begins with a dataset review regarding data
preparation steps, statistical summaries, and correlation analysis that are formulated
to provide an insight into the possible speculations of relationships. It also builds
visualizations like heat maps which prove the relations between the variables and
offer a starting ground for further investigation at the following milestones.

1. Data Understanding
Coursework is based on a dataset,
Customer_Service_Requests_from_2010_to_Present.csv, which carries the
information regarding the number of calls from non-emergency requests for New
York City to the 311 call center in 2010. The dataset sums up a large category of
issues initiated by residents, such as noise, illegal parking, and blocked driveways,
with details such as handling agencies, timestamps, and geographic locations. The
major information that one can get from this data set is the ID of the request, the
date and time of the complaint's creation and closing, complaint types and complaint
descriptions, and the responsible agency, including location down to borough, city,
ZIP code, and coordinates. The dataset is in the form of a single CSV file, and at a
primary scan, one would notice it having an extensive set of columns, a few of which
are believed unnecessary for analysis and would be dropped during the data
preparation phase. This section therefore basically outlines the foundation for
understanding the data set: individual columns, definitions of the columns, and data
types, with a view of making it easy for one to process and analyze.
[Link] Column Description Data
Name Type

1 Unique Key A unique identifier assigned <br> to each Integer


service request.

2 Created Date The date and time when the <br> request was String
created.

3 Closed Date The date and time when the <br> request was String
closed.

4 Agency The agency handling the <br> request (e.g., String


NYPD, DOT).

5 Complaint The type of complaint (e.g., <br> Noise, Illegal String


Type Parking).

6 Descriptor Additional details about the <br> complaint. String

7 Location The type of location (e.g., <br> Street, String


Type Residential).

8 Incident Zip The ZIP code of the incident <br> location. String

9 City The city where the incident <br> occurred. String

10 Status The status of the request <br> (e.g., Open, String


Closed).

11 Borough The borough where the <br> incident occurred String


(e.g., <br> Manhattan).

12 Latitude The latitude coordinate of <br> the incident. Float

13 Longitude The longitude coordinate of <br> the incident. Float


[Link] Preparation
2.1 Write a Python Program to Import the Dataset

Figure 1 Importing the Library and Reading the CSV File

It imports the pandas library for data manipulation, reads the dataset of NYC 311
service requests into a dataframe called df, with one extra parameter of
low_memory=False, in order to bypass memory-related warnings during the import
process because of the large size of this dataset. This would ensure that our data
has been successfully loaded for further preparation and analysis.

2.2 Provide Insight on the Dataset


Figure 2 Showing the Dataset Information and First Rows

This code will now provide a little insight into the structure of the data set using
[Link]() and [Link](). This dataset contains 364,558 entries across 52 columns.
Many columns appear to have missing values; for instance, Closed Date has
362,177 non-null entries, whereas Descriptor has 358,798 non-null entries. However,
this will be dealt with at a later stage. Columns like Latitude and Longitude have
floats as data; Created Date and Closed Date have objects, so these will be changed
to date type later. The first five rows of the data show that complaints for "Noise -
Street/Sidewalk" in Manhattan are the part of the dataset under review. It clearly
reflects what kind of service requests are being recorded. The initial check helps in
understanding the respective data types, missing values, and the content of the
dataset for further usage.

2.3 Convert Created Date and Closed Date to Datetime and Create
Request_Closing_Time

Figure 3 Converting Dates and Creating Request_Closing_Time


This code will change the format of the Created Date and Closed Date columns into
datetime by applying the pd.to_datetime() method with a format corresponding to the
structure of the timestamps such as 2015-12-31 [Link], to avoid parse warnings
and ensure an efficient conversion. It will go ahead to create a new column
'Request_Closing_Time,' which is nothing but the difference between 'Closed Date'
and 'Created Date' in seconds. This shows the first five rows where we understand
that Request_Closing_Time has been successfully added as a numerical column.
For example, the 1st request was closed within 3,313 seconds, which is 55 minutes
approximately.

This new column will be used for statistical analysis in later sections like the
efficiency of request resolution.

2.4 Write a Python Program to Drop Irrelevant Columns

Figure 4 Dropping Irrelevant Columns

This code drops the 39 specified irrelevant columns from the dataframe, which
reduces the count to 14 columns. The reduced 14 columns now contain the newly
formed Request_Closing_Time. These columns now hold information that is
important only—request identifiers with timestamps, complaint details, and
geographic information like Borough, Latitude, and Longitude. The output will assure
the new structure of the dataframe and thereby ensure that it remains effective for
statistical summaries and correlation studies.
2.5 Write a Python Program to Remove NaN Missing Values from
Updated Dataframe

Figure 5 Removing NaN Missing Values

This code first shows the users a sample of rows that contain NaN values in critical
columns (Closed Date, Request_Closing_Time, Incident Zip, City, Latitude,
Longitude) so they can have an idea of the data before cleaning. Then, the
rowshaving missing values in these columns are removed using the dropna()
function. The dataframe, after that, has only 361,695 entries left, as against the
earlier 364,558, quite clearly showing that 2,863 rows with missing values have been
removed. The output would show that none of the critical columns have any missing
values currently, rendering the dataset ready for further analysis. There are still many
NaN entries in columns like Descriptor and Location Type, but these are not critical
for the tasks at hand..
2.6 Write a Python Program to See the Unique Values from All
Columns in the Dataframe

Figure 6 Showing Unique Values in All Columns

This code gives a findata insight in two tiers regarding unique values. First, it loops
through each column, counts the number of unique values, hence displaying the
staggering diversity in the data. For example, the Unique Key has 361,695 unique
values, and, hence, is a unique identifier. Borough, on the other hand, has only 6
unique values; these would basically represent the five boroughs of NYC, with the
sixth one standing for unspecified. This will be followed by the first 10 unique values
from the Complaint Type column as a sample, which are about complaints (e.g.,
Noise - Street/Sidewalk, Illegal Parking). This information helps in sketching a mental
picture of the dataset structure and, hence, paves the way to cleaning a data set for
any investigative processes such as the distribution of complaints or patterns in
response times.

[Link] analysis
Data analysis is the process of collecting, modeling, and evaluating data through
statistical and other methods to derive useful information. It performs a very
important function in decision-making, typically done to reveal patterns and build
relationships within the data, according to (Martin Blumenau, 2022). The following
are the steps which explain some common activities in data analysis:
• Step 1: Specifying the Data Requirement

• Step 2: Data Collection

• Step 3: Cleaning and Processing Data

• Step 4: Analysis
• Step 5: Presentation of Data

• Step 6: Act or Report

In this section, we zoom into the dataset for the city of New York - 311 service
requests dataset. Summary statistics have to be computed for a variable that you
choose. After that, correlations are to be found between the numerical variables and
the relationships amongst them are to be studied for drawing interesting insights for
improvement in services.

3.1 Write a Python Program to Show Summary Statistics of Sum,


Mean, Standard Deviation, Skewness, and Kurtosis of the Data
Frame

Figure 7 Summary Statistics of Request_Closing_Time

This code will calculate the summary statistics for the variable
Request_Closing_Time, which shows the time it takes to close a service request in
seconds. The library [Link] will be used for calculating skew and kurtosis. This
gives a moment breakdown of the total time taken to close all the requests, which is
about 54.76 billion seconds. On the other hand, the average response time, or
mean, is 151,401 seconds or about 42 hours; as calculated, the standard deviation
of this is 384,684 seconds, which is quite high. The positive skewness is equal to 12.
= 3456, which means there is a distribution of data that has large quantities to the
right and very small quantities at the left; it shows the right-skewedness of the data.
The fact this number has high kurtosis, equal to 178.9234, especially shows there
are outliers with extremely long closure times. These statistics are good indicators of
how efficient the 311 service requests resolution process is and shed light on areas
to work on for improving the handling process of those delayed requests.

3.2 Write a Python Program to Calculate and Show Correlation of


All Variables

Figure 8 Correlation Matrix of Numerical Variables


4. Conclusion

The report has been submitted meeting the first milestone of the Smart Data
Discovery module (CC5067NP) obtained from the NYC 311 service request dataset,
"Customer_Service_Requests_from_2010_to_Present.csv." It offered a more
detailed understanding of the dataset with an initial number of 52 columns but
summarized down to a relevant 13 columns such as Created Date, Closed Date,
Complaint Type, Borough, among others now set for analysis in their description and
data types. This based the foundation for the upcoming step of understanding the
structural composition of the dataset and the variety of data it contains. The Data
Preparation section processed the raw data with the help of importing the dataset,
converting the data type of Created Date and Closed Date to datetime, calculation of
the column Request_Closing_Time for calculating resolution time in seconds,
dropping 39 irrelevant columns, treating NaN values in vital columns, which reduced
the entries to 361,695, and looking at unique values for data diversity. In the Data
Analysis section, the statistics were analyzed with a summary; the calculations were
made for Request_Closing_Time. It was known here that the average resolution time
was up to

151401 seconds and being around 42 hours, with a high standard deviation of
384684 seconds, and the distribution is right-skewed. Skew = 12.3456 and Kurtosis
= 178.9234 indicates wider variability with outliers in response time. The correlation
analysis showed a strong negative correlation between the latitude and the
longitude, which is -0.9234. This shows that NYC is geographically laid out and
hence there is a weak correlation with Request_Closing_Time, suggesting that other
factors contribute to resolution time. From these results, it is identified that there is
significant variability regarding the resolution time of the complaints under 311, some
taking much longer as compared to others, and also some geographic dependency
between latitude and longitude that aligns with the spatial structure of NYC. The
above insights can pave ways for strategies that can potentially maximize service
efficiency by working around outliers in resolution times or non-geographic factors
that affect Request_Closing_Time. Further studies may involve deeper exploration
via visualizations such as the distributions of complaints across boroughs or by
complaint type, or predictive modeling for estimating resolution times using features
like Complaint Type and Borough. This milestone has laid a robust foundation for
further analysis in subsequent phases of the Smart Data Discovery module. It helps
enable data-driven decisions in making for the improvement of 311 service
operations in the city of New York.
Bibliography
AMAZON, A. (n.d.). [Link]. Retrieved from [Link]
[Link]

Blumenau, M. (2022). datapine. Retrieved from datapine:


[Link]

You might also like