Unit 4_Working With Graphs _python
Unit 4_Working With Graphs _python
import pandas as pd
Handle missing values by imputing them or dropping rows/columns with missing data.
Remove duplicates.
Correct data errors and inconsistencies.
# Remove duplicates
df = df.drop_duplicates()
6. Data Visualization:
Use libraries like Matplotlib or Seaborn to visualize the data, detect patterns, and gain insights.
# Create a histogram
plt.hist(df['numeric_column'])
plt.xlabel('Numeric Column')
plt.ylabel('Frequency')
plt.show()
VISHNU PRIYA P M | PYTHON | V BCA 6
7. Data Export:
1. Concatenation:
Concatenation is used to combine data frames either row-wise or column-wise.
a.Row-wise concatenation:
import pandas as pd
2. Appending:
result = df1.append(df2)
1. Inner Join:
result = pd.merge(df1, df2, on='key_column', how='inner’)
2. Left Join:
result = pd.merge(df1, df2, on='key_column', how='left’)
3. Right Join:
result = pd.merge(df1, df2, on='key_column', how='right’)
4. Outer Join:
result = pd.merge(df1, df2, on='key_column', how='outer’)
Data transformation is the process of converting raw data into a format that is more suitable for
analysis, modeling, or machine learning. It is an essential step in any data science project, and
Python is a popular programming language for data transformation.
There are many different types of data transformation, but some common examples include:
Cleaning and preprocessing: This involves removing errors and inconsistencies from the data,
as well as converting the data to a consistent format.
Feature engineering: This involves creating new features from the existing data, or
transforming existing features in a way that is more informative for the task at hand.
Encoding categorical data: Categorical data, such as text or labels, needs to be converted to
numerical data before it can be used by many machine learning algorithms.
Scaling and normalization: This involves transforming the data so that all features are on a
similar scale, which can improve the performance of machine learning algorithms.
There areP Ma| PYTHON
VISHNU PRIYA number| V BCA of different Python libraries that can be used for data transformation, but
11
the most popular one is Pandas. Pandas is a powerful library for data manipulation and
analysis, and it provides a wide range of functions for data transformation.
import pandas as pd
import pandas as pd
Use drop_duplicates()
VISHNU PRIYA P M | PYTHON | V BCA to remove duplicate rows from your DataFrame. 13
df.drop_duplicates()
2. Data Filtering:
Filtering allows you to select a subset of data based on certain conditions.
df['scaled_column'] = scaler.fit_transform(df[['numeric_column']])
5. One-Hot Encoding:
Convert categorical variables into a numerical format using one-hot encoding.
6. Reshaping Data:
Reshaping data includes tasks like pivoting, melting, or stacking/unstacking for better analysis.
# Pivot a DataFrame
pivoted_df = df.pivot(index='row_column', columns='column_column', values='value_column')
# Melt a DataFrame
melted_df = pd.melt(df, id_vars=['id_column'], value_vars=['var1', 'var2'], var_name='variable',
value_name='value')
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')
Raw data that is fed to a system is usually generated from surveys and extraction of data from
real-time actions on the web. This may give rise to variations in the data and there exists a
chance of measurement error while recording the data.
An outlier is a point or set of data points that lie away from the rest of the data values of the
dataset. That is, it is a data point(s) that appear away from the overall distribution of data
values in a dataset.
Outliers are possible only in continuous values. Thus, the detection and removal of outliers are
applicable to regression values only.
VISHNU PRIYA P M | PYTHON | V BCA 17
Basically, outliers appear to diverge from the overall proper and well structured distribution of
the data elements. It can be considered as an abnormal distribution which appears away from
Why is it necessary to remove outliers from the data?
As discussed above, outliers are the data points that lie away from the usual distribution of the
data and causes the below effects on the overall data distribution:
•Z-score
•Scatter Plots
•Interquartile range(IQR)
2. Z-Score:
The Z-score measures how far a data point is from the mean in terms of standard deviations. You
can use the Z-score to detect outliers. Typically, data points with a Z-score greater than a
threshold (e.g., 2 or 3) are considered outliers.
z_scores = stats.zscore(df['column_name'])
VISHNU PRIYA P M | PYTHON | V BCA 19
outliers = df[abs(z_scores) > 2]
filtered_data = df[abs(z_scores) <= 2]
3. IQR (Interquartile Range) Method:
The IQR method involves calculating the IQR (the difference between the 75th percentile and
the 25th percentile) and identifying outliers as values outside a specified range.
Q1 = df['column_name'].quantile(0.25)
Q3 = df['column_name'].quantile(0.75)
IQR = Q3 - Q1
Q1 = df['column_name'].quantile(0.25)
Q3 = df['column_name'].quantile(0.75)
1. String Concatenation:
You can concatenate strings using the + operator or by using the str.join() method.
str1 = "Hello"
str2 = "World"
result = str1 + ", " + str2 # Using the + operator
words = ["Hello", "World"]
result = ", ".join(words) # Using join
VISHNU PRIYA P M | PYTHON | V BCA 23
2. String Slicing:
String slicing allows you to extract substrings from a string based on their positions.
3. String Searching:
You can search for substrings within a string using methods like str.find(), str.index(), or regular
expressions with the re module.
4. String Replacement:
Replace specific substrings within a string using the str.replace() method.
name = "Alice"
age = 30
formatted_str = f"My name is {name} and I am {age} years old."
name = "Bob"
age = 25
formatted_str = "My name is {} and I am {} years old.".format(name, age)
6. String Splitting:
Split a string into a list of substrings using the str.split() method.
text = "Python,Java,C++,JavaScript"
languages = text.split(",") # Split by comma
import pandas as pd
series = pd.Series(['[email protected]',
'[email protected]', '[email protected]']) Output:
first_names = series.str.extract(r'(?P<first_name>\
w+)@example.com') 0 alice
VISHNU PRIYA P M | PYTHON | V BCA 1 bob 29
print(first_names) 2 carol
dtype: object
PLOTTING AND VISUALIZATION
Plotting and visualization are crucial for understanding and communicating data. In Python, there
are several libraries for creating plots and visualizations, with Matplotlib, Seaborn, and Plotly
being some of the most popular ones. Here's an overview of how to create plots and
visualizations in Python:
Matplotlib
Matplotlib is an easy-to-use, low-level data visualization library that is built on NumPy arrays. It
consists of various plots like scatter plot, line plot, histogram, etc. Matplotlib provides a lot of
flexibility.
pip install matplotlib
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
plt.plot(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Line Plot')
plt.show()
Installation:
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
sns.scatterplot(x=x, y=y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Scatter Plot')
plt.show()
VISHNU PRIYA P M | PYTHON | V BCA 32
Plotly:
Plotly is a powerful library for creating interactive and web-based visualizations. It is often used
for creating dashboards and web applications.
Installation:
import plotly.express as px
Bokeh: Bokeh is another library for interactive web-based visualizations and is well-suited for
creating interactive dashboards.
Altair: Altair is a declarative statistical visualization library for Python, making it easy to create
complex visualizations with concise code.
ggplot (ggpy): ggplot is a Python implementation of the popular ggplot2 package from R, which
uses a grammar of graphics to create plots.
Matplotlib is a popular Python library for creating static, animated, and interactive
visualizations. When creating plots with Matplotlib, you can customize the appearance of your
lines, markers, and line styles. Here's a primer on how to do that:
Colors:
Matplotlib allows you to specify colors for lines, markers, and other plot elements in several
ways:
Named Colors: You can use named colors like 'red', 'blue', 'green', etc.
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
plt.xlabel('X-axis Label')
plt.ylabel('Y-axis Label') In the above example:
plt.title('Custom Line Style')
plt.legend() color: Sets the line color.
plt.show() marker: Specifies the marker style (e.g., 'o' for
circles, 's' for squares).
linestyle: Sets the line style ('--' for dashed, ':' for
VISHNU PRIYA P M | PYTHON | V BCA dotted, etc.). 38
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
plt.xlabel('X-axis Label')
plt.ylabel('Y-axis Label')
plt.title('Custom Ticks and Labels') In this example:
plt.legend()
VISHNU PRIYA P M | PYTHON | V BCA 39
plt.show() xticks() and yticks() specify the locations and
labels for the ticks on the x and y axes,
Legends:
To add legends to your plot, you can use the legend() function. You should also label your plotted
lines or data points using the label parameter when creating the plot.
x = [1, 2, 3, 4, 5]
y1 = [2, 4, 6, 8, 10]
y2 = [1, 3, 5, 7, 9]
plt.xlabel('X-axis Label')
plt.ylabel('Y-axis Label')
plt.title('Legend Example')
plt.legend()
plt.show()
In this example:
VISHNU PRIYA P M | PYTHON | V BCA 40
Annotations and drawing on a subplot in Matplotlib allow you to add textual or graphical
elements to your plots to provide additional information or highlight specific points of interest.
Here's how you can add annotations and draw on a subplot:
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
plt.xlabel('X-axis Label')
plt.ylabel('Y-axis Label')
plt.title('Text Annotation Example')
plt.legend()
VISHNU PRIYA P M | PYTHON | V BCA
plt.show() In this example, plt.text(x, y, text, fontsize, color)
42
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
plt.xlabel('X-axis Label')
plt.ylabel('Y-axis Label')
plt.title('Arrow Annotation Example')
plt.legend()
plt.show()
VISHNU PRIYA P M | PYTHON | V BCA 43
In this example, plt.annotate(text, xy, xytext, fontsize, arrowprops) is used to add an arrow with
text annotation. xy specifies the point being pointed to, and xytext specifies the location of the
Drawing Shapes:
You can draw various shapes, lines, and polygons on a subplot using Matplotlib's plotting
functions. For example, to draw a rectangle:
# Create a subplot
fig, ax = plt.subplots()
# Add a rectangle
rectangle = patches.Rectangle((1, 2), 2, 4, linewidth=2, edgecolor='red', facecolor='none')
ax.add_patch(rectangle)
plt.xlabel('X-axis Label')
plt.ylabel('Y-axis Label')
plt.title('Drawing Shapes Example')
plt.show()
In this example, we create a subplot, add a rectangle using patches.Rectangle(), and then add it
to the
VISHNU PRIYAplot with| V ax.add_patch().
P M | PYTHON BCA 44
SAVING PLOTS TO FILE
You can save plots created with Matplotlib to various file formats such as PNG, PDF, SVG, and
more using the savefig() function. Here's how to save a plot to a file:
#VISHNU
Save the
PRIYA plot |to
P M | PYTHON a file (e.g., PNG)
V BCA 45
plt.savefig('saved_plot.png', dpi=300)
In this example, plt.savefig('saved_plot.png', dpi=300) saves the current plot to a PNG file
named "saved_plot.png" with a resolution of 300 dots per inch (dpi). You can specify the file
format by changing the file extension (e.g., ".pdf" for PDF, ".svg" for SVG).
fname: The file name and path where the plot will be saved.
dpi: The resolution in dots per inch (default is 100).
format: The file format (e.g., 'png', 'pdf', 'svg').
bbox_inches: Specifies which part of the plot to save. Use 'tight' to save the entire plot (default).
transparent: If True, the plot will have a transparent background.
orientation: For PDFs, you can specify 'portrait' or 'landscape'.
To use the Pandas plotting functions, you first need to import the pandas.plotting module. Once
you have imported the module, you can use the plot() function to create a plot of a Series or
DataFrame. The plot() function takes a number of keyword arguments, which can be used to
control the appearance of the plot.
Here is a simple example of how to use the Pandas plot() function to create a line chart:
import pandas as pd
import matplotlib.pyplot as plt
# Create a Series
series = pd.Series([2, 4, 6, 8, 10])
# Create a line chart of the Series
series.plot()
VISHNU PRIYA P M | PYTHON | V BCA 47
Customization:
You can further customize your plots by using Matplotlib functions after calling .plot(). Additionally,
you can create subplots using the .subplots() method to have more control over the layout of
multiple plots.
plt.tight_layout()