0% found this document useful (0 votes)
5 views

DATASCIENCE_INTERNSHIP[1]

Data science is an interdisciplinary field that utilizes scientific methods to extract insights from data, combining statistics, computer science, and domain expertise. Key components include data collection, cleaning, and exploratory analysis, with tools like Python, Pandas, and Matplotlib being essential. The field is crucial for enhancing decision-making, predictive analytics, and operational efficiency, and it continues to evolve with advancements in AI and big data.

Uploaded by

mdnishadh001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

DATASCIENCE_INTERNSHIP[1]

Data science is an interdisciplinary field that utilizes scientific methods to extract insights from data, combining statistics, computer science, and domain expertise. Key components include data collection, cleaning, and exploratory analysis, with tools like Python, Pandas, and Matplotlib being essential. The field is crucial for enhancing decision-making, predictive analytics, and operational efficiency, and it continues to evolve with advancements in AI and big data.

Uploaded by

mdnishadh001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

*DATA SCIENCE

* What is DATA SCIENCE


* Data science is an interdisciplinary field that uses scientific
methods, processes, algorithms, and systems to extract
knowledge and insights from structured and unstructured
data.
* It combines aspects of statistics, computer science, and
domain expertise to analyze and interpret complex data.
*key components and processes
involved in data science

*Data Collection
*Data Cleaning and Preprocessing
*Exploratory Data Analysis (EDA)
*Why is Data Science
Important?
* Enhances decision-making
* Predictive analytics for future trends
* Personalization and improved customer experience
* Operational efficiency
* Image/Graphic: Infographic with data science applications in
different industries (e.g., healthcare, finance, retail)
*Key Tools and Technologies

* Programming Languages: Python, R


* Libraries: Pandas, NumPy, Scikit-learn, TensorFlow
* Tools: Jupyter Notebook, Tableau, PowerBI
* Image/Graphic: Logos or icons of the mentioned tools
* Numpy Array Indexing
* import numpy as np
* a = np.array([1, 2, 3, 4])
* print(a[0] + a[2])

* Output : 4

* import numpy as np
* a = np.array([[1,2,3,4,5], [6,7,8,9,10]])
* print( a[1, 4] + a[0,2])

* Output : 13
* import numpy as np
* a = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9],
[10, 11, 12]]])
* print(a[0, 1, 2])

* Output : 6
* import numpy as np
*Array Slicing
* a= np.array([1, 2, 3, 4, 5, 6, 7])
* print(a[4:])

* Output : [5 6 7]

* import numpy as np
* a= np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
* print(a[1, 1:4])

* Output : [7 8 9]

* import numpy as np
* arr = np.array([1, 2, 3, 4, 5, 6, 7]) #step slicing
* print(arr[::2])

* Output : [1 3 5 7]
* import numpy as np
* a = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
* print(a[0:2, 1:4]) #First slicing represents 0 to 2
index 2nd Represents values

* Output : [[2 3 4]
* [7 8 9]]
*Copy and view:
* import numpy as np
* a = np.array([1, 2, 3, 4, 5])
* x = a.copy()
* a[0] =0
* print(a)
* print(x)
* Output : [0 2 3 4 5]
* [1 2 3 4 5]
* ------------------------------------------------------------------------------------------------------------
* import numpy as np
* a = np.array([1, 2, 3, 4, 5])
* x = a.view()
* a[0] = 0
* print(a)
* print(x)
* Output : [0 2 3 4 5]
* [0 2 3 4 5]
*Array Shape:
* import numpy as np
* a = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
* print(a.shape)

* Output : (2,4)
*Array Reshape: 1D to 2D
and 1D to 3D
* import numpy as np
* a= np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
* b= a.reshape(4, 3)
* print(b)

* Output : [[1 2 3]
* [4 5 6]
* [7 8 9]
* [10 11 12]]
* import numpy as np
*For Loop and Array
*
*
*
a= np.array([[1, 2, 3], [4, 5, 6]])
for x in a:
for y in x:
iterating:
* print(y)
* Output : 1
* 2
* 3
* 4
* 5
* 6
------------------------------------------------------------------------------------------------------------------------------------------
------------------------
* import numpy as np
* a= np.array([[1,2,3],[4,5,6]])
* for x in np.nditer(a):
* print(x)
* Output : 1
* 2
* 3
* 4
* 5
* 6
* import numpy as np *Array Concatenation
* a1 = np.array([[1, 2], [3, 4]])
* a2 = np.array([[5, 6], [7, 8]])
* a = np.concatenate((a1, a2), axis=1) #axis represents row
* print(a)

* Output : [[1 2 5 6]
* [3 4 7 8]]
* -----------------------------------------------------------------------------------------------------------
* import numpy as np
* a1 = np.array([[1, 2], [3, 4]])
* a2 = np.array([[5, 6], [7, 8]])
* a = np.concatenate((a1, a2))
* print(a)

* Output : [[1 2]
* [3 4]
* [5 6]
* [7 8]]
*Array Sort:

*import numpy as np
*a = np.array([[13, 9, 4], [15, 12, 1]])
*print(np.sort(a))

*Output :[[4 9 13]


* [1 12 15]]
*Pandas
➢ Pandas is a Python library used for working with data sets.
➢ It has functions for analyzing, cleaning, exploring, and manipulating data.
➢ The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis“
➢ Pandas allows us to analyze big data and make conclusions based on statistical theories.
➢ Pandas can clean messy data sets, and make them readable and relevant

*Pandas installation:
* In command prompt type pip install pandas

* import pandas

* import pandas as pd #as keyword is used to import the pandas as pd

*
* import pandas as pd
*Pandas DataFrames:
* data = {
* "name": ["Arun", "praveen", "Raja"],
* 'Marks': [100, 70, 82]
*}
* new = pd.DataFrame(data)
* print(new)

* Output : name marks


* 0 arun 100
* 1 Praveen 70
* 2 Raja 82
*Index
* import pandas as pd
* data = {
* "name": ["Arun", "praveen", "Raja"],
* 'Marks': [100, 70, 82]
*}
* new = pd.DataFrame(data,index=[“one”,” two”,
“three”])
* print(new)

* Output : name marks


* one arun 100
* two Praveen 70
* three Raja 82
*Duplicate values
* import pandas as pd

* data = {'A': [1, 2, 3, 5],


* 'B': ['apple', 'banana', 'apple','apple']}

* df = pd.DataFrame(data)

* print(df)

* datas = df[df.duplicated(subset='B',keep=False)]
* df = datas.value_counts(subset='B')
* print(df)
*Remove Duplicates

* import pandas as pd

* # Sample DataFrame
* data = {'A': [1, 2, 3, 5],
* 'B': ['apple', 'banana', 'apple','apple']}

* data = pd.DataFrame(data)

* df = data.drop_duplicates(subset='B')
* print(df)
*Test Case
* Question:
* +----+---------+
* | id | email |
* +----+---------+
* | 1 | [email protected]|
* | 2 | [email protected] |
* | 3 | [email protected] |
* +----+---------+
* id is the primary key (column with unique values) for this
table.
* Each row of this table contains an email. The emails will not
contain uppercase letters.
* Output:
* output: [email protected] is repeated two times.
*Pandas Series:
* import pandas as pd
* a = [1, 2, 8]
* new = pd.Series(a, index = ["x", "y", "z"])
* print(new)

* Output : x 1
* y 2
* z 8
* dtype: int64
* import pandas as pd
* a = [1, 2, 8]
* new = pd.Series(a, index = ["x", "y", "z"])
* print(new)

* Output : x 1
* y 2
* z 8
* dtype: int64
* ------------------------------------------------------------------------------------------------------------
* import pandas as pd
* names = {"name1": "Arun", "name2": "Raja", "name3": "Praveen"}
* new = pd.Series(names)
* print(new)

* Output : name1 Arun


* name2 Raja
* name3 Praveen
* dtype: object
* import pandas as pd
data = {
"Duration":{
"0":60,
"1":60,
"2":60,
"3":45,
"4":45,
"5":60
},
"Pulse":{
"0":110,
"1":117,
"2":103,
"3":109,
"4":117,
"5":102
},
"Maxpulse":{
"0":130,
"1":145,
"2":135,
"3":175,
"4":148,
"5":127
},
"Calories":{
"0":409,
"1":479,
"2":340,
"3":282,
"4":406,
"5":300
}
}
df = pd.DataFrame(data)
print(df)
*Read csv file
* import pandas as pd
df = pd.read_csv('data.csv')

print(df))
df.head()
df.info()
df.isnull()
df.isnull().sum()
data = data.drop([‘column name’], axis = 1)
*Matplotlib:
* Matplotlib is a plotting library for the python programming
language and it’s numerical mathematic extension of Numpy.

* Matplotlib is a low level graph plotting library in python that


serves as a visualization utility
* In command prompt type pip install matplotlib

* import matplotlib
*Pyplot

* import matplotlib.pyplot as plt


import numpy as np

xpoints = np.array([0, 6])


ypoints = np.array([0, 250])

plt.plot(xpoints, ypoints)
plt.show()
*Plotting x and y points

* import matplotlib.pyplot as plt


import numpy as np

xpoints = np.array([1, 8])


ypoints = np.array([3, 10])

plt.plot(xpoints, ypoints)
plt.show()
*Markers

* import matplotlib.pyplot as plt


import numpy as np

ypoints = np.array([3, 8, 1, 10])

plt.plot(ypoints, marker = 'o')


plt.show()
*Display Multiple Plots
* import matplotlib.pyplot as plt
import numpy as np
#plot 1:
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])
plt.subplot(1, 2, 1)
plt.plot(x,y)
#plot 2:
x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])
plt.subplot(1, 2, 2)
plt.plot(x,y)
plt.show()
*Machine Learning
* Machine learning is a branch of AI that focuses
on the development of algorithms that can
learn from and make predictions or decisions
based on data.
* ML algorithms learn patterns and relationships
from labeled or unlabeled data to perform
specific tasks without being explicitly
programmed.
*Types of Algorithms:

*Supervised Learning
*Unsupervised Learning
*Reinforcement Learning
*Supervised Learning
Supervised learning is a type of machine learning algorithm where
the system is trained on labeled data to make predictions or decisions based
on the input data. In supervised learning, the algorithm learns from a set of
input data (often called "training data") that is already labeled with the
correct output or target variable. The algorithm tries to learn a mapping
function that can predict the correct output for any new input data it
encounters.
Supervised learning can be divided into two main categories:
1.regression
2.classification.
* REGRESSION *
* In regression, the output variable is continuous, and the algorithm tries to find a
relationship between the input variables and the output variable. For example,
given data on the size of a house, the number of bedrooms, and the location, a
regression model can predict the price of the house.
* CLASSIFICATION
* In classification, the output variable is categorical, and the algorithm tries to
classify the input data into different categories. For example, given a dataset of
images of cats and dogs, a classification model can predict whether a new
image is of a cat or a dog.
*Unsupervised Learning
* Unsupervised learning is a type of machine learning algorithm
where the system learns to identify patterns and relationships in the
input data without any explicit supervision or labels. In unsupervised
learning, the input data is not labeled, and the algorithm tries to find
a hidden structure or clustering in the data.
*Reinforcement learning
*Model Evaluation

*Accuracy, Precision, Recall, F1 Score


*Confusion Matrix
*Cross-validation
*Accuracy:
* Accuracy is the ratio of correctly predicted instances to
the total instances.
Number of Correct Predictions
* Accuracy=
Totall number of predictions
* Accuracy can be misleading in imbalanced
datasets.
* High accuracy doesn't always mean good
performance (e.g., in cases where one class
dominates).
* Does not account for the costs of different types
of errors (false positives vs. false negatives).
*Precision
* Precision is the ratio of correctly predicted positive observations to
the total predicted positives.
* True positives(TP)

* Precision=
* True positives(TP)+False positives(FP)

• Indicates how many of the predicted positive instances are actually positive.
• Crucial in situations where the cost of false positives is high (e.g., medical
diagnosis, spam detection).
• Helps in assessing the relevance of positive predictions in information retrieval
systems.
*Confusion Matrix
* A Confusion matrix is an N x N matrix used for evaluating the
performance of a classification model, where N is the total number
of target classes. The matrix compares the actual target values with
those predicted by the machine learning model.
*Conclusion
* Data science is a powerful tool that transforms data into
actionable insights, driving better decision-making and efficiency.
* The data science process involves several key steps from problem
definition to deployment.
* A variety of tools and technologies support data science activities,
making it a versatile and dynamic field.
* The future of data science looks promising with advancements in
AI, big data, and automation, along with an increased focus on
ethics and interdisciplinary collaboration.

You might also like