DATASCIENCE_INTERNSHIP[1]
DATASCIENCE_INTERNSHIP[1]
*Data Collection
*Data Cleaning and Preprocessing
*Exploratory Data Analysis (EDA)
*Why is Data Science
Important?
* Enhances decision-making
* Predictive analytics for future trends
* Personalization and improved customer experience
* Operational efficiency
* Image/Graphic: Infographic with data science applications in
different industries (e.g., healthcare, finance, retail)
*Key Tools and Technologies
* Output : 4
* import numpy as np
* a = np.array([[1,2,3,4,5], [6,7,8,9,10]])
* print( a[1, 4] + a[0,2])
* Output : 13
* import numpy as np
* a = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9],
[10, 11, 12]]])
* print(a[0, 1, 2])
* Output : 6
* import numpy as np
*Array Slicing
* a= np.array([1, 2, 3, 4, 5, 6, 7])
* print(a[4:])
* Output : [5 6 7]
* import numpy as np
* a= np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
* print(a[1, 1:4])
* Output : [7 8 9]
* import numpy as np
* arr = np.array([1, 2, 3, 4, 5, 6, 7]) #step slicing
* print(arr[::2])
* Output : [1 3 5 7]
* import numpy as np
* a = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
* print(a[0:2, 1:4]) #First slicing represents 0 to 2
index 2nd Represents values
* Output : [[2 3 4]
* [7 8 9]]
*Copy and view:
* import numpy as np
* a = np.array([1, 2, 3, 4, 5])
* x = a.copy()
* a[0] =0
* print(a)
* print(x)
* Output : [0 2 3 4 5]
* [1 2 3 4 5]
* ------------------------------------------------------------------------------------------------------------
* import numpy as np
* a = np.array([1, 2, 3, 4, 5])
* x = a.view()
* a[0] = 0
* print(a)
* print(x)
* Output : [0 2 3 4 5]
* [0 2 3 4 5]
*Array Shape:
* import numpy as np
* a = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
* print(a.shape)
* Output : (2,4)
*Array Reshape: 1D to 2D
and 1D to 3D
* import numpy as np
* a= np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
* b= a.reshape(4, 3)
* print(b)
* Output : [[1 2 3]
* [4 5 6]
* [7 8 9]
* [10 11 12]]
* import numpy as np
*For Loop and Array
*
*
*
a= np.array([[1, 2, 3], [4, 5, 6]])
for x in a:
for y in x:
iterating:
* print(y)
* Output : 1
* 2
* 3
* 4
* 5
* 6
------------------------------------------------------------------------------------------------------------------------------------------
------------------------
* import numpy as np
* a= np.array([[1,2,3],[4,5,6]])
* for x in np.nditer(a):
* print(x)
* Output : 1
* 2
* 3
* 4
* 5
* 6
* import numpy as np *Array Concatenation
* a1 = np.array([[1, 2], [3, 4]])
* a2 = np.array([[5, 6], [7, 8]])
* a = np.concatenate((a1, a2), axis=1) #axis represents row
* print(a)
* Output : [[1 2 5 6]
* [3 4 7 8]]
* -----------------------------------------------------------------------------------------------------------
* import numpy as np
* a1 = np.array([[1, 2], [3, 4]])
* a2 = np.array([[5, 6], [7, 8]])
* a = np.concatenate((a1, a2))
* print(a)
* Output : [[1 2]
* [3 4]
* [5 6]
* [7 8]]
*Array Sort:
*import numpy as np
*a = np.array([[13, 9, 4], [15, 12, 1]])
*print(np.sort(a))
*Pandas installation:
* In command prompt type pip install pandas
* import pandas
*
* import pandas as pd
*Pandas DataFrames:
* data = {
* "name": ["Arun", "praveen", "Raja"],
* 'Marks': [100, 70, 82]
*}
* new = pd.DataFrame(data)
* print(new)
* df = pd.DataFrame(data)
* print(df)
* datas = df[df.duplicated(subset='B',keep=False)]
* df = datas.value_counts(subset='B')
* print(df)
*Remove Duplicates
* import pandas as pd
* # Sample DataFrame
* data = {'A': [1, 2, 3, 5],
* 'B': ['apple', 'banana', 'apple','apple']}
* data = pd.DataFrame(data)
* df = data.drop_duplicates(subset='B')
* print(df)
*Test Case
* Question:
* +----+---------+
* | id | email |
* +----+---------+
* | 1 | [email protected]|
* | 2 | [email protected] |
* | 3 | [email protected] |
* +----+---------+
* id is the primary key (column with unique values) for this
table.
* Each row of this table contains an email. The emails will not
contain uppercase letters.
* Output:
* output: [email protected] is repeated two times.
*Pandas Series:
* import pandas as pd
* a = [1, 2, 8]
* new = pd.Series(a, index = ["x", "y", "z"])
* print(new)
* Output : x 1
* y 2
* z 8
* dtype: int64
* import pandas as pd
* a = [1, 2, 8]
* new = pd.Series(a, index = ["x", "y", "z"])
* print(new)
* Output : x 1
* y 2
* z 8
* dtype: int64
* ------------------------------------------------------------------------------------------------------------
* import pandas as pd
* names = {"name1": "Arun", "name2": "Raja", "name3": "Praveen"}
* new = pd.Series(names)
* print(new)
print(df))
df.head()
df.info()
df.isnull()
df.isnull().sum()
data = data.drop([‘column name’], axis = 1)
*Matplotlib:
* Matplotlib is a plotting library for the python programming
language and it’s numerical mathematic extension of Numpy.
* import matplotlib
*Pyplot
plt.plot(xpoints, ypoints)
plt.show()
*Plotting x and y points
plt.plot(xpoints, ypoints)
plt.show()
*Markers
*Supervised Learning
*Unsupervised Learning
*Reinforcement Learning
*Supervised Learning
Supervised learning is a type of machine learning algorithm where
the system is trained on labeled data to make predictions or decisions based
on the input data. In supervised learning, the algorithm learns from a set of
input data (often called "training data") that is already labeled with the
correct output or target variable. The algorithm tries to learn a mapping
function that can predict the correct output for any new input data it
encounters.
Supervised learning can be divided into two main categories:
1.regression
2.classification.
* REGRESSION *
* In regression, the output variable is continuous, and the algorithm tries to find a
relationship between the input variables and the output variable. For example,
given data on the size of a house, the number of bedrooms, and the location, a
regression model can predict the price of the house.
* CLASSIFICATION
* In classification, the output variable is categorical, and the algorithm tries to
classify the input data into different categories. For example, given a dataset of
images of cats and dogs, a classification model can predict whether a new
image is of a cat or a dog.
*Unsupervised Learning
* Unsupervised learning is a type of machine learning algorithm
where the system learns to identify patterns and relationships in the
input data without any explicit supervision or labels. In unsupervised
learning, the input data is not labeled, and the algorithm tries to find
a hidden structure or clustering in the data.
*Reinforcement learning
*Model Evaluation
* Precision=
* True positives(TP)+False positives(FP)
• Indicates how many of the predicted positive instances are actually positive.
• Crucial in situations where the cost of false positives is high (e.g., medical
diagnosis, spam detection).
• Helps in assessing the relevance of positive predictions in information retrieval
systems.
*Confusion Matrix
* A Confusion matrix is an N x N matrix used for evaluating the
performance of a classification model, where N is the total number
of target classes. The matrix compares the actual target values with
those predicted by the machine learning model.
*Conclusion
* Data science is a powerful tool that transforms data into
actionable insights, driving better decision-making and efficiency.
* The data science process involves several key steps from problem
definition to deployment.
* A variety of tools and technologies support data science activities,
making it a versatile and dynamic field.
* The future of data science looks promising with advancements in
AI, big data, and automation, along with an increased focus on
ethics and interdisciplinary collaboration.