0% found this document useful (0 votes)

23 views71 pages

Num Py

NumPy Crash Course for Data Science & Machine Learning

Uploaded by

n2diyana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views71 pages

Num Py

NumPy Crash Course for Data Science & Machine Learning

Uploaded by

n2diyana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 71

NumPy Crash Course for Data Science &

Machine Learning

Table of Contents
Introduction
Why NumPy Over Python Lists
NumPy's Internal Architecture
Performance Comparison
Part 1 - Foundation
Creating Arrays
Array Attributes
Part 2 - Working with Data
Indexing & Slicing
Array Operations
Broadcasting
Part 3 - Reshaping & Organizing Data
Part 4 - Data Cleaning & Generation
Part 5 - Math, Stats & Linear Algebra

Introduction
NumPy (Numerical Python) is the cornerstone of scientific computing in Python and the backbone of
the entire PyData ecosystem. It provides a powerful, efficient N-dimensional array object called
ndarray , along with a comprehensive collection of mathematical functions to operate on these arrays.

What Makes NumPy Essential?

NumPy is not just another Python library—it's the foundation upon which the entire data science and
machine learning ecosystem is built. Libraries like Pandas, Scikit-learn, TensorFlow, PyTorch,
Matplotlib, and SciPy all depend on NumPy arrays as their fundamental data structure.

Key Features
High-performance N-dimensional arrays with optimized C/C++ implementations
Broadcasting capabilities for operations between arrays of different shapes
Linear algebra functions including matrix operations, decompositions, and eigenvalues
Fourier transforms and random number generation
Integration tools for C/C++ and Fortran code
Memory-efficient operations with sophisticated memory management

Why NumPy Over Python Lists?

Understanding why NumPy arrays outperform Python lists requires diving into their fundamental
differences in design and implementation.

1. Memory Layout & Data Types

Python Lists (Heterogeneous & Scattered)

Python List: [1, 2.5, "hello", True]

Memory Layout:
┌─────────┬─────────┬─────────┬─────────┐
│ ptr→int │ ptr→flt │ ptr→str │ ptr→bln │ ← List of pointers
└─────────┴─────────┴─────────┴─────────┘
│ │ │ │
▼ ▼ ▼ ▼
[int obj] [float obj] [str obj] [bool obj] ← Objects scattered in memory

NumPy Arrays (Homogeneous & Contiguous)

NumPy Array: [1, 2, 3, 4]

Memory Layout:
┌───┬───┬───┬───┐
│ 1 │ 2 │ 3 │ 4 │ ← Raw data stored contiguously
└───┴───┴───┴───┘

2. Performance Advantages
Aspect Python Lists NumPy Arrays

Memory Usage ~28 bytes per integer ~8 bytes per integer

Cache Efficiency Poor (scattered memory) Excellent (contiguous memory)

Aspect Python Lists NumPy Arrays

Type Checking Runtime for each operation None (homogeneous types)

Loops Python interpreter overhead Optimized C loops

Vectorization Manual loops required Built-in vectorized operations

3. Operation Efficiency
Python Lists - Element-by-Element Processing

# Python approach - slow due to interpreter overhead

result = []
for i in range(len(list1)):
result.append(list1[i] + list2[i]) # Each operation goes through Python interpreter

NumPy Arrays - Vectorized Operations

# NumPy approach - fast C-level operations

result = array1 + array2 # Single optimized C operation on entire arrays

NumPy's Internal Architecture

1. The ndarray Object Structure

NumPy's core is the ndarray (N-dimensional array) object, which consists of:
ndarray object:
├── Data Buffer (Raw memory block)
│ └── Contiguous memory storing actual values
├── Data Type (dtype)
│ └── Describes how to interpret each byte
├── Shape
│ └── Tuple describing array dimensions
├── Strides
│ └── Bytes to skip to reach next element in each dimension
└── Metadata
├── Number of dimensions (ndim)
├── Total number of elements (size)
└── Memory layout flags

2. Memory Efficiency Through Homogeneity

import sys
import numpy as np

# Memory comparison - demonstrating overhead differences

python_list = [1, 2, 3, 4, 5] * 1000
numpy_array = np.array(python_list)

# Python lists store pointers to Python objects (with metadata)

# NumPy arrays store raw data values directly
print(f"Python list memory: {sys.getsizeof(python_list)} bytes")
print(f"NumPy array memory: {numpy_array.nbytes} bytes")
print(f"NumPy advantage: Eliminates {sys.getsizeof(python_list) - numpy_array.nbytes:,} bytes of

3. Vectorization Engine
NumPy's vectorization works through:

Universal Functions (ufuncs): Optimized C functions that operate element-wise

Broadcasting: Intelligent shape manipulation for operations between different-sized arrays
SIMD Instructions: Single Instruction, Multiple Data operations when available
4. Advantages Summary
Feature Impact

Contiguous Memory Better cache performance, faster access patterns

Homogeneous Types No type checking overhead, predictable memory usage

C-level Implementation Eliminates Python interpreter overhead

Vectorized Operations Processes entire arrays in single optimized operations

Broadcasting Enables efficient operations between different-shaped arrays

Memory Views Share data between arrays without copying

Performance Comparison
Let's demonstrate the dramatic performance differences between Python lists and NumPy arrays
through practical examples:
Speed Benchmark

import numpy as np
import time

# Speed comparison: NumPy vs Python lists

size = 1_000_000

# Python lists approach (slower)

python_list = list(range(size)) # Create a regular Python list
start_time = time.time()
result_list = [x * 2 for x in python_list] # Loop through each element individually
python_time = time.time() - start_time

# NumPy arrays approach (faster)

numpy_array = np.arange(size) # Create NumPy array with same data
start_time = time.time()
result_numpy = numpy_array * 2 # Vectorized operation - no explicit loop needed!
numpy_time = time.time() - start_time

print(f"Python list time: {python_time:.4f} seconds")

print(f"NumPy array time: {numpy_time:.4f} seconds")
print(f"NumPy is {python_time/numpy_time:.1f}x faster!")

Output:

Python list time: 0.1250 seconds

NumPy array time: 0.0031 seconds
NumPy is 40.3x faster!
Memory Usage Comparison

import sys

# Memory efficiency comparison

size = 10000
python_list = list(range(size))
numpy_array = np.arange(size)

print(f"Python list: {sys.getsizeof(python_list):,} bytes")

print(f"NumPy array: {numpy_array.nbytes:,} bytes")
print(f"Additional overhead: NumPy eliminates pointer overhead and object headers")

Output:

Python list: 80,056 bytes

NumPy array: 80,000 bytes
Memory efficiency: 1.0x less memory (plus additional overhead benefits)
Complex Operations Performance

# Matrix multiplication comparison

size = 500

# Python lists (using nested loops)

list_a = [[i + j for j in range(size)] for i in range(size)]
list_b = [[i + j for j in range(size)] for i in range(size)]

start_time = time.time()
# Manual matrix multiplication (simplified for demonstration)
result_list = [[sum(a * b for a, b in zip(row_a, col_b))
for col_b in zip(*list_b)] for row_a in list_a]
python_time = time.time() - start_time

# NumPy arrays
array_a = np.random.randint(0, 10, (size, size))
array_b = np.random.randint(0, 10, (size, size))

start_time = time.time()
result_numpy = np.dot(array_a, array_b) # Optimized BLAS operations
numpy_time = time.time() - start_time

print(f"Python matrix multiplication: {python_time:.4f} seconds")

print(f"NumPy matrix multiplication: {numpy_time:.4f} seconds")
print(f"NumPy is {python_time/numpy_time:.0f}x faster for matrix operations!")
Part 1 – Foundation

1. Creating Arrays
From Python Lists & Tuples

# From lists
arr_1d = np.array([1, 2, 3, 4, 5]) # 1-dimensional array
arr_2d = np.array([[1, 2, 3], [4, 5, 6]]) # 2D array (like a matrix)
arr_3d = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]]) # 3D array (like a cube)

print("1D array:", arr_1d)

print("2D array:\n", arr_2d)
print("3D array:\n", arr_3d)

Output:

1D array: [1 2 3 4 5]
2D array:
[[1 2 3]
[4 5 6]]
3D array:
[[[1 2]
[3 4]]

[[5 6]
[7 8]]]

Pre-filled Arrays

# Zeros, ones, and filled arrays

zeros_arr = np.zeros((3, 4)) # Creates 3x4 array filled with zeros
ones_arr = np.ones((2, 3, 4)) # Creates 2x3x4 array filled with ones
full_arr = np.full((3, 3), 7) # Creates 3x3 array filled with the value 7
identity = np.eye(4) # Creates 4x4 identity matrix (1s on diagonal, 0s elsewher

print("Zeros array:\n", zeros_arr)

print("Identity matrix:\n", identity)

Output:
Zeros array:
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
Identity matrix:
[[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]

Ranges and Sequences

# arange: similar to Python's range() but creates NumPy array

arr_range = np.arange(0, 10, 2) # Start=0, Stop=10, Step=2 → [0, 2, 4, 6, 8]

# linspace: creates evenly spaced numbers between start and stop

arr_linspace = np.linspace(0, 1, 5) # 5 evenly spaced numbers from 0 to 1

print("arange:", arr_range)
print("linspace:", arr_linspace)

Output:

arange: [0 2 4 6 8]
linspace: [0. 0.25 0.5 0.75 1. ]

Random Arrays

# Set seed for reproducibility (same random numbers each time)

np.random.seed(42)

# Random arrays
rand_uniform = np.random.rand(3, 3) # Uniform distribution between 0 and 1
rand_normal = np.random.randn(3, 3) # Standard normal distribution (mean=0, std=1)
rand_int = np.random.randint(1, 10, (3, 3)) # Random integers between 1 and 9 (10 excluded)

print("Random uniform:\n", rand_uniform)

print("Random integers:\n", rand_int)

Output:
Random uniform:
[[0.374 0.951 0.732]
[0.598 0.156 0.155]
[0.058 0.866 0.601]]
Random integers:
[[7 4 8]
[5 2 2]
[1 9 4]]

Copy vs View

# Original array
original = np.array([1, 2, 3, 4, 5])

# View (shares memory with original - changes affect both)

view = original[1:4] # Gets elements at indices 1, 2, 3
view[0] = 999 # This changes original[1] to 999!
print("Original after view modification:", original) # [1, 999, 3, 4, 5]

# Copy (independent - changes don't affect original)

original = np.array([1, 2, 3, 4, 5]) # Reset original array
copy = original[1:4].copy() # .copy() creates independent copy
copy[0] = 999 # This only changes the copy
print("Original after copy modification:", original) # [1, 2, 3, 4, 5] - unchanged!

Output:

Original after view modification: [ 1 999 3 4 5]

Original after copy modification: [1 2 3 4 5]

Why this matters: Views save memory but can cause unexpected changes. Always use .copy()
when you want to modify data independently.

Note: Always use .copy() when you want to modify a slice without affecting the original array.
2. Array Attributes

# Create a sample array

arr = np.random.randint(1, 10, (3, 4, 5)) # 3D array: 3 "layers", 4 rows, 5 columns

print("Array shape:", arr.shape) # (3, 4, 5) - dimensions of the array

print("Number of dimensions:", arr.ndim) # 3 - how many dimensions (1D, 2D, 3D, etc.)
print("Data type:", arr.dtype) # int32 or int64 - type of data stored
print("Total elements:", arr.size) # 60 - total number of elements (3×4×5)
print("Item size (bytes):", arr.itemsize) # 4 or 8 - memory used per element

# Change data type to save memory or increase precision

arr_float = arr.astype(np.float32) # Convert integers to 32-bit floats
print("New dtype:", arr_float.dtype) # float32

Output:

Array shape: (3, 4, 5)

Number of dimensions: 3
Data type: int64
Total elements: 60
Item size (bytes): 8
New dtype: float32

Key insight: Understanding array attributes helps optimize memory usage and performance.

Exercises - Foundation
Exercise 1: Create a 5x5 matrix where the diagonal elements are 1, elements above the diagonal are
2, and elements below the diagonal are 0.

Exercise 2: Generate a 1D array of 20 evenly spaced numbers between -π and π, then reshape it into
a 4x5 matrix.
Part 2 – Working with Data

3. Indexing & Slicing

1D Indexing & Slicing

arr = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90])

# Basic indexing (like Python lists)

print("Element at index 2:", arr[2]) # 30 - third element (0-indexed)
print("Last element:", arr[-1]) # 90 - negative indexing from end

# Slicing: [start:stop:step]
print("Slice [2:6]:", arr[2:6]) # [30, 40, 50, 60] - elements 2,3,4,5
print("Every 2nd element:", arr[::2]) # [10, 30, 50, 70, 90] - start:end:step=2

Output:

Element at index 2: 30
Last element: 90
Slice [2:6]: [30 40 50 60]
Every 2nd element: [10 30 50 70 90]

2D Indexing

arr_2d = np.array([[1, 2, 3, 4],

[5, 6, 7, 8],
[9, 10, 11, 12]])

# 2D indexing: [row, column]

print("Element [1, 2]:", arr_2d[1, 2]) # 7 - row 1, column 2 (0-indexed)
print("First row:", arr_2d[0, :]) # [1, 2, 3, 4] - row 0, all columns
print("Last column:", arr_2d[:, -1]) # [4, 8, 12] - all rows, last column
print("Subarray:\n", arr_2d[1:, 1:3]) # Rows 1-end, columns 1-2

Output:
Element [1, 2]: 7
First row: [1 2 3 4]
Last column: [ 4 8 12]
Subarray:
[[ 6 7]
[10 11]]

Key insight: The colon : means "all" - use it to select entire rows or columns.

Boolean Indexing

data = np.array([1, 5, 3, 8, 2, 9, 4])

# Boolean mask - creates True/False array

mask = data > 4 # Creates boolean array: [False, True, False, True, False, True, False]
print("Mask:", mask)
print("Values > 4:", data[mask]) # [5, 8, 9] - only True positions are selected

# Direct boolean indexing (more common)

print("Values between 3 and 7:", data[(data >= 3) & (data <= 7)]) # [5, 3, 4]
# Note: Use & for AND, | for OR in NumPy (not 'and'/'or')

Output:

Mask: [False True False True False True False]

Values > 4: [5 8 9]
Values between 3 and 7: [5 3 4]

Important: Use & and | for logical operations with NumPy arrays, not and / or .
Fancy Indexing

arr = np.array([10, 20, 30, 40, 50])

indices = [0, 2, 4] # Which positions we want

# Fancy indexing - use list/array of indices to select multiple elements

print("Fancy indexing:", arr[indices]) # [10, 30, 50] - elements at positions 0, 2, 4

# 2D fancy indexing - select specific elements from 2D array

arr_2d = np.arange(12).reshape(3, 4) # Creates [[0,1,2,3], [4,5,6,7], [8,9,10,11]]
print("Original:\n", arr_2d)
print("Fancy 2D:\n", arr_2d[[0, 2], [1, 3]]) # Elements at [0,1] and [2,3] positions

Output:

Fancy indexing: [10 30 50]

Original:
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
Fancy 2D: [1 11]

Explanation: arr_2d[[0, 2], [1, 3]] selects arr_2d[0,1] and arr_2d[2,3].

np.where()

arr = np.array([1, 5, 3, 8, 2, 9, 4])

# np.where() has two main uses:

# 1. Find indices where condition is True

indices = np.where(arr > 4) # Returns tuple of arrays
print("Indices where arr > 4:", indices[0]) # [1, 3, 5] - positions of values > 4

# 2. Conditional replacement (like ternary operator: condition ? value_if_true : value_if_false)

result = np.where(arr > 4, arr, 0) # If arr[i] > 4, keep arr[i], else replace with 0
print("Replace values ≤ 4 with 0:", result) # [0, 5, 0, 8, 0, 9, 0]

Output:
Indices where arr > 4: [1 3 5]
Replace values ≤ 4 with 0: [0 5 0 8 0 9 0]

Think of np.where() as: "Where condition is True, use first value, otherwise use second value."

4. Array Operations
Element-wise Arithmetic

a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])

# Element-wise operations (applied to corresponding elements)

print("Addition:", a + b) # [11, 22, 33, 44] - adds corresponding elements
print("Multiplication:", a * b) # [10, 40, 90, 160] - multiplies corresponding elements
print("Division:", b / a) # [10.0, 10.0, 10.0, 10.0] - divides corresponding elements
print("Power:", a ** 2) # [1, 4, 9, 16] - squares each element

Output:

Addition: [11 22 33 44]

Multiplication: [ 10 40 90 160]
Division: [10. 10. 10. 10.]
Power: [ 1 4 9 16]

Key concept: Operations are applied element-wise, not like matrix multiplication.

Universal Functions

arr = np.array([1, 4, 9, 16, 25])

# Universal functions (ufuncs) - work on entire arrays

print("Square root:", np.sqrt(arr)) # [1.0, 2.0, 3.0, 4.0, 5.0]
print("Exponential:", np.exp([0, 1, 2])) # [1.0, 2.718..., 7.389...] - e^x
print("Logarithm:", np.log(arr)) # [0.0, 1.386..., 2.197..., ...] - natural log
print("Sin values:", np.sin([0, np.pi/2, np.pi])) # [0.0, 1.0, 0.0] - sine function

Output:
Square root: [1. 2. 3. 4. 5.]
Exponential: [1. 2.71828183 7.3890561 ]
Logarithm: [0. 1.38629436 2.19722458 2.77258872 3.21887582]
Sin values: [0.0000000e+00 1.0000000e+00 1.2246468e-16]

Note: These functions are much faster than Python's math module because they're vectorized.

Aggregations

data = np.array([[1, 2, 3],

[4, 5, 6],
[7, 8, 9]])

# Aggregation functions - reduce array to single values or along axes

print("Sum of all elements:", np.sum(data)) # 45 - adds all numbers together
print("Sum along axis 0:", np.sum(data, axis=0)) # [12, 15, 18] - sum each column
print("Sum along axis 1:", np.sum(data, axis=1)) # [6, 15, 24] - sum each row

print("Mean:", np.mean(data)) # 5.0 - average of all elements

print("Standard deviation:", np.std(data)) # 2.58... - measure of spread
print("Min value:", np.min(data)) # 1 - smallest value
print("Max value:", np.max(data)) # 9 - largest value
print("Min index:", np.argmin(data)) # 0 - position of min value (flattened)
print("Max index:", np.argmax(data)) # 8 - position of max value (flattened)

Output:

Sum of all elements: 45

Sum along axis 0: [12 15 18]
Sum along axis 1: [ 6 15 24]
Mean: 5.0
Standard deviation: 2.581988897471611
Min value: 1
Max value: 9
Min index: 0
Max index: 8

Axis explanation:

axis=0 : operations go down the rows (column-wise)

axis=1 : operations go across the columns (row-wise)
Conditional Checks

arr = np.array([1, 0, 3, 0, 5])

# Check if ANY element meets condition

print("Any non-zero:", np.any(arr)) # True - at least one element is non-zero
# Check if ALL elements meet condition
print("All non-zero:", np.all(arr)) # False - not all elements are non-zero
print("All positive:", np.all(arr > 0)) # False - zeros are not positive

Output:

Any non-zero: True

All non-zero: False
All positive: False

Use cases: Great for data validation - checking if any values are missing, or if all values meet criteria.

5. Broadcasting
Broadcasting allows NumPy to perform operations on arrays with different shapes without explicitly
reshaping them.

Broadcasting Rules

# Rule 1: Scalar broadcasting (scalar works with any array)

arr = np.array([1, 2, 3, 4])
result = arr + 10 # [11, 12, 13, 14] - 10 is added to each element

# Rule 2: Array broadcasting (smaller array is "stretched" to match larger one)

arr_2d = np.array([[1, 2, 3], # 2x3 array
[4, 5, 6]])
arr_1d = np.array([10, 20, 30]) # 1x3 array

# The 1D array is mentally "copied" to match the 2D array:

# [[10, 20, 30],
# [10, 20, 30]]
result = arr_2d + arr_1d # [[11, 22, 33], [14, 25, 36]]
print("Broadcasted result:\n", result)

Output:
Broadcasted result:
[[11 22 33]
[14 25 36]]

Key insight: Broadcasting eliminates the need to manually reshape arrays, making code cleaner and
more efficient.

Real-world Example: Data Normalization

# Sample data: student scores across 3 subjects

scores = np.array([[85, 90, 78], # Student 1: Math, Science, English
[92, 88, 85], # Student 2: Math, Science, English
[78, 85, 90], # Student 3: Math, Science, English
[88, 92, 82]]) # Student 4: Math, Science, English

# Normalize each subject (column) to 0-100 scale

min_scores = np.min(scores, axis=0) # [78, 85, 78] - min score per subject
max_scores = np.max(scores, axis=0) # [92, 92, 90] - max score per subject

# Broadcasting magic: each operation is applied column-wise automatically

# Formula: (score - min) / (max - min) * 100
normalized = (scores - min_scores) / (max_scores - min_scores) * 100
print("Normalized scores:\n", normalized)

Output:

Normalized scores:
[[50. 71.42857143 0. ]
[100. 42.85714286 58.33333333]
[ 0. 0. 100. ]
[71.42857143 100. 33.33333333]]

What happened: Each student's score in each subject is normalized to a 0-100 scale based on the
min/max for that subject. Broadcasting lets us do this in one line instead of looping through subjects.

Exercises - Working with Data

Exercise 1: Given a 2D array of student grades, find all students who scored above 85 in at least 2
subjects.
Exercise 2: Create a function that normalizes an array to have zero mean and unit variance using
broadcasting.

Part 3 – Reshaping & Organizing Data

6. Reshaping & Combining Arrays

Reshaping

# Original array
arr = np.arange(12) # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
print("Original:", arr)

# Reshape to 2D (must have same total elements: 12 = 3×4)

reshaped = arr.reshape(3, 4) # Convert 1D array to 3 rows, 4 columns
print("Reshaped (3x4):\n", reshaped)

# Reshape to 3D (12 = 2×2×3)

reshaped_3d = arr.reshape(2, 2, 3) # 2 "layers", each with 2 rows and 3 columns
print("Reshaped (2x2x3):\n", reshaped_3d)

# Flatten back to 1D
flattened = reshaped.flatten() # Creates a copy - safe to modify
raveled = reshaped.ravel() # Returns a view if possible - faster but risky
print("Flattened:", flattened)

Output:
Original: [ 0 1 2 3 4 5 6 7 8 9 10 11]
Reshaped (3x4):
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
Reshaped (2x2x3):
[[[ 0 1 2]
[ 3 4 5]]

[[ 6 7 8]
[ 9 10 11]]]
Flattened: [ 0 1 2 3 4 5 6 7 8 9 10 11]

Key rule: Total elements must remain the same. Use -1 for auto-calculation: arr.reshape(-1, 4) or
arr.reshape(3, -1) .

Note: Use -1 in reshape to let NumPy calculate the dimension: arr.reshape(-1, 4) or

arr.reshape(3, -1)

Stacking Arrays

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Stacking combines arrays along different axes

# Horizontal stacking - side by side (increases columns)
h_stack = np.hstack([a, b]) # [1, 2, 3, 4, 5, 6] - one long array

# Vertical stacking - on top of each other (increases rows)

v_stack = np.vstack([a, b]) # [[1, 2, 3], [4, 5, 6]] - 2D array

# Depth stacking - creates new dimension (3D array)

d_stack = np.dstack([a, b]) # [[[1, 4], [2, 5], [3, 6]]] - pairs elements

print("Horizontal stack:", h_stack)

print("Vertical stack:\n", v_stack)
print("Depth stack:\n", d_stack)

Output:
Horizontal stack: [1 2 3 4 5 6]
Vertical stack:
[[1 2 3]
[4 5 6]]
Depth stack:
[[[1 4]
[2 5]
[3 6]]]

Memory tip:

hstack = horizontal = side-by-side

vstack = vertical = stacked up
dstack = depth = into the page

Concatenation

arr1 = np.array([[1, 2], [3, 4]])

arr2 = np.array([[5, 6], [7, 8]])

# Concatenate allows more control than stack functions

# axis=0: concatenate along rows (vertically)
concat_0 = np.concatenate([arr1, arr2], axis=0) # Stack vertically

# axis=1: concatenate along columns (horizontally)

concat_1 = np.concatenate([arr1, arr2], axis=1) # Stack horizontally

print("Original arrays:")
print("arr1:\n", arr1)
print("arr2:\n", arr2)
print("Concatenate axis=0 (vertical):\n", concat_0)
print("Concatenate axis=1 (horizontal):\n", concat_1)

Output:
Original arrays:
arr1:
[[1 2]
[3 4]]
arr2:
[[5 6]
[7 8]]
Concatenate axis=0 (vertical):
[[1 2]
[3 4]
[5 6]
[7 8]]
Concatenate axis=1 (horizontal):
[[1 2 5 6]
[3 4 7 8]]

Rule: Arrays must have the same shape in all dimensions except the concatenation axis.

Splitting Arrays

arr = np.arange(12).reshape(4, 3) # Create 4x3 array for demonstration

print("Original array:")
print(arr)

# Split into equal parts along rows (axis=0)

split_arrays = np.split(arr, 2, axis=0) # Split into 2 parts along rows
print("\nSplit into 2 parts along axis=0:")
for i, sub_arr in enumerate(split_arrays):
print(f"Part {i+1}:\n", sub_arr)

# Horizontal and vertical splits (shortcuts)

h_split = np.hsplit(arr, 3) # Split into 3 parts along columns (must divide evenly
v_split = np.vsplit(arr, 2) # Split into 2 parts along rows

print(f"\nHorizontal split into {len(h_split)} arrays of shape {h_split[0].shape}")

print(f"Vertical split into {len(v_split)} arrays of shape {v_split[0].shape}")

Output:
Original array:
[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]]

Split into 2 parts along axis=0:

Part 1:
[[0 1 2]
[3 4 5]]
Part 2:
[[ 6 7 8]
[ 9 10 11]]

Horizontal split into 3 arrays of shape (4, 1)

Vertical split into 2 arrays of shape (2, 3)

Important: The array size along the split axis must be evenly divisible by the number of sections.

7. Sorting & Searching

Sorting

arr = np.array([3, 1, 4, 1, 5, 9, 2, 6])

# Sort array (returns new array)

sorted_arr = np.sort(arr)
print("Sorted:", sorted_arr) # [1, 1, 2, 3, 4, 5, 6, 9]

# Get indices that would sort the array

sort_indices = np.argsort(arr)
print("Sort indices:", sort_indices) # [1, 3, 6, 0, 2, 4, 7, 5]

# Sort 2D array
arr_2d = np.random.randint(1, 10, (3, 4))
sorted_2d = np.sort(arr_2d, axis=1) # Sort each row
print("Original 2D:\n", arr_2d)
print("Sorted 2D:\n", sorted_2d)
Unique Values

arr = np.array([1, 2, 2, 3, 3, 3, 4, 4, 4, 4])

unique_vals = np.unique(arr)
print("Unique values:", unique_vals) # [1, 2, 3, 4]

# Get counts and indices

unique_vals, counts = np.unique(arr, return_counts=True)
print("Values:", unique_vals) # [1, 2, 3, 4]
print("Counts:", counts) # [1, 2, 3, 4]

Exercises - Reshaping & Organizing

Exercise 1: Create a function that takes a 1D array and reshapes it into a matrix where each row
sums to approximately the same value.

Exercise 2: Given multiple arrays of different shapes, stack them together and then split the result
back into the original arrays.
Part 4 – Data Cleaning & Generation

8. Working with Missing & Special Values

Handling NaN (Not a Number)

# Create array with NaN values

data = np.array([1.0, 2.0, np.nan, 4.0, np.nan, 6.0])
print("Data with NaN:", data)

# Check for NaN values

print("Is NaN:", np.isnan(data)) # [False, False, True, False, True, False]
print("Any NaN:", np.any(np.isnan(data))) # True

# Remove NaN values

clean_data = data[~np.isnan(data)]
print("Clean data:", clean_data) # [1.0, 2.0, 4.0, 6.0]

# Replace NaN with specific value

filled_data = np.nan_to_num(data, nan=0.0)
print("NaN replaced with 0:", filled_data)

Working with Infinity

# Create array with infinity

data = np.array([1.0, np.inf, 3.0, -np.inf, 5.0])
print("Data with inf:", data)

# Check for infinity

print("Is infinite:", np.isinf(data)) # [False, True, False, True, False]
print("Is positive inf:", np.isposinf(data))
print("Is negative inf:", np.isneginf(data))

# Check for finite values

print("Is finite:", np.isfinite(data)) # [True, False, True, False, True]
9. Random Number Generation
Setting Seeds and Basic Random Generation

# Set seed for reproducibility

np.random.seed(42)

# Uniform distribution [0, 1)

uniform = np.random.rand(3, 3)
print("Uniform random:\n", uniform)

# Normal distribution (mean=0, std=1)

normal = np.random.randn(1000)
print("Normal dist - Mean:", np.mean(normal), "Std:", np.std(normal))

# Random integers
random_ints = np.random.randint(1, 7, size=10) # Dice rolls
print("Dice rolls:", random_ints)

Advanced Random Generation

# Custom normal distribution

custom_normal = np.random.normal(loc=100, scale=15, size=1000) # IQ scores
print("IQ scores - Mean:", np.mean(custom_normal), "Std:", np.std(custom_normal))

# Random choice with probabilities

outcomes = ['A', 'B', 'C']
probabilities = [0.5, 0.3, 0.2]
choices = np.random.choice(outcomes, size=100, p=probabilities)
unique, counts = np.unique(choices, return_counts=True)
print("Choices:", dict(zip(unique, counts)))

# Shuffling
deck = np.arange(52) # Card deck
np.random.shuffle(deck)
print("Shuffled deck (first 10):", deck[:10])

# Permutation (doesn't modify original)

permuted = np.random.permutation(deck[:10])
print("Permuted:", permuted)
Exercises - Data Cleaning & Generation
Exercise 1: Create a dataset with missing values and implement three different strategies to handle
them: removal, mean imputation, and forward fill.

Exercise 2: Generate a synthetic dataset for a linear regression problem with noise and outliers.

Part 5 – Math, Stats & Linear Algebra

10. Statistical & Mathematical Tools

Rounding and Mathematical Functions

data = np.array([1.2345, 2.6789, 3.1416, 4.9999])

print("Round to 2 decimals:", np.round(data, 2))

print("Floor:", np.floor(data)) # [1.0, 2.0, 3.0, 4.0]
print("Ceiling:", np.ceil(data)) # [2.0, 3.0, 4.0, 5.0]
print("Truncate:", np.trunc(data)) # [1.0, 2.0, 3.0, 4.0]

Statistical Functions

# Sample data: test scores

scores = np.array([78, 85, 92, 88, 76, 90, 84, 87, 91, 83])

print("Mean:", np.mean(scores)) # 85.4

print("Median:", np.median(scores)) # 85.5
print("Standard deviation:", np.std(scores)) # 5.24
print("Variance:", np.var(scores)) # 27.44

# Percentiles
print("25th percentile:", np.percentile(scores, 25)) # 81.75
print("75th percentile:", np.percentile(scores, 75)) # 89.25

# Quantiles
print("Quartiles:", np.quantile(scores, [0.25, 0.5, 0.75]))
Cumulative Operations

arr = np.array([1, 2, 3, 4, 5])

print("Cumulative sum:", np.cumsum(arr)) # [1, 3, 6, 10, 15]

print("Cumulative product:", np.cumprod(arr)) # [1, 2, 6, 24, 120]

# Differences
print("Differences:", np.diff(arr)) # [1, 1, 1, 1]

# For time series: calculate returns

prices = np.array([100, 105, 102, 108, 110])
returns = np.diff(prices) / prices[:-1] * 100
print("Returns (%):", returns) # [5.0, -2.86, 5.88, 1.85]

Histograms

# Generate sample data

data = np.random.normal(100, 15, 1000)

# Create histogram
hist, bin_edges = np.histogram(data, bins=10)
print("Histogram counts:", hist)
print("Bin edges:", bin_edges)

# Histogram with custom range

hist_custom, bins_custom = np.histogram(data, bins=20, range=(70, 130))
11. Linear Algebra
Basic Operations

# Matrix operations
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

print("Matrix A:\n", A)
print("Matrix B:\n", B)

# Transpose
print("A transpose:\n", A.T)

# Matrix multiplication
print("A @ B:\n", A @ B) # Python 3.5+ syntax
print("np.dot(A, B):\n", np.dot(A, B)) # Traditional syntax

Advanced Linear Algebra

# Square matrix for advanced operations

matrix = np.array([[4, 2], [1, 3]])

# Determinant
det = np.linalg.det(matrix)
print("Determinant:", det) # 10.0

# Inverse
inv = np.linalg.inv(matrix)
print("Inverse:\n", inv)

# Verify: A * A^(-1) = I
identity_check = matrix @ inv
print("A @ A^(-1):\n", np.round(identity_check, 10))

# Eigenvalues and eigenvectors

eigenvals, eigenvecs = np.linalg.eig(matrix)
print("Eigenvalues:", eigenvals)
print("Eigenvectors:\n", eigenvecs)
Solving Linear Systems

# Solve Ax = b
# System: 2x + 3y = 7
# x - y = 1
A = np.array([[2, 3], [1, -1]])
b = np.array([7, 1])

solution = np.linalg.solve(A, b)
print("Solution [x, y]:", solution) # [2.5, 1.5]

# Verify solution
print("Verification:", A @ solution) # Should equal b

Exercises - Math, Stats & Linear Algebra

Exercise 1: Implement Principal Component Analysis (PCA) using NumPy's linear algebra functions.

Exercise 2: Create a function that calculates the correlation matrix for a dataset and finds the most
correlated pair of features.

Part 6 – Advanced Tricks & Applications

12. Advanced Python & NumPy Syntax

Powerful One-Liners and Advanced Techniques
NumPy combined with Python's advanced syntax can create incredibly concise and powerful code.
Here are essential patterns:
List/Array Comprehensions with NumPy

# Traditional approach vs one-liner

data = np.random.randint(1, 100, 20) # 20 random integers between 1-99

# Traditional: multiple lines with Python loop (slower)

positive = []
for x in data:
if x > 50:
positive.append(x ** 2)

# One-liner with list comprehension (faster than loop)

positive_oneliner = [x**2 for x in data if x > 50]

# NumPy vectorized approach (fastest - no Python loop at all!)

positive_numpy = data[data > 50] ** 2

print("Traditional result:", positive)

print("One-liner result:", positive_oneliner)
print("NumPy result:", positive_numpy)
print("All methods produce the same result:", np.array_equal(positive, positive_numpy))

Performance ranking: NumPy vectorized > List comprehension > Traditional loop

Why NumPy wins: It operates at C speed, not Python speed, and avoids creating intermediate
Python objects.
Advanced Indexing Tricks

# Multi-dimensional boolean indexing one-liners

matrix = np.random.randint(1, 10, (5, 5))
print("Original matrix:\n", matrix)

# Find all elements > 5 and replace with their square root (in-place operation)
matrix[matrix > 5] = np.sqrt(matrix[matrix > 5])
print("After replacing >5 with sqrt:\n", matrix)

# Reset matrix for next examples

matrix = np.random.randint(1, 10, (5, 5))

# Advanced trick: Get indices of max value in each row

row_max_indices = np.argmax(matrix, axis=1) # axis=1 means "along columns" (find max in each ro
print("Max value indices per row:", row_max_indices)

# Even more advanced: Get the actual max values using fancy indexing
# This combines np.arange() with the indices to select one element from each row
row_max_values = matrix[np.arange(matrix.shape[0]), row_max_indices]
print("Max values per row:", row_max_values)

What matrix[np.arange(matrix.shape[0]), row_max_indices] does:

np.arange(matrix.shape[0]) creates [0, 1, 2, 3, 4] (row indices)

row_max_indices might be [2, 0, 4, 1, 3] (column indices of max values)
Result: selects matrix[0,2], matrix[1,0], matrix[2,4], matrix[3,1], matrix[4,3]
Conditional Operations and Ternary-like Operations

# NumPy's where() as ternary operator

scores = np.array([85, 92, 78, 95, 88, 76, 91])

# Traditional if-else approach (verbose)

grades = []
for score in scores:
if score >= 90:
grades.append('A')
elif score >= 80:
grades.append('B')
else:
grades.append('C')

# One-liner with numpy.where (nested ternary operations)

# Read as: "where score >= 90, give 'A', otherwise where score >= 80, give 'B', otherwise give
grades_oneliner = np.where(scores >= 90, 'A',
np.where(scores >= 80, 'B', 'C'))

# Even more elegant: using numpy.select for multiple conditions

conditions = [scores >= 90, scores >= 80, scores >= 70] # List of boolean conditions
choices = ['A', 'B', 'C'] # Corresponding grades
grades_select = np.select(conditions, choices, default='F') # Default if no condition is met

print("Scores:", scores)
print("Traditional grades:", grades)
print("One-liner grades:", grades_oneliner)
print("Select grades:", grades_select)

np.select() explanation:

Checks conditions in order: first >=90, then >=80, then >=70

Returns corresponding choice from choices list
Uses default='F' if no conditions are met
Lambda Functions with NumPy

# Complex function that we want to apply to arrays

def complex_function(x):
"""A function with multiple conditions - normally can't be vectorized"""
if x < 0:
return x ** 2 # Square negative numbers
elif x < 10:
return np.sqrt(x) # Square root for small positive numbers
else:
return np.log(x) # Natural log for large numbers

# Problem: This function can't work directly on arrays due to if-else statements
# Solution: Use np.vectorize to make it work element-wise

# Method 1: Vectorize the function

vectorized_func = np.vectorize(complex_function)

# Method 2: One-liner with lambda (more concise)

transform = np.vectorize(lambda x: x**2 if x < 0 else np.sqrt(x) if x < 10 else np.log(x))

# Test data with negative, small positive, and large positive numbers
data = np.array([-5, 2, 15, -3, 8, 25])
result = vectorized_func(data)
result_lambda = transform(data)

print("Original data:", data)

print("Function result:", result)
print("Lambda result:", result_lambda)
print("Results are identical:", np.allclose(result, result_lambda))

When to use np.vectorize() :

When you have complex conditional logic that can't be easily vectorized
When you want to apply a custom function element-wise
Note: It's still essentially a Python loop under the hood, so not as fast as pure NumPy operations
Advanced Array Manipulation One-Liners

# Complex array operations in single lines

data = np.random.randn(100, 5) # 100 samples, 5 features

# Normalize each column to [0, 1] range (Min-Max scaling)

# Formula: (x - min) / (max - min) applied column-wise
normalized = (data - data.min(axis=0)) / (data.max(axis=0) - data.min(axis=0))

# Get top 3 indices per row (find 3 largest values in each row)
# np.argpartition doesn't fully sort, just partitions around the 3rd largest
top3_indices = np.argpartition(data, -3, axis=1)[:, -3:] # Get last 3 indices

# Sort those top 3 indices by actual values (most advanced line!)

# Step 1: Get the actual values using fancy indexing
# Step 2: Sort those values to get ordering
# Step 3: Apply that ordering to our indices
top3_sorted = np.take_along_axis(
top3_indices, # What to sort
np.argsort(np.take_along_axis(data, top3_indices, axis=1), axis=1)[:, ::-1], # How to sort
axis=1
)

print("Data shape:", data.shape)

print("Normalized range - min:", normalized.min(), "max:", normalized.max())
print("Top 3 indices shape:", top3_indices.shape)
print("First row top 3 indices:", top3_sorted[0])
print("First row top 3 values:", data[0, top3_sorted[0]])
print("Verification - are they sorted?", np.all(np.diff(data[0, top3_sorted[0]]) <= 0))

Breaking down the complex line:

1. np.take_along_axis(data, top3_indices, axis=1) - get the actual top 3 values

2. np.argsort(..., axis=1)[:, ::-1] - get indices that would sort them in descending order
3. np.take_along_axis(top3_indices, ..., axis=1) - rearrange our indices according to that
ordering
Dictionary Comprehensions with NumPy

# Create statistical summaries using dict comprehensions

data_dict = {
'sales': np.random.normal(1000, 200, 100), # Sales data: mean=1000, std=200
'costs': np.random.normal(600, 150, 100), # Cost data: mean=600, std=150
'profit': np.random.normal(400, 100, 100) # Profit data: mean=400, std=100
}

# One-liner: comprehensive statistics for all metrics

# Creates nested dictionary with stats for each business metric
stats = {metric: {'mean': np.mean(values), 'std': np.std(values),
'min': np.min(values), 'max': np.max(values)}
for metric, values in data_dict.items()}

# One-liner: correlation matrix between all metrics

# Only calculates unique pairs (k1 < k2 prevents duplicates like sales vs costs AND costs vs sal
correlation_dict = {f"{k1}_vs_{k2}": np.corrcoef(v1, v2)[0, 1]
for k1, v1 in data_dict.items()
for k2, v2 in data_dict.items() if k1 < k2}

print("Statistics summary:")
for metric, stat in stats.items():
print(f"{metric}: Mean={stat['mean']:.1f}, Std={stat['std']:.1f}")

print("\nCorrelations:")
for pair, corr in correlation_dict.items():
print(f"{pair}: {corr:.3f}")

Dictionary comprehension breakdown:

Outer loop: for metric, values in data_dict.items() - iterate through each business metric
Inner dict: {'mean': np.mean(values), ...} - calculate multiple stats for each metric
Result: Nested dictionary structure for easy access like stats['sales']['mean']
Advanced Generator Expressions with NumPy

# Memory-efficient processing with generators

def process_large_dataset(data_generator, chunk_size=1000):
"""Process large dataset in chunks using generator expressions."""
return np.concatenate([
np.array([x**2 for x in chunk if x > 0]) # One-liner processing
for chunk in (list(data_generator[i:i+chunk_size])
for i in range(0, len(data_generator), chunk_size))
if len([x for x in chunk if x > 0]) > 0 # Filter empty chunks
])

# Simulate large dataset

large_data = np.random.normal(0, 1, 10000)
processed = process_large_dataset(large_data)

print(f"Original size: {len(large_data)}, Processed size: {len(processed)}")

print(f"Sample processed values: {processed[:10]}")
Functional Programming Patterns

from functools import reduce

# Functional approach to data processing

datasets = [np.random.randn(100, 3) for _ in range(5)]

# One-liner: combine multiple datasets and apply transformations

combined_and_processed = reduce(
lambda acc, x: np.vstack([acc, x]) if acc.size > 0 else x,
[dataset[dataset.sum(axis=1) > 0] for dataset in datasets], # Filter rows
np.array([]).reshape(0, 3)
)

# One-liner: apply multiple transformations in sequence

transformations = [
lambda x: x - np.mean(x, axis=0), # Center
lambda x: x / np.std(x, axis=0), # Scale
lambda x: np.clip(x, -3, 3) # Clip outliers
]

final_data = reduce(lambda data, transform: transform(data),

transformations, combined_and_processed)

print(f"Combined shape: {combined_and_processed.shape}")

print(f"Final processed shape: {final_data.shape}")
print(f"Final data stats - Mean: {np.mean(final_data, axis=0)}")
print(f"Final data stats - Std: {np.std(final_data, axis=0)}")
Real-World Example: Advanced Data Pipeline

class DataPipeline:
"""Advanced NumPy data pipeline using functional programming."""

@staticmethod
def create_pipeline(*operations):
"""Create a data processing pipeline."""
return lambda data: reduce(lambda d, op: op(d), operations, data)

@staticmethod
def outlier_removal(threshold=3):
"""Remove outliers beyond threshold standard deviations."""
return lambda data: data[np.abs(data - np.mean(data, axis=0)) <= threshold * np.std(data

@staticmethod
def feature_engineering():
"""Add polynomial features (one-liner)."""
return lambda data: np.column_stack([
data,
data**2, # Quadratic features
np.prod(data[:, :2], axis=1, keepdims=True) if data.shape[1] >= 2 else data[:, [0]]
])

@staticmethod
def normalize():
"""Normalize to unit variance."""
return lambda data: (data - np.mean(data, axis=0)) / np.std(data, axis=0)

# Usage: Create and apply pipeline (one-liner)

raw_data = np.random.normal([10, 20, 30], [2, 5, 3], (1000, 3))

# Create pipeline in one line

pipeline = DataPipeline.create_pipeline(
DataPipeline.outlier_removal(2.5),
DataPipeline.feature_engineering(),
DataPipeline.normalize()
)

# Apply pipeline (one function call)

processed_data = pipeline(raw_data)

print(f"Raw data shape: {raw_data.shape}")

print(f"Processed data shape: {processed_data.shape}")
print(f"Processed mean: {np.mean(processed_data, axis=0)}")
print(f"Processed std: {np.std(processed_data, axis=0)}")

Note: These advanced techniques combine Python's functional programming features with
NumPy's vectorization for maximum efficiency and code elegance.

13. Advanced Tricks

np.meshgrid()

# Create coordinate grids

x = np.linspace(-2, 2, 5)
y = np.linspace(-1, 1, 3)

X, Y = np.meshgrid(x, y)
print("X grid:\n", X)
print("Y grid:\n", Y)

# Useful for plotting functions

Z = X**2 + Y**2 # Distance from origin
print("Z = X² + Y²:\n", Z)

np.tile() & np.repeat()

arr = np.array([1, 2, 3])

# Repeat elements
repeated = np.repeat(arr, 3) # [1, 1, 1, 2, 2, 2, 3, 3, 3]
print("Repeated:", repeated)

# Tile entire array

tiled = np.tile(arr, 3) # [1, 2, 3, 1, 2, 3, 1, 2, 3]
print("Tiled:", tiled)

# 2D tiling
tiled_2d = np.tile(arr, (2, 3)) # 2 rows, 3 repetitions per row
print("Tiled 2D:\n", tiled_2d)
Adding Dimensions with np.newaxis

arr_1d = np.array([1, 2, 3, 4])

# Add new axis

arr_col = arr_1d[:, np.newaxis] # Column vector
arr_row = arr_1d[np.newaxis, :] # Row vector

print("Original shape:", arr_1d.shape) # (4,)

print("Column shape:", arr_col.shape) # (4, 1)
print("Row shape:", arr_row.shape) # (1, 4)

# Useful for broadcasting

result = arr_col + arr_row # (4, 1) + (1, 4) = (4, 4)
print("Broadcasting result shape:", result.shape)
Memory Efficiency Tips & Advanced Memory Management

# Memory-efficient operations
large_array = np.arange(1000000)

# Use views instead of copies when possible

view = large_array[::2] # Every 2nd element (view)
copy = large_array[::2].copy() # Explicit copy

# In-place operations save memory

large_array *= 2 # In-place multiplication
large_array += 1 # In-place addition

# Use appropriate data types

small_ints = np.array([1, 2, 3], dtype=np.int8) # 1 byte per element
large_ints = np.array([1, 2, 3], dtype=np.int64) # 8 bytes per element

print("Small ints memory:", small_ints.nbytes, "bytes")

print("Large ints memory:", large_ints.nbytes, "bytes")

# Advanced: Memory-mapped arrays for huge datasets

# mmap_array = np.memmap('large_data.dat', dtype='float32', mode='w+', shape=(1000000, 100))

# One-liner: Efficient batch processing

def efficient_batch_process(data, batch_size=1000, operation=lambda x: x**2):
"""Process large arrays in memory-efficient batches (one-liner approach)."""
return np.concatenate([
operation(data[i:i+batch_size])
for i in range(0, len(data), batch_size)
])

# Example usage
large_data = np.random.randn(10000)
result = efficient_batch_process(large_data, operation=lambda x: np.where(x > 0, x**2, 0))
print(f"Processed {len(result)} elements efficiently")
13. Practical Applications in Data Science
Advanced Data Science One-Liners

# Advanced feature engineering one-liners

X = np.random.randn(1000, 5)

# Create polynomial features (degree 2) in one line

poly_features = np.column_stack([
X, X**2,
*[X[:, i:i+1] * X[:, j:j+1] for i in range(X.shape[1]) for j in range(i+1, X.shape[1])]
])

# Advanced feature selection using correlation (one-liner)

def select_features_by_correlation(X, y, threshold=0.5):
"""Select features based on correlation with target (one-liner logic)."""
correlations = np.array([np.corrcoef(X[:, i], y)[0, 1] for i in range(X.shape[1])])
return X[:, np.abs(correlations) > threshold], np.where(np.abs(correlations) > threshold)[0]

# Simulate target variable

y = X.sum(axis=1) + 0.5 * np.random.randn(1000)
selected_X, selected_indices = select_features_by_correlation(X, y, 0.3)

print(f"Original features: {X.shape[1]}")

print(f"Selected features: {selected_X.shape[1]} (indices: {selected_indices})")
print(f"Polynomial features: {poly_features.shape[1]}")
Data Normalization & Scaling

# Sample dataset: house prices

prices = np.array([150000, 300000, 450000, 200000, 350000, 180000])

# Multiple normalization methods in one-liner dictionary

normalizations = {
'min_max': (prices - prices.min()) / (prices.max() - prices.min()),
'z_score': (prices - prices.mean()) / prices.std(),
'robust': (prices - np.median(prices)) / (np.percentile(prices, 75) - np.percentile(prices,
'unit_vector': prices / np.linalg.norm(prices)
}

# Display all normalizations

for method, normalized in normalizations.items():
print(f"{method}: {normalized}")

# Advanced: Custom scaling function (one-liner)

custom_scale = lambda data, target_range=(0, 1): (data - data.min()) / (data.max() - data.min())
scaled_to_10_100 = custom_scale(prices, (10, 100))
print(f"Custom scaled [10, 100]: {scaled_to_10_100}")
One-Hot Encoding

# Categories
categories = np.array(['cat', 'dog', 'bird', 'cat', 'dog', 'bird', 'cat'])

# Get unique categories

unique_cats = np.unique(categories)
print("Unique categories:", unique_cats)

# One-liner one-hot encoding (most efficient)

one_hot_oneliner = (categories[:, None] == unique_cats).astype(int)
print("One-hot encoding (one-liner):\n", one_hot_oneliner)

# Advanced: Handle unknown categories gracefully

def robust_one_hot_encode(data, known_categories=None):
"""One-liner robust one-hot encoding with unknown category handling."""
if known_categories is None:
known_categories = np.unique(data)

# One-liner with unknown category handling

return np.column_stack([
(data == cat).astype(int) for cat in known_categories
] + [np.isin(data, known_categories, invert=True).astype(int)]) # Unknown category column

# Test with new data including unknown category

new_data = np.array(['cat', 'dog', 'fish', 'bird']) # 'fish' is unknown
encoded_robust = robust_one_hot_encode(new_data, unique_cats)
print("Robust encoding with unknown:\n", encoded_robust)
print("Columns: [cat, dog, bird, unknown]")
Distance Calculations

# Points in 2D space
points = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])

# One-liner Euclidean distance matrix using broadcasting

euclidean_distances = np.sqrt(((points[:, None, :] - points[None, :, :]) ** 2).sum(axis=2))
print("Euclidean distances:\n", euclidean_distances)

# One-liner Manhattan distance

manhattan_distances = np.abs(points[:, None, :] - points[None, :, :]).sum(axis=2)
print("Manhattan distances:\n", manhattan_distances)

# One-liner Cosine similarity

def cosine_similarity_oneliner(vectors):
"""Calculate cosine similarity matrix in one line."""
normalized = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)
return normalized @ normalized.T

cosine_sim = cosine_similarity_oneliner(points)
print("Cosine similarity:\n", cosine_sim)

# Advanced: Multiple distance metrics at once

distance_metrics = {
'euclidean': lambda p: np.sqrt(((p[:, None, :] - p[None, :, :]) ** 2).sum(axis=2)),
'manhattan': lambda p: np.abs(p[:, None, :] - p[None, :, :]).sum(axis=2),
'chebyshev': lambda p: np.abs(p[:, None, :] - p[None, :, :]).max(axis=2)
}

# Calculate all distances in one go

all_distances = {name: metric(points) for name, metric in distance_metrics.items()}

for name, dist_matrix in all_distances.items():

print(f"\n{name.title()} distances:\n", dist_matrix)
Mini-batch Creation for ML

def create_mini_batches(X, y, batch_size=32, shuffle=True):

"""Create mini-batches for machine learning with advanced one-liner logic."""
n_samples = X.shape[0]

# One-liner shuffling
indices = np.random.permutation(n_samples) if shuffle else np.arange(n_samples)
X_shuffled, y_shuffled = X[indices], y[indices]

# One-liner batch creation using list comprehension

return [(X_shuffled[i:i+batch_size], y_shuffled[i:i+batch_size])
for i in range(0, n_samples, batch_size)]

# Advanced: Stratified mini-batches (maintain class distribution)

def create_stratified_batches(X, y, batch_size=32):
"""Create stratified mini-batches maintaining class distribution."""
unique_classes = np.unique(y)

# One-liner: get indices for each class

class_indices = {cls: np.where(y == cls)[0] for cls in unique_classes}

# Calculate samples per class per batch

samples_per_class = {cls: max(1, len(indices) * batch_size // len(y))
for cls, indices in class_indices.items()}

batches = []
max_batches = min(len(indices) // samples_per_class[cls]
for cls, indices in class_indices.items())

for batch_idx in range(max_batches):

# One-liner: select balanced samples from each class
batch_indices = np.concatenate([
class_indices[cls][batch_idx * samples_per_class[cls]:
(batch_idx + 1) * samples_per_class[cls]]
for cls in unique_classes
])

batches.append((X[batch_indices], y[batch_indices]))

return batches

# Example usage
X = np.random.randn(100, 5) # 100 samples, 5 features
y = np.random.randint(0, 3, 100) # 3 classes

regular_batches = create_mini_batches(X, y, batch_size=16)

stratified_batches = create_stratified_batches(X, y, batch_size=16)

print(f"Regular batches: {len(regular_batches)}")

print(f"Stratified batches: {len(stratified_batches)}")
print(f"First regular batch shapes: X={regular_batches[0][0].shape}, y={regular_batches[0][1].sh
print(f"First stratified batch class distribution: {np.bincount(stratified_batches[0][1])}")
Generating Synthetic Datasets

def generate_regression_data(n_samples=100, n_features=1, noise=0.1, random_state=42):

"""Generate synthetic regression dataset with advanced options."""
np.random.seed(random_state)

X = np.random.randn(n_samples, n_features)
true_coefficients = np.random.randn(n_features)

# One-liner: add polynomial features and noise

y = (X @ true_coefficients +
noise * np.random.randn(n_samples) +
0.1 * np.sum(X**2, axis=1)) # Add non-linear component

return X, y, true_coefficients

def generate_classification_data(n_samples=100, n_features=2, n_classes=2,

class_sep=1.0, random_state=42):
"""Generate synthetic classification dataset with controlled separation."""
np.random.seed(random_state)

# One-liner: create class centers

centers = np.random.randn(n_classes, n_features) * class_sep * 3

# One-liner: generate samples and assign to nearest center

X = np.random.randn(n_samples, n_features)
y = np.array([np.argmin([np.linalg.norm(x - center) for center in centers])
for x in X])

# Add class-specific offsets to improve separation

for class_idx in range(n_classes):
mask = y == class_idx
X[mask] += centers[class_idx] * class_sep

return X, y

# Advanced: Generate time series data

def generate_time_series(n_points=1000, noise_level=0.1, trend=0.001,
seasonality_period=50, random_state=42):
"""Generate synthetic time series with trend, seasonality, and noise."""
np.random.seed(random_state)
t = np.arange(n_points)
# One-liner: combine trend, seasonality, and noise
series = (trend * t +
np.sin(2 * np.pi * t / seasonality_period) +
0.5 * np.sin(2 * np.pi * t / (seasonality_period * 0.3)) + # Higher frequency
noise_level * np.random.randn(n_points))

return t, series

# Generate various datasets

X_reg, y_reg, true_coefs = generate_regression_data(100, 3, noise=0.2)
X_clf, y_clf = generate_classification_data(200, 2, 3, class_sep=1.5)
t, ts_data = generate_time_series(365, noise_level=0.2)

print("Regression data shape:", X_reg.shape, y_reg.shape)

print("Classification data shape:", X_clf.shape, y_clf.shape)
print("Time series shape:", t.shape, ts_data.shape)
print("True coefficients:", true_coefs)
print("Class distribution:", np.bincount(y_clf))

# Advanced: All-in-one dataset generator

def generate_ml_dataset(dataset_type='regression', **kwargs):
"""One-liner dataset generator for different ML tasks."""
generators = {
'regression': generate_regression_data,
'classification': generate_classification_data,
'time_series': generate_time_series
}
return generators[dataset_type](**kwargs)

# Usage examples
datasets = {name: generate_ml_dataset(name, n_samples=100)
for name in ['regression', 'classification']}

print("\nGenerated datasets:")
for name, data in datasets.items():
if name == 'time_series':
print(f"{name}: {len(data)} points")
else:
print(f"{name}: X shape {data[0].shape}, y shape {data[1].shape}")

Exercises - Advanced Applications

Exercise 1: Implement k-means clustering algorithm using only NumPy and advanced one-liners.
Exercise 2: Create a function that performs cross-validation splits for time series data (no random
shuffling).

Exercise 3: Build a complete data preprocessing pipeline using functional programming and one-
liners that handles missing values, outliers, scaling, and feature engineering.

Exercise 4: Implement PCA (Principal Component Analysis) using only NumPy with matrix operations
in a functional style.

Exercise 5: Create an advanced anomaly detection system using statistical methods and NumPy one-
liners.

# Exercise 5 Solution Hint: Advanced Anomaly Detection

def detect_anomalies_advanced(data, methods=['zscore', 'iqr', 'isolation']):
"""Advanced anomaly detection using multiple methods (one-liner approach)."""

detectors = {
'zscore': lambda x: np.abs((x - np.mean(x, axis=0)) / np.std(x, axis=0)) > 3,
'iqr': lambda x: (x < (np.percentile(x, 25, axis=0) - 1.5 *
(np.percentile(x, 75, axis=0) - np.percentile(x, 25, axis=0)))) |
(x > (np.percentile(x, 75, axis=0) + 1.5 *
(np.percentile(x, 75, axis=0) - np.percentile(x, 25, axis=0)))),
'isolation': lambda x: np.random.rand(*x.shape) < 0.05 # Simplified for demo
}

# One-liner: combine multiple detection methods

anomaly_scores = np.mean([detectors[method](data).astype(int)
for method in methods if method in detectors], axis=0)

return anomaly_scores > 0.5 # Majority vote

# Test the anomaly detector

test_data = np.random.normal(0, 1, (1000, 3))
test_data[50:60] = np.random.normal(5, 1, (10, 3)) # Insert anomalies

anomalies = detect_anomalies_advanced(test_data)
print(f"Detected {np.sum(anomalies)} anomalies out of {len(test_data)} samples")
Part 7 – Advanced NumPy Patterns & Best Practices

Advanced Patterns for Production Code

# Pattern 1: Vectorized Operations with Error Handling

def safe_vectorized_operation(data, operation, fill_value=0):
"""Apply operation safely with error handling."""
try:
return np.where(np.isfinite(data), operation(data), fill_value)
except:
return np.full_like(data, fill_value)

# Pattern 2: Memory-Efficient Chunked Processing

def process_in_chunks(data, chunk_size=1000, operation=lambda x: x):
"""Process large arrays in memory-efficient chunks."""
return np.concatenate([
operation(data[i:i+chunk_size])
for i in range(0, len(data), chunk_size)
]) if len(data) > 0 else data

# Pattern 3: Advanced Broadcasting with Shape Validation

def broadcast_safe_operation(a, b, operation=np.add):
"""Perform operations with automatic broadcasting validation."""
try:
return operation(a, b)
except ValueError as e:
# Attempt to fix common broadcasting issues
if a.ndim < b.ndim:
a = a.reshape(*((1,) * (b.ndim - a.ndim)), *a.shape)
elif b.ndim < a.ndim:
b = b.reshape(*((1,) * (a.ndim - b.ndim)), *b.shape)
return operation(a, b)

# Pattern 4: Functional Composition for Data Pipelines

def compose(*functions):
"""Compose multiple functions into a single operation."""
return lambda x: reduce(lambda acc, f: f(acc), functions, x)

# Example: Complex data processing pipeline

pipeline = compose(
lambda x: x[~np.isnan(x).any(axis=1)], # Remove NaN rows
lambda x: (x - np.mean(x, axis=0)) / np.std(x, axis=0), # Standardize
lambda x: np.clip(x, -3, 3) # Clip outliers
)

# Usage
sample_data = np.random.normal(0, 1, (100, 5))
sample_data[10:15] = np.nan # Add some NaN values
processed = pipeline(sample_data)
print(f"Original shape: {sample_data.shape}, Processed shape: {processed.shape}")
Performance Optimization Techniques

# Technique 1: Avoid Python loops with vectorization

def bad_approach(arr):
"""Slow approach using Python loops."""
result = []
for i in range(len(arr)):
if arr[i] > 0:
result.append(arr[i] ** 2)
else:
result.append(0)
return np.array(result)

def good_approach(arr):
"""Fast approach using NumPy vectorization."""
return np.where(arr > 0, arr ** 2, 0)

# Technique 2: Use views instead of copies when possible

def create_view_not_copy(arr):
"""Create views for memory efficiency."""
return arr[::2] # View of every second element

def force_copy_when_needed(arr):
"""Explicitly copy when modification is needed."""
return arr[::2].copy()

# Technique 3: Leverage NumPy's built-in functions

def efficient_statistics(data):
"""Calculate multiple statistics efficiently."""
return {
'mean': np.mean(data, axis=0),
'std': np.std(data, axis=0),
'percentiles': np.percentile(data, [25, 50, 75], axis=0),
'min_max': np.array([np.min(data, axis=0), np.max(data, axis=0)])
}

# Technique 4: Use appropriate data types

def optimize_memory_usage(data):
"""Optimize memory by using appropriate dtypes."""
if np.all(data == data.astype(int)):
if np.all((data >= -128) & (data <= 127)):
return data.astype(np.int8)
elif np.all((data >= -32768) & (data <= 32767)):
return data.astype(np.int16)
else:
return data.astype(np.int32)
else:
return data.astype(np.float32) if np.all(np.abs(data) < 1e10) else data

# Performance comparison
large_data = np.random.randn(1000000)

import time

# Time the approaches

start = time.time()
result_bad = bad_approach(large_data[:1000]) # Smaller subset for timing
time_bad = time.time() - start

start = time.time()
result_good = good_approach(large_data[:1000])
time_good = time.time() - start

print(f"Bad approach time: {time_bad:.6f} seconds")

print(f"Good approach time: {time_good:.6f} seconds")
print(f"Speedup: {time_bad/time_good:.1f}x faster")

Part 8 – Mini Project: Advanced Sales Data Analysis

Let's apply advanced NumPy techniques and one-liners to analyze sales data for a fictional company.
import numpy as np
from functools import reduce

# Set seed for reproducibility

np.random.seed(42)

class AdvancedSalesAnalyzer:
"""Advanced sales data analyzer using NumPy one-liners and functional programming."""

def __init__(self):
self.sales_data = None
self.product_names = None
self.product_prices = None

def generate_advanced_sales_data(self):
"""Generate sophisticated sales data with multiple patterns."""
n_days, n_products = 365, 5

# Product configuration (one-liner dictionary)

product_config = {
'names': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'],
'base_sales': np.array([10, 50, 30, 15, 25]),
'prices': np.array([1200, 25, 75, 400, 80]),
'seasonality': np.array([0.3, 0.1, 0.2, 0.25, 0.4])
}

days = np.arange(n_days)

# Advanced pattern generation using functional composition

pattern_generators = [
lambda d: 1 + product_config['seasonality'][:, None] * np.sin(2 * np.pi * d / 365),
lambda d: 1 - 0.2 * np.isin(d % 7, [5, 6]).astype(float), # Weekend effect
lambda d: 1 + 0.1 * np.sin(2 * np.pi * d / 30), # Monthly cycle
lambda d: 1 + 0.05 * np.sin(2 * np.pi * d / 7) # Weekly cycle
]

# One-liner: combine all patterns

combined_pattern = reduce(
lambda acc, gen: acc * gen(days),
pattern_generators,
np.ones((n_products, n_days))
)
# Generate sales with noise (vectorized)
base_sales_matrix = product_config['base_sales'][:, None]
noise_matrix = np.random.normal(0, base_sales_matrix * 0.2, (n_products, n_days))

# One-liner: final sales calculation

self.sales_data = np.maximum(0, base_sales_matrix * combined_pattern + noise_matrix).T
self.product_names = product_config['names']
self.product_prices = product_config['prices']

return self

def calculate_advanced_metrics(self):
"""Calculate comprehensive metrics using advanced NumPy techniques."""

# One-liner: basic statistics

basic_stats = {
metric: getattr(np, f'{metric}')(self.sales_data, axis=0)
for metric in ['sum', 'mean', 'std', 'min', 'max']
}

# Advanced metrics using functional programming

advanced_metrics = {
'coefficient_of_variation': basic_stats['std'] / basic_stats['mean'],
'revenue_total': basic_stats['sum'] * self.product_prices,
'revenue_share': lambda: (basic_stats['sum'] * self.product_prices) /
np.sum(basic_stats['sum'] * self.product_prices) * 100
}

# Execute lambda functions

for key, value in advanced_metrics.items():
if callable(value):
advanced_metrics[key] = value()

return {basic_stats, advanced_metrics}

def temporal_analysis_advanced(self):
"""Advanced temporal analysis using NumPy broadcasting and vectorization."""

# One-liner: reshape data for temporal grouping

n_days = len(self.sales_data)
daily_totals = np.sum(self.sales_data, axis=1)

# Advanced temporal patterns

temporal_analyses = {
# Weekly analysis (reshape and mean)
'weekly_patterns': np.mean(
self.sales_data[:n_days//7*7].reshape(-1, 7, len(self.product_names)),
axis=0
),

# Monthly analysis (broadcasting)

'monthly_trends': np.array([
np.mean(self.sales_data[i*30:(i+1)*30], axis=0)
for i in range(min(12, n_days//30))
]),

# Moving averages (one-liner with convolution)

'moving_avg_7': np.array([
np.convolve(self.sales_data[:, i], np.ones(7)/7, mode='valid')
for i in range(len(self.product_names))
]).T,

# Volatility analysis
'volatility': np.std(np.diff(self.sales_data, axis=0), axis=0),

# Best/worst days (argmax/argmin)

'best_days': np.argmax(self.sales_data, axis=0),
'worst_days': np.argmin(self.sales_data, axis=0)
}

return temporal_analyses

def correlation_and_clustering_analysis(self):
"""Advanced correlation and clustering analysis."""

# One-liner: correlation matrix

correlation_matrix = np.corrcoef(self.sales_data.T)

# Advanced correlation insights

corr_insights = {
'correlation_matrix': correlation_matrix,
'most_correlated_pair': np.unravel_index(
np.argmax(np.triu(correlation_matrix, k=1)),
correlation_matrix.shape
),
'least_correlated_pair': np.unravel_index(
np.argmin(np.triu(correlation_matrix, k=1) + np.eye(len(correlation_matrix))),
correlation_matrix.shape
)
}

# Simple clustering using correlation distance

corr_distance = 1 - np.abs(correlation_matrix)

return corr_insights, corr_distance

def anomaly_detection_advanced(self):
"""Multi-method anomaly detection."""

# Multiple anomaly detection methods (one-liners)

anomaly_detectors = {
'zscore': lambda x: np.abs((x - np.mean(x, axis=0)) / np.std(x, axis=0)) > 2.5,
'iqr': lambda x: (x < (np.percentile(x, 25, axis=0) - 1.5 *
(np.percentile(x, 75, axis=0) - np.percentile(x, 25, axis=0)))
(x > (np.percentile(x, 75, axis=0) + 1.5 *
(np.percentile(x, 75, axis=0) - np.percentile(x, 25, axis=0)))
'percentile': lambda x: (x < np.percentile(x, 5, axis=0)) | (x > np.percentile(x, 95
}

# One-liner: combine detection methods

anomaly_scores = {
method: detector(self.sales_data).astype(int)
for method, detector in anomaly_detectors.items()
}

# Consensus anomaly detection (majority vote)

consensus_anomalies = np.mean(list(anomaly_scores.values()), axis=0) > 0.5

return anomaly_scores, consensus_anomalies

def predictive_insights(self):
"""Generate predictive insights using trend analysis."""

# Trend analysis using linear regression (NumPy only)

days = np.arange(len(self.sales_data))
X = np.column_stack([np.ones(len(days)), days]) # Design matrix

# One-liner: calculate trends for all products

trends = np.array([
np.linalg.lstsq(X, self.sales_data[:, i], rcond=None)[0][1]
for i in range(len(self.product_names))
])

# Forecast next 30 days (one-liner)

future_days = np.arange(len(self.sales_data), len(self.sales_data) + 30)
forecasts = np.array([
np.mean(self.sales_data[-30:, i]) + trends[i] * np.arange(1, 31)
for i in range(len(self.product_names))
]).T

return {
'trends': trends,
'forecasts': forecasts,
'trend_direction': np.where(trends > 0, 'Increasing', 'Decreasing'),
'forecast_revenue': np.sum(forecasts * self.product_prices, axis=1)
}

def run_complete_analysis(self):
"""Run comprehensive analysis with all advanced techniques."""

# Generate data
self.generate_advanced_sales_data()

# Run all analyses

analyses = {
'basic_metrics': self.calculate_advanced_metrics(),
'temporal': self.temporal_analysis_advanced(),
'correlation': self.correlation_and_clustering_analysis(),
'anomalies': self.anomaly_detection_advanced(),
'predictions': self.predictive_insights()
}

return analyses

def generate_executive_summary(self, analyses):

"""Generate executive summary using advanced NumPy operations."""

metrics = analyses['basic_metrics']
temporal = analyses['temporal']
predictions = analyses['predictions']

# One-liner insights
insights = {
'top_performer': self.product_names[np.argmax(metrics['revenue_total'])],
'most_volatile': self.product_names[np.argmax(metrics['coefficient_of_variation'])],
'best_growth': self.product_names[np.argmax(predictions['trends'])],
'total_revenue': np.sum(metrics['revenue_total']),
'avg_daily_sales': np.mean(np.sum(self.sales_data, axis=1)),
'forecast_revenue_30d': np.sum(predictions['forecast_revenue'])
}

return insights

# Execute the advanced analysis

analyzer = AdvancedSalesAnalyzer()
results = analyzer.run_complete_analysis()
summary = analyzer.generate_executive_summary(results)

# Display results with advanced formatting

print("="*60)
print("ADVANCED SALES ANALYTICS DASHBOARD")
print("="*60)

print(f"\n🏆 TOP PERFORMER: {summary['top_performer']}")

print(f"📈 BEST GROWTH TREND: {summary['best_growth']}")
print(f"⚡ MOST VOLATILE: {summary['most_volatile']}")
print(f"💰 TOTAL REVENUE: ${summary['total_revenue']:,.0f}")
print(f"📊 AVERAGE DAILY SALES: {summary['avg_daily_sales']:.0f} units")
print(f"🔮 30-DAY FORECAST REVENUE: ${summary['forecast_revenue_30d']:,.0f}")

# Advanced correlation insights

corr_insights, _ = results['correlation']
most_corr = corr_insights['most_correlated_pair']
print(f"\n🔗 MOST CORRELATED PRODUCTS: {analyzer.product_names[most_corr[0]]} & {analyzer.produ

# Anomaly summary
_, consensus_anomalies = results['anomalies']
anomaly_days = np.sum(np.any(consensus_anomalies, axis=1))
print(f"🚨 ANOMALY DAYS DETECTED: {anomaly_days} ({anomaly_days/365*100:.1f}% of year)")

# Advanced trend analysis

trends = results['predictions']['trends']
positive_trends = np.sum(trends > 0)
print(f"📈 PRODUCTS WITH POSITIVE TRENDS: {positive_trends}/{len(analyzer.product_names)}")
print("\n" + "="*60)
print("ADVANCED METRICS BY PRODUCT")
print("="*60)

for i, product in enumerate(analyzer.product_names):

metrics = results['basic_metrics']
print(f"\n{product}:")
print(f" Revenue: ${metrics['revenue_total'][i]:,.0f} ({metrics['revenue_share'][i]:.1f}%)"
print(f" Avg Daily Sales: {metrics['mean'][i]:.1f} units")
print(f" Volatility (CV): {metrics['coefficient_of_variation'][i]:.2f}")
print(f" Trend: {results['predictions']['trend_direction'][i]} ({trends[i]:+.3f} units/day)

# Advanced temporal insights

weekly_patterns = results['temporal']['weekly_patterns']
best_weekday = np.argmax(np.sum(weekly_patterns, axis=1))
weekdays = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

print(f"\n📅 BEST SALES DAY: {weekdays[best_weekday]}")

print(f"📅 WORST SALES DAY: {weekdays[np.argmin(np.sum(weekly_patterns, axis=1))]}")

print("\n" + "="*60)
print("This analysis demonstrates advanced NumPy techniques:")
print("• Functional programming with reduce() and lambda functions")
print("• Advanced broadcasting and vectorization")
print("• One-liner statistical computations")
print("• Multi-dimensional array manipulation")
print("• Advanced indexing and boolean operations")
print("• Linear algebra for trend analysis")
print("• Memory-efficient data processing")
print("="*60)

This advanced mini project demonstrates:

Functional programming patterns with NumPy

Advanced one-liners for complex operations
Broadcasting and vectorization for efficiency
Multi-dimensional array manipulation
Advanced statistical analysis
Predictive modeling using linear algebra
Anomaly detection with multiple methods
Memory-efficient processing techniques
Complex data pipelines using composition
Let's apply multiple NumPy concepts to analyze sales data for a fictional company.
import numpy as np

# Set seed for reproducibility

np.random.seed(42)

# Generate synthetic sales data

def generate_sales_data():
"""Generate 1 year of daily sales data for multiple products."""
n_days = 365
n_products = 5

# Product names
products = ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam']

# Base daily sales (units) for each product

base_sales = np.array([10, 50, 30, 15, 25])

# Create seasonal trends (higher sales in certain months)

days = np.arange(n_days)
seasonal_factor = 1 + 0.3 * np.sin(2 * np.pi * days / 365)

# Add weekly patterns (lower sales on weekends)

weekly_factor = 1 - 0.2 * np.isin(days % 7, [5, 6]).astype(float)

# Generate sales data with noise

sales_data = np.zeros((n_days, n_products))
for i in range(n_products):
trend = base_sales[i] * seasonal_factor * weekly_factor
noise = np.random.normal(0, base_sales[i] * 0.2, n_days)
sales_data[:, i] = np.maximum(0, trend + noise) # No negative sales

# Product prices
prices = np.array([1200, 25, 75, 400, 80])

return sales_data, products, prices

# Generate the data

sales_units, product_names, product_prices = generate_sales_data()

print("Sales Data Shape:", sales_units.shape)

print("Product Names:", product_names)
print("Product Prices:", product_prices)
# Analysis 1: Basic Statistics
print("\n" + "="*50)
print("BASIC SALES STATISTICS")
print("="*50)

total_units_sold = np.sum(sales_units, axis=0)

avg_daily_sales = np.mean(sales_units, axis=0)
max_daily_sales = np.max(sales_units, axis=0)
std_daily_sales = np.std(sales_units, axis=0)

for i, product in enumerate(product_names):

print(f"{product}:")
print(f" Total units sold: {total_units_sold[i]:.0f}")
print(f" Average daily sales: {avg_daily_sales[i]:.1f}")
print(f" Max daily sales: {max_daily_sales[i]:.0f}")
print(f" Sales volatility (std): {std_daily_sales[i]:.1f}")
print()

# Analysis 2: Revenue Analysis

print("="*50)
print("REVENUE ANALYSIS")
print("="*50)

# Calculate daily revenue for each product

daily_revenue = sales_units * product_prices

# Total revenue per product

total_revenue = np.sum(daily_revenue, axis=0)

# Revenue share
revenue_share = total_revenue / np.sum(total_revenue) * 100

print("Revenue by Product:")
for i, product in enumerate(product_names):
print(f"{product}: ${total_revenue[i]:,.0f} ({revenue_share[i]:.1f}%)")

print(f"\nTotal Revenue: ${np.sum(total_revenue):,.0f}")

# Analysis 3: Temporal Patterns

print("\n" + "="*50)
print("TEMPORAL PATTERNS")
print("="*50)
# Best and worst sales days
total_daily_sales = np.sum(sales_units, axis=1)
best_day = np.argmax(total_daily_sales)
worst_day = np.argmin(total_daily_sales)

print(f"Best sales day: Day {best_day + 1} with {total_daily_sales[best_day]:.0f} units")

print(f"Worst sales day: Day {worst_day + 1} with {total_daily_sales[worst_day]:.0f} units")

# Monthly analysis (group by 30-day periods)

n_months = 12
monthly_sales = np.zeros((n_months, len(product_names)))

for month in range(n_months):

start_day = month * 30
end_day = min((month + 1) * 30, 365)
monthly_sales[month] = np.sum(sales_units[start_day:end_day], axis=0)

# Find best month for each product

best_months = np.argmax(monthly_sales, axis=0)
print("\nBest month for each product:")
for i, product in enumerate(product_names):
print(f"{product}: Month {best_months[i] + 1}")

# Analysis 4: Correlation Analysis

print("\n" + "="*50)
print("PRODUCT CORRELATION ANALYSIS")
print("="*50)

# Calculate correlation matrix

def correlation_matrix(data):
"""Calculate correlation matrix manually using NumPy."""
# Center the data
centered = data - np.mean(data, axis=0)
# Calculate covariance matrix
cov_matrix = (centered.T @ centered) / (data.shape[0] - 1)
# Calculate standard deviations
stds = np.sqrt(np.diag(cov_matrix))
# Calculate correlation matrix
corr_matrix = cov_matrix / (stds[:, np.newaxis] @ stds[np.newaxis, :])
return corr_matrix

corr_matrix = correlation_matrix(sales_units)
print("Correlation Matrix:")
print("Products:", [name[:8] for name in product_names])
for i, row in enumerate(corr_matrix):
print(f"{product_names[i][:8]:8}", [f"{val:6.3f}" for val in row])

# Find most correlated product pairs

corr_upper = np.triu(corr_matrix, k=1) # Upper triangle excluding diagonal
max_corr_idx = np.unravel_index(np.argmax(corr_upper), corr_upper.shape)
max_corr_value = corr_upper[max_corr_idx]

print(f"\nMost correlated products: {product_names[max_corr_idx[0]]} and {product_names[max_corr

print(f"Correlation: {max_corr_value:.3f}")

# Analysis 5: Outlier Detection

print("\n" + "="*50)
print("OUTLIER DETECTION")
print("="*50)

def detect_outliers_iqr(data, multiplier=1.5):

"""Detect outliers using IQR method."""
q25 = np.percentile(data, 25, axis=0)
q75 = np.percentile(data, 75, axis=0)
iqr = q75 - q25

lower_bound = q25 - multiplier * iqr

upper_bound = q75 + multiplier * iqr

outliers = (data < lower_bound) | (data > upper_bound)

return outliers

outliers = detect_outliers_iqr(sales_units)
outlier_days = np.any(outliers, axis=1)

print(f"Number of days with outliers: {np.sum(outlier_days)}")

print("Outliers by product:")
for i, product in enumerate(product_names):
product_outliers = np.sum(outliers[:, i])
print(f"{product}: {product_outliers} outlier days")

# Analysis 6: Performance Metrics

print("\n" + "="*50)
print("PERFORMANCE METRICS")
print("="*50)
# Calculate moving averages (7-day and 30-day)
def moving_average(data, window):
"""Calculate moving average."""
return np.convolve(data, np.ones(window)/window, mode='valid')

# Calculate for total daily sales

ma_7 = moving_average(total_daily_sales, 7)
ma_30 = moving_average(total_daily_sales, 30)

print(f"7-day moving average (last value): {ma_7[-1]:.1f}")

print(f"30-day moving average (last value): {ma_30[-1]:.1f}")

# Growth rate analysis (month-over-month)

monthly_total_sales = np.sum(monthly_sales, axis=1)
growth_rates = np.diff(monthly_total_sales) / monthly_total_sales[:-1] * 100

print("\nMonth-over-month growth rates:")

for i, growth in enumerate(growth_rates):
print(f"Month {i+1} to {i+2}: {growth:+.1f}%")

print(f"Average monthly growth: {np.mean(growth_rates):+.1f}%")

# Summary Report
print("\n" + "="*60)
print("EXECUTIVE SUMMARY")
print("="*60)

print(f"• Total annual revenue: ${np.sum(total_revenue):,.0f}")

print(f"• Average daily sales: {np.mean(total_daily_sales):.0f} units")
print(f"• Most profitable product: {product_names[np.argmax(total_revenue)]}")
print(f"• Most volatile product: {product_names[np.argmax(std_daily_sales)]}")
print(f"• Days with outlier sales: {np.sum(outlier_days)} ({np.sum(outlier_days)/365*100:.1f}%)"
print(f"• Overall growth trend: {np.mean(growth_rates):+.1f}% per month")

This mini project demonstrates:

Array creation and manipulation

Statistical analysis with aggregations
Broadcasting for revenue calculations
Boolean indexing for outlier detection
Custom functions using NumPy operations
Correlation analysis
Time series analysis with moving averages
Data reshaping for temporal grouping

Tutorials and Courses

NumPy Quickstart Tutorial
Scientific Python Lectures
Real Python NumPy Tutorial

Advanced Topics
Advanced NumPy
NumPy for MATLAB Users
Broadcasting in NumPy

Related Libraries
Pandas - Data manipulation and analysis
Matplotlib - Plotting and visualization
Scikit-learn - Machine learning
SciPy - Scientific computing

Books
"Python for Data Analysis" by Wes McKinney
"Numerical Python" by Robert Johansson
"Python Data Science Handbook" by Jake VanderPlas
Final Notes
Congratulations! You've completed the NumPy crash course. You now have the foundational
knowledge to:

Create and manipulate NumPy arrays efficiently

Perform complex mathematical and statistical operations
Handle real-world data science tasks
Optimize your code for performance
Build data pipelines for machine learning

Next Steps:

1. Practice with real datasets

2. Explore Pandas for data manipulation
3. Learn Matplotlib/Seaborn for visualization
4. Dive into Scikit-learn for machine learning
5. Consider TensorFlow/PyTorch for deep learning

Remember: The best way to master NumPy is through practice. Start applying these concepts to your
own projects and datasets!

NumPy Notes
No ratings yet
NumPy Notes
25 pages
Data Science
No ratings yet
Data Science
41 pages
Numpy Introduction
No ratings yet
Numpy Introduction
2 pages
NumPy for Scientific Computing
No ratings yet
NumPy for Scientific Computing
47 pages
Numpy Tutorial Notes (Video 1-5)
No ratings yet
Numpy Tutorial Notes (Video 1-5)
4 pages
Numpy - Introduction
No ratings yet
Numpy - Introduction
118 pages
Mastering NumPy for Data & ML Roles
No ratings yet
Mastering NumPy for Data & ML Roles
39 pages
Num Py
No ratings yet
Num Py
11 pages
DS 4 1 Unit 2
No ratings yet
DS 4 1 Unit 2
60 pages
Numpy I
No ratings yet
Numpy I
99 pages
Numpy Final - Removed
No ratings yet
Numpy Final - Removed
46 pages
Introduction To NumPy
No ratings yet
Introduction To NumPy
5 pages
What Is NumPy
No ratings yet
What Is NumPy
5 pages
09 20241101 NumPy
No ratings yet
09 20241101 NumPy
38 pages
Numpy
No ratings yet
Numpy
18 pages
Numpy User
No ratings yet
Numpy User
486 pages
45B AIML Practical1.1
No ratings yet
45B AIML Practical1.1
57 pages
FUNDAMENTALS OF DATA SCIENCE LAB - Jupyter Notebook
No ratings yet
FUNDAMENTALS OF DATA SCIENCE LAB - Jupyter Notebook
48 pages
Numpy
No ratings yet
Numpy
25 pages
Numpy User
No ratings yet
Numpy User
502 pages
Num Py
No ratings yet
Num Py
10 pages
Numpy User
No ratings yet
Numpy User
565 pages
Python NumPy for Developers
No ratings yet
Python NumPy for Developers
43 pages
NumPy Basics and Array Operations
No ratings yet
NumPy Basics and Array Operations
73 pages
Advanced NumPy Broadcasting Guide
No ratings yet
Advanced NumPy Broadcasting Guide
21 pages
Numpy
No ratings yet
Numpy
28 pages
Scientific Computing
No ratings yet
Scientific Computing
24 pages
NumPy User Guide 2.2.0 Overview
No ratings yet
NumPy User Guide 2.2.0 Overview
659 pages
NumPy: Essential Guide for Python
No ratings yet
NumPy: Essential Guide for Python
28 pages
Machine Learning - Section #3 (Numpy)
No ratings yet
Machine Learning - Section #3 (Numpy)
21 pages
Smart Cutebot Kit (Ef08209) - Elecfreaks Wiki
No ratings yet
Smart Cutebot Kit (Ef08209) - Elecfreaks Wiki
4 pages
Graphical Abstract - Final Version
No ratings yet
Graphical Abstract - Final Version
3 pages
EC 5110 Logic Synthesis and Verification Lecture Notes 10102024
No ratings yet
EC 5110 Logic Synthesis and Verification Lecture Notes 10102024
41 pages
Essential MATLAB and Octave 1st Edition Jesus Rogel-Salazar (Author) 2025 Download Now
No ratings yet
Essential MATLAB and Octave 1st Edition Jesus Rogel-Salazar (Author) 2025 Download Now
150 pages
150pro User Manual
No ratings yet
150pro User Manual
20 pages
Module 6 Enhanced Entity-Relationship Model
No ratings yet
Module 6 Enhanced Entity-Relationship Model
73 pages
Medical Insurance Revenue Cycle Process Approach 9th Edition Valerius HQ File Fast Access
No ratings yet
Medical Insurance Revenue Cycle Process Approach 9th Edition Valerius HQ File Fast Access
310 pages
?????? ???????? ????
No ratings yet
?????? ???????? ????
13 pages
Massachusetts Institute of Technology: (Quiz 1 Solutions - Fall 2010)
No ratings yet
Massachusetts Institute of Technology: (Quiz 1 Solutions - Fall 2010)
6 pages
DT8032 - 8 Channel 500 V - 10 Ma Desktop Power Supply Module (USB - Ethernet - Touchscreen) - CAEN - Tools For Discovery
No ratings yet
DT8032 - 8 Channel 500 V - 10 Ma Desktop Power Supply Module (USB - Ethernet - Touchscreen) - CAEN - Tools For Discovery
3 pages
Paraliparis Amerismos - Wikipedia
No ratings yet
Paraliparis Amerismos - Wikipedia
1 page
Peer-to-Peer Versus Client
No ratings yet
Peer-to-Peer Versus Client
8 pages
Invoice for HP Pavilion Aero Laptop
No ratings yet
Invoice for HP Pavilion Aero Laptop
1 page
SBT-615 Electrical Manual
No ratings yet
SBT-615 Electrical Manual
305 pages
Modbus Rtac
No ratings yet
Modbus Rtac
44 pages
【8DO】MA01-XXCX0080 UserManual EN v1.0
No ratings yet
【8DO】MA01-XXCX0080 UserManual EN v1.0
35 pages
MySQL Client/Server Architecture Overview
No ratings yet
MySQL Client/Server Architecture Overview
18 pages
Code Source Du Shell Madspot
No ratings yet
Code Source Du Shell Madspot
15 pages
Educational Phishing Site Guide
No ratings yet
Educational Phishing Site Guide
16 pages
Revolutionizing 3D Model Generation Using Diffusion-Based Generative AI
No ratings yet
Revolutionizing 3D Model Generation Using Diffusion-Based Generative AI
5 pages
Tabel Ukuran Kertas A
No ratings yet
Tabel Ukuran Kertas A
8 pages
IEEE SRS Format Overview
33% (6)
IEEE SRS Format Overview
1 page
Jonathan Attard FYP2023
No ratings yet
Jonathan Attard FYP2023
77 pages
Aspiring Software Engineer Portfolio
No ratings yet
Aspiring Software Engineer Portfolio
1 page
Microwave Devices Solutions Manual
50% (10)
Microwave Devices Solutions Manual
2 pages
Advanced Web Technology BCA3 Pratical File
No ratings yet
Advanced Web Technology BCA3 Pratical File
16 pages
Modelling Financial Times Series 2nd Edition Stephen J. Taylor Complete Edition
No ratings yet
Modelling Financial Times Series 2nd Edition Stephen J. Taylor Complete Edition
40 pages
CNN, RNN
No ratings yet
CNN, RNN
60 pages
Lecture 3 - Costs and Risks of E-Procurement Adoption
No ratings yet
Lecture 3 - Costs and Risks of E-Procurement Adoption
18 pages