Why NumPy?
Big Idea: NumPy arrays let us do math on whole datasets at once , instead of writing slow Python loops.
Pure Python list:
temps = [15.2 , 18.7 , 22.1 , 19.8 ]
temp_f = []
for t in temps:
temp_f.append(t * 9 / 5 + 32 )
Loop in Python
Manual append logic
Harder to read and optimize
NumPy array:
temps = np.array([15.2 , 18.7 , 22.1 , 19.8 ])
temp_f = temps * 9 / 5 + 32
One line does the math for all elements
Operations implemented in fast C code
Reads like the mathematical formula
Arrays: fixed-size, typed, efficient blocks of numbers
NumPy lets you:
Apply operations to entire arrays (vectorize)
Avoid many explicit loops
Write shorter, clearer, and usually much faster numerical code
Common Error: Forgetting to Import
Predict the output:
temps = np.array([15.2 , 18.7 , 22.1 ])
print (temps)
NameError: name 'np' is not defined
The Fix:
import numpy as np # ALWAYS at the top of your file
temps = np.array([15.2 , 18.7 , 22.1 ])
print (temps)
Takeaway: import numpy as np is the standard convention. Put all imports at the top of your script/notebook.
Creating Arrays
Big Idea: Use NumPy’s constructors to quickly build arrays for real data, ranges, and constant grids.
import numpy as np
# From existing list
temps = np.array([15.2 , 18.7 , 22.1 ])
print (f"From list: { temps} " )
# Range-like sequence
indices = np.arange(0 , 10 , 2 )
print (f"arange: { indices} " )
# Evenly spaced samples
samples = np.linspace(0 , 1 , 5 )
print (f"linspace: { samples} " )
From list: [15.2 18.7 22.1]
arange: [0 2 4 6 8]
linspace: [0. 0.25 0.5 0.75 1. ]
# Constant arrays
zeros = np.zeros(3 )
print (f"zeros: { zeros} " )
ones = np.ones(3 )
print (f"ones: { ones} " )
filled = np.full(3 , 20.5 )
print (f"full: { filled} " )
zeros: [0. 0. 0.]
ones: [1. 1. 1.]
full: [20.5 20.5 20.5]
Takeaway:
np.array for real data you already have
arange / linspace for ranges and sample points
zeros / full for constant grids you will use in calculations
Check Your Understanding 🤔
What’s the difference between arange and linspace?
a = np.arange(0 , 10 , 2 )
b = np.linspace(0 , 10 , 5 )
Answer:
arange(start, stop, step): goes from 0 to 10 by steps of 2 → [0, 2, 4, 6, 8]
linspace(start, stop, num): 5 evenly spaced points from 0 to 10 → [0., 2.5, 5., 7.5, 10.]
Key difference: arange uses step size , linspace uses number of points and includes the endpoint!
When to use which?
arange: When you know the step (e.g., hourly data, every 5 km)
linspace: When you need exact number of samples (e.g., 100 points for smooth plot)
Common Error: Integer Division
Predict the output:
temps_c = np.array([20 , 25 , 30 ]) # integers!
temps_f = temps_c * 9 / 5 + 32
print (f"Result: { temps_f} " )
print (f"dtype: { temps_f. dtype} " )
Result: [68. 77. 86.]
dtype: float64
Wait, this worked! Why?
In Python 3, / always returns float. But watch out for this:
temps_c = np.array([20 , 25 , 30 ])
# Using integer division by mistake
result = temps_c // 5 # // is integer division!
print (f"Wrong: { result} " )
Takeaway: Be mindful of your dtypes. When in doubt, create float arrays: np.array([20.0, 25.0, 30.0]) or temps_c.astype(float)
Array Attributes: dtype, shape, ndim
Big Idea: Check an array’s dtype, shape, and ndim early. It saves you from weird bugs later.
# 1-D array
temps = np.array([15.2 , 18.7 , 22.1 ])
print (f"dtype: { temps. dtype} " )
print (f"shape: { temps. shape} " )
print (f"ndim: { temps. ndim} " )
dtype: float64
shape: (3,)
ndim: 1
# 2-D array
data = np.array([[1 , 2 , 3 ], [4 , 5 , 6 ]])
print (f"dtype: { data. dtype} " )
print (f"shape: { data. shape} " )
print (f"ndim: { data. ndim} " )
dtype: int64
shape: (2, 3)
ndim: 2
dtype – data type of the array
e.g. float64, int32, bool
Watch out if you accidentally create int when you want float
shape – size of the array in each dimension
1-D: (5,) (5 elements)
2-D: (2,3) (2 rows, 3 columns)
ndim – number of dimensions
1-D vector: ndim == 1
2-D matrix: ndim == 2
When something crashes or broadcasts strangely, first print:
array.dtype, array.shape, array.ndim
Visual: Understanding Shape
1-D array: shape = (5,)
────────────────────────────────────────
[15.2, 18.7, 22.1, 19.8, 16.5]
↑ ↑ ↑ ↑ ↑
idx 0 1 2 3 4
2-D array: shape = (3, 4) means 3 rows, 4 columns
────────────────────────────────────────
Col 0 Col 1 Col 2 Col 3
Row 0 [ 15 18 22 19 ]
Row 1 [ 14 17 21 18 ]
Row 2 [ 16 19 23 20 ]
↑
First index = row
Second index = column
Atmospheric example: 10 stations, 24 hours
────────────────────────────────────────
shape = (10, 24)
↑ ↑
stations hours
Key concepts:
1-D array: shape = (5,) → 5 elements in a line
2-D array: shape = (3, 4) → 3 rows, 4 columns
Think: “rows first, then columns” (like matrix notation)
Stations × Time: If you have 10 stations and 24 hours, shape is (10, 24)
Check Your Understanding 🤔
What will be the shape?
# 5 weather stations, 48 hours of data each
temps = np.random.randn(5 , 48 )
Answer: shape = (5, 48)
First dimension: 5 stations
Second dimension: 48 hours
Total elements: 5 × 48 = 240
Now predict: What’s the shape of temps[0, :]?
(48,) — a 1-D array of 48 hours for station 0
Indexing and Slicing
Big Idea: NumPy indexing feels like list indexing, but works in multiple dimensions and stays fast.
1-D arrays:
temps = np.array([15.2 , 18.7 , 22.1 , 19.8 , 16.5 ])
# Single element
print (f"First: { temps[0 ]} " )
print (f"Last: { temps[- 1 ]} " )
# Slicing
print (f"First 3: { temps[:3 ]} " )
print (f"Last 2: { temps[- 2 :]} " )
print (f"Every other: { temps[::2 ]} " )
First: 15.2
Last: 16.5
First 3: [15.2 18.7 22.1]
Last 2: [19.8 16.5]
Every other: [15.2 22.1 16.5]
N-D arrays:
data = np.array([[1 , 2 , 3 ],
[4 , 5 , 6 ]])
# Single element
print (f"Row 0, Col 1: { data[0 , 1 ]} " )
# Slicing
print (f"First row: { data[0 , :]} " )
print (f"Second column: \n { data[:, 1 ]} " )
Row 0, Col 1: 2
First row: [1 2 3]
Second column:
[2 5]
Key points:
1-D: temps[start:stop:step] just like lists
N-D: array[row_index, col_index] and array[row_slice, col_slice]
Slices are views into the original data (no copy in most cases)
Common Error: Wrong Dimension Indexing
Predict the output:
data = np.array([[1 , 2 , 3 ],
[4 , 5 , 6 ]])
print (data[1 ]) # What does this return?
print (data[:, 1 ]) # What about this?
data = np.array([[1 , 2 , 3 ], [4 , 5 , 6 ]])
print (f"data[1]: { data[1 ]} " ) # Second ROW
print (f"data[:, 1]: { data[:, 1 ]} " ) # Second COLUMN
data[1]: [4 5 6]
data[:, 1]: [2 5]
Common mistake: Forgetting that data[1] gives you a row , not a column!
To get a column, you need data[:, 1] (all rows, column 1)
Boolean Masks
Big Idea: Comparisons create boolean arrays that you can use to filter values. Masks replace manual if loops.
temps = np.array([15.2 , 18.7 , 22.1 , 19.8 , 16.5 ])
# Create boolean mask
mask = (temps >= 15 ) & (temps <= 22 )
print (f"Mask: { mask} " )
# Filter using mask
comfortable_temps = temps[mask]
print (f"Comfortable temps: { comfortable_temps} " )
Mask: [ True True False True True]
Comfortable temps: [15.2 18.7 19.8 16.5]
# Count how many meet condition
count = np.sum (mask)
print (f"Number of comfortable temps: { count} " )
Number of comfortable temps: 4
Takeaway:
(temps >= 15) returns a boolean array
& combines conditions elementwise (use &, not and!)
temps[mask] selects only the elements where mask is True
Instead of looping and if, build a mask once and index with it
Try It Yourself 💻
Challenge: Given this temperature data, find:
All temperatures above 20°C
How many hours were between 15-25°C
The indices where temp > 22°C (hint: np.where)
hourly_temps = np.array([...]) # 24 hours of data
Solution:
# 1. Temps above 20
hot = hourly_temps[hourly_temps > 20 ]
print (f"Hot hours: { hot[:5 ]} ..." ) # show first 5
# 2. Count between 15-25
comfortable = np.sum ((hourly_temps >= 15 ) & (hourly_temps <= 25 ))
print (f"Comfortable hours: { comfortable} " )
# 3. Indices where > 22
indices = np.where(hourly_temps > 22 )[0 ]
print (f"Hot hour indices: { indices} " )
Hot hours: [21.89329683 24.70023953 21.97111934 22.42714652 26.14676948]...
Comfortable hours: 15
Hot hour indices: [3 5 6 7 9]
Common Error: Using ‘and’ Instead of ‘&’
Predict the output:
temps = np.array([15.2 , 18.7 , 22.1 , 19.8 , 16.5 ])
mask = (temps >= 15 ) and (temps <= 22 ) # Wrong!
ValueError: The truth value of an array with more than one element is ambiguous.
The Fix:
temps = np.array([15.2 , 18.7 , 22.1 , 19.8 , 16.5 ])
mask = (temps >= 15 ) & (temps <= 22 ) # Correct!
print (f"Mask: { mask} " )
Mask: [ True True False True True]
Why?
and is for single boolean values: True and False
& is for element-wise array operations
Always use & (and | for OR) with NumPy arrays
Don’t forget parentheses: (temps >= 15) & (temps <= 22)
Vectorized Operations & Broadcasting
Big Idea: NumPy applies the same formula to whole arrays at once. Scalars and smaller arrays are broadcast to match shapes.
temps = np.array([15.2 , 18.7 , 22.1 , 19.8 ])
# Convert to Fahrenheit
temp_f = temps * 9 / 5 + 32
print (f"°F: { temp_f} " )
# Subtract baseline
baseline = 15
anomaly = temps - baseline
print (f"Anomaly: { anomaly} " )
°F: [59.36 65.66 71.78 67.64]
Anomaly: [0.2 3.7 7.1 4.8]
# Element-wise operations
temps_squared = temps ** 2
print (f"Squared: { temps_squared} " )
# Works with functions too
temps_rounded = np.round (temps, 1 )
print (f"Rounded: { temps_rounded} " )
Squared: [231.04 349.69 488.41 392.04]
Rounded: [15.2 18.7 22.1 19.8]
Key concepts:
Arithmetic (+, -, *, /, **) is elementwise on arrays
Scalars are broadcast automatically to match array shape
You write the math once; NumPy handles the loops in fast C code
Think in formulas on arrays, not in explicit Python for loops
Common Error: Shape Mismatch
Predict the output:
temps = np.array([15 , 20 , 25 , 30 ]) # shape (4,)
wind = np.array([5 , 10 , 15 ]) # shape (3,)
result = temps + wind
ValueError: operands could not be broadcast together with shapes (4,) (3,)
Why? Arrays must have compatible shapes for broadcasting!
Debugging strategy:
temps = np.array([15 , 20 , 25 , 30 ])
wind = np.array([5 , 10 , 15 ])
print (f"temps.shape: { temps. shape} " )
print (f"wind.shape: { wind. shape} " )
# They must match or one must be 1!
temps.shape: (4,)
wind.shape: (3,)
The fix: Make sure your arrays have the same length, or reshape one of them
Array Statistics
Big Idea: NumPy has built-in “reductions” (mean, std, min, max, etc.) that keep your analysis code short and clear.
Full array:
temps = np.array([15.2 , 18.7 , 22.1 , 19.8 , 16.5 ])
print (f"Mean: { temps. mean():.1f} " )
print (f"Std: { temps. std():.1f} " )
print (f"Min: { temps. min ():.1f} " )
print (f"Max: { temps. max ():.1f} " )
print (f"Sum: { temps. sum ():.1f} " )
Mean: 18.5
Std: 2.4
Min: 15.2
Max: 22.1
Sum: 92.3
Axis example:
# 2D array: 3 stations, 4 times
data = np.array([[15 , 18 , 22 , 19 ],
[14 , 17 , 21 , 18 ],
[16 , 19 , 23 , 20 ]])
# Mean across time (axis=1)
station_means = data.mean(axis= 1 )
print (f"Station means: { station_means} " )
# Mean across stations (axis=0)
time_means = data.mean(axis= 0 )
print (f"Time means: { time_means} " )
Station means: [18.5 17.5 19.5]
Time means: [15. 18. 22. 19.]
Key points:
Reductions turn many values → one (or one per row/column)
axis=None (default) flattens everything
axis=0 works “down” rows, axis=1 works “across” columns
Using these methods avoids writing your own loops and counters
Visual: Understanding axis=0 vs axis=1
data = [[15, 18, 22, 19], shape (3, 4)
[14, 17, 21, 18], 3 stations (rows)
[16, 19, 23, 20]] 4 times (columns)
axis=0: collapse ROWS (↓) axis=1: collapse COLUMNS (→)
↓ ↓ ↓ ↓ → → → →
[15, 18, 22, 19] [18.5] ← mean of row 0
[17.5] ← mean of row 1
Result: (4,) [19.5] ← mean of row 2
mean per TIME Result: (3,)
mean per STATION
Mnemonic: axis=0 → “collapse dimension 0 (rows)”
axis=1 → “collapse dimension 1 (columns)”
Check Your Understanding 🤔
Given this data:
# 4 stations, 24 hours
temps = np.random.randn(4 , 24 ) * 5 + 20
print (f"Shape: { temps. shape} " )
What will be the shape of:
temps.mean(axis=0)
temps.mean(axis=1)
temps.mean()
Answers:
(24,) — mean across stations, one value per hour
(4,) — mean across time, one value per station
Scalar — mean of all values (flattened)
Common Error: Wrong Axis
Scenario: You want the mean temperature for each station (averaged over time)
# 4 stations, 24 hours of data
temps = np.random.randn(4 , 24 ) * 5 + 20
# Which is correct?
option_a = temps.mean(axis= 0 )
option_b = temps.mean(axis= 1 )
Answer: temps.mean(axis=1)
temps = np.random.randn(4 , 24 ) * 5 + 20
station_means = temps.mean(axis= 1 )
print (f"Shape: { station_means. shape} " ) # (4,) ✓
print (f"Station means: { station_means} " )
Shape: (4,)
Station means: [19.55167332 21.05860409 20.14683389 21.58417493]
Why? axis=1 collapses the time dimension, leaving station dimension
Strategy: Always check the output shape! It should match what you expect.
Try It Yourself 💻
Challenge: Given temperature and wind speed data:
Calculate wind chill: WC = 13.12 + 0.6215*T - 11.37*V^0.16 + 0.3965*T*V^0.16 (T = temp in °C, V = wind speed in km/h)
Find how many hours have wind chill below 0°C
Calculate the mean and standard deviation of wind chill
Solution:
# 1. Calculate wind chill (vectorized!)
T = temps
V = wind_speed
wind_chill = 13.12 + 0.6215 * T - 11.37 * V** 0.16 + 0.3965 * T* V** 0.16
# 2. Count below 0
below_zero = np.sum (wind_chill < 0 )
print (f"Hours with WC < 0°C: { below_zero} " )
# 3. Statistics
print (f"Mean WC: { wind_chill. mean():.1f} °C" )
print (f"Std WC: { wind_chill. std():.1f} °C" )
Hours with WC < 0°C: 24
Mean WC: -6.9°C
Std WC: 3.8°C
Why Visualize?
Anscombe’s Quartet: Four datasets with identical statistics but very different patterns
# All four datasets have:
mean_x = 9.0
mean_y = 7.5
correlation = 0.816
Without plotting, they look the same!
With plotting:
Dataset 1: Linear relationship
Dataset 2: Quadratic curve
Dataset 3: Linear with outlier
Dataset 4: Vertical line with outlier
Always visualize your data!
Lesson: Statistics alone can mislead. Plots reveal the truth.
Common Error: Forgetting Import
Predict the output:
hours = np.arange(0 , 24 )
temps = 15 + 8 * np.sin(hours * np.pi / 12 )
plt.plot(hours, temps)
plt.show()
NameError: name 'plt' is not defined
The Fix:
import matplotlib.pyplot as plt # Standard convention
import numpy as np
hours = np.arange(0 , 24 )
temps = 15 + 8 * np.sin(hours * np.pi / 12 )
plt.plot(hours, temps)
plt.show()
Takeaway: Always import matplotlib.pyplot as plt at the top of your file
Plotting Recipe
5-step pattern for every plot:
import matplotlib.pyplot as plt
# 1. Prepare x and y data
hours = np.arange(0 , 24 , 1 )
temps = 15 + 8 * np.sin((hours - 6 ) * np.pi / 12 )
# 2. Plot
plt.plot(hours, temps, marker= 'o' , color= 'steelblue' , linewidth= 2 )
# 3. Label axes
plt.xlabel('Hour of Day' )
plt.ylabel('Temperature (°C)' )
# 4. Add title
plt.title('Simulated Daily Temperature Cycle' )
# 5. Turn on grid and show
plt.grid(True , alpha= 0.3 )
plt.show()
Takeaway: Follow this pattern every time. Good labels turn quick plots into report-ready figures.
Common Error: Mismatched Array Lengths
Predict the output:
hours = np.arange(0 , 24 ) # 24 elements
temps = np.array([15 , 18 , 22 ]) # 3 elements
plt.plot(hours, temps)
ValueError: x and y must have same first dimension, but have shapes (24,) and (3,)
Why? plt.plot(x, y) expects x and y to have the same length!
Debugging strategy:
print (f"hours.shape: { hours. shape} " )
print (f"temps.shape: { temps. shape} " )
# They must match!
The fix: Make sure your x and y arrays have the same number of elements
Bad Plot Example
What’s wrong with this plot?
hours = np.arange(0 , 24 )
temps = 15 + 8 * np.sin((hours - 6 ) * np.pi / 12 )
plt.plot(hours, temps)
plt.show()
Problems:
No axis labels — what am I looking at?
No title — what does this represent?
No units — is temperature in °C or °F?
No grid — hard to read values
Never submit a plot like this!
Good Plot Example
Same data, better communication:
hours = np.arange(0 , 24 )
temps = 15 + 8 * np.sin((hours - 6 ) * np.pi / 12 )
plt.plot(hours, temps, marker= 'o' , color= 'steelblue' , linewidth= 2 )
plt.xlabel('Hour of Day' , fontsize= 12 )
plt.ylabel('Temperature (°C)' , fontsize= 12 )
plt.title('Simulated Daily Temperature Cycle - Boulder, CO' , fontsize= 14 )
plt.grid(True , alpha= 0.3 )
plt.tight_layout()
plt.show()
Improvements:
Clear axis labels with units
Descriptive title with location
Grid for reading values
Markers to show data points
Professional appearance
Scatter and Bar Plots
Big Idea: Different plot types = different stories
Scatter Plot:
temp = np.array([15 , 18 , 22 , 19 , 16 ])
pressure = np.array([1010 , 1012 , 1008 , 1011 , 1013 ])
plt.scatter(temp, pressure, s= 100 , alpha= 0.6 )
plt.xlabel('Temperature (°C)' )
plt.ylabel('Pressure (hPa)' )
plt.title('Temp vs. Pressure' )
plt.grid(True , alpha= 0.3 )
plt.show()
Compare two continuous variables
Look for relationships / patterns
Bar Plot:
stations = ['Boulder' , 'Denver' , 'Vail' ]
mean_temps = [18.5 , 20.2 , 12.8 ]
plt.bar(stations, mean_temps, color= 'coral' )
plt.xlabel('Station' )
plt.ylabel('Mean Temp (°C)' )
plt.title('Mean Temperature by Station' )
plt.grid(True , alpha= 0.3 , axis= 'y' )
plt.show()
Compare categories or totals
Good for “which is bigger?” questions
Match your plot type to the question you’re asking
Check Your Understanding 🤔
Which plot type would you use for:
Comparing average precipitation across 12 months?
Exploring relationship between humidity and temperature?
Showing temperature change over 24 hours?
Answers:
Bar plot — comparing categories (months)
Scatter plot — relationship between two continuous variables
Line plot — showing change over time (continuous)
Key: Think about what question you’re answering!
Multiple Lines
Comparing multiple datasets:
hours = np.arange(0 , 24 , 1 )
boulder = 15 + 8 * np.sin((hours - 6 ) * np.pi / 12 )
denver = 17 + 7 * np.sin((hours - 6 ) * np.pi / 12 )
vail = 10 + 6 * np.sin((hours - 6 ) * np.pi / 12 )
plt.plot(hours, boulder, marker= 'o' , linestyle= '-' , label= 'Boulder' , linewidth= 2 )
plt.plot(hours, denver, marker= 's' , linestyle= '--' , label= 'Denver' , linewidth= 2 )
plt.plot(hours, vail, marker= '^' , linestyle= '-.' , label= 'Vail' , linewidth= 2 )
plt.xlabel('Hour of Day' )
plt.ylabel('Temperature (°C)' )
plt.title('Temperature Comparison: Colorado Stations' )
plt.legend(loc= 'best' )
plt.grid(True , alpha= 0.3 )
plt.show()
Use: Different colors, markers, and linestyles + legend to label each line
Bad pattern: three identical blue lines with no legend — your viewer will be confused!
Common Error: Missing Legend Labels
Predict: Can you tell which line is which?
hours = np.arange(0 , 24 )
boulder = 15 + 8 * np.sin((hours - 6 ) * np.pi / 12 )
denver = 17 + 7 * np.sin((hours - 6 ) * np.pi / 12 )
plt.plot(hours, boulder) # No label!
plt.plot(hours, denver) # No label!
plt.xlabel('Hour of Day' )
plt.ylabel('Temperature (°C)' )
plt.title('Temperature Comparison' )
plt.show()
Problem: Which line is Boulder? Which is Denver? Impossible to tell!
The fix: Always use label= and plt.legend()
Subplots & Layouts
Multiple panels in one figure:
hours = np.arange(0 , 24 , 1 )
temps = 15 + 8 * np.sin((hours - 6 ) * np.pi / 12 )
pressure = 1010 + 3 * np.cos((hours - 6 ) * np.pi / 12 )
fig, (ax_temp, ax_press) = plt.subplots(2 , 1 , figsize= (9 , 5 ), sharex= True )
ax_temp.plot(hours, temps, color= 'red' , marker= 'o' )
ax_temp.set_ylabel('Temperature (°C)' )
ax_temp.set_title('Temperature' )
ax_temp.grid(True , alpha= 0.3 )
ax_press.plot(hours, pressure, color= 'blue' , marker= 's' )
ax_press.set_xlabel('Hour of Day' )
ax_press.set_ylabel('Pressure (hPa)' )
ax_press.set_title('Pressure' )
ax_press.grid(True , alpha= 0.3 )
plt.tight_layout()
plt.show()
Key points:
plt.subplots(nrows, ncols) creates a grid of axes
fig = whole figure, ax_temp, ax_press = individual panels
sharex=True keeps same x-axis for both
Use ax.plot() instead of plt.plot() for subplots
Try It Yourself 💻
Challenge: Create a 2×2 subplot showing:
Top-left: Line plot of temperature over 24 hours
Top-right: Scatter of temp vs pressure
Bottom-left: Bar plot of mean temps for 3 stations
Bottom-right: Histogram of all temperature values
Hint:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2 , 2 , figsize= (10 , 8 ))
ax1.plot(...)
ax2.scatter(...)
ax3.bar(...)
ax4.hist(...)
plt.tight_layout()