Pedagogical Design: xarray Week 5

ATOC 4815 - Built from Scratch with Proven Framework

📊 Design Statistics

Created: 67 slides, ~1,900 lines Active Learning Exercises: 4 hands-on “Try It Yourself” Error Examples: 6 explicit error scenarios with fixes “Check Your Understanding” moments: 3 formative assessments Decision Guides: 3 metacognitive “when to use” sections

Built from scratch using the proven pedagogical framework from Weeks 1-4

✅ Pedagogical Framework Applied

1. Real-World Scenario Motivation 🌍

Design Decision: Don’t start with “here’s xarray syntax”—start with “here’s a problem pandas and NumPy can’t solve”

Created Research Scenario:

Imagine: You're analyzing the 2023 North American heat wave using ERA5 reanalysis

Your data:
- 4D gridded dataset: temperature(time, level, lat, lon)
- Spatial: North America (15°N-70°N, 130°W-60°W)
- Temporal: 45 years daily (16,425 days)
- Vertical: 37 pressure levels
- Grid: 0.25° × 0.25° (~25 km)
- File size: ~50 GB

Questions:
1. Max temperature in Boulder summer 2023?
2. How does 2023 compare to 1991-2020 climatology?
3. Vertical temperature profile during heat wave?
4. Where was heat wave most intense?
5. Has frequency increased over 45 years?

Impact: Immediately establishes that they need a new tool—this data can’t be handled with pandas or numpy alone.

2. Show Why Previous Tools Fail ⚠️

Problem: Students might think “Why not just use pandas?”

Solution: Explicit “Why Pandas Falls Short” and “Why NumPy Falls Short” slides

Pandas Limitations (Slide 6):

2D only - Can’t represent 4D (time × level × lat × lon)
No dimension concept - Which axis is which?
No coordinate-based selection - Need index math for “500 hPa at Boulder”

NumPy Limitations (Slide 7):

No dimension labels - Is axis=1 pressure or latitude?
No coordinate values - Manual argmin to find nearest point
Metadata loss - After slicing, what does the data represent?

Code Examples:

# Pandas fail
df = pd.DataFrame(temperature_data)  # ❌ How to structure 4D?
mean_temp = df.mean(axis=2)  # ❌ Was that lat or lon?

# NumPy fail
temp.mean(axis=1)  # ❌ Is axis=1 pressure levels or latitude?
lat_idx = np.argmin(np.abs(lat_array - 40.0))  # ❌ Manual index math!

Impact: Students understand the problem before seeing the solution

3. Error-Driven Learning 🐛

Created 6 explicit error scenarios:

Error 1: Dimension Name Mismatch (Slide 23)

temp.sel(latitude=40)  # ❌ Dimension is 'lat', not 'latitude'

Fix: Always check print(data.dims) first

Error 2: isel vs sel Confusion (Slide 36)

ds.sel(time=0)  # ❌ 0 is an index, not a date!

Fix: Index number → .isel(), Coordinate value → .sel()

Error 3: Wrong Dimension Name in Reduction (Slide 41)

temp.mean(dim='times')  # ❌ Typo: should be 'time'

Fix: Use tab completion for dimension names

Error 4: Forgetting to Close Files (Slide 32)

ds = xr.open_dataset('ERA5_50GB.nc')
os.remove('ERA5_50GB.nc')  # ❌ File still open!

Fix: Use context manager with xr.open_dataset(...) as ds:

Error 5: Using .values Too Early (Slide 52)

temp_array = temp.values  # ❌ Loses all metadata!
subset = temp_array.sel(lat=40)  # ❌ Won't work—just NumPy array!

Fix: Keep as xarray until the last possible moment

Error 6: Forgetting method=’nearest’ (Slide 34)

temp.sel(lat=40.015, lon=-105.2705)  # ❌ Exact match may not exist

Fix: Always use method='nearest' for spatial selection

Format: “Predict the output” → Error message → Explanation → The Fix

Impact: Students see and learn from errors before encountering them in homework

4. Active Learning Exercises 💻

Exercise 1: Tool Selection Quiz (Slide 13)

Which tool for each task?
ERA5 temperature(time, level, lat, lon) → xarray
Single station hourly time series → Pandas
Matrix multiplication → NumPy
Climate model temp(time, ensemble, lat, lon) → xarray
CSV with station metadata → Pandas

Exercise 2: Create First DataArray (Slide 26)

# With your neighbor (5 min):
# Create DataArray for wind speed
# Dimensions: time (3 days), lat (2), lon (2)
# Boulder coordinates: lat=[40, 40.5], lon=[-105, -104.5]

Exercise 3: Practice Selection Methods (Slide 38)

# Tasks:
# 1. Select last time step using .isel()
# 2. Select all data from January 3rd using .sel()
# 3. Select temp at lat=40°N, lon=-105°W (nearest)
# 4. Select time slice: January 2-5
# 5. What happens if you try ds.sel(time=5)? Why?

Exercise 4: Heat Wave Detector (Slide 57)

# Final Challenge: Detect and visualize a heat wave
# 1. Compute time mean for each grid point
# 2. Find location with highest mean temp
# 3. Extract time series at that location
# 4. Compute anomaly from overall mean
# 5. Plot time series with anomaly highlighted
# 6. Save processed data to NetCDF

Impact: Retrieval practice every 10-15 slides; forces active engagement

5. Metacognitive “When to Use Each Tool” Guidance 🧠

Decision Guide 1: Mental Model Progression (Slide 9)

NumPy:   "Calculator for N-D arrays"
         ✅ Fast math, any dimensions
         ❌ No dimension names, no coordinates

Pandas:  "Spreadsheet with labels"
         ✅ Named columns, time indexing
         ❌ Only 2D

xarray:  "Pandas for N-D grids"
         ✅ Named dimensions
         ✅ Coordinate-based selection
         ✅ Metadata preservation

When to use:

isel: “First 10 time steps”, “every 3rd latitude”
sel: “Data at 500 hPa”, “January 2024”

Impact: Students know WHEN to use each tool, not just HOW

6. Progressive Scaffolding 📈

Carefully designed progression:

Stage	Slides	Content
1. Motivation	4-13	Real scenario, tool limitations, mental model
2. Fundamentals	14-28	DataArray, Dataset, anatomy, errors
3. I/O	29-33	Opening NetCDF, exploring structure, file management
4. Selection	34-39	isel vs sel, nearest neighbor, practice
5. Operations	40-48	Reductions, groupby, climatologies, anomalies
6. Plotting	49-52	Built-in plotting, customization, multi-panel
7. Advanced	53-57	Dask/lazy loading, multi-file, workflows
8. Synthesis	58-67	Best practices, error checklist, resources

Impact: Each section builds on previous; no conceptual leaps

7. Visual & Conceptual Scaffolding 📊

Design principle: Multi-dimensional data is abstract—make it concrete

Slide 15: DataArray Anatomy

Shows actual printed output with arrows pointing to components
Dimensions, coordinates, attributes labeled
“Self-describing data” concept

Slide 16: Dataset Structure

Visual comparison to dictionary of DataArrays
Shows how multiple variables share dimensions

Slide 42-44: GroupBy Operations

Monthly climatology example with actual code output
Shows 1461 daily values → 12 monthly means
Automatic alignment visualization

Slide 46: Automatic Alignment Magic

Side-by-side arrays with different coordinates
Shows how xarray aligns by labels, not position
NaN where no overlap—prevents silent errors

8. Realistic Research Workflows 🔬

Slide 55: Complete Research Example

Shows full analysis pipeline:

# 1. Open multi-year dataset
ds = xr.open_mfdataset('ERA5_*.nc')

# 2. Subset to region
ds_west = ds.sel(lat=slice(32, 49), lon=slice(-125, -100))

# 3. Compute climatology (1991-2020)
ds_clim = ds_west.sel(time=slice('1991', '2020'))
climatology = ds_clim['t2m'].groupby('time.dayofyear').mean()

# 4. Select 2023 summer
summer_2023 = ds_west['t2m'].sel(time=slice('2023-06-01', '2023-08-31'))

# 5. Compute anomalies
anomaly = summer_2023.groupby('time.dayofyear') - climatology

# 6. Find peak heat wave
max_anomaly = anomaly.max(dim='time')

# 7. Plot
[creates 2-panel comparison figure]

# 8. Save results
max_anomaly.to_netcdf('heatwave_2023_anomaly.nc')

Impact: Students see the path from raw data to research results

9. Plotting Integration 📈

Progressive plotting examples:

Slide 49: 1D Time Series

Automatic axis labeling
Uses coordinate values, not indices

Slide 50: 2D Spatial Map

Automatic colorbar
Proper lat/lon axes

Slide 51: Customized Plot

Control over colormap, limits, labels
Professional figure appearance

Slide 52: Multi-Panel Figures

2×2 panels showing time evolution
Single shared colorbar
Date formatting in titles

Impact: Students can create publication-quality figures immediately

10. Advanced Topics with Context 🚀

Lazy Loading with Dask (Slide 53)

Problem: 50 GB file won’t fit in memory
Solution: Chunked loading with dask
When to use: File > RAM, need subset only
When NOT to use: Small files, adds overhead

Multi-File Operations (Slide 54)

Problem: One file per year (common in climate data)
Solution: open_mfdataset() with wildcards
Benefits: Automatic combining, lazy loading, parallel

Writing NetCDF (Slide 56)

Complete workflow: compute → add metadata → save
Shows how to preserve provenance information

Impact: Students ready for real research data workflows

📚 Learning Science Principles Applied

1. Worked Examples Effect

Every concept has 2-3 complete examples
Shows process AND result
Includes common errors and fixes

2. Cognitive Load Management

One new concept at a time
Progressive complexity
Visual aids for abstract concepts (dimensions, coordinates)

3. Retrieval Practice

Regular “Check Your Understanding” questions
“Predict the output” before error messages
Hands-on exercises every 10-15 slides

4. Transfer of Learning

Every example uses atmospheric science context
Real research scenario (ERA5, heat waves)
Complete analysis workflows

5. Error-Driven Learning (Productive Failure)

6 common errors shown explicitly
Students learn debugging patterns
Errors normalized as learning opportunity

6. Metacognition

Explicit “when to use” guidance
Mental model comparisons (NumPy → Pandas → xarray)
Decision tables for tool selection

🎯 Key Design Decisions

Why Start with “What’s Wrong with Pandas/NumPy”?

Alternative approach: Jump straight to xarray syntax

Our approach: Show why existing tools fail first

Rationale: Students need to understand the problem before appreciating the solution. This creates cognitive dissonance → motivation to learn.

Evidence: Pedagogical research shows problem-driven learning improves retention and transfer.

Why 6 Error Examples Instead of Just Showing Correct Code?

Alternative approach: Only show working examples

Our approach: “Predict the output” → Error message → Explanation → Fix

Rationale:

Students will make these errors anyway
Seeing errors in controlled environment builds debugging skills
“Productive failure” research shows learning from mistakes improves understanding

Evidence: Kapur (2008) - productive failure in problem-solving

Why “Try It Yourself” Every 10-15 Slides?

Alternative approach: One big exercise at end

Our approach: Frequent small hands-on moments

Rationale:

Retrieval practice must be spaced throughout
Catches misconceptions early
Active learning > passive watching

Evidence: Freeman et al. (2014) - active learning increases STEM performance

Why Explicit “When to Use Each Tool” Tables?

Alternative approach: Students infer when to use xarray

Our approach: Explicit metacognitive guidance

Rationale:

Novices don’t develop expert heuristics automatically
Need explicit instruction on decision-making
Metacognitive skills are teachable

Evidence: Flavell (1979) - metacognition in learning

📈 Expected Learning Outcomes

Outcome	Design Element	Slide(s)
Understand xarray motivation	Real scenario showing pandas/NumPy limits	4-8
Know when to use xarray	Decision guides, tool comparison table	9, 13, 62
Avoid common errors	6 explicit error examples with fixes	23, 32, 36, 41, 52, 63
Select data correctly	isel vs sel guidance, practice	29-38
Compute climatologies	GroupBy examples, anomaly calculations	42-47
Create quality figures	Progressive plotting examples	49-52
Handle large datasets	Dask/chunking, multi-file operations	53-54
Build research workflows	Complete heat wave analysis example	55, 57

🔄 Recommended Classroom Use

Before Class:

Email students: “Bring laptop, ensure xarray installed”
Post sample NetCDF file on Canvas
Prepare for live coding demonstrations

During Class:

Live code the error examples - Show yourself debugging
Pause at “Try It Yourself” slides - Give full 5 minutes
Use chalkboard for dimension diagrams - Draw 3D/4D grids
Show real ERA5 data - Not just toy examples
Emphasize coordinates - This is the key xarray concept
Cold call after pair work - Keep everyone engaged

After Class:

Post slides + sample NetCDF file immediately
Office hours: “Bring your NetCDF data questions”
Canvas discussion: “What clicked? What’s still confusing?”
Prepare similar data for homework

🚀 Future Enhancements

For Next Iteration:

Add:

Comparison with CDO/NCO command-line tools
Integration with cartopy for maps
Performance tips for large datasets
Common xarray + matplotlib patterns

Consider:

Video of live NetCDF exploration workflow
Student-contributed xarray tips
Gallery of real research figures made with xarray
Debugging flowchart poster

Collect feedback on:

Which error examples were most helpful
Time needed for “Try It Yourself” exercises
Which concepts need more explanation
Whether Dask section is too advanced

📖 Pedagogical References

These design decisions align with:

Cognitive Load Theory (Sweller, 1988)
- Progressive complexity from simple to advanced
- Visual scaffolding for multi-dimensional concepts
- Worked examples reduce cognitive load
Retrieval Practice (Roediger & Butler, 2011)
- Frequent low-stakes “Check Your Understanding”
- Spaced throughout lesson
- Immediate feedback
Productive Failure (Kapur, 2008)
- Error-driven learning: show mistakes first
- Debugging as pedagogy
- Normalizes errors as learning
Transfer of Learning (Bransford & Schwartz, 1999)
- Atmospheric science examples throughout
- Real research scenarios (ERA5, heat waves)
- Complete workflows from data → results
Metacognition (Flavell, 1979)
- Explicit “when to use” guidance
- Decision tables
- Tool selection heuristics
Active Learning (Freeman et al., 2014)
- Frequent hands-on exercises
- Pair programming
- Predict-then-reveal format

🎓 Files Created

atoc4815-week05-xarray.qmd - Complete lesson (67 slides, ~1,900 lines)
atoc4815-week05-xarray.html - Rendered slides with live Python code
PEDAGOGICAL_IMPROVEMENTS_XARRAY.md - This design document

Note: Built from scratch using proven framework, NOT converted from PowerPoint

💡 Comparison to Traditional xarray Tutorials

Typical xarray tutorial:

“Here’s a DataArray. Here’s how to select data.”
Focus on syntax and API
Few error examples
Generic data examples
Minimal context for when/why to use

Our approach:

Starts with real research problem that requires xarray
Shows why pandas/NumPy fail before introducing xarray
6 explicit error scenarios students will encounter
Atmospheric science context throughout
Metacognitive guidance on tool selection

Result:

Students learn not just HOW to use xarray, but WHEN and WHY—skills that transfer to their research.

🌟 Instructor Notes

This lesson is the gateway to real atmospheric data science.

Before this week, students worked with:

Single station time series (pandas)
Simple array operations (NumPy)

After this week, they can:

Handle gridded reanalysis data (ERA5, MERRA-2)
Compute climatologies and anomalies
Work with climate model output
Create publication-quality figures
Build complete research workflows

This is where atmospheric science and programming converge.

The pedagogical investment here pays dividends throughout their research careers. Students who master xarray can:

Analyze their thesis data independently
Contribute to research projects immediately
Publish figures without manual data manipulation
Collaborate with the broader climate science community

You’re giving them a superpower. 🚀