Pedagogical Design: xarray Week 5

ATOC 4815 - Built from Scratch with Proven Framework


📊 Design Statistics

Created: 67 slides, ~1,900 lines Active Learning Exercises: 4 hands-on “Try It Yourself” Error Examples: 6 explicit error scenarios with fixes “Check Your Understanding” moments: 3 formative assessments Decision Guides: 3 metacognitive “when to use” sections

Built from scratch using the proven pedagogical framework from Weeks 1-4


✅ Pedagogical Framework Applied

1. Real-World Scenario Motivation 🌍

Design Decision: Don’t start with “here’s xarray syntax”—start with “here’s a problem pandas and NumPy can’t solve”

Created Research Scenario:

Imagine: You're analyzing the 2023 North American heat wave using ERA5 reanalysis

Your data:
- 4D gridded dataset: temperature(time, level, lat, lon)
- Spatial: North America (15°N-70°N, 130°W-60°W)
- Temporal: 45 years daily (16,425 days)
- Vertical: 37 pressure levels
- Grid: 0.25° × 0.25° (~25 km)
- File size: ~50 GB

Questions:
1. Max temperature in Boulder summer 2023?
2. How does 2023 compare to 1991-2020 climatology?
3. Vertical temperature profile during heat wave?
4. Where was heat wave most intense?
5. Has frequency increased over 45 years?

Impact: Immediately establishes that they need a new tool—this data can’t be handled with pandas or numpy alone.


2. Show Why Previous Tools Fail ⚠️

Problem: Students might think “Why not just use pandas?”

Solution: Explicit “Why Pandas Falls Short” and “Why NumPy Falls Short” slides

Pandas Limitations (Slide 6):

  1. 2D only - Can’t represent 4D (time × level × lat × lon)
  2. No dimension concept - Which axis is which?
  3. No coordinate-based selection - Need index math for “500 hPa at Boulder”

NumPy Limitations (Slide 7):

  1. No dimension labels - Is axis=1 pressure or latitude?
  2. No coordinate values - Manual argmin to find nearest point
  3. Metadata loss - After slicing, what does the data represent?

Code Examples:

# Pandas fail
df = pd.DataFrame(temperature_data)  # ❌ How to structure 4D?
mean_temp = df.mean(axis=2)  # ❌ Was that lat or lon?

# NumPy fail
temp.mean(axis=1)  # ❌ Is axis=1 pressure levels or latitude?
lat_idx = np.argmin(np.abs(lat_array - 40.0))  # ❌ Manual index math!

Impact: Students understand the problem before seeing the solution


3. Error-Driven Learning 🐛

Created 6 explicit error scenarios:

Error 1: Dimension Name Mismatch (Slide 23)

temp.sel(latitude=40)  # ❌ Dimension is 'lat', not 'latitude'

Fix: Always check print(data.dims) first

Error 2: isel vs sel Confusion (Slide 36)

ds.sel(time=0)  # ❌ 0 is an index, not a date!

Fix: Index number → .isel(), Coordinate value → .sel()

Error 3: Wrong Dimension Name in Reduction (Slide 41)

temp.mean(dim='times')  # ❌ Typo: should be 'time'

Fix: Use tab completion for dimension names

Error 4: Forgetting to Close Files (Slide 32)

ds = xr.open_dataset('ERA5_50GB.nc')
os.remove('ERA5_50GB.nc')  # ❌ File still open!

Fix: Use context manager with xr.open_dataset(...) as ds:

Error 5: Using .values Too Early (Slide 52)

temp_array = temp.values  # ❌ Loses all metadata!
subset = temp_array.sel(lat=40)  # ❌ Won't work—just NumPy array!

Fix: Keep as xarray until the last possible moment

Error 6: Forgetting method=’nearest’ (Slide 34)

temp.sel(lat=40.015, lon=-105.2705)  # ❌ Exact match may not exist

Fix: Always use method='nearest' for spatial selection

Format: “Predict the output” → Error message → Explanation → The Fix

Impact: Students see and learn from errors before encountering them in homework


4. Active Learning Exercises 💻

Exercise 1: Tool Selection Quiz (Slide 13)

Which tool for each task?
1. ERA5 temperature(time, level, lat, lon) → xarray
2. Single station hourly time series → Pandas
3. Matrix multiplication → NumPy
4. Climate model temp(time, ensemble, lat, lon) → xarray
5. CSV with station metadata → Pandas

Exercise 2: Create First DataArray (Slide 26)

# With your neighbor (5 min):
# Create DataArray for wind speed
# Dimensions: time (3 days), lat (2), lon (2)
# Boulder coordinates: lat=[40, 40.5], lon=[-105, -104.5]

Exercise 3: Practice Selection Methods (Slide 38)

# Tasks:
# 1. Select last time step using .isel()
# 2. Select all data from January 3rd using .sel()
# 3. Select temp at lat=40°N, lon=-105°W (nearest)
# 4. Select time slice: January 2-5
# 5. What happens if you try ds.sel(time=5)? Why?

Exercise 4: Heat Wave Detector (Slide 57)

# Final Challenge: Detect and visualize a heat wave
# 1. Compute time mean for each grid point
# 2. Find location with highest mean temp
# 3. Extract time series at that location
# 4. Compute anomaly from overall mean
# 5. Plot time series with anomaly highlighted
# 6. Save processed data to NetCDF

Impact: Retrieval practice every 10-15 slides; forces active engagement


5. Metacognitive “When to Use Each Tool” Guidance 🧠

Decision Guide 1: Mental Model Progression (Slide 9)

NumPy:   "Calculator for N-D arrays"
         ✅ Fast math, any dimensions
         ❌ No dimension names, no coordinates

Pandas:  "Spreadsheet with labels"
         ✅ Named columns, time indexing
         ❌ Only 2D

xarray:  "Pandas for N-D grids"
         ✅ Named dimensions
         ✅ Coordinate-based selection
         ✅ Metadata preservation

Decision Guide 2: Selection Methods (Slide 29) | Method | Selection By | Example | |——–|————–|———| | .isel() | Integer position | ds.isel(time=0) → first time | | .sel() | Coordinate value | ds.sel(lat=40) → data at 40°N |

When to use:

  • isel: “First 10 time steps”, “every 3rd latitude”
  • sel: “Data at 500 hPa”, “January 2024”

Decision Guide 3: Tool Selection Table (Slide 62) | Data Type | Tool | Why | |———–|——|—–| | Single station time series | Pandas | 1D, time indexing | | CSV with multiple stations | Pandas | Tabular, mixed types | | Gridded 3D+ NetCDF | xarray | Multi-dimensional | | Climate model output | xarray | 4D (time, lat, lon, level) | | Pure numerical computation | NumPy | Matrix ops, FFT |

Impact: Students know WHEN to use each tool, not just HOW


6. Progressive Scaffolding 📈

Carefully designed progression:

Stage Slides Content
1. Motivation 4-13 Real scenario, tool limitations, mental model
2. Fundamentals 14-28 DataArray, Dataset, anatomy, errors
3. I/O 29-33 Opening NetCDF, exploring structure, file management
4. Selection 34-39 isel vs sel, nearest neighbor, practice
5. Operations 40-48 Reductions, groupby, climatologies, anomalies
6. Plotting 49-52 Built-in plotting, customization, multi-panel
7. Advanced 53-57 Dask/lazy loading, multi-file, workflows
8. Synthesis 58-67 Best practices, error checklist, resources

Impact: Each section builds on previous; no conceptual leaps


7. Visual & Conceptual Scaffolding 📊

Design principle: Multi-dimensional data is abstract—make it concrete

Slide 15: DataArray Anatomy

  • Shows actual printed output with arrows pointing to components
  • Dimensions, coordinates, attributes labeled
  • “Self-describing data” concept

Slide 16: Dataset Structure

  • Visual comparison to dictionary of DataArrays
  • Shows how multiple variables share dimensions

Slide 42-44: GroupBy Operations

  • Monthly climatology example with actual code output
  • Shows 1461 daily values → 12 monthly means
  • Automatic alignment visualization

Slide 46: Automatic Alignment Magic

  • Side-by-side arrays with different coordinates
  • Shows how xarray aligns by labels, not position
  • NaN where no overlap—prevents silent errors

8. Realistic Research Workflows 🔬

Slide 55: Complete Research Example

Shows full analysis pipeline:

# 1. Open multi-year dataset
ds = xr.open_mfdataset('ERA5_*.nc')

# 2. Subset to region
ds_west = ds.sel(lat=slice(32, 49), lon=slice(-125, -100))

# 3. Compute climatology (1991-2020)
ds_clim = ds_west.sel(time=slice('1991', '2020'))
climatology = ds_clim['t2m'].groupby('time.dayofyear').mean()

# 4. Select 2023 summer
summer_2023 = ds_west['t2m'].sel(time=slice('2023-06-01', '2023-08-31'))

# 5. Compute anomalies
anomaly = summer_2023.groupby('time.dayofyear') - climatology

# 6. Find peak heat wave
max_anomaly = anomaly.max(dim='time')

# 7. Plot
[creates 2-panel comparison figure]

# 8. Save results
max_anomaly.to_netcdf('heatwave_2023_anomaly.nc')

Impact: Students see the path from raw data to research results


9. Plotting Integration 📈

Progressive plotting examples:

Slide 49: 1D Time Series

  • Automatic axis labeling
  • Uses coordinate values, not indices

Slide 50: 2D Spatial Map

  • Automatic colorbar
  • Proper lat/lon axes

Slide 51: Customized Plot

  • Control over colormap, limits, labels
  • Professional figure appearance

Slide 52: Multi-Panel Figures

  • 2×2 panels showing time evolution
  • Single shared colorbar
  • Date formatting in titles

Impact: Students can create publication-quality figures immediately


10. Advanced Topics with Context 🚀

Lazy Loading with Dask (Slide 53)

  • Problem: 50 GB file won’t fit in memory
  • Solution: Chunked loading with dask
  • When to use: File > RAM, need subset only
  • When NOT to use: Small files, adds overhead

Multi-File Operations (Slide 54)

  • Problem: One file per year (common in climate data)
  • Solution: open_mfdataset() with wildcards
  • Benefits: Automatic combining, lazy loading, parallel

Writing NetCDF (Slide 56)

  • Complete workflow: compute → add metadata → save
  • Shows how to preserve provenance information

Impact: Students ready for real research data workflows


📚 Learning Science Principles Applied

1. Worked Examples Effect

  • Every concept has 2-3 complete examples
  • Shows process AND result
  • Includes common errors and fixes

2. Cognitive Load Management

  • One new concept at a time
  • Progressive complexity
  • Visual aids for abstract concepts (dimensions, coordinates)

3. Retrieval Practice

  • Regular “Check Your Understanding” questions
  • “Predict the output” before error messages
  • Hands-on exercises every 10-15 slides

4. Transfer of Learning

  • Every example uses atmospheric science context
  • Real research scenario (ERA5, heat waves)
  • Complete analysis workflows

5. Error-Driven Learning (Productive Failure)

  • 6 common errors shown explicitly
  • Students learn debugging patterns
  • Errors normalized as learning opportunity

6. Metacognition

  • Explicit “when to use” guidance
  • Mental model comparisons (NumPy → Pandas → xarray)
  • Decision tables for tool selection

🎯 Key Design Decisions

Why Start with “What’s Wrong with Pandas/NumPy”?

Alternative approach: Jump straight to xarray syntax

Our approach: Show why existing tools fail first

Rationale: Students need to understand the problem before appreciating the solution. This creates cognitive dissonance → motivation to learn.

Evidence: Pedagogical research shows problem-driven learning improves retention and transfer.


Why 6 Error Examples Instead of Just Showing Correct Code?

Alternative approach: Only show working examples

Our approach: “Predict the output” → Error message → Explanation → Fix

Rationale:

  • Students will make these errors anyway
  • Seeing errors in controlled environment builds debugging skills
  • “Productive failure” research shows learning from mistakes improves understanding

Evidence: Kapur (2008) - productive failure in problem-solving


Why “Try It Yourself” Every 10-15 Slides?

Alternative approach: One big exercise at end

Our approach: Frequent small hands-on moments

Rationale:

  • Retrieval practice must be spaced throughout
  • Catches misconceptions early
  • Active learning > passive watching

Evidence: Freeman et al. (2014) - active learning increases STEM performance


Why Explicit “When to Use Each Tool” Tables?

Alternative approach: Students infer when to use xarray

Our approach: Explicit metacognitive guidance

Rationale:

  • Novices don’t develop expert heuristics automatically
  • Need explicit instruction on decision-making
  • Metacognitive skills are teachable

Evidence: Flavell (1979) - metacognition in learning


📈 Expected Learning Outcomes

Outcome Design Element Slide(s)
Understand xarray motivation Real scenario showing pandas/NumPy limits 4-8
Know when to use xarray Decision guides, tool comparison table 9, 13, 62
Avoid common errors 6 explicit error examples with fixes 23, 32, 36, 41, 52, 63
Select data correctly isel vs sel guidance, practice 29-38
Compute climatologies GroupBy examples, anomaly calculations 42-47
Create quality figures Progressive plotting examples 49-52
Handle large datasets Dask/chunking, multi-file operations 53-54
Build research workflows Complete heat wave analysis example 55, 57

Before Class:

  1. Email students: “Bring laptop, ensure xarray installed”
  2. Post sample NetCDF file on Canvas
  3. Prepare for live coding demonstrations

During Class:

  1. Live code the error examples - Show yourself debugging
  2. Pause at “Try It Yourself” slides - Give full 5 minutes
  3. Use chalkboard for dimension diagrams - Draw 3D/4D grids
  4. Show real ERA5 data - Not just toy examples
  5. Emphasize coordinates - This is the key xarray concept
  6. Cold call after pair work - Keep everyone engaged

After Class:

  1. Post slides + sample NetCDF file immediately
  2. Office hours: “Bring your NetCDF data questions”
  3. Canvas discussion: “What clicked? What’s still confusing?”
  4. Prepare similar data for homework

🚀 Future Enhancements

For Next Iteration:

Add:

  • Comparison with CDO/NCO command-line tools
  • Integration with cartopy for maps
  • Performance tips for large datasets
  • Common xarray + matplotlib patterns

Consider:

  • Video of live NetCDF exploration workflow
  • Student-contributed xarray tips
  • Gallery of real research figures made with xarray
  • Debugging flowchart poster

Collect feedback on:

  • Which error examples were most helpful
  • Time needed for “Try It Yourself” exercises
  • Which concepts need more explanation
  • Whether Dask section is too advanced

📖 Pedagogical References

These design decisions align with:

  1. Cognitive Load Theory (Sweller, 1988)
    • Progressive complexity from simple to advanced
    • Visual scaffolding for multi-dimensional concepts
    • Worked examples reduce cognitive load
  2. Retrieval Practice (Roediger & Butler, 2011)
    • Frequent low-stakes “Check Your Understanding”
    • Spaced throughout lesson
    • Immediate feedback
  3. Productive Failure (Kapur, 2008)
    • Error-driven learning: show mistakes first
    • Debugging as pedagogy
    • Normalizes errors as learning
  4. Transfer of Learning (Bransford & Schwartz, 1999)
    • Atmospheric science examples throughout
    • Real research scenarios (ERA5, heat waves)
    • Complete workflows from data → results
  5. Metacognition (Flavell, 1979)
    • Explicit “when to use” guidance
    • Decision tables
    • Tool selection heuristics
  6. Active Learning (Freeman et al., 2014)
    • Frequent hands-on exercises
    • Pair programming
    • Predict-then-reveal format

🎓 Files Created

  1. atoc4815-week05-xarray.qmd - Complete lesson (67 slides, ~1,900 lines)
  2. atoc4815-week05-xarray.html - Rendered slides with live Python code
  3. PEDAGOGICAL_IMPROVEMENTS_XARRAY.md - This design document

Note: Built from scratch using proven framework, NOT converted from PowerPoint


💡 Comparison to Traditional xarray Tutorials

Typical xarray tutorial:

  • “Here’s a DataArray. Here’s how to select data.”
  • Focus on syntax and API
  • Few error examples
  • Generic data examples
  • Minimal context for when/why to use

Our approach:

  • Starts with real research problem that requires xarray
  • Shows why pandas/NumPy fail before introducing xarray
  • 6 explicit error scenarios students will encounter
  • Atmospheric science context throughout
  • Metacognitive guidance on tool selection

Result:

Students learn not just HOW to use xarray, but WHEN and WHY—skills that transfer to their research.


🌟 Instructor Notes

This lesson is the gateway to real atmospheric data science.

Before this week, students worked with:

  • Single station time series (pandas)
  • Simple array operations (NumPy)

After this week, they can:

  • Handle gridded reanalysis data (ERA5, MERRA-2)
  • Compute climatologies and anomalies
  • Work with climate model output
  • Create publication-quality figures
  • Build complete research workflows

This is where atmospheric science and programming converge.

The pedagogical investment here pays dividends throughout their research careers. Students who master xarray can:

  • Analyze their thesis data independently
  • Contribute to research projects immediately
  • Publish figures without manual data manipulation
  • Collaborate with the broader climate science community

You’re giving them a superpower. 🚀