Git for Scientific Software Development

Version Control for ATOC-esque Science

Will Chapman

CU Boulder ATOC

2026-01-01

Introduction

Why Version Control?

Real scenario from ATOC-esque science research:

  • You write code to analyze climate model output
  • It works! You submit your paper
  • Reviewer asks: “What if you change the threshold?”
  • You modify the code
  • Everything breaks
  • You can’t remember what the working version looked like
  • Panic sets in…

Git prevents this nightmare

What We’ll Cover Today

Part 1: Fundamentals

  • What is version control?
  • Why scientists need Git
  • Core concepts (commits, diffs, branches)
  • Essential commands

Part 2: Workflows

  • Repository setup
  • Making commits
  • Working with branches
  • Collaborative workflows

Part 3: Best Practices

  • README files
  • LICENSE files
  • .gitignore files
  • Code review
  • Pull requests

Part 4: Resources

  • Where to get help
  • Git GUIs vs command line
  • GitHub vs GitLab

Acknowledgments

This presentation is heavily inspired by Jack Atkinson’s excellent “Git for Scientific Software Development” presentation.

Thank you, Jack, for the great template!

Why Git Matters for Science

Research Software is Critical

Your code influences important decisions:

  • Weather forecasts - Emergency management relies on model output
  • Climate projections - Policy decisions worth trillions of dollars
  • Air quality - Public health warnings
  • Aviation - Flight safety and routing
  • Agriculture - Crop predictions and water management

If your code has bugs, people could make wrong decisions

→ Quality matters!

What is Git?

Git is not just a series of backups

It’s a project management system that tracks:

  • What changed (every line of code)
  • When it changed (timestamp)
  • Who changed it (author)
  • Why it changed (commit message)
  • How to undo it (full history)

Think of it as “Track Changes” for code, but way more powerful

Git vs Dropbox/OneDrive

Feature Git Dropbox/OneDrive
Track who changed what ✅ Yes, line-by-line ❌ No
Meaningful change messages ✅ Yes, commit messages ❌ No
Work on multiple features simultaneously ✅ Yes, branches ❌ No
Merge conflicting changes ✅ Yes, smart merge ⚠️ Creates duplicates
Collaborate with strangers ✅ Yes, pull requests ❌ Hard
Review before accepting changes ✅ Yes, code review ❌ No
Undo specific changes ✅ Yes, revert commits ⚠️ Version history

Bottom line: Dropbox backs up files. Git manages projects.

Git Core Concepts

Repositories

A repository (repo) is a project folder tracked by Git

my_analysis/
├── .git/              # Git's database (don't touch!)
├── analyze_temp.py    # Your code
├── plot_results.py
├── data/
   └── temperatures.nc
└── README.md

The .git/ folder contains:

  • All history
  • All branches
  • All commits
  • Configuration

You never edit .git/ directly - use git commands!

Commits

A commit is a snapshot of your project at a specific time

%%{init: {'theme':'base', 'themeVariables': { 'git0': '#5DA9E9', 'gitBranchLabel0': '#ffffff', 'commitLabelColor': '#000000', 'commitLabelBackground': '#ffffff'}}}%%
gitGraph
    commit id: "Initial analysis"
    commit id: "Add QC checks"
    commit id: "Fix temperature bug"
    commit id: "Add plotting"

Each commit contains:

  • Complete snapshot of all files
  • Commit message (what changed and why)
  • Author (who made the change)
  • Timestamp (when it happened)
  • Parent commit (what came before)

Diffs

A diff shows exactly what changed between commits

Before:

def calculate_mean(temps):
    """Calculate mean temperature"""
    return sum(temps) / len(temps)

After:

def calculate_mean(temps):
    """Calculate mean temperature"""
    return np.mean(temps)  # Better!

Git shows this as:

- return sum(temps) / len(temps)    # Red (removed)
+ return np.mean(temps)              # Green (added)

Lines with - = removed (red in terminal) Lines with + = added (green in terminal)

Branches

Branches let you work on multiple things simultaneously

%%{init: {'theme':'base', 'themeVariables': { 'git0': '#5DA9E9', 'git1': '#66B266', 'git2': '#E8A838', 'gitBranchLabel0': '#ffffff', 'gitBranchLabel1': '#ffffff', 'gitBranchLabel2': '#ffffff', 'commitLabelColor': '#000000', 'commitLabelBackground': '#ffffff'}}}%%
gitGraph
    commit id: "Working analysis"
    commit id: "Paper submitted"
    branch bug-fix
    commit id: "Fix QC threshold"
    checkout main
    branch new-feature
    commit id: "Add spatial plots"
    commit id: "Add significance tests"
    checkout main
    merge bug-fix
    merge new-feature

Use cases:

  • main - stable, working code
  • bug-fix - fix a specific bug
  • new-feature - develop new functionality
  • reviewer-response - address reviewer comments

Essential Git Commands

Setup: Configure Git

First time only - tell Git who you are:

# Set your name (shows in commits)
git config --global user.name "Alice Johnson"

# Set your email (links to GitHub)
git config --global user.email "alice.johnson@colorado.edu"

# Set default editor (nano is easiest)
git config --global core.editor "nano"

# Check your settings
git config --list

This information appears in every commit you make

Creating a Repository

Option 1: Start from scratch (local)

cd ~/ATOC4815/my_project
git init
# Creates .git/ folder - now tracked by Git!

Option 2: Clone existing repository (from GitHub)

git clone https://github.com/username/repo.git
cd repo
# Now you have a copy with full history

For this class: You’ll mostly clone assignments from GitHub

The Basic Workflow

Every time you make changes:

# 1. Check what changed
git status

# 2. See the actual changes
git diff

# 3. Stage files for commit (tell Git what to save)
git add analyze_temp.py
# or add everything:
git add .

# 4. Commit (save a snapshot)
git commit -m "Fix temperature calculation bug"

# 5. Push to GitHub (backup + share)
git push

Git Status - Your Best Friend

$ git status
On branch main
Your branch is up to date with 'origin/main'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
    modified:   analyze_temp.py

Untracked files:
  (use "git add <file>..." to include in what will be committed)
    plot_results.py

no changes added to commit (use "git add" and/or "git commit -a")

Translation:

  • analyze_temp.py - modified, not staged
  • plot_results.py - new file, not tracked
  • Action: Run git add to stage them

Git Diff - See What Changed

$ git diff analyze_temp.py

Before (old code):

def qc_temperature(temp):
    if temp < -50 or temp > 50:
        return None
    return temp

After (your changes):

def qc_temperature(temp):
    if temp < -90 or temp > 60:  # Updated!
        return None
    return temp

Shows line-by-line changes before you commit

In your terminal, removed lines appear in red and added lines in green

Committing Changes

Good commit:

git commit -m "Expand temperature QC range for polar data

The previous range (-50 to 50°C) was rejecting valid Antarctic
observations. Updated to -90 to 60°C based on WMO standards."

Bad commit:

git commit -m "fixed stuff"
git commit -m "asdf"
git commit -m "Update"

Good commit messages:

  • Start with verb (Fix, Add, Update, Remove)
  • Explain WHY, not just WHAT
  • First line is summary (< 50 chars)
  • Blank line, then details if needed

Pushing to GitHub

After committing, send changes to GitHub:

git push

First time pushing a new branch:

git push -u origin main
# -u sets up tracking, only needed once

What push does:

  • Uploads your commits to GitHub
  • Others can now see your changes
  • Backs up your work
  • Enables collaboration

Working with Branches

Why Branches?

Scenario: You’re working on a paper

  • Your code works, analysis complete
  • Reviewer asks: “Try a different method”
  • You start changing code
  • Halfway through: “Oh no, the old method had a bug too!”
  • Now you need to:
    • Fix the bug in the submitted version
    • Continue developing the new method
    • Keep them separate!

Solution: Branches let you work on both simultaneously

Branch Workflow

%%{init: {'theme':'base', 'themeVariables': { 'git0': '#5DA9E9', 'git1': '#66B266', 'git2': '#E8A838', 'gitBranchLabel0': '#ffffff', 'gitBranchLabel1': '#ffffff', 'gitBranchLabel2': '#ffffff', 'commitLabelColor': '#000000', 'commitLabelBackground': '#ffffff'}}}%%
gitGraph
    commit id: "Initial paper results"
    commit id: "Submitted to journal"
    branch reviewer-method
    checkout reviewer-method
    commit id: "Start alternative method"
    commit id: "Implement new approach"
    checkout main
    branch bug-fix
    commit id: "Fix original bug"
    checkout main
    merge bug-fix id: "Merge bug fix"
    checkout reviewer-method
    commit id: "Finish alternative method"
    checkout main
    merge reviewer-method id: "Merge for revision"

Branch Commands

Create and switch to new branch:

# Create branch
git branch new-feature

# Switch to it
git checkout new-feature

# Or do both at once:
git checkout -b new-feature

See all branches:

git branch
  main
* new-feature    # * shows current branch

Switch branches:

git checkout main

Merging Branches

When feature is done, merge back to main:

# Switch to main
git checkout main

# Merge feature branch in
git merge new-feature

# If successful, delete feature branch
git branch -d new-feature

Merge conflicts happen when:

  • Same lines changed in both branches
  • Git can’t decide which to keep
  • You manually resolve (we’ll cover this later)

Repository Infrastructure

Essential Files

Every repository should have:

  1. README.md - What is this project?
  2. LICENSE - Can others use your code?
  3. .gitignore - What files to NOT track?

These files are the first thing people see on GitHub!

README.md

Your project’s front page

# ATOC 4815 Temperature Analysis

Analysis scripts for Boulder temperature trends 1950-2025.

## Installation

conda env create -f environment.yml
conda activate temp-analysis

## Usage

python analyze_temps.py --input data/boulder.csv

## Data Sources

NOAA Boulder weather station (Station ID: GHCND:USC00050848)

## Contact

Alice Johnson (alice.johnson@colorado.edu)

What Makes a Good README?

  1. Title - What is this?
  2. Description - What does it do? (1-2 sentences)
  3. Installation - How to set it up?
  4. Usage - How to run it? (examples!)
  5. Data - Where does data come from?
  6. Contact - Who to ask questions?
  7. Citation - How to cite your work?

Make it easy for your future self (6 months from now) to use this code

LICENSE

Do others have permission to use your code?

For academic research, common choices:

MIT License (most permissive):

  • ✅ Anyone can use, modify, sell
  • ✅ Just need to include license text
  • Example: NumPy, Flask, Rails

GPL License (copyleft):

  • ✅ Anyone can use, modify
  • ⚠️ Must share modifications under GPL
  • Example: Linux, Git itself

CC BY 4.0 (for data/docs):

  • ✅ Share, adapt, commercial use
  • ⚠️ Must give credit
  • Example: Papers, datasets, documentation

GitHub has a license picker:

  • Go to repository → Add file → Create new file → “LICENSE”
  • Choose from templates

.gitignore

Tell Git to ignore certain files

# Python
__pycache__/
*.pyc
*.pyo
.pytest_cache/

# Jupyter
.ipynb_checkpoints/

# Data (too large for Git)
data/*.nc
data/*.csv
*.h5

# OS files
.DS_Store
Thumbs.db

# IDE
.vscode/
.idea/

# Temporary files
*.tmp
*~

Why ignore files?

  • Too large (> 100 MB)
  • Generated automatically
  • Contain secrets/passwords
  • User-specific (IDE configs)

What NOT to Commit

❌ DON’T commit:

  • Large data files (> 100 MB)
  • Passwords or API keys
  • Generated files (.pyc, .o)
  • IDE config files
  • Binary files (unless necessary)
  • Your entire conda environment

Use .gitignore for these!

✅ DO commit:

  • Source code (.py, .R, .jl)
  • Documentation (.md, .txt)
  • Configuration files (small!)
  • Example data (< 1 MB)
  • Environment spec (environment.yml)
  • Scripts and notebooks

These are what makes the project runnable!

Collaborative Workflows

GitHub != Git

Git = Version control system (runs locally)

GitHub = Website for hosting Git repositories

Alternatives to GitHub:

  • GitLab - Similar, some prefer it
  • Bitbucket - Atlassian product
  • Gitea - Self-hosted option
  • Your own server - Just need SSH

This class uses GitHub because it’s most common in science

Forking vs Cloning

Cloning = Make a local copy

git clone https://github.com/owner/repo.git
  • You can pull updates
  • Can’t push unless you have permission

Forking = Make your own copy on GitHub

  • Click “Fork” button on GitHub
  • Now you have github.com/YOUR_NAME/repo
  • You CAN push to your fork
  • Can submit pull requests to original

Use case: Contributing to someone else’s project

Pull Requests

A pull request (PR) is a request to merge your changes into someone else’s repository

  1. Fork their repository on GitHub
  2. Clone your fork locally
  3. Create a branch for your changes
  4. Make commits
  5. Push to your fork
  6. Open pull request on GitHub
  7. Discuss/review
  8. They merge it!

PRs are how open-source development works

Code Review

Before merging, someone should review your code

What reviewers check:

  • Does it work? Test the code
  • Is it clear? Can you understand it?
  • Is it correct? Are the calculations right?
  • Is it documented? Docstrings, comments
  • Does it follow style? PEP 8 for Python
  • Any bugs? Edge cases, error handling

Code review catches bugs before they reach production!

Code Review Example

Original code :

def calc(x, y):
    z = []
    for i in range(len(x)):
        z.append(x[i] - y[i])
    return z

Reviewer comments:

What does this function do? Need a docstring.

Function name “calc” is not descriptive

This can be simplified with NumPy

After revision:

def calculate_temperature_dewpoint_difference(temp_c, dewpoint_c):
    """
    Calculate temperature minus dewpoint (T-Td)

    Args:
        temp_c: Array of temperatures in Celsius
        dewpoint_c: Array of dewpoints in Celsius
    Returns:
        Array of T-Td values
    """
    return temp_c - dewpoint_c

How to Be a Good Code Reviewer

  1. Be specific - Don’t just say “fix this,” explain what’s wrong
  2. Be constructive - Suggest improvements, don’t just criticize
  3. Praise good code - “Nice use of list comprehension!”
  4. Ask questions - “Why did you choose this approach?”
  5. Focus on learning - It’s a teaching opportunity
  6. Be kind - Remember there’s a person on the other side

Bad review: “This code is terrible”

Good review: “This function is hard to understand. Could you add a docstring explaining what it does? Also, lines 15-20 could be simplified with NumPy’s np.where() function.”

Common Git Scenarios

Scenario 1: Undo Last Commit

“I just committed but forgot to add a file!”

# Add the forgotten file
git add forgotten_file.py

# Amend the last commit (replaces it)
git commit --amend --no-edit

# If you want to change the message too:
git commit --amend -m "New commit message"

⚠️ Only amend commits that haven’t been pushed!

Scenario 2: Discard Local Changes

“I broke everything, just go back to last commit”

# See what changed
git status

# Discard changes to one file
git checkout -- analyze_temp.py

# Or discard ALL changes
git reset --hard HEAD

⚠️ This DELETES your changes! Can’t undo this!

Scenario 3: Merge Conflict

“Git says there’s a conflict - help!”

$ git merge feature-branch
Auto-merging analyze_temp.py
CONFLICT (content): Merge conflict in analyze_temp.py
Automatic merge failed; fix conflicts and then commit the result.

Git marks conflicts in the file:

<<<<<<< HEAD
temp_threshold = 50  # Current branch
=======
temp_threshold = 60  # Feature branch
>>>>>>> feature-branch

Fix it:

  1. Choose which version to keep (or combine)
  2. Delete the markers (<<<<<<<, =======, >>>>>>>)
  3. git add analyze_temp.py
  4. git commit -m "Resolve merge conflict"

Scenario 4: Sync with Upstream

“My fork is out of date with the original repo”

# Add original repo as "upstream" (once)
git remote add upstream https://github.com/original/repo.git

# Fetch updates from upstream
git fetch upstream

# Merge upstream's main into your main
git checkout main
git merge upstream/main

# Push to your fork
git push origin main

Important: This updates your main branch but doesn’t affect your feature branches! Your work is safe.

Update your feature branch:

git checkout your-feature-branch
git merge main  # Or: git rebase main

Do this regularly to avoid huge merge conflicts later!

Scenario 5: Accidentally Committed to Main

“I meant to create a branch but committed to main!”

# Create branch from current state
git branch my-feature

# Reset main to before your commits
git reset --hard origin/main

# Switch to your feature branch
git checkout my-feature

# Your commits are now on the feature branch!

Works because commits are still saved, just moved

Don’t Be Afraid of Git!

✅ These are SAFE (won’t erase your work):

  • git status - Just looks, never changes
  • git log - Just reads history
  • git diff - Just shows differences
  • git branch - Creating/viewing branches
  • git checkout <branch> - Switching branches
  • git add - Staging files
  • git commit - Saving work locally
  • git fetch - Getting updates (doesn’t merge)
  • git push - Uploading to GitHub
  • git pull - Usually safe on your own branch
  • git merge - Can be undone if needed

⚠️ Be CAREFUL with these:

  • git reset --hard - Discards uncommitted changes
  • git push --force - Can overwrite others’ work
  • git clean -fd - Deletes untracked files
  • git rebase on shared branches
  • Deleting branches you haven’t merged

💡 Pro tip: Git saves your commits for ~30 days even if you “delete” them. Use git reflog to find them!

Bottom line: If you’ve committed it, Git has saved it. You’d have to try pretty hard to actually lose work!

Git Best Practices

The Golden Rules

  1. Commit often - Small commits > huge commits
  2. Write good messages - Future you will thank present you
  3. Use branches - Don’t do everything on main
  4. Pull before push - Sync with team first
  5. Review your own code - Check git diff before committing
  6. Don’t commit secrets - API keys, passwords stay out!
  7. Test before pushing - Make sure it works

Commit Message Tips

Good commit messages follow a pattern:

# Format:
<type>: <short summary>

<longer description if needed>

# Types:
feat: Add new temperature plotting function
fix: Correct timezone handling in data loader
docs: Update README with installation instructions
style: Format code with black
refactor: Simplify QC logic without changing behavior
test: Add tests for temperature calculations
chore: Update dependencies in environment.yml

Why this matters: GitHub, GitLab, and many tools parse these!

Branching Strategy

For research projects:

%%{init: {'theme':'base', 'themeVariables': { 'git0': '#5DA9E9', 'git1': '#66B266', 'git2': '#E8A838', 'git3': '#D47FA6', 'gitBranchLabel0': '#ffffff', 'gitBranchLabel1': '#ffffff', 'gitBranchLabel2': '#ffffff', 'gitBranchLabel3': '#ffffff', 'commitLabelColor': '#000000', 'commitLabelBackground': '#ffffff'}}}%%
gitGraph
    commit id: "Initial code"
    branch develop
    checkout develop
    commit id: "Work in progress"
    branch feature/new-plot
    checkout feature/new-plot
    commit id: "Add spatial plot"
    checkout develop
    merge feature/new-plot
    branch bug/fix-qc
    checkout bug/fix-qc
    commit id: "Fix QC threshold"
    checkout develop
    merge bug/fix-qc
    checkout main
    merge develop id: "Release for paper"

  • main - stable, matches published paper
  • develop - active work
  • feature/* - new features
  • bug/* - bug fixes

When to Commit

✅ Good times to commit:

  • Feature works (even if not complete)
  • Fixed a specific bug
  • Added a complete function
  • Updated documentation
  • End of work session

❌ Don’t commit:

  • Code doesn’t run
  • Halfway through a change
  • Without testing it first

Rule of thumb: Each commit should be a logical unit of work

Getting Help

Git Documentation

Official docs are… challenging

$ git help commit
GIT-COMMIT(1)                     Git Manual                     GIT-COMMIT(1)

NAME
       git-commit - Record changes to the repository

SYNOPSIS
       git commit [-a | --interactive | --patch] [-s] [-v] [-u<mode>]
                  [--amend] [--dry-run] [(-c | -C | --fixup | --squash) <commit>]
                  [-F <file> | -m <msg>] [--reset-author] [--allow-empty]
                  [--allow-empty-message] [--no-verify] [-e] [--author=<author>]
                  [--date=<date>] [--cleanup=<mode>] [--[no-]status]
                  [-i | -o] [-S[<keyid>]] [--] [<file>...]

Better alternatives:

Friendly Resources

Learning Git:

Cheat Sheets:

Visual Tools:

Getting Help:

  • Office hours!
  • Stack Overflow (search first)
  • GitHub Discussions
  • Your classmates

Git GUIs vs Command Line

GUI (Graphical User Interface):

Pros:

  • Visual representation of branches
  • Easier to see diffs
  • Point and click commits
  • Less scary for beginners

Cons:

  • Different tools work differently
  • Some operations still need command line
  • Hides what’s actually happening

Command Line:

Pros:

  • Works everywhere (even on servers)
  • All features available
  • Understand what Git is doing
  • Faster once you learn it

Cons:

  • Steeper learning curve
  • Easy to make mistakes
  • Have to memorize commands

My recommendation: Learn command line basics, use GUI for visualization

Practice Time!

Hands-On Exercise

Let’s practice the workflow:

  1. Fork this repository: github.com/WillyChap/git-practice
  2. Clone your fork locally
  3. Create a branch called add-your-name
  4. Edit students.md - add your name
  5. Commit with message “Add [Your Name] to students list”
  6. Push to your fork
  7. Open a pull request to the original repo

Let’s do this together!

Common Mistakes (and How to Fix Them)

1. “I committed to the wrong branch”

git checkout correct-branch
git cherry-pick <commit-hash>  # Copy commit to correct branch
git checkout wrong-branch
git reset --hard HEAD~1  # Remove from wrong branch

2. “I pushed something I shouldn’t have”

# If just pushed (no one else pulled):
git revert <commit-hash>
git push
# Creates new commit that undoes the bad one

3. “My repo is completely broken”

# Nuclear option - start fresh (keeps your code):
cd ..
mv old_repo old_repo_backup
git clone <url>
# Manually copy your changes over

Quiz Yourself

  1. What command shows you what changed in your files?
  2. How do you create a new branch?
  3. What should a commit message include?
  4. What files should go in .gitignore?
  5. How do you undo the last commit?

Answers: 1. git diff 2. git checkout -b branch-name 3. What changed and WHY 4. Large files, generated files, secrets 5. git reset --soft HEAD~1 (or git commit --amend if not pushed)

Wrapping Up

Key Takeaways

  1. Git is essential for modern science - Reproducibility, collaboration, quality
  2. Commits are snapshots - Complete project state at a point in time
  3. Branches enable parallel work - Multiple features/fixes simultaneously
  4. Good practices matter - Commit messages, READMEs, .gitignore
  5. Code review improves quality - Catch bugs before they cause problems
  6. Use Git for your homework! - Practice makes perfect

What We Didn’t Cover

Advanced topics (for later):

  • Rebasing vs merging
  • Stashing changes
  • Tags and releases
  • Git hooks
  • Submodules
  • Git LFS (Large File Storage)
  • Bisecting to find bugs

Don’t worry! Master the basics first.

Resources for This Class

Course materials: - Git practice repository: github.com/WillyChap/git-practice - Assignment templates: github.com/WillyChap/ATOC4815-templates

Getting help:

  • Office hours: Tu 11:15-12:15p, Th 9-10a (Aerospace Café)
  • Email: wchapman@colorado.edu
  • GitHub Discussions on course repo

Remember:

  • Commit often!
  • Use meaningful messages!
  • Ask for help when stuck!

Next Steps

This week:

  1. ✅ Fork the practice repository
  2. ✅ Make your first commit
  3. ✅ Open a pull request

For homework:

  • All homework submitted via GitHub
  • Use branches for different problems
  • Commit messages will be graded!
  • Code review required before final submission

Start building good habits now - they’ll serve you for your entire career!

Thank You!

Questions?

Credits:

  • Jack Atkinson for the original “Git for Science” presentation
  • GitHub for Git training resources
  • The science community for embracing open science

Contact:

“Commit early, commit often, push regularly!”