Version Control with git and Github


Peyton Politewicz, MS Data Science - Data Scientist @ UCI CSC/BERD



Adapted from Sep Dadsetan, PhD - Executive Director, RWD Analytics at


linkedin.com/in/sepdadsetan/

Introduction

Overview of today’s agenda:

  • Importance of Version Control
  • Getting Started - Account & CLI
  • Some Starter Github Vocabulary
  • ‘Hello World!’

Picture This:

Imagine you’re contributing on a biostatistics research team. You’ve written an incredible analysis portion of a research paper: it cleans the data smoothly, performs all your necessary calculations, generates all your figures, and structures all your outputs so that your co-authors breathe deep sighs of relief.

Everyone is happy, BUT… something always happens. A journal rejects your team’s paper with conditional edits; collaborators make demands; a PI thinks something new needs to be added. How do you deal with that ask without damaging your beautiful work and having to start from scratch?

Don’t do this…

Section 1: Importance of Version Control

Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later. It’s a critical practice in software development and has become increasingly important in data science as well. Here’s why:

1. Collaboration

Data science often involves collaboration among multiple team members. Version control systems like Git allow multiple people to work on the same project simultaneously. They can make changes, submit them for review, and merge them into the main project without overwriting each other’s work.

2. Reproducibility

In data science, it’s crucial to be able to reproduce results. By keeping track of the exact versions of code, data, and libraries used, version control helps ensure that experiments can be replicated precisely. This is vital for both validation of results and for future work that builds on previous findings.

3. Experimentation

Data scientists often need to try out different models, features, or hyperparameters. Version control allows them to create branches where they can experiment without affecting the main project. If an experiment is successful, it can be merged back into the main codebase; if not, it can be discarded without any mess.

4. Accountability

Version control maintains a detailed log of who made what changes and when. This is essential for understanding the evolution of a project and can be crucial for regulatory compliance in some industries.

5. Backup and Recovery

Mistakes happen, and code can be lost or broken. Version control acts as a continuous backup, allowing you to revert to previous versions if something goes wrong. This can save hours or even days of work.

6. Code Quality

Through code reviews and the ability to track changes over time, version control can help maintain and improve the quality of the code. It encourages best practices and helps prevent “code rot” where code becomes unmaintained and outdated.

7. Integration with Other Tools

Version control systems often integrate with other tools used in data science, such as continuous integration systems, project management tools, and code notebooks. This can streamline workflows and make the entire process more efficient.

Summary

Version control is not just a tool for software developers. It’s an essential part of modern data science, enabling collaboration, reproducibility, experimentation, and much more. By providing a structured way to manage changes, track history, and integrate with other tools, version control systems like Git help data scientists work more efficiently and effectively.

Getting Started: Account

  • Head on over to https://github.com/ to make an account
  • Use a tasteful personal email that’s going to last
  • Have a form of 2FA ready (I use Authy on my phone)
  • In the background: Grab yourself a versatile text editor
    • Notepad++ for Windows
    • Sublime Text for Mac

Getting Started: CLI

  • Why CLI?
  • Git vs. GitHub

MAC

  • Open your terminal, run:
git --version (for git)
brew install gh (for github)

Windows

  • Visit https://git-scm.com/download/win for Git
  • Open powershell or terminal, run:
winget install --id GitHub.cli (for github)

Linux (Oh no)

  • From terminal:
&& sudo mkdir -p -m 755 /etc/apt/keyrings \
&& wget -qO- https://cli.github.com/packages/githubcli-archive-keyring.gpg | sudo tee /etc/apt/keyrings/githubcli-archive-keyring.gpg > /dev/null \
&& sudo chmod go+r /etc/apt/keyrings/githubcli-archive-keyring.gpg \
&& echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/githubcli-archive-keyring.gpg] https://cli.github.com/packages stable main" | sudo tee /etc/apt/sources.list.d/github-cli.list > /dev/null \
&& sudo apt update \
&& sudo apt install gh -y

CLI Authentication

  • Run the command “gh auth login”
  • Follow the onscreen prompts - you’ll need to bounce between the terminal and your web browser
  • Select ‘GitHub.com’ and then ‘SSH’
  • Good names for your SSH key include firstnamePublic, lastnameKey, and iDontKnowWhatImDoing
  • Once you’re done, your terminal actions will all be signed with your github credentials!

Section 2: Dive into Git

  • Introduction to Git as a distributed version control system
  • Basic commands and workflow (add/stage, commit, log)
  • Branching and merging in Git
  • Demonstration with a simple project

Git Vocab - Conceptualized

Imagine you’re working on a big puzzle with your friends, and you want to make sure that you can always go back to see how the puzzle looked at different stages. Version control is like taking a picture of the puzzle every time you add a few pieces.

Here’s how it works:

(Repository) - Starting the Puzzle: When you first open the puzzle box, that’s like starting a new project. In version control, this is called creating a “repository.” It’s the place where all the pictures of your puzzle will be stored.

(Commits) - Adding Pieces: Every time you and your friends add some pieces to the puzzle and are happy with how it looks, you take a picture. In version control, this picture is called a “commit.” It’s a snapshot of how everything looks at that moment.

(Branches) - Trying New Things: Sometimes, you might want to try putting together a part of the puzzle without messing up what you’ve already done. You can take the current picture and make a copy to work on. This is called a “branch.” If you like what you’ve done in the branch, you can add it back to the main puzzle. If not, you can just throw that copy away.

(Collaboration) - Working with Friends: Version control lets you and your friends work on the puzzle together without getting in each other’s way. You can all take turns adding pieces and taking pictures, and if someone makes a mistake, you can always look at the previous pictures to see what went wrong.

(History) - Going Back in Time: If you ever want to see how the puzzle looked at any earlier stage, you can look at the pictures you’ve taken along the way. This is the “history” in version control, and it lets you go back in time to see how things have changed.

(Backup) - Keeping Everything Safe: If something happens to the puzzle, like if your little sibling messes it up, you still have all the pictures you’ve taken. You can use those pictures to put the puzzle back the way it was.

So, version control is like a magical camera for your projects. It lets you take pictures of what you’re working on, try new things without worry, work with friends, and even go back in time if you need to. It’s a way to make sure that you can always see how your puzzle—or your project—has grown and changed.

Dictionary, Chronologically

Starting!

  • (Repository): A fancy term for project. Where everything starts.
    • Typically, I create my repos on the website because of certain shortcuts they offer
  • (Clone or Fork:) The act of taking a project and bringing it into your workspace.
    • Cloning duplicates the code as if you’re joining a project that already exists midstream
    • Forking splits off the code into your own space, and isolates you from the original project

Working

  • Go on GH, create a repo for your COSMOS work
  • Include a readme, and add the gitignore for R
  • Create a folder on your computer to work from
  • Navigate to that folder in the terminal and run:
    • gh repo clone username/reponame
    • You also get a copy-pasteable command from the green ‘Code’ button on GH
  • Now, the README.md and .gitignore should download to your drive!

Working, further

  • Git doesn’t track and update everything automatically
  • We need to (stage) the files that we want to sync
  • Let’s create a sample text file - test.txt
  • Add anything (polite) you’d like, then save
  • Finally, type git status in the terminal

Working, further

Working, further further

  • Now, do ‘git add test.txt’ to add it to github’s purview
  • Let’s go one further - it’s time to (commit) a change to our repo branch
  • Commits are snapshots of progress, a save function
    • git commit -m “my first commit!”
    • The ‘-m’ argument lets you add a commit message in quotes - always make sure to include the argument and a message.
    • If you don’t, everything explodes (you open an archaic command line text editing tool)

Committing commits

  • Now, commits are still local.
  • To update your repo on the github website, we’ll need to (push) our committed changes.
  • Try ‘git push origin main’
    • We’re telling git to update the origin remote repository with all our commits to the main branch.

Off On The Right Foot

Congratulations… You’re on your way!

This little example sets the stage for working solo with GitHub, and version control in general.

As we get closer to your capstone project, I’ll walk through an explanation of (branching) and (pulling), and how to smooth out working in teams.

Basic Git Usage and Concepts - Visualized

Adding and Committing

Git Build Flow, Visualized

Branching

Helpful Resources

Thank You’s

  • First of all, everyone for listening and participating
  • SoCal RUG and Sep for these materials
  • Developers behind these wonderful tools
    • Linus Torvalds, Junio Hamano, et. al. for Git
    • The software development community!