Looping Over Data Sets

Last updated on 2024-10-04 | Edit this page

Estimated time: 20 minutes

Overview

Questions

  • How can I process many data sets with a single command?

Objectives

  • Be able to read and write globbing expressions that match sets of files.
  • Use glob to create lists of files.
  • Write for loops to perform operations on files given their names in a list.

If you did not run the commands from episode 5 in this Colab session, you will need to load the library pandas and make your google drive accessible:

PYTHON

import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
file_location = "drive/MyDrive/lc-python/"

You’ll need to grant Google all the permissions it requests to make your google drive accessible to Colab.

Uploading files to Google Drive allows the data persist over time when using Colab. To save time now, run wget to download files directly to the cloud:

BASH

!wget https://github.com/jlchang/cb-python-intro-lesson-template/raw/refs/heads/main/episodes/files/data.zip
!unzip data.zip

PYTHON

file_location = ""

Remember that next time you use Colab, you’ll need to get these files again unless you follow the Setup instructions to copy the files to Google Drive.

Use a for loop to process files given a list of their names.


If you recall from episode 06, the pd.read_csv() method takes a text string referencing a filename as an argument. If we have a list of strings that point to our filenames, we can loop through the list to read in each CSV file as a DataFrame. Let’s print out the maximum values from the ‘ytd’ (year to date) column for each DataFrame.

PYTHON

for filename in ['data/2011_circ.csv', 'data/2012_circ.csv']:
  data = pd.read_csv(file_location + filename)
  print(filename, data['ytd'].max())

OUTPUT

data/2011_circ.csv 966720
data/2012_circ.csv 937649

Use glob to find sets of files whose names match a pattern.


Fortunately, we don’t have to manually type in a list of all of our filenames. We can use a Python library called glob, to work with paths and files in a convenient way. In Unix, the term “globbing” means “matching a set of files with a pattern”. Glob gives us some nice pattern matching options:

  • * will “match zero or more characters”
  • ? will “match exactly one character”

The glob library contains a function also called glob to match file patterns. For example, glob.glob('*.txt') would match all files in the current directory with names that end with .txt.

Let’s create a list of the usage data CSV files. Because the .glob() argument includes a filepath in single quotes, we’ll use double quotes around our f-string.

PYTHON

import glob
print(f"all csv files in data directory: {glob.glob(file_location + 'data/*.csv')}")

OUTPUT

all csv files in data directory: ['drive/MyDrive/lc-python/data/2017_circ.csv', 'drive/MyDrive/lc-python/data/2016_circ.csv', 'drive/MyDrive/lc-python/data/2011_circ.csv', 'drive/MyDrive/lc-python/data/2022_circ.csv', 'drive/MyDrive/lc-python/data/2018_circ.csv', 'drive/MyDrive/lc-python/data/2012_circ.csv', 'drive/MyDrive/lc-python/data/2019_circ.csv', 'drive/MyDrive/lc-python/data/2013_circ.csv', 'drive/MyDrive/lc-python/data/2020_circ.csv', 'drive/MyDrive/lc-python/data/2021_circ.csv', 'drive/MyDrive/lc-python/data/2015_circ.csv', 'drive/MyDrive/lc-python/data/2014_circ.csv']

Use glob and for to process batches of files.


Now we can use glob in a for loop to create DataFrames from all of the CSV files in the data directory. To use tools like glob it helps if files are named and stored consistently so that simple patterns will find the right data. You can learn more about how to name files to improve machine-readability from the Open Science Foundation article on file naming.

PYTHON

for csv in glob.glob(file_location + 'data/*.csv'):
  data = pd.read_csv(csv)
  print(csv, data['ytd'].max())

OUTPUT

drive/MyDrive/lc-python/data/2017_circ.csv 634570
drive/MyDrive/lc-python/data/2016_circ.csv 670077
drive/MyDrive/lc-python/data/2011_circ.csv 966720
drive/MyDrive/lc-python/data/2022_circ.csv 301340
drive/MyDrive/lc-python/data/2018_circ.csv 614313
drive/MyDrive/lc-python/data/2012_circ.csv 937649
drive/MyDrive/lc-python/data/2019_circ.csv 581151
drive/MyDrive/lc-python/data/2013_circ.csv 821749
drive/MyDrive/lc-python/data/2020_circ.csv 276878
drive/MyDrive/lc-python/data/2021_circ.csv 271811
drive/MyDrive/lc-python/data/2015_circ.csv 694528
drive/MyDrive/lc-python/data/2014_circ.csv 755189

The output of the files above may be different for you, depending on what operating system you use. The glob library doesn’t have its own internal system for determining how filenames are sorted, but instead relies on the operating system’s filesystem. Since operating systems can differ, it is helpful to use Python to manually sort the glob files so that everyone will see the same results, regardless of their operating system. You can do that by applying the Python method sorted() to the glob.glob list.

PYTHON

for csv in sorted(glob.glob(file_location + 'data/*.csv')):
    data = pd.read_csv(csv)
    print(csv, data['ytd'].max())

OUTPUT

drive/MyDrive/lc-python/data/2011_circ.csv 966720
drive/MyDrive/lc-python/data/2012_circ.csv 937649
drive/MyDrive/lc-python/data/2013_circ.csv 821749
drive/MyDrive/lc-python/data/2014_circ.csv 755189
drive/MyDrive/lc-python/data/2015_circ.csv 694528
drive/MyDrive/lc-python/data/2016_circ.csv 670077
drive/MyDrive/lc-python/data/2017_circ.csv 634570
drive/MyDrive/lc-python/data/2018_circ.csv 614313
drive/MyDrive/lc-python/data/2019_circ.csv 581151
drive/MyDrive/lc-python/data/2020_circ.csv 276878
drive/MyDrive/lc-python/data/2021_circ.csv 271811
drive/MyDrive/lc-python/data/2022_circ.csv 301340

Appending DataFrames to a list


In the example above, we can print out results from each DataFrame as we cycle through them, but it would be more convenient if we saved all of the yearly usage data in these CSV files into DataFrames that we could work with later on.

Convert Year in filenames to a column

Before we join the data from each CSV into a single DataFrame, we’ll want to make sure we keep track of which year each dataset comes from. To do that we can capture the year from each file name and save it to a new column for all of the rows in each CSV. Let’s see how this works by looping through each of our CSVs.

PYTHON

for csv in sorted(glob.glob(file_location + 'data/*.csv')):
        year = csv[29:33] #the 30th to 33rd characters in each file match the year
                          # for files we downloaded to Google Drive for Colab
        print(f'filename: {csv} year: {year}')

PYTHON

for csv in sorted(glob.glob(file_location + 'data/*.csv')):
        year = csv[5:9] # the 5th to 9th characters in each file match the year
                        # for files downloaded using wget
        print(f'filename: {csv} year: {year}')

OUTPUT

filename: drive/MyDrive/lc-python/data/2011_circ.csv year: 2011
filename: drive/MyDrive/lc-python/data/2012_circ.csv year: 2012
filename: drive/MyDrive/lc-python/data/2013_circ.csv year: 2013
filename: drive/MyDrive/lc-python/data/2014_circ.csv year: 2014
filename: drive/MyDrive/lc-python/data/2015_circ.csv year: 2015
filename: drive/MyDrive/lc-python/data/2016_circ.csv year: 2016
filename: drive/MyDrive/lc-python/data/2017_circ.csv year: 2017
filename: drive/MyDrive/lc-python/data/2018_circ.csv year: 2018
filename: drive/MyDrive/lc-python/data/2019_circ.csv year: 2019
filename: drive/MyDrive/lc-python/data/2020_circ.csv year: 2020
filename: drive/MyDrive/lc-python/data/2021_circ.csv year: 2021
filename: drive/MyDrive/lc-python/data/2022_circ.csv year: 2022

Once we’ve saved the year variable from each file name, we can assign it to every row in a column for each CSV by assigning data['year'] = year inside of the loop.

To collect the data from each CSV we’ll use a list “accumulator” (as we covered in the last episode) and append each DataFrame to an empty list. You can create an empty list by assigning a variable to empty square brackets before the loop begins.

PYTHON

dfs = [] # an empty list to hold all of our DataFrames
counter = 1

for csv in sorted(glob.glob(file_location + 'data/*.csv')):
  year = csv[29:33] # the 30th to 33rd characters match the year
                    # for files we downloaded to Google Drive for Colab
  data = pd.read_csv(csv) 
  data['year'] = year 
  print(f'{counter} Saving {len(data)} rows from {csv}')
  dfs.append(data)
  counter += 1

print(f'Number of saved DataFrames: {len(dfs)}')

PYTHON

dfs = [] # an empty list to hold all of our DataFrames
counter = 1

for csv in sorted(glob.glob(file_location + 'data/*.csv')):
  year = csv[5:9] # the 5th to 9th characters in each file match the year
                  # for files downloaded using wget
  data = pd.read_csv(csv) 
  data['year'] = year 
  print(f'{counter} Saving {len(data)} rows from {csv}')
  dfs.append(data)
  counter += 1

print(f'Number of saved DataFrames: {len(dfs)}')

OUTPUT

1 Saving 80 rows from drive/MyDrive/lc-python/data/2011_circ.csv
2 Saving 79 rows from drive/MyDrive/lc-python/data/2012_circ.csv
3 Saving 80 rows from drive/MyDrive/lc-python/data/2013_circ.csv
4 Saving 80 rows from drive/MyDrive/lc-python/data/2014_circ.csv
5 Saving 80 rows from drive/MyDrive/lc-python/data/2015_circ.csv
6 Saving 80 rows from drive/MyDrive/lc-python/data/2016_circ.csv
7 Saving 80 rows from drive/MyDrive/lc-python/data/2017_circ.csv
8 Saving 80 rows from drive/MyDrive/lc-python/data/2018_circ.csv
9 Saving 81 rows from drive/MyDrive/lc-python/data/2019_circ.csv
10 Saving 81 rows from drive/MyDrive/lc-python/data/2020_circ.csv
11 Saving 81 rows from drive/MyDrive/lc-python/data/2021_circ.csv
12 Saving 81 rows from drive/MyDrive/lc-python/data/2022_circ.csv
Number of saved DataFrames: 12

We can check to make sure the year was properly saved by looking at the first DataFrame in the dfs list. If you scroll to the right you should see the first two rows of the year column both have the value 2011.

PYTHON

dfs[0].head(2) # we can add a number to head() to ask for a specific number of rows

OUTPUT

|     | branch      | address               | city    | zip code | january | february | march | april | may  | june  | july  | august | september | october | november | december | ytd    | year |
|-----|-------------|-----------------------|---------|----------|---------|----------|-------|-------|------|-------|-------|--------|-----------|---------|----------|----------|--------|------|
| 0   | Albany Park | 5150 N. Kimball Ave.  | Chicago | 60625.0  | 8427    | 7023     | 9702  | 9344  | 8865 | 11650 | 11778 | 11306  | 10466     | 10997   | 10567    | 9934     | 120059 | 2011 |
| 1   | Altgeld     | 13281 S. Corliss Ave. | Chicago | 60827.0  | 1258    | 708      | 854   | 804   | 816  | 870   | 713   | 480    | 702       | 927     | 787      | 692      | 9611   | 2011 |

Concatenating DataFrames


There are many different ways to merge, join, and concatenate pandas DataFrames together. The pandas documentation has good examples of how to use the .merge(), .join(), and .concat() methods to accomplish different goals. Because all of our CSVs have the exact same columns, if we want to concatenate them vertically (adding all of the rows from each DataFrame together in order), we can do so using concat(), which takes a list of DataFrames as its first argument. Since we aren’t using a specific column as a pandas index, we’ll set the argument of ignore_index to be True.

PYTHON

df = pd.concat(dfs, ignore_index=True)
f'Number of rows in df: {len(df)}'

OUTPUT

'Number of rows in df: 963'

Determining Matches

Which of these files would be matched by the expression glob.glob('data/*circ.csv')?

  1. data/2011_circ.csv
  2. data/2012_circ_stats.csv
  3. circ/2013_circ.csv
  4. Both 1 and 3

Only item 1 is matched by the wildcard expression data/*circ.csv.

Minimum circulation per year

Modify the following code to print out the lowest value in the ytd column from each year/file.

PYTHON

import pandas as pd
for csv in sorted(glob.glob(file_location + 'data/*.csv')):
    data = pd.read_csv(____)
    print(csv, data['____'].____())
    

PYTHON

import pandas as pd
for csv in sorted(glob.glob(file_location + 'data/*.csv')):
    data = pd.read_csv(csv)
    print(csv, data['ytd'].min())
    

Imagine you had a folder named outputs/ that included all kinds of different file types. Use glob and a for loop to iterate through all of the CSV files in the folder that have a file name that begins with data. Save them to a list called dfs, and then use pd.concat() to concatenate all of the DataFrames from the dfs list together into a new DataFrame called, new_df. You can assume that all of the data CSV files have the same columns so they will concatenate together cleanly using pd.concat().

PYTHON

import pandas as pd

dfs = []

for csv in sorted(glob.glob(file_location + 'outputs/data*.csv')):
    data = pd.read_csv(csv)
    dfs.append(data)
    
new_df = pd.concat(dfs, ignore_index=True)
    

Key Points

  • Use a for loop to process files given a list of their names.
  • Use glob.glob to find sets of files whose names match a pattern.
  • Use glob and for to process batches of files.
  • Use a list “accumulator” to append a DataFrame to an empty list [].
  • The .merge(), .join(), and .concat() methods can combine pandas DataFrames.