Introduction

Brood tables, also called run reconstructions, utilize annual estimates of the total run (commercial catch plus escapement), and samples of ages, to estimate the number of recruits per age class. These data are useful for salmon biologists to understand salmon productivity and salmon life histories.

These data can come in a number of different formats, but generally follow the pattern of: 1. Rows for each brood year 2. Columns for the estimated number of fish in each age class.

See the example below, showing a chinook salmon brood table from the Karluk River, Alaska:

If you are interested in analysing trends across multiple stocks, the many different ways brood tables can be presented make integrating multiple brood tables a big challenge. In this exercise we will introduce some ways to reformat, merge, and reshape a set of brood tables from around the state of Alaska. These datasets were gathered as part of the SASAP working group research. We will start with the raw (unmodified) data for each stock, run each stock’s brood table through a separate R script to normalize the format, integrate all of the brood tables together, and finally do an exploratory analysis across stocks.

Datasets

As part of the SASAP project, brood tables for 5 chinook salmon stocks were collected. Table 2.1 shows a list of these stocks, along with regional information.

Reformatting

First we need to change the brood table column names so that they are in a consistent format. Here we show an example of how one of our source tables (Karluk River, from above) is reformatted. Prior to diving into reformatting, it is good to have an idea of what columns will be necessary in your merged data table. To identify different stocks, certainly a column will be needed for the stock name (usually a river name) and the species. If you wish to analyze across different regions or areas, you may want to have a column for region as well. Of course you will also need columns for brood year and all of the possible age classes you might encounter. A quality flag indicating whether data should be used in analysis or not based on set critera is also useful. Finally, it is also a good idea to have a unique identifier for each brood table that goes into the merged data table so that you can easily add data by stock, or summarize data by stock. With that in mind, we will be reformatting all of the brood tables so they have the following column names:

Stock.ID
UseFlag
BroodYear
TotalEscapement
R2
...
R8

Now let’s read in a brood table. The Karluk River table is on sheet 2. After a quick visual inspection in Excel, we can see that the header doesn’t start until the second row, so we should tell read_excel to skip the first two lines, and we should specify how many rows of data occur in the table (40):

karluk <- read_excel('/home/sfreund/Other-SASAP-edits/additional_westward_broods.xlsx', sheet = 2, skip = 2, n_max = 40)
colnames(karluk)

Now we need to fill in the information that is missing by adding a Stock.ID column. We will use this later to join to the stock information table to get the remainder of the stock specific information. Note that since we have no reason to suspect any data are not up to quality standards yet, we fill the UseFlag columns with 1s.

#add stock information columns
karluk$Stock.ID <- 102 #preassigned Stock.ID
karluk$UseFlag <- 1

There are some redundant and confusing columns here that we need to remove: the total recruits and recruits/spawner. Note that we can calculate these columns again later from the rest of the data.

karluk[c(8,9)]<-NULL

The rest of the columns are still not quite in the format that we decided on so we will rename them and reorder them.

colnames(karluk) <- c('BroodYear', 'TotalEscapement','R3','R4','R5','R6','R7','Stock.ID','UseFlag')

karluk <- karluk[, c('Stock.ID', 'UseFlag','BroodYear','TotalEscapement','R3','R4','R5','R6','R7')]

Strings in this dataset also include extraneous characters like commas and spaces, so we will remove those.

#remove commas
karluk$TotalEscapement <- gsub( ',', '', karluk$TotalEscapement)
karluk$R3 <- gsub( ',', '', karluk$R3)
karluk$R4 <- gsub( ',', '', karluk$R4)
karluk$R5 <- gsub( ',', '', karluk$R5)
karluk$R6 <- gsub( ',', '', karluk$R6)
karluk$R7 <- gsub( ',', '', karluk$R7)

#remove whitespace
karluk$BroodYear<-str_trim(karluk$BroodYear)
karluk$TotalEscapement<-str_trim(karluk$TotalEscapement)
karluk$R3<-str_trim(karluk$R3)
karluk$R4<-str_trim(karluk$R4)
karluk$R5<-str_trim(karluk$R5)
karluk$R6<-str_trim(karluk$R6)
karluk$R7<-str_trim(karluk$R7)

Next we write the file to a new directory where all of the individually reformatted brood tables will be kept.

write.csv(karluk, 'home/sfreund/Other-SASAP-edits/ReformBroodTables/Karluk_chinook.csv', row.names = F)

We write a script similar to this for each stock that we would like to integrate into the brood table analysis. Note that many of the original brood tables come in multi-tab spreadsheets containing multiple stocks with each tab corresponding to a unique stock. Therefore some of these stock-specific scripts access the same original excel document, but different sheets. If the sheets are all formatted the same, the script contains a function which processes each sheet and writes individual .csv files for each stock.

Here is a listing of the reformatting scripts for this data package:

## [1] "Ayakulik_chinook_formatting.R" "Chignik_chinook_formatting.R" 
## [3] "Karluk_chinook_formatting.R"   "Nelson_chinook_formatting.R"

Notice that there are only 4 scripts; a separate script was not necessary for the Goodnews River table, because it was already in the format that we use.

Merging

Now that all of the data tables have been reformatted, we can read them in and merge them into a single file. A variety of functions can be used to combine data. We use the function bind_rows from the dplyr package here to merge the files together since some of these files have age classes that others don’t. The bind_rows function adds in all columns that exist, filling in the columns that don’t have values for a particular stock with NA.

path1 <- '/home/sfreund/Other-SASAP-edits/ReformBroodTables/chinooks'

library(dplyr)
files <- dir(path1, include.dirs = FALSE)
brood <- do.call(bind_rows,
                 lapply(file.path(path1,files),
                        read.csv, 
                        stringsAsFactors = F))

To get the additional information (like stock, region, source) in this dataframe, we read in the stock info file and join it to the main data frame. The join is done using the Stock.ID as a key for both tables.

stock_info <- read.csv('/home/sfreund/Other-SASAP-edits/StockInfo_Chinook.csv', stringsAsFactors = F)
brood <- left_join(brood, stock_info)

Some of the columns from the stock info table aren’t necessary, and the order is a bit strange so here we select the columns we want and reorder them, and then sort by Stock ID.

brood <- brood[, c('Stock.ID', 'Species', 'Stock', 'Region', 'UseFlag', 'BroodYear',
                   'TotalEscapement', 'R2', 'R3', 'R4', 'R5', 'R6', 'R7', 'R8')]
brood <- brood[order(brood$Stock.ID),]

Quality Assurance

Acceptable values

Now we should do some checks to make sure that the data are up to standards and that there were no errors during the reformatting process. Below is a list of expected values for this dataset that we can check for:

  • Year values are numeric and within reasonable bounds (1900s to present), no NA values
  • Age class columns are numeric with values greater than or equal to zero, NA allowed
  • Age class columns are integers
  • Escapement values are numeric and greater than 0, NA allowed

These checks are very simple to make - we will simply use the range function on these three columns.

Checking BroodYear

range(brood$BroodYear)
## [1] 1976 2015

The year data look good.

Checking Age Classes

Here we use the select function, combined with matches and a regular expression, to select only the age class columns.

range(select(brood, matches("R[2-8]")), na.rm = T)
## [1]     0 16276

The data in all of the age class columns look good in terms of range. Now we check to see if the values look like integers, because the data were meant to be interpreted as integers (number of fish).

R2 R3 R4 R5 R6 R7 R8
NA 407 5063 1698 8647 655 NA
NA 1173 833 4314 3480 726 NA
NA 282 2539 2434 4752 492 NA
NA 367 745 1562 1799 294 NA
NA 644 3137 3258 6547 1183 NA
NA 999 1810 4327 7462 676 NA

The data look like integers (NAs are allowed).

Checking Escapement

range(brood$TotalEscapement, na.rm = T)
## [1]   669 24425

The escapement data look good.

Infilling small age classes

Here, we systematically estimate a few minor return age groups so that we can include more years in the analysis. Typically, the infilling procedure involves the most recent year where an old age group has not yet returned. If age-specific abundance is NA and abundance is less than 10% of the total brood return in the previous 2 brood years, then the abundance for that age group is estiamted using the mean of the age-specific values from the previous 2 brood years. If the age-specific value that is to be filled in is greater than 10% of the total brood return, then that value remains NA.

First we define a helpful function which computes the sum and keeps the sum as NA if all values are of NA, but otherwise ignores NA values in the calculation.

nasum <- function(x) if (all(is.na(x))) NA else sum(x,na.rm=T)

First we calculate the total number of recruits across age classes.

brood$TotalRecruits <- apply(select(brood, matches("R[2-8]")), 1, nasum)

Then we define the function described in the paragraph above, which we will run over each stock.

brood_infilling <- function(b){
#get list of column indices for age classes
colnums <- grep("R[2-8]", colnames(b))

#run over each column
for (c in 1:length(colnums)){


#find NA values in a column, remove if they are in the first 3 rows of data
i <- which(is.na(b[,colnums[c]]) == T)
r <- which(i < 2)
i <- i[-r]
    #move on to the next column if there aren't any NAs
    if (length(i) == 0){
        next
    }
    else if (length(i) > 0){
      #for loop to calculate average values for each of the NAs
      for (z in 1:length(i)){
          k <- i[z]-2 #create indices to average over
          l <- i[z]-1
          
          if (k > 0){ #make sure index is positive
            if (mean(b[k:l,colnums[c]]) >= 0.1*b$TotalRecruits[i[z]] | 
                is.na(mean(b[k:l,colnums[c]]) == T) | 
                is.na(.1*b$TotalRecruits[i[z]] == T)){ #if mean value is greater than 10% of the recruits for that year, or either the total recruits or mean are NA, set mean back to NA
              b[i[z],colnums[c]] <- NA
          }
            else if (mean(b[k:l,colnums[c]]) < 0.1*b$TotalRecruits[i[z]]){ #otherwise set the NA value to the mean of the previous 3 years
              b[i[z],colnums[c]] <- mean(b[k:l,colnums[c]])
            }
          }
          else if (k <= 0){ #if you can't get a mean over the prev 3 years set to NA
              b[i[z],colnums[c]] <- NA
          }
        }
    }
}

return(b)
}

Now run the function.

s <- unique(brood$Stock)

brood_filled <- c()

for (i in 1:length(s)){
    t <- subset(brood, Stock == s[i])
    f <- brood_infilling(t)
    brood_filled <- rbind(f, brood_filled)
}

Now calculate the total number of recruits based on the new, infilled age class columns.

brood_filled$TotalRecruits <- NULL
brood_filled$TotalRecruits <- apply(select(brood_filled, matches("R[2-8]")), 1, nasum)

Finding complete age classes

Now, we need to set the UseFlag based on whether a row has a complete age class estimation. A year of data is considered complete when all of the major age classes have real values (not NA). A major age class is defined for these purposes as any age class where the long term mean is greater than 1% of the total recruits for that population.

First we calculate the long-term mean for each age class.

brood_flagged <- c()
for (i in 1:length(s)){
    t <- subset(brood_filled, Stock == s[i])
    #calculate longterm mean for each age class, removing NAs
    brood_means <- apply(select(t, matches("R[2-8]")), 2, mean, na.rm = T)
    #find which long term means are greater than 1% of the total recruits
    age_cols <- which(brood_means > 0.01* mean(t$TotalRecruits, na.rm = T))
    #subset for these columns
    subs <- t[, names(age_cols)]
    #calculate sum along rows, if there is an NA in a row, sum will be equal to NA
    sums <- rowSums(subs)
    #find which rows have NA values in the major age classes
    ind <- which(is.na(sums) == T)
    #assign the UseFlag for these age classes to be 0
    t$UseFlag[ind] <- 0
    brood_flagged <- rbind(brood_flagged, t)
}

Write

Now write the file.

write.csv(brood_flagged, '/home/sfreund/Other-SASAP-edits/ReformBroodTables/Chinook_CompiledBroodTables.csv', 
          row.names = F)