Brood Table Data Processing

Introduction

Brood tables, also called run reconstructions, utilize annual estimates of the total run (commercial catch plus escapement), and samples of ages, to estimate the number of recruits per age class. These data are useful for salmon biologists to understand salmon productivity and salmon life histories.

These data can come in a number of different formats, but generally follow the pattern of: 1. Rows for each brood year 2. Columns for the estimated number of fish in each age class.

See the example below, showing a Sockeye salmon brood table from the Goodnews River, Alaska:

Sometimes other columns are included, such as return years in this example (again for Sockeye salmon) from Coghill Lake, Alaska:

If you are interested in analysing trends across multiple stocks, the many different ways brood tables can be presented make integrating multiple brood tables a big challenge. In this exercise we will intoduce some ways to reformat, merge, and reshape a set of brood tables from around the state of Alaska. These datasets were gathered as part of the SASAP working group research from a number of different ADFG offices. We will start with the raw (unmodified) data for each stock, run each stock’s brood table through a separate R script to normalize the format, integrate all of the brood tables together, and finally do an exploratory analysis across stocks.

Datasets

As part of the SASAP project, brood tables for 48 Sockeye salmon stocks were collected. Table 2.1 shows a list of these stocks, along with other regional and location information.

These stocks range geographically from Washington to Alaska. Although temporal coverage varies by stock, many of the brood tables were updated in 2016, and some have reconstructions dating back to 1922.

Figure 2.1 indicates the approximate location of the salmon stocks in Table 2.1.

Figure 2.1: Location of stocks used in this data integration. Salmonid icon by Servien (vectorized by T. Michael Keesey) CC-BY-SA, available at Phylopic

Reformatting

First we need to change the brood table column names so that they are in a consistent format. Here we show an example of how one of our source tables (Coghill Lake, from above) is reformatted. Prior to diving into reformatting, it is good to have an idea of what columns will be necessary in your merged data table. To identify different stocks, certainly a column will be needed for the stock name (usually a river name) and the species. If you wish to analyze across different regions or areas, you may want to have a column for region as well. Of course you will also need columns for brood year and all of the possible age classes you might encounter. A quality flag indicating whether data should be used in analysis or not based on set critera is also useful. Finally, it is also a good idea to have a unique identifier for each brood table that goes into the merged data table so that you can easily add data by stock, or summarize data by stock. With that in mind, we will be reformatting all of the brood tables so they have the following column names:

Stock.ID
BroodYear
Region
Sub.Region
UseFlag
BroodYear
TotalEscapement
R0.1
...
R4.5

Now let’s read in the brood table. After a quick visual inspection in Excel, we can see that the header doesn’t start until the sixth row, so we should tell read_excel to skip the first six lines:

coghill_brood <- read_excel('/home/sjclark/BroodTables/Original/112_COGHILL_BROOD_TABLE.xls', skip = 6)

## Warning in read_fun(path = path, sheet = sheet, limits = limits, shim =
## shim, : '.Random.seed' is not an integer vector but of type 'NULL', so
## ignored

colnames(coghill_brood)

##  [1] "Year"           "Escpmt."        "Year__1"        "1.1"           
##  [5] "0.2"            "Year__2"        "0.3"            "Aged 1.2"      
##  [9] "Aged 2.1"       "Year__3"        "Aged 1.3"       "Aged 2.2"      
## [13] "Year__4"        "Aged 1.4"       "Aged 2.3"       "Year__5"       
## [17] "Aged 2.4"       "Return"         "Year__6"        "Return__1"     
## [21] "Return/spawner"

There are definitely some redundant and confusing columns here that we need to remove, and some we need to add according to the list above. First we remove the columns we don’t want - namely all of the return years, the total return, and recruits/spawner. Note that we can always calculate these columns again later from the rest of the data.

coghill_brood <- coghill_brood %>%
  select(-Year__1,
         -Year__2, 
         -Year__3, 
         -Year__4,
         -Year__5, 
         -Year__6, 
         -Return, 
         -Return__1, 
         -`Return/spawner`)

Now we need to fill in the information that is missing by adding a Stock.ID column. We will use this later to join to the stock information table to get the remainder of the stock specific information.

#add stock information columns
coghill_brood$Stock.ID <- 139 #preassigned Stock.ID
coghill_brood$UseFlag <- 1

Note that since we have no reason to suspect any data are not up to quality standards yet, we fill the UseFlag columns with 1s.

The rest of the columns are still not quite in the format that we decided on so we will rename them.

names(coghill_brood) <- c("BroodYear", 
                          "TotalEscapement", 
                          "R1.1", 
                          "R0.2", 
                          "R0.3", 
                          "R1.2", 
                          "R2.1", 
                          "R1.3", 
                          "R2.2",  
                          "R1.4",
                          "R2.3",
                          "R2.4",
                          "Stock.ID",
                          "UseFlag")

Finally, you may want to reorder the columns into something more intuitive.

coghill_brood <- coghill_brood[,c('Stock.ID', 'UseFlag',
                                  'BroodYear','TotalEscapement','R0.2', 'R0.3',
                                  'R1.1','R1.2','R1.3','R1.4', 
                                  'R2.1','R2.2','R2.3','R2.4')]

And very lastly, write the file to a new directory where all of the individually reformatted brood tables will be kept.

write.csv(coghill_brood, 'data/reformatted/Coghill_sockeye.csv', row.names = F)

We write a script similar to this for each stock that we would like to integrate into the brood table analysis. Note that many of the original brood tables come in multi-tab spreadsheets containing multiple stocks with each tab corresponding to a unique stock, therefore some of these stock-specific scripts access the same original excel document, but different sheets. In the case where the sheets are all formatted the same (Bristol Bay), the script contains a function which processes each sheet and writes individual .csv files for each stock.

Here is a listing of the original data files:

##  [1] "112_COGHILL_BROOD_TABLE.xls"                                        
##  [2] "2016 Bristol Bay Updated Brood Table 11.4.16.xlsx"                  
##  [3] "Canadian non-Fraser sox prod data 1 Sept 2017 from data portal.xlsx"
##  [4] "Chilkoot Lake brood table 2016.xlsx"                                
##  [5] "Copper_River_Catch_byAge_SOCKEYE.xls.xlsx"                          
##  [6] "Eshamy_Sockeye.csv"                                                 
##  [7] "Fraser_SX_brood_tables_July2017.xlsx"                               
##  [8] "Kenai and Kasilof Sockeye Salmon Brood Tables.xlsx"                 
##  [9] "Lake WA Brood Table 2015.xlsx"                                      
## [10] "Middle Fork Goodnews River Sockeye Brood Table.xlsx"                
## [11] "Redoubt Lake - updated brood table_GR.xlsx"                         
## [12] "SourceInfo.csv"                                                     
## [13] "source_information.csv"                                             
## [14] "Speel Lake sockeye run reconstruction 1986-2013_GR.xlsx"            
## [15] "StockInfo.csv"                                                      
## [16] "TogiakSockeye.xlsx"                                                 
## [17] "westward brood tables.xlsx"

And of the reformatting scripts:

##  [1] "Ayakulik_formatting.R"          "Bear_formatting.R"             
##  [3] "BlackLake_formatting.R"         "BristolBay_formatting.R"       
##  [5] "ChignikLake_formatting.R"       "Chilkoot_formatting.R"         
##  [7] "Coghill_formatting.R"           "Copper_formatting.R"           
##  [9] "do_not_run"                     "Eshamy_formatting.R"           
## [11] "FraserRiver_formatting.R"       "Frazer_formatting.R"           
## [13] "Goodnews_formatting.R"          "KarlukEarly_formatting.R"      
## [15] "KarlukLate_formatting.R"        "Kasilof_formatting.R"          
## [17] "Kenai_formatting.R"             "Nelson_formatting.R"           
## [19] "Redoubt_formatting.R"           "Speel_formatting.R"            
## [21] "UpperStationEarly_formatting.R" "UpperStationLate_formatting.R" 
## [23] "Washington_formatting.R"

Batch Reformatting

Once a reformatting script has been written for all of the data files that are going to be incorporated, we can source all of the scripts within the reformatting folder and run them, which will create all of the reformatted data files.

formatting_scripts <- dir("/home/sjclark/sasap-data/data-processing/broodTables/formatting/", 
                          full.names = TRUE, 
                          pattern = "*\\.[rR]")
sapply(formatting_scripts, source)

Merging

Now that all of the data tables have been reformatted, we can read them in and merge them into a single file. We use the function bind_rows here to merge the files together since some of these files have age classes that others don’t. The bind_rows function adds in all columns that exist, filling in the columns that don’t have values for a particular stock with NA.

path1 <- '/home/sjclark/BroodTables/Reformatted/'

files <- dir(path1, include.dirs = FALSE)
brood <- do.call(bind_rows,
                 lapply(file.path(path1,files),
                        read.csv, 
                        stringsAsFactors = F))

To get the additional information (like stock, region, source) in this dataframe, we read in the stock info file and join it to the main data frame. The join is done using the Stock.ID as a key for both tables.

stock_info <- read.csv('/home/sjclark/BroodTables/Original/StockInfo.csv', stringsAsFactors = F)

brood <- left_join(brood, stock_info)

Some of the columns from the stock info table aren’t necessary, and the order is a bit strange so here we select the columns we want and reorder them.

brood <- brood[, c('Stock.ID', 'Species', 'Stock','Ocean.Region','Region','Sub.Region','Jurisdiction', 'Lat', 'Lon', 'UseFlag','BroodYear',
                   'TotalEscapement','R0.1','R0.2', 'R0.3','R0.4',  'R0.5',
                   'R1.1','R1.2','R1.3','R1.4','R1.5', 
                   'R2.1','R2.2','R2.3','R2.4',
                   'R3.1','R3.2','R3.3','R3.4',
                   'R4.1','R4.2' ,'R4.3')]

Quality Assurance

Acceptable values

Now we should do some checks to make sure that the data are up to standards and that there were no errors during the reformatting process. Below is a list of expected values for this dataset that we can check for:

Year values are numeric and within reasonable bounds (1900s to present), no NA values
Age class columns are numeric with values greater than or equal to zero, NA allowed
Age class columns are integers
Escapement values are numeric and greater than 0, NA allowed

These checks are very simple to make - we will simply use the range function on these three columns.

Checking BroodYear

range(brood$BroodYear)

## [1] 1922 2017

The year data look good.

Checking Age Classes

Here we use the select function, combined with matches and a regular expression, to select only the age class columns.

range(select(brood, matches("R[0-9].[0-9]")), na.rm = T)

## [1]        0 47748183

The data in all of the age class columns look good in terms of range. Now we check to see if the values look like integers.

R0.1	R0.2	R0.3	R0.4	R0.5	R1.1	R1.2	R1.3	R1.4	R1.5	R2.1	R2.2	R2.3	R2.4	R3.1	R3.2	R3.3	R4.1	R4.2	R4.3
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	0	NA	NA	NA	0	NA	NA	0.0000000	NA	NA	NA
NA	NA	NA	NA	0	NA	NA	NA	0	0	NA	NA	13499.399	0	NA	0.0000000	0.0000016	NA	NA	NA
NA	NA	NA	0	0	NA	NA	13119.73	0	0	NA	20219.16	8577.297	0	0	0.0000002	0.0000000	NA	NA	NA
NA	NA	0	0	0	NA	331300.0	338122.23	0	0	0.0e+00	286853.66	50738.894	0	0	772.7989217	1312.3211884	NA	NA	NA
NA	0	0	0	0	0	121361.3	116448.34	0	0	8.1e-06	166914.94	43428.894	0	0	0.0000000	0.0000000	NA	NA	NA

Because much of these data were imported from Excel spreadsheets which use formulas to calculate the values in each cell for the recruits per age class, many of these recruit values contain real numbers as opposed to integers. Since the data were meant to be interpreted as integers (number of fish), we use the trunc function to round all of the values in the age class columns down to the nearest integer.

This line runs the trunc function only the columns in the brood data.frame which match the pattern that defines the age class columns

brood <- mutate_at(brood, vars(matches("R[0-9].[0-9]")), trunc)

Checking Escapement

range(brood$TotalEscapement, na.rm = T)

## [1]        0 24325926

Some escapement data have a value of 0 - which does not make sense in this context. Let’s examine these 0 values,

Stock.ID	Species	Stock	Ocean.Region	Region	Sub.Region	Jurisdiction	Lat	Lon	UseFlag	BroodYear	R0.2	R0.3	R1.1	R1.2	R1.3	R1.4	R2.2	R2.3	R2.4	R3.2	R4.1	R4.2	R4.3
163	Sockeye	Togiak	BS	Bristol Bay	Bristol Bay North	AK	58.95	160.45	1	1950	0	0	0	0	0	0	0	27842	0	0	NA	NA	NA
163	Sockeye	Togiak	BS	Bristol Bay	Bristol Bay North	AK	58.95	160.45	1	1951	0	0	0	0	98113	111	53034	8569	0	104	NA	NA	NA
163	Sockeye	Togiak	BS	Bristol Bay	Bristol Bay North	AK	58.95	160.45	1	1952	0	0	0	152473	58318	0	8640	5917	0	0	NA	NA	NA
163	Sockeye	Togiak	BS	Bristol Bay	Bristol Bay North	AK	58.95	160.45	1	1953	0	1114	0	31434	84135	0	8369	16206	0	0	NA	NA	NA
163	Sockeye	Togiak	BS	Bristol Bay	Bristol Bay North	AK	58.95	160.45	1	1954	104	0	66	19804	145542	0	11957	17007	0	2006	NA	NA	NA
163	Sockeye	Togiak	BS	Bristol Bay	Bristol Bay North	AK	58.95	160.45	1	1955	0	0	0	135517	189761	508	9022	38171	213	0	NA	NA	NA

These values appear to not be real 0 values (no spawners) but instead indicate that there are no data available. We can find and replace these values with NA.

brood$TotalEscapement[which(brood$TotalEscapement == 0)] <- NA

Infilling small age classes

Here, we systematically estimate a few minor return age groups so that we can include more years in the analysis. Typically, the infilling procedure involves the most recent year where an old age group has not yet returned. If age-specific abundance is NA and abundance is less than 10% of the total brood return in the previous 2 brood years, then the abundance for that age group is estiamted using the mean of the age-specific values from the previous 2 brood years. If the age-specific value that is to be filled in is greater than 10% of the total brood return, then that value remains NA.

First we define a helpful function which computes the sum and keeps the sum as NA if all values of NA, but otherwise ignores NA values in the calculation.

nasum <- function(x) if (all(is.na(x))) NA else sum(x,na.rm=T)

First we calculate the total number of recruits across age classes.

brood$TotalRecruits <- apply(select(brood, matches("R[0-9].[0-9]")), 1, nasum)

Then we define the function described in the paragraph above, which we will run over each stock.

brood_infilling <- function(b){
#get list of column indices for age classes
colnums <- grep("R[0-9].[0-9]", colnames(b))

#run over each column
for (c in 1:length(colnums)){


#find NA values in a column, remove if they are in the first 3 rows of data
i <- which(is.na(b[,colnums[c]]) == T)
r <- which(i < 2)
i <- i[-r]
    #move on to the next column if there aren't any NAs
    if (length(i) == 0){
        next
    }
    else if (length(i) > 0){
      #for loop to calculate average values for each of the NAs
      for (z in 1:length(i)){
          k <- i[z]-2 #create indices to average over
          l <- i[z]-1
          
          if (k > 0){ #make sure index is positive
            if (mean(b[k:l,colnums[c]]) >= 0.1*b$TotalRecruits[i[z]] | 
                is.na(mean(b[k:l,colnums[c]]) == T) | 
                is.na(.1*b$TotalRecruits[i[z]] == T)){ #if mean value is greater than 10% of the recruits for that year, or either the total recruits or mean are NA, set mean back to NA
              b[i[z],colnums[c]] <- NA
          }
            else if (mean(b[k:l,colnums[c]]) < 0.1*b$TotalRecruits[i[z]]){ #otherwise set the NA value to the mean of the previous 3 years
              b[i[z],colnums[c]] <- mean(b[k:l,colnums[c]])
            }
          }
          else if (k <= 0){ #if you can't get a mean over the prev 3 years set to NA
              b[i[z],colnums[c]] <- NA
          }
        }
    }
}

return(b)
}

Now run the function.

s <- unique(brood$Stock)

brood_filled <- c()

for (i in 1:length(s)){
    t <- subset(brood, Stock == s[i])
    f <- brood_infilling(t)
    brood_filled <- rbind(f, brood_filled)
}

Now calculate the total number of recruits based on the new, infilled age class columns.

brood_filled$TotalRecruits <- NULL
brood_filled$TotalRecruits <- apply(select(brood_filled, matches("R[0-9].[0-9]")), 1, nasum)

Finding complete age classes

Now, we need to set the UseFlag based on whether a row has a complete age class estimation. A year of data is considered complete when all of the major age classes have real values (not NA). A major age class is defined for these purposes as any age class where the long term mean is greater than 1% of the total recruits for that population.

First we calculate the long-term mean for each age class.

brood_flagged <- c()
for (i in 1:length(s)){
    t <- subset(brood_filled, Stock == s[i])
    #calculate longterm mean for each age class, removing NAs
    brood_means <- apply(select(t, matches("R[0-9].[0-9]")), 2, mean, na.rm = T)
    #find which long term means are greater than 1% of the total recruits
    age_cols <- which(brood_means > 0.01* mean(t$TotalRecruits, na.rm = T))
    #subset for these columns
    subs <- t[, names(age_cols)]
    #calculate sum along rows, if there is an NA in a row, sum will be equal to NA
    sums <- rowSums(subs)
    #find which rows have NA values in the major age classes
    ind <- which(is.na(sums) == T)
    #assign the UseFlag for these age classes to be 0
    t$UseFlag[ind] <- 0
    brood_flagged <- rbind(brood_flagged, t)
}

Kvichak Correction

This should be 17505000 salmon because FRI-UW estimated that about 5 million spawners died (out of 22505000) while trying to swim over the cascade at Newhalen R during extremely high water.

brood_flagged$TotalEscapement[which(brood_flagged$BroodYear == 1980 
                                    & brood_flagged$Stock == 'Kvichak')] <- 17505000

Additional use considerations

If there are no escapement counts, we cannot calculate recruits/spawner, and therefore should not use the data.

brood_flagged$UseFlag[which(is.na(brood_flagged$TotalEscapement) == T)] <- 0

We toggle off use of the Alagnak and Speel stocks

brood_flagged$UseFlag[which(brood_flagged$Stock == 'Alagnak' | brood_flagged$Stock == 'Speel')] <- 0

Since no other stocks have data going back this far, we toggle off the early years of Black Lake and Chignik Lake.

brood_flagged$UseFlag[which(brood_flagged$Stock == 'Black Lake' & brood_flagged$BroodYear < 1965)] <- 0
brood_flagged$UseFlag[which(brood_flagged$Stock == 'Chignik Lake' & brood_flagged$BroodYear < 1965)] <- 0

Coghill has estimates in a rare age class (1.4) for only two recent years, which toggles off the data despite that age class being a minor one. Here, we toggle the Coghill data back on.

brood_flagged$UseFlag[which(brood_flagged$Stock == 'Coghill' & brood_flagged$BroodYear < 2007 & brood_flagged$BroodYear > 1961)] <- 1

The Portage stock has a smattering of NA values in realtively small age classes between 2003-2011. Here we toggle that back on.

brood_flagged$UseFlag[which(brood_flagged$Stock == 'Portage' & brood_flagged$BroodYear < 2012 & brood_flagged$BroodYear > 2002)] <- 1

We also turn Seymour back on, after a relatively small age class was no longer reported starting in 2005.

brood_flagged$UseFlag[which(brood_flagged$Stock == 'Seymour' & brood_flagged$BroodYear < 2011 & brood_flagged$BroodYear > 2004)] <- 1

Write

Now write the file

write.csv(brood_flagged, '/home/sjclark/BroodTables/BroodTables.csv', row.names = F)