library(sf)  
library(terra)  
library(tidyverse)  
library(spData)
library(magrittr)

Attribute Data Operations

Attribute data is non-spatial information associated with geographic (geometry) data. Understanding how to manipulate attribute data in vector and raster datasets is important because you will likely be performing operations on these data throughout your analysis. Fortunately, in ‘sf’ objects these operations are analagous to working with data frames. For raster objects, each cell contains a single attribute value so it is important that each raster layer is consistently defined.


Vector Attribute Manipulation

The most significant difference between data frames and ‘sf’ objects is that ‘sf’ objects include the ‘geometry’ column. A useful feature of this column is that it is “sticky” and will remain appended to the attribute data unless explicitly removed.

For example, read in the ‘world’ dataset and then select out a single column.

summary(world)
##     iso_a2           name_long          continent          region_un        
##  Length:177         Length:177         Length:177         Length:177        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   subregion             type              area_km2             pop           
##  Length:177         Length:177         Min.   :    2417   Min.   :5.630e+04  
##  Class :character   Class :character   1st Qu.:   46185   1st Qu.:3.755e+06  
##  Mode  :character   Mode  :character   Median :  185004   Median :1.040e+07  
##                                        Mean   :  832558   Mean   :4.282e+07  
##                                        3rd Qu.:  621860   3rd Qu.:3.075e+07  
##                                        Max.   :17018507   Max.   :1.364e+09  
##                                                           NA's   :10         
##     lifeExp        gdpPercap                   geom    
##  Min.   :50.62   Min.   :   597.1   MULTIPOLYGON :177  
##  1st Qu.:64.96   1st Qu.:  3752.4   epsg:4326    :  0  
##  Median :72.87   Median : 10734.1   +proj=long...:  0  
##  Mean   :70.85   Mean   : 17106.0                      
##  3rd Qu.:76.78   3rd Qu.: 24232.7                      
##  Max.   :83.59   Max.   :120860.1                      
##  NA's   :10      NA's   :17

NOTE: It is useful to index to the package specifically when using certain functions like select(). This is especially true when working with ‘sf’ and ‘terra’ which share a number of function names.

Which columns remain after using select()?

If you really need to remove the ‘geometry’ column from an ‘sf’ object then you can use the conveniently named st_drop_geometry() function. Removing the ‘geometry’ column can be useful because operations on the attribute data alone can sometimes be faster. However, it is usually helpful to keep the ‘geometry’ column included and there shouldn’t be any instances during this workshop when you’ll need to drop it.

Vector Attribute Subsetting

Vector attribute subsetting can be accomplished using the base R operators such as [] and subset() or the dplyr functions filter(), slice(), or select().

Base R and dplyr functions both work well for subsetting ‘sf’ objects while maintaining them as ‘sf’ objects. However, if you use operators such as $ or [[]] or the dplyr function pull() then you should expect to receive vectors rather than ‘sf’ objects. Vectors, by definition, will only store information from a single column which means you will lose the ‘geometry’ column.

Use ‘world’ to create a new ‘sf’ object that only includes data on name, continent, region, area, and population for countries that have an area greater than 50,000 km^2.

Vector Attribute Aggregation

A common part of a data analysis workflow is to group and summarize data according to grouping variables (i.e. aggregation). The ‘tidy’ method of programming in R provides an excellent workflow and we recommend using the pipe operator (%>%) as well. Pipes are efficient and easier to read than the base R equivalent of subsetted functions.

Here is an example of a pipe that will group the ‘world’ dataset by ‘continent’ and then calculate the total population

world_cont_pop <- world %>% 
  group_by(continent) %>%
  summarize(pop = sum(pop, na.rm=T))
world_cont_pop
## Simple feature collection with 8 features and 2 fields
## Geometry type: GEOMETRY
## Dimension:     XY
## Bounding box:  xmin: -180 ymin: -89.9 xmax: 180 ymax: 83.64513
## Geodetic CRS:  WGS 84
## # A tibble: 8 × 3
##   continent                      pop                                        geom
##   <chr>                        <dbl>                              <GEOMETRY [°]>
## 1 Africa                  1154946633 MULTIPOLYGON (((43.1453 11.46204, 42.71587…
## 2 Antarctica                       0 MULTIPOLYGON (((-180 -89.9, 180 -89.9, 180…
## 3 Asia                    4311408059 MULTIPOLYGON (((104.37 -1.084843, 104.0108…
## 4 Europe                   669036256 MULTIPOLYGON (((-180 64.97971, -179.4327 6…
## 5 North America            565028684 MULTIPOLYGON (((-132.71 54.04001, -133.18 …
## 6 Oceania                   37757833 MULTIPOLYGON (((-180 -16.55522, -179.9174 …
## 7 Seven seas (open ocean)          0 POLYGON ((68.935 -48.625, 68.8675 -48.83, …
## 8 South America            412060811 MULTIPOLYGON (((-66.95992 -54.89681, -66.4…

Let’s try to write a pipe that only includes Asia, Europe, and Africa and summarizes each continent by total area.

Here is another slightly more involved example. If you have time, then add a comment to each line to describe what it does.

world_agg <- world %>% 
  st_drop_geometry() %>%                      
  dplyr::select(pop, continent, area_km2) %>% 
  group_by(continent) %>%                     
  summarize(Pop = sum(pop, na.rm = TRUE), Area = sum(area_km2), N = n()) %>%
  mutate(Density = round(Pop / Area)) %>%     
  top_n(n = 3, wt = Pop) %>%                   
  arrange(desc(N))                            

Vector Attribute Joining

Another common part of a data analysis workflow will be joining different datasets. ‘dplyr’ provides us with a number of useful functions for this such as left_join(), inner_join(), and full_join() and these same functions work with ‘sf’ objects as well.

‘spData’ includes another dataset called ‘coffee_data’. Try to join ‘world’ and ‘coffee_data’ using one of the dplyr join functions.

Did it work? Note that joins will only work if there is a common column name between the two objects being joined.

If you use left_join() then it is also important to note that the result of this function will usually match the formatting of the first argument. (i.e. left_join(world, coffee_data) will return an ‘sf’ object because ‘world’ was an ‘sf’ object but left_join(coffee_data, world) will return a data frame)

Creating Attributes

Another important part of your workflow might include creating new columns based off of data from exisiting columns. The ‘dplyr’ functions mutate() and transmute() will aid you in accomplishing this.

Describe what this code chunk does and then add another column that log transforms the population

world %>% 
  mutate(pop_dens = pop / area_km2)
## Simple feature collection with 177 features and 11 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -180 ymin: -89.9 xmax: 180 ymax: 83.64513
## Geodetic CRS:  WGS 84
## # A tibble: 177 × 12
##    iso_a2 name_long continent region_un subregion type  area_km2     pop lifeExp
##  * <chr>  <chr>     <chr>     <chr>     <chr>     <chr>    <dbl>   <dbl>   <dbl>
##  1 FJ     Fiji      Oceania   Oceania   Melanesia Sove…   1.93e4  8.86e5    70.0
##  2 TZ     Tanzania  Africa    Africa    Eastern … Sove…   9.33e5  5.22e7    64.2
##  3 EH     Western … Africa    Africa    Northern… Inde…   9.63e4 NA         NA  
##  4 CA     Canada    North Am… Americas  Northern… Sove…   1.00e7  3.55e7    82.0
##  5 US     United S… North Am… Americas  Northern… Coun…   9.51e6  3.19e8    78.8
##  6 KZ     Kazakhst… Asia      Asia      Central … Sove…   2.73e6  1.73e7    71.6
##  7 UZ     Uzbekist… Asia      Asia      Central … Sove…   4.61e5  3.08e7    71.0
##  8 PG     Papua Ne… Oceania   Oceania   Melanesia Sove…   4.65e5  7.76e6    65.2
##  9 ID     Indonesia Asia      Asia      South-Ea… Sove…   1.82e6  2.55e8    68.9
## 10 AR     Argentina South Am… Americas  South Am… Sove…   2.78e6  4.30e7    76.3
## # … with 167 more rows, and 3 more variables: gdpPercap <dbl>,
## #   geom <MULTIPOLYGON [°]>, pop_dens <dbl>

Manipulating Raster Objects

Raster objects can not be manipulated like vector objects since these represent continuous surfaces rather than geometric objects (points, lines, and polygons). For raster objects, we will primarily use the [] operators for subsetting. First, we can create a couple raster objects that we can work with.

This chunk creates a 6x6 raster that will represent elevation (continuous data).

elev = rast(nrows = 6, ncols = 6, resolution = 0.5, 
            xmin = -1.5, xmax = 1.5, ymin = -1.5, ymax = 1.5,
            vals = 1:36)

This chunk creates another 6x6 raster that represents grain sizes (categorical data).

grain_order = c("clay", "silt", "sand")
grain_char = sample(grain_order, 36, replace = TRUE)
grain_fact = factor(grain_char, levels = grain_order)
grain = rast(nrows = 6, ncols = 6, resolution = 0.5, 
             xmin = -1.5, xmax = 1.5, ymin = -1.5, ymax = 1.5,
             vals = grain_fact)

Cell values can be obtained and overwritten using the [] operators.

elev[1,1]
##   lyr.1
## 1     1
elev[1,1] <- 0
elev[1,1]
##   lyr.1
## 1     0

c() can be used to combine layers to make a multi-layer raster

multi <- c(elev,grain)

Summarizing raster objects

‘terra’ includes helpful functions to gather descriptive statistics from raster objects. A few of these include some familiar functions such as boxplot(), hist(), and summary()

Try a few of these out and compare with your neighbor. (‘grain’ had cells that were randomly assigned so you should expect different answers for that raster)


For these exercises we will use the us_states and us_states_df datasets from the spData package:

us_states is a spatial object (of class sf), containing geometry and a few attributes (including name, region, area, and population) of states within the contiguous United States. us_states_df is a data frame (of class data.frame) containing the name and additional variables (including median income and poverty level, for the years 2010 and 2015) of US states, including Alaska, Hawaii and Puerto Rico. The data comes from the United States Census Bureau, and is documented in ?us_states and ?us_states_df.

E1. Create a new object called us_states_name that contains only the NAME column from the us_states object. What is the class of the new object and what makes it geographic?

E2. Select columns from the us_states object which contain population data. Obtain the same result using a different command (bonus: try to find three ways of obtaining the same result). Hint: try to use helper functions, such as contains or starts_with from dplyr (see ?contains).

E3. Find all states with the following characteristics (bonus find and plot them):

Belong to the Midwest region. Belong to the West region, have an area below 250,000 km2and in 2015 a population greater than 5,000,000 residents (hint: you may need to use the function units::set_units() or as.numeric()). Belong to the South region, had an area larger than 150,000 km2 or a total population in 2015 larger than 7,000,000 residents.

E4. What was the total population in 2015 in the us_states dataset? What was the minimum and maximum total population in 2015?

E5. How many states are there in each region?

E6. What was the minimum and maximum total population in 2015 in each region? What was the total population in 2015 in each region?

E7. Add variables from us_states_df to us_states, and create a new object called us_states_stats. What function did you use and why? Which variable is the key in both datasets? What is the class of the new object?

E8. us_states_df has two more rows than us_states. How can you find them? (hint: try to use the dplyr::anti_join() function)

E9. What was the population density in 2015 in each state? What was the population density in 2010 in each state?

E10. How much has population density changed between 2010 and 2015 in each state? Calculate the change in percentages and map them.

E11. Change the columns’ names in us_states to lowercase. (Hint: helper functions - tolower() and colnames() may help.)

E12. Using us_states and us_states_df create a new object called us_states_sel. The new object should have only two variables - median_income_15 and geometry. Change the name of the median_income_15 column to Income.

E13. Calculate the change in the number of residents living below the poverty level between 2010 and 2015 for each state. (Hint: See ?us_states_df for documentation on the poverty level columns.) Bonus: Calculate the change in the percentage of residents living below the poverty level in each state.

E14. What was the minimum, average and maximum state’s number of people living below the poverty line in 2015 for each region? Bonus: What is the region with the largest increase in people living below the poverty line?

E15. Create a raster from scratch with nine rows and columns and a resolution of 0.5 decimal degrees (WGS84). Fill it with random numbers. Extract the values of the four corner cells.

E16. What is the most common class of our example raster grain (hint: modal())?

E17. Plot the histogram and the boxplot of the data(dem, package = “spDataLarge”) raster.