Important Functions and Packages
- I’m using
SWMPr::import_local
to pull in and qaqc my data files.
- This function works best on zip files downloaded from the CDMO’s Advanced Query System: http://cdmo.baruch.sc.edu/aqs/. I do a big download every year and keep it all in a folder I call “Data-latest” (you’ll see that in my code below).
- You can import all files from a site by only specifying the site name, without a year. I only care about 2017 here, but if I wanted all the data, I’d just say
import_local(datapath, "gndblwq")
.
- The default behavior for the
qaqc()
function is to only keep data with (0) flags. I like to keep suspect (1) and corrected (5) data too, so I’ve specified all three using qaqc_keep()
.
- The
select
function from dplyr
lets me pick and choose by column name. I’m using a “-” in front of column names that represent columns I don’t want: select everything MINUS these columns. If I only wanted to keep a column or two, it would read select(datetimestamp, sal)
and be a smaller data frame.
- The pipe function,
%>%
, lets me do multiple things at once. I think of it like “and then” - so these lines of code say:
- “import this data file, and then
- run the qaqc step on it, and then
- remove the columns that I don’t want."
# look for data here
datapath <- "C:/Users/kimberly.cressman/Desktop/Main Docs/Data-latest"
# import data
dat_bl <- import_local(datapath, "gndblwq2017") %>%
qaqc(qaqc_keep = c(0, 1, 5)) %>%
select(-level, -clevel, -chlfluor)
dat_bh <- import_local(datapath, "gndbhwq2017") %>%
qaqc(qaqc_keep = c(0, 1, 5)) %>%
select(-level, -clevel, -chlfluor)
dat_bc <- import_local(datapath, "gndbcwq2017") %>%
qaqc(qaqc_keep = c(0, 1, 5)) %>%
select(-level, -clevel, -chlfluor)
dat_pc <- import_local(datapath, "gndpcwq2017") %>%
qaqc(qaqc_keep = c(0, 1, 5)) %>%
select(-level, -clevel, -chlfluor)
Data Shaping
Next, I want to get all of the salinities into a single data frame, with the site name as the column name, so I can easily put them all on a plot and color by site. Surely there’s a better way, but here’s what I’ve gotten to work.
I’m reading the code in my head like this:
- “bl_sal is what I want out of this.
- Start with dat_bl, and then:
- select the datetimestamp and sal columns, and then:
- rename the sal column to be BL."
bl_sal <- dat_bl %>%
select(datetimestamp, sal) %>%
rename(BL = sal)
See what I mean about the pipe letting you string a bunch of commands together easily?
And I’m just doing that for each site.
bh_sal <- dat_bh %>%
select(datetimestamp, sal) %>%
rename(BH = sal)
bc_sal <- dat_bc %>%
select(datetimestamp, sal) %>%
rename(BC = sal)
pc_sal <- dat_pc %>%
select(datetimestamp, sal) %>%
rename(PC = sal)
Now I’m going to glue them all together, adding one at a time through dplyr
’s left_join
function. Read this as:
- “all_sal will be the output. Take bl_sal, and then
- left_join bh_sal to it; and then
- left_join bc_sal to that; and then
- left_join pc_sal."
all_sal <- bl_sal %>%
left_join(bh_sal) %>%
left_join(bc_sal) %>%
left_join(pc_sal)
Here’s the head of the data frame - need to make sure this worked! I got a little worried when I saw all NAs at the top of BL’s list, but then I remembered the sonde’s batteries died in December 2016. So this isn’t actually a problem with anything I’ve done in R.
head(all_sal, 10)
## datetimestamp BL BH BC PC
## 1 2017-01-01 00:00:00 NA 21.8 26.1 27.3
## 2 2017-01-01 00:15:00 NA 21.2 26.9 27.3
## 3 2017-01-01 00:30:00 NA 20.1 26.4 27.4
## 4 2017-01-01 00:45:00 NA 20.6 26.4 27.5
## 5 2017-01-01 01:00:00 NA 20.4 26.7 27.4
## 6 2017-01-01 01:15:00 NA 20.5 23.4 27.3
## 7 2017-01-01 01:30:00 NA 20.6 22.9 27.3
## 8 2017-01-01 01:45:00 NA 20.0 23.7 27.2
## 9 2017-01-01 02:00:00 NA 19.7 21.4 27.1
## 10 2017-01-01 02:15:00 NA 14.9 21.4 27.0
Above is what you’d see in your console in R. What I’ll do for the rest of this post is make the tables pretty using the kable
function from the knitr
package. If you’re not making html files with your code, don’t worry about kable
; just do the normal head(dataframe)
thing. Here’s how the same information looks different:
kable(head(all_sal, 10), align = "c", caption = "first 10 salinity readings of 2017 for each site")
first 10 salinity readings of 2017 for each site
2017-01-01 00:00:00 |
NA |
21.8 |
26.1 |
27.3 |
2017-01-01 00:15:00 |
NA |
21.2 |
26.9 |
27.3 |
2017-01-01 00:30:00 |
NA |
20.1 |
26.4 |
27.4 |
2017-01-01 00:45:00 |
NA |
20.6 |
26.4 |
27.5 |
2017-01-01 01:00:00 |
NA |
20.4 |
26.7 |
27.4 |
2017-01-01 01:15:00 |
NA |
20.5 |
23.4 |
27.3 |
2017-01-01 01:30:00 |
NA |
20.6 |
22.9 |
27.3 |
2017-01-01 01:45:00 |
NA |
20.0 |
23.7 |
27.2 |
2017-01-01 02:00:00 |
NA |
19.7 |
21.4 |
27.1 |
2017-01-01 02:15:00 |
NA |
14.9 |
21.4 |
27.0 |
Just a little more reshaping to do; for easy grouping in ggplot, we need data in a long format. This is pretty easy using the dplyr
package’s gather
function.
I’m also throwing in a mutate
line with fct_relevel
(from the forcats
package) because R automatically places my sites in alphabetical order, but I want them in order of highest to lowest freshwater input. This matters for their arrangement on the plots.
all_sal_long <- all_sal %>%
gather(key = "site", value = "sal", -datetimestamp) %>%
mutate(site = factor(site),
site = fct_relevel(site, "BH", "BC", "BL", "PC"))
kable(head(all_sal_long), align = "c")
2017-01-01 00:00:00 |
BL |
NA |
2017-01-01 00:15:00 |
BL |
NA |
2017-01-01 00:30:00 |
BL |
NA |
2017-01-01 00:45:00 |
BL |
NA |
2017-01-01 01:00:00 |
BL |
NA |
2017-01-01 01:15:00 |
BL |
NA |
kable(tail(all_sal_long), align = "c")
140155 |
2017-12-31 22:30:00 |
PC |
25.4 |
140156 |
2017-12-31 22:45:00 |
PC |
25.4 |
140157 |
2017-12-31 23:00:00 |
PC |
25.4 |
140158 |
2017-12-31 23:15:00 |
PC |
25.3 |
140159 |
2017-12-31 23:30:00 |
PC |
25.2 |
140160 |
2017-12-31 23:45:00 |
PC |
25.2 |