---
title: "PopPsiSeq and Helper Functions"
output: 
  pdf_document: 
    toc: true
  html_document: rmarkdown::html_vignette
bibliography: references.bib
vignette: >
  %\VignetteIndexEntry{PopPsiSeq and Helper Functions}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",

  fig.width = 8,
  fig.height = 5,
  fig.align = "center"
  )
```

```{r setup, include=FALSE}
#library(PopPsiSeqR)
devtools::load_all()

library("ggplot2")
library("dplyr")
library("tidyr")
library("ggbio")
library("rtracklayer")
library("patchwork")

```

# Introduction

PopPsiSeq is a bioinformatics workflow intended for analyzing the results of evolution & resequencing experiments. It builds on previous software published in @Earley2011 and has been used to analyze hybrid sterility between Drosophila simulans and mauritiana (@Kanippayoor2024).

PopPsiSeqR bundles its key functions into a convenient R package. This vignette presents a basic use case.

# Hybridize/Select/Backcross/Resequence Experiments

The underlying idea in an experiment of this sort is that the genetic basis of a trait can be mapped by hybridizing genomes that differ in that trait, then selecting for the trait associated with one genome while backcrossing to the other. The procedure is sketched out in @Earley2011 :

> Populations with a divergent complex trait are hybridized and then selected for a specific phenotype across multiple generations of backcrosses. [...] The trait of interest is selected for each generation, and offspring are mated to the other parental line expressing the unselected phenotype (introgression and backcrossing). Over multiple generations of selection and backcrossing this hybrid population becomes homozygous for the majority of the unselected parent’s genome while loci from the selected parent, which contribute to the selected trait, remain.

The experiment design is analogous to older techniques which use discrete, pre-identified molecular markers. Modern high-throughput sequencing allows finer mapping at a greater scale, but requires modern software to extract meaningful results from raw sequence.

# Building and Loading Example Data

After an experiment has been performed and reads have been sequenced, a typical workflow ahead of PopPsiSeq will include

-   Read filtering and quality control

-   Mapping of reads to a reference genome

-   Calling genome variants from mapped reads

Example workflows starting from sequenced reads are documented on github: <https://github.com/csoeder/PopPsiSeq>

In this vignette we will start with an example dataset, in which allele frequencies have been measured for three populations:

-   a Drosophila simulans population, constructed from sequenced lab strains downloaded from NCBI; this was treated as a parental population

-   a Drosophila sechellia population, constructed from sequenced wild and lab strains downloaded from NCBI; this was treated as a parental population; for some examples these flies will be subdivided by strain source.

-   an offspring population from evolve & resequence experiments, in which simulans/sechellia hybrids were repeatedly backcrossed to sechellia while selecting for simulans-like traits ( \@Earley2011 and unpublished followup experiments)

```{bash include=TRUE, echo = T, eval=F}
bedtools intersect -wa -wb -a {input.par1_frq} -b  {input.par2_frq} \
  | bedtools intersect -wa -wb -a - -b  {input.off_frq} \
  | cut -f 1,2,4-6,10,12,16,18 | tr ":" "\\t" \ 
  | awk -v OFS='\\t' '{{print $1,$2,$2+1,"0","0","+",$4,$6,$3,$7,$8,$10,$11,$13}}' \
  > {output.frqShft_out}.pre
```

The *{input.\*\_frq}* files are the BEDfile-ification () of the VCFtools (@Danecek2011) allele frequency utility:

```{bash include=TRUE, echo = T, eval=F}

vcftools --vcf {input.vcf_in} --out {output.report_out} --freq;
cat {output.report_out}.frq | tail -n +2 | awk -v OFS='\\t' '{{print $1,$2,$2+1,$4,$5,$6}}'

```

The full allele frequency measurements have been trimmed to the neighborhood of a feature identified in @Earley2011 .

```{bash include=TRUE, echo=T, eval=F}
shifted_freqs=data/intermediate/freq_shift/freebayes/all2.cleanEarleyAll_with_allSim2_and_allSech.vs_droSim1.bwaUniq.frqShift.pre
# this file is generated by the PopPsiSeq workflows documented elsewhere

bedtools intersect -wa -a $shifted_freqs  -b <( echo -e "chr2L\t10000000\t15000000" )  \
  > inst/extdata/merged_frequencies.example_data.tbl 


```

This is then loaded with the PopPsiSeqR utility, **`import.freqtbl()`** :

```{r include=TRUE, echo=TRUE, eval=TRUE}

merged_frequencies.filename <- system.file("extdata", 
  "merged_frequencies.example_data.tbl", package = "PopPsiSeqR")
merged_frequencies.bg <- import.freqtbl(merged_frequencies.filename)
```

The data is loaded as an extended GRanges object; the key fields (beyond those related to the BED6 structure) are the allele frequencies ( columns named *\*\_af* ). Let's take a look.

```{r include=TRUE, echo=TRUE, eval=TRUE}

head(merged_frequencies.bg %>% as.data.frame() 
     %>% select(-c(ends_with("_count"), "name", "score")) 
     %>% GRanges(), n=5)

```

```{r, include=TRUE, echo=FALSE}

autoplot(merged_frequencies.bg %>% head(n=36), aes(y = selected_parent_alt_af), geom='point', color = "red", alpha=0) +geom_hline(yintercept = 0, color ="black", linetype = "dotted")  +geom_hline(yintercept = 1, color ="black", linetype = "dotted") + geom_point(color = "red", alpha=1) + geom_point(aes(y=backcrossed_parent_alt_af), color = "mediumblue", alpha = 0.5) + geom_segment(data= . %>% as.data.frame(),  aes(x=start, xend=start, y =selected_parent_alt_af, yend=backcrossed_parent_alt_af ), color = "black", alpha=0.25) + geom_point(aes(y=offspring_alt_af), color = "black", shape = "x", size=4) + geom_point(aes(y=offspring_alt_af), color = "black", shape = "o", size=4) + geom_text( data = . %>% as.data.frame() %>% mutate(why = rep(c(-0.1,-0.05), n()/2)) ,aes(label=ref,y = why) ) + geom_text( data = . %>% as.data.frame() %>% mutate(why = rep(c(1.05,1.1), n()/2)) ,aes(label=alt,y = why) ) + facet_wrap(~seqnames, scales = "free_x")  + theme_clear() + theme(axis.text.x = element_text(angle = 30, hjust = 1)) + labs(y="Allele Frequency", title = "Allele Frequencies for Parental and Offspring Populations", subtitle = "small subset shown for clarity", caption ="blue dots - backcross parent (D. sechellia); red dots - selection parent (D. simulans); crosshactched circles - offspring",x= "coordinate (droSim1 reference genome)" ) 

```

Here the allele frequencies are plotted across the chromosome arm (for visibility, only the first few dozen diagnostic SNPs are shown). The allele frequencies of the population standing in for the selected-for parent species (D. simulans) are displayed in red; the allele frequencies of the population standing in for the backcrossed-to parent species (D. sechellia) are displayed in blue. The allele frequencies of the offspring population are displayed as a black crosshatched circle. The allele appearing in the reference genome (droSim1) is at the bottom at zero; the alternate allele appears at the top at 1. For example, at the very left of the plot we have the SNP described in the first line of the data table above: the backcross population and the offspring population both have an alternate allele frequency of 0 (that is, they are 100% G and 0% A), while the selected-for population carries the A allele at \~25%.

A few observations:

-   The population representing the selected-for parental species (D. simulans) tends to have a low alternate allele frequency; indeed, the distribution appears to be centered near zero, which is not surprising, considering that it is the degree of difference from a simulans reference genome which is being measured. However, this population is not uniformly identical to the reference genome. In fact, this population has much more density in the intermediate frequencies, than does the D. sechellia population; this is consistent with the higher within-population diversity of simulans generally (cite this

-   Although variants in the the D. sechellia population tend to be more fixed than variants in the D. simulans population, they are not necessarily fixed as the alternate allele; not only is their frequency distribution bimodal, but the peak at 0 is greater than that at 1, indicating that the allele found in the simulans reference genome is nearing fixation in sechellia more often than the alternative allele. (see below)

-   The frequency distribution of the offspring is similar to that of the backcrossed-to parent.

```{r, include=TRUE, echo=FALSE, eval=TRUE}

merged_frequencies.gath.full.df <- merged_frequencies.bg %>% as.data.frame() %>% select(c(selected_parent_alt_af, backcrossed_parent_alt_af, offspring_alt_af)) %>% gather(key="population", value = "allele_frequency") %>% mutate(population = gsub("_alt_af","",population)) %>% mutate(population = gsub("_"," ",population))


ggplot(merged_frequencies.gath.full.df) + geom_step( data = . %>% filter(population=="offspring") , aes(x=allele_frequency, group = population, color = population),  position = "identity", stat="bin", bins=31, alpha=0.666) + geom_step(  data = . %>% filter(population!="offspring") , aes(x=allele_frequency, group = population, color = population),  position = "identity", stat="bin", bins=31) +theme_bw() + scale_color_manual(values=c("red", "black", "blue"))  + labs(y="# variant sites", title = "Allele Frequency Spectra", subtitle = "(full example dataset, grouped by population)", caption ="", x= "frequency of alternate (non-reference) allele" ) 


```

-   The alternate allele often has a higher frequency in the sechellia population, than it does in the simulans population, as might be expected given its construction. However, this is not always true; in fact, in the majority of cases, the simulans population actually has the non-reference allele at a higher frequency than does the sechellia population. These are typically sites in which the simulans population is polymorphic while the sechellia population is more fixed; this is indicated by the bulk these cases having an allele frequency difference in the [-0.5,0] range. On the other hand, the sites where the simulans population is more like the reference have an allele frequency difference ranging from 0 to 1, with the bulk clustered at 1. This indicates that these are sites where the alternate allele tends to be fixed in the sechellia population, and the reference allele tends to be fixed in simulans. There are very few cases in which the alternate allele is fixed in sechellia and the reference allele is fixed in simulans. (comment on nature of reference genome)

```{r, include = TRUE, echo=FALSE, eval=TRUE}

merged_frequencies.diffhist.gg <- ggplot(merged_frequencies.bg %>% as.data.frame() %>% select(c(selected_parent_alt_af, backcrossed_parent_alt_af)) %>% mutate(alt_af_diff = backcrossed_parent_alt_af - selected_parent_alt_af)) + geom_histogram(aes(x=alt_af_diff), fill="darkgray", color ="darkgray", bins = 31) + geom_vline(xintercept = 0, color = "black", linetype = "dotted")  +theme_bw()+ labs(y="# variant sites", title = "Difference in frequency of alternate (nonreference) allele between parental populations", subtitle = " ", caption ="",x= "frequency difference (sechellia - simulans)" ) 
merged_frequencies.diffhist.gg
```

# Sitewise Calculations

Now that we have our data loaded, it's time to run PopPsiSeq's **`freqShifter()`**.

```{r, eval=TRUE, echo = TRUE, include = TRUE}
frequency_shifts.bg <- freqShifter(merged_frequencies.bg)

head(frequency_shifts.bg %>% as.data.frame() 
     %>% select(-c(ends_with("_count"), ends_with("_deltaF"),  "name", "score")) 
     %>% GRanges(), n=5)

```

Here's what's happening within the **freqShifter**:

Consider again our evolve and resequence experiment. It began with a hybridization step in which one fly was drawn from a population of D. simulans; at each variant site our expectation for the alternate allele is the alternate allele frequency in our stand-in simulans population. When we pick one fly from the D. sechellia population, we'd also expect the allele frequency at each site to be that of the allele frequency in our stand-in sechellia population.

In the absence of higher-order evolutionary mechanisms (e.g., genome incompatibilities or meiotic drive), we'd expect the F1 generation to have, at each site, an allele frequency exactly the average of its two parents. In the absence of (net) evolutionary forces, all future offspring would also share this intermediate allele frequency, per Hardy-Weinberg. This will therefor be our reference point for understanding the allele frequencies observed experimentally. The first step then is to tare our population frequencies relative to this value.

```{r, include = TRUE, echo=FALSE, warning=FALSE}
subsample_number <- 36
frequency_shifts.raw.gg <- autoplot(frequency_shifts.bg %>% head(n=subsample_number), aes(y = selected_parent_alt_af), geom='point', color = "red", alpha = 0) +geom_hline(yintercept = 0, color ="black", linetype = "dotted")  +geom_hline(yintercept = 1, color ="black", linetype = "dotted") + geom_point(aes(y=backcrossed_parent_alt_af), color = "mediumblue", alpha = 1, size=2,shape=21) + geom_point(aes(y=selected_parent_alt_af), color = "red", alpha = 1, size=2,shape=21)  + geom_segment(data= . %>% as.data.frame(), aes(x=start, xend=start, y =selected_parent_alt_af, yend=backcrossed_parent_alt_af ), color = "black", alpha=0.25)  + geom_point(aes(y=central), color = "black", alpha = 1, shape = "o", size=2) + geom_point(aes(y=offspring_alt_af), color = "black", size=3, shape = "x") + labs(y="Allele Frequency (raw)", title = "Taring the Frequencies to Hardy-Weinberg Expectation", subtitle = " ", caption ="",x= "" ) + theme_clear() + theme(axis.text.x = element_blank()) 


frequency_shifts.tared.gg <- autoplot(frequency_shifts.bg %>% head(n=subsample_number), aes(y = selected_parent_alt_af - central), geom='point', color = "red", alpha = 0) +geom_hline(yintercept = c(-0.5,0.5), color ="black", linetype = "dotted")  + geom_point(aes(y=backcrossed_parent_alt_af-central), color = "mediumblue", alpha = 1, size=2,shape=21) + geom_point(aes(y=selected_parent_alt_af - central), color = "red", alpha = 1, size=2,shape=21)  + geom_segment(data= . %>% as.data.frame(), aes(x=start, xend=start, y =selected_parent_alt_af-central, yend=backcrossed_parent_alt_af-central ), color = "black", alpha=0.25) + geom_point(y=0, color = "black", alpha = 1, shape = "o", size=2) + geom_point(aes(y=offspring_alt_af- central), color = "black", size=3, shape = "x") + labs(y="Allele Frequency (tared)", title = "", subtitle = "", caption ="blue circles - backcross parent (D. sechellia); red circles - selection parent (D. simulans);\nblack circles - neutral expectation; black x - offspring",x= "coordinate (droSim1 reference genome)" )  + theme_clear() + theme(axis.text.x = element_text(angle = 30, hjust = 1)) 


frequency_shifts.raw.gg@ggplot / frequency_shifts.tared.gg@ggplot

```

The next step is to polarize the allele frequencies such that they follow a consistent orientation. At each site there is an allele which is more common in the selected-for population and an allele which is more common in the backcrossed-to population. Sites at which the alternate allele is more common in the selected-for population are left alone; at the rest of the sites, the alternate and reference alleles were swapped. Conceptually this is like retroactively editing the reference genome to represent the backcrossed-to population; mathematically this is a simple change of sign to the tared data. Notice that the offspring allele frequencies are also polarized in this manner.

```{r, include = TRUE, echo=FALSE, warning=FALSE}


arrow_len <- unit(0.05, "npc")
frequency_shifts.oriented.gg <- autoplot(frequency_shifts.bg %>% head(n=subsample_number), aes(y = sign(selected_parent_alt_af-backcrossed_parent_alt_af)*(selected_parent_alt_af - central)), geom='point', color = "red",  alpha = 0) + geom_point(aes(y=sign(selected_parent_alt_af-backcrossed_parent_alt_af)*(backcrossed_parent_alt_af-central)), color = "mediumblue", alpha = 1, shape=21, size=2,) + geom_point(color = "red", alpha = 1,shape=21, size=2,) + geom_point(y=0, color = "black", alpha = 1, size=2,shape=21) + geom_point(aes(y=sign(selected_parent_alt_af-backcrossed_parent_alt_af)*(offspring_alt_af- central)), color = "black", size=3, shape = "x") + geom_segment(data= . %>% as.data.frame(), aes(x=start, xend=start, yend =sign(selected_parent_alt_af-backcrossed_parent_alt_af)*(selected_parent_alt_af-central)) , y=0, color = "black", alpha=0.75, arrow=arrow(ends = "last", length=arrow_len, type="open")) + geom_segment(data= . %>% as.data.frame(), aes(x=start, xend=start, yend = 0, y = sign(selected_parent_alt_af-backcrossed_parent_alt_af)*(backcrossed_parent_alt_af-central)), color = "black", alpha=0.5, arrow=arrow(ends = "last", length=arrow_len, type="open"))  + labs(y="Allele Frequency (polarized)", title = "", subtitle = "", caption ="blue circles - backcross parent (D. sechellia); red circles - selection parent (D. simulans);\nblack circles - neutral expectation; black x - offspring",x= "coordinate (droSim1 reference genome)" )  + theme_clear() + theme(axis.text.x = element_text(angle = 30, hjust = 1))


frequency_shifts.tared.mod.gg <- frequency_shifts.tared.gg + labs( title = "Polarizing the Allele Frequencies to a Common Orientation", x="", caption = "") + theme(axis.text.x = element_blank())
frequency_shifts.tared.mod.gg@ggplot$layers[[2]]$aes_params$alpha <- 0

frequency_shifts.tared.mod.gg@ggplot / frequency_shifts.oriented.gg@ggplot


```

Now that all alternate alleles are characteristic of the selected-for population, allele frequency is a measure of selected-for population character, and a positive tared value indicates more selected-for character than backcrossed-to character, relative to the ideal initial hybridization.

Consider again the backcrossed-to population. Suppose there was not selection applied, just backcrossing. With each successive generation, we'd expect our hypothetical offsprings' allele frequencies to converge on those of the backcrossed-to population. We can thus place a guide on the plot which follows the tranformed allele frequency values from this population, to indicate how much our hypothetical offsprings' alleles would have to change, in order to identically resemble the backcrossed-to population. The change necessary to identically resemble the selected-for population is of equal magnitude but opposite direction, by construction. The guides for the parental populations are thus symmetric around the horizontal axis, but of variable distance from one another. The distance represents the difference in allele frequencies between the two populations at a given site, and thus how distinct the two populations are at that site; it necessarily falls in the range (0,1]. These data are not explicitly recorded in the data frame but correspond to the *center* field +/- ½ the *AF_difference* field.

```{r, include=TRUE, echo = FALSE, eval=TRUE}

autoplot(frequency_shifts.bg %>% head(n=subsample_number), aes(y = sign(selected_parent_alt_af-backcrossed_parent_alt_af)*(selected_parent_alt_af - central)), geom='point', color = "red", shape="o", alpha = 0 ) + geom_line(aes(y= sign(selected_parent_alt_af-backcrossed_parent_alt_af)*(selected_parent_alt_af - central)), color = "red", alpha=0.5) + geom_point(aes(y=sign(selected_parent_alt_af-backcrossed_parent_alt_af)*(backcrossed_parent_alt_af-central)), color = "mediumblue", shape="o", alpha = 0 ) + geom_line(aes(y=sign(selected_parent_alt_af-backcrossed_parent_alt_af)*(backcrossed_parent_alt_af-central)), color = "mediumblue", alpha=0.5) + geom_point(aes(y=sign(selected_parent_alt_af-backcrossed_parent_alt_af)*(offspring_alt_af- central)), color = "black", size=3, shape = "x", alpha=0)  + labs(y="Allele Frequency (transformed)", title = "Ancestral Populations, Relative to Expected Equilibria", subtitle = " ", caption ="blue - backcross parent (D. sechellia); red - selection parent (D. simulans); black x - offspring",x= "coordinate (droSim1 reference genome)" ) + theme_clear()
```

Now consider a hypothetical population with no backcrossing, only selection. If the allele frequencies of the selected-for population have optimal fitness under the applied selection, then we'd expect our hypothetical offsprings' allele frequencies to converge on those of the selected-for population.

However, the optimal allele configuration is not necessarily total fixation (e.g., balancing or frequency-dependent selection); alternatively, the selected-for population might not be optimal. As a result, some sites can be polymorphic in the selected-for population. At such a site, our hypothetical offspring could conceivably attain a higher allele frequency than the parental population, up to fixation. In the same way, sites which are polymorphic in the backcrossed-to population allow our hypothetical offspring to attain allele frequencies lower than the parental population. These data are recorded in the *max_oriented_shift* and *min_oriented_shift* fields.

```{r, include=TRUE, eval=TRUE, echo=FALSE}


autoplot(frequency_shifts.bg %>% head(n=subsample_number), aes(y = sign(selected_parent_alt_af-backcrossed_parent_alt_af)*(selected_parent_alt_af - central)), geom='point', color = "red", shape="o", alpha = 0 ) + geom_line(aes(y= sign(selected_parent_alt_af-backcrossed_parent_alt_af)*(selected_parent_alt_af - central)), color = "red", alpha=0.5) + geom_line(aes(y=sign(selected_parent_alt_af-backcrossed_parent_alt_af)*(backcrossed_parent_alt_af-central)), color = "mediumblue", alpha=0.5)  + geom_line(aes(y= max_oriented_shift), color = "red", alpha=1, linetype="dotted") + geom_line(aes(y= min_oriented_shift), color = "mediumblue", alpha=1, linetype="dotted") + labs(y="Allele Frequency (transformed)", title = "Ancestral Populations, Relative to Expected Equilibria", subtitle = "with gene fixation envelope", caption ="blue - backcross parent (D. sechellia); red - selection parent (D. simulans)",x= "coordinate (droSim1 reference genome)" ) + theme_clear()

```

This envelope defines hard bounds on our observations; all transformed allele frequencies must appear somewhere between them. Notice that the upper and lower bounds have a constant separation of 1 unit (the maximum an allele frequency can change), but are not necessarily centered on the horizontal axis. The center of the envelope can be offset by any value in the range [-0.5, 0.5]. This offset represents relative polymorphism in the two parental populations: if one population has an allele frequency of 1 and another population has an allele frequency of 0.5, their offspring will require a larger change to fix in the direction of the second population, than in the direction of the first. Importantly it is not just the presence of polymorphism, but whether there is more in one parent population than the other. The envelope is centered on the axis at an allele with 100% frequency in one population and 0% in another; it is also centered on the axis at an allele with 49% frequency in one population and 51% in the other.

This difference in polymorphism can be interpreted as a difference in the diagnostic value of one allele versus another. Suppose the backcrossed-to population is fixed for A at a given site, while the selected-for population is half As and half Ts. If, at that site in a hybrid offspring, a T is observed, we can be confident that it descended from the selected-for population; an A observation is less informative.

With these guides in place, we can put the measured data from our offspring population in context. Unsurprisingly, the offspring closely resembles the backcrossed-to population, to the point of overplotting:

```{r, include=TRUE, eval=TRUE, echo=FALSE}


autoplot(frequency_shifts.bg %>% head(n=subsample_number), aes(y = sign(selected_parent_alt_af-backcrossed_parent_alt_af)*(selected_parent_alt_af - central)), geom='point', color = "red", shape="o", alpha = 0 ) + geom_line(aes(y= sign(selected_parent_alt_af-backcrossed_parent_alt_af)*(selected_parent_alt_af - central)), color = "red", alpha=0.5) + geom_line(aes(y=sign(selected_parent_alt_af-backcrossed_parent_alt_af)*(backcrossed_parent_alt_af-central)), color = "mediumblue", alpha=0.5)  + geom_line(aes(y= max_oriented_shift), color = "red", alpha=1, linetype="dotted") + geom_line(aes(y= min_oriented_shift), color = "mediumblue", alpha=1, linetype="dotted")  + geom_point(aes(y=mean_oriented_shift), color = "black", shape = "x", size=3)  + geom_line(aes(y=mean_oriented_shift), color = "black", alpha = 0.8)  + labs(y="Allele Frequency (transformed)", title = "Offspring Population, Relative to Expected Equilibria", subtitle = "with parental populations and fixation envelope", caption ="blue - backcross parent (D. sechellia); red - selection parent (D. simulans); black - offspring",x= "coordinate (droSim1 reference genome)" ) + theme_clear()


```

Here is the whole example dataset:

```{r, include=TRUE, eval=TRUE, echo=FALSE}

autoplot(frequency_shifts.bg , aes(y = sign(selected_parent_alt_af-backcrossed_parent_alt_af)*(selected_parent_alt_af - central)), geom='point', color = "red", shape="o", alpha = 0 ) + geom_point(aes(y= sign(selected_parent_alt_af-backcrossed_parent_alt_af)*(selected_parent_alt_af - central)), color = "red", alpha=0.1) + geom_point(aes(y=sign(selected_parent_alt_af-backcrossed_parent_alt_af)*(backcrossed_parent_alt_af-central)), color = "mediumblue", alpha=0.1)  + geom_line(aes(y= max_oriented_shift), color = "red", alpha=0, linetype="dotted") + geom_line(aes(y= min_oriented_shift), color = "mediumblue", alpha=0, linetype="dotted")  + geom_point(aes(y=mean_oriented_shift), color = "black", shape = "x", size=3)  + geom_line(aes(y=mean_oriented_shift), color = "black", alpha = 0.0)  + labs(y="Allele Frequency (tared and polarized)", title = "Offspring Population, Relative to Expected Equilibria", subtitle = "with parental populations", caption ="blue - backcross parent (D. sechellia); red - selection parent (D. simulans); black - offspring",x= "coordinate (droSim1 reference genome)" ) + theme_clear()

```

These site-wise data can then be saved to disk for additional processing or future reference with the **`export.freqshft()`** utility

```{r, include=TRUE, echo=TRUE, eval=TRUE}
export.freqshft(frequency_shifts.bg , tempfile())
```

# Smoothing by Window

The site-wise data we've used so far are downsampled to a 5Mb region on one arm of one chromosome, and while some major features can be discerned, there's still too much data to work with easily, especially when it comes to visualization. One simplifying approach is to divide the data into contiguous chunks and to assign summary to the corresponding genomic intervals. This has two advantages:

-   averaging many data points will cancel out random noise while amplifying subtle signals

-   the evolutionary forces acting on nearby variant sites are not independent due to physical linkage; collapsing them into a single datum deflates the exaggerated apparent degrees of freedom

This can be done easily using the **`bedtools`**'s **`makewindow`** and **`map`** utilities ([@Quinlan2010]):

```{bash, eval = F, echo=TRUE, include=TRUE}
bedtools makewindows -w {wildcards.window_size} -s {wildcards.slide_rate} -g {fai_path} -i winnum \
  | bedtools sort -i - > {output.windowed}

bedtools map -c 7,8,9,10,11,12,12 -o sum,sum,sum,sum,sum,sum,count -null NA -a {input.windows_in} \
  -b <( tail -n +2  {input.frqShft_in}   | cut -f  1-3,15-20 | nl -n ln  | \
  awk -v OFS='\t' '{{if( $5!="NA" && $6!="NA")print $2,$3,$4,$1,"0",".",$5,$6,$7,$8,$9,$10}}' \
  | bedtools sort -i - ) > {output.windowed_out}

```

Other windowing schemes are possible. For example, the original @Earley2011 workflow used intervals containing a fixed number of SNPs, rather than of a fixed length. A python utility for producing such intervals can be found here:

<https://github.com/csoeder/PopPsiSeq/blob/master/scripts/bin_by_SNPs.py>

```{bash, include=FALSE, eval=F}
mkdir -p ../exec/python/
curl https://raw.githubusercontent.com/csoeder/PopPsiSeq/refs/heads/master/scripts/bin_by_SNPs.py > ../exec/python/bin_by_SNPs.py

```

```{bash, echo = TRUE, include=TRUE, eval=F}
python3 ../exec/python/bin_by_SNPs.py --help
```

Another possibility is using a gene model annotation to define locii of interest, and measure the shift in/around them. Once a bed file of intervals has been defined, the above map command can be applied.

The full-genome dataset (from which our previous examples were subsampled) was smoothed with bookended windows 100kB wide. These were smoothed data were subsampled to just chromosome 2L for an example file:

```{bash, echo=TRUE, eval=F}
windowed_freqs=data/ultimate/freq_shift/freebayes/all2.cleanEarleyAll_with_allSim2_and_allSech.vs_droSim1.bwaUniq.windowed_w100000_s100000.frqShift.bed
# this file is generated by the PopPsiSeq workflows documented elsewhere

cat $windowed_freqs | grep -w "chr2L" > inst/extdata/windowed_shifts.example_data.bed 
 
```

This can then be imported using the **`import.smvshift()`** utility. This loads the file, computes the window averages from the window sums, and names the fields according to the experiment design.

```{r, warning=FALSE, include=TRUE, echo=TRUE}
windowed_shifts.filename <- system.file("extdata", "windowed_shifts.example_data.bed", 
  package = "PopPsiSeqR")

windowed_shifts.bg <- import.smvshift(windowed_shifts.filename, selected_parent = "sim",
  backcrossed_parent = "sech")

windowed_shifts.bg %>% head()
```

Key fields are the *[min,avg,max]\_\*ward_AFshift*s, which correspond to the fixation bounds and the offspring average, and *\*\_\*\_difference*, from which the parental population envelope is calculated in the same manner as in the per-site case.

# Data Visualization

The smoothed data can be automatically displayed with the **`windowedFrequencyShift.plotter()`** utility.

```{r, warning=FALSE, include=TRUE, echo=TRUE}
windowedFrequencyShift.plotter(windowed_shifts.bg, 
  selected_parent = "sim", backcrossed_parent = "sech", main_title = 
  "PopPsiSeq Results: offspring allele frequency\nrelative to neutral expectation, parental populations, and fixation")

```

By default a single experiment is displayed, with the fixation bounds and the parental populations shown. Simple modifications can be passed as arguments to the **`windowedFrequencyShift.plotter()`** function or by adding ggplot commands (see below for an example including both). The resulting ggbio object can also be modified directly through its ggplot slot.

# Comparing and Contrasting Results

It may be useful to compare data tracks from two different conditions. For example, it might be useful to run a control experiment to correct for effects other than the experimental treatment; it might also be useful to test the impact of changing different analysis parameters.

As an example, two more versions of the smoothed data above have been included; one was generated based on variants called from sechellia which were collected from the wild, the other from sechellia which had been maintained as lab culture for many generations.

```{bash, echo=TRUE, eval=FALSE, include=TRUE}

lab_data=data/ultimate/freq_shift/freebayes/all2.cleanEarleyAll_with_allSim2_and_labSech.vs_droSim1.bwaUniq.windowed_w100000_s100000.frqShift.bed
# this file is generated by the PopPsiSeq workflows documented elsewhere
wild_data=data/ultimate/freq_shift/freebayes/all2.cleanEarleyAll_with_allSim2_and_wildSech.vs_droSim1.bwaUniq.windowed_w100000_s100000.frqShift.bed
# this file is generated by the PopPsiSeq workflows documented elsewhere

cat $wild_data | grep -w "chr2L" > inst/extdata/wild_sechellia.example_data.bed 

cat $lab_data | grep -w "chr2L" > inst/extdata/lab_sechellia.example_data.bed 
 
```

These are loaded and a field added to record the sechellia type; upon visualization, qualitative differences between the two analyses are apparent.

```{r, warning=FALSE, include=TRUE, echo=TRUE}

lab_sechellia.filename <- system.file("extdata", 
  "wild_sechellia.example_data.bed", package = "PopPsiSeqR")
lab.bg <- import.smvshift(lab_sechellia.filename)
lab.bg$sechellia <- "lab"


wild_sechellia.filename <- system.file("extdata",
  "lab_sechellia.example_data.bed", package = "PopPsiSeqR")
wild.bg <- import.smvshift(wild_sechellia.filename)
wild.bg$sechellia <- "wild"


windowedFrequencyShift.plotter(c(lab.bg,wild.bg),
  selected_parent = "sim", backcrossed_parent = "sec",
  primary_aesthetic = aes(color=sechellia) ,
  main_title = "PopPsiSeq Results: offspring allele frequency\nrelative to neutral expectation, parental populations, and fixation"
  ) + facet_grid(sechellia~seqnames
  ) + labs(subtitle = "analyses based on lab-reared and wild-caught sechellia")

```

To quantify the differences between the two cases, there is the **`subTractor()`** utility which identifies corresponding intervals between the two cases and subtracts one data track from the other.

```{r, warning=FALSE, include=TRUE, echo=TRUE}

sub.traction <- subTractor(lab.bg, wild.bg ,treament_name = "sechellia")

autoplot(sub.traction, aes(y=lab_minus_wild), geom="line"
  ) + labs(y="Difference in Allele Frequency Shift\n(lab - wild)", 
  title = "Difference between PopPsiSeq analyses based on lab-reared and wild-caught sechellia",
  subtitle = "", caption ="",x= "coordinate (droSim1 reference genome)" 
  ) + theme_clear()  


```

Since the **`subTractor()`** requires the same intervals to be present in both data sets, it can't be used for data which have been smoothed with inconsistent windowing schemes. For example, the constant-SNP count windows produced by the **`bin_by_SNPs.py`** utility will depend on the variant dataset used, which may change between compared cases.

One way to circumvent this problem is to calculate the differences on a per-site basis, write these as a BED file, then using the binning script on these sites. The subtraction step involves an inner join between the two frequency shift tables; the sites binned in this case are therefor those sites common to both tables.

```{bash include=TRUE, echo=T, eval=F}
wild_shifts=data/intermediate/freq_shift/freebayes/all2.cleanEarleyAll_with_allSim2_and_wildSech.vs_droSim1.bwaUniq.frqShift.pre
lab_shifts=data/intermediate/freq_shift/freebayes/all2.cleanEarleyAll_with_allSim2_and_labSech.vs_droSim1.bwaUniq.frqShift.pre
# these files are generated by the PopPsiSeq workflows documented elsewhere


bedtools intersect -wa -a $wild_shifts  -b <( echo -e "chr2L\t12400000\t12600000" )  \
  > inst/extdata/wild_frequencies.example_data.tbl 

bedtools intersect -wa -a $lab_shifts  -b <( echo -e "chr2L\t12400000\t12600000" )  \
  > inst/extdata/lab_frequencies.example_data.tbl 


```

```{r include=TRUE, echo=TRUE, eval=TRUE}

lab_frequencies.filename <- system.file("extdata", 
  "lab_frequencies.example_data.tbl", package = "PopPsiSeqR")
lab_frequencies.bg <- import.freqtbl(lab_frequencies.filename)
lab_shifts.bg <- freqShifter(lab_frequencies.bg)
lab_shifts.bg$avg_simward_AFshift <- lab_shifts.bg$mean_oriented_shift
lab_shifts.bg$sechellia <- "lab"    

wild_frequencies.filename <- system.file("extdata", 
  "wild_frequencies.example_data.tbl", package = "PopPsiSeqR")
wild_frequencies.bg <- import.freqtbl(wild_frequencies.filename)
wild_shifts.bg <- freqShifter(wild_frequencies.bg)
wild_shifts.bg$avg_simward_AFshift <- wild_shifts.bg$mean_oriented_shift
wild_shifts.bg$sechellia <- "wild"


fine_subtraction.bg <- subTractor(lab_shifts.bg, wild_shifts.bg,
  treament_name = "sechellia")

fine_subtraction.df <- fine_subtraction.bg %>% as.data.frame() %>% 
  mutate(name = n(), score = 0) %>% select(
    c("seqnames","start","end","name","score","strand", "lab_minus_wild") ) 

write.table(fine_subtraction.df, file=tempfile(), quote=F, 
  sep="\t", row.names=F, col.names = F)

```

```{bash, echo = TRUE, include=TRUE, eval=FALSE}
difference_bedfile="sitewise_differences.bed"
binned_difference="binned_differences.bed"
python3 ../exec/python/bin_by_SNPs.py  -i $difference_bedfile \
  -o $binned_difference -b 25 -s 25 -c 7,7 -m count,mean
```

# Bibliography