5.2 Stats

Last updated 1 year ago

5.2 Stats

Basic Stats Concepts

Sample Independence

Independent vs. dependent/paired samples - are your sample groups related to each other?
- ex. separate groups of people arriving and leaving an airport is independent, the same individuals as they leave and return would be dependent
- dependent samples are always in pairs (or multiples)
Different hypothesis test
- t-test for independent samples or ANOVA without repeated measurements
- t-test for dependent samples or ANVOA with repeated measurement
- datatab.net

avoid pseudo-replication - most models assume that all results are fixed, but there are random effects everywhere
if every measurement comes from a different level of the random effect then each observation is statistically independent & you avoid pseudo-replication
Ex. Smokers vs non-smokers
- fixed effect - researcher sets/choses levels
- random effects - introducing other variables that align with the desired levels (for example, one nurse measures all the smokers and another measures all the non-smokers) or multiple replicates of the same individual get measured, you cant treat all the measurements as random
fixed effect models -
- when all data in analysis are estimated the same in value - same population and protocol
- if you look at test scores from sub-pops from one school, all their averages will be the same as the total average
- usually not asking about the mean because the mean will be the same for all
- models impacted by sample size - larger studies will be weighted heavier
- generally not appropriate for most studies
random effects models -
- start with a universe of populations and sample randomly within that universe
- if you add a level and measure from many schools, the average may be the same, but the average for reach individual school will vary
- random implies our assumption that the populations we're studying are at random
- usually helping to understand the variation between pops
- usually more sources of error - higher confidence interval
- model not impacted as much by size - larger studies are weighted a little bit heavier, but because they are not all estimating the same mean, its not as much as fixed
researchers often use heterogeneity tests to determine wether to use random or fixed effect models - but instead it needs to be based on the actual study design
What happens if you use the wrong type of model?
- the effect size will be wrong
- confidence interval is wrong (too narrow)
- p-value not valid
- analysis will be wrong

Simple Linear regression or multiple linear regressions
- work best when observations are coming from a single homogenous group
Linear mixed effects models
- when there are subgroups that may be affecting the results

ex. influence of nitrate (fiexed) on dissolved oxygen in multiple river basins (random)
ex. influence of three light level treatments (fixed) on photosynthetic rate in multiple species (random) with each individual plant (random) measured once per light level

Options for dealing with mixed effects:

random effects have different intercepts but the same slope
- in R: y ~ x + (1|group)
correlating varying slope and intercept
- in R: y ~ x + (x|group)
- usually best to fit the full structure (use this model)

Nested Factors

with multiple random effects nested within each other, where factors only occur within one level
- in R: y ~ color + (color|green/gray) where gray is nested within green

Crossed random effects

all combinations of random effects levels occur together
- in R: y ~ color + (color|green) + (color|gray)

How to check the model (model diagnostics)

plot residuals vs. fitted for entire model, and residuals vs. each independent variable should show no trends, be normally distributed with equal variance
residuals vs. fitted values with each random-effect group should also be normal with equal variance within and between groups
conditional means of random effects - means of slope values - should be normally distributed - can use a Q-Q plot
random effects should have many levels (>5 often said to be needed)
variables that are strongly collinear (have a strong correlation) can cause problems
structure of random effects should be specified correctly, requires good understanding of data & study design

Centering and standardizing models

centering - subtract mean from each data point
standardizing - divide each point by standard deviation
helps model converge (especially with large values)
can help compare magnitude of effect among variables measured with different units/scales

Interpreting results of a linear mixed model fit

provides info about model specification and fit, residual variance explained by grouping factor & residuals
provides intercept, coefficients, standard errors, expected correlation of fixed effects under multiple trials
no p values!

No interaction terms -
- force the differences between gender to have the same slope
- score = b1 + b2*hours + b3*gender
With interaction terms -
- allow for more complexity, different intercepts & different slopes
- score = b1 + b2*hours + b3*gender + b4*gender*hours

Homoscedasticity

variance of errors is constant
heteroscedasticity - errors change in magnitude as x changes
- means that there is some important information in the model that is missed, the model could have a better fit

My Statistical Analyses

Shapiro-wilks test
- shapiro.test(mydata$Detatchment.Force..g.) #p=.2685
- If P>0.05, then the data is normal
Histogram

Q-Q plot

Testing for Homogeneity of variance / homoscedasticity

bartlett.test(Detatchment.Force..g.~Condition, data=mydata) #p=.6505
If P>0.05, then variances are equal ##box plots can serve as a good visual for this

Removing outliers

z-score: a statistical measurement of a score's relationship to the mean in a group of scores
- z_scores <- (data-mean(data))/sd(data)
- z-score greater than 3 is dropped

Ch1

"Each species will be analyzed using a three-way ANOVA (Species (7) x Depth (4) x Site(2)) to test for interactions between species and depth."

3-way ANOVA

Tests all the following models, produces a p-value for each
- Site
- Depth
- Species
- Site*Species
- Site*Depth
- Depth*Species
- Site*Species*Depth

Ch2

"The difference of the same-genotype pairs (one from a control cage and one from a treatment cage) will be calculated and used for all statistical analyses. Genotype will be treated as an unreplicated block and each of the three sets of genotypes will be analyzed using an ANOVA to compare treatments. Because this analysis only considers samples that survived transplantation, a Fisher’s Exact Test will be used to assess survivorship."

ANOVA with unreplicated block

The cages are unreplicated - they may have caused variation unrelated to the treatment, but because we don't have replicates, we won't know if any of the cages had effects. We can't separate cage effects from treatment effects
Genotype is also unreplicated - we only have one replicate of each genotype, so each genotype is technically treated as a block
ANOVA - Analysis of Variance
- used on non-binary groups
- similar to a regression but uses a categorical variable to predict a quantitative one (instead of two quantitative)
- f-statistic - compares how much variation the model accounts for vs. how much it can't account for
  - just tells you if SOMETHING is statistically different
  - followed up with multiple t-tests to get p-values

2 x 2 - 2 conditions, 2 outcomes
calculates the probability of observed data

the range of phenotypes expressed by a genotype along an environmental gradient
closely related to phenotypic plasticity
consists of offset or intercept which describes the mean in each environment, the slope, and the shape or curvature (linear, quadratic, etc.)

Ch3

"3-way ANOVA models (morph x treatment x tray) will be created using a planned-comparisons test (emmeans R package) focused on comparing the shaded and deep treatments with the control. PCA plots will be used to visualize the strength of treatment, Apo or Sym categorization, and caging effects."

Planned-comparisons test

focus on scientifically relevant comparisons rather than all comparisons
opposite of post-hoc tests

dimensionality reduction method
1- standardization
2- covariance matrix computation
3- compute values from the matrix to determine PCs

compares isotopic niche widths
isotope scatterplots w/ ellipses - standard is 40%
Bayseian analysis
SEA - Standard ellipse area

statistical tools that use biotracers to estimate contributions of sources to a mixture
- sources = prey, mixture = consumer

Last updated 1 year ago