# Simulating multivariate normally distributed data in R

In my graduate class on path analysis, we do a lot of analysis on our own data. This year, I suggested that people consider analyzing simulated data based upon the statistics of their data. This way they’ll use a data set that looks like their data, but they aren’t doing a lot of model fitting on data they care about and what to use in real research. Thus, today I typed up a quick guide to simulating multivariate normal data in R for use in our class.

If you find typos, errors, etc., please let me know.

# Teaching PSYCH 548: CFA and SEM in Fall of 2018

I’ll be teaching Confirmatory Factor Analysis and Structural Equation Modeling next fall (listed as PSYCH 548 for 5.0 credits).

First, if you don’t know, I encourage you to bring your own data to use in the class. You have to be able to share some form of it with me, like a covariance matrix. I won’t share, distribute, or use it for my own work. You’ll submit your R syntax and the data, so I can help debug and provide feedback. Second, I’m planning to stick with R, although maybe look at some other packages besides lavaan. In the past, I’ve let people use other programs, like Mplus, but I don’t think I’m going to do that anymore. Third, I’m planning to reinstate writing three research papers (one for each major topic: observed variable path anaylsis, confirmatory factor analysis, and latent variable path analysis), although with a peer review component. I’m also thinking about adding some work on simulation and power analysis. In the past, people have turned the class’ papers into thesis chapters or publications, so if you plan for it, you may be able to do the same thing.
For the class to be useful to you, you’ll want the following:
• hypotheses (or the ability to create such) about how your data may be structured and tested; this is not an exploratory data analysis class.
• if you have many more observations than variables, you need a “large” sample (probably greater than 100, over 300 better); In some cases as few as 80 people will work, but the class can be more challenging/frustrating and/or the models quite limited.
• if you have many more variables than observations (e.g., time series, physiological, and/or neruoscience data), you’ll need to think about intra-individual covariance structure and pooling (or not) across people.
• either way, you want to think about “redundant measures”. Items, measurements that are “getting at the same thing” and can be structured.
• however, if you have tiny cross-sectional or two time point data sets (say, N=20) with few variables on each respondent, the latent variables part of this class probably won’t work for you.