Data-simulation
Download: Simulation (May 19, 2003)
GENDATA: A program package for creating artificial data with
complex factor structure
Preamble: Why simulate?
a) Create datasets with idealities or special features wanted for
educating and testing students, or for studying the comparative
accuracies of alternative solution methods.
b) Test for ease or difficulty in solving for particular hypothesized
structures, especially when the ideal is contaminated by sampling
noise or other model violations. In particular, which parameters
in a favored model are most/least robustly recoverable under varied
degrees of misspecification elsewhere in the model?
c) Explore alternatives for parts of a model, intended for fit to real
data, that are convenient guesses rather than hypotheses supported
by what is already known or plausibly surmised about the model's
intended real-data application. (Even if you don't actually create
any Gendata simulations, consider grounding some student seminar
discussion on the options it avails for choosing a model's
exo-roots structure.)
d) [More? ÄÄ suggestions welcome.]
-------------------------------------------------------------------------
Below is a packing list, with brief overviews, of files in the Gendata
package. These are of four sorts: (a) executables and the Fortran-90
source code from which they have been compiled; (b) major or minor textfiles
documenting these programs' usage; (c) several programs from my EFA package
("Hyball") that do things with suitably formatted rawdata files; and
(d) some textfiles containing unpublished though circulated views/proposals
on certain controversial aspects of SEM practice.
However, the descriptions here will not give you much operational sense
of how to run these programs and what you can get from them. So skim this
document only briefly before turning to SIMDATA.TXT for your main instruction
in use of this package. Afterward, you can reread this list with rather
more appreciation of how these pieces constitute the simulation process.
Gendata's programs for creation of artificial data were initiated some
years ago for my personal study of solving data covariances for common
sources that produce factored data in accord with the premises of classical
factor analysis but with patterns of factor weights in model Y = AùF + E
considerably more complex than the nearly degenerate independent-clusters
ideal that has dominated past EFA simulation studies. Gendata's original
production resources ÄÄ in retrospect, rather narrowly conceived ÄÄ are
still included here but have been much expanded to provide feasible creation
of simulation data having almost any structural complexity that modern SEM
analysis might hope to recover from data covariances. Textfile SIMDATA.TXT
in this package describes in considerable detail the models that can be
simulated, followed by operational instructions how to do so once you have
completed the installation.
All programs in this package run under operating system DOS, either
stand-alone or in a DOS window, and should be kept together in a dedicated
subdirectory whose name you will later want to put on your DOS searchpath if
you decide to make serious use of them. The output files they write will
collect in whatever directory is active when you run them, and if you create
data for a series of projects you will probably want to spread that among
assorted subdirectories different from the one containing the production
code. (For initial acquainting yourself with these programs, however, you
may prefer to install them in a temporary subdirectory and test-run them
there.) The only files in this package that actually do something have
names of form .EXE. These are in binary code ÄÄ you can't read
them in a text editor ÄÄ and are activated by typing at the DOS
prompt. (Don't be shy about starting any of these before you're clear on
what it does: Each gives an initial on-screen overview of what lies ahead,
and you can safely abort the run by hitting Ctrl-C whenever you prefer not
to continue.) The counterpart .FOR files are the Fortran-90
ASCII source code from which these EXE-files have been compiled: You don't
need these, but they are included in case you plan to do some simulation
programming yourself and find ideas or procedures in them that you can use.
Also they contain internal documention, my personal reminders, that may
clarify an occasional uncertainty about the programs' operations. In
particular, all the runtime screen displays can be read there.
-------------------------------------------------------------------------
A. Here are the Gendata executables, listed in roughly decreasing order of
importance with name-extensions omitted.
GENSCOR: This produces an ASCII matrix of standardized covariances (corre-
lations) Cyy for NV data variables Y having NF common factors (sources) F,
and unique residual influences E having classical orthogonality properties,
that give rise to the Y-scores under a production model Y = WùF + E whose
source pattern W and source-correlations Cff are stipulated by you up to
scaling adjustments required to standardize Y. At your option, GENSCOR
also writes a file of joint scores on satisfying this stipulated
composition for NS simulated subjects whose size (NS) and, if wanted,
nonNormalities of distributional skew and kurtosis to degrees you control.
Before running GENSCOR, however, you must first prepare pattern W at least
schematically using program GENPAT or SCHEMAS. The same is true of source
correlations Cff except that your three alternatives for how to prepare
these, detailed in SIMDATA.TXT, are more diverse than are preparations
of A. With one qualification, GENSCOR's output of data covariances Cyy is
Gendata's primary end-product, ready for upload by some other multivariate
program/procedure for whatever purpose has motivated this simulation.
(The qualification is that you can also take Cyy to be Cff in a multi-stage
Genscor simulation.) The scorefiles *.POP produced by GENSCOR are likewise
Gendata end-products except that further Gendata-availed operations on
these will be wanted before you put their simulation scores to external
use.
SEMCOV: For simulation of data manifesting the sorts of causal-path
dependencies among common sources F that modern structural modelling
(SEM) seeks to recover from F-indicator covariances, SEMCOV computes the
F-covariances Cff entailed by a posited path structure A conjoined with
the posited covariances Cuu among exogenous inputs to the endogenous F.
A schema of path matrix A must have been previously created by a run of
SCHEMAS, while Cuu has the same three preparation alternatives that Cff
has in GENSCOR runs. SEMCOV's output Cff has basename of form KOV* and
is intended primarily for upload by GENSCOR to affix a pattern W of F's
measurement manfestations.
In contrast to SEM models, EFA simulations impose no causal-path
dependencies on their manifest variables' F-sources and hence make no
use of SEMCOV.
GENPAT: This is an alternative source of common-factor measurement patterns
W for input to GENSCOR. Its patterns specialize in complex block layouts
that SEM-oriented simulators will find distasteful albeit challenging. And
its output filenames (or "vilenames", as an overly frank typing error has
brought to my attention) are near-intolerably ugly. (They encode rather
more information than really needed there.) You can safely ignore GENPAT
at outset of your encounters with Gendata.
SCHEMAS: This program lets you build a library of schemata for pattern
and covariance matrices that SEMCOV and GENSCOR can upload. SIMDATA.TXT
explains these in detail; here, it suffices to note: (a) These are quite
easy to create and revise. (b) Each comprises a grid within which nonzero
elements are flagged by placeholders (letters) that go proxy for numeric
values or randomization ranges specified in an assignment table that can
be revised when the schema is uploaded. And (c), when a run of SEMCOV or
GENSCOR is ready for this, it lists the available schemata having the size
and character currently needed and allows you to browse for your preference
if filename alone doesn't sufficiently prompt your recall of that.
MORSCOR: In order to maximize retained information in minimal storage space,
each scorefile .POP written by GENSCOR is in binary code that
will presumably need translation into ASCII if you want to work with this
score distribution. MORSCOR does this for you under your choice on options
that are meager at present but can be expanded as need arises for more
alternatives. MORSCOR can also reconstruct all the ASCII payout written
by GENSCOR when it produced this POP-file in case you have lost its
original archive.
SAMPLCOV: This provides material for studying the effect of sampling noise
on SEM source recovery. Using the Odell & Feiveson (J. Amer. Stat. Assn.,
l966) algorithm for simulating covariances in random samples of stipulated
size from an infinite population whose to-be-sampled covariances are given,
SAMPLCOV creates simulations of the X-covariances in K independently random
samples of size NS (your choice of K and NS) from an infinite population
wherein Cxx is what you stipulate. This is tantamount to creating a set of
K bootstrap samples of size NS from a POP-file generated by GENSCOR to have
exactly Cxx for its datascore covariances from untwisted production axes.
Should you want genuine bootstrap samples from a POP-file, you can run that
through Hydata-supplement program HYDATA (described below) to use the
bootstrap option it proffers when computing the input datafile's
covariances.
COVFORM: Run this to set a small parameter file named KOVFMT that instructs
GENSCOR how best to format its ASCII data-covariance output for transport
to the solution program you want to receive this.
ORDER: This clarifies the structure of a disjointedly assembled causal-path
structure or, more generally, of any directed but possibly nonrecursive
graph whose links, described piecemeal, needn't be allegedly causal.
Operating upon a set of provisional numeric indices (they can be assigned
randomly) for nodes in a causal-path model, program ORDER (a) receives
keyboard entry of a list of index pairs signifying that node i is
directly path-antecedent to node j; (b) sorts these indices into disjoint
"units", each of which is either a singlet (strongly recommended if this is
a SEM model's intended path structure) or comprises a group of nodes each
of which is path-antecedent to all others in its group; (c) identifies one
or more sequences whose left-to-right order embeds the strict partial order
these input pairings impose on these units; (d) clarifies the degrees of
path-separation among these units; and (e) for each unit that contains
multiple nodes, identifies all the closed loops within this unit as well
as each member node's counts of direct input/output links to nodes outside
this unit. This information enables you to re-index these nodes in closest
correspondence to their degree of path-dependency and, if it dlscloses
loops you did not intend, makes perspicuous where you can break those most
parsimoniously.
-------------------------------------------------------------------------
B. These are the Gendata documentation files in decreasing order of
importance.
SIMDATA.TXT: This describes in considerable detail each Gendata program's
capabilities, how to invoke those when your simulation wants them, and
how to plan your sequence of program calls and uploads. And it begins
with a comprehensive overview of linear structural modelling, both
classical and modern, that you may find conceptually instructive even
if you never create any simulation material using Gendata resources.
READ_1ST.TXT: What you're now reading. In addition to affording your
initial overview of Gendata, it will continue to provide lookup for
which filename to call for doing what job. (If even I have trouble
remembering, why shouldn't you?)
HYSTAND.TXT: This contains excerpts from Hyball's README.TXT documentation
that explain the use of certain Hyball programs, included in this package
and briefly described below, that carry out assorted operations on suitably
formatted rawdata files. Unless you do creative things with the simulation
datafiles that GENSCOR affords, you will probably have little use for these.
COVFMT.TXT: This describes the layout an ASCII file requires to transport a
covariance matrix for upload by Hyball/Gendata programs that operate upon
those. All transport covariance files written by programs in this package
have this layout, but you may have occasion to initiate some outside of
that process or do some editing thereof between production and upload.
-------------------------------------------------------------------------
C. These are the Hydata-supplement programs included in this package.
HYDATA: This reads ASCII scorefiles in many alternative formats from
outside sources, does preprocessing as needed and, if that is successful,
allows one or both of two outputs: (1) Transcription of the really-raw
scores into an ASCII Hydata-standard format that HYDATA and all other
programs in the Hyball package that work on raw data can read without
preprocessing. (2) Computation of the standardized covariances
(correlations) among the input variables or any selected subset thereof,
written to a COV-file that any Hyball and now Gendata program that
uploads covariances can read. (This computation also avails search for
possible quadratic relations among the data variables, and production
of COV for each of up to 156 bootstrap samples of the input data.)
RESCORE: This allows computation of scores on many user-stipulated
functions of the variables in an uploaded Hydata-standard datafile.
The newly created scores can be appended to the input datafile or
recorded in a separate file. RESCORE also allows input of an
externally prepared M-by-N matrix of coefficients that will generate
scores on M linear combinations of any selection of N variables in
the input datafile. Details of how to prepare this in a text editor,
not at present included in RESCORE's documentation, are available on
request.
SELECT: This allows subsets of variables and/or records in a Hydata-
standard datafile to be copied to a separate one.
MERGE: This enables two or more Hydata-standard datafiles having some
(not necessarily all) named variables and/or record IDs in common to
be combined into a single datafile. The input files are merged in a
user-chosen sequence whereby record IDi's score on variable Yj in the
merged file is the last score that occurs with those coordinates in
the merge sequence.
An example of what can be done with these resources is revising the
scores on a subset X of the variables in datafile .D1: First, select
scores on X into a new datafile .D2. Next, use RESCORE to transform
.D2 as wanted. (E.g, the selected scores could be rescaled to have
different means and SDs, or partitioned into categories, or nonlinearly
reshaped as by Log-transform.) Finally, merging { .D1, .D2 }
in that order overwrites the X-scores in .D1 with their modified
values in .D2. And if .D2 contains only a subset of the records
in .D1, the merge will overwrite scores on those records only.
Hyball's documentation for these Hydata-supplement programs is excerpted
in this package's textfile HYSTAND.TXT.
-------------------------------------------------------------------------
D. Specialized background reading, contentious.
PRELOOPS.TXT: An introduction to important details of scientific
regularities' conceptual composition that are mostly supressed, seldom
wisely and often obfuscatingly, in standard multivariate algebra.
LOOPS.TXT: My effort to make clear for SEMists who see nothing wrong with
nonrecursive models that these violate our most basic intuitions about
causal influence.
MORLOOPS.TXT: My 26/9/2001 Semnet post showing algebraically the special
conditions under which it is possible, under the standard linear model
of system dynamics, for indicators of source variables in a path loop
to have uncorrelated residual disturbances.
SMEPSHOW.TXT: A reduced version of my 1997 SMEP-meeting handout attempting
to demonstrate how EFA can effectively play in SEM's path-detection league
by use of my Hyball package's Hyblock procedure.