Rozeboom: Tools-Advanced EFA

Advanced-EFA (Hyball)

Download: Advanced-EFA (Sept 18, 2003)

W. W. Rozeboom
September, 2003

GENDATA: A program package for creating artificial data with
complex factor structure

Preamble: Why simulate?

a) Create datasets with idealities or special features wanted for
educating and testing students, or for studying the comparative
accuracies of alternative solution methods.

b) Test for ease or difficulty in solving for particular hypothesized
structures, especially when the ideal is contaminated by sampling
noise or other model violations. In particular, which parameters
in a favored model are most/least robustly recoverable under varied
degrees of misspecification elsewhere in the model?

c) Explore alternatives for parts of a model, intended for fit to real
data, that are convenient guesses rather than hypotheses supported
by what is already known or plausibly surmised about the model's
intended real-data application. (Even if you don't actually create
any Gendata simulations, consider grounding some student seminar
discussion on the options it avails for choosing a model's exo-roots
structure and how plausibly these should occur in reality.)

d) [More? ДД suggestions welcome.]

НННННННННННННННННННННННННННННННННННННННННННННННННННННННННННННННННННННННННННН
Below is a packing list, with brief overviews, of files in the Gendata
package. These are of four sorts: (a) executables and the Fortran-90
source code from which they have been compiled; (b) major or minor textfiles
documenting these programs' usage; (c) several programs from my EFA package
(Hyball) that do things with suitably formatted rawdata files; and (d) some
textfiles containing unpublished though circulated views/proposals on certain
controversial aspects of SEM practice.

However, the descriptions here won't give you much operational sense
of how to run these programs and what they can do for you. So skim this
document only briefly before turning to SIMDATA.TXT for your main instruction
in use of this package. Afterward, you can reread selections here with
rather more appreciation of how the components they describe constitute
this simulation procedure.

Gendata's programs for creation of artificial data were initiated some
years ago for my personal study of solving data covariances for common
sources that produce factored data in accord with the premises of classical
factor analysis but with patterns of factor weights in model Y = WщF + E
considerably more complex than the nearly degenerate independent-clusters
ideal that has dominated past EFA simulation studies. Gendata's original
production resources ДД in retrospect, rather narrowly conceived ДД are
still included here but have been much expanded to provide feasible creation
of simulation data having almost any structural complexity that modern SEM
analysis might hope to recover from data covariances. Textfile SIMDATA.TXT
in this package describes in considerable detail the models that can be
simulated, followed by operational instructions how to do so once you have
completed the installation.

All programs in this package run under operating system DOS, either
stand-alone or in a DOS window, and should be kept together in a dedicated
subdirectory whose name you will later want to put on your DOS searchpath if
you decide to make serious use of them. The output files they write will
collect in whatever directory is active when you run them, and if you create
data for a series of projects you will probably want to spread that among
assorted subdirectories different from the one containing the production
code. (For initial acquainting yourself with these programs, however, you
may prefer to install them in a temporary subdirectory and test-run them
there.) The only files in this package that actually do something have
names of form .EXE. These are in binary code ДД you can't read
them in a text editor ДД and are activated by typing at the DOS
prompt. (Don't be shy about starting any of these before you're clear on
what it does: Each gives an initial on-screen overview of what lies ahead,
and you can safely abort the run by hitting Ctrl-C whenever you prefer not
to continue.) The counterpart .FOR files are the Fortran-90
ASCII source code from which these EXE-files have been compiled: You don't
need these, but they are included in case you plan to do some simulation
programming yourself and find ideas or procedures in them that you can use.
Also they contain internal documention, my personal reminders, that might
clarify an occasional uncertainty about the programs' operations. In
particular, all the runtime screen displays can be read there.

ДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДД
A. Here are the Gendata executables, listed in roughly decreasing order of
importance with name-extensions omitted.

GENSCOR: This produces an ASCII matrix of standardized covariances (corre-
lations) Cyy for NV data variables Y having NF common factors (sources) F,
and unique residual influences E having classical orthogonality properties,
that give rise to the Y-scores under a production model Y = WщF + E whose
source pattern W and source-correlations Cff are stipulated by you up to
scaling adjustments required to standardize Y. At your option, GENSCOR
also writes a file of joint scores on satisfying this stipulated
composition for NS simulated subjects whose size (NS) and, if wanted,
nonNormalities of distributional skew and kurtosis, are for you to choose.
Before running GENSCOR, however, you must first prepare pattern W at least
schematically using program SCHEMAS or possibly GENPAT. The same is
largely true of source correlations Cff except that your three alternatives
for how to prepare these, detailed in SIMDATA.TXT, are more diverse than
are preparations of W. With one qualification, GENSCOR's output of data
covariances Cyy is Gendata's primary end-product, ready for upload by some
other multivariate program/procedure for whatever purpose has motivated
this simulation. (The qualification is that you can also take Cyy to be
Cff in a multistage Genscor simulation.) The scorefiles *.POP produced
by GENSCOR are likewise Gendata end-products except that further Gendata-
availed operations on these will be wanted before you put their simulation
scores to external use.

SEMCOV: For simulation of data manifesting the sorts of causal-path
dependencies among common sources F that modern structural modelling
(SEM) seeks to recover from F-indicator covariances, SEMCOV computes the
F-covariances Cff entailed by a posited path structure A conjoined with
the posited covariances Cuu among exogenous inputs to the endogenous F.
A schema of path matrix A must have been previously created by a run of
SCHEMAS, while Cuu has the same three preparation alternatives that Cff
has in GENSCOR runs. SEMCOV's output Cff has basename of form KOV* and
is intended primarily for upload by GENSCOR to affix a pattern W of F's
measurement manfestations.

Unlike SEM models, EFA simulations generally impose no causal-path
dependencies on their manifest variables' F-sources and hence make no
use of SEMCOV. (EFA by Hyblock is an exception.)

GENPAT: This is an alternative source of common-factor measurement patterns
W for input to GENSCOR. Its patterns specialize in complex block layouts
that SEM-oriented simulators will find distasteful albeit challenging. And
its output filenames (or "vilenames", as an overly frank typing error has
brought to my attention) are near-intolerably ugly. (They encode rather
more information than really needed there.) You can safely ignore GENPAT
at outset of your encounters with Gendata.

SCHEMAS: This program lets you build a library of schemata for pattern
and covariance matrices for SEMCOV and GENSCOR to upload. SIMDATA.TXT
explains these in detail; here, it suffices to note: (a) These are quite
easy to create and revise. (b) Each comprises a grid within which nonzero
elements are flagged by placeholders (letters) that go proxy for numeric
values or randomization ranges specified in an assignment table that can
be revised when the schema is uploaded. And (c), when a run of SEMCOV or
GENSCOR needs an input matrix, it lists the available schemata having the
size and character wanted and allows you to browse for your preference if
filename alone doesn't sufficiently prompt your recall of that.

MORSCOR: In order to maximize retained information in minimal storage space,
each scorefile .POP written by GENSCOR is in binary code that
will presumably need translation into ASCII if you want to work with this
score distribution. MORSCOR does this for you under your choice on options
that include partitioning the uploaded POP-file's score records into groups
that can be assigned different means on their source variables to yield
modified data suited to study of multiple-group analysis. MORSCOR can
also reconstruct the LOG-information written in ASCII by GENSCOR when
it produced this POP-file in case you have lost its original archive or
want this with 3-decimal accuracy.

SAMPLCOV: This provides material for studying the effect of sampling noise
on SEM source recovery. Using the Odell & Feiveson (J. Amer. Stat. Assn.,
l966) algorithm for simulating covariances in random samples of stipulated
size from an infinite population whose to-be-sampled covariances are given,
SAMPLCOV creates simulations of the X-covariances in K independently random
samples of size NS (your choice of K and NS) from an infinite population
wherein Cxx is what you stipulate. This is tantamount to creating a set
of K bootstrap samples of size NS from a POP-file generated by GENSCOR to
have exactly Cxx for its datascore covariances from untwisted production
axes. Should you want genuine bootstrap samples from a POP-file, you can
run that through Hydata-supplement program HYDATA (described below) to use
the bootstrap option it proffers when computing the input datafile's
covariances.

COVFORM: Run this to set a small parameter file named KOVFMT that instructs
GENSCOR how best to format its ASCII data-covariance output for transport
to the solution program you want to receive this.

ORDER: This clarifies the structure of a disjointedly assembled causal-path
structure or, more generally, of any directed but possibly nonrecursive
graph whose links, described piecemeal, needn't be allegedly causal.
Operating upon a set of provisional numeric indices (they can be assigned
randomly) for nodes in a causal-path model, program ORDER (a) receives
keyboard entry of a list of index pairs signifying that node i is
directly path-antecedent to node j; (b) sorts these indices into disjoint
"units", each of which is either a singlet (strongly recommended if this
is a SEM's intended path structure) or comprises a group of nodes wherein
each node is path-antecedent to all others in its group; (c) identifies
one or more sequences whose left-to-right order embeds the strict partial
order your input pairings impose on these units; (d) clarifies the degrees
of path-separation among these units; and (e) for each unit that contains
multiple nodes, identifies all the closed loops within this unit as well
as each member node's counts of direct input/output links to nodes outside
this unit. This information enables you to re-index these nodes in closest
correspondence to their degree of path-dependency and, if it reveals loops
you did not intend, makes perspicuous where you can break those most
parsimoniously.

ДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДД
B. These are the Gendata documentation files in decreasing order of
importance.

SIMDATA.TXT: This describes in considerable detail each Gendata program's
capabilities, how to invoke those when your simulation wants them, and
how to plan your sequence of program calls and uploads. And it begins
with a comprehensive overview of linear structural modelling, both
classical and modern, that you may find conceptually instructive even
if you never create any simulation material using Gendata resources.

READ_1ST.TXT: What you're now reading. In addition to providing your
initial overview of Gendata, it will continue to avail look-up for
which filename to call for doing what job.

HYSTAND.TXT: This contains excerpts from Hyball's README.TXT documentation
that explain the use of certain Hyball programs, included in this package
and briefly described below, that carry out assorted operations on suitably
formatted rawdata files. Unless you do creative things with the simulation
datafiles that GENSCOR affords, you will probably have little use for these.

COVFMT.TXT: This describes the layout an ASCII file requires to transport a
covariance matrix for upload by Hyball/Gendata programs that operate upon
those. All transport covariance files written by programs in this package
have this layout, but you may have occasion to initiate some outside of
that process or do some editing thereof between production and upload.

ДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДД
C. These are the Hydata-supplement programs included in this package.

HYDATA: This reads ASCII scorefiles in many alternative formats from
outside sources, does preprocessing as needed and, if that is successful,
allows one or both of two outputs: (1) Transcription of the really-raw
scores into an ASCII Hydata-standard format that HYDATA and all other
programs in the Hyball package that work on raw data can read without
preprocessing. (2) Computation of the standardized covariances
(correlations) among the input variables or any selected subset thereof,
written to a COV-file that any Hyball and now Gendata program that
uploads covariances can read. (This computation also avails search for
possible quadratic relations among the data variables, and production
of COV for each of up to 156 bootstrap samples of the input data.)

RESCORE: This allows computation of scores on many user-stipulated
functions of the variables in an uploaded Hydata-standard datafile.
The newly created scores can be appended to the input datafile or
recorded in a separate file. RESCORE also allows input of an
externally prepared M-by-N matrix of coefficients that will generate
scores on M linear combinations of any selection of N variables in
the input datafile. Details of how to prepare this in a text editor,
not at present included in RESCORE's documentation, are available on
request.

SELECT: This copies selected subsets of variables and/or records in a
Hydata-standard datafile to a separate D-file.

MERGE: This enables two or more Hydata-standard datafiles having some
(not necessarily all) named variables and/or record IDs in common to
be combined into a single datafile. The input files are merged in a
user-chosen sequence whereby record IDi's score on variable Yj in the
merged file is the last score that occurs with coordinates in
the merge sequence.

An example of what can be done with these resources is revising the
scores on a subset X of the variables in datafile .D1: First, select
scores on X into a new datafile .D2. Next, use RESCORE to transform
.D2 as wanted. (E.g, the selected scores could be rescaled to have
different means and SDs, or partitioned into categories, or nonlinearly
reshaped as by Log-transform.) Finally, merging { .D1, .D2 }
in that order overwrites the X-scores in .D1 with their modified
values in .D2. And if .D2 contains only a subset of the records
in .D1, the merge will overwrite scores on those records only.

Hyball's documentation for these Hydata-supplement programs is excerpted
in this package's textfile HYSTAND.TXT.

ДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДД
D. Specialized background readings (contentious). All are in ASCII format.

PRELOOPS.TXT: An introduction to important details of scientific
regularities' conceptual composition that are mostly supressed, seldom
wisely and often obfuscatingly, in standard multivariate algebra.

LOOPS.TXT: My effort to make clear for SEMists who see nothing wrong with
nonrecursive models that these violate our most basic intuitions about
causal influence.

MORLOOPS.TXT: My 26/9/2001 Semnet post showing algebraically the special
conditions under which it is possible, under the standard linear model
of system dynamics, for indicators of source variables in a path loop
to have uncorrelated residual disturbances.

SMEPSHOW.TXT: A reduced version of my 1997 SMEP-meeting handout attempting
to demonstrate how EFA can effectively play in SEM's path-detection league
by use of the Hyball package's Hyblock procedure.

Wm. W. Rozeboom