Data-simulation

Download: Simulation (May 19, 2003)


       GENDATA: A program package for creating artificial data with
                complex factor structure

 Preamble:  Why simulate?

   a) Create datasets with idealities or special features wanted for
      educating and testing students, or for studying the comparative
      accuracies of alternative solution methods.

   b) Test for ease or difficulty in solving for particular hypothesized
      structures, especially when the ideal is contaminated by sampling
      noise or other model violations.  In particular, which parameters
      in a favored model are most/least robustly recoverable under varied
      degrees of misspecification elsewhere in the model?

   c) Explore alternatives for parts of a model, intended for fit to real
      data, that are convenient guesses rather than hypotheses supported
      by what is already known or plausibly surmised about the model's
      intended real-data application.  (Even if you don't actually create
      any Gendata simulations, consider grounding some student seminar
      discussion on the options it avails for choosing a model's
      exo-roots structure.)

   d) [More? ÄÄ suggestions welcome.]

 -------------------------------------------------------------------------

     Below is a packing list, with brief overviews, of files in the Gendata
 package.  These are of four sorts:  (a) executables and the Fortran-90
 source code from which they have been compiled; (b) major or minor textfiles
 documenting these programs' usage; (c) several programs from my EFA package
 ("Hyball") that do things with suitably formatted rawdata files; and
 (d) some textfiles containing unpublished though circulated views/proposals
 on certain controversial aspects of SEM practice.

     However, the descriptions here will not give you much operational sense
 of how to run these programs and what you can get from them.  So skim this
 document only briefly before turning to SIMDATA.TXT for your main instruction
 in use of this package.  Afterward, you can reread this list with rather
 more appreciation of how these pieces constitute the simulation process.

     Gendata's programs for creation of artificial data were initiated some
 years ago for my personal study of solving data covariances for common
 sources that produce factored data in accord with the premises of classical
 factor analysis but with patterns of factor weights in model Y = AùF + E
 considerably more complex than the nearly degenerate independent-clusters
 ideal that has dominated past EFA simulation studies.  Gendata's original
 production resources ÄÄ in retrospect, rather narrowly conceived ÄÄ are
 still included here but have been much expanded to provide feasible creation
 of simulation data having almost any structural complexity that modern SEM
 analysis might hope to recover from data covariances.  Textfile SIMDATA.TXT
 in this package describes in considerable detail the models that can be
 simulated, followed by operational instructions how to do so once you have
 completed the installation.

     All programs in this package run under operating system DOS, either
 stand-alone or in a DOS window, and should be kept together in a dedicated
 subdirectory whose name you will later want to put on your DOS searchpath if
 you decide to make serious use of them.  The output files they write will
 collect in whatever directory is active when you run them, and if you create
 data for a series of projects you will probably want to spread that among
 assorted subdirectories different from the one containing the production
 code.  (For initial acquainting yourself with these programs, however, you
 may prefer to install them in a temporary subdirectory and test-run them
 there.)  The only files in this package that actually do something have
 names of form .EXE.  These are in binary code ÄÄ you can't read
 them in a text editor ÄÄ and are activated by typing  at the DOS
 prompt.  (Don't be shy about starting any of these before you're clear on
 what it does:  Each gives an initial on-screen overview of what lies ahead,
 and you can safely abort the run by hitting Ctrl-C whenever you prefer not
 to continue.)  The counterpart .FOR files are the Fortran-90
 ASCII source code from which these EXE-files have been compiled:  You don't
 need these, but they are included in case you plan to do some simulation
 programming yourself and find ideas or procedures in them that you can use.
 Also they contain internal documention, my personal reminders, that may
 clarify an occasional uncertainty about the programs' operations.  In
 particular, all the runtime screen displays can be read there.

 -------------------------------------------------------------------------

 A. Here are the Gendata executables, listed in roughly decreasing order of
    importance with name-extensions omitted.

 GENSCOR:  This produces an ASCII matrix of standardized covariances (corre-
   lations) Cyy for NV data variables Y having NF common factors (sources) F,
   and unique residual influences E having classical orthogonality properties,
   that give rise to the Y-scores under a production model Y = WùF + E whose
   source pattern W and source-correlations Cff are stipulated by you up to
   scaling adjustments required to standardize Y.  At your option, GENSCOR
   also writes a file of joint scores on  satisfying this stipulated
   composition for NS simulated subjects whose size (NS) and, if wanted,
   nonNormalities of distributional skew and kurtosis to degrees you control.
   Before running GENSCOR, however, you must first prepare pattern W at least
   schematically using program GENPAT or SCHEMAS.  The same is true of source
   correlations Cff except that your three alternatives for how to prepare
   these, detailed in SIMDATA.TXT, are more diverse than are preparations
   of A.  With one qualification, GENSCOR's output of data covariances Cyy is
   Gendata's primary end-product, ready for upload by some other multivariate
   program/procedure for whatever purpose has motivated this simulation.
   (The qualification is that you can also take Cyy to be Cff in a multi-stage
   Genscor simulation.)  The scorefiles *.POP produced by GENSCOR are likewise
   Gendata end-products except that further Gendata-availed operations on
   these will be wanted before you put their simulation scores to external
   use.

 SEMCOV:  For simulation of data manifesting the sorts of causal-path
   dependencies among common sources F that modern structural modelling
   (SEM) seeks to recover from F-indicator covariances, SEMCOV computes the
   F-covariances Cff entailed by a posited path structure A conjoined with
   the posited covariances Cuu among exogenous inputs to the endogenous F.
   A schema of path matrix A must have been previously created by a run of
   SCHEMAS, while Cuu has the same three preparation alternatives that Cff
   has in GENSCOR runs.  SEMCOV's output Cff has basename of form KOV* and
   is intended primarily for upload by GENSCOR to affix a pattern W of F's
   measurement manfestations.

       In contrast to SEM models, EFA simulations impose no causal-path
   dependencies on their manifest variables' F-sources and hence make no
   use of SEMCOV.

 GENPAT:  This is an alternative source of common-factor measurement patterns
   W for input to GENSCOR.  Its patterns specialize in complex block layouts
   that SEM-oriented simulators will find distasteful albeit challenging.  And
   its output filenames (or "vilenames", as an overly frank typing error has
   brought to my attention) are near-intolerably ugly.  (They encode rather
   more information than really needed there.)  You can safely ignore GENPAT
   at outset of your encounters with Gendata.

 SCHEMAS:  This program lets you build a library of schemata for pattern
   and covariance matrices that SEMCOV and GENSCOR can upload.  SIMDATA.TXT
   explains these in detail; here, it suffices to note:  (a) These are quite
   easy to create and revise.  (b) Each comprises a grid within which nonzero
   elements are flagged by placeholders (letters) that go proxy for numeric
   values or randomization ranges specified in an assignment table that can
   be revised when the schema is uploaded.  And (c), when a run of SEMCOV or
   GENSCOR is ready for this, it lists the available schemata having the size
   and character currently needed and allows you to browse for your preference
   if filename alone doesn't sufficiently prompt your recall of that.

 MORSCOR:  In order to maximize retained information in minimal storage space,
   each scorefile .POP written by GENSCOR is in binary code that
   will presumably need translation into ASCII if you want to work with this
   score distribution.  MORSCOR does this for you under your choice on options
   that are meager at present but can be expanded as need arises for more
   alternatives.  MORSCOR can also reconstruct all the ASCII payout written
   by GENSCOR when it produced this POP-file in case you have lost its
   original archive.

 SAMPLCOV:  This provides material for studying the effect of sampling noise
   on SEM source recovery.  Using the Odell & Feiveson (J. Amer. Stat. Assn.,
   l966) algorithm for simulating covariances in random samples of stipulated
   size from an infinite population whose to-be-sampled covariances are given,
   SAMPLCOV creates simulations of the X-covariances in K independently random
   samples of size NS (your choice of K and NS) from an infinite population
   wherein Cxx is what you stipulate.  This is tantamount to creating a set of
   K bootstrap samples of size NS from a POP-file generated by GENSCOR to have
   exactly Cxx for its datascore covariances from untwisted production axes.
   Should you want genuine bootstrap samples from a POP-file, you can run that
   through Hydata-supplement program HYDATA (described below) to use the
   bootstrap option it proffers when computing the input datafile's
   covariances.

 COVFORM:  Run this to set a small parameter file named KOVFMT that instructs
   GENSCOR how best to format its ASCII data-covariance output for transport
   to the solution program you want to receive this.

 ORDER:  This clarifies the structure of a disjointedly assembled causal-path
   structure or, more generally, of any directed but possibly nonrecursive
   graph whose links, described piecemeal, needn't be allegedly causal.
   Operating upon a set of provisional numeric indices (they can be assigned
   randomly) for nodes in a causal-path model, program ORDER (a) receives
   keyboard entry of a list of index pairs  signifying that node i is
   directly path-antecedent to node j; (b) sorts these indices into disjoint
   "units", each of which is either a singlet (strongly recommended if this is
   a SEM model's intended path structure) or comprises a group of nodes each
   of which is path-antecedent to all others in its group; (c) identifies one
   or more sequences whose left-to-right order embeds the strict partial order
   these input pairings impose on these units; (d) clarifies the degrees of
   path-separation among these units; and (e) for each unit that contains
   multiple nodes, identifies all the closed loops within this unit as well
   as each member node's counts of direct input/output links to nodes outside
   this unit.  This information enables you to re-index these nodes in closest
   correspondence to their degree of path-dependency and, if it dlscloses
   loops you did not intend, makes perspicuous where you can break those most
   parsimoniously.

 -------------------------------------------------------------------------
 B. These are the Gendata documentation files in decreasing order of
    importance.

 SIMDATA.TXT:  This describes in considerable detail each Gendata program's
   capabilities, how to invoke those when your simulation wants them, and
   how to plan your sequence of program calls and uploads.  And it begins
   with a comprehensive overview of linear structural modelling, both
   classical and modern, that you may find conceptually instructive even
   if you never create any simulation material using Gendata resources.

 READ_1ST.TXT:  What you're now reading.  In addition to affording your
   initial overview of Gendata, it will continue to provide lookup for
   which filename to call for doing what job.  (If even I have trouble
   remembering, why shouldn't you?)

 HYSTAND.TXT:  This contains excerpts from Hyball's README.TXT documentation
   that explain the use of certain Hyball programs, included in this package
   and briefly described below, that carry out assorted operations on suitably
   formatted rawdata files.  Unless you do creative things with the simulation
   datafiles that GENSCOR affords, you will probably have little use for these.

 COVFMT.TXT:  This describes the layout an ASCII file requires to transport a
   covariance matrix for upload by Hyball/Gendata programs that operate upon
   those.  All transport covariance files written by programs in this package
   have this layout, but you may have occasion to initiate some outside of
   that process or do some editing thereof between production and upload.


 -------------------------------------------------------------------------

 C. These are the Hydata-supplement programs included in this package.

 HYDATA:  This reads ASCII scorefiles in many alternative formats from
   outside sources, does preprocessing as needed and, if that is successful,
   allows one or both of two outputs: (1) Transcription of the really-raw
   scores into an ASCII Hydata-standard format that HYDATA and all other
   programs in the Hyball package that work on raw data can read without
   preprocessing.  (2) Computation of the standardized covariances
   (correlations) among the input variables or any selected subset thereof,
   written to a COV-file that any Hyball and now Gendata program that
   uploads covariances can read.  (This computation also avails search for
   possible quadratic relations among the data variables, and production
   of COV for each of up to 156 bootstrap samples of the input data.)

 RESCORE:  This allows computation of scores on many user-stipulated
   functions of the variables in an uploaded Hydata-standard datafile.
   The newly created scores can be appended to the input datafile or
   recorded in a separate file.  RESCORE also allows input of an
   externally prepared M-by-N matrix of coefficients that will generate
   scores on M linear combinations of any selection of N variables in
   the input datafile.  Details of how to prepare this in a text editor,
   not at present included in RESCORE's documentation, are available on
   request.

 SELECT:  This allows subsets of variables and/or records in a Hydata-
   standard datafile to be copied to a separate one.

 MERGE:  This enables two or more Hydata-standard datafiles having some
   (not necessarily all) named variables and/or record IDs in common to
   be combined into a single datafile.  The input files are merged in a
   user-chosen sequence whereby record IDi's score on variable Yj in the
   merged file is the last score that occurs with those coordinates in
   the merge sequence.

      An example of what can be done with these resources is revising the
 scores on a subset X of the variables in datafile .D1:  First, select
 scores on X into a new datafile .D2.  Next, use RESCORE to transform
 .D2 as wanted.  (E.g, the selected scores could be rescaled to have
 different means and SDs, or partitioned into categories, or nonlinearly
 reshaped as by Log-transform.)  Finally, merging { .D1, .D2 }
 in that order overwrites the X-scores in .D1 with their modified
 values in .D2.  And if .D2 contains only a subset of the records
 in .D1, the merge will overwrite scores on those records only.

     Hyball's documentation for these Hydata-supplement programs is excerpted
 in this package's textfile HYSTAND.TXT.


 -------------------------------------------------------------------------

 D. Specialized background reading, contentious.

 PRELOOPS.TXT:  An introduction to important details of scientific
   regularities' conceptual composition that are mostly supressed, seldom
   wisely and often obfuscatingly, in standard multivariate algebra.

 LOOPS.TXT:  My effort to make clear for SEMists who see nothing wrong with
   nonrecursive models that these violate our most basic intuitions about
   causal influence.

 MORLOOPS.TXT:  My 26/9/2001 Semnet post showing algebraically the special
   conditions under which it is possible, under the standard linear model
   of system dynamics, for indicators of source variables in a path loop
   to have uncorrelated residual disturbances.

 SMEPSHOW.TXT:  A reduced version of my 1997 SMEP-meeting handout attempting
   to demonstrate how EFA can effectively play in SEM's path-detection league
   by use of my Hyball package's Hyblock procedure.