Data-simulation
Download: Simulation (May 19, 2003)GENDATA: A program package for creating artificial data with complex factor structure Preamble: Why simulate? a) Create datasets with idealities or special features wanted for educating and testing students, or for studying the comparative accuracies of alternative solution methods. b) Test for ease or difficulty in solving for particular hypothesized structures, especially when the ideal is contaminated by sampling noise or other model violations. In particular, which parameters in a favored model are most/least robustly recoverable under varied degrees of misspecification elsewhere in the model? c) Explore alternatives for parts of a model, intended for fit to real data, that are convenient guesses rather than hypotheses supported by what is already known or plausibly surmised about the model's intended real-data application. (Even if you don't actually create any Gendata simulations, consider grounding some student seminar discussion on the options it avails for choosing a model's exo-roots structure.) d) [More? ÄÄ suggestions welcome.] ------------------------------------------------------------------------- Below is a packing list, with brief overviews, of files in the Gendata package. These are of four sorts: (a) executables and the Fortran-90 source code from which they have been compiled; (b) major or minor textfiles documenting these programs' usage; (c) several programs from my EFA package ("Hyball") that do things with suitably formatted rawdata files; and (d) some textfiles containing unpublished though circulated views/proposals on certain controversial aspects of SEM practice. However, the descriptions here will not give you much operational sense of how to run these programs and what you can get from them. So skim this document only briefly before turning to SIMDATA.TXT for your main instruction in use of this package. Afterward, you can reread this list with rather more appreciation of how these pieces constitute the simulation process. Gendata's programs for creation of artificial data were initiated some years ago for my personal study of solving data covariances for common sources that produce factored data in accord with the premises of classical factor analysis but with patterns of factor weights in model Y = AùF + E considerably more complex than the nearly degenerate independent-clusters ideal that has dominated past EFA simulation studies. Gendata's original production resources ÄÄ in retrospect, rather narrowly conceived ÄÄ are still included here but have been much expanded to provide feasible creation of simulation data having almost any structural complexity that modern SEM analysis might hope to recover from data covariances. Textfile SIMDATA.TXT in this package describes in considerable detail the models that can be simulated, followed by operational instructions how to do so once you have completed the installation. All programs in this package run under operating system DOS, either stand-alone or in a DOS window, and should be kept together in a dedicated subdirectory whose name you will later want to put on your DOS searchpath if you decide to make serious use of them. The output files they write will collect in whatever directory is active when you run them, and if you create data for a series of projects you will probably want to spread that among assorted subdirectories different from the one containing the production code. (For initial acquainting yourself with these programs, however, you may prefer to install them in a temporary subdirectory and test-run them there.) The only files in this package that actually do something have names of form.EXE. These are in binary code ÄÄ you can't read them in a text editor ÄÄ and are activated by typing at the DOS prompt. (Don't be shy about starting any of these before you're clear on what it does: Each gives an initial on-screen overview of what lies ahead, and you can safely abort the run by hitting Ctrl-C whenever you prefer not to continue.) The counterpart .FOR files are the Fortran-90 ASCII source code from which these EXE-files have been compiled: You don't need these, but they are included in case you plan to do some simulation programming yourself and find ideas or procedures in them that you can use. Also they contain internal documention, my personal reminders, that may clarify an occasional uncertainty about the programs' operations. In particular, all the runtime screen displays can be read there. ------------------------------------------------------------------------- A. Here are the Gendata executables, listed in roughly decreasing order of importance with name-extensions omitted. GENSCOR: This produces an ASCII matrix of standardized covariances (corre- lations) Cyy for NV data variables Y having NF common factors (sources) F, and unique residual influences E having classical orthogonality properties, that give rise to the Y-scores under a production model Y = WùF + E whose source pattern W and source-correlations Cff are stipulated by you up to scaling adjustments required to standardize Y. At your option, GENSCOR also writes a file of joint scores on satisfying this stipulated composition for NS simulated subjects whose size (NS) and, if wanted, nonNormalities of distributional skew and kurtosis to degrees you control. Before running GENSCOR, however, you must first prepare pattern W at least schematically using program GENPAT or SCHEMAS. The same is true of source correlations Cff except that your three alternatives for how to prepare these, detailed in SIMDATA.TXT, are more diverse than are preparations of A. With one qualification, GENSCOR's output of data covariances Cyy is Gendata's primary end-product, ready for upload by some other multivariate program/procedure for whatever purpose has motivated this simulation. (The qualification is that you can also take Cyy to be Cff in a multi-stage Genscor simulation.) The scorefiles *.POP produced by GENSCOR are likewise Gendata end-products except that further Gendata-availed operations on these will be wanted before you put their simulation scores to external use. SEMCOV: For simulation of data manifesting the sorts of causal-path dependencies among common sources F that modern structural modelling (SEM) seeks to recover from F-indicator covariances, SEMCOV computes the F-covariances Cff entailed by a posited path structure A conjoined with the posited covariances Cuu among exogenous inputs to the endogenous F. A schema of path matrix A must have been previously created by a run of SCHEMAS, while Cuu has the same three preparation alternatives that Cff has in GENSCOR runs. SEMCOV's output Cff has basename of form KOV* and is intended primarily for upload by GENSCOR to affix a pattern W of F's measurement manfestations. In contrast to SEM models, EFA simulations impose no causal-path dependencies on their manifest variables' F-sources and hence make no use of SEMCOV. GENPAT: This is an alternative source of common-factor measurement patterns W for input to GENSCOR. Its patterns specialize in complex block layouts that SEM-oriented simulators will find distasteful albeit challenging. And its output filenames (or "vilenames", as an overly frank typing error has brought to my attention) are near-intolerably ugly. (They encode rather more information than really needed there.) You can safely ignore GENPAT at outset of your encounters with Gendata. SCHEMAS: This program lets you build a library of schemata for pattern and covariance matrices that SEMCOV and GENSCOR can upload. SIMDATA.TXT explains these in detail; here, it suffices to note: (a) These are quite easy to create and revise. (b) Each comprises a grid within which nonzero elements are flagged by placeholders (letters) that go proxy for numeric values or randomization ranges specified in an assignment table that can be revised when the schema is uploaded. And (c), when a run of SEMCOV or GENSCOR is ready for this, it lists the available schemata having the size and character currently needed and allows you to browse for your preference if filename alone doesn't sufficiently prompt your recall of that. MORSCOR: In order to maximize retained information in minimal storage space, each scorefile .POP written by GENSCOR is in binary code that will presumably need translation into ASCII if you want to work with this score distribution. MORSCOR does this for you under your choice on options that are meager at present but can be expanded as need arises for more alternatives. MORSCOR can also reconstruct all the ASCII payout written by GENSCOR when it produced this POP-file in case you have lost its original archive. SAMPLCOV: This provides material for studying the effect of sampling noise on SEM source recovery. Using the Odell & Feiveson (J. Amer. Stat. Assn., l966) algorithm for simulating covariances in random samples of stipulated size from an infinite population whose to-be-sampled covariances are given, SAMPLCOV creates simulations of the X-covariances in K independently random samples of size NS (your choice of K and NS) from an infinite population wherein Cxx is what you stipulate. This is tantamount to creating a set of K bootstrap samples of size NS from a POP-file generated by GENSCOR to have exactly Cxx for its datascore covariances from untwisted production axes. Should you want genuine bootstrap samples from a POP-file, you can run that through Hydata-supplement program HYDATA (described below) to use the bootstrap option it proffers when computing the input datafile's covariances. COVFORM: Run this to set a small parameter file named KOVFMT that instructs GENSCOR how best to format its ASCII data-covariance output for transport to the solution program you want to receive this. ORDER: This clarifies the structure of a disjointedly assembled causal-path structure or, more generally, of any directed but possibly nonrecursive graph whose links, described piecemeal, needn't be allegedly causal. Operating upon a set of provisional numeric indices (they can be assigned randomly) for nodes in a causal-path model, program ORDER (a) receives keyboard entry of a list of index pairs signifying that node i is directly path-antecedent to node j; (b) sorts these indices into disjoint "units", each of which is either a singlet (strongly recommended if this is a SEM model's intended path structure) or comprises a group of nodes each of which is path-antecedent to all others in its group; (c) identifies one or more sequences whose left-to-right order embeds the strict partial order these input pairings impose on these units; (d) clarifies the degrees of path-separation among these units; and (e) for each unit that contains multiple nodes, identifies all the closed loops within this unit as well as each member node's counts of direct input/output links to nodes outside this unit. This information enables you to re-index these nodes in closest correspondence to their degree of path-dependency and, if it dlscloses loops you did not intend, makes perspicuous where you can break those most parsimoniously. ------------------------------------------------------------------------- B. These are the Gendata documentation files in decreasing order of importance. SIMDATA.TXT: This describes in considerable detail each Gendata program's capabilities, how to invoke those when your simulation wants them, and how to plan your sequence of program calls and uploads. And it begins with a comprehensive overview of linear structural modelling, both classical and modern, that you may find conceptually instructive even if you never create any simulation material using Gendata resources. READ_1ST.TXT: What you're now reading. In addition to affording your initial overview of Gendata, it will continue to provide lookup for which filename to call for doing what job. (If even I have trouble remembering, why shouldn't you?) HYSTAND.TXT: This contains excerpts from Hyball's README.TXT documentation that explain the use of certain Hyball programs, included in this package and briefly described below, that carry out assorted operations on suitably formatted rawdata files. Unless you do creative things with the simulation datafiles that GENSCOR affords, you will probably have little use for these. COVFMT.TXT: This describes the layout an ASCII file requires to transport a covariance matrix for upload by Hyball/Gendata programs that operate upon those. All transport covariance files written by programs in this package have this layout, but you may have occasion to initiate some outside of that process or do some editing thereof between production and upload. ------------------------------------------------------------------------- C. These are the Hydata-supplement programs included in this package. HYDATA: This reads ASCII scorefiles in many alternative formats from outside sources, does preprocessing as needed and, if that is successful, allows one or both of two outputs: (1) Transcription of the really-raw scores into an ASCII Hydata-standard format that HYDATA and all other programs in the Hyball package that work on raw data can read without preprocessing. (2) Computation of the standardized covariances (correlations) among the input variables or any selected subset thereof, written to a COV-file that any Hyball and now Gendata program that uploads covariances can read. (This computation also avails search for possible quadratic relations among the data variables, and production of COV for each of up to 156 bootstrap samples of the input data.) RESCORE: This allows computation of scores on many user-stipulated functions of the variables in an uploaded Hydata-standard datafile. The newly created scores can be appended to the input datafile or recorded in a separate file. RESCORE also allows input of an externally prepared M-by-N matrix of coefficients that will generate scores on M linear combinations of any selection of N variables in the input datafile. Details of how to prepare this in a text editor, not at present included in RESCORE's documentation, are available on request. SELECT: This allows subsets of variables and/or records in a Hydata- standard datafile to be copied to a separate one. MERGE: This enables two or more Hydata-standard datafiles having some (not necessarily all) named variables and/or record IDs in common to be combined into a single datafile. The input files are merged in a user-chosen sequence whereby record IDi's score on variable Yj in the merged file is the last score that occurs with those coordinates in the merge sequence. An example of what can be done with these resources is revising the scores on a subset X of the variables in datafile .D1: First, select scores on X into a new datafile .D2. Next, use RESCORE to transform .D2 as wanted. (E.g, the selected scores could be rescaled to have different means and SDs, or partitioned into categories, or nonlinearly reshaped as by Log-transform.) Finally, merging { .D1, .D2 } in that order overwrites the X-scores in .D1 with their modified values in .D2. And if .D2 contains only a subset of the records in .D1, the merge will overwrite scores on those records only. Hyball's documentation for these Hydata-supplement programs is excerpted in this package's textfile HYSTAND.TXT. ------------------------------------------------------------------------- D. Specialized background reading, contentious. PRELOOPS.TXT: An introduction to important details of scientific regularities' conceptual composition that are mostly supressed, seldom wisely and often obfuscatingly, in standard multivariate algebra. LOOPS.TXT: My effort to make clear for SEMists who see nothing wrong with nonrecursive models that these violate our most basic intuitions about causal influence. MORLOOPS.TXT: My 26/9/2001 Semnet post showing algebraically the special conditions under which it is possible, under the standard linear model of system dynamics, for indicators of source variables in a path loop to have uncorrelated residual disturbances. SMEPSHOW.TXT: A reduced version of my 1997 SMEP-meeting handout attempting to demonstrate how EFA can effectively play in SEM's path-detection league by use of my Hyball package's Hyblock procedure.