Tuesday, September 11, 2007

network generative models for modeling networks’ features

network generative models for modeling networks’ features

Erdos-Renyi random graph (Poisson random graphs or Bernoulli graphs)
•Examples (Price 1965 for modeling citations):
–Citations: new citations of a paper are proportional to the number it already has

  • very rich mathematical theory: many properties are exactly solvable
  • Degree distribution is Poisson since the presence and absence of edges is independent
  • Pros (Simple and tractable model, Phase transitions, Giant component)
  • Cons (Degree distribution, No community structure, No degree correlations)
  • Extensions (Configuration model)

Exponential random graphs (p*) modeli

  • Exponential random graph model defines a probability distribution over graphs; includes Erdos-Renyi as a special case;
  • No analytical solutions for the model, but can use simulation to sample the graphs (Define local moves on a graph--Addition/removal of edges,Movement of edges,Edge swaps);
  • Parameter estimation (maximum likelihood);
  • Problem (Can’t solve for transitivity --produces cliques; Used to analyze small networks)

Small world model

  • Used for modeling network transitivity;
  • Many networks assume some kind of geographical proximity; S
  • small-world model ( Start with a low-dimensional regular lattice; Rewire-- Add/remove edges to create shortcuts to join remote parts of the lattice, for each edge with prob p move the other end to a random vertex); Rewiring allows to interpolate between regular lattice and random graph;
  • No power-law degree distribution;

Models of evolution

Preferential attachment

  • Models the growth of the network
  • Preferential attachment (Price 1965, Albert & Barabasi 1999): Add a new node, create M out-links; probability of linking a node K is proportional to its degree
  • Based on Herbert Simon’s result--Power-laws arise from “Rich get richer” (cumulative advantage)
  • Examples (Price 1965 for modeling citations): Citations: new citations of a paper are proportional to the number it already has
  • Examples (Price 1965 for modeling citations):
    –Citations: new citations of a paper are proportional to the number it already has

Edge copying model
Community Guided Attachment
Forest Fire model


Models for realistic network generation

Kronecker graphs

traditional and new approach to network analysis

Traditional approach

  • People are nodes, interactions are edges
  • Questionaires are used to collect link data (hard to obtain, inaccurate, subjective)
  • Typical questions: Centrality and connectivity
  • Limited to small graphs (~10 nodes) and properties of individual nodes and edges

New approach

  • Large networks (e.g., web, internet, on-line social networks) with millions of nodes
  • Many traditional questions not useful anymore:
    –Traditional: What happens if a node U is removed?
    –Now: What percentage of nodes needs to be removed to affect network connectivity?
  • Focus moves from a single node to study of statistical properties of the network as a whole. we need models that help understand statistical properties of large network
  • Can not draw (plot) the network and examine it, How the network “looks like” even if I can’t look at it?
  • Need statistical methods and tools to quantify large networks
  • Predict behavior of networked systems based on measured structural properties and local rules governing individual nodes

Statistical properties of networks

Properties of static networks:

  • Small-world effect: Six degrees of separation (Milgram 60s); Measuring path lengths (
    Diameter,longest shortest path; Effective diameter (distance at which 90% of all connected pairs of nodes can be reached); Mean geodesic (shortest) distance ;
    Distribution of shortest path lengths ; People only know their friends; People do not have the global knowledge of the network--- On a random graph, short paths exists but no one would be able to find them
  • Transitivity or clustering: friend of a friend is a friend; If a connects to b, and b to c, then with high probability a connects to c; Clustering coefficient C= 3*number of triangles / number of connected triples; Clustering coefficient scale is considerably higher than in a random graph;
  • Degree distributions: 1) Scale-free (power-law) network--Degrees in real networks are heavily skewed to the right ( Distribution has a long tail of values that are far above the mean); 2) Poisson network (Erdos-Renyi random graph)-- Degree distribution is Poisson; In a Erdos-Renyi random graph degree distribution follows Poisson distribution;
  • Network resilience: We observe how the connectivity (length of the paths) of the network changes as the vertices get removed (vertices can be removed uniformly at random or in order of decreasing degree; it is important for epidemiology, removal of vertices corresponds to vaccination); Real-world networks are resilient to random attacks;
    Random network has better resilience to targeted attacks;
  • Community structure: Most social networks show community structure ( groups have higher density of edges within than accross groups; People naturally divide into groups based on interests, age, occupation); How to find communities (spectral clustering-- embedding into a low-dim space; hierarchical clustering based on connection strength;
    Combinatorial algorithms; Block models; Diffusion methods)
  • Subgraphs or motifs: What are the building blocks (motifs) of networks?
    Do motifs have specific roles in networks? Network motifs detection process (1.Count how many times each subgraph appears; 2. Compute statistical significance for each subgraph – probability of appearing in random as much as in real network)

Temporal properties:

  • Densification: What is the relation between the number of nodes and the number of edges in a network? Networks are becoming denser over time; the number of edges grows faster than the number of nodes (average degree is increasing) ; densification & degree distribution (how does densification affect degree distribution?);
  • Shrinking diameter: Intuition says that distances between the nodes slowly grow as the network grows (like log n); But as the network grows the distances between nodes slowly decrease





Sunday, September 09, 2007

SIMPLIS & PRELIS -- LISREL

SIMPLIS--new simplified language for LISREL

PRELIS--preprocessor for LISREL
  • to prepare covariance or correlation matrix that can be analyzed by LISREL
  • transform and combine variable, recode, handle missing data, imputation, test univariate and multvariate normality, Monte Carlo, simulation, computing the weight matrix needed for weighted least squares (WLS) estimation, compute various types of correlation matrix (Pearson, polychoric, polyserial, tetrachoric), descriptive statistics
  • PRELIS language is not case sensitive, either upper or lower case letters can be used
LISREL
  • in LISREL 8, factors are all correlated by default
  • values of t greater than absolute value 2.0 are commonly taken to be significant

Saturday, September 08, 2007

nonrecursive SEM (reciprocal relation) + two wave data

Reading list
  • Kelley-Moore, J.A., & Ferraro, K.F. (2001) Functional limitations and religious service attendence in later life: barrier and/or benefit mechanism? Journal of Gerontology: Social Sciences. 56B (6), s365-s373

Exponential random graph (p*) model for social network

  • The possible ties among nodes of a network are regarded as random variables, and different assumptions about dependencies among these random tie variables determine the general form of the exponential random graph model for the network. Different dependence assumptions and their associated models are 1) Bernoulli, 2) dyad independent, 3) and Markov random graph models.
  • The problems with degeneracy faced by the homogeneous Markov random graph models of Frank and Strauss (1986). The new specifications proposed by Snijders, Pattison, Robins & Handcock (2006) introduce not only show improvements in goodness of fit for various data sets, they are much more successful at avoiding degeneracy, particularly for
    network data exhibiting high levels of transitivity.



Reading list

  • Frank, O., & Strauss, D. (1986). Markov graphs. Journal of the
    American Statistical Association, 81, 832-842.
  • Goodreau,S.M. (2007). Advances in exponential random graph models applied to a large social network. Social Networks, 29 (2), 231-248.
  • Robins,G., Pattison,P., Kalish,Y., & Lusher,D. (2007) An introduction to exponential random graph models for social networks. Social Networks. 29 (2), 169-348
  • Snidjers, T.A.B., Pattison, P., Robins, G., & Handcock, M. (2006) New specifications for exponential random graph models. Sociological Methodology.

cross validation fit index

ECVI (Expected cross-validation index)

  • ECVI is proposed as a means to assess, in a single sample, the likelihood that the model cross-validates similar-size samples from the same population. It measures the discrepancy between the fitted covariance matrix in the analyzed sample, and the expected covariance matrix that would be obtained in another sample of equivalent size. Application of the ECVI assumes a comparison of models whereby an ECVI index is computed for each model and then all ECVI values placed in rank order. The model having the smallest ECVI value exhibits the greatest potential for replication. It is possible to take the precision of the estimated ECVI value into account through the formulation of confidence intervals. By reporting an ECVI value within the bounds of a 95% confidence interval, one can argue that over all possible randomly sampled ECVI, 95% of them will fall within the upper and lower limits of the interval constructed. -- Byrne(1994), Testing for the factorial validity, replication, and invariance of a measuring instrument: a paradigmatic application based on the Maslach Burnout Inventory. Multivariate Behavioral Research, 29 (3), 289-311

CVI (cross-validation index)

  • CVI is developed to measure the extent to which a model cross-validate over independent samples. CVI measures the discrepancy between the fitted covariance matrix in the calibration sample and the sample covariance matrix in the validation sample. The model having the smallest CVI value is the one expected to have the highest degree of stability in repeated samples. -- Byrne (1994) Testing for the factorial validity, replication, and invriance of a measuring instrument: a paradigmatic applicaiton based on the Maslach Burnout Inventory, Multivariate Behavioral Research, 29 (3), 289-311

AIC (Akaike information criterion)

  • AIC is a single sample criteria, thus we don't have to split sample into calibration and validation samples
  • the model that produces the smallest AIC will be selected

CAIC (consistent AIC)

  • CAIC is a single sample criteria
  • the model that produces the smallest CAIC will be selected

Reading list

  • Camstra,A., and Boomsma, A. (1992) Cross-validatin in regression and covariance structure analysis: an overview. Sociological Methods and Research, 21 (1), 89-115
  • Browne, M.W. (2000) Cross-validation methods. Journal of Mathematical Psychology, 44,108-132

nested model sequence for testing measurement invariance

a series of increasingly restrictive models
a significant chi-square difference indicates noninvariance (i.e., nonequivalence)

M0 Null model
M1 number of factor invariance---chi-square= 1188.79 df= 586
M2 number of factor & pattern of loadings invariance---chi-square= 1219.69 df= 608
M3 number of factor & pattern of loadings & factor correlation invariance ---chi-square=
1226.6 df= 618
M4 number of factor& pattern of loading & factor correlation & measurement error invariance
---chi-square= 1403.17 df=644

  • compare M1 and M2, chi-square difference= 30.4, df difference= 22, the difference is insignificant (M2 is more parsimonious and does have a worse fit than M1, but M2's fit is not significantly worse than M1, thus we choose M2), the hypothesis of an invariant patter of factor loadings is considered tenable (accepted).
  • compare M2 and M3, chi-square difference= 6.91, df difference= 10, the difference is insignificant (M3 is more parsimonious and does have a worse fit than M2, but M3's fit is not significantly worse than M2, thus we choose M3), we choose M3, thus, the hypothesis of equal factor correlation is tenable (accepted)
  • compare M3 and M4, chi-square difference= 176.57, df difference= 26, the difference is significant (M4 has a significantly worse fit than M3, thus we choose M3), the hypothesis of equal measurement error (uniqueness) across groups is rejected ----Torkzadeh, Koufteros, Pflugheft (2003), confirmatory analysis of computer self-efficacy, Structural Equation Modeling, 10(2), 263-275

Reading list for testing measurement invariance

  • Steenkamp, J.-B. E. M., & Baumgartner, H. (1998). Assessing measurement invariance in cross-national consumer research. Journal of Consumer Research, 25(1), 78-90.
  • Vandenburg, R., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3(1), 4-70.
  • Vanderberg, R. J. (2002). Toward a further understanding of and improvement in measurement invariance methods and procedures. Organizational Research Methods, 5(2), 139-158.

Friday, September 07, 2007

Test partial measurement invariance

Reading list
  • Byrne, B.M., & Shavelson, R.J. (1987) Adolescent self-concept: testing the assumption of equivalent structure across gender. American Educational Research Journal, 24 (3), 365-385
  • Byrne,B.M., Shavelson, R.J., & Muthen, B. (1989) Testing for the equivalence of factor covariance and mean structures: the issue of partial measurement invariance. Psychological Bulletin, 105, 456-466

Abnormal parameter estimates, think before headache

  • correlation greater than 1
  • standard errors that are abnormally large or small; a standard error approaching 0 usually results from the linear dependence of the related parameter with some other parameters in the model, such a circumstance renders testing for the statistical significance of the estimate impossible
  • negative variance
  • LISREL permits these abnormal estimates to be printed. EQS prevents these estimation by constraining the value of the offending parameter to be zero.

Missing data in SEM

  • Listwise deletion is usually recommended when working with SEM-- Byrne (1995)
  • However, if our sample size is too small, listwise deleted is unlikely, pairwise deletion can result in serious problem. so, we may use imputation (EM, mean imputation, so other imputation methods)

Reading list

  • Byrne, B.M. (1995) Strategies in testing for an invariant second-order factors structure: a comparison of EQS and LISREL. Structural Equation Modeling, 2(1), 53-72

testing invariance of second-order factor structure

Reading list
  • Byrne, B.M.(1995) Strategies in testing for an invariant second-order factor structure: a comparison of EQS and LISREL, Structural Equation Modeling, 2(1), 53-72
  • Byrne, B.M., Baron, P., & Campbell, T.L. (1993) Measuring adolescent depression: factorial validity and invariance of the Beck Depression Inventory across gender. Journal of Research on Adolescence, 3(2), 127-143

Different research designs

  • Repeated multi-method approach (RMM)---one trait is measured by three methods at two points in time
  • Multitrait-multimethod approach (MTMM)---three traits are measured by three methods at one point in time
  • Multitrait-multimethod-multitime approach (MTMMMT)---three traits are measured by three methods at least two points in time, eg, panel data ---for the comparions of the three approaches, see Saris,W.E.& Andrew, F.M. (1991,2004) evaluation of measurement instruments using a structural modeling approach. In Paul P. Biemer, Robert M. Groves, Lars E. Lyberg, and Nancy A. Mathiowetz, Measurement Errors in Surveys (Wiley Series in Probability and Statistics)

Equality constraints,metric-setting, multiple group SEM

  • For measurment models without equality constraints, there will be no metric-setting effects. In principle, there two standard methods for setting the metric of a latent variable: fix one of its loadings to a nonzero constant, eg, 1 or fix its variance to a nonzero constant, eg, 1. For measurement model with equality constraints, in some situations, setting the variance of the latent variable yields an identified model, whereas fixing the value of one loading from each of the latent variables results in an underidentified model. In other situations, setting the metric by fixing the variances of the latent variables produces an underidentified model, whereas setting the metric by fixing the values of one loading from each of the latent variables to a nonzero constant results in an identified model. --- O'Brien & Reilly (1995)
  • For multiple group SEM, we should scale the second-order factor by fixing one of the loading from the second-order factor to one indicator to 1, not fixing the variance of the second-order factor; fixing the variance of the second-order factor would lead to erroneous results. For the first-order factor, we can scale it by fixing one of its loading at 1.
  • In Marsh (1994), 4 correlated first-order factor, each factor has five indicators, Marsh establish the scale for each facotor by fixing the factor variance at 1. --- I remember read somewhere that it is better to set the variance than one factor loading, because maybe in group A, the factor loading of A path is fine to be set to 1, however, in group B, the factor loading of A path is not appropriate, B path is more appropriate to be set at 1.

Reading list

  • Marsh, H.W. (1994) Confirmatory factor analysis models of factorial invariance: a multifaceted approach, Structural Equation Modelings, 1(1), 5-24
  • O'Brien, R.M., & Reilly, T. (1995) Equality in constraints and metric-setting measurement models. Structural Equation Modeling, 2(1), 1-12

Wednesday, September 05, 2007

multiple groups SEM analytic procedure in Amos 6

Estimate the baseline model (best fitting model) for each group seperately to find the best fitting model for each group, the best fitting models might be different for each other. Tests are conducted for each group seperately. Even the baseline models are different for each groups, we can still continue the following procedures. The baseline model represents one that is most parsimonious, as well as statistically best fitting and substantively most meaningful.

Test the factorial structure simultaneously across groups, parameters are estimated for all groups at the same time, the fit of this simultaneously estiamted model can provide the baseline value against which all subsequently specified models are compared. This multigroup analysis yields only one set of fit statistics for overall model fit.

  1. Each group has its own data file.
  2. A model structure needs only to be drawn for the first group. By default, all other groups will have the same structure. ps, the structure comes from the basline model (best fitting model in the previous stage)
  3. The fit fo this simultaneously estiamted model provides the baseline value against which all subsequently specified models are compared. This multigroup analysis yields only one set of fit statistics or overall model fit. Given chi-square statistics are summative, the overall chi-square value for the multipgroup model should equal the sum of the chi-square values obtained when the baseline model is tested seperately for each group (without any cross-group constraints imposed). (df of the multiple group is also the sum of the df of all different groups--not very sure about this)
  4. Analyze--manage groups--new (give group name to each group, eg, sample1, sample2,sample3)--data file (identify data file for each group)--analyze (calculate estimate)----this step provide overall fit index for the one model (simultaneously estimated multiple groups)--this is M1
  5. all factor loadings, all factor variances, and all factor covariances constrained equal (in general, testing for the quality of error variances across grups is considered to be excessively stringent) --- open the hypothesized model Amos input file---right click the mouse (Object Properties)-- give the paremeter we want to constrain equal across groups label (each parameter to be held equal across groups is given a label; any parameter that is unlabeled will be freely estimated. how to name the lable is arbitrary, depending on my mood; one indicator of each factor initial was set at 1 to give metric to the factor, we let it remain as 1 and don't give it any label, given this parameter, ie., the path from the indicator and the factor is alredy constrained to equal 1, its value will be constant across groups) -- check "all groups" in the object properties box--this is M2
  6. If M1 (no equality constraints, more free parameters) has chi-square value=2243.21, df=495; M2 (has equality constraints, more parsimonious) has chi-square value= 2344.75, df= 545; the chi-square difference is 101.54 and df difference 50; -- statistically significant--M2 has a worse fit which is significant, ie, M2 is significantly worse than M1, so we choos M1-- some equality constraints do not hold across the group--so we need to further investigate which parameter is not equal across the groups
  7. If M2 is not significantly worse than M1, since M2 is more parsimonious, we choose M2--indicate that parameters are equal across groups. --- a significant chi-square difference indicates non-invariance, a significant chi-square difference indicates non-invariance (ie, non-equivalence)
  8. Example, we have sample A and sample B. we first specify a mdoel with an underlying three-factor structure (irrespective of factor loading pattern) across two samples. After simultaneously testing the model for the two groups, we get M1: chi-square (330)=1004.37--this model means testing the number of underlying factor (or underlying three-factor structure). Second, we put constraints on factor loadings across the two groups, we get M2: chi-square (347)=1025.81. Compare M1 with M2, we get chi-square difference (17,df)=21.44, which is not significant. M2 (more parsimonious) although has a worse fit, but is not significantly worse, thus we choose M2, indicating the equivalent of item loadings. The item loadings (found to be invariant) were cumulatively constrained equal across groups---maintaining equality constraints on the factor loadings parameters across groups, we specify M3 (constraining all factor covariances to be group-invariant+factor loading invariance). Comparing M2 with M3 yields a chi-squrare difference (6)= 3, insignificant--indicate support for M3--- from Byrne (1993) the Maslach burnout inventory: testing for factorial validity and invariance across elementary, intermediate and secondary teachers
  9. Example, M1 is the multigroup model without any constraints, M1 tests for the equivalence of an underlying three-factor structure. M1: chi-squre=604.18, df= 373, CFI=0.92. The CFI is high, so we can say that M1 indicate equivalent three-factor structure. M2 constrains the pattern of factor loadings to be equal across two groups. M2: chi-squre=641.25, df=390. Compare M1 with M2 yield chi-square difference 37.07, df difference 17. This is a statistically significant difference in model fit. Thus, substantiating rejection of the "hypothesis that item measurements were equivalent across two groups".--ie, factor loadings are different across two groups
  10. Marsh (1994) the invariance of factor variances and of hte uniqueness (error variance) associated with measured variables is typically less substantively relevant.

Reading list

  • Byrne, B.M. (1993). The Maslach Burnout Inventory: Testing for factorial validity and invariance across elementary, intermediate and secondary teachers. Journal of Occupational and Organizational Psychology, 66: 197-212.
  • Byrne, B.M. (1994) Testing for the factorial validity, replication, and invariance of a measuring instrument: a paradigmatic application based on the Maslach burnout inventory. Multivariate Behavioral Research, 29 (3), 289--311
  • Byrne, B.M. (2004). Testing for multigroup invariance using AMOS graphics: a road less traveled. Structrual Equation Modeling, 11(2): 272-300
  • Byrne, B.M.(2001) Structural Equation Modeling with AMOS: basic concepts, applications, and programming. Lawrence Erlbaum.
  • Marsh,H.W. (1994) Confirmatory factor analysis models of factorial invariance: a multifaceted approach. Structrual Equation Modeling, 1(1): 5-34--gender, age, gender multiply age (gender differences in structure that varies with age)
  • Torkzadeh, G., Koufteros, X., Pflughoeft,K. (2003), Confirmatory analysis of computer self-efficacy. Structural Equation Modeling, 10(2), 263-275--across male vs. famale

Tuesday, September 04, 2007

multiple group SEM

  • In multiple group SEM, we need to compute convarince matrix for each group. Also, in data description, we need to report descriptive analysis for all the groups.

MIMIC model (multiple indicator and multiple cause)

  • Scott, M. Lynch (2000), Measurement and prediction of aging anxiety, Research on Aging, 22(5), 533-558-- this paper provides an example of MIMIC model, in this paper, aging anxiety is the latent variable measured by 7 indicators. Age, education, income, race, sex are used as exogenous observed variables that influence the latent aging anxiety variable.

first-order factor vs second-order factor

  • the amount of error variance (the residual of the first-order factor) in the individual first-order factors was examined as as to identify any of the first-order factors that were not well represented by the second-order factor. If the error variance in a particular first-order factor approaches zero, then it is perfectly represented by the second-order factor. For example, the 4 first-order factors have error variance, 0.26, 0.19, 0.28, 0.49 respectively, the error variance in factor 1, 2, and 3 were acceptable, but factor 4 with the largest error variance (0.49). This implies that 49% of the variance of true score in factor 4 can't be explained by the second-order factor. This can also be predicted from F4's low second-order factor loading. --Cheung, D. (2000). Evidence of a single second-order factor in student ratings of teaching effectiveness. Structural Equation Modeling, 7 (3), 442-460

continuous, polytomous data in SEM

  • Song,X.Y., Lee, S.Y., & Zhu,H.T. (2001). Model selection in structural equation models with continuous and polytomous data. Structural Equation Modeling, 8, 378-396

Nonlinear SEM, interaction & quadratic effects

  • Lee, S.Y., & Lu, B. (2004). Case-deletion diagnostics for nonlinear structural equation models. Multivariate Behavioral Research, 38, 275-400
  • Lee, S.Y., Song, X.Y., & Poon, W.Y. (2004). Comparison of approaches in estimating interaction and quadratic effects of latent variables. Multvariate Behavioral Research, 39, 37-67

Sunday, September 02, 2007

Randomly split data in SPSS

  • data---select cases---random sample of cases---choose either filtered or deleted
  • If select deleted, we will only see the cases that are randomly selected out, not all cases
  • If select filtered, we will see "filter_ $" as an new variable in the data file on the right hand side, (1) means selected cases. (0) means unselected cases. --- save the file
  • Data---select cases---if condition is satisfied---if---choose the filter_$ variable---filter_$=1---unselected cases are deleted---we will get the cases randomly selected