For a given set of parameters alpha and Beta and document-specific total word counts, simulate a document-by-term matrix. Additional structuring variables (the numbers of topics (k), documents (M), terms (V)) are inferred from input objects.

sim_LDA_data(N, Beta, alpha = NULL, Theta = NULL, seed = NULL)

Arguments

N

A vector of document sizes (total word counts). Must be integer conformable. Is used to infer the total number of documents.

Beta

matrix of categorical distribution parameters defining terms within topics. Dimension: k x V (number of topics x number of terms). Used to infer both (k) and (V). Must be non-negative and sum to 1 within topics.

alpha

Single positive numeric value for the Dirichlet distribution parameter defining topics within documents. To specifically define document topic probabilities, use Theta.

Theta

matrix of probabilities defining topics within documents. Dimension: M x k (documents x topics). Must be non-negative and sum to 1 within documents. To generally define document topic probabilities, use alpha.

seed

Input to set.seed.

Value

A document-by-term matrix of counts (dim: M x V).

Examples

  N <- c(10, 22, 15, 31)
  alpha <- 1.2
  Beta <- matrix(c(0.1, 0.1, 0.8, 0.2, 0.6, 0.2), 2, 3, byrow = TRUE)
  sim_LDA_data(N, Beta, alpha = alpha)
  Theta <- matrix(c(0.2, 0.8, 0.8, 0.2, 0.5, 0.5, 0.9, 0.1), 4, 2,
               byrow = TRUE)
  sim_LDA_data(N, Beta, Theta = Theta)