# Text obtained from a sample of the full Penn Treebank corpus, which is available through the NLTK suite. # It contains parse trees of 10% of the whole Wall Street Journal articles inside the original corpus. # After pre-cleaning the corpus text consists of a vocabulary of 6k words and a total of 2,285 sentences, for a total size of 38k words.