# ULL Gutenberg Children Corpus: a compendium of books for children contained within Project Gutenberg (www.gutenberg.org), following the selection used for the Children's Book Test of the bAbI CBT corpus (research.fb.com/downloads/babi/). # The files here have been pre-cleaned, and the corpus contains a vocabulary of 40k words in 207k sentences, with a total size of 2.7M words. capital-full/ capitalized, tokenized with pre-cleaner, in one file, 415420 rows capital/ capitalized, tokenized with pre-cleaner, 415420 rows lower/ lowercased, tokenized with pre-cleaner, 414856 rows lower_LGEng_token/ lowercased, tokenized with LG?.?.?, 414814 rows lower_tokenized_LG5.5.1/ lowercased, tokeized with LG5.5.1, 414814 rows, which is matching 207407 parses in http://langlearn.singularitynet.io/data/parses/English/Gutenberg-Children-Books/LG5.5.1/ capital-LGEnglish-noQuotes-fullyParsed capitalized, tokenized with LG?.?.?, fully parsed sentences with no quotes (direct speech) capital-LGEnglish-noQuotes-manual capitalized, tokenized with LG?.?.?, fully parsed sentences with no quotes (direct speech), manually selected 200+ sentences MSLXX-2019JUL01 Series of folders containing sentences from the cleaned GC corpus that contain less or equal than XX tokens. All other pre-cleaner options are the default ones as of 2019JUL01 (the default options had minor updates a bit before that date).