Gut_16_10_xx_text.txt: - Gut_16_10: Gutenberg corpus, sentences 16+ words long, word with frequencies 10+ - xx = 10,20,30...90: number of KMeans clusters - text: - clusters_overview - cluster_top_words - cluster_words -- /detailed_data/ dir - cluster_similarities - cosine similarities to cluster centroids for words in ...cluster_words /detailed_data/ ...cluster_words, cluster_similarities /cluster_words+weighted_similarity/... = |-delimited cluster words \t SWF/Lexicon \t SWF/Cluster - SWF/Lexicon - similaruty weigthed by word frequency within lexicon - SWF/Lexicon - similaruty weigthed by word frequency within cluster /shorter_lexicons/ - lexicons pruned based on word and word pair frequencies. - objective: find compact lexicons with better "natural" clustering