OpenCog ULL - clustering 2017-12-23 Vector space Gutenberg_16_10 Algorithm KMeans - sklearn.cluster Number of clusters 90 Lexicon file Gut_16_10_lexicon.csv Lexicon size (words) 6855 AVERAGE METRICS FOR ALL CLUSTERS - Average top cluster word similarity to centroid 0.512 Average top 10 words similarity to centroid 0.471 Average cluster words similarity to centroid 0.37 Similarity weighted by words frequency within cluster (SWFC) 0.348 Similarity weighted by words frequency within lexicon (SWFL) 0.002 METRICS FOR CLUSTERS - Cluster sizes (words) 48, 48, 44, 48, 48, 77, 84, 34, 112, 34, 41, 38, 99, 105, 76, 53, 69, 81, 104, 73, 74, 164, 85, 64, 72, 23, 90, 53, 66, 96, 80, 107, 93, 72, 84, 66, 77, 68, 84, 61, 62, 137, 92, 151, 76, 85, 52, 90, 41, 78, 49, 58, 125, 70, 46, 100, 48, 91, 57, 65, 60, 163, 69, 36, 30, 17, 68, 102, 98, 68, 101, 126, 100, 85, 58, 156, 110, 24, 101, 31, 86, 94, 60, 74, 39, 96, 55, 93, 84, 103 Cluster top words caelte, illustrations, nancy, letting, destruction, nick, youngest, mushrooms, deserved, assertion, stain, diseases, sewed, verse, kids, isaiah, shoemaker, submission, irritably, lydgate, commented, neighbours, sixpence, strove, luggage, insignia, mesmerism, doubting, knitting, sep, disturb, distributed, prevented, rely, pp, suppose, bounds, bigger, sentimental, unbelief, scotland, triumphant, proofreading, coin, blurted, clearer, railway, sob, ivan, drift, oh, mifflin, fates, came, cuff, shepherd, hook, foolish, cherry, wid, vietnam, disembodied, needn't, pilot, saint, gaze, idiot, shrine, exclaim, clapped, pin, naumann, prof, ananda, fertility, beauties, shipmates, answered, twenty, barn, devotee, ropes, beseech, ut, speculation, snarled, i've, truth, hasten, restrain Cluster top word similarities to centroids 0.49, 0.497, 0.456, 0.451, 0.479, 0.541, 0.524, 0.472, 0.429, 0.544, 0.635, 0.473, 0.509, 0.467, 0.543, 0.596, 0.599, 0.502, 0.586, 0.352, 0.431, 0.559, 0.614, 0.649, 0.521, 0.709, 0.565, 0.444, 0.553, 0.676, 0.473, 0.555, 0.517, 0.555, 0.681, 0.398, 0.564, 0.481, 0.49, 0.605, 0.512, 0.535, 0.564, 0.537, 0.56, 0.502, 0.427, 0.511, 0.473, 0.496, 0.435, 0.709, 0.591, 0.441, 0.404, 0.532, 0.477, 0.355, 0.489, 0.443, 0.485, 0.555, 0.549, 0.576, 0.364, 0.479, 0.498, 0.468, 0.574, 0.482, 0.502, 0.46, 0.561, 0.553, 0.561, 0.536, 0.407, 0.6, 0.213, 0.575, 0.606, 0.529, 0.517, 0.665, 0.573, 0.467, 0.239, 0.303, 0.481, 0.485 Cluster top 10 words avg similarities to centroids 0.43, 0.445, 0.421, 0.434, 0.444, 0.488, 0.5, 0.401, 0.383, 0.506, 0.59, 0.427, 0.452, 0.439, 0.462, 0.54, 0.55, 0.462, 0.532, 0.325, 0.393, 0.51, 0.567, 0.626, 0.489, 0.617, 0.532, 0.43, 0.49, 0.649, 0.461, 0.529, 0.489, 0.535, 0.676, 0.378, 0.52, 0.434, 0.457, 0.579, 0.46, 0.508, 0.526, 0.507, 0.479, 0.411, 0.402, 0.474, 0.387, 0.457, 0.389, 0.706, 0.553, 0.396, 0.389, 0.502, 0.412, 0.331, 0.422, 0.428, 0.453, 0.524, 0.527, 0.526, 0.331, 0.4, 0.48, 0.437, 0.481, 0.469, 0.472, 0.425, 0.496, 0.479, 0.52, 0.513, 0.39, 0.556, 0.192, 0.513, 0.559, 0.481, 0.492, 0.64, 0.518, 0.434, 0.205, 0.284, 0.463, 0.432 Cluster words avg similarities to centroids 0.343, 0.344, 0.346, 0.357, 0.359, 0.388, 0.378, 0.321, 0.293, 0.411, 0.493, 0.363, 0.346, 0.343, 0.358, 0.43, 0.444, 0.362, 0.419, 0.26, 0.328, 0.382, 0.457, 0.463, 0.38, 0.553, 0.425, 0.344, 0.394, 0.499, 0.358, 0.401, 0.395, 0.43, 0.489, 0.29, 0.416, 0.319, 0.364, 0.464, 0.359, 0.383, 0.413, 0.379, 0.363, 0.311, 0.322, 0.375, 0.307, 0.356, 0.296, 0.527, 0.41, 0.308, 0.3, 0.395, 0.329, 0.24, 0.326, 0.365, 0.368, 0.381, 0.386, 0.436, 0.24, 0.363, 0.38, 0.33, 0.378, 0.395, 0.359, 0.322, 0.382, 0.373, 0.404, 0.399, 0.29, 0.478, 0.088, 0.441, 0.437, 0.385, 0.397, 0.488, 0.427, 0.341, 0.151, 0.225, 0.366, 0.335 Similarities weighted by words frequency within cluster (SWFC) 0.31, 0.322, 0.319, 0.344, 0.329, 0.372, 0.362, 0.296, 0.29, 0.373, 0.469, 0.338, 0.318, 0.309, 0.344, 0.418, 0.432, 0.339, 0.407, 0.261, 0.311, 0.361, 0.447, 0.425, 0.355, 0.513, 0.405, 0.325, 0.382, 0.474, 0.346, 0.387, 0.356, 0.395, 0.473, 0.286, 0.398, 0.283, 0.332, 0.455, 0.331, 0.361, 0.404, 0.339, 0.331, 0.287, 0.288, 0.331, 0.277, 0.321, 0.295, 0.463, 0.389, 0.304, 0.311, 0.365, 0.309, 0.209, 0.299, 0.357, 0.341, 0.356, 0.369, 0.414, 0.238, 0.339, 0.35, 0.301, 0.345, 0.365, 0.333, 0.296, 0.364, 0.332, 0.37, 0.373, 0.264, 0.481, 0.058, 0.44, 0.411, 0.362, 0.385, 0.433, 0.399, 0.318, 0.14, 0.235, 0.344, 0.293 Similarities weighted by words frequency within lexicon (SWFL) 0.0, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.004, 0.0, 0.0, 0.001, 0.001, 0.002, 0.001, 0.0, 0.001, 0.001, 0.001, 0.003, 0.001, 0.002, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.003, 0.001, 0.001, 0.007, 0.001, 0.001, 0.002, 0.0, 0.001, 0.001, 0.001, 0.002, 0.001, 0.001, 0.001, 0.003, 0.01, 0.001, 0.015, 0.002, 0.001, 0.012, 0.002, 0.001, 0.0, 0.003, 0.001, 0.0, 0.001, 0.001, 0.012, 0.0, 0.001, 0.0, 0.001, 0.002, 0.002, 0.002, 0.001, 0.002, 0.001, 0.002, 0.001, 0.002, 0.001, 0.007, 0.017, 0.0, 0.001, 0.001, 0.001, 0.002, 0.0, 0.002, 0.027, 0.009, 0.001, 0.002 SUPPLEMENTAL FILES CONTENT Gut_16_10_90_cluster_top_words.txt Cluster top 10 words statistics Gut_16_10_90_cluster_words.txt Cluster words, sorted by similarities to cluster centroids Gut_16_10_90_cluster_similarities.txt Cluster word similarities to cluster centroids