Agglomerative clustering: varying parameters 2018-11-16

Link Grammar 5.4.4, test_grammar updated 2018-10-19.
This notebook is shared as static Agglomerative-Clustering-2018-11-16.html

Basic settings

In [1]:
import os, sys, time
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path: sys.path.append(module_path)
from src.grammar_learner.utl import UTC
from src.grammar_learner.read_files import check_dir
from src.grammar_learner.write_files import list2file
from src.grammar_learner.widgets import html_table
from src.grammar_learner.pqa_table import table_rows, wide_rows
tmpath = module_path + '/tmp/'
check_dir(tmpath, True, 'none')
table = []
start = time.time()
print(UTC(), ':: module_path =', module_path)
2018-11-16 20:01:56 UTC :: module_path = /home/obaskov/94/language-learning

Corpus test settings

In [2]:
out_dir = module_path + '/output/Agglomerative-Clustering-' + str(UTC())[:10]
corpus = 'CDS-caps-br-text+brent9mos'
corpus = 'CDS-caps-br-text'
dataset = 'LG-English'
dataset = 'LG-English-clean-clean'  # 2018-10-29: only 100% parsed 
runs = (1,1)
kwargs = {
    'left_wall'     :   ''          ,
    'period'        :   False       ,
    'context'       :   2           ,
    'word_space'    :   'sparse'    ,
    'clustering'    :   ['agglomerative', 'ward'],
    'cluster_range' :   200         ,
    'clustering_metric' : ['silhouette', 'cosine'],
    'grammar_rules' :   2           ,
    'tmpath'        :   tmpath      , 
    'verbose'       :   'min'       ,
    'template_path' :   'poc-turtle',
    'linkage_limit' :   1000        }
lines = [[33, corpus, 'LG-English' , 0, 0, 'none'], 
         [34, corpus, 'LG-English' , 0, 0, 'rules']] 
rp = module_path + '/data/CDS-caps-br-text+brent9mos/LG-English'
rp = module_path + '/data/CDS-caps-br-text/LG-English'  # shorter test
rp = module_path + '/data/CDS-caps-br-text/LG-English-clean-clean'
cp = rp  # corpus path = reference_path :: use 'gold' parses as test corpus

Linkage = "ward", affinity = "euclidean" (the only choice)

In [3]:
%%capture
line = [lines[0]]
a, _, header, log = wide_rows(lines, out_dir, cp, rp, runs, **kwargs)
display(html_table([header] + a))
In [4]:
display(html_table([header] + a))
LineCorpusParsingSpaceLinkageAffinityGen.RulesNNSIPAPQF1Top 5 cluster sizes
33CDS-caps-br-textLG-EnglishdALWEdwardeuclideannone200--- --- 99%98%0.98[726, 44, 36, 30, 29]
34CDS-caps-br-textLG-EnglishdALWEdwardeuclideanrules182--- --- 99%97%0.97[726, 44, 36, 30, 29]

The same test with basic (not-cleaned) corpus

In [5]:
%%capture
corpus = 'CDS-caps-br-text+brent9mos'
a2, _, header, log = wide_rows(lines, out_dir, cp, rp, runs, **kwargs)
corpus = 'CDS-caps-br-text'
In [6]:
display(html_table([header] + a2))
LineCorpusParsingSpaceLinkageAffinityGen.RulesNNSIPAPQF1Top 5 cluster sizes
33CDS-caps-br-textLG-EnglishdALWEdwardeuclideannone200--- --- 99%98%0.98[726, 44, 36, 30, 29]
34CDS-caps-br-textLG-EnglishdALWEdwardeuclideanrules182--- --- 99%97%0.97[726, 44, 36, 30, 29]

Varying agglomerative clustering parameters

Varying linkage, affinity, rules generalization

In [7]:
%%capture
t1 = []
n = 0
for linkage in ['ward', 'complete', 'average']:
    n += 1
    m = 0
    for affinity in ['euclidean', 'manhattan', 'cosine']:
        if linkage == 'ward' and affinity != 'euclidean': continue
        m += 1
        lines[0][0] = round(n + 0.1*m, 1)
        m += 1
        lines[1][0] = round(n + 0.1*m, 1)
        kwargs['clustering'] = ['agglomerative', linkage, affinity]
        a, _, header, log = wide_rows(lines, out_dir, cp, rp, runs, **kwargs)
        t1.extend(a)
        table.extend(a)
In [8]:
display(html_table([header] + t1))
LineCorpusParsingSpaceLinkageAffinityGen.RulesNNSIPAPQF1Top 5 cluster sizes
1.1CDS-caps-br-textLG-EnglishdALWEdwardeuclideannone200--- --- 99%98%0.98[726, 44, 36, 30, 29]
1.2CDS-caps-br-textLG-EnglishdALWEdwardeuclideanrules182--- --- 99%97%0.97[726, 44, 36, 30, 29]
2.1CDS-caps-br-textLG-EnglishdALCEdcompleteeuclideannone200--- --- 99%96%0.97[975, 19, 12, 4, 2]
2.2CDS-caps-br-textLG-EnglishdALCEdcompleteeuclideanrules182--- --- 99%95%0.96[975, 19, 12, 4, 2]
2.3CDS-caps-br-textLG-EnglishdALCMdcompletemanhattannone200--- --- 99%96%0.97[975, 19, 12, 4, 2]
2.4CDS-caps-br-textLG-EnglishdALCMdcompletemanhattanrules182--- --- 99%95%0.96[975, 19, 12, 4, 2]
2.5CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone200--- --- 99%80%0.81[637, 42, 38, 21, 15]
2.6CDS-caps-br-textLG-EnglishdALCCdcompletecosinerules200--- --- 99%79%0.81[637, 42, 38, 21, 15]
3.1CDS-caps-br-textLG-EnglishdALAEdaverageeuclideannone200--- --- 99%96%0.96[1023, 1]
3.2CDS-caps-br-textLG-EnglishdALAEdaverageeuclideanrules177--- --- 99%94%0.95[1023, 1]
3.3CDS-caps-br-textLG-EnglishdALAMdaveragemanhattannone200--- --- 99%96%0.96[1023, 1]
3.4CDS-caps-br-textLG-EnglishdALAMdaveragemanhattanrules177--- --- 99%94%0.95[1023, 1]
3.5CDS-caps-br-textLG-EnglishdALACdaveragecosinenone200--- --- 99%72%0.73[1009, 13, 2, 1]
3.6CDS-caps-br-textLG-EnglishdALACdaveragecosinerules200--- --- 99%72%0.73[1009, 13, 2, 1]

Varying number of clusters

In [9]:
%%capture
t3 = []
for linkage in ['ward', 'average', 'complete']:
    n += 1
    m = 0
    for affinity in ['euclidean', 'manhattan', 'cosine']:
        for kwargs['cluster_range'] in [300, 400, 500]:
            if linkage == 'ward' and affinity != 'euclidean': continue
            m += 1
            line[0][0] = round((n + 0.1*m), 1)
            kwargs['clustering'] = ['agglomerative', linkage, affinity]
            a, _, header, log = wide_rows(line, out_dir, cp, rp, runs, **kwargs)
            t3.extend(a)
            table.extend(a)
In [10]:
display(html_table([header] + t3))
LineCorpusParsingSpaceLinkageAffinityGen.RulesNNSIPAPQF1Top 5 cluster sizes
4.1CDS-caps-br-textLG-EnglishdALWEdwardeuclideannone300--- --- 99%98%0.98[566, 41, 23, 22, 19]
4.2CDS-caps-br-textLG-EnglishdALWEdwardeuclideannone400--- --- 99%97%0.98[484, 39, 23, 13, 12]
4.3CDS-caps-br-textLG-EnglishdALWEdwardeuclideannone500--- --- 99%98%0.98[408, 29, 20, 12, 11]
5.1CDS-caps-br-textLG-EnglishdALAEdaverageeuclideannone300--- --- 99%96%0.97[921, 2, 1]
5.2CDS-caps-br-textLG-EnglishdALAEdaverageeuclideannone400--- --- 99%96%0.97[822, 2, 1]
5.3CDS-caps-br-textLG-EnglishdALAEdaverageeuclideannone500--- --- 99%96%0.97[713, 4, 3, 2, 1]
5.4CDS-caps-br-textLG-EnglishdALAMdaveragemanhattannone300--- --- 99%96%0.97[921, 2, 1]
5.5CDS-caps-br-textLG-EnglishdALAMdaveragemanhattannone400--- --- 99%96%0.97[822, 2, 1]
5.6CDS-caps-br-textLG-EnglishdALAMdaveragemanhattannone500--- --- 99%96%0.97[712, 4, 3, 2, 1]
5.7CDS-caps-br-textLG-EnglishdALACdaveragecosinenone300--- --- 99%78%0.79[394, 243, 48, 41, 25]
5.8CDS-caps-br-textLG-EnglishdALACdaveragecosinenone400--- --- 99%96%0.96[117, 115, 41, 34, 24]
5.9CDS-caps-br-textLG-EnglishdALACdaveragecosinenone500--- --- 99%97%0.97[82, 69, 24, 23, 19]
6.1CDS-caps-br-textLG-EnglishdALCEdcompleteeuclideannone300--- --- 99%96%0.97[807, 38, 29, 18, 14]
6.2CDS-caps-br-textLG-EnglishdALCEdcompleteeuclideannone400--- --- 99%96%0.97[698, 25, 16, 15, 14]
6.3CDS-caps-br-textLG-EnglishdALCEdcompleteeuclideannone500--- --- 99%96%0.97[498, 89, 13, 12, 11]
6.4CDS-caps-br-textLG-EnglishdALCMdcompletemanhattannone300--- --- 99%96%0.97[807, 38, 29, 18, 14]
6.5CDS-caps-br-textLG-EnglishdALCMdcompletemanhattannone400--- --- 99%96%0.97[698, 25, 16, 15, 14]
6.6CDS-caps-br-textLG-EnglishdALCMdcompletemanhattannone500--- --- 99%96%0.97[498, 89, 13, 12, 11]
6.7CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone300--- --- 99%86%0.87[403, 42, 38, 28, 21]
6.8CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone400--- --- 99%91%0.92[217, 42, 38, 28, 21]
6.9CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone500--- --- 99%95%0.96[50, 42, 38, 28, 21]

⇒ NEXT: complete/average linkage, cosine affinity, 400-500 clusters

Learning connectivity matrix with various nearest neighbors

In [11]:
%%capture
affinity = 'cosine'
t4 = []
for kwargs['cluster_range'] in [500,400,300]:
    n += 1
    m = 0
    for linkage in ['complete', 'average']:
        for knn in [None, 50, 20, 10]:
            if linkage == 'ward' and affinity != 'euclidean': continue
            m += 1
            line[0][0] = round((n + m* 0.1), 1)
            kwargs['clustering'] = ['agglomerative', linkage, affinity, knn]
            a, _, header, log = wide_rows(line, out_dir, cp, rp, runs, **kwargs)
            t4.extend(a)
            table.extend(a)
In [12]:
display(html_table([header] + t4))
LineCorpusParsingSpaceLinkageAffinityGen.RulesNNSIPAPQF1Top 5 cluster sizes
7.1CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone500--- --- 99%95%0.96[50, 42, 38, 28, 21]
7.2CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone50050 --- 99%98%0.98[75, 43, 39, 34, 33]
7.3CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone50020 --- 99%98%0.98[109, 105, 79, 73, 53]
7.4CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone50010 --- 99%97%0.97[231, 113, 100, 57, 47]
7.5CDS-caps-br-textLG-EnglishdALACdaveragecosinenone500--- --- 99%97%0.97[82, 69, 24, 23, 19]
7.6CDS-caps-br-textLG-EnglishdALACdaveragecosinenone50050 --- 99%98%0.98[68, 56, 54, 51, 44]
7.7CDS-caps-br-textLG-EnglishdALACdaveragecosinenone50020 --- 99%98%0.98[144, 121, 97, 91, 38]
7.8CDS-caps-br-textLG-EnglishdALACdaveragecosinenone50010 --- 99%96%0.96[364, 183, 47, 18, 9]
8.1CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone400--- --- 99%91%0.92[217, 42, 38, 28, 21]
8.2CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone40050 --- 99%88%0.93[84, 55, 45, 44, 42]
8.3CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone40020 --- 99%97%0.98[124, 113, 87, 77, 59]
8.4CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone40010 --- 99%91%0.92[265, 138, 124, 73, 57]
8.5CDS-caps-br-textLG-EnglishdALACdaveragecosinenone400--- --- 99%96%0.96[117, 115, 41, 34, 24]
8.6CDS-caps-br-textLG-EnglishdALACdaveragecosinenone40050 --- 99%96%0.96[116, 102, 82, 65, 56]
8.7CDS-caps-br-textLG-EnglishdALACdaveragecosinenone40020 --- 99%95%0.96[256, 187, 101, 53, 35]
8.8CDS-caps-br-textLG-EnglishdALACdaveragecosinenone40010 --- 99%90%0.91[601, 72, 19, 18, 14]
9.1CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone300--- --- 99%86%0.87[403, 42, 38, 28, 21]
9.2CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone30050 --- 99%95%0.95[87, 63, 46, 45, 44]
9.3CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone30020 --- 99%91%0.91[136, 115, 93, 86, 75]
9.4CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone30010 --- 99%78%0.79[285, 154, 136, 88, 71]
9.5CDS-caps-br-textLG-EnglishdALACdaveragecosinenone300--- --- 99%78%0.79[394, 243, 48, 41, 25]
9.6CDS-caps-br-textLG-EnglishdALACdaveragecosinenone30050 --- 99%73%0.74[906, 5, 4, 3, 2]
9.7CDS-caps-br-textLG-EnglishdALACdaveragecosinenone30020 --- 99%73%0.74[911, 3, 2, 1]
9.8CDS-caps-br-textLG-EnglishdALACdaveragecosinenone30010 --- 99%73%0.74[909, 3, 2, 1]

All tests

In [13]:
display(html_table([header] + table))
LineCorpusParsingSpaceLinkageAffinityGen.RulesNNSIPAPQF1Top 5 cluster sizes
1.1CDS-caps-br-textLG-EnglishdALWEdwardeuclideannone200--- --- 99%98%0.98[726, 44, 36, 30, 29]
1.2CDS-caps-br-textLG-EnglishdALWEdwardeuclideanrules182--- --- 99%97%0.97[726, 44, 36, 30, 29]
2.1CDS-caps-br-textLG-EnglishdALCEdcompleteeuclideannone200--- --- 99%96%0.97[975, 19, 12, 4, 2]
2.2CDS-caps-br-textLG-EnglishdALCEdcompleteeuclideanrules182--- --- 99%95%0.96[975, 19, 12, 4, 2]
2.3CDS-caps-br-textLG-EnglishdALCMdcompletemanhattannone200--- --- 99%96%0.97[975, 19, 12, 4, 2]
2.4CDS-caps-br-textLG-EnglishdALCMdcompletemanhattanrules182--- --- 99%95%0.96[975, 19, 12, 4, 2]
2.5CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone200--- --- 99%80%0.81[637, 42, 38, 21, 15]
2.6CDS-caps-br-textLG-EnglishdALCCdcompletecosinerules200--- --- 99%79%0.81[637, 42, 38, 21, 15]
3.1CDS-caps-br-textLG-EnglishdALAEdaverageeuclideannone200--- --- 99%96%0.96[1023, 1]
3.2CDS-caps-br-textLG-EnglishdALAEdaverageeuclideanrules177--- --- 99%94%0.95[1023, 1]
3.3CDS-caps-br-textLG-EnglishdALAMdaveragemanhattannone200--- --- 99%96%0.96[1023, 1]
3.4CDS-caps-br-textLG-EnglishdALAMdaveragemanhattanrules177--- --- 99%94%0.95[1023, 1]
3.5CDS-caps-br-textLG-EnglishdALACdaveragecosinenone200--- --- 99%72%0.73[1009, 13, 2, 1]
3.6CDS-caps-br-textLG-EnglishdALACdaveragecosinerules200--- --- 99%72%0.73[1009, 13, 2, 1]
4.1CDS-caps-br-textLG-EnglishdALWEdwardeuclideannone300--- --- 99%98%0.98[566, 41, 23, 22, 19]
4.2CDS-caps-br-textLG-EnglishdALWEdwardeuclideannone400--- --- 99%97%0.98[484, 39, 23, 13, 12]
4.3CDS-caps-br-textLG-EnglishdALWEdwardeuclideannone500--- --- 99%98%0.98[408, 29, 20, 12, 11]
5.1CDS-caps-br-textLG-EnglishdALAEdaverageeuclideannone300--- --- 99%96%0.97[921, 2, 1]
5.2CDS-caps-br-textLG-EnglishdALAEdaverageeuclideannone400--- --- 99%96%0.97[822, 2, 1]
5.3CDS-caps-br-textLG-EnglishdALAEdaverageeuclideannone500--- --- 99%96%0.97[713, 4, 3, 2, 1]
5.4CDS-caps-br-textLG-EnglishdALAMdaveragemanhattannone300--- --- 99%96%0.97[921, 2, 1]
5.5CDS-caps-br-textLG-EnglishdALAMdaveragemanhattannone400--- --- 99%96%0.97[822, 2, 1]
5.6CDS-caps-br-textLG-EnglishdALAMdaveragemanhattannone500--- --- 99%96%0.97[712, 4, 3, 2, 1]
5.7CDS-caps-br-textLG-EnglishdALACdaveragecosinenone300--- --- 99%78%0.79[394, 243, 48, 41, 25]
5.8CDS-caps-br-textLG-EnglishdALACdaveragecosinenone400--- --- 99%96%0.96[117, 115, 41, 34, 24]
5.9CDS-caps-br-textLG-EnglishdALACdaveragecosinenone500--- --- 99%97%0.97[82, 69, 24, 23, 19]
6.1CDS-caps-br-textLG-EnglishdALCEdcompleteeuclideannone300--- --- 99%96%0.97[807, 38, 29, 18, 14]
6.2CDS-caps-br-textLG-EnglishdALCEdcompleteeuclideannone400--- --- 99%96%0.97[698, 25, 16, 15, 14]
6.3CDS-caps-br-textLG-EnglishdALCEdcompleteeuclideannone500--- --- 99%96%0.97[498, 89, 13, 12, 11]
6.4CDS-caps-br-textLG-EnglishdALCMdcompletemanhattannone300--- --- 99%96%0.97[807, 38, 29, 18, 14]
6.5CDS-caps-br-textLG-EnglishdALCMdcompletemanhattannone400--- --- 99%96%0.97[698, 25, 16, 15, 14]
6.6CDS-caps-br-textLG-EnglishdALCMdcompletemanhattannone500--- --- 99%96%0.97[498, 89, 13, 12, 11]
6.7CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone300--- --- 99%86%0.87[403, 42, 38, 28, 21]
6.8CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone400--- --- 99%91%0.92[217, 42, 38, 28, 21]
6.9CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone500--- --- 99%95%0.96[50, 42, 38, 28, 21]
7.1CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone500--- --- 99%95%0.96[50, 42, 38, 28, 21]
7.2CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone50050 --- 99%98%0.98[75, 43, 39, 34, 33]
7.3CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone50020 --- 99%98%0.98[109, 105, 79, 73, 53]
7.4CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone50010 --- 99%97%0.97[231, 113, 100, 57, 47]
7.5CDS-caps-br-textLG-EnglishdALACdaveragecosinenone500--- --- 99%97%0.97[82, 69, 24, 23, 19]
7.6CDS-caps-br-textLG-EnglishdALACdaveragecosinenone50050 --- 99%98%0.98[68, 56, 54, 51, 44]
7.7CDS-caps-br-textLG-EnglishdALACdaveragecosinenone50020 --- 99%98%0.98[144, 121, 97, 91, 38]
7.8CDS-caps-br-textLG-EnglishdALACdaveragecosinenone50010 --- 99%96%0.96[364, 183, 47, 18, 9]
8.1CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone400--- --- 99%91%0.92[217, 42, 38, 28, 21]
8.2CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone40050 --- 99%88%0.93[84, 55, 45, 44, 42]
8.3CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone40020 --- 99%97%0.98[124, 113, 87, 77, 59]
8.4CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone40010 --- 99%91%0.92[265, 138, 124, 73, 57]
8.5CDS-caps-br-textLG-EnglishdALACdaveragecosinenone400--- --- 99%96%0.96[117, 115, 41, 34, 24]
8.6CDS-caps-br-textLG-EnglishdALACdaveragecosinenone40050 --- 99%96%0.96[116, 102, 82, 65, 56]
8.7CDS-caps-br-textLG-EnglishdALACdaveragecosinenone40020 --- 99%95%0.96[256, 187, 101, 53, 35]
8.8CDS-caps-br-textLG-EnglishdALACdaveragecosinenone40010 --- 99%90%0.91[601, 72, 19, 18, 14]
9.1CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone300--- --- 99%86%0.87[403, 42, 38, 28, 21]
9.2CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone30050 --- 99%95%0.95[87, 63, 46, 45, 44]
9.3CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone30020 --- 99%91%0.91[136, 115, 93, 86, 75]
9.4CDS-caps-br-textLG-EnglishdALCCdcompletecosinenone30010 --- 99%78%0.79[285, 154, 136, 88, 71]
9.5CDS-caps-br-textLG-EnglishdALACdaveragecosinenone300--- --- 99%78%0.79[394, 243, 48, 41, 25]
9.6CDS-caps-br-textLG-EnglishdALACdaveragecosinenone30050 --- 99%73%0.74[906, 5, 4, 3, 2]
9.7CDS-caps-br-textLG-EnglishdALACdaveragecosinenone30020 --- 99%73%0.74[911, 3, 2, 1]
9.8CDS-caps-br-textLG-EnglishdALACdaveragecosinenone30010 --- 99%73%0.74[909, 3, 2, 1]
In [14]:
print(UTC(), ':: finished, elapsed', str(round((time.time()-start)/3600.0, 1)), 'hours')
table_str = list2file(table, out_dir + '/table.txt')
print('Results saved to', out_dir + '/table.txt')
2018-11-16 20:24:15 UTC :: finished, elapsed 0.4 hours
Results saved to /home/obaskov/94/language-learning/output/Agglomerative-Clustering-2018-11-16/table.txt