Random Clustering consistency test 2018-08-28 10x

Updated (2018-08-14) Grammar Tester, server 94.130.238.118
Each line is calculated 10x, parsing metrics tested 1x for each calculation.
This notebook is a supplement to
http://langlearn.singularitynet.io/data/clustering_2018/html/Random-Clusters-CDS-2018-08-28.html
This notebook is shared as static html via
http://langlearn.singularitynet.io/data/clustering_2018/html/Random-Clusters-CDS-2018-08-28-multi.html

Basic settings

In [1]:
import os, sys, time
from IPython.display import display
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path: sys.path.append(module_path)
grammar_learner_path = module_path + '/src/grammar_learner/'
if grammar_learner_path not in sys.path: sys.path.append(grammar_learner_path)
from utl import UTC
from read_files import check_dir
from widgets import html_table
from pqa_table import table_cds
tmpath = module_path + '/tmp/'
if check_dir(tmpath, True, 'none'):
    table = []
    long_table = []
    header = ['Line','Corpus','Parsing','LW','"."','Generalization','Space','Rules','PA','PQ']
    start = time.time()
    print(UTC(), ':: module_path =', module_path)
else: print(UTC(), ':: could not create temporary files directory', tmpath)
2018-08-28 16:00:35 UTC :: module_path = /home/obaskov/language-learning

Corpus test settings

In [2]:
out_dir = module_path + '/output/Random-Clusters-' + str(UTC())[:10]
runs = (10,1)    # (attempts to learn grammar per line, grammar tests per attempt)
if runs != (1,1): out_dir += '-multi'
kwargs = {
    'left_wall'     :   ''          ,
    'period'        :   False       ,
    'clustering'    :   ('kmeans', 'kmeans++', 10),
    'cluster_range' :   (30,120,3)  , # random cluster number within interval
    'cluster_criteria': 'silhouette',
    'cluster_level' :   1           ,
    'tmpath'        :   tmpath      , 
    'verbose'       :   'min'       ,
    'template_path' :   'poc-turtle',
    'linkage_limit' :   1000        ,
    'categories_generalization': 'off' }
lines = [
    [58, 'CDS-caps-br-text+brent9mos' , 'LG-English'                     ,0,0, 'none'  ], 
    [60, 'CDS-caps-br-text+brent9mos' , 'R=6-Weight=6:R-mst-weight=+1:R' ,0,0, 'none'  ]]
rp = module_path + '/data/CDS-caps-br-text+brent9mos/LG-English'
cp = rp  # corpus path = reference_path :: use 'gold' parses as test corpus

Random clusters, interconnected -- RNDic

"Connector-based rules" style interconnection:
C01: {C01C01- or C02C01- or ... or CnC01-} & {C01C01+ or C01C02+ or ... or C01Cn+} ...
Cxx: {C01Cxx- or C02Cxx- or ... or CnCxx-} & {CxxC01+ or CxxC02+ or ... or CxxCn+} ...
where n -- number of clusters, Cn -- n-th cluster, Cx -- x-th cluste of {C01 ... Cn}

In [3]:
%%capture
kwargs['context'] = 1
kwargs['word_space'] = 'none'
kwargs['clustering'] = 'random'
kwargs['grammar_rules'] = -1
average21, long21 = table_cds(lines, out_dir, cp, rp, runs, **kwargs)
table.extend(average21)
long_table.extend(long21)
In [4]:
display(html_table([header]+average21))
print(UTC())
LineCorpusParsingLW"."GeneralizationSpaceRulesPAPQ
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDic7972%51%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDic6970%44%
2018-08-28 16:09:46 UTC

Random Clusters, connector-based rules

In [5]:
%%capture
kwargs['grammar_rules'] = 1
average22, long22 = table_cds(lines, out_dir, cp, rp, runs, **kwargs)
table.extend(average22)
long_table.extend(long22)
In [6]:
display(html_table([header]+average22))
print(UTC())
LineCorpusParsingLW"."GeneralizationSpaceRulesPAPQ
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDc6472%51%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDc8270%44%
2018-08-28 16:18:53 UTC

Random Clusters, disjunct-based rules

In [7]:
%%capture
kwargs['grammar_rules'] = 2
average23, long23 = table_cds(lines, out_dir, cp, rp, runs, **kwargs)
table.extend(average23)
long_table.extend(long23)
In [8]:
display(html_table([header]+average23))
print(UTC())
LineCorpusParsingLW"."GeneralizationSpaceRulesPAPQ
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDd7841%24%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDd6041%20%
2018-08-28 16:28:27 UTC

Random clusters, linked -- RNDid

Every cluster is linked to all clusters with single-link disjuncts:
C01: (C01C01-) or (C02C01-) or ... (CnC01-) or (C01C01+) or (C01C02+) or ... (C01Cn+) ...
Cxx: (C01Cxx-) or (C02Cxx-) or ... (CnCxx-) or (CxxC01+) or (CxxC02+) or ... (CxxCn+) ...
where n -- number of clusters, Cn -- n-th cluster, Cxx -- xx-th cluster of {C01 ... Cn}

In [9]:
%%capture
kwargs['grammar_rules'] = -2
average24, long24 = table_cds(lines, out_dir, cp, rp, runs, **kwargs)
table.extend(average24)
long_table.extend(long24)
In [10]:
display(html_table([header]+average24))
print(UTC())
LineCorpusParsingLW"."GeneralizationSpaceRulesPAPQ
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDid6141%24%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDid6641%20%
2018-08-28 16:38:05 UTC

All tests

In [11]:
display(html_table([header]+long_table))
LineCorpusParsingLW"."GeneralizationSpaceRulesPAPQ
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDic 72 72%51%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDic 115 72%51%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDic 68 72%51%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDic 80 72%51%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDic 83 72%51%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDic 96 72%51%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDic 40 72%51%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDic 82 72%51%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDic 114 72%51%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDic 36 72%51%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDic 75 70%44%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDic 84 70%44%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDic 114 70%44%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDic 48 70%44%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDic 39 70%44%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDic 51 70%44%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDic 69 70%44%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDic 61 70%44%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDic 70 70%44%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDic 83 70%44%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDc 50 72%51%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDc 116 71%50%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDc 43 72%51%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDc 90 72%51%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDc 43 72%51%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDc 31 72%51%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDc 95 71%50%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDc 44 72%51%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDc 68 72%51%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDc 60 72%51%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDc 75 70%44%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDc 84 70%44%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDc 112 69%44%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDc 34 70%44%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDc 87 70%44%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDc 71 70%44%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDc 59 70%44%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDc 95 70%44%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDc 102 70%44%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDc 105 69%44%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDd 76 41%24%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDd 31 41%24%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDd 74 41%24%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDd 51 41%24%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDd 83 41%24%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDd 109 40%24%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDd 61 41%24%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDd 120 40%24%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDd 54 41%24%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDd 119 40%24%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDd 58 41%20%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDd 102 41%20%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDd 42 41%20%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDd 48 41%20%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDd 46 41%20%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDd 62 41%20%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDd 90 41%20%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDd 30 41%20%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDd 43 41%20%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDd 78 41%20%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDid 75 41%24%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDid 33 41%24%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDid 33 41%24%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDid 62 41%24%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDid 64 41%24%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDid 73 41%24%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDid 69 41%24%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDid 64 41%24%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDid 58 41%24%
58CDS-caps-br-text+brent9mosLG-English -- -- noneRNDid 77 41%24%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDid 51 41%20%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDid 31 41%20%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDid 61 41%20%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDid 84 41%20%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDid 102 41%20%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDid 41 41%20%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDid 96 41%20%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDid 70 41%20%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDid 78 41%20%
60CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R -- -- noneRNDid 46 41%20%
In [12]:
from write_files import list2file
print(UTC(), ':: finished, elapsed', str(round((time.time()-start)/3600.0, 1)), 'hours')
table_str = list2file(table, out_dir+'/short_table.txt')
if runs == (1,1):
    print('Results saved to', out_dir + '/short_table.txt')
else:
    long_table_str = list2file(long_table, out_dir+'/long_table.txt')
    print('Average results saved to', out_dir + '/short_table.txt\n'
          'Detailed results for every run saved to', out_dir + '/long_table.txt')
2018-08-28 16:38:05 UTC :: finished, elapsed 0.6 hours
Average results saved to /home/obaskov/language-learning/output/Random-Clusters-2018-08-28-multi/short_table.txt
Detailed results for every run saved to /home/obaskov/language-learning/output/Random-Clusters-2018-08-28-multi/long_table.txt