WSD_optiontests `2019-04-09`

"Gutenberg Children Books" corpus, new "LG-E-noQuotes" dataset (GC_LGEnglish_noQuotes_fullyParsed.ull),
trash filter off: min_word_count = 1, max_sentence_length = 24, Link Grammar 5.5.1.

This notebook is shared as static WSD_option_tests2019-04-09.html, Output data shared via WSD_option_tests2019-04-09 directory.

Basic settings

In [1]:
import os, sys, time, pandas as pd
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path: sys.path.append(module_path)
from src.grammar_learner.utl import UTC, test_stats
from src.grammar_learner.read_files import check_dir, check_corpus
from src.grammar_learner.write_files import list2file
from src.grammar_learner.widgets import html_table
from src.grammar_learner.pqa_table import table_rows, params, wide_rows
tmpath = module_path + '/tmp/'
check_dir(tmpath, True, 'none')
start = time.time()
runs = (1,1)
print(UTC(), ':: module_path:', module_path)
2019-04-09 16:05:28 UTC :: module_path: /home/obaskov/94/ULL

Corpus test settings

In [2]:
corpus = 'POC-English-disAmb'
dataset = 'LG-ANY-all-parses-agm-100'
kwargs = {
    'max_sentence_length':  24      ,
    'max_unparsed_words' :   1      ,   # dataset: .@e
    'left_wall'     :   ''          ,
    'period'        :   False       ,
    'context'       :   2           ,
    'min_word_count':   1           ,
    'word_space'    :   'sparse'    ,
    'clustering'    :   ['agglomerative', 'ward'],
    'clustering_metric' : ['silhouette', 'cosine'],
    'cluster_range' :   [20]        ,
    'top_level'     :   0.01        ,
    'grammar_rules' :   2           ,
    'max_disjuncts' :   1000000     ,   # off
    'stop_words'    :   []          ,
    'tmpath'        :   tmpath      ,
    'verbose'       :   'log+'      ,
    'template_path' :   'poc-turtle',
    'linkage_limit' :   1000        }
rp = module_path + '/data/' + corpus + '/poc-english_ex-parses-gold.txt'
cp = rp  # corpus path = reference_path
runs = (1,1)
out_dir = module_path + '/output/' + 'WSD_option_tests_' + str(UTC())[:10] + '_'
print(UTC(), '\n', out_dir)
2019-04-09 16:05:28 UTC 
 /home/obaskov/94/ULL/output/WSD_option_tests_2019-04-09_

Tests: ALE, ILE with and without disambiquation

ALE without disambiquation: kwargs['wsd_symbol'] = ''

In [3]:
%%capture
kwargs['wsd_symbol'] = ''
table = []
line = [['no WSD', corpus, dataset, 0, 0, 'none']]
a, _, header, log, rules = wide_rows(line, out_dir, cp, rp, runs, **kwargs)
header[0] = 'WSD'
table.extend(a)
In [4]:
display(html_table([header] + a))
WSDCorpusParsingSpaceLinkageAffinityG12nThresholdRulesMWCNNSIPAPQF1Top 5 cluster sizes
no WSDPOC-English-disAmbLG-ANY-all-parses-agm-100dALWEdwardeuclideannone---201---0.01%1%0.01[33, 4, 3, 2, 1]
In [5]:
with open(out_dir + '/POC-English-disAmb_LG-ANY-all-parses-agm-100_dALWEd_no-gen/dict_20C_2019-04-09_0007.4.0.dict') as f:
    rules = f.read(); print('No disamiguation: "@" not replaced with ".":\n', rules[115:400])
No disamiguation: "@" not replaced with ".":
 
% B
"are" "be" "before@b" "binoculars" "board@a" "board@b" "cake@a" "chalk" "child" "directors" "food@a" "has@a" "has@b" "her" "his" "human" "liked" "likes@a" "likes@b" "not" "of" "on@a" "on@b" "parent@a" "parent@b" "sausage@a" "the@b" "the@e" "tool" "was@a" "with@a" "with@b" "wood":

ALE with disambiquation: kwargs['wsd_symbol'] = '@'

In [6]:
%%capture
kwargs['wsd_symbol'] = '@'
line = [['"@"', corpus, dataset, 0, 0, 'none']]
a, _, _, log, rules = wide_rows(line, out_dir, cp, rp, runs, **kwargs)
table.extend(a)
In [7]:
display(html_table([header] + a))
WSDCorpusParsingSpaceLinkageAffinityG12nThresholdRulesMWCNNSIPAPQF1Top 5 cluster sizes
"@"POC-English-disAmbLG-ANY-all-parses-agm-100dALWEdwardeuclideannone---201---0.0100%70%0.7[33, 4, 3, 2, 1]
In [8]:
with open(out_dir + '/POC-English-disAmb_LG-ANY-all-parses-agm-100_dALWEd_no-gen/dict_20C_2019-04-09_0007.4.0.dict') as f:
    rules = f.read(); print('Disamiguation: "@" replaced with ".":\n', rules[115:400])
Disamiguation: "@" replaced with ".":
 
% B
"are" "be" "before.b" "binoculars" "board.a" "board.b" "cake.a" "chalk" "child" "directors" "food.a" "has.a" "has.b" "her" "his" "human" "liked" "likes.a" "likes.b" "not" "of" "on.a" "on.b" "parent.a" "parent.b" "sausage.a" "the.b" "the.e" "tool" "was.a" "with.a" "with.b" "wood":

ILE with disambiquation: kwargs['wsd_symbol'] = '@'

In [9]:
%%capture
kwargs['wsd_symbol'] = '@'
kwargs['word_space'] = 'discrete'
kwargs['clustering'] = 'group'
kwargs['cluster_range'] = [0]
line = [['"@"', corpus, dataset, 0, 0, 'none']]
a, _, _, log, rules = wide_rows(line, out_dir, cp, rp, runs, **kwargs)
table.extend(a)
In [10]:
display(html_table([header] + a))
WSDCorpusParsingSpaceLinkageAffinityG12nThresholdRulesMWCNNSIPAPQF1Top 5 cluster sizes
"@"POC-English-disAmbLG-ANY-all-parses-agm-100dILEdwardeuclideannone---611---0.0100%72%0.72[2, 1, 0]
In [11]:
with open(out_dir + '/POC-English-disAmb_LG-ANY-all-parses-agm-100_dILEd_no-gen/dict_61C_2019-04-09_0007.4.0.dict') as f:
    rules = f.read(); print('Disamiguation: "@" replaced with ".":\n', rules[115:334])
Disamiguation: "@" replaced with ".":
 
% AB
"a.a":
(ABAV+) or (ANAB- & ABAW+) or (BUAB- & ABAW+) or (CHAB- & ABAT+ & ABAG+) or (CHAB- & ABBZ+ & ABAG+);

% AC
"a.c":
(ACAP+) or (ACAQ+) or (ACAT+) or (ACBC+) or (ACBS+) or (ACBZ+) or (AEAC-) or (BDAC- & ACBC+)

ILE without disambiquation: kwargs['wsd_symbol'] = ''

In [12]:
%%capture
kwargs['wsd_symbol'] = ''
line = [['no WSD', corpus, dataset, 0, 0, 'none']]
a, _, _, log, rules = wide_rows(line, out_dir, cp, rp, runs, **kwargs)
table.extend(a)
In [13]:
display(html_table([header] + a))
WSDCorpusParsingSpaceLinkageAffinityG12nThresholdRulesMWCNNSIPAPQF1Top 5 cluster sizes
no WSDPOC-English-disAmbLG-ANY-all-parses-agm-100dILEdwardeuclideannone---611---0.00%0%0.0[2, 1, 0]
In [14]:
with open(out_dir + '/POC-English-disAmb_LG-ANY-all-parses-agm-100_dILEd_no-gen/dict_61C_2019-04-09_0007.4.0.dict') as f:
    rules = f.read(); print('No disamiguation: "@" not replaced with ".":\n', rules[115:334])
No disamiguation: "@" not replaced with ".":
 
% AB
"a@a":
(ABAV+) or (ANAB- & ABAW+) or (BUAB- & ABAW+) or (CHAB- & ABAT+ & ABAG+) or (CHAB- & ABBZ+ & ABAG+);

% AC
"a@c":
(ACAP+) or (ACAQ+) or (ACAT+) or (ACBC+) or (ACBS+) or (ACBZ+) or (AEAC-) or (BDAC- & ACBC+)