Newer
Older

Silas S. Brown
committed
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
`--annotation-names=ANNOTATION_NAMES
`
: Comma-separated list of annotation types supplied to sharp-multi (e.g. Pinyin,Yale), if you want the Android app etc to be able to name them. You can also set just one annotation names here if you are not using sharp-multi.
`-o, --allow-overlaps`
: Normally, the analyser avoids generating rules that could overlap with each other in a way that would leave the program not knowing which one to apply. If a short rule would cause overlaps, the analyser will prefer to generate a longer rule that uses more context, and if even the entire phrase cannot be made into a rule without causing overlaps then the analyser will give up on trying to cover that phrase. This option allows the analyser to generate rules that could overlap, as long as none of the overlaps would cause actual problems in the example phrases. Thus more of the examples can be covered, at the expense of a higher risk of ambiguity problems when applying the rules to other texts. See also the -y option.
`--no-allow-overlaps`
: Cancels any earlier `--allow-overlaps` option in Makefile variables etc
`-P, --primitive`
: Don't bother with any overlap or conflict checks at all, just make a rule for each word. The resulting parser is not likely to be useful, but the summary might be.
`--no-primitive`
: Cancels any earlier `--primitive` option in Makefile variables etc
`-y YBYTES, `--ybytes`=YBYTES
`
: Look for candidate Yarowsky seed-collocations within this number of bytes of the end of a word. If this is set then overlaps and rule conflicts will be allowed when seed collocations can be used to distinguish between them, and the analysis is likely to be faster. Markup examples that are completely separate (e.g. sentences from different sources) must have at least this number of (non-whitespace) bytes between them.
`--ybytes-max=YBYTES_MAX
`
: Extend the Yarowsky seed-collocation search to check over larger ranges up to this maximum. If this is set then several ranges will be checked in an attempt to determine the best one for each word, but see also ymax-threshold.
`--ymax-threshold=YMAX_THRESHOLD
`
: Limits the length of word that receives the narrower-range Yarowsky search when ybytes-max is in use. For words longer than this, the search will go directly to ybytes-max. This is for languages where the likelihood of a word's annotation being influenced by its immediate neighbours more than its distant collocations increases for shorter words, and less is to be gained by comparing different ranges when processing longer words. Setting this to 0 means no limit, i.e. the full range will be explored on **all** Yarowsky checks.
`--ybytes-step=YBYTES_STEP
`
: The increment value for the loop between ybytes and ybytes-max
`-k, --warn-yarowsky`
: Warn when absolutely no distinguishing Yarowsky seed collocations can be found for a word in the examples
`--no-warn-yarowsky`
: Cancels any earlier `--warn-yarowsky` option in Makefile variables etc
`-K, --yarowsky-all`
: Accept Yarowsky seed collocations even from input characters that never occur in annotated words (this might include punctuation and example-separation markup)
`--no-yarowsky-all`
: Cancels any earlier `--yarowsky-all` option in Makefile variables etc
`--yarowsky-debug=YAROWSKY_DEBUG
`
: Report the details of seed-collocation false positives if there are a large number of matches and at most this number of false positives (default 1). Occasionally these might be due to typos in the corpus, so it might be worth a check.
`--normalise-debug=NORMALISE_DEBUG
`
: When `--capitalisation` is not in effect. report words that are usually capitalised but that have at most this number of lower-case exceptions (default 1) for investigation of possible typos in the corpus

Silas S. Brown
committed
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
`-1, --single-words`
: Do not consider any rule longer than 1 word, although it can still have Yarowsky seed collocations if -y is set. This speeds up the search, but at the expense of thoroughness. You might want to use this in conjuction with -y to make a parser quickly. It is like -P (primitive) but without removing the conflict checks.
`--no-single-words`
: Cancels any earlier `--single-words` option in Makefile variables etc
`--max-words=MAX_WORDS
`
: Limits the number of words in a rule; rules longer than this are not considered. 0 means no limit. `--single-words` is equivalent to `--max-words`=1. If you need to limit the search time, and are using -y, it should suffice to use `--single-words` for a quick annotator or `--max-words`=5 for a more thorough one.
`--checkpoint=CHECKPOINT
`
: Periodically save checkpoint files in the specified directory. These files can save time when starting again after a reboot (and it's easier than setting up Condor etc). As well as a protection against random reboots, this can be used for scheduled reboots: if file called ExitASAP appears in the checkpoint directory, annogen will checkpoint, remove the ExitASAP file, and exit. After a run has completed, the checkpoint directory should be removed, unless you want to re-do the last part of the run for some reason.
`--checkpoint-period=CHECKPOINT_PERIOD
`
: Approximate number of seconds between checkpoints (default 1000). Setting this to 0 disables periodic checkpoints but still allows use of checkpoint directory for concurrency or ExitASAP processing.
`-d DIAGNOSE, `--diagnose`=DIAGNOSE
`
: Output some diagnostics for the specified word. Use this option to help answer "why doesn't it have a rule for...?" issues. This option expects the word without markup and uses the system locale (UTF-8 if it cannot be detected).
`--diagnose-limit=DIAGNOSE_LIMIT
`
: Maximum number of phrases to print diagnostics for (0 means unlimited); can be useful when trying to diagnose a common word in rulesFile without re-evaluating all phrases that contain it. Default: 10
`-m, `--diagnose-manual`
`
: Check and diagnose potential failures of `--manualrules`
`--no-diagnose-manual`
: Cancels any earlier `--diagnose-manual` option in Makefile variables etc
`-q, --diagnose-quick`
: Ignore all phrases that do not contain the word specified by the `--diagnose` option, for getting a faster (but possibly less accurate) diagnostic. The generated annotator is not likely to be useful when this option is present. You may get quick diagnostics **without** these disadvantages by loading a `--rules`File instead.
`--no-diagnose-quick`
: Cancels any earlier `--diagnose-quick` option in Makefile variables etc
`--priority-list=PRIORITY_LIST
`
: Instead of generating an annotator, use the input examples to generate a list of (non-annotated) words with priority numbers, a higher number meaning the word should have greater preferential treatment in ambiguities, and write it to this file (or compressed .gz, .bz2 or .xz file). If the file provided already exists, it will be updated, thus you can amend an existing usage-frequency list or similar (although the final numbers are priorities and might no longer match usage-frequency exactly). The purpose of this option is to help if you have an existing word-priority-based text segmenter and wish to update its data from the examples; this approach might not be as good as the Yarowsky-like one (especially when the same word has multiple readings to choose from), but when there are integration issues with existing code you might at least be able to improve its word-priority data.
`-t, --time-estimate`
: Estimate time to completion. The code to do this is unreliable and is prone to underestimate. If you turn it on, its estimate is displayed at the end of the status line as days, hours or minutes.
`--no-time-estimate`
: Cancels any earlier `--time-estimate` option in Makefile variables etc
`-0, --single-core`
: Use only one CPU core even when others are available. If this option is not set, multiple cores are used if a 'futures' package is installed or if run under MPI or SCOOP; this currently requires `--checkpoint` + shared filespace, and is currently used only for large collocation checks in limited circumstances. Single-core saves on CPU power consumption, but if the computer is set to switch itself off at the end of the run then **total** energy used is generally less if you allow it to run multicore and reach that switchoff sooner.
`--no-single-core`
: Cancels any earlier `--single-core` option in Makefile variables etc
`-p STATUS_PREFIX, `--status-prefix`=STATUS_PREFIX
`
: Label to add at the start of the status line, for use if you batch-run annogen in multiple configurations and want to know which one is currently running
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
Copyright and Trademarks
========================
(c) Silas S. Brown, licensed under Apache 2
* Android is a trademark of Google LLC.
* Apache is a registered trademark of The Apache Software Foundation.
* AppEngine is possibly a trademark of Google LLC.
* Apple is a trademark of Apple Inc.
* Firefox is a registered trademark of The Mozilla Foundation.
* Google Play is a trademark of Google LLC.
* Google is a trademark of Google LLC.
* Java is a registered trademark of Oracle Corporation in the US and possibly other countries.
* Javascript is a trademark of Oracle Corporation in the US.
* Linux is the registered trademark of Linus Torvalds in the U.S. and other countries.
* MP3 is a trademark that was registered in Europe to Hypermedia GmbH Webcasting but I was unable to confirm its current holder.
* Mac is a trademark of Apple Inc.
* Microsoft is a registered trademark of Microsoft Corp.
* Python is a trademark of the Python Software Foundation.
* Raspberry Pi is a trademark of the Raspberry Pi Foundation.
* Unicode is a registered trademark of Unicode, Inc. in the United States and other countries.
* Windows is a registered trademark of Microsoft Corp.
* iPhone is a trademark of Apple in some countries.
* Any other trademarks I mentioned without realising are trademarks of their respective holders.