Newer
Older
`--ybytes-max=YBYTES_MAX
`
: Extend the Yarowsky seed-collocation search to check over larger ranges up to this maximum. If this is set then several ranges will be checked in an attempt to determine the best one for each word, but see also ymax-threshold.
`--ymax-threshold=YMAX_THRESHOLD
`
: Limits the length of word that receives the narrower-range Yarowsky search when ybytes-max is in use. For words longer than this, the search will go directly to ybytes-max. This is for languages where the likelihood of a word's annotation being influenced by its immediate neighbours more than its distant collocations increases for shorter words, and less is to be gained by comparing different ranges when processing longer words. Setting this to 0 means no limit, i.e. the full range will be explored on **all** Yarowsky checks.
`--ybytes-step=YBYTES_STEP
`
: The increment value for the loop between ybytes and ybytes-max
: Warn when absolutely no distinguishing Yarowsky seed collocations can be found for a word in the examples
`--no-warn-yarowsky`
: Cancels any earlier `--warn-yarowsky` option in Makefile variables etc
: Accept Yarowsky seed collocations even from input characters that never occur in annotated words (this might include punctuation and example-separation markup)
`--no-yarowsky-all`
: Cancels any earlier `--yarowsky-all` option in Makefile variables etc
`--yarowsky-debug=YAROWSKY_DEBUG
`
: Report the details of seed-collocation false positives if there are a large number of matches and at most this number of false positives (default 1). Occasionally these might be due to typos in the corpus, so it might be worth a check.
: Do not consider any rule longer than 1 word, although it can still have Yarowsky seed collocations if -y is set. This speeds up the search, but at the expense of thoroughness. You might want to use this in conjuction with -y to make a parser quickly. It is like -P (primitive) but without removing the conflict checks.
`--no-single-words`
: Cancels any earlier `--single-words` option in Makefile variables etc
`--max-words=MAX_WORDS
`
: Limits the number of words in a rule; rules longer than this are not considered. 0 means no limit. `--single-words` is equivalent to `--max-words`=1. If you need to limit the search time, and are using -y, it should suffice to use `--single-words` for a quick annotator or `--max-words`=5 for a more thorough one.
`--checkpoint=CHECKPOINT
`
: Periodically save checkpoint files in the specified directory. These files can save time when starting again after a reboot (and it's easier than setting up Condor etc). As well as a protection against random reboots, this can be used for scheduled reboots: if file called ExitASAP appears in the checkpoint directory, annogen will checkpoint, remove the ExitASAP file, and exit. After a run has completed, the checkpoint directory should be removed, unless you want to re-do the last part of the run for some reason.
`
: Output some diagnostics for the specified word. Use this option to help answer "why doesn't it have a rule for...?" issues. This option expects the word without markup and uses the system locale (UTF-8 if it cannot be detected).
`--diagnose-limit=DIAGNOSE_LIMIT
`
: Maximum number of phrases to print diagnostics for (0 means unlimited); can be useful when trying to diagnose a common word in rulesFile without re-evaluating all phrases that contain it. Default: 10
`
: Check and diagnose potential failures of `--manualrules`
`--no-diagnose-manual`
: Cancels any earlier `--diagnose-manual` option in Makefile variables etc
`-q, `--diagnose-quick``
: Ignore all phrases that do not contain the word specified by the `--diagnose` option, for getting a faster (but possibly less accurate) diagnostic. The generated annotator is not likely to be useful when this option is present. You may get quick diagnostics **without** these disadvantages by loading a `--rules`File instead.
`--no-diagnose-quick`
: Cancels any earlier `--diagnose-quick` option in Makefile variables etc
`--priority-list=PRIORITY_LIST
`
: Instead of generating an annotator, use the input examples to generate a list of (non-annotated) words with priority numbers, a higher number meaning the word should have greater preferential treatment in ambiguities, and write it to this file (or compressed .gz, .bz2 or .xz file). If the file provided already exists, it will be updated, thus you can amend an existing usage-frequency list or similar (although the final numbers are priorities and might no longer match usage-frequency exactly). The purpose of this option is to help if you have an existing word-priority-based text segmenter and wish to update its data from the examples; this approach might not be as good as the Yarowsky-like one (especially when the same word has multiple readings to choose from), but when there are integration issues with existing code you might at least be able to improve its word-priority data.
: Estimate time to completion. The code to do this is unreliable and is prone to underestimate. If you turn it on, its estimate is displayed at the end of the status line as days, hours or minutes.
`--no-time-estimate`
: Cancels any earlier `--time-estimate` option in Makefile variables etc
`-0, `--single-core``
: Use only one CPU core even when others are available. If this option is not set, multiple cores are used if a 'futures' package is installed or if run under MPI or SCOOP; this currently requires `--checkpoint` + shared filespace, and is currently used only for large collocation checks in limited circumstances. Single-core saves on CPU power consumption, but if the computer is set to switch itself off at the end of the run then **total** energy used is generally less if you allow it to run multicore and reach that switchoff sooner.
`--no-single-core`
: Cancels any earlier `--single-core` option in Makefile variables etc
`-p STATUS_PREFIX, `--status-prefix`=STATUS_PREFIX
`
: Label to add at the start of the status line, for use if you batch-run annogen in multiple configurations and want to know which one is currently running
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
Copyright and Trademarks
========================
(c) Silas S. Brown, licensed under Apache 2
* Android is a trademark of Google LLC.
* Apache is a registered trademark of The Apache Software Foundation.
* AppEngine is possibly a trademark of Google LLC.
* Apple is a trademark of Apple Inc.
* Firefox is a registered trademark of The Mozilla Foundation.
* Google Play is a trademark of Google LLC.
* Google is a trademark of Google LLC.
* Java is a registered trademark of Oracle Corporation in the US and possibly other countries.
* Javascript is a trademark of Oracle Corporation in the US.
* Linux is the registered trademark of Linus Torvalds in the U.S. and other countries.
* MP3 is a trademark that was registered in Europe to Hypermedia GmbH Webcasting but I was unable to confirm its current holder.
* Mac is a trademark of Apple Inc.
* Microsoft is a registered trademark of Microsoft Corp.
* Python is a trademark of the Python Software Foundation.
* Raspberry Pi is a trademark of the Raspberry Pi Foundation.
* Unicode is a registered trademark of Unicode, Inc. in the United States and other countries.
* Windows is a registered trademark of Microsoft Corp.
* iPhone is a trademark of Apple in some countries.
* Any other trademarks I mentioned without realising are trademarks of their respective holders.