README.md


`--ybytes-step=YBYTES_STEP
`
 : The increment value for the loop between ybytes and ybytes-max

`-k, --warn-yarowsky`
 : Warn when absolutely no distinguishing Yarowsky seed collocations can be found for a word in the examples

`--no-warn-yarowsky`
 : Cancels any earlier --warn-yarowsky option in Makefile variables etc

`-K, --yarowsky-all`
 : Accept Yarowsky seed collocations even from input characters that never occur in annotated words (this might include punctuation and example-separation markup)

`--no-yarowsky-all`
 : Cancels any earlier --yarowsky-all option in Makefile variables etc

`--yarowsky-debug=YAROWSKY_DEBUG
`
 : Report the details of seed-collocation false positives if there are a large number of matches and at most this number of false positives (default 1). Occasionally these might be due to typos in the corpus, so it might be worth a check.

`-1, --single-words`
 : Do not consider any rule longer than 1 word, although it can still have Yarowsky seed collocations if -y is set. This speeds up the search, but at the expense of thoroughness. You might want to use this in conjuction with -y to make a parser quickly. It is like -P (primitive) but without removing the conflict checks.

`--no-single-words`
 : Cancels any earlier --single-words option in Makefile variables etc

`--max-words=MAX_WORDS
`
 : Limits the number of words in a rule; rules longer than this are not considered.  0 means no limit.  --single-words is equivalent to --max-words=1.  If you need to limit the search time, and are using -y, it should suffice to use --single-words for a quick annotator or --max-words=5 for a more thorough one.

`--checkpoint=CHECKPOINT
`
 : Periodically save checkpoint files in the specified directory.  These files can save time when starting again after a reboot (and it's easier than setting up Condor etc).  As well as a protection against random reboots, this can be used for scheduled reboots: if file called ExitASAP appears in the checkpoint directory, annogen will checkpoint, remove the ExitASAP file, and exit.  After a run has completed, the checkpoint directory should be removed, unless you want to re-do the last part of the run for some reason.

`-d DIAGNOSE, --diagnose=DIAGNOSE
`
 : Output some diagnostics for the specified word. Use this option to help answer "why doesn't it have a rule for...?" issues. This option expects the word without markup and uses the system locale (UTF-8 if it cannot be detected).

`--diagnose-limit=DIAGNOSE_LIMIT
`
 : Maximum number of phrases to print diagnostics for (0 means unlimited); can be useful when trying to diagnose a common word in rulesFile without re-evaluating all phrases that contain it. Default: 10

`-m, --diagnose-manual
`
 : Check and diagnose potential failures of --manualrules

`--no-diagnose-manual`
 : Cancels any earlier --diagnose-manual option in Makefile variables etc

`-q, --diagnose-quick`
 : Ignore all phrases that do not contain the word specified by the --diagnose option, for getting a faster (but possibly less accurate) diagnostic.  The generated annotator is not likely to be useful when this option is present.  You may get quick diagnostics **without** these disadvantages by loading a --rulesFile instead.

`--no-diagnose-quick`
 : Cancels any earlier --diagnose-quick option in Makefile variables etc

`--priority-list=PRIORITY_LIST
`
 : Instead of generating an annotator, use the input examples to generate a list of (non-annotated) words with priority numbers, a higher number meaning the word should have greater preferential treatment in ambiguities, and write it to this file (or compressed .gz, .bz2 or .xz file).  If the file provided already exists, it will be updated, thus you can amend an existing usage-frequency list or similar (although the final numbers are priorities and might no longer match usage-frequency exactly).  The purpose of this option is to help if you have an existing word-priority-based text segmenter and wish to update its data from the examples; this approach might not be as good as the Yarowsky-like one (especially when the same word has multiple readings to choose from), but when there are integration issues with existing code you might at least be able to improve its word-priority data.

`-t, --time-estimate`
 : Estimate time to completion.  The code to do this is unreliable and is prone to underestimate.  If you turn it on, its estimate is displayed at the end of the status line as days, hours or minutes.

`--no-time-estimate`
 : Cancels any earlier --time-estimate option in Makefile variables etc

`-0, --single-core`
 : Use only one CPU core even when others are available. If this option is not set, multiple cores are used if a 'futures' package is installed or if run under MPI or SCOOP; this currently requires --checkpoint + shared filespace, and is currently used only for large collocation checks in limited circumstances. Single-core saves on CPU power consumption, but if the computer is set to switch itself off at the end of the run then **total** energy used is generally less if you allow it to run multicore and reach that switchoff sooner.

`--no-single-core`
 : Cancels any earlier --single-core option in Makefile variables etc

`-p STATUS_PREFIX, --status-prefix=STATUS_PREFIX
`
 : Label to add at the start of the status line, for use if you batch-run annogen in multiple configurations and want to know which one is currently running