FAQ | This is a LIVE service | Changelog

Skip to content
Snippets Groups Projects
README.md 112 KiB
Newer Older

`--annotation-names=ANNOTATION_NAMES
`
 : Comma-separated list of annotation types supplied to sharp-multi (e.g. Pinyin,Yale), if you want the Android app etc to be able to name them.  You can also set just one annotation names here if you are not using sharp-multi.

`-o, --allow-overlaps`
 : Normally, the analyser avoids generating rules that could overlap with each other in a way that would leave the program not knowing which one to apply.  If a short rule would cause overlaps, the analyser will prefer to generate a longer rule that uses more context, and if even the entire phrase cannot be made into a rule without causing overlaps then the analyser will give up on trying to cover that phrase.  This option allows the analyser to generate rules that could overlap, as long as none of the overlaps would cause actual problems in the example phrases. Thus more of the examples can be covered, at the expense of a higher risk of ambiguity problems when applying the rules to other texts.  See also the -y option.

`--no-allow-overlaps`
 : Cancels any earlier `--allow-overlaps` option in Makefile variables etc

`-P, --primitive`
 : Don't bother with any overlap or conflict checks at all, just make a rule for each word. The resulting parser is not likely to be useful, but the summary might be.

`--no-primitive`
 : Cancels any earlier `--primitive` option in Makefile variables etc

`-y YBYTES, `--ybytes`=YBYTES
`
 : Look for candidate Yarowsky seed-collocations within this number of bytes of the end of a word.  If this is set then overlaps and rule conflicts will be allowed when seed collocations can be used to distinguish between them, and the analysis is likely to be faster.  Markup examples that are completely separate (e.g. sentences from different sources) must have at least this number of (non-whitespace) bytes between them.

`--ybytes-max=YBYTES_MAX
`
 : Extend the Yarowsky seed-collocation search to check over larger ranges up to this maximum.  If this is set then several ranges will be checked in an attempt to determine the best one for each word, but see also ymax-threshold.

`--ymax-threshold=YMAX_THRESHOLD
`
 : Limits the length of word that receives the narrower-range Yarowsky search when ybytes-max is in use. For words longer than this, the search will go directly to ybytes-max. This is for languages where the likelihood of a word's annotation being influenced by its immediate neighbours more than its distant collocations increases for shorter words, and less is to be gained by comparing different ranges when processing longer words. Setting this to 0 means no limit, i.e. the full range will be explored on **all** Yarowsky checks.

`--ybytes-step=YBYTES_STEP
`
 : The increment value for the loop between ybytes and ybytes-max

`-k, --warn-yarowsky`
 : Warn when absolutely no distinguishing Yarowsky seed collocations can be found for a word in the examples

`--no-warn-yarowsky`
 : Cancels any earlier `--warn-yarowsky` option in Makefile variables etc

`-K, --yarowsky-all`
 : Accept Yarowsky seed collocations even from input characters that never occur in annotated words (this might include punctuation and example-separation markup)

`--no-yarowsky-all`
 : Cancels any earlier `--yarowsky-all` option in Makefile variables etc

`--yarowsky-debug=YAROWSKY_DEBUG
`
 : Report the details of seed-collocation false positives if there are a large number of matches and at most this number of false positives (default 1). Occasionally these might be due to typos in the corpus, so it might be worth a check.

`--normalise-debug=NORMALISE_DEBUG
`
 : When `--capitalisation` is not in effect. report words that are usually capitalised but that have at most this number of lower-case exceptions (default 1) for investigation of possible typos in the corpus

`-1, --single-words`
 : Do not consider any rule longer than 1 word, although it can still have Yarowsky seed collocations if -y is set. This speeds up the search, but at the expense of thoroughness. You might want to use this in conjuction with -y to make a parser quickly. It is like -P (primitive) but without removing the conflict checks.

`--no-single-words`
 : Cancels any earlier `--single-words` option in Makefile variables etc

`--max-words=MAX_WORDS
`
 : Limits the number of words in a rule; rules longer than this are not considered.  0 means no limit.  `--single-words` is equivalent to `--max-words`=1.  If you need to limit the search time, and are using -y, it should suffice to use `--single-words` for a quick annotator or `--max-words`=5 for a more thorough one.

`--checkpoint=CHECKPOINT
`
 : Periodically save checkpoint files in the specified directory.  These files can save time when starting again after a reboot (and it's easier than setting up Condor etc).  As well as a protection against random reboots, this can be used for scheduled reboots: if file called ExitASAP appears in the checkpoint directory, annogen will checkpoint, remove the ExitASAP file, and exit.  After a run has completed, the checkpoint directory should be removed, unless you want to re-do the last part of the run for some reason.

`--checkpoint-period=CHECKPOINT_PERIOD
`
 : Approximate number of seconds between checkpoints (default 1000).  Setting this to 0 disables periodic checkpoints but still allows use of checkpoint directory for concurrency or ExitASAP processing.

`-d DIAGNOSE, `--diagnose`=DIAGNOSE
`
 : Output some diagnostics for the specified word. Use this option to help answer "why doesn't it have a rule for...?" issues. This option expects the word without markup and uses the system locale (UTF-8 if it cannot be detected).

`--diagnose-limit=DIAGNOSE_LIMIT
`
 : Maximum number of phrases to print diagnostics for (0 means unlimited); can be useful when trying to diagnose a common word in rulesFile without re-evaluating all phrases that contain it. Default: 10

`-m, `--diagnose-manual`
`
 : Check and diagnose potential failures of `--manualrules`

`--no-diagnose-manual`
 : Cancels any earlier `--diagnose-manual` option in Makefile variables etc

`-q, --diagnose-quick`
 : Ignore all phrases that do not contain the word specified by the `--diagnose` option, for getting a faster (but possibly less accurate) diagnostic.  The generated annotator is not likely to be useful when this option is present.  You may get quick diagnostics **without** these disadvantages by loading a `--rules`File instead.

`--no-diagnose-quick`
 : Cancels any earlier `--diagnose-quick` option in Makefile variables etc

`--priority-list=PRIORITY_LIST
`
 : Instead of generating an annotator, use the input examples to generate a list of (non-annotated) words with priority numbers, a higher number meaning the word should have greater preferential treatment in ambiguities, and write it to this file (or compressed .gz, .bz2 or .xz file).  If the file provided already exists, it will be updated, thus you can amend an existing usage-frequency list or similar (although the final numbers are priorities and might no longer match usage-frequency exactly).  The purpose of this option is to help if you have an existing word-priority-based text segmenter and wish to update its data from the examples; this approach might not be as good as the Yarowsky-like one (especially when the same word has multiple readings to choose from), but when there are integration issues with existing code you might at least be able to improve its word-priority data.

`-t, --time-estimate`
 : Estimate time to completion.  The code to do this is unreliable and is prone to underestimate.  If you turn it on, its estimate is displayed at the end of the status line as days, hours or minutes.

`--no-time-estimate`
 : Cancels any earlier `--time-estimate` option in Makefile variables etc

`-0, --single-core`
 : Use only one CPU core even when others are available. If this option is not set, multiple cores are used if a 'futures' package is installed or if run under MPI or SCOOP; this currently requires `--checkpoint` + shared filespace, and is currently used only for large collocation checks in limited circumstances. Single-core saves on CPU power consumption, but if the computer is set to switch itself off at the end of the run then **total** energy used is generally less if you allow it to run multicore and reach that switchoff sooner.

`--no-single-core`
 : Cancels any earlier `--single-core` option in Makefile variables etc

`-p STATUS_PREFIX, `--status-prefix`=STATUS_PREFIX
`
 : Label to add at the start of the status line, for use if you batch-run annogen in multiple configurations and want to know which one is currently running

Copyright and Trademarks
========================

(c) Silas S. Brown, licensed under Apache 2

* Android is a trademark of Google LLC.

* Apache is a registered trademark of The Apache Software Foundation.

* AppEngine is possibly a trademark of Google LLC.

* Apple is a trademark of Apple Inc.

* Firefox is a registered trademark of The Mozilla Foundation.

* Google Play is a trademark of Google LLC.

* Google is a trademark of Google LLC.

* Java is a registered trademark of Oracle Corporation in the US and possibly other countries.

* Javascript is a trademark of Oracle Corporation in the US.

* Linux is the registered trademark of Linus Torvalds in the U.S. and other countries.

* MP3 is a trademark that was registered in Europe to Hypermedia GmbH Webcasting but I was unable to confirm its current holder.

* Mac is a trademark of Apple Inc.

* Microsoft is a registered trademark of Microsoft Corp.

* Python is a trademark of the Python Software Foundation.

* Raspberry Pi is a trademark of the Raspberry Pi Foundation.

* Unicode is a registered trademark of Unicode, Inc. in the United States and other countries.

* Windows is a registered trademark of Microsoft Corp.

* iPhone is a trademark of Apple in some countries.

* Any other trademarks I mentioned without realising are trademarks of their respective holders.