FAQ | This is a LIVE service | Changelog

Commit 7efd3ebc authored by Dr Rich Wareham's avatar Dr Rich Wareham
Browse files

provide an initial implementation of the utility

Modify the existing cardsync utility to become a new "instsync" utility.

Part of https://gitlab.developers.cam.ac.uk/uis/devops/iam/ibis/ibis/-/issues/68
parent d0ef28de
[run]
omit =
.tox/*
setup.py
.tox/*
setup.py
.venv/*
cardsync/__init__.py
\ No newline at end of file
instsync/__main__.py
instsync/tests/*
......@@ -21,3 +21,25 @@ test:
artifacts:
reports:
cobertura: ./artefacts/**/coverage.xml
sync:
stage: deploy
image: docker:latest
services:
- docker:dind
needs:
- build
before_script:
- echo "$CI_REGISTRY_PASSWORD" | docker login "$CI_REGISTRY" --username "$CI_REGISTRY_USER" --password-stdin
script:
- >-
docker run --rm
-v "$CONFIGURATION:/config.yaml:ro"
-v "$GOOGLE_APPLICATION_CREDENTIALS:/credentials.json:ro"
-e GOOGLE_APPLICATION_CREDENTIALS=/credentials.json
"$TOOL_IMAGE"
--configuration /config.yaml
variables:
TOOL_IMAGE: "$CI_REGISTRY_IMAGE/$CI_COMMIT_REF_SLUG:$CI_COMMIT_SHA"
rules:
- if: "$CONFIGURATION && $GOOGLE_APPLICATION_CREDENTIALS && $PERFORM_SYNC"
# .logan.yaml configures the logan tool to run sync tool with google authentication
project: "cardsync"
# .logan.yaml configures the logan tool
project: "instsync"
build:
dockerfile: Dockerfile
secrets:
# Mount credentials for acting as the *Product Admin* service account. This
......
......@@ -14,4 +14,4 @@ RUN pip3 install tox && pip3 install -r requirements.txt
ADD ./ ./
RUN pip3 install -e .
ENTRYPOINT ["cardsync"]
ENTRYPOINT ["instsync"]
MIT License
Copyright (c) 2020 University of Cambridge Information Services
Copyright (c) 2022 University of Cambridge Information Services
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
......
# Sync Tool
# Institution Mapping Tool
Synchronisation tool for keeping API database up to date with legacy card database
Tool to query Jackdaw for CHRIS and CAMSIS institutions and their mapping to
Lookup institutions.
# Usage
## Usage
The sync tool needs to connect to oracle databases and google storage. If running
locally you would need to install both the oracle client libraries and gcloud SDK.
Fortunately, the docker container provided contains both of these.
The sync tool needs to connect to Oracle databases and, potentially, Google
Cloud storage. If running locally you would need to install the proprietary oracle
client libraries and gcloud SDK. Fortunately, the docker container provided
contains both of these.
Also, in order to pass the google credentials into the container, a logan
configuration is provided.
So the logan tool can be called with the following to have the gcloud already authenticated:
So the logan tool can be called with the following to have the gcloud already
authenticated:
```bash
$ logan -- help
```
......@@ -20,83 +24,41 @@ You'll need a configuration file for the sync tool so need to pass extra docker
arguments for logan to include:
```bash
$ logan --docker-run-args "-v $(pwd)/configuration.yaml:/config.yaml" \
-- -c /config.yaml enhanced-cards
-- -c /config.yaml
```
The logan tool is available at:
https://gitlab.developers.cam.ac.uk/uis/devops/tools/logan
## Configuration
## Reconciling CRSids
It has become clear that there are a large proportion of cardholder records within the legacy
card database which do not have a CRSid recorded, although the person they refer to does hold a
CRSid.
This has become problematic for two reason:
* The CRSid is not encoded within the cardholder's card,
* Datafeeds from the card system do not include the CRSid as expected,
These cardholder records without CRSids have been manually added to the card system. Cardholders
created by automatic ingest of data from CHRIS and CamSIS should all have CRSids assigned, as an
automated process within the legacy card system matches them against Jackdaw using either the
USN or Staff number, updating the cardholder record with the CRSid held by Jackdaw.
In order to catch cases where cardholders have been manually added to the card system without
a CRSid specified, two scripts have been run which allow their CRSids to be reconciled against
data in Jackdaw. Both are expected to be run locally, with some manual sanity checks to ensure
that incorrect data is not imported into the legacy card system.
### `reconcile-crsids`
Configuration can be passed in a YAML-formatted configuration file
([example](configuration.yaml.example)) via the `--configuration` option or via
environment variable.
This task matches cardholder records against confirmed staff preregistration requests within
Jackdaw. It does not make any changes to either the data in Jackdaw or the legacy card system,
but instead outputs data indicating which records match and which records cannot be matched.
Environment variables start `INSTSYNC_` and then follow the naming of settings
in the configuration file. Nested fields can be specified using a `__`
delimiter. For example, to override the Jackdaw username from that supplied in
`configuration.yaml`:
Staff preregistration requests are seen as the best source of data to match against to determine
if a cardholder has a CRSid, as they largely are related to the same group of users who are
manually added to the card system - i.e. not students automatically added to the card system and
not staff members imported from CHRIS. Records are matched based on date of birth and surname,
with forename disregarded due to differences in spelling and partial include of middle names -
but with the Levenshtein distance output allowing the difference to be manually reviewed.
This task can be run using:
```shell
logan --docker-run-args "-v $(pwd)/config.yml:/config.yml -v $(pwd)/output:/output" \
-- -c /config.yml reconcile-crsids --dir /output
```console
$ INSTSYNC_JACKDAW__USERNAME=myuser instsync --configuration configuration.yaml
```
This task requires the configuration provided contains a `jackdaw` block as specified in
`configuration.yaml.example`.
This task outputs three csv:
* `crsid_not_matched` contains details of cardholders who could not be matched against records
within Jackdaw.
* `crsid_matched` contains details of cardholders who could be matched against records within
Jackdaw, including the CRSid of the matching record and the Levenshtein distance of the
cardholder firstname and the firstname held in Jackdaw.
* `crsid_multiple_matches` contains details of cardholders who could be matched to multiple
CRSids held within Jackdaw. These should be manually reviewed to determine which
CRSid should be associated with the cardholder record.
Note that configuration in environment variables will override any configuration
passed via `--configuration`.
### `update-crsids`
## CI Configuration
This task takes a csv containing `cam_uid` and `crsid` columns and updates all cardholder records
with the crsids provided. It will only update cardholders where a crsid is not already present
and will error if no cardholder can be found by a given `cam_uid`.
The following CI variables need to be available for a sync job to run:
**DANGER** this task does not have a dry-run mode, and therefore should be carefully tested against
the development instance of the legacy card system.
This task can be run using:
```shell
logan --docker-run-args "-v $(pwd)/config.yml:/config.yml -v $(pwd)/output:/output" \
-- -c /config.yml update-crsids --crsid-source /output/matched.csv
```
* `CONFIGURATION`: A "file-like" CI-variable which expands to a path to a file
containing the configuration. Currently this is configure to upload to a
Google Cloud Storage bucket.
* `GOOGLE_APPLICATION_CREDENTIALS`: A "file-like" CI-variable which expands to a
path to Google service account credentials which should be used to upload to
a bucket.
* `PERFORM_SYNC`: As long as this variable has some value, the job will run.
Where `./output/matched.csv` in the local directory contains a csv with the columns `crsid`
and `cam_uid`.
Both of these variables are marked are "protected-only" meaning they only run on
protected branches and not on merge request branches.
"""
Card Database synchronisation tool
Usage:
cardsync (-h | --help)
cardsync [--configuration=FILE]... [--order-by=STRING] [--dir=DIRECTORY] [--limit=INT]
[--crsid-source=CSV_FILE] [--quiet] [--debug] [--format=STRING]
(
enhanced-cards|card-photos|temporary-cards|available-barcodes|card-notes|
card-logos|holders|cards|barcodes|orphaned-cardholders|diff-enhanced-cards|
diff-temporary-cards|backsync-cards|reconcile-crsids|update-crsids|
cardholders-with-multiple-issued-cards|revoke-duplicate-issued-cards
)
Options:
-h, --help Show a brief usage summary.
--crsid-source=CSV_FILE The location of a mapping of crsid to cam_uid, required in the
update-crsids task
--quiet Reduce logging verbosity.
--debug Log debugging information
-o, --order-by=STRING String to be passed to order by when selecting data.
When not specified no particular order is guaranteed.
-c, --configuration=FILE Specify configuration file to load which has
credentials to access database.
-d, --dir=directory Write CSV to local directory instead of bucket.
-l, --limit=INT Limit on the number of result returned from any query.
-f, --format=STRING Optionally specify the export format (csv, json).
"""
import logging
import os
import sys
import docopt
import yaml
import deepmerge
import geddit
from datetime import datetime, timezone
from identitylib.card_client import CardClient
from . import source
from . import image_source
from . import upload
from . import reconcile_crsids
from . import update_crsids
from . import revoke_duplicate_cards
LOG = logging.getLogger(os.path.basename(sys.argv[0]))
def load_settings(urls):
settings = {}
for url in urls:
LOG.info('Loading settings from %s', url)
settings = deepmerge.always_merger.merge(settings, yaml.safe_load(geddit.geddit(url)))
return settings
def required_settings(settings):
missing = []
if 'source' not in settings:
missing.append("'source' at root")
else:
missing.extend([f"'{s}' in 'source'" for s in source.REQUIRED_SETTINGS
if s not in settings['source']])
if 'upload' not in settings:
missing.append("'upload' at root")
else:
missing.extend([f"'{s}' in 'upload'" for s in upload.REQUIRED_SETTINGS
if s not in settings['upload']])
if len(missing) > 0:
LOG.error('Missing settings in configuration:' + ", ".join(missing))
raise Exception('Missing required settings')
def export_enhanced_cards_with_photos(connection, write_dir, settings, limit):
"""
Coordinates the export of `enhanced` cards alongside the persistence of photos.
The two exports are coordinated in order to efficiently fetch all photos for 'active'
cards.
"""
# TODO: cleanup to request photos only without calling get_enhanced_cards
(card_data, photo_ids_to_metadata) = source.get_enhanced_cards(
connection, with_photos=True, limit=limit
)
photo_data = image_source.persist_photos(connection, settings, photo_ids_to_metadata)
# card_upload_ref = upload.write(settings, 'enhanced-cards', card_data, write_dir)
photo_upload_ref = upload.write(settings, 'photos', photo_data, write_dir)
# we have to return both references so the calling process knows the name of the
# card export and the photo export
# print(card_upload_ref)
print(photo_upload_ref)
def main():
opts = docopt.docopt(__doc__)
# Configure logging
logging.basicConfig(
level=logging.DEBUG if opts['--debug'] else
logging.WARN if opts['--quiet'] else logging.INFO
)
# Optionally override select order by
order_by = opts.get('--order-by')
# Optionally write to local directory instead of bucket
write_dir = opts.get('--dir').rstrip(os.sep) if opts.get('--dir') else None
# Optionally write as json instead of csv
export_format = opts.get('--format').lower() if opts.get('--format') else 'csv'
# Optionally limit the number of rows returned
limit = int(opts.get('--limit')) if opts.get('--limit') else None
# read configuration files/objects/secrets in to settings
settings = load_settings(opts['--configuration'])
# check we have the required settings
required_settings(settings)
upload_ref = ''
export_type_to_query = {
"holders": source.CARDHOLDERS_SELECT,
"cards": source.CARDS_SELECT,
"barcodes": source.BARCODES_SELECT,
"available-barcodes": source.AVAILABLE_BARCODES_SELECT,
"card-notes": source.CARD_NOTES_SELECT,
"cardholders-with-multiple-issued-cards": (
revoke_duplicate_cards.SELECT_CARDHOLDERS_WITH_MULTIPLE_ISSUED_CARDS
)
}
connection = source.get_connection(settings['source'])
if opts['revoke-duplicate-issued-cards']:
data = revoke_duplicate_cards.revoke_duplicate_issued_cards(connection)
print(upload.write(settings, 'revoked_duplicate_cards', data, write_dir, export_format))
if opts['reconcile-crsids']:
if not settings.get('jackdaw') or not settings.get('jackdaw').get('dsn'):
raise ValueError('No `jackdaw` configuration provided in configuration file')
jd_connection = source.get_connection(settings['jackdaw'])
reconciliation_result = reconcile_crsids.reconcile_crsids(connection, jd_connection)
for file_name, content in reconciliation_result.items():
print(upload.write(settings, file_name, content, write_dir, export_format))
if opts['update-crsids']:
if not opts.get('--crsid-source'):
raise ValueError(
'`--crsid-source` must be a valid path to a csv of crsid and cam_uid mapping`'
)
update_crsids.update_crsids(connection, opts['--crsid-source'])
if opts['card-photos']:
export_enhanced_cards_with_photos(connection, write_dir, settings, limit)
return
if opts['enhanced-cards']:
(card_data, _) = source.get_enhanced_cards(connection, with_photos=False, limit=limit)
upload_ref = upload.write(settings, 'enhanced-cards', card_data, write_dir, export_format)
if opts['card-logos']:
card_data = image_source.persist_logos(connection, settings)
upload_ref = upload.write(settings, 'card-logos', card_data, write_dir, export_format)
if opts['orphaned-cardholders']:
card_data = source.get_orphan_cards(connection, order_by, limit)
upload_ref = upload.write(settings, 'orphaned-cardholders',
card_data, write_dir, export_format)
if opts['backsync-cards']:
# write the card_serial_number back to the legacy database
coerce_settings = settings.copy()
# time must be appended to allow files to be sorted in time order
coerce_settings['upload']['append-time'] = True
card_client = CardClient(
settings.get('environment', {}).get('client_id'),
settings.get('environment', {}).get('client_secret'),
settings.get('environment', {}).get('base_url', None),
)
try:
# fetch the most recent backsync
object_ref = upload.get_recent_object_urls(
settings, 'backsync-cards', limit=1
)
except RuntimeError:
# search returns zero blobs
object_ref = []
process_date_time = datetime.now(tz=timezone.utc)
date_param = {}
if len(object_ref) > 0:
last_run_date = datetime.fromisoformat(object_ref[0].split(".")[-1])
date_param = {"updated_at__gte": last_run_date}
card_data = card_client.all_cards(params={"status": "ISSUED",
**date_param}
)
# write an empty file as we use the run date but don't need the file data
upload_ref = upload.write(coerce_settings, 'backsync-cards',
[], write_dir, 'csv', process_date_time)
source.backsync_card_serial_numbers(connection, card_data)
if opts['diff-enhanced-cards']:
coerce_settings = settings.copy()
# time must be appended to allow files to be sorted in time order
coerce_settings['upload']['append-time'] = True
(card_data, _) = source.get_enhanced_cards(connection,
with_photos=False,
limit=limit)
upload_ref = upload.write(coerce_settings, 'enhanced-cards',
card_data, write_dir, 'csv')
object_ref = upload.get_recent_object_urls(settings, 'enhanced-cards',
limit=2)
if len(object_ref) == 1:
card_data = source.blob_to_card_data(object_ref[0])
elif len(object_ref) == 2:
card_data = source.get_card_diff(object_ref[0], object_ref[1])
upload_ref = upload.write(settings, 'diff-enhanced-cards',
card_data, write_dir, export_format)
if opts['diff-temporary-cards']:
coerce_settings = settings.copy()
# time must be appended to allow files to be sorted in time order
coerce_settings['upload']['append-time'] = True
card_data = source.get_data(connection, source.TEMP_CARD_SELECT,
order_by, limit)
upload_ref = upload.write(coerce_settings, 'temporary-cards',
card_data, write_dir, 'csv')
object_ref = upload.get_recent_object_urls(settings, 'temporary-cards', limit=2)
if len(object_ref) == 1:
card_data = source.blob_to_card_data(object_ref[0])
elif len(object_ref) == 2:
card_data = source.get_card_diff(object_ref[0], object_ref[1])
upload_ref = upload.write(settings, 'diff-temporary-cards',
card_data, write_dir, export_format)
for export_type, select_query in export_type_to_query.items():
if (opts[export_type]):
card_data = source.get_data(connection, select_query, order_by, limit)
upload_ref = upload.write(settings, export_type, card_data, write_dir, export_format)
break
# output uploaded object reference
print(upload_ref)
from typing import Dict, List, Mapping, Tuple
from logging import getLogger
from cx_Oracle import Connection, DB_TYPE_BLOB, DB_TYPE_LONG_RAW
from io import BytesIO
from PIL import Image as PILImage
from .image_storage import Image, ImageAttributes, ImageStorage, IMAGE_FORMAT
from .source import get_cursor
from .utils import chunks
LOG = getLogger(__name__)
# Unfortunately the oracle client does not support query params for arrays of unknown
# size, so we have to add the ids using string formatting. This is not ideal but should
# be safe as the photo ids come from the photo_ids of the cards returned from the
# query to get active cards.
SELECT_PHOTOS_BY_IDS = """
SELECT PHOTO_ID, PHOTO
FROM PHOTOGRAPHS
WHERE PHOTO_ID IN ({})
"""
SELECT_LOGOS = """
SELECT LOGO_ID, LOGO, LOGO_NAME
FROM LOGOS
"""
def OutputTypeHandler(cursor, _, defaultType, *args):
"""
By default the Oracle client will stream binary data, which for some reason is significantly
slower than just fetching the binary in one-go.
The docs suggest using this magic if fetching binary data which is smaller than 1GB:
https://cx-oracle.readthedocs.io/en/latest/user_guide/lob_data.html
"""
if defaultType == DB_TYPE_BLOB:
return cursor.var(DB_TYPE_LONG_RAW, arraysize=cursor.arraysize)
def process_image(image_bytes: bytes, photo_id: str):
"""
Helper method to get the size of an image - this allows failure in order
to not fail the rest of the export if a single image is corrupt
"""
try:
with PILImage.open(BytesIO(image_bytes)) as image:
# ensure that the format is suitable for JPEG - resaving the image also removes any
# exif data which it seems in the case of these legacy photos can cause the images
# to be wrongly rotated due to the use of software which has been used to rotate them
parsed_image = image.convert("RGB")
parsed_image_bytes = BytesIO()
parsed_image.save(parsed_image_bytes, format=IMAGE_FORMAT)
attributes = ImageAttributes(photo_id, parsed_image.height, parsed_image.width)
return (parsed_image_bytes.getvalue(), attributes)
except Exception as exception:
LOG.warning(exception)
LOG.warning(f'Invalid image for photo id {photo_id}, unable to calculate attributes')
return (image_bytes, ImageAttributes(photo_id, 0, 0))
def fetch_photos_with_metadata(
connection: Connection, ids: List[int], photo_id_to_metadata: Dict[int, Dict]
) -> List[Tuple[bytes, ImageAttributes, Dict]]:
"""
Fetches photos from the Oracle DB - creating the photo attributes from the image and
getting the metadata for the photo from `photo_id_to_metadata`.
Returns a list of tuples of (image bytes, photo attributes, photo metadata).
"""
id_query = ','.join([str(id) for id in ids])
LOG.debug(f'Querying for photos with ids {id_query}')
photos = list(get_cursor(connection, SELECT_PHOTOS_BY_IDS.format(id_query), log=False))
photos_with_attributes = []
for (photo_id, image_bytes) in photos:
(processed_bytes, attributes) = process_image(image_bytes, photo_id)
photos_with_attributes.append(
(processed_bytes, attributes, photo_id_to_metadata.get(photo_id, {}))
)
return photos_with_attributes
def get_persisted_images(
image_storage: ImageStorage, image_id_to_metadata: Dict[int, Dict]
):
"""
Gets Image which have already been persisted (using a image_storage instance), returning
them as `Image` instances - matching any relevant metadata against `image_id_to_metadata`.
Returns a Tuple of: a list of `Image` instances which have been persisted, and a list of
image ids which are missing
"""
existing_images = image_storage.get_existing_images()
image_data: List[Image] = []
unmatched_ids: List[int] = []
for image_id, metadata in image_id_to_metadata.items():
if existing_images.get(image_id):
image = existing_images[image_id]
image_data.append(Image(image.image_path, image.attributes, metadata))
else:
unmatched_ids.append(image_id)
return image_data, unmatched_ids
def images_to_row_data(images: List[Image], id_field_name: str = 'id'):
"""
Converts a list of Image instances into a list of data where each item represents a row
of csv data. Writes a header row containing the field names.
Expects that all image metadata will have the same set of keys.
"""
row_data = []
for image in images:
# Append the heading row if we're on our first record
if not row_data:
row_data.append([
'image_path',
# map the generic 'id' field key to 'photo_id'
*(id_field_name if field == 'id' else field for field in image.attributes._fields),
*image.metadata.keys()
])
row_data.append([
image.image_path,
*list(image.attributes),
*image.metadata.values()
])