FAQ | This is a LIVE service | Changelog

Commit f5aff290 authored by Dr Rich Wareham's avatar Dr Rich Wareham
Browse files

initial implmentation of membership synchronisation

parent d7ef672b
Pipeline #163218 passed with stages
in 3 minutes and 39 seconds
......@@ -3,5 +3,5 @@ omit =
......@@ -128,3 +128,4 @@ photos/
# .logan.yaml configures the logan tool
project: "instsync"
project: "lookupsync"
dockerfile: Dockerfile
# Mount credentials for acting as the *Product Admin* service account. This
# account has full access to resources within the Google Cloud *Product
# Folder*. This service account is used to create the per-deployment *Project*
# and *Project Admin* service account.
- name: "Product folder admin service account credentials"
source: "sm://identity-meta-12c86b54/identity-folder-admin-credentials"
target: "/config/service-account-credentials.json"
# Set the Google application default credentials' location to be the secret
# which was mounted inside the container.
GOOGLE_APPLICATION_CREDENTIALS: "/config/service-account-credentials.json"
......@@ -14,4 +14,4 @@ RUN pip3 install tox && pip3 install -r requirements.txt
ADD ./ ./
RUN pip3 install -e .
ENTRYPOINT ["instsync"]
ENTRYPOINT ["lookupsync"]
# Institution Mapping Tool
# Lookup Membership Synchronisation Tool
Tool to query Jackdaw for CHRIS and CAMSIS institutions and their mapping to
Lookup institutions.
Tool to query Lookup for CHRIS and CAMSIS insitutional membership.
## Usage
The sync tool needs to connect to Oracle databases and, potentially, Google
Cloud storage. If running locally you would need to install the proprietary oracle
client libraries and gcloud SDK. Fortunately, the docker container provided
contains both of these.
> This tool is incomplete.
Also, in order to pass the google credentials into the container, a logan
configuration is provided.
## Installation
So the logan tool can be called with the following to have the gcloud already
The `lookupsync` tool can be installed via pip:
$ logan -- help
pip3 install --user https://gitlab.developers.cam.ac.uk/uis/devops/iam/ibis/membership-synchronisation.git
You'll need a configuration file for the sync tool so need to pass extra docker
arguments for logan to include:
$ logan --docker-run-args "-v $(pwd)/configuration.yaml:/config.yaml" \
-- -c /config.yaml
## Usage
The logan tool is available at:
See the output from `--help` for usage.
## Configuration
## Programattic use
Configuration can be passed in a YAML-formatted configuration file
([example](configuration.yaml.example)) via the `--configuration` option or via
environment variable.
This tool can also be called programatically by importing the `main` function
and calling it with command line arguments:
Environment variables start `INSTSYNC_` and then follow the naming of settings
in the configuration file. Nested fields can be specified using a `__`
delimiter. For example, to override the Jackdaw username from that supplied in
import lookupsync
$ INSTSYNC_JACKDAW__USERNAME=myuser instsync --configuration configuration.yaml
'student-inst-members', '--gateway-client-id=ABCDEF',
'--gateway-client-secret-from=/path/to/secret', ...
Note that configuration in environment variables will override any configuration
passed via `--configuration`.
## CI Configuration
The following CI variables need to be available for a sync job to run:
* `CONFIGURATION`: A "file-like" CI-variable which expands to a path to a file
containing the configuration. Currently this is configure to upload to a
Google Cloud Storage bucket.
* `GOOGLE_APPLICATION_CREDENTIALS`: A "file-like" CI-variable which expands to a
path to Google service account credentials which should be used to upload to
a bucket.
* `PERFORM_SYNC`: As long as this variable has some value, the job will run.
Both of these variables are marked are "protected-only" meaning they only run on
protected branches and not on merge request branches.
# For devops team members an account with read-only permissions to Jackdaw is available
# in 1Password: https://uis-devops.1password.eu/vaults/i5en3kuzea3vdqvstra7cmj77q/allitems/t5tssb3lubhydioi2k5ughoalq
dsn: <Oracle DSN in the format "host:port/servicename">
username: <Oracle Username>
password: <Oracle User Password>
# Specify output. Any of the following URLs work.
# Write to standard output
url: "stdout://"
# Write to local file
url: "file:///tmp/foo.json"
# Write to Cloud Storage bucket
url: "gs://some-bucket/path/to/object"
# (Optional) pretty-print output
indent: 2
# (Optional) timeout in seconds for Cloud upload
timeout: 120
Card Database synchronisation tool
instsync (-h | --help)
instsync [--quiet] [--debug] [--configuration=FILE]...
-h, --help Show a brief usage summary.
--quiet Reduce logging verbosity.
--debug Log debugging information
-c, --configuration=FILE Specify configuration file to load which has
credentials to access database.
import json
import logging
import os
import sys
import docopt
from . import source, camsis
from .settings import load_settings
from .upload import google_storage_upload
LOG = logging.getLogger(os.path.basename(sys.argv[0]))
def main():
# In principle __doc__ could be None so use f-strings to ensure input to docopt is always a
# string.
opts = docopt.docopt(f"{__doc__}")
# Configure logging
level=logging.DEBUG if opts["--debug"] else
logging.WARN if opts["--quiet"] else logging.INFO
# Read configuration files/objects/secrets in to settings. Pydantic takes care of validating
# the format.
settings = load_settings(opts["--configuration"])
# Connect to Jackdaw.
jd_connection = source.get_connection(**(settings.jackdaw.dict()))
# Generate output as a JSON document.
result = json.dumps(
{"institutions": collect_institutions(jd_connection)},
# Write result to output.
if settings.output.url.scheme == "stdout":
LOG.info("Writing output to standard output...")
elif settings.output.url.scheme == "file":
LOG.info("Writing output to \"%s\"...", settings.output.url.path)
with open(settings.output.url.path, "w") as fobj:
elif settings.output.url.scheme == "gs":
LOG.info("Writing output to Google Storage at \"%s\"...", settings.output.url)
else: # pragma: no cover
# Should not be reached since scheme is verified in the settings module.
raise NotImplementedError()
def collect_institutions(connection):
Collect CHRIS and CAMSIS institutions together into a single list of the form:
"instid": <Lookup inst id>,
"identifiers": [
# ...
# ...
# Fetch CHRIS data into a dictionary keyed by Lookup instid.
LOG.info("Fetching CHRIS institutions...")
chris_data = {}
for row in source.fetch(connection, "SELECT unitref, inst FROM chris_data.unit"):
if row["inst"] is None:
chris_data[row["inst"]] = chris_data.get(row["inst"], []) + [row["unitref"]]
# Pre-seed CAMSIS data from static table.
camsis_data = {inst: [dept] for dept, _, inst in camsis.STATIC_COLLEGES}
# Fetch CAMSIS data into dictionary keyed by Lookup instid.
LOG.info("Fetching CAMSIS institutions...")
for row in source.fetch(connection, "SELECT dept, inst FROM camsis_data.dept_inst"):
if row["inst"] is None:
camsis_data[row["inst"]] = camsis_data.get(row["inst"], []) + [row["dept"]]
# Get list of all Lookup instids covered in ascending alphabetical order.
instids = sorted(set(chris_data.keys()) | set(camsis_data.keys()))
# Compute final result by merging data from CAMSIS and CHRIS.
institutions = []
for instid in instids:
identifiers = [f'{instid}@insts.lookup.cam.ac.uk']
for chris_id in sorted(chris_data.get(instid, [])):
for camsis_id in sorted(camsis_data.get(instid, [])):
institutions.append({"instid": instid, "identifiers": identifiers})
return institutions
# Static mapping providing:
# (CamSIS college code, CamSIS description, Lookup instid)
# Generated manually from: https://www.camsis.cam.ac.uk/files/student-codes/a01.html
("CAI", "Gonville and Caius College", "CAIUS"),
("CC", "Corpus Christi College", "CORPUS"),
("CHR", "Christ's College", "CHRISTS"),
("CHU", "Churchill College", "CHURCH"),
("CL", "Clare College", "CLARE"),
("CLH", "Clare Hall", "CLAREH"),
("CTH", "St Catharine's College", "CATH"),
("DAR", "Darwin College", "DARWIN"),
("DOW", "Downing College", "DOWN"),
("ED", "St Edmund's College", "EDMUND"),
("EM", "Emmanuel College", "EMM"),
("F", "Fitzwilliam College", "FITZ"),
("G", "Girton College", "GIRTON"),
("HH", "Hughes Hall", "HUGHES"),
("HO", "Homerton College", "HOM"),
("JE", "Jesus College", "JESUS"),
("JN", "St John's College", "JOHNS"),
("K", "King's College", "KINGS"),
("LC", "Lucy Cavendish College", "LCC"),
("M", "Magdalene College", "MAGD"),
("N", "Newnham College", "NEWN"),
("NH", "Murray Edwards College", "NEWH"),
("PEM", "Pembroke College", "PEMB"),
("PET", "Peterhouse", "PET"),
("Q", "Queens' College", "QUEENS"),
("R", "Robinson College", "ROBIN"),
("SE", "Selwyn College", "SEL"),
("SID", "Sidney Sussex College", "SID"),
("T", "Trinity College", "TRIN"),
("TH", "Trinity Hall", "TRINH"),
("W", "Wolfson College", "WOLFC"),
import logging
import re
from typing import Tuple, Optional, Sequence
import deepmerge
from pydantic import BaseModel, BaseSettings, validator, stricturl
from pydantic.env_settings import SettingsSourceCallable
import yaml
LOG = logging.getLogger(__name__)
def load_settings(paths: Sequence[str]) -> "Settings":
Load settings from a list of filesystem paths which point to YAML documents.
settings = {}
for path in paths:
LOG.info("Loading settings from %s", path)
with open(path) as fobj:
settings = deepmerge.always_merger.merge(settings, yaml.safe_load(fobj))
return Settings.parse_obj(settings)
class Jackdaw(BaseModel):
dsn: str
username: str
password: str
def dsn_must_match_expected_format(cls, v):
if not re.match("^[^:]*:[^/]*/.*$", v):
raise ValueError("Jackdaw DSN must be of the form host:port/servicename")
return v
OUTPUT_URL_SCHEMES = ["stdout", "file", "gs"]
OutputURL = stricturl(host_required=False, tld_required=False, allowed_schemes=OUTPUT_URL_SCHEMES)
class Output(BaseModel):
url: OutputURL
indent: Optional[int] = None
timeout: int = 120
def stdout_url_no_host_or_path(cls, v):
if v.scheme == 'stdout' and (v.host is not None or v.path is not None):
raise ValueError("stdout output URL must have no host or path")
return v
class Settings(BaseSettings):
jackdaw: Jackdaw
output: Output
class Config:
env_prefix = 'instsync_'
env_nested_delimiter = '__'
def customise_sources(
init_settings: SettingsSourceCallable,
env_settings: SettingsSourceCallable,
file_secret_settings: SettingsSourceCallable,
) -> Tuple[SettingsSourceCallable, ...]:
return env_settings, file_secret_settings, init_settings
from logging import getLogger
import cx_Oracle
LOG = getLogger(__name__)
def fetch(connection: cx_Oracle.Connection, statement: str, **kwargs):
Utility method to run a statement using the given connection and return a generator of dicts.
cursor = connection.cursor()
cursor.execute(statement, **kwargs)
field_names = [d[0].lower() for d in cursor.description]
for row in cursor:
yield {
field_name: row[index]
for index, field_name in enumerate(field_names)
def get_connection(dsn: str, username: str, password: str):
Returns an oracle db connection using the given settings
LOG.info(f"Connecting to '{dsn}'")
return cx_Oracle.connect(user=username, password=password, dsn=dsn)
from typing import Tuple, List
class MockCursorOf:
A cursor which mocks an Oracle cursor allowing results to be emitted
from our `mock_results` input
def __init__(self, mock_results: List[List[Tuple]]):
self.queries = list()
self.mock_results = mock_results
self.description = None
self.next_result_iterator = None
self.rowcount = 0
def execute(self, query: str, **kwargs):
This is the function that gets called by application code to make a query,
we then grab one of our results from mock_results and allow application
code to start iterating on it
self.next_result_iterator = self.mock_results.pop(0) if self.mock_results else []
# the first result should be a tuple containing the field names which we set as
# our 'description' - the description is in an odd format, field names have to
# be the first element in a tuple
number_of_results = len(self.next_result_iterator)
if number_of_results != 0:
# we set the row to indicate the number of value rows that we have - excluding
# the first row which is the header
self.rowcount = number_of_results - 1
self.description = [(field_name,) for field_name in self.next_result_iterator.pop(0)]
self.rowcount = 0
self.description = []
self.queries.append((query, kwargs))
def clear_current_execution(self):
self.description = None
self.next_result_iterator = None
def __iter__(self):
return self
def __next__(self):
if self.next_result_iterator is None:
raise ValueError('No iterator to return from mock cursor')
if len(self.next_result_iterator) == 0:
self.next_result_iterator = None
raise StopIteration
return self.next_result_iterator.pop(0)
class MockConnection:
A mock to imitate an oracle db connection, returning a mocked cursor which
will emit results using the `mock_results` passed in on our constructor.
The mock results should be in the format of:
Result for query -> Row -> Tuple of each field
with the first tuple row containing the name of the fields which will be
placed on the cursor's description. For example two independent query results
each with a single row containing two fields would look like:
("firstname", "lastname")
("monty", "dawson")
("id", "number")
(100, 1)
base_url = 'https://test.com/legacydb'
def __init__(self, mock_results: List[List[Tuple]] = []):
self.mock_cursor = MockCursorOf(mock_results)
def cursor(self):
return self.mock_cursor
def get_queries_executed(self):
return self.mock_cursor.queries
import os
import json
import tempfile
from unittest import TestCase, mock
import yaml
from .. import main
from .db_mock import MockConnection
# Mock data from CHRIS.
{'inst': 'ABC', 'unitref': 'U0001'},
{'inst': 'ABC', 'unitref': 'U0002'},
{'inst': 'DEF', 'unitref': 'U0003'},
{'inst': None, 'unitref': 'U0004'},
# Mock data from CAMSIS
{'inst': 'ABC', 'dept': 'A'},
{'inst': 'GHI', 'dept': 'G'},
{'inst': None, 'dept': 'X'},
# Mock static college data from CAMSIS
('CAI', 'Gonville and Caius College', 'CAIUS'),
('CC', 'Corpus Christi College', 'CORPUS'),
('CHR', 'Christ\'s College', 'CHRISTS'),
# Expected output from above
'institutions': [
'instid': 'ABC',
'identifiers': [
}, {
'instid': 'CAIUS',
'identifiers': [
}, {
'instid': 'CHRISTS',
'identifiers': [
}, {
'instid': 'CORPUS',
'identifiers': [
'instid': 'DEF',
'identifiers': [
}, {
'instid': 'GHI',
'identifiers': [
class InstitutionsTestCase(TestCase):
def setUp(self):
patcher = mock.patch('os.environ', new_callable=dict)
self.environ_mock = patcher.start()
# Known good settings.
settings_dict = {
'jackdaw': {'dsn': 'host:port/service', 'username': 'user', 'password': 'pass'},
'output': {'url': 'stdout://'}
# Write settings to configuration file.
tmpdir = tempfile.TemporaryDirectory()
config_file = os.path.join(tmpdir.name, 'config.yaml')
with open(config_file, 'w') as fobj: