FAQ | This is a LIVE service | Changelog

Skip to content
Snippets Groups Projects
Robin Goodall's avatar
Robin Goodall authored
Configure Renovate

See merge request !35
34945af0
History

Google Drive Management Tool

This repository contains a custom management tool for scanning google user drives for shared files, logging information about shared files, and changing sharing permissions for files owned by users who will be deleted.

Configuration is performed via a configuration file. Take a look at the example configuration file for more information.

Usage

The tool can be invoked from the command line:

$ gdrivemanager scan

By default this will log what will be done but not implement the change. To implement the management operation the --write flag is required: :

$ gdrivemanager scan --write

See the output of gdrivemanager --help for more information on valid command-line flags and commands.

Unless overridden on the command line, the tool searches for its configuration file in the following places in the following order:

  • A configuration.yaml file in the current directory.
  • ~/.gdrivemanager/configuration.yaml.
  • /etc/gdrivemanager/configuration.yaml.

The first located file is used.

Installation

The command-line tool can be installed directly from the git repository:

$ pip3 install git+https://gitlab.developers.cam.ac.uk/uis/gsuite/synctool.git

For developers, the script can be installed from a cloned repo using pip. It recommended to do this in a virtual environment:

$ cd /path/to/this/repo
$ python -m venv .venv
$ . ./.venv/bin/activate
$ pip3 install -e .

Operations

UCam fields

Google users in the domain being scanned should have a set of custom fields defined in a schema called Ucam. These fields are:

  • mydrive-shared-action
  • mydrive-shared-result
  • mydrive-shared-filecount
  • mydrive-shared-doc

The mydrive-shared-action field is a string field set by the directory synchronisation tool and should contain one of the values:

  • scan-files
  • scan-folders
  • recover
  • scan (deprecated)

The mydrive-shared-result field is a string field and is set by gdrivemanager; it will contain one of the values:

  • permissions-none
  • permissions-removed
  • permissions-recovered

The mydrive-shared-filecount and mydrive-shared-doc fields are debug fields containing the number of files with shared permissions removed, and the name of the permissions recovery document written as part of the scan.

Scan

The scan operation retrieves a list of all users marked for scan and removes the shared file permissions from any of their shared files. The scan operates in two passes, the first removing all shared folder permissions and the second removing all other shared permissions.

Users are marked for scan if their UCam.mydrive-shared-action field is set to scan-folders, scan-files, or scan; this last is deprecated and treated the same as scan-folders.

The first pass scan happens when the mydrive-shared-action is set to scan-folders. A list of all the user's shared folders is retrieved, and the sharing permissions applied to them are removed. The list of permissions removed is preserved and stored in a YAML document in the google shared drive configured by google.shared_storage_drive. Finally the user's UCam fields are updated, setting mydrive-shared-action to scan-files and the filecount and document fields to the number of shared folders processed and the name of the written YAML document.

The second pass scan happens when the mydrive-shared-action is set to scan-files. A list of all the user's shared files is retrieved, and the sharing permissions applied to them are removed. The list of permissions removed is preserved and stored in a YAML document in the google shared drive configured by google.shared_storage_drive. If a permissions document exists, from the first-pass scan the document is updated rather than a new document being created.

Once the second-pass scan is complete the user's UCam fields are updated, removing the mydrive-shared-action. The mydrive-shared-result is set to permissions-removed if any permissions were removed in either the first- or second- pass scan, or to permissions-none if there were no permissions removed. The filecount and document fields to the number of shared folders and files processed over both scans and the name of the written YAML document.

The scan operation has three limits. max_scanned_users limits how many users will be scanning in each pass. A new user won't be scanned if the total elapsed time of the current pass has excessed the max_total_scan_duration in minutes.

The max_user_scan_duration is the number of minutes the task of listing a user's personal drive files is allowed to take before being aborted. If a user is aborted this way, their mydrive-shared-action is prefixed with "manual-" to indicated they need manual attention and to prevent them from blocking further progress with other users.

Recover

The recover operation uses the permissions YAML document written during the scan operation to re-instate any shared permissions that were removed.

Users are marked for recovery if their UCam.mydrive-shared-action field is set to recover.

The recovery operation retrieves the permissions YAML document from the configured shared drive, and applies all the permissions found in this document to the user's files.

Shared Drive Usage

The "shared-drive-usage" operation maintains a cache file of the list of shared drives with their permissions, usage and number of files. Each drive and the list in general has a last_updated timestamp.

This cache file is stored in the shared drive configured by google.shared_storage_drive with a name configured by google.shared_drive_list_file (defaults to "shared_drive_list.yaml").

Each run of this operation, first checks if the list needs updating. i.e. the list's last_updated is older than google.shared_drive_list_cache_days days (default 7). If so all the shared drives are relisted, their current permissions added and then cached usage, files, last_updated, last_scanned and last_scan_success merged.

A list of shared drives to be scanned is obtained and sorted. These are drives whose last_scanned (last_updated as a fallback) is older than google.shared_drive_usage_cache_days days (default 7). Drives without a last_scanned or last_updated (new to list) are first to be scanned with the rest sorted from oldest to newest.

A successful scan will result in last_updated being updated. last_scanned and last_scan_success are always updated when a drive scan is completed.

While the run hasn't taken limits.max_shared_drive_usage_duration minutes, each drive's usage and file count is determined. Unfortunately, there is currently no way to immediately get these without scanning through all files and summing their quotaBytesUsed values.

The shared drive cache is rewritten after each shared drive is updated as there is a (high) potential for the API calls to drop out when counting a large number of files in a shared drive.

Reporting

The "shared-drive-report" operation will use the shared drive cache file to compile the report data. It will also search for all users in Lookup to add institutions and put users in appropriate fields.

This report will be written to the file specified by report.all_shared_drive_filename (defaults to shared-drive-usage-{timestamp}.csv).

If the configuration report.output_location is given (as a Google Drive folder id) then the report output file will be saved to this location, otherwise it will be saved locally.

User, Inst or Group Reporting

Specifying who to report on

The "report" operation can be given a single user (--user=CRSID), an institution (--instid=INSTID) or group (--groupid=GROUPID). For the latter two, the Lookup institution/group active and cancelled members will be obtained. For groupid, the group's short name (e.g. uis-devops-hamilton) can be used. For institutions, the --children option can be added to include members of all child institutions too.

If the configuration report.output_location is given (as a Google Drive folder id) then the report output files will be saved to this location.

Instead of specifying a institution, group or user directly, the --request option can be used to produce a report based on the existence of a report request file (content irrelevant) in the report.output_location (locally if not specified). The format of this file's name should be report-request-{type}-{id} where the {type} is as follows:

  • "i", "inst", "institution" with {id} being the instid
  • "g", "group" with {id} being either groupid or group's short name
  • "u", "user" with {id} being the CRSid

This report request file will be deleted after the report output files have been saved.

Report process

For all the users in the institution or group, or just the single user, firstly, their MyDrive usage is gathered and exported to a CSV file with a name configured by report.mydrive_filename (defaults to {id}-mydrive-{timestamp}.csv with id being the CRSid, InstID or GroupID).

Next, all shared drives are checked using the shared drive cache file, see above. Any drives that has a manager or content manager matching at least one of the user(s) is included in another CSV file with a name configured by report.shared_drive_filename (defaults to {id}-shared-drive{timestamp}.csv).

Note, if a shared drive usage has yet to be counted then it will have blank values in the CSV and the report may need regenerating later.

Testing against gdev.apps.cam.ac.uk, (UCam test Google Workspace)

Download the gdrive-management-bot-test GCP service account credentials from 1password and save as credentials.json.

Copy the configuration.yaml.example file to configuration.yaml.

The following configuration has already been performed for this service account.

Preparing a service account (Admin Roles)

Google have updated the API to allow service accounts direct access to the API without needing domain-wide delegation.

Therefore, in order to read and write users' custom schema, the service account only needs to be added to the "User Management" admin role. It therefore doesn't need to impersonate an admin user (nor need the admin.directory.user scope that would require).

  1. Create a service account in the Google Cloud Platform Console for this script.
  2. Copy the service account's full email address.
  3. In the Google Workspace admin panel, go to "Account" > "Admin Roles" and open the "User Management" role.
  4. Add the service account to the role using the "Assign service accounts" option when viewing the role's admins

Required API scopes

In order for the tool to impersonate the users that it will need to read and write files and permissions for, the service account will need the following scopes:

  • https://www.googleapis.com/auth/drive.metadata.readonly
  • https://www.googleapis.com/auth/drive

These need to be added to the Domain-Wide-Delegation configuration for the domain:

  1. View the service account in the Google Cloud Platform Console.
  2. Copy the service account's unique id (this is the same as the client_id).
  3. In the Google Workspace admin panel, go to "Security Settings" > "Access and data control" > "API Controls".
  4. Click "Manage Domain-Wide Delegation" then "Add new"
  5. Paste in the service account Client ID and add a comma-separated list of scopes above.