Concepts and Interactions
A bit more focused on how the services fit together and why.
types of Digital Objects
We started with more, but we’re kinda down to three
(ask Andrew)
- APO - administrative (or admin) policy object; every object is required to have one, it has default rights metadata values for items created under it, which determines who may manage the object.
- Collection - a collection of items
- Item - an item can have files; it can belong to any number of collections (zero, one, or many)
- Agreement - an agreement is a subclass of the Item type; each APO has one agreement (TODO: the Agreement represents?)
Object Registration and Accessioning
How people get objects get into SDR
Argo
- E.g. “Use Argo to create an object without any files” - https://github.com/sul-dlss/infrastructure-integration-test/blob/main/spec/features/create_object_no_files_spec.rb
- E.g. Create an object, version it, confirm it was preserved - https://github.com/sul-dlss/infrastructure-integration-test/blob/main/spec/features/create_preassembly_image_spec.rb
- touches on accessioning, preservation, and the event log
H2
Use H2 to create an object - https://github.com/sul-dlss/infrastructure-integration-test/blob/main/spec/features/create_object_h2_spec.rb
ETD
Create a new ETD - https://github.com/sul-dlss/infrastructure-integration-test/blob/main/spec/features/create_etd_spec.rb
Google Books
TODO: link to code or brief explanation of gbooks accessioning
was-registrar-app
TODO: link to code or brief explanation of was-registrar-app accessioning
cocina-models
SDR data model written as syntactically validatable with dry-struct, dry-types, and openapi
(ask JCoyne or JLitt)
Workflows and Robots
Robots are what we call our individual processing steps which are grouped into “workflows”, coordinated by the workflow server (and Resque and resque-pool). They do things like updating the SDR metadata store (currently Fedora data streams), generating technical metadata, handing off to the preservation system a copy of each object version, etc.
Together, workflow server and the robots provide a system for managing the SDR accessioning pipeline: the robots ingest content into the SDR; workflow server coordinates by queuing tasks to a shared Redis instance.
The robots inherit their worker functionality from the lyber-core gem.
Each type of robot-suite has one or more VMs of its own, but all share the same Redis instance, and all are managed on their respective VMs by resque-pool.
This architecture was chosen before open source workflow automation tools like Airflow were available/mature.
Another thing that adds indirection/confusion: sometimes the robots will perform the heavy lifting of a task, but sometimes even they call out to other services. For example, in the preservationIngestWF, the validate-moab step in the preservation_robots actually makes a REST call to preservation_catalog to do the validation asynchronously (because that service will do all auditing and cloud replication after the robot finishes updating the content). And then preservation_catalog tells workflow service when it’s done and workflow service tells the preservation_robots to do the next step.
(ask anyone and hand waving will occur. Perhaps Peter? Andrew? Naomi? JCoyne? - we can re-figure out the layers of legacy code as a team)
Access Rights for digital content
(ask JCoyne, John)
- What’s an APO?
- How are they stored? - Currently, in the
rightsMetadatadatastream for an object (an XML datstream stored in Fedora for a druid). Will be migrated to the Cocina data model and its JSON serialization format. - How do they work to restrict access?
- Defaults are applied based on the item’s parent APO
- An item can have multiple files, and those files may each have different rights specified from each other as well as from the parent item.
- See also
Embargoes
- Part of the cocina-model. They are expired by DSA - it emits a RabbitMQ message.
- See integration test for ETD
- See integration test for H2
- Expiring - see … cron job for dor-services-app (?)
SDR APIs
- dor-services-app https://sul-dlss.github.io/dor-services-app/ (https://github.com/sul-dlss/dor-services-app)
- dor-services-client
- cocina-models: https://sul-dlss.github.io/cocina-models/ (https://github.com/sul-dlss/cocina-models)
- sdr-api: https://sul-dlss.github.io/sdr-api/ (https://github.com/sul-dlss/sdr-api)
- sdr-client
- preservation-catalog api: https://sul-dlss.github.io/preservation_catalog/ (https://github.com/sul-dlss/preservation_catalog)
- preservation-client
Google-books
A github repository, a deployed app and so much more
(ask JLitt, JCoyne or Mike)
- The goal - To download from Google all scans they have of the books they borrowed from Stanford University Libraries.
- Will this app ever go away? - Possibly? If we successfully download all scans and accompanying OCRed text, and if Google stops providing better (re-OCRed) text, the app may have then fulfilled its purpose.
- How much load does it put on our infrastructure? - Sometimes enough to contribute to Fedora performance issues that cascade and cause issues with the rest of SDR, but sometimes that happens even in the absence of Google books activity. We also suspect that due to storage I/O contention among the virtual machines, it may have contributed to degraded SearchWorks performance for a time (but we think a combination of putting SearchWorks’ Solr on faster storage and turning down GBooks concurrency may have solved this?).
ETD app
Electronic Theses and Disserations
(ask Naomi or Mike)
- Used by students submitting ETDs, by the Registrar’s office, and caught up in the flow of cataloging ETDs as well.
- Accession an ETD object from faking original registrar’s post all the way through SDR (stage) - follow steps in infrastructure-integration-test
Pre-assembly
When you have complex objects or objects of certain characteristics, this app will organize the files appropriate for getting them into the SDR
(ask Peter or Naomi)
- Why do we need it?
- Who uses it?
- Run a discovery report on stage and look at the result.
Preservation
Used to keep digital content safe, both on prem and also in the cloud
(ask John)
- Objects are preserved in the Moab format, a forward-delta versioning scheme inspired by BagIt and software version control systems such as Git.
- Moab manipultion and comparison code mostly lives in the moab-versioning gem, but also some in preservation_robots and preservation_catalog.
- How things go in
- a preservation_robots worker takes a BagIt bag containing data and metadata for a new object version, and uses that to create a new Moab (if v1 of the object) or add a new version to an existing Moab (if v2+ of the object).
- a preservation_robots worker will also ask preservation_catalog to queue an integrity check of the new Moab version.
- Moabs are stored on “storage roots”. There are currently 3 in production. The storage is currently implemented on a Ceph cluster exposed the the VM via NFS mounts.
- new content goes to the latest storage root (as configured in the storage root list given to the moab-versioning gem)
- updates to existing content go to the Moab directory on its current storage root (which may no longer be the last storage root)
- Moabs are stored in “druid tree” paths, which are similar to pair tree paths (e.g. druid
bc123df4567would bebc/123/df/4567/bc123df4567)
- What are the APIs to interact with preservation
- preservation-client is a ruby gem for interating with pres cat’s REST API
- Once preservation_catalog has been made aware that new object has been preserved or a new version has been preserved for an existing object, assuming the object passes structural and checksum validation, preservation catalog will replicate the new object version to the cloud.
- It creates a zip file (uncompressed) of the new Moab version, and sends that to each of the S3 endpoints (there are currently 3, 2 AWS and 1 IBM).
- How is redis used in preservation
- there is a Redis instance that coordinates all of the regular accessioning “robots” (including preservation_robots)
- there is a Redis instance that coordinates all of the preservation_catalog workers (for replicating copies to the cloud and for auditing all preserved content both on prem in in the cloud)
- The preservation catalog codebase contains a database README with an ER diagram for the schema as well as a number of useful example queries, and a diagram and README for its replication worker pipeline.
Technical Metadata app
Service to extract and record technical metadata for files deposited into SDR
(ask JLitt)
- technical metadata: for each file, its file type, characteristics of image, PDF, A/V files and so on
SDR Tags database
Accessioneers and Argo users and some of our apps “tag” digital objects in the SDR
(ask Mike)
- A table in dor-services-app’s PostgreSQL database. Used to live in Fedora. Useful for human administration of repository content in Argo.
Modsulator
A gem and an app that can turn spreadsheets into MODS in a single bound … or something
Bulk jobs/actions in Argo
When making changes to objects one at a time is the wrong approach
Web archiving
yes, Virginia, DLSS does crawl or download to get WARCs and ingest them into SDR and serve them out
(ask JLitt or Naomi)
- was-registrar-app
- was_robot_suite robots to carry out special processing required for WAS seed and crawl objects
- SWAP
- wasapi-downloader
- …
Solr indexing of SDR
How we currently index SDR content for searching; the index is used by Argo and dor-services-app. Fedora does not have robust search/query built in. We are occasionally bitten by this non-transactional separation between metadata persistence and indexing, e.g. the occasional failure to prevent re-use of a sourceID when registering items.
(ask JCoyne)
- dor_indexing_app - subscribes to messages about object changes, and can also be triggered synchronously via HTTP
Events (e.g. in Argo object view)
A way to capture an objects changes over time
(ask JCoyne)
- A table in dor-services-app’s PostgreSQL database. Other applications make REST calls to DSA (token authenticated), logging notable activity that might help future with future debugging or provenance determination.
Goobi
A workflow tool used by DLSS digitization staff and then hooking into common-accessioning
(ask Peter)
- what is it, and how does it relate to SDR