acdh-oeaw / arche-ingest
A set of sample ARCHE ingestion scripts
Installs: 840
Dependents: 0
Suggesters: 0
Security: 0
Stars: 0
Watchers: 4
Forks: 1
Open Issues: 2
pkg:composer/acdh-oeaw/arche-ingest
Requires
Requires (Dev)
- dev-master
- 1.6.12
- 1.6.11
- 1.6.10
- 1.6.9
- 1.6.8
- 1.6.7
- 1.6.6
- 1.6.5
- 1.6.4
- 1.6.3
- 1.6.2
- 1.6.1
- 1.6.0
- 1.5.4
- 1.5.3
- 1.5.2
- 1.5.1
- 1.5.0
- 1.4.17
- 1.4.16
- 1.4.15
- 1.4.14
- 1.4.13
- 1.4.12
- 1.4.11
- 1.4.10
- 1.4.9
- 1.4.8
- 1.4.7
- 1.4.6
- 1.4.5
- 1.4.4
- 1.4.3
- 1.4.2
- 1.4.1
- 1.4.0
- 1.3.0
- 1.2.2
- 1.2.1
- 1.2.0
- 1.1.1
- 1.1.0
- 1.0.6
- 1.0.5
- 1.0.4
- 1.0.3
- 1.0.2
- 1.0.1
- 1.0.0
This package is auto-updated.
Last update: 2025-10-08 10:02:26 UTC
README
A collection of ARCHE ingestion script templates
The REST API provided by the ARCHE is quite a low-level from the point of view of real-world data ingestions. To make ingestions simpler, the arche-lib-ingest library has been developed. While it provides a convenient high-level data ingestion API, it's still only a library which requires you to write your own ingestion script.
This repository is aimed at closing this gap - it provides a set of data ingestion scripts (built on top of the the arche-lib-ingest) which can be used by people with almost no programming skills.
Scripts provided
There are two script variants provided:
- Console scripts variant where where parameters are passed trough the command line.
 The benefit of this variant is easiness of use, especially in CI/CD workflows.- bin/arche-import-metadataimports metadata from an RDF file
- bin/arche-import-binary(re)ingests a single resource's binary content (to be used when file name and/or location changed)
- bin/arche-delete-resourceremoves a given repository resource (allows recursion, etc.)
- bin/arche-delete-triplesremoves metadata triples specified in the ttl file (but doesn't remove repository resources)
- bin/arche-update-redmineupdates a Redmine issue describing the data curation/ingestion process (see a dedicated section at the bottom of the README)
 
- Template variant where you adjust execution parameters and/or the way the script works by editign its content.
 The benefit of this variant is that it allows to treat the adjusted script as a documentation of the ingestion process and/or adjust it to your particular needs.- add_metadata_sample.phpadds metadata triples specified in the ttl file preserving all existing metadata of repository resources
- delete_metadata_sample.phpremoves metadata triples specified in the ttl file (but doesn't remove repository resources)
- delete_resource_sample.phpremoves a given repository resource (allows recursion, etc.)
- import_binary_sample.phpimports binary data from the disk
- import_metadata_sample.phpimports metadata from an RDF file
- reimport_single_binary.phpreingests a single resource's binary content (to be used when file name and/or location changed)
 
Installation & Usage
Runtime environment
You can also use the acdhch/arche-ingest Docker image
(the {pathToDirectoryWithFilesToIngest} will be available at the /data location inside the Docker container):
docker run \
  --rm \
  -ti \
  --name arche-ingest \
  -v {pathToDirectoryWithFilesToIngest}:/data \
  acdhch/arche-ingest
Console script variant
- Install with:
composer require acdh-oeaw/arche-ingest 
- Update regularly with:
composer update --no-dev
- Run with:
vendor/bin/{scriptOfYourChoice} {parametersGoHere}e.g.vendor/bin/arche-import-metadata --concurrency 4 myRdf.ttl https://arche.acdh.oeaw.ac.at/api myLogin myPassword - To get the list of available parameters run
vendor/bin/{scriptOfYourChoice} --helpe.g.vendor/bin/arche-import-metadata --help 
 
- To get the list of available parameters run
Running inside GitHub Actions
Do not store your ARCHE credentials in the workflow configuration file. Use repository secrets instead (see example below).
A fragment of your workflow's yaml config may look like that:
- name: ingestion dependencies run: | composer require acdh-oeaw/arche-ingest - name: ingest arche run: | vendor/bin/arche-import-metadata myRdfFile.ttl https://arche-curation.acdh-dev.oeaw.ac.at/api ${{secrets.ARCHE_LOGIN}} ${{secrets.ARCHE_PASSWORD}} vendor/bin/arche-update-redmine --token ${{ secrets.REDMINE_TOKEN }} https://redmine.acdh.oeaw.ac.at 1234 'Upload AIP to Curation Instance (Minerva)'
Running on ACDH Cluster
First, get the arche-ingestion workload console as described here
Then:
- Run screen -S mySessionName
- Go to your ingestion directory
- Run scripts using {scriptName}, e.g.arche-import-metadata myRdf.ttl https://arche.acdh.oeaw.ac.at/api myLogin myPassword 
- If the script will take long to run, you may safely quit the console with CTRL+a+dfollowed byexit.- To get back to the script log again into repo-ingestion@hephaistosand runscreen -r mySessionName 
 
- To get back to the script log again into 
Template variant
- Clone this repository.
- Run
composer update --no-dev 
- Adjust the script of your choice.
- Available parameters are provided at the beginning of the script.
- Don't adjust anything below the
// NO CHANGES NEEDED BELOW THIS LINEline until you consider yourself a programmer and would like to change the way a script works.
 
- Run the script with
php -f {scriptOfYourChoice}- You can consider reading input from a file and/or saving output to a log file, e.g. with:
 (see the section below for hints on the input file format)php -f import_metadata_sample.php < inputData 2>&1 | tee logFile
 
- You can consider reading input from a file and/or saving output to a log file, e.g. with:
Long runs
If you are performing time consuming operations, e.g. a large data ingestion, you may consider running scripts in a way they won't stop when you turn your computer off.
You can use nohup or screen for that, e.g.:
- nohup - run with:
# console script variant nohup vendor/bin/arche-import-metadata --concurrency 4 myRdf.ttl https://arche.acdh.oeaw.ac.at/api myLogin myPassword > logFile 2>&1 & # template variant nohup php -f import_metadata_sample.php < input > logFile 2>&1 &- If you want to run template script variants that way, you have to prepare the input data file.
 It should look as follows:
 e.g.{arche instance API URL} yes {login} {password}https://arche-dev.acdh-dev.oeaw.ac.at yes myLogin myPassword
 
- If you want to run template script variants that way, you have to prepare the input data file.
- screen
- start a screensession withscreen -S mySessionName 
- Then run your commands as usual
- Hit CTRL+afollowed by adto leave thescreensession.
- You can get back to the screensession withscreen -r mySessionName 
 
- start a 
Reporting errors
Create a subtask of the Redmine issue #17641.
- Provide information on the exact location of the ingestion script location (including the script file itself) and any other information which may be required to replicated the problem.
- Assign Mateusz and Norbert as watchers.
Using arche-update-redmine in a GitHub workflow
The basic idea is to execute data processing steps in a following way:
- note down the step name so we can read it instead of a failure
- perform the step
- call the arche-update-redmine
and have a separate on-failure job step which makes an arche-update-redmine call noting the faillure.
Remarks:
- As a good practice we should include the GitHub job URL in the Redmine issue note. For that we set up a dedicated environment variable.
- It goes without saying Redmine access credentials are stored as a repository secret.
- The way you store the main Redmine issue ID doesn't matter as it's not secret. Do it a way you want (here we just hardcode it in the workflow using an environment variable)
name: sample on: push: ~ jobs: dockerhub: runs-on: ubuntu-latest env: REDMINE_ID: 21085 steps: - uses: actions/checkout@v4 - name: init run: | composer require acdh-oeaw/arche-ingest echo "RUN_URL=$GITHUB_SERVER_URL/$GITHUB_REPOSITORY/actions/runs/$GITHUB_RUN_ID" >> $GITHUB_ENV - name: virus scan run: | echo 'STEP=Virus Scan' >> $GITHUB_ENV ...perform the virus scan... vendor/bin/arche-update-redmine --token ${{ secrets.REDMINE_TOKEN }} --append "$RUN_URL" $REDMINE_ID 'Virus scan' - name: repo-filechecker run: | echo 'STEP=Run repo-file-checker' >> $GITHUB_ENV ...run the repo-filechecker... vendor/bin/arche-update-redmine --token ${{ secrets.REDMINE_TOKEN }} --append "$RUN_URL" $REDMINE_ID 'Run repo-file-checker' - name: check3 run: | echo 'STEP=Upload AIP to Curation Instance (Minerva)' >> $GITHUB_ENV ...perform the ingestion... vendor/bin/arche-update-redmine --token ${{ secrets.REDMINE_TOKEN }} --append "$RUN_URL" $REDMINE_ID 'Upload AIP to Curation Instance (Minerva)' - name: on failure if: ${{ failure() }} run: | vendor/bin/arche-update-redmine --token ${{ secrets.REDMINE_TOKEN }} --append "$RUN_URL" --statusCode 1 $REDMINE_ID "$STEP"