Controller boot CLI Script execution

In core

Table of Contents

Overview
Issue Metadata
Requirements
Test Plan
Community Documentation
Release Note Content

Overview

During testing of the modernised WildFly image done as upstream for EAP7-891 for WildFly 18, it was discovered that the boot time of the contaner had increased a fair amount (tracked in https://issues.redhat.com/browse/WFWIP-276). This makes sense since we switched to use the CLI to configure the server which results in an extra embedded server CLI session to run the commands before starting the server. If the user has provided their own user extensions as well, an additional embedded CLI sessions is needed.

Launching the CLI process and starting an embedded server takes multiple seconds, on top of the normal boot. We want this time to be reduced.

We currently have 2 different use cases.

Standard use-case, no user extensions installed:
- STEP1, mandatory: CLI+embedded to execute CLI operations generated by pre-configure/configure of builtin launch and extensions scripts
- STEP2, mandatory: Server start
Optional use-case, user extensions installed:
- STEP1, mandatory: CLI+embedded to execute CLI operations generated by pre-configure/configure of builtin launch and extension scripts
- STEP2, optional: CLI+embedded to execute CLI operations generated by postconfigure of extensions scripts
- STEP3, mandatory: Server start

We are observing an increase up to 8secs (measured in Openshift context) for use-case 1, not yet measured for use-case 2 but we expect 16secs. Those numbers are way too big.

This proposal changes the boot sequence for both use-cases:

Standard use-case, no user extensions installed:
- STEP1, mandatory: Server starts in admin-only mode, new server boot hook to apply CLI operations from inside the booting server.
- STEP2, mandatory: Server reloads itself in normal mode.
  - In cases when the server needs to restart after having applied the CLI operations (very rare, it happens if any of the CLI operations call an operation which puts the server into restart-required mode), the server first restarts before reloading to normal mode (this is because restart is done by the script, and thus ‘inherits’ the original parameters the server was started with).
Optional use-case, user extensions installed:
- STEP1, mandatory: Server starts in admin-only mode, new server boot hook to apply CLI operations from inside the booting server.
- STEP2, mandatory: In parallel the launch script waits for the server to be done with STEP1. When done, extensions/postconfigure.sh executed.
- STEP3, mandatory: CLI process to connect to the remote running server in admin-only mode, execute the extensions generated CLI script (if any) and reload the server in normal mode. In case the server needs to restart after having applied the CLI operations (very rare), the server is restarted before reloading to normal mode.

Expected gains:

The standard use-case has a time increase of few 100s of milliseconds compared to before EAP7-891 was implemented. This is the main use-case to optimize.
The optional use-case has a time increase of few seconds.

Issue Metadata

Issue

Dev Contacts

QE Contacts

Testing By

[x] Engineering

[ ] QE

Affected Projects or Components

WildFly openshift images
WildFly Core

Other Interested Projects

Requirements

Hard Requirements

WildFly controller boot time hook

The controller is extended with a boot time hook, which happens as part of the controller boot after all the operations parsed from the configuration have been processed. The hook loads up the org.jboss.as.cli module, and then then invokes the AdditionalBootCliScriptInvoker loaded up from a service loader. The hook is only available in admin-mode.

The boot time hook is configured with the following system properties:

org.wildfly.internal.cli.boot.hook.script=<path to script to pass to the invoker> - this activates the feature and specifies the CLI script the hook should apply
org.wildfly.internal.cli.boot.hook.reload.skip=true - false by default. If true, we do not reload to normal mode after the hook has executed the CLI operations. We set it to true to keep the server running when we have user extensions, i.e. Optional use-case 2) from above.
org.wildfly.internal.cli.boot.hook.marker.dir - <path to directory to write markers>. There are two marker files:
- wf-invoker-result This is used to advertise that the config adjustment is completed from the hook. It records whether the configuration changes were successful or not.
- wf-cli-shutdown-initiated which is used to keep track of if the calling script needs to do a restart of the server.
- (In addition there is a marker file called wf-restart-embedded-server which is used to indicate that if the server is embedded, the calling script needs to restart it. This is important for Uberjar, which is outside the scope of this feature).

New server boot sequence:

If a boot.script is configured, the hook:
- Asserts that the server is started in admin-mode
- Uses the service loader to load up the AdditionalBootCliScriptInvoker implementation
- Calls the AdditionalBootCliScriptInvoker with the specified CLI script
- If an exception is thrown, writes the marker file containing "failure" (if the marker directory was configured) then aborts.
- Otherwise write the marker file containing "success" (if the marker directory was configured).
- If the server is not configured to skip reload, we then reload/restart the server depending on the ControlledProcessState.restartRequiredFlag value.
  - If false, the hook runs the :reload operation which reloads the server into normal mode
  - If true (which should be rare), we need to restart the server using these steps:
    
    Execute :shutdown(restart=true) and write the wf-cli-shutdown-initiated marker indicating that restart is initiated. This marker is needed because the restart handled by the startup script will pass in the same system properties as for the original start, and so be in admin-only mode.
    
    Following the restart, the server will be in admin-only mode, so we check for and delete the wf-cli-shutdown-initiated marker and do a reload to get the server back into normal mode.
If no boot.script configured:
- Nothing additional happens at start

CLI AdditionalBootCliScriptInvoker implementation

The org.jboss.as.cli module exposes an implementation of the AdditionalBootCliScriptInvoker interface (defined in the org.jboss.as.controller-client module) and exposes it via the JDK Service Loader mechanism. The implementation calls the CLI API to execute a script. The operation target is a local ModelControllerClient passed to the service along with the script to execute.

The classes to interact with CLI are loaded in a temporary module loader, which is closed once the CLI script has been called to ensure that we don’t hang onto all the CLI and Aesh classes in the running server.

The CLI service implementation can be configured with:

org.wildfly.internal.cli.boot.hook.script.properties=<path to file containing properties> (optional)
org.wildfly.internal.cli.boot.hook.script.output.file=<path to file to redirect output to> (optional)
org.wildfly.internal.cli.boot.hook.script.logging=<true to enable CommandContext CLI logging, false by default> (optional)
org.wildfly.internal.cli.boot.hook.script.warn.file=<path to file that contains warning message to display> (optional)
org.wildfly.internal.cli.boot.hook.script.error.file=<path to file that contains error message to display> (optional)

The service executes each line of the file. If an error occurs (Exception, error in error file, un-expected quit), the implementation Logger is used to display each line of the executed script that failed. In addition, the content of a non empty error, warn and output files are printed using the implementation Logger. An invalid condition in the service leads to a RuntimeException thrown, which will cause the server to abort booting.

The CLI service disables the following commands commands:

connect
reload
shutdown
embedded-*
clear
additional commands found from to ServiceLoader.

Calling any of the above commands will cause the the AdditionalBootCliScriptInvoker to fail and the server to abort. The raw operations :reload and :shutdown can still be called but is at the user’s own risk.

The CLI classes in use must not have any side effect on the server (eg: static initializer, static calls).

Impact on log traces

The traces for CLI execution are mixed with the server traces (used to be printed before server starts). This is not a major issue, the traces are not interleaved. We have blocks of traces.
WFLYSRV0025 displayed multiple times (up to 3 times). This will have an impact on tests that are only checking for this in the server log.
This is internal functionality to speed up the image start on OpenShift so it is not supported on bare metal (see Non-Requirements). If the system properties to trigger this are used, we will always log a WARN message saying this is only supported in the OpenShift images, and that any other usage is liable to change in the future.

Impact in Openshift Builder image

From a cloud user perspective, this can be summarized in few sentences:

If the WildFLy container launch script detects that some configuration is required, it starts the server in admin-mode. Once configuration is over, the server is reloaded for the rare cases) into normal mode. All configurations aspects are handled internally by the WildFly container. The extensions hook allows users to add their own custom CLI operations to the global server configuration.”

It means that we now have a CONFIG phase (where the server runs in admin-only mode during the boot/configuration), followed by a RUNNING phase.

New env vars:

EXECUTE_BOOT_SCRIPT_INVOKER - to control if the CLI boot time hook is to be used. Set to true in the builder image. If false, we fall back on the existing way of configuring the server where we launch embedded server session(s) in CLI to configure the server before starting the server in normal mode.
EXECUTE_BOOT_SCRIPT_INVOKER_TIMEOUT - to control the timeout to wait for the server to generate the marker file (if needed, @see sequence with extensions).

Detailed sequences:

Standard use-case, no user extensions installed:
- Preconfigure and configure are called on all launch modules (same as today).
- The server is started in admin-only mode with the generated script passed as a system property, as well as the marker directory specified.
- Any WARN messages during CLI execution are printed
- If error, the error message is displayed and the server aborts
- The server reloads itself in normal mode
Optional use-case, user extensions installed:
- The server is started in admin mode with following system properties set: CLI script, no reload and marker directory.
  - NB: The server is started in background (that is already the case today).
- Once CLI execution done, the server writes the marker file.
  - NB: The server IS NOT yet reloaded in normal mode.
- The shell script waits for the marker file to appear. If it doesn’t appear after a configurable timeout (default 30secs), the script aborts.
- If the marker appeared and contains success, the script configures the extensions (pure bash and remote CLI). If this fails, the script is aborted.
- If an error occurred during the above step, the server is shutdown.
- Otherwise,a CLI process is started, connects to remote running server, executes the extensions generated script (if any).
  - If an error occurred during this CLI execution, the server is shutdown.
  - If success, the following logic applies:
    
    If restart is required. :shutdown(restart=true) is executed which causes the launch script to restart the server with the original arguments. Following the restart of the admin-only server it is reloaded into normal mode.
    
    Otherwise the server is reloaded into normal mode.
- The CLI process terminates.

Impact on readiness probe and Operator

While running in admin-mode the server shouldn’t be advertised as ready. The probe reports UP if the server-state attribute is running but needs updating to check that the running-mode is NORMAL. If running-mode is ADMIN-ONLY the probe should not report UP.

Nice-to-Have Requirements

None

Non-Requirements

This is not supported in standard bare metal mode. As mentioned when the mechanism is enabled an error message is displayed saying this is internal OpenShift functionality.

Test Plan

The existing Openshift EAP image tests owned by engineering will activate the new support since it will be part of the launch scripts.
New engineering owned Openshift EAP image tests will cover extensions and restart edge cases.
There will be new tests in Wildfly to cover the controller hook.
New QE tests TBD.

Community Documentation

This is an internal feature, which we do not intend to expose to users.

The WildFly s2i community documentation will mention the environment variables that can be used to turn this feature off and to adjust the timeout.

Release Note Content