Software update: Condor 7.2.2

Spread the love

The Condor Team at the University of Wisconsin-Madison has released a new stable version of their workload management system Condor. The version number has landed at 7.2.2 and the package is under the Apache 2.0 License issued. Condor focuses on the management of compute-intensive tasks and can distribute them over several connected nodes. The user sends his task to Condor, after which it handles the process based on set policies and the availability of the connected resources, and finally sends the results back to the user. Condor can, for example, control a dedicated Beowulf cluster, but standard desktops that are normally used for users can also be used when they have nothing to do for a while. When a user returns to their desktop, the current task is automatically transferred to another node. The announcement, including a list of changes, looks like this:

Condor 7.2.2 released!

The Condor Team is pleased to announce the release of Condor 7.2.2. This release includes a new full port to Debian 5.0 x86 and a cliped port to Debian 5 x86_64. Also, it includes a large number of bug fixes in the Hibernation code as well as the SOAP interface, among other systems. This release has also been made to be forward compatible with the up and coming 7.3 release, when using CCB (which will be detailed in the next 7.3 release).

New Features:

  • Added a full port of Condor to Debian 5.0 on the x86 platform.
  • Added a clipped port of Condor to Debian 5.0 on the x86_64 platform.
  • Added the -DumpRescue command-line flag to condor_dagman and condor_submit_dag. This flag is intended mainly for testing.
  • Added support for the -debug option to condor_qedit.
  • The Job Router now uses a time slice timer for periodic expression evaluation, similar to the condor_schedd daemon. The evaluation interval is controlled by the configuration variable PERIODIC_EXPR_INTERVAL, and defaults to 60 seconds, the same default value used by the condor_schedd daemon.
  • The Job Router now resets the source job, if a failure occurs when updating the condor_schedd daemon for a periodic expression that evaluated to True. The job’s periodic expressions should be evaluated again some time in the future with a successful update.

Configuration Variable Additions and Changes:

  • The new boolean configuration variable EVENT_LOG_FSYNC provides control of the behavior of Condor when writing events to the event log. Previously, the behavior was as if this parameter were set to False. See 3.3.4 for the complete definition of this variable.
  • The new boolean configuration variable EVENT_LOG_LOCKING provides control of the behavior of Condor when writing events to the event log. Previously, the behavior was controlled by ENABLE_USERLOG_LOCKING. See 3.3.4 for the complete definition of this variable.
  • The new string configuration variable TRANSFERER specifies the path to the condor_transferer program which is invoked by the condor_replication daemon to perform the actual transfer of the file set by STATE_FILE. This is part of the high availability framework. Prior to Condor 7.2.2, the value of TRANSFERER was hard coded to $(RELEASE_DIR)/sbin/condor_transferer. The use of this hard coded behavior should be considered obsolete behavior, and will be removed in a future version of Condor.
  • The PREEMPTION_REQUIREMENTS and the RANK expression in the matchmaker can now reference many more ClassAd attributes than just SubmittorPrio. New attributes allow this expression to take into account resources currently in use, as well as group usage and quota info. New attributes are: SubmitterUserResourcesInUse, RemoteUserResourcesInUse, RemoteGroupResourcesInUse, RemoteGroupQuota, SubmitterGroupResourcesInUse, SubmitterGroupQuota.
  • Added JOB_ROUTER_ATTRS_TO_COPY configuration option. This is a comma separated list of attributes that the Job Router should copy from the routed ad to the source ad in addition to internally hard coded attributes that are copied.
  • Added JOB_ROUTER_RELEASE_ON_HOLD. configuration option that will control whether the Job Router will reset the source job to an untouched state if it needs to yield the job because the routed job went on hold. The option defaults to resetting the source job.
  • The new configuration variables PREEMPTION_REQUIREMENTS_STABLE and PREEMPTION_RANK_STABLE identify for Condor if all attributes in the variables PREEMPTION_REQUIREMENTS and PREEMPTION_RANK will not change within a negotiation interval.

Bugs Fixed:

  • Fixed the condor_collector daemon such that hibernating machines never time out.
  • Fixed incorrectly set ClassAd attribute values ​​of machines entering a hibernation state. All hibernating machines are unclaimed and idle, they have no load, the CPU is not busy, and the keyboard and console appear as if they had been idle for a long time.
  • Fixed a bug where if any idle slot satisfied the HIBERNATE expression, Condor would put the machine into a sleep state irrespective of any active slots.
  • Fixed a bug on Windows that made it impossible to use the defined string “S5” for hibernation.
  • Fixed a bug in the condor_starter where it would be running as real uid condor after job hooks are invoked which causes issues when accessing files.
  • Fixed a bug where some machines would send a final update ad to the condor_collector, invalidating the persistent one that was previously sent (when HIBERNATE evaluates to True). This had the effect of dropping the machine out of the pool once the ad had grown stale.
  • Fixed a bug where any two Condor daemons on Windows were able to bind to the same port at the same time.
  • Fixed the behavior of the condor_negotiator so that when a Condor-G matchmaking ad matches, the machine’s ad will be shuffled to the end for round-robin matching to multiple gatekeepers with the same rank.
  • Resolved a bug in which the submit description file command vm_macaddr was improperly parsed, and thus ignored, by condor_submit for vm universe jobs.
  • Condor’s Windows zip file distribution now includes the new C/C++ runtime libraries.
  • Fixed a Windows platform bug for jobs that enable streaming I/O. The bug caused the condor_starter to crash upon invocation of the job.
  • Fixed a bug in which an ill-formed network packet could crash a Condor daemon. This would not be seen in normal Condor operation, but sometimes port-scanning software could trigger such a crash.
  • Fixed a bug in which condor_q would sometimes exit with the value zero, indicating success, when it could not connect to a condor_schedd daemon. It now exits with an error code.
  • Fixed two seemingly small memory leaks in Condor’s SOAP interface. A small amount of memory was lost per SOAP transaction. On a high traffic machine, this leak would eventually render the condor_schedd daemon unresponsive.
  • Fixed a bug in the parallel universe where periodic expressions involving the JobStatus attribute would not function properly.
  • Fixed a bug where Condor daemons could segmentation fault while trying to write a core file to disk in the Unix ports.
  • Fixed a bug in which the use of dedicated execute accounts (indicated by use of the configuration variable DEDICATED_EXECUTE_ACCOUNT_REGEXP) did not work properly in PrivSep mode: those with the configuration variable PRIVSEP_ENABLED set to True.
  • Fixed an erroneous log message that reported that the hook defined by HOOK_UPDATE_JOB_INFO had run, but would print the $(HOOK_PREPARE_JOB) path. The correct hook ran, so this was only a logging error. The log message is only visible at the D_FULLDEBUG level.
  • Fixed a bug that caused condor_dagman to crash if the dagman.out file reached a size of 2 GBytes.
  • Fixed a problem affecting the condor_starter when in PrivSep mode. After the user job exited, an error was printed in the condor_starter log file complaining that it failed to chown the sandbox to Condor ownership. This error was not actually harmful, just noisy.
  • Fixed a bug in the condor_master that caused it to not have REPLICATION in its default list for DC_DAEMON_LIST. The example configuration file for HAD has been updated to match, as well.
  • Fixed the condor_transferer daemon and documentation to consistently use the value of the configuration variable MAX_TRANSFERER_LIFETIME in High Availability code.
  • Fixed a bug that caused condor_dagman to crash, if a splice DAG has node categories.
  • Changed splice-related condor_dagman debug messages to not be printed at the default verbosity. They are now mostly printed at debug level 4. For definitions of the debug levels, see the condor_dagman manual page at section 9.
  • Fixed a bug that caused the condor_replication daemon, as part of the high availability framework, to start the condor_transferer client incorrectly; the end result was that the condor_transferer was unable to authenticate via GSI using host-based certificates.
  • Fixed a bug in which the ClassAd attribute RemoteWallClockTime could get too big after a restart of the condor_schedd daemon, for jobs that were running at the time of the restart.
  • Fixed a bug that was causing the condor_startd to log the error message
    ioctl(SIOCETHTOOL/GWOL) failed: Operation not permitted (1)
    when started as a Personal Condor on Linux. The message is now suppressed in this case. When the message is printed, an additional message is logged informing the user that this error can be ignored, unless hibernation is being used.
  • Fixed a bug that was causing the condor_startd to sometimes publish the network adapter’s hardware address incorrectly in its ClassAd.
  • Fixed a case in which condor_history could get into an infinite loop when searching through a corrupted history file.
  • Fixed a bug in the user log reader code that could cause it to get into an inconsistent state after detecting missed events.
  • Condor version 7.2.2 and previous releases do not support communication with Condor 7.3.x daemons using the new 7.3.x configuration variables CCB_ADDRESS or PRIVATE_NETWORK_NAME. The version 7.2.2 condor_collector daemon now recognizes when it is receiving ClassAds from such daemons, and it will reject them. In prior versions, Condor would accept the ClassAds, but attempts to use them led to unexpected behavior.

Additions and Changes to the Manual:

  • Reorganized the user manual section that describes DAGMan.
  • Added a note about the fact that environment values ​​specified with the environment submit description file command override values ​​from the submitter’s environment, as imported with getenv = True.

Version number 7.2.2
Release status Final
Operating systems Windows 2000, Linux, BSD, Windows XP, macOS, Solaris, UNIX, Windows Server 2003, Windows Server 2008
Website Condor
Download
License type Conditions (GNU/BSD/etc.)
DebianexpressionJobOperationRouterSoftwareTeamUniversityWindowsWindows Server