Software update: Condor 7.4.2

Spread the love

The Condor Team at the University of Wisconsin-Madison has released a new version of their workload management system Condor. The version number has arrived at 7.4.2 and the package is under the Apache 2.0 License issued. Condor focuses on the management of compute-intensive tasks and can distribute them over several connected nodes. The user sends his task to Condor, after which it handles the process based on set policies and the availability of the connected resources, and finally sends the results back to the user. Condor can, for example, control a dedicated Beowulf cluster, but standard desktops that are normally intended for users can also be used when they have nothing to do for a while. The announcement, listing the changes to this release, is as follows:

Condor 7.4.2 released!

The Condor Team is pleased to announce the release of Condor 7.4.2. This is a stable release of Condor which includes numerous bug fixes to 7.4.1.

Bugs Fixed:

  • Fixed a bug in which the condor_schedd would sometimes negotiate for and try to run more jobs than specified by MAX_RUNNING_JOBS. Once the jobs started running, it would then kill them off to get back below the limit. This was more likely to happen with slow preemption caused by MaxJobRetirementTime or by a large timeout imposed by KILL. This problem has existed since before Condor 6.5. When this problem happened, the following message appeared in the condor_schedd log:
    Preempting X jobs due to MAX_JOBS_RUNNING change
  • Fixed a problem that caused condor_ssh_to_job to fail to connect to a job running on a slot with multiple ‘@’ signs in its name. This bug has existed since the introduction of condor_ssh_to_job in 7.3.2.
  • In all previous versions of Condor, condor_status refused to accept -long, -xml, and -format when followed by an argument such as -master that specified which type of daemon to look at. The order of the arguments had to be reversed or it would produce a message such as the following:
    Error: arg 4 (-master) contradicts arg 1 (-format)
  • Fixed a bug which caused the condor_master to crash if VIEW_SERVER was included in DAEMON_LIST and CONDOR_VIEW_HOST was unset.
  • Fixed a bug that caused configuration parameter LOCAL_CONFIG_DIR to be ignored if it was set in a local configuration file, as opposed to the top-level configuration file.
  • Fixed a bug that could cause the condor_schedd to behave incorrectly when reading an invalid job queue log on startup.
  • Fixed a bug that could corrupt the job queue log if the condor_schedd daemon’s attempt to compact it fails.
  • Fixed a problem that in rare cases caused the condor_schedd to crash shortly after the condor_gridmanager exited. This bug has existed since before Condor version 6.8.
  • Fixed a problem that was resulting in messages such as the following:
    ERROR: receiving new UDP message but found a long message still waiting to be closed (consumed=0). Closing it now.
  • The file extension specified to condor_fetch_log can no longer contain a path delimiter.
  • When in graceful shutdown mode, the condor_schedd was sometimes starting idle scheduler universe jobs. With a large enough number of scheduler universe jobs, this could lead to a cycle of stopping and restarting jobs until the graceful shutdown time expired.
  • Fixed multiple bugs that prevented Condor from building on or running correctly on OpenSolaris X86/64 version 2009.06.
  • Fixed a bug which caused the condor_startd to incorrectly count the number of processors on some machines with Hyper-threading enabled. This bug was introduced in Condor version 7.3.2, and exists in 7.4.0 and 7.4.1.
  • Fixed a problem with GSI authentication in Condor that would cause daemons to consume more and more memory over time. The biggest source of trouble was introduced in Condor version 7.3.2. However, a smaller memory leak that existed in all previous versions of Condor has also been fixed.
  • Fixed a bug where if condor_compile is invoked in a manner such as:
    condor_compile gcc -print-prog-name=ld
    an error would be emitted, and condor_compile would exit with a bad exit code.
  • The sort based on condor_status output accidentally changed in Condor version 7.3, so that the output was based on the slot name first, then machine name. The behavior is now restored to the original sorting: first on machine name, then slot name.
  • If one machine running a parallel job crashed, and job leases are enabled (which they are by default), the job would not exit until the job lease duration expired. As the condor_starter will not get respawned, there is no need to wait. Many sites set long job lease durations, to prevent jobs from being killed when the machine running the condor_schedd daemon reboots. Now, if one node goes away, the whole computation is shut down immediately.
  • Fixed the verbosity level of some condor_dagman messages written to the dagman.out file.
  • Fixed a bug introduced in Condor version 7.3.2 that resulted in messages such as the following even in cases where no problem in communicating with the condor_collector had been encountered:
    Collector is still being avoided if an alternative succeeds.
    This problem was believed to be fixed in Condor 7.4.1, but some cases of the problem remained in that version.
  • Fixed a bug from Condor version 6.1.14, that resulted in the condor_schedd performing the operation scheduled via WALL_CLOCK_CKPT_INTERVAL at the specified frequency (default time of 1 hour), multiplied by the number of times the condor_schedd daemon had been reconfigured during its lifetime. This could lead to degraded performance, especially prior to Condor version 7.4.1, when this operation was more disk-intensive.
  • 32-bit Linux versions of Condor running in a 64-bit environment would sometimes not detect the existence of some processes and sometimes wrongly detect that a tracked process belonged to root when it actually belonged to some other user. This could lead to failure to run jobs or failure to properly monitor and clean up after them. When the wrong process ownership problem happened, the following message appeared in the condor_master and/or condor_procd logs:
    ProcAPI: fstat failed in /proc! (errno=75)
    If condor_procd failed to detect the existence of its own parent process, it would exit with the following message in its log:
    ERROR: master has exited
  • Fixed a problem in the condor_job_router daemon, introduced in Condor version 7.2.2, that could cause the daemon to crash when failing to carry out the change of state dictated by a job’s periodic policy expressions, for example, the failure to put a job on hold when periodic_hold becomes True.
  • Fixed a bug introduced in Condor 7.3.2 that caused Grid Monitor jobs to receive a full X.509 proxy. Now, it always receives a limited proxy, which was the previous behavior.
  • Fixed a bug that could cause the nordugrid_gahp to crash.
  • Fixed a problem introduced in 7.4.0 that could cause two condor_schedd daemons with a match to the same slot to both fail to claim it, rather than letting the first one to claim it succeed. This sort of situation can happen when the condor_negotiator has a stale view of the pool, either because the gap between negotiation cycles is configured to be shorter than usual, or because updates from the condor_startd to the condor_collector are not reliably delivered and processed.

Additions and Changes to the Manual:

  • Descriptions of all the commands that may be placed into a submit description file are now located within the condor_submit manual page, instead of within Chapter 2, the Users’ Manual.
  • An initial, but not yet complete set of configuration variables that require a restart when changed, is listed in section 3.3.1. Using condor_reconfig to change these variables’ values ​​is not sufficient.

Version number 7.4.2
Release status stable
Operating systems Windows 7, Linux, BSD, Windows XP, macOS, Solaris, UNIX, Windows Server 2003, Windows Vista, Windows Server 2008
Website Condor
Download http://www.cs.wisc.edu/condor/downloads/
License type Conditions (GNU/BSD/etc.)
You might also like