Software Update: HTCondor 8.8.4

The HTCondor Team at the University of Wisconsin-Madison has released a new stable version of its workload management system HTCondor. The version number has ended up at 8.8.4. HTCondor focuses on the management of compute-intensive tasks and can distribute them over several connected nodes. The user sends his task to HTCondor, after which it handles the process based on set policies and the availability of connected resources, and finally sends the results back to the user. HTCondor can, for example, control a dedicated Beowulf cluster, but also regular desktops that have nothing to do for a while. During the day SC16 Google, Fermilab and the HTCondor Team have a 160k-core cloud-based elastic compute cluster demonstrated. The abbreviated announcement of these releases is as follows:

Known Issues:

  • In the Python bindings, there are known issues with reference counting of ClassAds and ExprTrees. These problems are exacerbated by the more aggressive garbage collection in Python 3. See the ticket for more details. (Ticket #6721)

New Features:

  • The Python bindings are now available for Python 3 on Debian, Ubuntu, and Enterprise Linux 7. To use these bindings on Enterprise Linux 7 systems, the EPEL repositories are required to provide Python 3.6 and Boost 1.69. (Ticket #6327)
  • Added an optimization into DAGMan for graphs that use many-PARENT-many-CHILD statements. A new configuration variable DAGMAN_USE_JOIN_NODES can be used to automatically add an intermediate join node between the set of parent nodes and set of child nodes. When these sets are large, join nodes significantly improve condor_dagman memory footprint, parse time and submit speed. (Ticket #7108)
  • Dagman can now submit directly to the Schedd without using condor_submit This provides a workaround for slow submission rates for very large DAGs. This is controlled by a new configuration variable DAGMAN_USE_CONDOR_SUBMIT which defaults to True. When it is False, Dagman will contact the local Schedd directly to submit jobs. (Ticket #6974)
  • The HTCondor startd now advertises HasSelfCheckpointTransfers, so that pools with 8.8.4 (and later) stable-series startds can run jobs submitted using a new feature in 8.9.3 (and later). (Ticket #7112)

Bugs Fixed:

  • Fixed a bug that caused editing a job ClassAd in the schedd via the Python bindings to be needlessly inefficient. (Ticket #7124)
  • Fixed a bug that could cause the condor_schedd to crash when a scheduler universe job is removed. (Ticket #7095)
  • If a user accidentally submits a parallel universe job with thousands of times more nodes than exist in the pool, the condor_schedd no longer gets stuck for hours sorting that out. (Ticket #7055)
  • Fixed a bug on the ARM architecture that caused the condor_schedd to crash when starting jobs and responding to condor_history queries. (Ticket #7102)
  • HTCondor properly starts up when the condor user is in LDAP. The condor_master creates /var/run/condor and /var/lock/condor as needed at start up. (Ticket #7101)
  • The condor_master will no longer abort when the DAEMON_LIST does not contain MASTER; And when the DAEMON_LIST is empty, the condor_master will now start the SHARED_PORT daemon if shared port is enabled. (Ticket #7133)
  • Fixed a bug that prevented the inclusion of the last OBITUARY_LOG_LENGTH lines of the dead daemon’s log in the obituary. Increased the default OBITUARY_LOG_LENGTH from 20 to 200. (Ticket #7103)
  • Fixed a bug that could cause custom resources to fail to be released from a dynamic slot to partitionable slot correctly when there were multiple custom resources with the same identifier (Ticket #7104)
  • Fixed a bug that could result in job attributes CommittedTime and CommittedSlotTime reporting overly-large values. (Ticket #7083)
  • Improved the error messages generated when GSI authentication fails. (Ticket #7052)
  • Improved detection of failures writing to the job event logs. (Ticket #7008)
  • Updated the ChildCollector and CollectorNode configuration templates to set CCB_RECONNECT_FILE. This avoids a bug where each collector running behind the same shared port daemon uses the same reconnect file, corrupting it. (This corruption will cause new connections to a daemon using CCB to fail if the collector has restarted since the daemon initially registered.) If your configuration does not use the templates to run multiple collectors behind the same shared port daemon, you will need to update your configuration by hand. (Ticket #7134)
  • The condor_q tool now displays -nobatch mode by default when the -run option is used. (Ticket #7068)
  • HTCondor EC2 components are now packaged for Debian and Ubuntu. (Ticket #7084)
  • Fixed a bug that could cause condor_submit to send invalid job ClassAds to the condor_schedd when the executable attribute was not the same for all jobs in that submission. (Ticket #6719)
  • Fixed a bug in the Standard Universe where SOFT_UID_DOMAIN did not work as expected. (Ticket #7075)

Version number 8.8.4
Release status stable
Operating systems Windows 7, Linux, BSD, macOS, Solaris, UNIX, Windows Server 2012, Windows 8, Windows 10, Windows Server 2016
Website HTCondor
License type Conditions (GNU/BSD/etc.)