The HADES run control

Current Status and Proposal

2004-03-21 · Mathias Münch

Abstract

Since the first implementation of the HADES run control was only brought to a point near to operation, a summary of the current status of the system is done. The system is broken roughly into three parts, namely control and monitoring, parameter storage and user interface. The experiences with the current and planned system are collected. It is checked, if the requests have changed since the original design and analyzed which parts of the system found a good solution for the requests and which parts may be better to be built from scratch. A procedure for continuation of he development in terms of technical, organizational and personal questions is proposed.

Control and Monitoring

Requirements

The HADES DAQ consists of a distributed system of a dozen front end and back end CPUs, controlling several software processes and a large amount of readout electronics. The whole system needs to be configured with the proper operational parameters, controlled by means of starting and stopping operation and monitored for errors or other exceptional conditions.

All these operations shall be possible from a central operator console, from which all commands are distributed and all information can be presented in a concise way.

Status

As one of the first decisions for the current system, EPICS was chosen as the base of the run control system. The original idea was to use readily available components of this Experimental Physics and Industrial Control System as far as possible.

Today, an agent is running on all CPUs which can be controlled and queried by the user interface via the EPICS channel access protocol. This agent uses a C++ library for implementing the EPICS part. The communication with the controlled device is done via a library that is written by the developer of the device. For this, a very simple minded interface definition was introduced to allow for independent development of run control and electronics.

Gains

All the electronics is controlled by the run control agents. They can be stopped and started from a central place. The success of each operation is checked and leads to a consistent state by definition of the finite state machine. Performance is not limited by the run control system, but by the operations themself. The operation is optimized by working in parallel as far as possible.

Problems

During the development, one after the other of the EPICS services was abandoned. So today, only the part of network communication and the finite state machine implementation of EPICS is used. Especially the network transport part proved to be quite complicated to use when the programming has to be done from scratch.

The run control agent has to be linked with the EPICS Portable Channel Access Library, a quite big and bulky C++ library. At least on the E7 systems this seems to bring the compiler system to the edge of operation. For example, now and then a jump table in the binary is obviously miscalculated by the compiler. Also when the run control agent is running on the E7 system, the CPU crashes silently after a few hours.

While the control part is fully implemented, the monitoring part is somewhat neglected. In the device libraries the support for monitoring was introduced quite late and was not very well supported by the developers. There are also hints that the simple interface of the device library leads to inefficiencies when doing continuous monitoring. Additionally, all the EPICS features, like alarms, notification and so on, are not yet employed in the user interface.

Proposal

To EPICS or not to EPICS, this is the question. Admittedly, it was a huge effort to bring the EPICS into operation on the LynxOS CPUs and still some problems are unsolved. So if it was used in the current way, which means only as a network transport, it would probably not be worth the trip.

On the other hand the event mixing problem during the January 2004 beam time has put the interest on the monitoring part of the run control again. The number of mixed events was correctly counted and displayed, just nobody looked at the number. Which is understandable, since this is a simple, small number on a big, cluttered screen. So a consistent monitoring with a defined alarm on this counter would have saved several million events.

To reinvent all the features that are necessary for a complete monitoring system, like user definable thresholds, notifications, event logs etc., is a big effort. So since most of the EPICS problems have been solved during the development of the current system, it is very tempting to stick with it and extend the use of the available features.

The special problems of the E7 platform may be solved by implementing a small Remote VME Library. The operations in the TOF crates are simpler than in the rest of the system, because they are mostly equipped with standard VME ADC/TDC modules, which need not to be initialized in a special way. Since the TOF crates are logically a special case anyhow, this proceeding seems not to break the general concept.

Also the EPICS system itself has improved during the last few years. So today a complete system, including the parts that were formerly only implemented for the VxWorks platform, is available for the Linux operating system. So the nicest solution would be to stick with EPICS, but abandon the Portable Channel Access Library. This would be possible, if one could port the IOC part if EPICS to LynxOS. Anyhow, first tests have shown that this is not possible at least with the very old C++ compiler of LynxOS 2.5. So this would mean quite some effort like installing a cross compiler or major changes to the IOC source. But since the gain of this step would be considerable, it may be worth a try.

In the planned revision of the run control the first thing to put priority on should be monitoring. For the rest of the run control tasks there is an existing solution in place, but the monitoring is almost completely missing. Additionally, most important aspects of EPICS, user interface and interface to electronics can be tested on the complete system without interrupting the normal operation. And last but not least, the control part of the run control system itself depends heavily on a reliable monitoring. So for example, some big inefficiencies in the current system had to be tolerated, just because it was not possible to figure out the current state of the readout electronics.

This proceeding shall lead to a development path, where parts of the new run control system can be used in a very early stage, while the rest of the functionality can still be covered by the old, script based system. So technical problems of the real world are hopefully visible already in this early stage. The users can get their hands on the solution, comment on it and give hints for improvement.

Parameter Storage

Requirement

One major request to the run control was to ensure the traceability of the parameter set used during the measurement. The parameter set in effect will be requested by the analysis program.

Status

The current implementation stores all parameters in the HADES Oracle database. Mass data like DSP/FPGA programs, thresholds for RICH or MDC etc. are held in files that are accessed via NFS. Only the file names are in the database.

Since the front end CPUs cannot access the Oracle DB directly, they use EPICS to fetch and store their parameters. A gateway program (Parameter Server) forwards the requests to the database. To keep the run control implementation somewhat independent from the underlying parameter storage and to allow test environments without EPICS and Oracle support, an interface library (allParam) is used. It allows to request parameters from Oracle, EPICS or text files.

The entries in the DB are split into about 100 tables to represent the entity properties correctly in the relations. This allows to use checks extensively and easily access the parameters from the analysis programs. The DAQ itself uses only a simple minded name value pair concept for the parameter access. As a consequence, a name value pair table has to be computed from the underlying data structure when a parameter changes.

Gains

All currently used parameters are available in the database, the access to the database directly and via EPICS works. The database contents can be read and set via the GUI.

Problems

Performance problems are encountered at two places. The calculation of name value pairs from the database tables takes about one minute. So in the worst case, after a parameter change, the next operation is possible only after waiting that time. The fetch of parameters via EPICS and the parameter gateway server is at least suspected of suffering from a performance limitation¹.

The whole system of tables and name value pair views is considered to be too complicated by most people.

Proposal

After talking to several people the impression is, that the problem of relational tables on one side and name value pairs on the other side cannot be solved easily. The application that uses the stored parameters after the measurement, i.e. the analysis, will need a proper representation. It will not use the name value pair. So one of both sides has to calculate.

For several reasons, like for example data consistency, the current system is the technically more elegant solution. The complexity may be reduced quantitatively by reducing the number of tables. Qualitatively no major changes seem possible. The crucial question is, if the performance can be brought to an acceptable point by optimization and implementation of a more aggressive caching concept. So some effort should be put into these areas before abandoning the whole concept.

If no improvement can be gained, the run control will have to store all parameters in simple name value pair tables. If this is the case, database features will be used to an only minimal extend. So an even more radical approach gets into view again. The parameters may be stored in text files using a system that enforces version control. The text files could be imported into the DB offline.

The performance problem of the parameter gateway server and possibly of the whole EPICS parameter solution can be avoided by using a simpler system for parameter transport. While EPICS has its virtues when monitoring parameters, it is a little bit of overkill if only a simple number or string shall be transported. The run control anyhow uses the parameter library, which is completely independent from EPICS. So by converting this library into an client/server version, which seems to be no major technical effort, a considerable overhead could be saved. The bottom side is that for the parameters an extra technology for the network transport will be necessary. Still this direction of development seems promising.

User Interface

Requirements

The user interface shall allow the shift crew to operate the system for measurement. So the main functions are to bring HADES into a measurement status and to monitor, if everything runs well during the data taking. This part shall only require minimal knowledge of the underlying system.

Additional functions like reconfiguration of the soft- and hardware or elaborate debugging may be implemented, but on a lower priority and only if they do not complicate the job of the shift crew.

Status

A user interface based on Tcl/Tk that communicates directly via EPICS to the agents and via SQL to the database was developed.

Gains

The current user interface allows to completely control all aspects of the HADES DAQ, including start and stop of agents, reconfiguration of the setup with more or less subsystems, introducing additional electronics etc. In a once running system, normal operation can be done with few simple operations. The user interface itself is dynamically configured by information from the parameter database.

Problems

The current system esteems from a phase of the HADES development where continuous change of the configuration was taking place. Therefore, it concentrates very much on dynamic configurability. This makes the system internally quite complicated. So changes inside the current model of HADES are easy to achieve, but when changing a basic aspect, the user interface can only hardly follow. The configuration part of the user interface somewhat hides the part that is used for normal operation. The monitoring, while available on a very detailed level, lacks features like compound views of variables, user definable alarms etc.

Proposal

In the area of the user interface the most radical proposal is put. When EPICS is used as the base system the current GUI shall be abandoned completely and replaced with a solution that uses one of the EPICS GUI elements, most probably MEDM. This will disallow dynamic reconfiguration of the GUI, but will bring the monitoring features of EPICS to the hands of the user. Also for normal operation, such a simple minded solution is probably the best fit.

Completely unsolved in such a scenario is then the question, how the reconfiguration of parameters in the database shall be done. It would be very nice to have it integrated in the MEDM GUI, but up to now it is unclear, if this is possible. On the other hand, some operations can be removed from the user interface completely. For example, the start of the run control agents and initialization of the front end electronics shall be done automatically during boot procedure of the front end CPU.

If EPICS is abandoned, the question of the UI gets a very hard one. In that case, a detailed investigation of the available solutions for operating and monitoring a distributed system seems necessary.

Summary

Since most technical questions have been solved in the already done development, this experience and expertise can be used. So the revision of the run control system, while still some effort, seems to be possible. The concept and core part of the current run control development shall be kept as the base for further development. Especially, EPICS shall be kept as the basic building block.

As a proof of this concept, first the two most problematic technical areas, i.e. the E7 EPICS agent and the EPICS parameter server shall be replaced by a more lightweight solution, as was proposed by B. Sailer already some time ago. If this proves to solve the current technical problems, agents and GUI shall put the priority on monitoring. This version of the system shall immediately be used under realistic conditions, i.e. in an experiment.

In a second scan the database design shall be revised and the parameter and control part of the run control shall be put into operation. During this phase, also a reimplementation of the agent using the EPICS IOC under Linux may be considered.

In general the further development shall be started in small steps that can immediately be used in parallel to the existing scripts. When the development reaches the areas of control and parameter access, a major switch from one system to the other cannot be avoided any more. Therefore the setup of a parallel test bed is absolutely necessary, so that the run control development can proceed under stable conditions².

The described procedure imposes quite some effort in terms of work to be done. Especially the changed priority in monitoring will put the electronics developers in charge again. Support has been offered in the areas of database and GUI.

All in all, based on the work done in the development of the current system, a run control for the HADES system seems to be reachable by revision and in some areas reimplementation of the available code.

¹A faster version is implemented, but not thoroughly tested.

²One of the most time consuming tasks in the current development was to resynchronize the new run control configuration with the changed HADES setup. Regression testing was almost impossible and took at least hours of preparation.

Valid HTML 4.01!