Since the first implementation of the HADES run control was only brought to a point near to operation, a summary of the current status of the system is done. The system is broken roughly into three parts, namely control and monitoring, parameter storage and user interface. The experiences with the current and planned system are collected. It is checked, if the requests have changed since the original design and analyzed which parts of the system found a good solution for the requests and which parts may be better to be built from scratch. A procedure for continuation of he development in terms of technical, organizational and personal questions is proposed.
The HADES DAQ consists of a distributed system of a dozen front end and back end CPUs, controlling several software processes and a large amount of readout electronics. The whole system needs to be configured with the proper operational parameters, controlled by means of starting and stopping operation and monitored for errors or other exceptional conditions.
All these operations shall be possible from a central operator console, from which all commands are distributed and all information can be presented in a concise way.
As one of the first decisions for the current system, EPICS was chosen as
the base of the run control system. The original idea was to
use readily available components of this Experimental
Physics and Industrial Control System
as far as possible.
Today, an agent
is running on all CPUs which can be
controlled and queried by the user interface via the EPICS
channel access protocol. This agent uses a C++ library for
implementing the EPICS part. The communication with the
controlled device is done via a library that is written by the
developer of the device. For this, a very simple minded
interface definition was introduced to allow for independent
development of run control and electronics.
All the electronics is controlled by the run control agents. They can be stopped and started from a central place. The success of each operation is checked and leads to a consistent state by definition of the finite state machine. Performance is not limited by the run control system, but by the operations themself. The operation is optimized by working in parallel as far as possible.
During the development, one after the other of the EPICS services was abandoned. So today, only the part of network communication and the finite state machine implementation of EPICS is used. Especially the network transport part proved to be quite complicated to use when the programming has to be done from scratch.
The run control agent has to be linked with the EPICS
Portable Channel Access Library
, a quite big and bulky C++
library. At least on the E7 systems this seems to bring the
compiler system to the edge of operation. For example, now and
then a jump table in the binary is obviously miscalculated by
the compiler. Also when the run control agent is running on the
E7 system, the CPU crashes silently after a few hours.
While the control part is fully implemented, the monitoring part is somewhat neglected. In the device libraries the support for monitoring was introduced quite late and was not very well supported by the developers. There are also hints that the simple interface of the device library leads to inefficiencies when doing continuous monitoring. Additionally, all the EPICS features, like alarms, notification and so on, are not yet employed in the user interface.
To EPICS or not to EPICS, this is the question. Admittedly, it was a huge effort to bring the EPICS into operation on the LynxOS CPUs and still some problems are unsolved. So if it was used in the current way, which means only as a network transport, it would probably not be worth the trip.
On the other hand the event mixing problem during the January 2004 beam time has put the interest on the monitoring part of the run control again. The number of mixed events was correctly counted and displayed, just nobody looked at the number. Which is understandable, since this is a simple, small number on a big, cluttered screen. So a consistent monitoring with a defined alarm on this counter would have saved several million events.
To reinvent all the features that are necessary for a complete monitoring system, like user definable thresholds, notifications, event logs etc., is a big effort. So since most of the EPICS problems have been solved during the development of the current system, it is very tempting to stick with it and extend the use of the available features.
The special problems of the E7 platform may be solved by
implementing a small Remote VME Library
. The operations
in the TOF crates are simpler than in the rest of the system,
because they are mostly equipped with standard VME ADC/TDC
modules, which need not to be initialized in a special way.
Since the TOF crates are logically a special case anyhow, this
proceeding seems not to break the general concept.
Also the EPICS system itself has improved during the last
few years. So today a complete system, including the parts that
were formerly only implemented for the VxWorks platform, is
available for the Linux operating system. So the nicest
solution would be to stick with EPICS, but abandon the
Portable Channel Access Library
. This would be possible,
if one could port the IOC
part if EPICS to LynxOS.
Anyhow, first tests have shown that this is not possible at
least with the very old C++ compiler of LynxOS 2.5. So
this would mean quite some effort like installing a cross
compiler or major changes to the IOC source. But since the gain
of this step would be considerable, it may be worth a try.
In the planned revision of the run control the first thing
to put priority on should be monitoring. For the rest of the
run control tasks there is an existing solution
in
place, but the monitoring is almost completely missing.
Additionally, most important aspects of EPICS, user interface
and interface to electronics can be tested on the complete
system without interrupting the normal operation. And last but
not least, the control part of the run control system itself
depends heavily on a reliable monitoring. So for example, some
big inefficiencies in the current system had to be tolerated,
just because it was not possible to figure out the current
state of the readout electronics.
This proceeding shall lead to a development path, where
parts of the new run control system can be used in a very early
stage, while the rest of the functionality can still be covered
by the old, script based system. So technical problems of the
real world
are hopefully visible already in this early
stage. The users can get their hands on the solution, comment
on it and give hints for improvement.
One major request to the run control was to ensure the traceability of the parameter set used during the measurement. The parameter set in effect will be requested by the analysis program.
The current implementation stores all parameters in the HADES Oracle database. Mass data like DSP/FPGA programs, thresholds for RICH or MDC etc. are held in files that are accessed via NFS. Only the file names are in the database.
Since the front end CPUs cannot access the Oracle DB
directly, they use EPICS to fetch and store their parameters. A
gateway program (Parameter Server
) forwards the requests
to the database. To keep the run control implementation
somewhat independent from the underlying parameter storage and
to allow test environments without EPICS and Oracle support, an
interface library (allParam
) is used. It allows to
request parameters from Oracle, EPICS or text files.
The entries in the DB are split into about 100 tables to
represent the entity properties correctly in the relations.
This allows to use checks extensively and easily access the
parameters from the analysis programs. The DAQ itself uses only
a simple minded name value pair
concept for the
parameter access. As a consequence, a name value pair table has
to be computed from the underlying data structure when a
parameter changes.
All currently used parameters are available in the database, the access to the database directly and via EPICS works. The database contents can be read and set via the GUI.
Performance problems are encountered at two places. The calculation of name value pairs from the database tables takes about one minute. So in the worst case, after a parameter change, the next operation is possible only after waiting that time. The fetch of parameters via EPICS and the parameter gateway server is at least suspected of suffering from a performance limitation1.
The whole system of tables and name value pair views is
considered to be too complicated
by most people.
After talking to several people the impression is, that the
problem of relational tables on one side and name value pairs
on the other side cannot be solved easily. The application that
uses the stored parameters after the measurement, i.e. the
analysis, will need a proper
representation. It will not
use the name value pair. So one of both sides has to calculate.
For several reasons, like for example data consistency, the current system is the technically more elegant solution. The complexity may be reduced quantitatively by reducing the number of tables. Qualitatively no major changes seem possible. The crucial question is, if the performance can be brought to an acceptable point by optimization and implementation of a more aggressive caching concept. So some effort should be put into these areas before abandoning the whole concept.
If no improvement can be gained, the run control will have to store all parameters in simple name value pair tables. If this is the case, database features will be used to an only minimal extend. So an even more radical approach gets into view again. The parameters may be stored in text files using a system that enforces version control. The text files could be imported into the DB offline.
The performance problem of the parameter gateway server and possibly of the whole EPICS parameter solution can be avoided by using a simpler system for parameter transport. While EPICS has its virtues when monitoring parameters, it is a little bit of overkill if only a simple number or string shall be transported. The run control anyhow uses the parameter library, which is completely independent from EPICS. So by converting this library into an client/server version, which seems to be no major technical effort, a considerable overhead could be saved. The bottom side is that for the parameters an extra technology for the network transport will be necessary. Still this direction of development seems promising.
Additional functions like reconfiguration of the soft- and hardware or elaborate debugging may be implemented, but on a lower priority and only if they do not complicate the job of the shift crew.
MEDM. This will disallow dynamic reconfiguration of the GUI, but will bring the monitoring features of EPICS to the hands of the user. Also for normal operation, such a simple minded solution is probably the best fit.
Completely unsolved in such a scenario is then the question, how the reconfiguration of parameters in the database shall be done. It would be very nice to have it integrated in the MEDM GUI, but up to now it is unclear, if this is possible. On the other hand, some operations can be removed from the user interface completely. For example, the start of the run control agents and initialization of the front end electronics shall be done automatically during boot procedure of the front end CPU.
If EPICS is abandoned, the question of the UI gets a very hard one. In that case, a detailed investigation of the available solutions for operating and monitoring a distributed system seems necessary.
Since most technical questions have been solved in the already done development, this experience and expertise can be used. So the revision of the run control system, while still some effort, seems to be possible. The concept and core part of the current run control development shall be kept as the base for further development. Especially, EPICS shall be kept as the basic building block.
As a proof of this concept, first the two most problematic technical areas, i.e. the E7 EPICS agent and the EPICS parameter server shall be replaced by a more lightweight solution, as was proposed by B. Sailer already some time ago. If this proves to solve the current technical problems, agents and GUI shall put the priority on monitoring. This version of the system shall immediately be used under realistic conditions, i.e. in an experiment.
In a second scan the database design shall be revised and the parameter and control part of the run control shall be put into operation. During this phase, also a reimplementation of the agent using the EPICS IOC under Linux may be considered.
In general the further development shall be started in small steps that can immediately be used in parallel to the existing scripts. When the development reaches the areas of control and parameter access, a major switch from one system to the other cannot be avoided any more. Therefore the setup of a parallel test bed is absolutely necessary, so that the run control development can proceed under stable conditions2.
The described procedure imposes quite some effort in terms of work to be done. Especially the changed priority in monitoring will put the electronics developers in charge again. Support has been offered in the areas of database and GUI.
All in all, based on the work done in the development of the current system, a run control for the HADES system seems to be reachable by revision and in some areas reimplementation of the available code.
1A faster version is implemented, but not thoroughly tested.
2One of the most time consuming tasks in the current development was to resynchronize the new run control configuration with the changed HADES setup. Regression testing was almost impossible and took at least hours of preparation.