Article Title           : mop_server_bug
Creation Date           : unknown       
Author                  : NCD Technical Support
Last Update             : November 11, 1992
Location	        : NCD-Articles/Host_Systems/DEC_VMS
Expiration Rules        : Applies to VMS and NCDware until further notice.
Location		: NCD-Articles/Host_Systems/VMS
=============================================================================

This article discusses a situation in which NCD's may fail to boot
reliably in a VMS/DECnet environment.  I will discuss why these problems
occur, what the source of the problem is, why it is not NCD's problem,
and how to work around it for now.


In a heavily loaded and possibly erroneously configured Ethernet network,
there exists the possibility of what is known as a "late collision".

While the VMS MOP server software is processing a MOP-load action, if it ever
experiences a late collision, it issues the error message indicating a
"datacheck error",  in a form similar to

	Circuit SVA-0. Load. Line communication error
	%SYSTEM-F-DATACHECK. write check error
	Node = 33.510 (BAEX11). File = MOM$LOAD:XNCD15B.SYS

and then it exits.

Once it exits, the NCD's boot monitor has no one to talk to.  What happens
next depends on the version of the boot monitor:

In boot monitor versions before 2.3.1, if the boot monitor did not receive a
packet in the midst of the MOP stream, it would abandon the load attempt and
make up to 10 retries for a new MOP load.  This was considered a bug, because
there are more frequent cases in which the NCD does not receive a MOP packet
(such as when the packet is dropped by a bridge) but the MOP server is still
present and ready to continue loading.  The bug was evidenced in that the
load would pause for a period of time, during which each MOP load retry by
the boot monitor was handled by the MOP server by its reporting a "line
protocol error".  Ultimately the boot monitor would run out of retries.

In boot monitor versions of 2.3.1 and presumably later, the boot monitor was
corrected such that if it does not receive a MOP packet during a load, it
simply requests the current packet again, up to 10 retries.  This is
considered the right thing to do and it interacts properly with the MOP
server under most circumstances, such as dropped packets.

However, the MOP server software on the VAX is erroneous in the case of a
late collision; it should not exit in this case but should rather either
retransmit the current packet or wait for the receiver to request the current
packet again.

So in the case of a late collision, pre-2.3.1 boot monitors restart the load
at the beginning.  This works because since the MOP server has erroneously
exited, the request for a MOP load is handled by the NML layer, which spawns
a new MOP server process and the load proceeds.

But in the case of a late collision with 2.3.1 boot monitors, the boot
monitor is requesting the current packet but there is no MOP server to
receive the request.  The NML layer gets confused because it is not written
to handle the "current" packet request, and reports errors in a form similar
to

	Circuit SVA-0. Line open error. Line protocol error 
	%MOM-E-BADMOPFCT. Bad MOP function received from target



NCD feels that the 2.3.1 Boot Monitor is correct and that the MOP Server's
characteristic of exiting in the event of a late collision is incorrect.
We also note that the ULTRIX MOP Server does not seem to have the same
problem which reinforces the idea that the VMS MOP server is incorrect.

One of our customers feels the same way and it was through this customer that
the problem was reported to Digital.  We feel that if the appropriate
pressure is applied to Digital, the problem will be fixed sooner, and, so, if
any of your customers are experiencing anything like this, please have them
complain to DEC and offer the DSNlink reference number C911018-1955, under
which this problem report is recorded.


How to Work Around the Problem Until DEC Fixes It
-------------------------------------------------

	-  Disable all the TCP/IP boot states using remote configuration.
	-  Then set "Boot-retry-forever" with remote configuration.

This will have the effect of telling the boot monitor to retry with a new MOP
load request after it has timed out on the 10 current-packet requests.


=============================================================================

This is from the DEC documentation:
-----------------------------------
Errors

  DATACHECK

    write check error

       Facility: SYSTEM, VMS System Services

       Another possible explanation is that the Ethernet hardware
       detected a late collision on a write request.

       User Action: Note the condition. If necessary, modify the source
       program to detect and respond to the condition. This message
       may indicate a hardware error. Check the Ethernet cable and all
       Ethernet controllers for proper operation. If the error persists,
       notify the system manager.
