Long distance debugging

We’ve now reached the end of the recent embedded development project for our Industrial Control Client.

The final phase was made more complicated by the difficulty in debugging the changes. The embedded hardware had no screen, and the network debug facility that it supported was unreliable; it sometimes just lost messages. So the first step was to work around this issue with some debug messages in-line with the normal TCP/IP data channel from the hardware. We had already developed a custom proxy server that sat between the hardware and the server that it communicated with; this dumped the message flow to a file for easy debugging (a bit like having WireShark traces with a custom protocol decoder but easier to manage for non-technical staff). We extended this proxy server so that when the hardware connected to it the proxy sent it a message that told it to send debug messages and from then on the hardware could output some of its debug directly to the proxy server. We didn’t need to change the code that the embedded system connected to as only the proxy ever enabled the new functionality. The result was that with the proxy in place, we could get some debug messages out of the code that we were working on and start to see what was going on. This was a start, but less than ideal because all the existing debug messages in the code that we didn’t need to touch all went to the unreliable debug connection…

Step two was to use the new, reliable debug to determine the problems with the existing, unreliable debug… There were two problems; firstly, the existing debug didn’t bother with any locking, and so the multithreaded access sometimes led to messages being lost. Secondly, the only tool that could process the debug connection only ran on one very old server machine, it ran slowly, and it seemed that if the hardware generating the debug messages was creating messages faster than the tool was reading, then messages were thrown away. The higher the debug level we set, the more messages were generated and the more likely it was that a subset of those messages would be thrown away. This was a sensible design, but it was frustrating. We looked at the network protocol used by the debug messages and wrote a new tool to read the messages. This tool could run on a newer, faster, machine and could set a very large TCP read buffer to allow for more data to be in flow before the sender backed up and started to throw messages away…

Eventually, we got to a point where we could debug the work we needed to do and finish the project.

This was a fun project and quite different to most of the stuff we’ve been doing recently.