Will a processor crash, stop interrupts being serviced?

JamesW · Joined: 23 Apr 2007 Posts: 91 Location: Rochester, England

Hi Folks,

Working on a device at the moment that keeps locking up. (Processor is an 18F26K80).

I'd originally enabled a watchdog on the devices, but once we'd built 24 of them - we noticed that they appeared to be resetting at spurious times.

Having increased the watchdog delay considerably, they appear to be crashing (or at least the main code loop has stopped executing).

However - my interrupt service routines are still running. (I know this, becuase I'm now toggling a pin in my main loop, and another in the timer ISR). The main loop has stopped, and the timer is still going. (Not to mention one of them is moving a motor a set number of steps - and the motor will move this number of steps, even after a crash)

I'm trying to work out if the code has just got stuck somewhere in a loop, or the processor has crashed.

If a processor has crashed (due to stack or other) will the Interrupts stop being handled?

Cheers

James

RF_Developer · Joined: 07 Feb 2011 Posts: 839

My thoughts, as a lot of this set alarm bells ringing...

JamesW · Joined: 23 Apr 2007 Posts: 91 Location: Rochester, England

Hi,

Thanks for the tips - I will at this junture, point out that I have a degree in digital electronics, run my own electronics business, and been programming embedded software, and device drivers for over 20 years! (in pic, stm32 & VxWorks!)

With regards to your concerns regarding an isr driving a stepper motor, this is a simple timer ISR that is running relatively slow, that checks a long int value. If the value is greater than 0, it decrements the value by 1, and does an output toggle on the motor clock pin.

There are NO delay ms lines, anywhere in the code - this is running flat out.

There is a standard main loop in the code - but this main loop is stopping executing - the question I have asked is pretty simple, and is aimed purely at tracking down the bug.

If for whatever reason my code is crashing the processor, will it dissapear off with the pixies, and stop servicing ISR's?

My code is still servicing the five ISR's in use (timer1, timer2, RDA1, RDA2 & TBE2 - this I have verified by toggling hardware lines in various ISR's). But my main loop has stopped looping.

I am trying to track down, whether the code has not actually crashed - but is stuck somewhere in a sub loop. This will give me a bit more of a pointer

Restart cause does not give me an answer, as in order to get this result I have to manually restart the processor - and so get a normal power up / MCLR_FROM_RUN result code.

Sorry for the confusion

James

JamesW · Joined: 23 Apr 2007 Posts: 91 Location: Rochester, England

Just to also add to the confusion, the PIC18F26K80 shares it's second USART (which is used to supply status information to the user) with PGC and PGD.

Hence if I enable the debugger, I won't be testing a true system.

Mike Walne · Joined: 19 Feb 2004 Posts: 1785 Location: Boston Spa UK

OK, you think that your main() code is getting lost.

I'm assuming that your main() executes a series of functions in a tight loop.

In that case make each of your functions generate a unique series of pulses on any spare I/O pins you may have.

That way you should be able to track which functions are operating and at what stage it all stops.

Once you know which function performed last, you can go down a level to its sub-functions.

As I understand it, the ISR's will operate until you disable them. So, if your stuck in a loop (with or without pixies) the ISR's should still work.

Mike

JamesW · Joined: 23 Apr 2007 Posts: 91 Location: Rochester, England

Hi Mike,

At this current moment - I'm not sure, and as is typical the unit stops working once in a blue moon (It stopped at some point overnight).

In a nutshell the unit is doing the following
- Controlling a set of clock hands using 2 stepper motors
- reading standard NMEA0183 packets from a GPS receiver into the serial port
- Decoding the packet, doing a shed load of maths on it to convert the time from UTC into local time with summertime correction on it.
- moving the hands to where they should be, and keeping track of time.
- Sending status information back to the user on uart 2

As the crash occurs so infrequently, I'm trying to work out if this is a major system crash (div by 0, Program count error, stack overflow etc..) Or if it is just sitting somewhere waiting for something to happen that hasn't. (I have been through the code, and can't see any obvious places)

If we've actually crashed for some reason, will the interrupts stop happening?

Thanks

James

temtronic · Posted: Fri Jul 06, 2012 5:03 am

Mike's right on here...take a page from the POST program that every PC has. Set a set of leds (or send a terminal a msg) to indicate which step of 'main' is executing, When it fails, it is the NEXT function that has the problem. 'old school' diagnostics but it works well.

JamesW · Joined: 23 Apr 2007 Posts: 91 Location: Rochester, England

The two spare pins on the processor are now wired, programmed and ready to light the way.

Downside - it's been running now for 20 minutes, and hasn't missed a beat! This could take a while.

Thanks for your help chaps.

RF_Developer · Joined: 07 Feb 2011 Posts: 839

As this is a GPS synced clock you need it to run 24/7/52.

Mike Walne · Joined: 19 Feb 2004 Posts: 1785 Location: Boston Spa UK

Don't know if this will help.

In the dim and distant past I had to diagnose a rare intermittent on a UART. It was only known that there was a problem after the event. A digital 'scope was set up in pre-trigger mode. The 'scope was triggered after a faulty transmission. In that case the trace data (all of it before the trigger) was then saved. Only the data for faulty transmissions was recorded for analysis.

You could do something similar. Put your watchdog back in. Send your diagnostic LED data to a 'scope set in pre-trigger mode (i.e. trigger set to right-hand side of screen). Use the watchdog restart to trigger the 'scope and save or analyse the pre-fault traces.

[ Or you could use the USART, on a temporary basis, to send out an ASCII character as it enters & leaves each function, and a sensible message on re-start. Save the messages to a PC. You then only have to trawl through looking for the characters ahead of the restart messages.]

If the problem is with parsing, try sending your own messages at a higher than normal rate to speed up testing. You could include difficult and/or garbage messages to test the handling.

Mike

SherpaDoug · Joined: 07 Sep 2003 Posts: 1640 Location: Cape Cod Mass USA

James, as another degreed engineer with decades of embedded development experience I think your problem is mainly one of mindset.

The word "crash" as applied to microcontrollers is too vague and should be banished from this conversation. The things that stop a uC from executing their program as they see it, such as loss of VCC, bad solder joints, thermal fracture of the die, die mask errors, etc. are rare and don't seem relevant to this problem. Even bits flipped by cosmic rays don't stop the processor from running something.

If your PIC is responding to interrupts it must have VCC and be clocking. So it is running some code somewhere. You just have to find out where. If it has run off the end of main() it will be stuck eternally executing sleep. If there is a hardware Reset issue it will be running initialization code between resets. Note interrupts generally default to enabled so you could still execute interrupts but never get into main(). The two spare pins are what you need to find where your PIC is going. And of course it is always hard to fix something that won't stay broken!

Your PIC is running some code somewhere. You just need to find out where.
_________________
The search for better is endless. Instead simply find very good and get the job done.

JamesW · Joined: 23 Apr 2007 Posts: 91 Location: Rochester, England

Hi Folks,

2 Hours in - and I know where the crash isn't (the area I thought it would be - the packet decoding of the satellite data, and subsequent maths).

I've now moved the pins, and am re-running the code.

Max stack useage 9 out of 31, max ram useage 27%.

I am only actually using the satellite to update the time on power up, and at midnight. From experience that as the satellites move around the Earth, there are times of the day when you may lose a valid fix - and I don't want the clock hands to stop moving at this point. The satellite packet is used to update a real time clock running on a quartz crystal. The quartz is used to calculate the position of the hands.

The reason I only update at midnight is that the NMEA packet is only really accurate to +/-1 second - especially as the GPS just squirts packets out - as a quartz crystal can drift a bit it's possible that the quartz moves to second X+1 (and hence advances the second hand) the packet then comes in and sets the time back to (X) and then the Quartz again ticks to X+1 and so you have moved more seconds than you want, and get hand drift.

Thanks for the help so far.

James

JamesW · Joined: 23 Apr 2007 Posts: 91 Location: Rochester, England

Now this is interesting, Having been playing with the unit all day - it seems to be stopping at roughly the same point, when looking at the RS232 output, and my two debug pins.

I have been originally doing it the "proper way" - ie printing the debug information to a buffer, and using the transmit buffer empty interrupt to empty the buffer and squirt the data out.

About 3 hours ago - I've removed all of this, and replaced it with a simple putc instead - and it seems to be running without incident.

So the golden question is - is there an obvious bug in the code below

newguy · Joined: 24 Jun 2004 Posts: 1911

You have to be very careful of the TBE interrupt - any small error or vulnerability in your transmit code, no matter how slight, will cause the TBE interrupt to continually fire. Been there, done that, 5 or 6 times now.

From what you describe it sounds like your program isn't properly handling transmit code. I haven't had a really close look at your code, but one thing does stand out: you only transmit if there is a character mismatch, which begs the question: what if they're the same and not different?

The way I usually handle a transmit interrupt is to load a buffer with a message and keep a tally of the characters that need to be transmitted. If a transmit isn't already in progress, start one and enable the TBE interrupt. The TBE interrupt then checks to see if the number of transmitted characters matches the number of characters to be transmitted - if so, it's done and the TBE interrupt must be disabled. If not, fetch the next character in the buffer and load it into the USART, and increment the number of characters transmitted. Just be sure to keep separate copies of critical things like buffer indexes (one for writing, one for reading), and numbers of characters transmitted/to be transmitted. The only "gotcha" is for large arrays which require a 16 bit index for a program running on an 8 bit processor. You need to disable interrupts just before doing any increments/comparisons on the indexes, since this method uses them to enable/disable the TBE interrupt. Since an 8 bit machine can't directly manipulate 16 bits at once, this is a potential trouble spot.

asmallri · Joined: 12 Aug 2004 Posts: 1636 Location: Perth, Australia

I had a tricky code lockup situation a few years back where my main code locked up however the ISR's functioned correctly. The way I tracked the problem was to add an addition switch input to be pressed in the event the application locked up. The ISR handler checked the button and, if depressed, printed out the contents of the stack. This enabled me to find the code that was interrupted by the ISR and therefore where the mainline was looping.
_________________
Regards, Andrew

http://www.brushelectronics.com/software
Home of Ethernet, SD card and Encrypted Serial Bootloaders for PICs!!