Broken

In case anyone should be wondering why I didn’t continue my project to install Gentoo: Well… Unfortunately that slug is broken.

USB seems to be damaged in some way as it reports nothing but rubbish to the kernel and something causes bad noises to come from the beeper. Some forum threads pointed out that might be caused by a bad power supply. However, after changing the AC adapter it doesn’t run any better in my case. Since I wasn’t able to flash the original firmware correctly (it was just too obvious that I manipulated that box) I decided to open it up and take a look at its PCB.

Unfortunately I can’t see any physical damages on that board so I must assume there’s some hidden defect somewhere. I doubt I can fix it, so I must consider that box to be broken after just 4 months. After all I read I must advice everyone NOT to buy a NSLU2 – except you are out for some soldering on the board. The hardware fails much too often from what I could find, be it for overheating, manufacturing or just design errors (like the fancy “feature” that you can power the box solely by an external USB hub). Sadly, if I cannot find the error or a workaround for it, I may not continue on this project since I am not willing to buy another box for full price that’s broken by design.

Either I buy one very very cheap at eBay or I will just let it die. Maybe someday I might be able to fix it – could be my studies might be of some help at some later point.

Anyway; this project was no complete loss of time. Much of the knowledge I gained can be helpful in getting Gentoo onto the Pandora when it finally arrives (which unfortunately may take another 3-4 months depending on LCD production since I assume I’m somewhere in the last third of the preorder queue).

msleep() vs. mdelay()

Arrrrgh…. I finally got it….

For months I have been trying to figure out why my cross-compiled kernel and rootfs won’t boot on the NSLU2 aka Slug. Since I didn’t want to solder pins for the serial I/O onto the board I tried a quite unique approach: Since the Slug has 4 LEDs that can be easily controlled through GPIO, I could hook up at some function and send all output through the LEDs. Sure this would be slow (~1 character per second, maybe 1.5) but that would be enough to at least read kernel panic messages while retaining full warranty on the hardware.

Finding out how LEDs were controlled took me about 1,5 hours, hooking up and testing first things another hour. I hooked up before the call of init (disk 2 on), right after init (disk 1 on) and on kernel panics (both disk leds on). So what happened: It got init and then crashed with a panic after a few seconds. I figured out I could hook up at uart_console_write() or panic() and then read any output by blinking a byte in 4 steps (one byte on each disk LED, signal indication by setting power to green and “clock” indication through power amber). Well it started blinking for hours and hours and hours… But all I could decode was just infinite rubbish, no matter what I tried. Even a comparison iThat was the critical fault.

msleep() seems to suspend the currently running task, so it is non-blocking regarding the whole system.

mdelay() blocks the system (or at least the active CPU) if running single-threaded.

So why was that small change critical to my code? I don’t know exactly. But I know that panic() disables scheduling before any further action. So what happens if some code fragment used by panic() tries to relay on scheduling? Something seems to get corrupted very seriously, maybe some kind of heap or stack overflow happens. Maybe some process/scheduler data gets screwed up. I don’t know. But that seemingly tiny difference of blocking vs. non-blocking functions (what function does what isn’t always that clear if you’re new to Linux kernel programming) really makes a very big difference.

I finally recorded my kernel panic on dv tape and will decode it tomorrow using a simple tool I wrote. If it is finished I will make it available from this website.

Here’s a small excerpt from the ~20 minutes long message transmitting “BUG: sched[…]” (I did not decode more than that yet):

Edit: I decoded all 20 minutes. What was readable (the decoder was a quick and dirty solution since yet) led me to “BUG: scheduling while atomic:” in kernel source and was simply caused by a remaining msleep in my LED function. I got rid of it and now got a clean message “Attempted to kill init!”. Now, that’s where the debugging begins…

Slug (Linksys NSLU2)

After I read about a quite inexpensive (about 65 to 75€) embedded system on a forum two weeks ago, I needed to get one of these myself. The system has two USB host ports and an ethernet interface, 32MB SD-RAM, 8MB flash memory and a 266MHz ARM (Intel XScale) CPU (underclocked @133 MHz until mid 2006 production dates). It’s running Linux with a modified RedBoot bootloader. It’s originally intended to be a NAS server for USB hard drives but can be flashed with different Linux kernels and images. Unfortunately it’s already getting old (first released in 2004) and was reported to be discontinued so I had to decide to buy it now or never. I bought it:

If that device is completely new to you, the article on Wikipedia (en/de) may provide a good starting point for more information on what is possible. If you get interested in it, nslu2-linux.org provides a great resource to answer almost all your questions.

My goal is to get Asterisk, DHCP, DNS, OpenVPN and maybe a small webserver to run on it. However I haven’t reached that yet. (Click the link below to read more.) Continue reading “Slug (Linksys NSLU2)”