Archive for the ‘English’ Category

Sunday, October 30th, 2016

Let’s Encrypt on Gentoo

I was always a fan of CACert but unfortunately, Gentoo recently decided to no longer trust CACert by default which caused our overlay to become unavailable for a few weeks without us noticing it (thanks for not notifying users about that unexpected move…). Gentoo thus forced us to switch to Let’s Encrypt immediately (or pay for a commercial certificate) if we wanted to keep our repository easily accessible. While we could have issued a LE certficate manually, that method is completely impractical as certificates issued by LE expire after just 90 days, while CACert’s certificates lasted 6 months or even 2 years if you passed the assurance tests.

Choosing an ACME client, aka “bot”

The implied requirement for automated renewal means you have to install a client for the so-called ACME protocol and allow it to generate a key & CSR, submit it to LE, somehow provide a domain-based verification, finally reload your server application and repeat that whole process periodically on its own. As we want to install certificates into an Apache web server but there’s no module to handle LE certificate issuance directly from within Apache yet, we have to use some tool to perform the necessary tasks for us. The first client published by LE themselves (”Certbot” or app-crypt/acme on Gentoo) had a terrible reputation as it was heavily modifying the system it was being installed on. Soon after the initial start last year, multiple alternate implementations became available, more or less system-specific and more or less disruptive in the way they hook into the system.

I wanted to avoid that. The client I would choose for our server should not manipulate any configs on its own and it shouldn’t install extra dependencies outside the regular package management provided by the system (portage on Gentoo). It should just handle certificate creation/registration and renewal, nothing else. And it should have a comprehensive documentation.

Judging from just the documentation, my personal favourites so far have been and acmetool. While would have had the big advantage that it requires basically no (unusual) dependencies as it’s “just a shell script”, that was also the reason why I decided against deploying it on my server. While there’s no rational reason against that script, I, as well as the friend I’m sharing the server with, had a gut instinct that didn’t allow us to trust an externally developed shell script to perform some periodic task in processing external data (i.e. interaction with LE’s ACME server). I therefore decided to try acmetool instead, which has been developed in Go.

This article thus details the installation of acmetool on Gentoo but most steps should apply to other clients as well.


Monday, March 14th, 2016

stale connections via IPv6, or: IPv6 deployment done wrong (on both ends)

Wow, it’s been a very long time since I last published any post. I prepared quite a few in the meantime but never polished them up for publication. However, since this issue appears to be reoccuring right now (March 2016) I’ve decided to finally put it online. Please do not mistake any time-related information in this post for up-to-date, as I last updated this post 2 years ago and I’m going to publish it largely unrevised now.

I originally wrote this post back in March 2014 with full names in it. I hoped that the service I encountered problems with would get fixed but it wasn’t as of May 2014. I still won’t name it directly to avoid any legal trouble but it should be enough to just check your config for these points if you encounter problems with any servers on the Internet. In the meantime, I’ve seen more than just this one service having the same issues.

There’s a web service that caught my interest lately (ehm… back in 2014 ;) ), so it came that I wanted to access it from my home computer having an IPv6 connection via SixXS/NetCologne. To my surprise, I was unable to establish a connection via IPv6, only IPv4. I didn’t find the problem at that moment so I just forced access via IPv4 to work around that issue. A few weeks later, after having deployed IPv6 at work (also SixXS via NetCologne), I noticed the same effect there as well. Strangely, a friend had no problems accessing it via either IPv6 or IPv4 so I figured that there might be a routing issue or the company (or their CDN) may be blocking certain connections that do not match geo-lookups done via DNS. Since I had the (slightly unprofessional) impression that “the Internet was broken” around early March (at least if you were using Deutsche Telekom as ISP), I put that issue aside and revisited it only later.

The service was still inaccessible via IPv6 from my networks but the friend, having native IPv6 from Deutsche Telekom, could access it without problems. We compared DNS but Telekom DNS, Google DNS and NetCologne DNS always resolved to the same addresses, so there shouldn’t be any issue with geo-lookups. Finally, I found a thread where other people experienced the same issue and suspected the MTU size and missing ICMPv6 to be a problem. Oookay…?

So apparently, the operator deployed IPv6 to their servers and missed that ICMPv6 is mandatory for IPv6 to work properly. The issue appears to not have made it to their network operations department yet, so nobody fixed it on their end so far. And indeed: Setting the MTU to 1280 locally made the service to be instantly reachable. Let’s investigate what happened here as it’s mainly (but not solely) the operators fault:

Fixing the MTU on your local network

On your local end, you are using a higher MTU than 1280 (the required minimum MTU to be routed on the Internet). That is a bit unfortunate if the first hops of your upstream provider already use lower MTUs than your local defaults (usually 1500 on Linux or 1400 on Windows). What happens at this point is Path MTU Discovery, since IPv6 routers do no longer do packet fragmentation on their own (as opposed to IPv4): If your clients are sending packets that do not fit through a router’s outbound interface for the route to be taken, the router discards your packet and replies with an ICMPv6 “Packet Too Big” message which includes the MTU for its outbound interface. Your client saves that Path MTU (”pmtu”) in its route cache (Linux: ip -6 route show table cache) and retransmits the failed packet with fragmented size to match that individual MTU. This repeats until the route is fully traversable and your packets reach their actual destination. If your upstream provider is set to use a MTU of 1280 (changeable default for SixXS tunnels) and your clients try a MTU of e.g. 1500, “Packet Too Big” is being sent by your local router or – at latest – by your provider’s gateway for almost every connection you try to establish (since 1280 bytes are easy to be exceeded). Let’s see how that discovery looks like to an unrelated website with tracepath6 after forcing MTU to 1500:

# sysctl net.ipv6.conf.br0.mtu=1500; ip -6 route flush cache; tracepath6
net.ipv6.conf.br0.mtu = 1500
 1?: [LOCALHOST]                        0.038ms pmtu 1500
 1:  xxxx:xxxx:xxxx:xxxx::                                 5.936ms
 1:  xxxx:xxxx:xxxx:xxxx::                                 1.215ms
 2:  xxxx:xxxx:xxxx:xxxx::                                 1.229ms pmtu 1280
 2:                          29.395ms
 3:  2001:4dd0:1234:3::42                                 29.967ms asymm  2
 4:                       29.940ms asymm  3
 5:                        29.940ms asymm  4
 6:                          33.667ms asymm  5
 7:                            38.434ms asymm  8
 8:                        139.257ms asymm  7
 9:  2a02:2e0:3fe:ff21:c::2                               37.920ms asymm  8
10:  2a02:2e0:3fe:ff21:c::2                               37.935ms !A
     Resume: pmtu 1280

We can see that PMTU starts with 1500 (the device’s default) but my local router (IP masked with xxxx) replied with “Packet Too Big”, indicating that the MTU for that path should be 1280, hence a PMTU of 1280 is being used to continue.

As I said, that’s a bit unfortunate as it means almost every connection attempt is delayed by “Packet Too Big” and the path cache for all external connections sooner or later starts filling up with PMTUs of 1280:

# ip -6 route show table cache
2a02:2e0:3fe:1001:7777:772e:2:85 via fe80::xxxx:xxff:fexx:xxxx dev br0  metric 0
    cache  expires 594sec mtu 1280

Apart from manually setting the interface MTUs on all your clients this can be fixed via Router Advertisements by announcing the MTU of your Internet uplink or, if unsure, simply by announcing the minimum MTU of 1280. It’s done by AdvLinkMTU if you are using radvd or by setting the router’s local interface’s MTU to the size to be used if you are using dnsmasq for instance. Upon receiving those RAs, your clients should reconfigure to that MTU immediately. When using MTU 1280, your clients should not need to rely on path MTU discovery any more (at least unless you hit routers that violate standards even further).

Back to the broken service: Why is it a problem that they block ICMPv6 if I can fix that issue locally?

This left me puzzled for a moment until I compared three packet dumps in Wireshark (service with MTU 1500 and broken PMTU discovery, service with MTU 1280, with MTU 1500 and working PMTU discovery). You can see the packet dump of a stalled connection attempt with broken PMTU discovery below:

IPv6 ICMP Packet Too Big not seen

  • The initial TCP SYN packet the local client sends to create a new connection contains the Maximum Segment Size (MSS) which is equal to the MTU of the outbound interface to be used minus some bytes for packet headers. This declares the MTU to be used by the other end initially. If the client uses a small MTU such as 1280, MSS will be set accordingly which means the other end will fragment packets appropriately right from the beginning, no PMTU discovery required.
  • At some point, a packet sent by either side may be too big and a router replies to the sender with ICMPv6 “Packet Too Big” and the MTU to be used instead. The end that receives that message adjusts its Path MTU and retransmits the packet fragmented to match the new PMTU. Apparently, the server tries to send more than 1280 bytes in reply to my SSL “Client Hello”. That’s actually too big for my gateway or some other router in between, so the server is being sent a “Packet Too Big” message which is ignored and thus the connection stalls on both ends as the server can’t get past the router with lower MTU.

This leads to two issues:

  • If the other end is blocking ICMPv6 (for related connections) it cannot adjust its Path MTU although routers reply to it with “Packet Too Big” messages. If thought further, this may lead to an accumulation of dead connections on server-side which is most likely nothing you would like to have resource-wise. There should be two easy server-side fixes: Either allow ICMPv6 for related connections, so PMTU Discovery can work as it should, or always transmit with the minimum MTU of 1280 bytes instead of a higher local MTU and regardless of the TCP MSS. If connections are common to fail with any MTU higher than 1280 bytes it may be good practice for heavy-load servers to use a fixed MTU of 1280 anyway. (Please note that this assumption may not be correct and was one of the reasons I did not publish this post back in 2014 – I just didn’t find time to verify my claims…)
  • Unfortunately, it appears that the other end isn’t notified about the change in PMTU on one side, so unless it hits a limitation itself, it does not adjust its PMTU as well (which makes some sense since routing is not uncommon to be asymmetric, so there may be different PMTUs for each direction). In theory, if one end would be able to notify the other about a lowered MTU after TCP SYN, this would still require one end to discover the correct PMTU before the other. In this case it would not have helped as the client may request the web page with a smaller packet than 1280 bytes (in my case the largest packet sent prior to connection stall had 516 bytes). One workaround that could be implemented on clients is that the connection should be retried with a MTU of 1280 instead if connection stalls. Note that this may not work for all application protocols in all cases; in particular no remote action must have been triggered before a connection retry (which would have worked in this case as the SSL handshake for HTTPS failed, so no action should have been taken by the server yet).
Friday, October 12th, 2012

Patching a non-syncing RAID 1

I have yet to figure out why, but today one of my hard drives hit multiple CRC errors and went offline. It is part of a software (mdraid) RAID 1 and I had seen such errors before, so I did the usual procedure: Shutdown and power off for a few minutes, check that all drives come up on boot, boot to a rescue system, stop RAIDs, run a long self-test on the lost drive and then resync the RAID by running mdadm --add using the drive that remained online as source. Sounds okay? Too bad I had errors on the source drive…

When the first RAID partition’s resync neared completion around 95%, it suddenly stopped and marked the target drive as spare again. I wondered what happened and looked at the kernel log which told me that there have been many failed retries to read a particular sector which made the resync impossible to complete. Plus, S.M.A.R.T. now listed one “Current Pending Sector”.

What could I have done to avoid this?

First of all, I should have run check/repair more regularly. If a check is being run, uncorrectable read errors can be noticed and the failed sectors can be re-written from a good disk. However, check/repair can be a rather lengthy task which means it is not very suited to desktop computers that are not running 24/7.

Another option could have been to use --re-add instead of --add, which might have synced only recently modified sectors, thus skipping the bad one I hit on the full resync caused by --add. However, since I had my system in use about one hour before I noticed the emails indicating RAID failure, I doubt this could have helped much. Plus, it would likely have been too late to run that after a resync has already tried and failed as the data on the lost disk was already partially overwritten.

What I did to work around the issue


The following steps can cause irreparable damage to your data. Only continue if you fully understand what you are doing and you either have a working backup or can avoid loosing the data. This post is omitting information you should know if you are going to follow these steps. Please make your own mind about them before running any commands.


NOTE: The failed sector will be called 12345 from now on. The broken sector resides on /dev/sdb and the drive that went offline and has a partial resync but good sector is /dev/sdc.

A quick search turned up some helpful sites ([1], [2]). First, I verified that I had the correct sector address by running hdparm --read-sector 12345 /dev/sdb, which returned an I/O error just as expected. I then checked the sectors immediately before and after the failed one. I was lucky to find a strangely uniform pattern that simply counted up – I’m not sure if that is some feature of either ext4 or mdraid or just random luck. I tried to ask debugfs what’s stored there (could have been free space, as in [2]) but I wasn’t sure if I had the correct ext4 block number, so I didn’t give anything on that information.

Since this is a RAID 1, I thought, maybe I could just selectively copy the sector over from /dev/sdc. What I needed to do was to get the correct sector address for sdc and then run some dd command. Since the partition layouts differ on sdb and sdc, the sector numbers don’t match 1:1 and have to be calculated. I ran parted and set unit s to get sector addresses, then ran print to get the partition tables of both disks. All that has to be done is subtracting the start offset from the failed sector address, then add the other partition’s offset again. Let’s say the address was 23456.

Since I knew what the sector should look like, I could verify it directly with hdparm. Additionally, I checked a few sectors below and above that address and the data matched perfectly.

Next, I had to assemble a dd command to display and then copy the sector. Using dd if=/dev/sdc bs=512 count=1 skip=23456 | hexdump (bs should match the logical sector size) and comparing it to the hdparm output, I could verify that I read the correct sector. I also tried a few sectors above/below again. When I was ready, I finally copied the sector: dd if=/dev/sdGOOD of=/dev/sdBAD bs=512 count=1 skip=23456 seek=12345 oflag=direct (replace sdGOOD and sdBAD by the actual drives – just making this post copy&paste-proof :) ) oflag=direct is required or you will likely get an I/O error.

To be sure that everything went fine, I checked the result with hdparm again. After restarting the RAID, the resync ran fine this time.

Sunday, January 15th, 2012

Reconstructing directory modification times

I’m currently migrating large parts of a Windows NTFS partition to Ext4 on my desktop system and accidentally messed the modification times of all directories I moved with Dolphin (KDE) by aborting the operation and restarting it over the already created directories from the first run. I decided to hack a small script to reconstruct the modification times (getting close to the original ones). It works by recursively finding the latest, deepest nested file modification time for a directory (as those have been copied correctly) and using touch to set the same timestamp on the target directory. I would advise to call it with find on the directories it should operate on, for example if it has been saved as ~/

find first/ second/ -type d -not -empty -exec ~/ \{\} \;

It is by far not optimized (being called this way will perform highly redundant queries on the file system) but it works very fast nevertheless and you usually don’t run it frequently, so that’s okay…

A little word of warning: Verify the correct behaviour on your own! That means: Please manually confirm that all calls performed by this script and method turn up with the correct result. This comes without any warranty and although touch shouldn’t do any more than changing timestamps you never know… ;) So… do your checks please and try it on some unimportant test directories first.

# sets one directory modification time according to its latest file modification time (searched recursively)

LATESTDATE=`find "$DIR" -type f -printf '%TY%Tm%Td%TH%TM.%TS\n' | sort -n | tail -n1 | cut -c-15`

        touch -m -t "$LATESTDATE" "$DIR"
Sunday, May 22nd, 2011

Java: Deployment issues

Being new to developing Java web applications using a lot of dependencies, I encountered a few issues when deploying an update onto our Geronimo app server. Since some were hard to find solutions for, I decided to write them down in this blog post. I didn’t have any of these issues while running the application for development by mvn jetty:run (even on the same machine as Geronimo).

  1. Doubled dependencies
    We use Quartz Scheduler in our application while Geronimo itself also uses Quartz (however, in an older version). That resulted in a module conflict indicated by an IncompatibleClassChangeError since both libraries can not be loaded at the same time within the same classloader. The solution to this was rather simple: all that was necessary is adding <inverse-classloading /> to the deployment plan (which may better be placed in a separate directory so you can easily switch between multiple deployment targets). If our application was using Quartz 1.6 (the version Geronimo is using) I might have simply added a dependency to my deployment plan and could have shared the classes.
  2. Missing XML implementation
    Our application also writes PDFs using the Apache XSL-FO processor. For a reason I still don’t understand, there appeared to be only stubs available but no implementations although it worked happily with apparently the same configuration on the same machine when run from mvn jetty:run instead of Geronimo. The message I got was something like “org.apache.xerces.jaxp.SAXParserFactoryImpl not found”. After a lot of search and failed tries I figured out that I simply had to add xerces/xercesImpl as a dependency. Another option may have been to dig deeper into property handling and figure out how to properly solve the problem by specifying an existing implementation as suggested on an older StackOverflow question. (however I’m unsure if that really was the problem as it appeared that the classes were missing but the class name was correct and it worked fine from command line so the classes – to my understanding – should have been available through the default class loader)
  3. LinkageError
    The last problem I had to deal with took me much longer to figure out. Take a look at this part of a stack trace:

    Caused by: java.lang.LinkageError: loader constraint violation: when resolving method "javax.imageio.metadata.IIOMetadata.getAsTree(Ljava/lang/String;)Lorg/w3c/dom/Node;" the class loader (instance of org/apache/geronimo/kernel/config/MultiParentClassLoader) of the current class, org/apache/xmlgraphics/image/loader/impl/imageio/ImageIOUtil, and the class loader (instance of <bootloader>) for resolved class, javax/imageio/metadata/IIOMetadata, have different Class objects for the type org/w3c/dom/Node used in the signature
    	at org.apache.xmlgraphics.image.loader.impl.imageio.ImageIOUtil.extractResolution(
    	at org.apache.xmlgraphics.image.loader.impl.imageio.PreloaderImageIO.preloadImage(
    	at org.apache.xmlgraphics.image.loader.ImageManager.preloadImage(
    	at org.apache.xmlgraphics.image.loader.cache.ImageCache.needImageInfo(
    	at org.apache.xmlgraphics.image.loader.ImageManager.getImageInfo(
    	at org.apache.xalan.transformer.TransformerIdentityImpl.startElement(
    	at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source)
    	at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source)
    	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
    	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
    	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    	at org.apache.xalan.transformer.TransformerIdentityImpl.transform(

    Let’s take a step back from that clutter of information, take a deep breath and examine the top-most sentence a bit closer:

    • we have some sort of class loader conflict
    • the method getAsTree in class javax.imageio.metadata.IIOMetadata refers to org.w3c.dom.Node
    • the class loader we are coming from is “MultiParentClassLoader”, provided by Geronimo
    • we come from org.apache.xmlgraphics.image.loader.impl.imageio.ImageIOUtil
    • the class loader being used by the method we try to call is called “bootloader”
    • the access to “bootloader” originates from javax.imageio.metadata.IIOMetadata
    • both class loaders link to incompatible and thus conflicting signatures of org.w3c.dom.Node, so we cannot continue

    Let’s interpret these facts: org.apache.xmlgraphics..ImageIOUtil calls a method from javax.imageio.metadata.IIOMetadata. We have at least two class loaders that have different understandings of what org.w3c.dom.Node should look like and both classes we access are using a different one of these signatures. Apparently the javax packages are provided by the JDK and are being used by xmlgraphics, so our dependency of xmlgraphics appears to be incompatible with the JDK we are running. We are confident that both the JDK and xmlgraphics are up-to-date, so what next?

    I searched for over an hour and couldn’t find anything relevant except some voodoo stuff. At first I tried to hide classes by adding a filter in the deployment plan; then I tried to enforce one specific version by adding a direct dependency to org.w3c.dom. It was a bug report saying something like “strange, org.w3c.dom hasn’t been touched for years” plus another report saying “use xml-apis-1.3.04″ that got me on the right track: Today’s Java versions seem to ship with at least some org.w3c.dom classes. However, Maven pulled xml-apis as a dependency nevertheless which includes its own versions of org.w3c packages but doesn’t seem to be a problem unless deployed to the app server. Maybe that’s a side-effect of the inverse classloading we enabled to get Quartz running. The solution was to simply exclude that dependency in the pom file. If you are using NetBeans you can simply right-click xml-apis and select “Exclude Dependency” which will automatically add


    to all relevant dependency definitions (in my case xerces/xercesImpl and org.apache.xmlgraphics/fop).

Tuesday, March 29th, 2011

wine caching Temporary Internet Files indefinitely

Taking a look at where my disk space went, I was quite surprised to find 11GiB in a directory at ~/.wine/drive_c/windows/profiles/username/Local Settings/Temporary Internet Files/Content.IE5/

It appears that this is an already reported issue with wine, as downloads made through calls to wininet.dll are supposed to be deleted by Windows sooner or later. As wine caches its downloads the same way but without ever removing old files, everything ever downloaded through wine’s wininet.dll currently gets cached indefinitely. To work around that problem, it’s currently necessary to clear that folder once in a while (or to automate that on reboots or similar). Simply deleting that directory should work just fine as it is supposed to be created automatically if needed.

It’s probably best to also search for other temp directories inside ~/.wine from time to time. Apart from cached downloads I could free another 6GiB of unnecessarily wasted space at the usual locations. It’s easy to forget about “C:” when running wine… :)

Monday, May 24th, 2010

Getting fed up with HTC’s non-existent maintenance…

I bought a HTC Hero last year in September and was quite happy with it. HTC’s custom Sense UI looked far better than the default Android UI. It also shipped with better applications, less Google bundling and pioneered some basic multitouch support (only in the browser and the photo app) and integration with Flickr, Facebook and Twitter.

Sense’s initial release was very sluggish (see older reviews on YouTube) but has just become fixed when I ordered my phone. After that, there was one more update in November without information on what has been fixed with it. Since it required yet another wipe and was still Android 1.5 (Codename Cupcake) and everything worked fine for me so far, I decided not to install it.

It’s the end of May now, 6 months since that last update, 8 months since the UI fix and when I got my phone. Android 1.6 (Donut) was released the day after I bought my phone, providing new features such as VPN connections, text-to-speech support, a new market application and multiple screen resolutions. I would call the last feature the most important one since newer devices required it and soon applications started requiring that API, resulting in older Android versions not being able to run them (they are simply hidden from the market). In preparation of the Droid release in November Android 2.0 (Eclair) was released just one month later, along with support for newer hardware (camera flash, OpenGL 2.0 ES) it also added more Bluetooth profiles (up to that point Android phones could only interact with headsets) and a better Bluetooth API as well as Google Voice Search among others. If I remember correctly, this was mainly used by the Droid and no other phone at that time. It was Android 2.1 (also Eclair) that shipped these new features to phones from other manufacturers since January but not without adding some more features such as live wallpapers. HTC said they would skip 1.6 and go directly to 2.0/2.1 for the Hero. Android 2.1’s release date was timed with the release of Nexus One which was manufactured by HTC. Last week, Android 2.2 (Froyo = Frozen yogurt) was released, bringing (among others) WiFi and USB tethering without the need of rooting your device.

Sprint released a 2.1 update for their customized Hero last week, one week after another customized revision called “Droid Eris” got the update. Hero owners using the plain GSM phones (in contrast to CDMA by US operators) are still waiting in vain for a release. There were multiple release dates, both rumored and official, each cancelled a few days before or simply missed. According to HTC now has yet another release date for us: A first preliminary update should roll out someday in June and the final 2.1 update should follow “a couple of weeks later” which could easily mean July or August regarding their previously announced dates. HTC also said, that 2.2 would come to all phones released in 2010 – that excludes the Hero for now…

Actually, Google released Android 1.6, 2.0 and 2.1 much too fast for any manufacturer to keep up with in time, that’s true. Also, HTC’s custom UI called Sense has to be ported to each Android revision they release an update for. But HTC already released a phone running Sense on 1.6. So assuming, they stopped support for Sense@1.5 at latest in November, they now had 6 months to tinker at ports to newer Android versions and prepare an update for the Hero. And they did: Not only did Droid Eris and Sprint Hero get an update in the past weeks, they also released a lot of phones previously, developing even more. All have similar hardware and starting from the Hero all of their Android phones have Sense UI.

So why didn’t HTC release any 2.1 update for the Hero yet? Considering that at least the Sprint Hero could have had its update so “soon” only because the network operator Sprint may have built its own release, I’ve got the notion that HTC is building throw-away phones: If one gets outdated, just dispose it and buy a new one; maintenance will only run for a few months and then suddenly stops. This may have worked for conventional phones in the past but since smartphones run on a common operating system that is often maintained by another company, that’s the completely wrong path to keep the Android platform healthy and customers as well as developers satisfied. By healthy I mean that a device should usually support all public API levels until the hardware becomes insufficient which shouldn’t be that much of a problem with smartphones which are nothing but PDAs with a cellular radio chip. However, since Android is open source and modifications to the Linux kernel have to be published under GPL, Android phones should be open to custom upgrades and so is the HTC Hero.

Unfortunately I’m legally unable to link or name any of the custom ROMs I’ve found but if you do a standard Google search you will most likely find them quite easily. They either run the plain Android system or incorporate (pirated) copies of Sense UI from leaked or previously released images. Since I’m not the only one who is sick of HTC’s poor excuses, there seems to exist a variety of ROMs specifically targeting the Hero. Reading into what’s necessary to get an update, I’ve found out that HTC is actually making it difficult to flash an inofficial ROM onto the phone. This involves unlocking a special diagnosis mode and signed files; things I would not have expected at all from a badly maintained open source based phone! Although I didn’t try it myself, it seems like even versions containing ripped Sense binaries run surprisingly fine (with some minor bugs and inconveniences) on the Hero, so the question remains: Why does HTC delay updates if a community of a few unrelated developers is able to build almost completely working releases?

To make things worse, it seems like at least a few released 2.1 images introduced a jail lock that appears to not have been broken completely yet. So you are strongly advised to check the news on this topic before installing any official HTC updates from now on or you may not be able to ever go further than Android 2.1.

Why does this happen? What are they thinking? I start to regret having bought my Hero with the wrong expectation to have a modifiable phone that doesn’t outdate for a few years due to an evolving open source operating system. All I can do now is to warn other people from buying HTC’s phones without prior investigation of possible issues. While other Android phones may also be several months late lagging behind Google’s SDK releases, that may be excusable up to some point (especially since Google sprinted ahead with their releases since 1.6). The day the Hero finally receives its 2.1 update, other phones will already get their updates to 2.2 and I doubt HTC will continue to update the Hero further. Since a jail lock may be introduced to the normal Hero by the preliminary update in June I would strongly advise considering either getting a custom ROM instead or live with 1.5 and wait for more information (is HTC serious about 2.1 and will the Hero get 2.2?). Once you installed that update you may not be able to go further without buying a new phone although the hardware would be capable of it. All I know for sure is that I won’t buy another HTC phone in near future.

As of May 17, Android currently has an almost equally distributed version fragmentation across 1.5, 1.6 and 2.1. In the advent of 2.2 and the final updates to 2.1 I would expect that we are one or two months from finally calling 1.5 (the third API level) “legacy”. This means that the number of devices running 1.5 will drop significantly low and therefore less and less applications will run on 1.5 due to API updates more current applications may prefer to use (with Android 2.2 we have reached API level 8). The 34,1% share of Android 1.5 may in fact contain a large percentage of Heros since almost every other phone has had updates and the Hero sold pretty well. Developers have or will have to choose whether they want to maintain a legacy “Hero revision” of their software or not. Thus by holding back OS updates, HTC is not only upsetting its customers but is annoying developers as well, so its highly unlikely they will support a 10% or less share of 1.5 users. Running out on updates in less than one year isn’t what Android was meant to be nor what I would expect from a 400€ device.

Sunday, January 24th, 2010

KDE 4.3 upgrade and Cashew

I reinstalled KDE for an upgrade from 4.2 to 4.3 (doesn’t seem to work without uninstalling on Gentoo; no idea why) and ran into some problems afterwards.

reinstalling KDE

I was still using the unstable KDE 4.2 until this weekend. For some strange reason, the slotted 4.2 ebuilds were completely removed from portage, thus preventing any re-merges which I had to do since I wanted to also upgrade from GCC 4.1.2 to 4.3.4, so I had to uninstall KDE before, upgrade GCC, recompile everything, then finally merge KDE again. To remove KDE more or less completely I was told to do emerge -pCv $(qlist -ICv kde | grep 4.2) (minus the pv of course). I had some other ebuilds left for manual checking but this command caught the vast majority. Unmerging and remerging kde-meta (after the GCC upgrade) went really smooth. To have a WM in the meantime I emerged xfce4-meta and gnome-terminal before I unmerged KDE.

Akonadi migration with bogus entry

Recently I had issues with some corrupted config in Akonadi that could not be solved from within KDE 4.2. On every start I would now get the migration dialog retrying to migrate a bridge called Y2252jgYO7 – whatever that was, I couldn’t find it. Running locate on that ID revealed old lock files back from KDE 3.5. Having searched for a moment I figured out that I could simply find and remove the files:

find ~/.kde* -iname \*Y2252jgYO7\* -exec rm \{\} \;

Next, everything referencing that ID has to be removed from ~/.kde4/share/config/kres-migratorrc – redo whatever runs the migration assistant for you and the problem should be solved.

frenzy Nepomuk – again…

I don’t seem to be the only one where the desktop search never works. With KDE 4.2 I disabled Strigi and Nepomuk because it took 100% CPU even when idling. If it wasn’t idling it was accessing the harddisk way too fast and so it always slowed down my system. That wasn’t without sideeffects however: I was unable to use the runner (Alt+F2) without crashes. I figured out that enabling Nepomuk, disabling Strigi and removing all repository data worked in KDE 4.2. Guess what? The problem reappeared after the upgrade to KDE 4.3. I don’t know if I solved it “properly” (I don’t need desktop search ATM so I can afford disabling it). From Stéphane Laurière at Mandriva’s bug tracker:

  1. make a backup of your Nepomuk database by backuping the folder ~/.kde4/share/apps/nepomuk/repository/
  2. reconfigure Strigi from the Nepomuk System Settings so that it indexes less
    Note: I unchecked everything plus added * to the exclude patterns, just to be sure.
  3. stop Nepomuk by running the following command:
    qdbus org.kde.NepomukServer /nepomukserver org.kde.NepomukServer.quit
  4. remove the repository folder ~/.kde4/share/apps/nepomuk/repository/
  5. restart Nepomuk: nepomukserver

After restart I got several crashes (what else) but Nepomuk stops retrying after a short time… (you should watch out for large numbers of crash logs though)

the Cashew

If you ever wondered what the name for that moon-and-star-like icon is which you get when unlocking your panels in KDE4, it seems to be called “the Cashew”. I know there was one icon on the desktop when I started using KDE 4.2 but it somehow disappeared very soon on its own and I didn’t miss it. It seems like that has been some bug since I now had it again:

cashew on the right side KDE Cashew

That nasty icon was sitting right on my otherwise clean desktop wallpaper. I could move it around my screen borders but it could not be removed. I thought something was broken until I finally searched for it and figured out that it’s some kind of branding not be meant to be removed. Great… Now, how do you get rid of it?!

As pointed out on the Gentoo forums (thread “KDE 4.3.0 Annoyances“) and also on a blog post, there exists a plasmoid that you can simply drop onto your desktop to remove/hide that useless cashew: I HATE the Cashew

I HATE the Cashew

If you should ever want that apparently useless cashew icon back, simply remove “I HATE the Cashew” by clicking the minus icon on the “Add Widgets” dialog.

There used to be an ebuild for it but since I couldn’t find it I simply wrote one on my own (well, half-way copy&pasting from other ones and the ebuild documentation, but it works). In case you need it too, here it is: (kde-misc/ihatethecashew-0.4)

# Copyright 1999-2009 Gentoo Foundation
# Distributed under the terms of the GNU General Public License v2


DESCRIPTION="I HATE the Cashew - KDE4 corner icon removal plasmoid"

KEYWORDS="~amd64 ~x86"


src_unpack() {
        unpack ${A}

src_compile() {
        cd iHateTheCashew
        mkdir build
        cd build
        cmake -DCMAKE_INSTALL_PREFIX=/usr .. || die "CMake failed!"
        emake || die "Make failed!"

src_install() {
        cd iHateTheCashew/build
        emake DESTDIR="${D}" install || die "Install failed"
Saturday, December 12th, 2009

portage: (permanently) ignoring file collisions

I’m currently setting up a server to migrate an older one to. In order to emerge all necessary ebuilds I basically took the worldfile of the older one, sorted, diffed and edited it to finally emerge `cat mergelist` it on the new system. Except for a few dependency problems (emerge --resume --skip-first and later reruns fixed those) everything went fine. Unfortunately, having emerged maildrop, it blocked the installation of courier-imap due to file collisions:

 * This package will overwrite one or more files that may belong to other
 * packages (see list below). You can use a command such as `portageq    
 * owners / <filename>` to identify the installed package that owns a    
 * file. If portageq reports that only one package owns a file then do   
 * NOT file a bug report. A bug report is only useful if it identifies at
 * least two or more packages that are known to install the same file(s).
 * If a collision occurs and you can not explain where the file came from
 * then you should simply ignore the collision since there is not enough 
 * information to determine if a real problem exists. Please do NOT file 
 * a bug report at unless you report exactly which
 * two packages install the same file(s). Once again, please do NOT file 
 * a bug report unless you have completely understood the above message. 
 * Detected file collision(s):                                           
 *      /usr/bin/maildirmake                                             
 *      /usr/share/man/man1/maildirmake.1.bz2                            
 *      /usr/share/man/man8/deliverquota.8.bz2                           
 * Searching all installed packages for file collisions...               
 * Press Ctrl-C to Stop                                                  
 * mail-filter/maildrop-2.2.0                                            
 *      /usr/bin/maildirmake                                             
 *      /usr/share/man/man1/maildirmake.1.bz2                            
 *      /usr/share/man/man8/deliverquota.8.bz2                           
 * Package 'net-mail/courier-imap-4.5.0' NOT merged due to file          
 * collisions. If necessary, refer to your elog messages for the whole   
 * content of the above message.                                         

maildrop belongs to courier and although that collision isn’t great but won’t hurt being ignored, I added the following line to /etc/make.conf (wrapped to be readable on my blog):

COLLISION_IGNORE="/usr/bin/maildirmake \
 /usr/share/man/man1/maildirmake.1.bz2 \

This will permanently ignore the collision for all 3 files.

Friday, October 23rd, 2009

Disabling HPA on Samsung HDDs

A friend of mine bought a new 1TB hard drive from Samsung (HD103SJ with firmware 1AJ100E4). He used Windows Vista to partition it and format 2 partitions for NTFS, one to install Windows 7 and one to move data from an external drive. Everything worked well, Vista used the drive as intended. When he rebooted to install Win7 everything – Windows 7 setup, BIOS and unfortunately also Vista – saw the drive with only 33 MB capacity, reporting one unformatted partition on it.

Searching the internet we found out it must be a problem with the drive’s host protected area (HPA) in conjunction with his mainboard/SATA controller (ironically from Gigabyte). He updated BIOS and chipset drivers but the problem persisted. According to some forums there are some tools for Windows to fix that issue. Since many people reported data loss afterwards and he didn’t have another copy of the data he moved from his external drive we decided to boot into the new Gentoo Live DVD that also allowed me to log in remotely via SSH to investigate the problem (we are more than 500km away from each other).

Following a short guide on how to detect and disable HPA we could see the settings of his drive:

# hdparm -N /dev/sdb
 max sectors   = 65134/1953525168, HPA is enabled

This seems to have been the opposite configuration of what should have been set according to the guide: While the examples from the guide had a much greater value before the slash his drive seemed to show the intended remainder to be allocated to the HPA set for “public” use, so his HPA seemed to cover almost the entire 1TB of his HDD.

Before we tried to correct that value we wanted to check if we can access any data right away. Unfortunately everything prevented us from accessing the HPA – the kernel did only register /dev/sdb1 but not /dev/sdb2, smartmontools showed 33MB capacity as well, parted refused to work and fdisk finally showed the correct partition table but reported a mismatch in geometry, so at least we still seemed to access the same start sector of his HDD:

Disk /dev/sdb: 33 MB, 33348608 bytes
255 heads, 63 sectors/track, 4 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x24c2a85c

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1       38245   307200000    7  HPFS/NTFS
Partition 1 has different physical/logical endings:
     phys=(1023, 254, 63) logical=(38244, 193, 29)
/dev/sdb2           38245      121601   669558784    7  HPFS/NTFS
Partition 2 has different physical/logical beginnings (non-Linux?):
     phys=(1023, 254, 63) logical=(38244, 193, 30)
Partition 2 has different physical/logical endings:
     phys=(1023, 254, 63) logical=(121600, 247, 55)

We also tried to access /dev/sdb with dd but only got the 33MB until the drive “ended”. Considering the HPA modifications to be the last resort we tried to change it only temporarily at first, but had to set it permanently to get any effects on the system after a reboot. Please note that the manpage for hdparm explicitely states that changing that setting is “VERY DANGEROUS, DATA LOSS IS EXTREMELY LIKELY”. I copied the old value to a text editor before making the modifications, so I hoped we could at least set it back to what it was before our changes in case it didn’t work well. To make a permanent change, simply run the same command with p (for permanent) followed by the new maximum value to be set (to disable HPA this would be the exact same value as written after the slash):

# hdparm -N p1953525168 /dev/sdb

Having rebooted, fdisk did not complain any more and the kernel recognized both partitions:

Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x24c2a85c

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1       38245   307200000    7  HPFS/NTFS
/dev/sdb2           38245      121601   669558784    7  HPFS/NTFS

hdparm confirmed that we just disabled HPA:

# hdparm -N /dev/sdb

 max sectors   = 1953525168/1953525168, HPA is disabled

We mounted his data partition read-only using ntfs-3g and since everything seemed fine we started to copy everything back to his external drive, just to be sure we had a backup of his data in case that Windows would decide to crash his partition after another reboot. However, Windows Vista did recognize the drive correctly again and so did Windows 7; he can access all data and nothing seems to be corrupted or missing.

It appears strange that Windows initially ignored the HPA and worked well until he rebooted. But having had extreme, sporadic problems myself using a Silicon Image SATA controller with Windows XP SP2 as well as Windows Vista (without SP, back in 2007) it did not really surprise me any more.