Mon, 31 Jul 2006
XFS Corruption
Odysseus has been hit by the infamous
XFS endianity bug
in the 2.6.17 kernel (for those who do not know, this in some rare cases
can create a filesystem corruption which xfs_repair
still cannot
fix).
I have tried to fix the filesystem according to the guidelines in the FAQ, but there must be yet another problem in my XFS volume or in the kernel, as it keeps crashing when I try to actually use that volume. Since Friday I have been trying to put Odysseus back in shape, so far without success. Now I am copying the local (non-mirrored) data to a spare disk, and I will recreate the entire volume from scratch.
I feel sorry for the XFS developers, as they have always been helpful
when I had a problem with XFS. But I am afraid I will have to use something
different than XFS. I am considering JFS (which was the fastest filesystem
when I did my own testing), but I will probably use ext3: this is - from
my experience - a very stable filesystem, with one important feature:
a rock-solid e2fsck
, which can even fix the filesystem corrupted
by a hardware bug (unlike reiserfsck
, and unfortunately
xfs_repair
as well).
I hope the FTP service of ftp.linux.cz
will be restored tomorrow.
Fri, 28 Jul 2006
Bloover
A coworker of mine showed me an interesting tool: Bloover. It is a security auditing tool for BlueTooth-enabled phones. It seems my Nokia has a huge security hole - Bloover running on his phone (it is a Java ME application) can download the whole contact list, list of recent calls, and few other things from my Nokia, even though the devices are not paired with each other, and my phone is not set to be visible via BlueTooth.
I ended up disabling BlueTooth on my phone, and enabling it only when I need it. Now I have to find whether this particular hole has been patched by Nokia, and whether they will provide a new firmware for free. I am afraid they won't.
This is the same problem with all closed-source devices. They cannot be fixed without the vendor's help. And some vendors are extremely unhelpful with fixing their devices (I have to name Cisco as well as Nokia here). For example, HP does this better with their switches. While the firmware is not open source, they provide all the firmware upgrades freely downloadable from their web site.
This problem will become more and more common, as more and more devices will have some sort of CPU and firmware inside. So I wonder what my next mobile phone will be so that I will not fall to the same firmware upgrade trap? Maybe some Linux-based Motorola.
Thu, 27 Jul 2006
Hotmail and UTF-8
A new user of IS MU has redirected her e-mail to Hotmail [?], and complained that mail from IS MU is not displayed correctly (the diacritical characters were wrong). I wondered whether it was true, and even tried to create a Hotmail account for testing purposes.
After accepting at least 50 cookies from their authentication service, they have finally accepted my registeration. However, I could not log in, the web site said some internal error message (I have tried it once again, and it returned me back to the login screen). I then tried to use MSIE from our remote desktop server, and I have been able to log in. So it seems MSIE is the only browser allowed to access Hotmail.
I have sent a test mail to this new mailbox then. It contained a text in Czech, and two words in Japanese (Katakana and Kanji). It ended up in the spambox in my Hotmail account, and even there it was not displayed correctly. So it seems that Hotmail cannot correctly handle mail with Czech or Japanese language. I have not found the "display all headers" option on Hotmail, so I cannot even verify if the headers arrived correctly to Hotmail.
Moreover, the mail from that user of IS MU (sent from Hotmail) was broken itself - it contained characters outside of US-ASCII, yet there was no mention about the transfer encoding, MIME version, or character set of the mail body in the headers. Talk about respecting the widely-accepted well established standards.
I have recommended the user to use a different mail service, preferably somewhere where they can be standards-compliant.
Mon, 24 Jul 2006
Netbox Voice
After some time of using Ekiga as my softphone, I have decided to acquire a public phone number, reachable even from the PSTN. There are definitely cheaper providers, but I have chosen Netbox, my ISP. They can put the VoIP traffic to a different band of their bandwidth limiter, so I can use my network connection at the full speed, while using the softphone without loss of quality.
I use the software phone only - I do not want to have another blackbox (read: the Linksys VoIP gateway) at home. Ekiga is pretty easy to use, and I have my PC always on anyway.
There were some problems with setting it up, though:
- Their SIP servers are in
nbox.cz
domain, notnetbox.cz
. It took me at least half an hour to figure out this typo in my configuration :-( - Ekiga 2.0.1 does not like the SIP server with more than one SRV record. It is fixed in the CVS, a quick fix is to put the name of the real SIP server (which the SRV record points to) to the account configuration.
- The password to the Netbox SIP account is not written in the amendment agreement, despite what their customer centre girls say. In the "account name" box, there is, well, only the account name. The password can be found in the Netbox customer's web pages (after entering a general Netbox password).
- The Netbox SIP servers are apparently not reachable via SIP from the outside of the Netbox network. So it is probably not possible to call the Netbox Voice customer from another softphone in the Internet. I have send a support request to their hotline, and I am waiting for a reply. It is not fatal, though, I can still use my ekiga.net account for Internet-only calls.
- The SIP server uses the PCMA codec only, so the quality is worse than when calling Ekiga-to-Ekiga with the SPEEX wideband codec. The sound quality is apparently also more sensitive to the mic input level.
So after a year or so, we have again a "land line", this time without a monthly fee, with much lower call rates than Český Telecom (now Telefónica) offers, and with immediately available calls history on the Netbox customer's web pages.
Wed, 19 Jul 2006
A Slightly Better Wheel
In the world of Open source software, one can often spot a phenomenon, which I hereby name "A Slightly Better Wheel Syndrome(tm)". I often see this in bachelors' projects of our students, but sometimes also in the works of my colleagues or other computer professionals. The Slightly Better Wheel Syndrome is - strictly speaking - reinventing the wheel. It is less harmful than plain old reinvention of the wheel, because the result is - well - slightly better than the original. Except that often, in the big picture, it is not. Today I have seen an outstanding example of this syndrome.
I have read an article about driving simulation game named
VDrift. I wanted to try it, and (because
it is not in Fedora Extras), I wanted to package it. So I downloaded
the source, and wanted to compile it. There was no Makefile
,
no configure
, nothing familiar. So I have read the docs,
and I have found that SCons is used instead of make
to build VDrift.
I have tried to find out WTF SCons is, and why should I use instead of
make
. They have a section titled "What makes SCons better?" in
their home page: almost all features listed there fall to the category
"make+autotools
can do this as well (sometimes in a less optimal
way)". Nothing exceptional, what would justify writing yet another
make
replacement. What they do not tell you, are the
drawbacks. And those are pretty serious:
The first one is, that virtually everybody is familiar with
make
. Every programmer, many system admins, etc. When something
fails, it is easy to find the right part in Makefile
which
needs to be fixed (it is true even for generated Makefile
s,
such as an automake
output). Everybody can do at least a
band-aid fix.
The second problem is, that their SConstruct
files (an equivalent
of Makefile
) are in fact Python scripts, interpreted by Python.
So the errors you get from SCons are not an ordinary errors, they are
cryptic Python backtraces. I got the following one when trying to build VDrift:
TypeError: __call__() takes at most 4 arguments (5 given): File "SConstruct", line 292: SConscript('data/SConscript') File "/usr/lib/scons/SCons/Script/SConscript.py", line 581: return apply(method, args, kw) File "/usr/lib/scons/SCons/Script/SConscript.py", line 508: return apply(_SConscript, [self.fs,] + files, {'exports' : exports}) File "/usr/lib/scons/SCons/Script/SConscript.py", line 239: exec _file_ in stack[-1].globals File "data/SConscript", line 21: SConscript('tracks/SConscript') File "/usr/lib/scons/SCons/Script/SConscript.py", line 581: return apply(method, args, kw) File "/usr/lib/scons/SCons/Script/SConscript.py", line 508: return apply(_SConscript, [self.fs,] + files, {'exports' : exports}) File "/usr/lib/scons/SCons/Script/SConscript.py", line 239: exec _file_ in stack[-1].globals File "data/tracks/SConscript", line 10: env.Distribute (bin_dir, 'track_list.txt.full', 'track_list.txt.minimal') File "/usr/lib/scons/SCons/Environment.py", line 149: return apply(self.builder, (self.env,) + args, kw)
So what's going on here? With make
, there would be a simple
syntax error with the line number. With SCons, there is a cryptic Python
backtrace, written in an order reverse to what anybody else (gdb
,
Linux Kernel, Perl
, etc.) uses. The line 149 in Environment.py
is this:
148: def __call__(self, *args, **kw): 149: return apply(self.builder, (self.env,) + args, kw)
So what the error message is about? __call__()
is defined with
three parameters, yet the message complains that it has at most four,
and it is called with five. Moreover, this is apparently called in some
magic way (there is no explicit call to the __call__()
function)
from data/tracks/SConscript
line 10, which is a call with
three arguments, and a call to something different that that
__call__()
function. There is no way to fix the problem
without the deep knowledge of Python and SCons.
I have googled the error message, and found
this
thread, which said that it is possible
to build VDrift after commenting out the line 10 in
data/tracks/SConscript
. But I still have no idea about what was
wrong, and wheter some edit of the line 10 would be better instead of
commenting it out.
So SCons is definitely another example of a Slightly
Better Wheel Syndrome. In a hypotetical SCons authors' Ideal World(tm)
where everybody uses SCons and everybody knows Python, it would have been
possible for SCons to be better than make
+Autotools combo,
but in the Real World(tm), no way.
It is definitely harder to make an existing solution fit your needs instead of rewriting it from scratch. This is because it is harder to read the other people's code than to write your own. When using an existing solution, however, it is often gained more flexibility, maintainability, and features which "I just don't need" (read: don't need now, but in the future they might be helpful).
So the morale is: Please, please! Try to use (and maybe even
improve) existing widespread solution, even when it at the first sight
does not exactly fit your needs. Do not reinvent a Slightly Better (in fact
sometimes much worse) Wheel. The world does not need yet another slightly
better, yet in fact broken, PHP-based discussion board, PHP-based photo
galery, or a make
replacement.
Tue, 18 Jul 2006
Comma-Separated Values?
While migrating IS MU to UTF-8, I rewrote the code for exporting tabular
data to CSV file for MS Excel, factoring it out to a separate module.
When I was at it, I have also added the Content-Disposition
header, so that the exported file is saved under a sane filename,
instead of the default of some_application.pl
. So now
the Excel exports are saved as files ending with the .csv
suffix.
Which is, interestingly enough, the source of problems and incompatibilities
with MS Excel.
As I have verified, when I save the CSV file as file.pl
,
excel reads it correctly - it asks whether the TAB
character
is the field separator (indeed it is), whether Windows-1250 is the
file encoding (it is), and happily imports the file. When the same file
is named file.csv
, Excel opens it without any question,
but somehow does not recognize the TAB
character as the
field separator. So all the fields are merged to the first column,
and the TAB
characters are displayed as those ugly rectangles.
When I try to separate the fileds with semicolons, Excel happily opens the
file (when named as *.csv
), but with another file name, it is
necessary to explicitly choose the semicolon as a separator. Just another
example of MS stupidity - why the separator cannot be the same regardless
of the file name? And by the way, what
does CSV stand for? Comma-separated values? Colon-separated values? It does
not work for commas nor colons. Just semicolons are detected correctly.
Maybe it is some kind of newspeak invented by Microsoft.
I guess I keep the exports TAB-delimited, and just change the file name
in the Content-Disposition
header to use the .txt
extension instead (altough something like
.its_csv_you_stupid_excel
would probably be more
appropriate).
Mon, 17 Jul 2006
IS in UTF-8
Our Information System is running with UTF-8 support even at the application layer since Friday. Finally the work which took the most of my work time is almost finished. Now we are fixing the parts of the system which are not running directly in Apache (cron jobs, etc), and minor glitches which survived our prior testing.
We do not allow arbitrary characters everywhere, because we must maintain some attributes in the form suitable for printing through TeX or exporting to the external systems, which are ISO 8859-2 or Windows-1250-based mostly. We allow almost all of Latin-1 and Latin-2 characters in most applications, though.
While it has been hard to convert the whole system to UTF-8, I must say that the UTF-8 support in Perl is well architected (and from what I have read, definitely better than in other scripting languages).
Tue, 11 Jul 2006
3ware Disk Latency
Odysseus with the new hardware seems to be pretty stable. However, there is still a problem: it seems that with the new 3ware 9550SX disk controller, the drives have much bigger latency than they had with the older controller (7508).
The system apparently has a bigger overall throughput, but the latency sucks. It is most visible on Qmail - with the old setup, Qmail was able to send about 2-4k individual mails per 5 minutes. With the new setup, this number is in low hundreds of messages per 5 minutes. With this slowness, Odysseus is not even able to keep up with the incoming queue. After the new HW was installed, the delay of the mail queue was several days(!).
I have found this two years old message to LKML, where they try to solve the same
problem with disk latency. It seems that the 3ware driver allow up to
254 requests in flight to a single SCSI target, while the kernel's block layer
queue (nr_requests
) is only 128 requests deep. This means
that the controller sucks all the outstanding requests to itself, and the
kernel's block request scheduler does not have an opportunity to do anything.
So I have lowered the per-target number of requests to 4, and disabled
the NCQ on the most latency-sensitive drives (i.e. those which carry the
/var
volume), and the performance looks much better now.
I think the main difference between the old HW and the new one is that
the new controller has much bigger cache, so it can allow more requests
in-flight. So the kernel scheduler cannot prioritize the requests it considers
important, causing the whole latency to go up.
I hope I have solved the latency problem for now, but during summer holidays the FTP server load is usually lower, so the problem may return back.
Fri, 07 Jul 2006
Weekly Crashes
We have an off-site backup server for the most important data. Several months ago it started to crash - and it crashed during the backup almost every Thursday morning.
At first we have suspected the hardware. However, I was able to run
parallel kernel compiles for a week or so, with some disk copying processes
on background. The next suspicious party were the backups themselves:
we have tried to isolate which of the backups flowing to this host was the
cause. But there was nothing interesting. We have checked our
cron(8)
jobs, but there was nothing special scheduled for
Thursday mornings only (the cron.daily
scripts run, well, daily,
and the cron.weekly
scripts run on Sunday morning.
When upgrading the disks this Tuesday I began to think that there was a problem with the power system - my theory was that on Thursdays, some other server in the same room runs something power-demanding, which causes power instability, and our backup server crashes.
Yesterday the backup server crashed even without the backup actually running. I have decided to re-check our cron jobs, and I have found the cause of the problem: we run daily the S.M.A.R.T. self-tests of our disk drives, and the script has been written to run "short" self-test every day except Thursdays - on Thursdays, it ran "long" self-tests. I wrote it this way so that in case of a faulty drive we can have two days (Thursday and a less-busy Friday) for fixing up the problem. So I have tried to run a "long" self-test on all six drives by hand, and the server has crashed within an hour.
It seems the backup server has a weak power supply or something, and running the "long" self-test on all the drives was too much for it. So I have added a two-hours sleep between the self-test runs on individual drives, and we will see if it solves the problem. Otherwise I would have to replace the power supply. Another hardware mystery solved.
Mon, 03 Jul 2006
New hardware in Odysseus
After several months of having new disks and controller at my desk, I have
finally managed to install them to Odysseus.
Now Odysseus (ftp.linux.cz
, ftp.cstug.cz
, etc.) has
a shiny new SATA-2 PCI-X
controller (3ware 9550SX-8LP) with eight new 500 GB drives, almost doubling
its previous storage capacity.
It seems the new controller (with NCQ and bigger cache memory) is
according to iostat(8)
able to keep all the drives busy
at 100% sustained when the demand is high enough. The previous one
- 3ware 7508 - was apparently not able to distribute the load equally:
when the load was high, the drives on which the
busiest volume (/var
) was, were at 100%, while the others peaked
at about 75-80%.
I even had to upgrade my MRTG configuration, raising the maximum theoretical bandwidth of each drive. It seems the new drives have noticably higher throughput. However, it feels like the latency of the drives has increased during the load peaks. I am not sure what is the cause (maybe it is simply a throughput-for-latency tradeoff because of bigger caches everywhere). From the graphs it seems that the imbalanced HDD utilization problem is gone (altough it might have been a problem of a particular HDD firmware).
Another thing to note is, that SATA cables require bigger space behind the drive, because the SATA connector is almost a centimeter deeper than PATA one. I think I have cracked the connector in one drive while trying to fit the cable between the drive and the fan behind it (but the drive fortunately works).
When upgrading the system, I screwed up when moving the data to the new host: I ran the following command on the upgraded system, which has still been running on a temporary IP address:
rsync -aHSvx -e ssh --delete odysseus:/export/ /export/ && echo "OK"|mail -s rsync kas
Which used the address 127.0.0.1
for odysseus
(which was in /etc/hosts
, because Anaconda sets it up this way).
So the above command on a temporary host actually did not do anything,
as it tried to synchronize the /export
volume from itself. You
can laugh at me now. Oh well.