Mon, 26 Feb 2007
Superlinear Algorithms
A while ago I wrote about the
Email::Address
Perl module for parsing addresses
in the e-mail headers (such as To:
, Cc:
, etc.).
We use it in production system, but on Thursday we have got a problem,
which I narrowed to an inoptimality in this module. The mail delivery daemon
(which - amongst other things - parses the mail message) seemed to be
stuck in an infinite loop.
Later I found out that the message in question contained an overly long
To:
header - it had more than 300 mail addresses there.
I have investigated this further, and it seems that the
Email::Address
module has complexity far above O(n):
# of addresses | parsing time |
---|---|
60 | 0.13 s |
65 | 0.13 s |
70 | 0.29 s |
75 | 0.45 s |
80 | 0.79 s |
85 | 0.58 s |
90 | 0.86 s |
95 | 1.63 s |
100 | 2.38 s |
105 | 3.17 s |
110 | 4.02 s |
115 | 5.97 s |
120 | 11.37 s |
125 | 230.89 s |
130 | 1045.07 s |
So it was definitely impossible for it to parse a header with >300
addresses in a reasonable amount of time.
The alternative module, Mail::Address
, is unusable
on this particular message as well - it crashes perl
with SIGSEGV :-(.
As a workaround, I wanted to truncate the header somewhere, but in order to truncate the header exactly between e-mail addresses, I would have to parse the header. Which is what I has been doing, and it took too much time. A nice Catch 22, isn't it? So I have invented a heuristics where I truncate the header after a comma followed by a newline. When this is not enough, I use just a sole comma as a separator, and when even this is not enough, I truncate the header to a fixed number of characters. So far it seems to be sufficient.