KDE Dot News: Ext3's Miserable Failure
As I mentioned previously, I have been musing a lot about KDE Dot News lately.
After some interesting discussion about the merits of ext3 vs ReiserFS vs MySQL vs Zope, I thought I'd put it to the test. I have zero, absolutely zero, free time to waste on such things, but in the end the lure was just too much.
Zope
I'd been somewhat negligent. I hadn't packed the Zope DB for 4 months. The nasty, horrible, consequence of that was that Data.fs was a whooping 2 gigabytes in size. So I took an hour or so to pack the DB, resulting in a more reasonable size of 260M.
Then I spent a couple of hours setting up a parallel Zope server for experimentation, as well as implementing the necessary code for dumping the Zope DB to the filesystem in the aforementioned hierarchical structure.
ext3
Since I like testing code as I'm implementing, I only dumped the directory structure in the first step. Evidently this resulted in a hierarchical directory structure -- nothing but 55000 directories. The results were horrific.
Ext3 takes up a whooping 220M holding nothing but directories. Nothing but directories. Yet it takes up almost as much space as the entire Zope DB, which actually has more stuff in it than just the dot.kde.org data.
Ostensibly unfazed, I implemented the remaining code necessary to dump the full monty, including article headers, bodies, meta-information and any file attachments.
The resulting dump takes up 930M. Even worse, global file operations (e.g. find or du) from the root are extremely slow. Slow as hell. I think ext3 is going to be absolutely hopeless here, although the running time of file operations is not that bad when kept local or lower down in the structure.
tar.gz
Still curious though, I made a tar.gz of the directory structure. Perhaps not surprisingly, that takes up only 700K compared to the ridiculous 220M for ext3. I tried the same thing with the full dump and that takes up 60M compared to ext3's gig. Help me, Jebus.
ReiserFS
Simon Edwards had previously suggested that ReiserFS might be up to the task. KDE Dot News does not have reiserfs. We're running Ark Linux and Bero, one of the dot admins, strongly favours ext3 (to put it politely).
KTown to the rescue! That worthy machine has tons of space and tons of resources, so it made a suitable victim. Don't nobody tell the sysadmins.
I was astounded to find that the entire directory structure took less than 200K on reiserfs. That's stupidly less than even what the tar.gz takes up. Hans, you da man, man.
The bad news is that the full monty still takes up a huge 700M on reiserfs. However, it is fast. Very fast. Very very fast compared to ext3.
tar
As suggested by a reader, I also took a look at the results for the uncompressed tar files. Directories take 30M and the full dump takes 270M. The latter result is very interesting, because one might deduce from it that there is a lot of room for improvement in ReiserFS in terms of space usage.
Summary of Results
Speed-wise, I'm not giving any numbers since I tested reiserfs and ext3 on two different systems; although, I did try to make sure both were mounted noatime and so on.
Nonetheless, I can say that it seemed like ext3 took as many minutes as reiserfs took seconds to perform global operations. There was essentially an order of magnitude of difference across the systems.
Conclusion?
So I'm not sure what to do. 700M in reiserfs represents 4 entire years of articles and comments on the dot. The information is uncompressed and in plain text (excepting binary file attachments). Given that, perhaps the space usage is reasonable. I had also envisioned using some form of revision control for articles, so that would just take up more space.
Space-wise, Zope is winning here. At least still for a few weeks, until it grows horribly out of control again.
On the other hand, ReiserFS could easily have the speed advantage.
After some interesting discussion about the merits of ext3 vs ReiserFS vs MySQL vs Zope, I thought I'd put it to the test. I have zero, absolutely zero, free time to waste on such things, but in the end the lure was just too much.
Zope
I'd been somewhat negligent. I hadn't packed the Zope DB for 4 months. The nasty, horrible, consequence of that was that Data.fs was a whooping 2 gigabytes in size. So I took an hour or so to pack the DB, resulting in a more reasonable size of 260M.
Then I spent a couple of hours setting up a parallel Zope server for experimentation, as well as implementing the necessary code for dumping the Zope DB to the filesystem in the aforementioned hierarchical structure.
ext3
Since I like testing code as I'm implementing, I only dumped the directory structure in the first step. Evidently this resulted in a hierarchical directory structure -- nothing but 55000 directories. The results were horrific.
Ext3 takes up a whooping 220M holding nothing but directories. Nothing but directories. Yet it takes up almost as much space as the entire Zope DB, which actually has more stuff in it than just the dot.kde.org data.
Ostensibly unfazed, I implemented the remaining code necessary to dump the full monty, including article headers, bodies, meta-information and any file attachments.
The resulting dump takes up 930M. Even worse, global file operations (e.g. find or du) from the root are extremely slow. Slow as hell. I think ext3 is going to be absolutely hopeless here, although the running time of file operations is not that bad when kept local or lower down in the structure.
tar.gz
Still curious though, I made a tar.gz of the directory structure. Perhaps not surprisingly, that takes up only 700K compared to the ridiculous 220M for ext3. I tried the same thing with the full dump and that takes up 60M compared to ext3's gig. Help me, Jebus.
ReiserFS
Simon Edwards had previously suggested that ReiserFS might be up to the task. KDE Dot News does not have reiserfs. We're running Ark Linux and Bero, one of the dot admins, strongly favours ext3 (to put it politely).
KTown to the rescue! That worthy machine has tons of space and tons of resources, so it made a suitable victim. Don't nobody tell the sysadmins.
I was astounded to find that the entire directory structure took less than 200K on reiserfs. That's stupidly less than even what the tar.gz takes up. Hans, you da man, man.
The bad news is that the full monty still takes up a huge 700M on reiserfs. However, it is fast. Very fast. Very very fast compared to ext3.
tar
As suggested by a reader, I also took a look at the results for the uncompressed tar files. Directories take 30M and the full dump takes 270M. The latter result is very interesting, because one might deduce from it that there is a lot of room for improvement in ReiserFS in terms of space usage.
Summary of Results
| KDE Dot News | Zope | tar | tar.gz | ext3 | ReiserFS |
|---|---|---|---|---|---|
| Directories | - | 30M | 700K | 200M | 200K |
| Full Monty | 260M | 270M | 60M | 930M | 700M |
Speed-wise, I'm not giving any numbers since I tested reiserfs and ext3 on two different systems; although, I did try to make sure both were mounted noatime and so on.
Nonetheless, I can say that it seemed like ext3 took as many minutes as reiserfs took seconds to perform global operations. There was essentially an order of magnitude of difference across the systems.
Conclusion?
So I'm not sure what to do. 700M in reiserfs represents 4 entire years of articles and comments on the dot. The information is uncompressed and in plain text (excepting binary file attachments). Given that, perhaps the space usage is reasonable. I had also envisioned using some form of revision control for articles, so that would just take up more space.
Space-wise, Zope is winning here. At least still for a few weeks, until it grows horribly out of control again.
On the other hand, ReiserFS could easily have the speed advantage.

28 Comments:
Interesting results. I didn't expect ext3 to die so horribly. :)
Also, Zope is an object database. It doesn't save a flat dump of an article or comment. It saves the objects that make up the article or comment. Shared objects only get saved once. I suspect that is the main difference between Zope and reiserfs. The massive compression that tar.gz got also suggests that there is a lot of redundancy it there. (once again, speculation!)
Anyway, I did some googling on the subject and found this:
http://cvs.zope.org/ZODB3/Doc/storages.html?rev=1.8.4.2
different backends for Zope, also a file based backend! They use reiserfs and report that it is 30% larger than data.fs.
cheers,
--
Simon Edwards
It would be interesting to make a tar.gz from the reiserfs tree.
For the directories alone, you mean? I did, and it didn't make much difference to the results. Any reason you think it should?
I would be interested to see what size a raw .tar is.
You're right. That's a very interesting and pertinent question. I've added the numbers.
MMh, ext3 sucks horribly, that's not big news. I've tested thoroughly ext3, reiserfs and xfs and ext3 always lose, always. Depending upon your job, either reiserfs or xfs is (usually enormously) better.
How about Reiser4? It will be very insteresting to see how Reiser4 will do.
It is to be expected that directory tree on ext3 takes more space due to ext3's traditional block orientated nature. You could alleviate that somewhat by using smaller block size (1024 or 512) instead of the default (4k, probably.) This the sort of thing that is traditionally left for the administrator (who knows what sort of data will be stored on the fs) on unix - freedom of choice if you like.
Reiserfs on the other hand packs the data much more tightly. With some access patterns and with larger files, this introduces results into some performance loss. There used to be tail merging patches for ext2 (ext3 uses the same on disk layout) that did something similar.
With regards to the ext3 performance: it used to be much words with large amount files/directories under same directory. Nowadays ext3 has dirindex feature that remedies the O(N) behaviour. You could make sure that dirindex was enabled on your fs (tune2fs -l _device_ | grep dir_index; lsattr -d _dir_ | grep I).
Very interesting and informative comment.
Since switching to reiserfs is not a short-term project, your tips to tune ext3 may come in handy.
dir_index is not currently enabled and doesn't seem to be something I can easily try right now (requires tune2fs followed by fsck). It does sound like it will help a lot here.
Incidentally reiserfs is mounted with notail on ktown, but that option didn't work with ext3.
Thanks a lot for the pointers.
For the speed thing, you may want to retest ext3 with different default values, as ext3 takes very conservative values that harm performance _at_purpose_
As they already said, ext3 has a "htree" feature which can help in the "lots of files/directories" case. Ext3 also has a new block allocator , "orlov"....and you guess well, if you really want to test it you need to fill your filesystem with a kernel that used orlov to allocate those files (2.6 supports both things)
Another thing is the journaling modes. Ext3 has 3 journaling modes, data=ordered, data=writeback and data=journal. "ordered" is the default, and when it's enabled ext3 ensures that data are written before the metadata that refers to that data. It may look stupid, but xfs jfs and reiser in the default mode don't care about this, and they can write the metadata before the data...in reiser you can enable it with the data=ordered mount parameter if you want the ext3 behaviour. XFS and JFS don't even have that option. If you want to have the xfs/jfs behaviour (more speed but less data safety) then use commit=writeback.
And the last interesting parameter is commit. commit defaults to 5 in ext3, and it means that ext3 will sync all its buffers - yes, that's a sync every 5 seconds. It harms performance a _lot_ and ext3 defaults to that value because ext3 developers are really _paranoic_ about data safety. You can mout ext3 it with huge values to increase performance
In short, if you want to do _fair_ benchmarks reiser vs ext3, you must:
o use htree
o use 2.6 (orlov, also ext3 is a _lot_ faster in 2.6, reiserfs too I guess)
o mount your filesystem with data=writeback
o mount your filesystem with commit=9999 (or whatever huge value you want)
o read Documentation/filesystems/ext3.txt
That said, I don't really think ext3 will shine. ext3 is great for data safety, and the fact that ext3 developers have done things like adding a feature which syncs all your filesystem every 5 seconds by default is a sign of that. I expect ext3 to be "slow", but not _that_ slow
Why not try NTFS? Supposedly NTFS 5.0 (Windows 2003 Server's default fs) is pretty efficient...
You could try just "chattr +I _dir_". It may turn on the dir_index feature without needing to reboot for a single dir (see also chattr -R). But I'm not 100% sure.
The other comment about data=ordered is also very valid.
Even with notail, reiser does pack directories much better thanks to its tree structure. notail would probably give you even more space benefit (if your files are small, as I've understood). That probably does reduce performance somewhat (but caching may help).
notail won't work with ext3, there used to be a tailmerge patch for ext2 (ext3?), but I don't think its supported anymore.
Hmm, thanks again for the cool comments, both of you.
Seems my systems don't have/allow the +I option in chattr, I'll have to look into that or just do the tune2fs thing.
data=writeback does not work with the remount option, so I'm going to have to try that another time. I would have thought that would affect only writing though (I guess it would affect the system in general though). So far my tests have been mostly reading of already written stuff.
Looking forward to the results!
Note to self: Some interesting comments here. Including a reliability study on ReiserFS vs ext3.
I'd really love to see some reiser4 figures here. From what I've red it should reduce the used amount of space considerably while still being faster than reiser3.
I agree. Testing reiser4 with the same method would be really cool (hint hint :) )
I really strongly suggest that you do some mySQL benchmarking before you go with a file-based solution (it just sounds really difficult to administrate such a thing to me.) We run a large website with ~8MM users, with message boards, custom search, and ecommerce, 24 servers, all off of a single mySQL database, and the results are great! Also, mySQL has an internal cache so that not every read results in a disk access.
As for storage space, mySQL will probably take up a lot with its indices, but these will really speed access time. You should make sure that your tables are properly indexed, or else your benchmarks will be disappointing.
Sorry to disappoint, but I don't have access to any ReiserFS 4 systems. :)
jayKayEss, I'll keep your comments in mind, thanks. I'll know where to find you when I need MySQL expertise. ;)
Honestly all, this was really just a quick-and-dirty dry run. Don't expect anything to happen soon, I have a thesis to complete first.
I would be curious to see all this under XFS. I have been a huge fan of it, and I would just like to see the numbers.
What about a .bz2?
I can't believe you ended up on OSNews! The comments show as well....
One thing where ext3 shines (as attested by Nigel Cunningham (swsusp2)) is for Laptop hibernation.
Since running on FC2/reiserfs, and using hibernation, I've experienced 3-5 data curruption events.
I'm getting a bit paranoid now.
Maybe ext3 is better for hibenation support then.
http://livejournal.com/users/lotso
Why Not MySQL?
hmm, please, consider also total amount of non-fixed bugs for reiserfs. at the time I looked for the last time (month ago?), there was about 15 of them...
How did you measure how much space the thing takes in ReiserFS? I don't know about notail, but with tail packing the usage reported by "du" is very wrong. If I remember correctly, it was seeing 50GB stored on my 25GB drive once. Tail packing also makes ReiserFS a bit slower.
I've heard stories about Ext3 corrupting data, too, years ago. But ReiserFS is particularly nasty. I saw nice articles about ReiserFS from the Gentoo main developer, and then the main Portage server thing went down for a few days a year or so ago.
Yes, I used "du". I obviously had no idea what I was getting myself into. :-)
Sadly, I've certainly developed a healthy dose of Fear, Uncertainty and Doubt (FUD) over ReiserFS at this point. Incidentally, KTown is a major KDE server and seems to have been using it for ages.
re: "why not mysql":
Here's the preface to that document:
NOTE: This Document was written in May 2000. Thus, it is outdated and does not represent the latest data concerning MySQL.
I don't know enough to say what the mysql folks have been up to for the last 5 years but i expect it's something.
I'm pretty much committed to using the filesystem for the backend, but I might consider MySQL for non-critical functions such as maintaining a seachable index. We'll see.
Post a Comment
<< Home