Friday, October 15, 2004

KDE Dot News: MySQL versus ext3

I've been musing a lot about KDE Dot News lately.

The site is currently host to over 55000 articles and comments all stored in Zope's internal DB format. Yet, the structure is completely hierarchical. Every article/comment has a single parent but may have one or more children.

This structure could easily be represented on a filesystem such as ext3 using directories and plain text files holding the information.

My question, for anyone with the experience and know-how, is: How efficient would it be to store all the dot's articles and comments on the filesystem? How would this compare to storing the information in, say, a MySQL database?

Remember that when you access an article/comment, you usually want to access all the descendants as well.

6 Comments:

Blogger jayKayEss said...

It sounds like a very bad idea to me because you'd have to allow your CGI scripts write access to a file partition someplace, and that is a potential security hole.

Also, what about meta-information? You'd have to put things like a) the article's author b) its creation time, etc. into a flat text file somehow, and then you'd have to parse that back when you read the article. A pain!

And don't forget locking issues: what happens if two CGI processes try to write to the same file at once? mySQL automatically takes care of that possibilty.

Plus, mySQL is an usually fast and well-optimized DB, so I doubt you'd stand gain any performance.

1:21 PM  
Blogger segedunum said...

I can never see that being a good idea really. You're talking about dynamically linked data here, and a normal filesystem is inherently static.

The other question is, if you have articles stored in Zope's ZODB database would it be feasible or even worth it converting these to plain text documents or to into a MySQL database? If these articles are all stored in the ZODB, as good as it is, it is going to be excruciatingly slow (can't quite believe that myself as it's not what it's suited to).

I'd think about the MySQL option, as this is what it is about. Lots of data linked together fairly simply and relationally, and you can use the power of Zope, SQL Methods and DTML to bring it together.

6:21 PM  
Blogger Navindra Umanee said...

Thanks for the replies. Lots of food for thought.

Right now, locking issues are avoided or not handled at all. Basically, there is no locking. If two editors are editing an article at the same time, the last one who submits will win. There are no locking issues when outsiders post an article, since each article gets a unique id.

Given the current situation, anything else would probably be an improvement. Writing to the same directory at the same time should be no problem, and of course I could use existing tools to implement locking at the file level if needed, or I could simply do what is being done now (nothing)... but it is easy to see how the system can be incrementally improved.

I did think about the meta-information issue and conceptually there does seem to be some simple ways to handle it, I suppose only experimentation will convince me that it's a problem in practice.

I guess my real problem is that I don't know anything about SQL, and I really hate the kind of opacity I have right now with ZODB so I don't really want to go down the SQL route.

Sure I can hack up python scripts that take forever to run to find out what the heck is going on with each object in ZODB, but the prospect of being able to traverse a transparent file structure with the full gamut of Unix tools seems much more appealing and user-friendly.

I've already seen both ZODB and MySQL get corrupted, pretty much leaving me at the mercy of various black magic to get back to a working system with minimum data loss. ZODB corruption was particularly nasty since I had no idea at what point it had happened and how it had happened. Just all of a sudden some things were corrupted and some weren't and the whole system was collapsing.

Right now ZODB is 2G. Imagine that. I have to go and pack the damn thing.

I haven't yet heard anyone say that performance is going to be an issue. It could well be, in which case the idea is dead.

segedunum, a filesystem may well be static but all it's doing is holding information here. The dynamic part would have to be built on top of that. I don't really see what would make it inherently more expensive than what we have now. Accessing one article or comment should be constant time (well depending on the size of the item) and accessing the descendents should be linear time in the number of descendents.

Security is a hard question and I admit it's something I did not yet consider.

9:46 PM  
Anonymous Anonymous said...

As far as performance goes with MySQL vs FS, it really should end up being the same for the simple fact that 99% of requests to The Dot require exactly the same comments and articles. (i.e. the articles and comments from the front page). This data _should_ be cached in RAM in the application server itself (Zope), and not pulled in from the DB each time.

As for SQL vs FS in general for this task, I think that the results could be quite surprising. For pulling up hierachical data, a good FS like reiserfs that is build with small files as a design goal, could well out perform a SQL DB. SQL does not natively understand hierachical data, while filesystems are designed with hierachies in mind. Point being, the results could go either way. :-)

The kicker is searching. SQL DBs _are_ designed for searching, although full-text searching is not something that SQL as a language supports well. :( SQL DBs happily use and manage indexes for data while with a FS you would have to manage data indexing yourself if you want any chance of getting decent search performance.

I've got more speculation that info here, but I hope it is useful, and of course I don't have any raw data to back this up. :-)

cheers,

--
Simon Edwards

4:54 AM  
Blogger Navindra Umanee said...

Thanks for the further insights.

About searching, symlinks (or better hardlinks?) could perhaps come in handy e.g. we can have symlinks to every article/comment in one index, symlinks to only articles in another. In fact I already intended to use symlinks in another part of the system (to denote the approved articles).

So, searching could become simply a grep on a flat structure.

On the other hand, searching could be handled through Google which has already indexed the dot site and crawls us every day...

10:48 AM  
Anonymous Anonymous said...

Regarding the usage of ZODB on a comment-heavy site, I'm actually quite surprised. ZODB is specifically written for a high read/write ratio. Write performance is rather bad and every write will bloat the DB a bit. This DB bloat can be countered by packing, but, as you seem to have noticed, a DB application with a lot writes requires frequent packing to keep DB size and performnance acceptable.

2:48 PM  

Post a Comment

<< Home