Add comments about proposed implementations and anything else here....
- ADDED: Other advantages of a relational database storage is that the content can be replicated to multiple servers and kept in sync. A round-robin front-end can be used to select a server via DNS. This allows high performance server farms. Also, you can "split" the content onto two databases and replicate them together onto one server. Why would you do this? To allow the company wide content to be separate from the department level content. Hence, the company templates can be replicated to each department server and the two kept in sync. -- NoneToBe?
I agree that data replication between DB servers can happen. I'm not too sure exactly what this buys us. I suppose something. A buzzword at least. As for clients choosing from multiple replicated database instances dynamically: I'm definitely no RDBMS expert, but I imagine that there are some client-level contracts that must be met by the underlying DA's to allow this sort of dynamic selection of a database server. I don't know to what extent the common Python DB-API2 adapters (the very few of them that exist) meet these requirements. Something like this probably would not happen through round-robin DNS, though I imagine it could be done. Have you done something like this? Can you describe it?
I'm not sure I understand your assertion that you could merge two disparate RelationalStorages? into one in order to
sync them.. (if that's what you meant)... what we're trying to implement here is a Zope storage almost exactly like a FileStorage (the default Zope storage). Merging two widely (or even mildly) divergent FileStorages? (or RelationalStorages?) together is something I'm afraid to even think about. In order to do so, we'd need a way to resolve conflicts between pickled data structures as well as reassignment of object ids, transaction ids, timestamps, and record ids. It's an understatement to say that such a thing would be very ambitious and is not a part of the scope of this project.
However, when (if?) Zope supports the use of multiple simultaneous Storages from a single Zope instance, it's conceivable that your department/company scenario could be accomplished by making use of localized data from per-department storages and site-wide data from a
company storage from within the same Zope instance. This data could be kept in RelationalStorage (it might even be the best choice), but it wouldn't need to be. But I think discussion about use of multiple simultaneous storages needs another place on the Wiki. :-) - ChrisMcDonough
- ChrisMcDonough random notes 5/10/2000
- Jim thinks it's a good idea to think about using stored procedures for most if not all of the reads and writes that this thing does. I've been designing for maximum portability thus far and as some SQL DBs? don't support stored procedures and other features, I'm going to try to design the table structures and such so that even if one DB doesn't have the features of the next, it still has a shot at storing the zodb. One of the benefits of stored procedures is for things like pack, where the entire operation can be handled by the database with just a nudge from Zope. I will look into Interbase and Kinterbasdb support for stored procedures and maybe try an implementation with them. Tres turned me on to the fact that the base64 module is implemented in Python and is probably a lot slower than struct (which is implemented in C) at manipulating the binary that makes up timestamps and oids for storage in the DB. I'd just dump the things right in to the DB and not bother with converting them using struct or base64, but it makes setting up the database pretty painful for the end user as at least for Interbase, you need to declare another "character set" on all of the columns that you want to store binary data in. I don't think I could support questions about these sort of configuration details on every major database. I'd rather just bust the oids and timestamps into two separate 32-bit ints and store them separately in the DB. Every database I can think of supports these datatypes and they're simple to explain. This might make things slightly slower on retrieve, but I haven't measured.
I've been implementing a prototype relational storage in increments.
- Basic store/load and begin/finish capability.
- Transactions records and undo.
I've got stage two complete and functioning. I'll post a link as soon as I've got it packaged up. It seems pretty solid but could use some optimization. I found it much easier to deal with tackling this problem in stages. Stage one was essentially just converting MappingStorage? to use the DB API instead of a dictionary to store data. It got a little trickier separating undo and versions in DemoStorage? but it made things a little easier to just rip out everything having to do with versions, including the fields in the table structures.
After reading your design notes and comments I do have a few points I'd like to address.
- I think it makes more sense to abstract out the base functionality from the start rather than try to split it out later. It's likely to be a painful operation and you may end up back at the drawing board. Currently, I'm doing all of the heavy lifting in the RDBStorage? base class. My derived classes do nothing but handle DB specific functions. Currently those are limited to _connect (connect to the database and return a connection object), _insertid (return the id of the last inserted record) and _escape (escape a string for use in a sql query). The queries are all class attributes so they can also be overriden in derived classes.
- I've been using a 64-bit integer type specific to mySQL for my record ids. These are the keys for both the record and transaction tables. It occured to me reading your discussion of base64 and struct that probably the fastest and easiest way to key the records is off of the oid and serial fields just like in FileStorage, by storing them as a VARCHAR and using repr() to store and eval() to retrieve them. In fact, I'm going to make this modification to my prototype tonight and see how well it works.
Gotta go. More to come. --jfarr
Yeah, I've been struggling on how to abstract the base functionality out of the thing... I mentioned this to Jim and he advised me to just make it work, then worry about dicing it up. I think I'm going to continue on this track. I don't much mind going back to the drawing board after it's done, and it's easier for me to concentrate on making it work when it's all in one module...
Your idea about separating versioning and undo is a good one. You're farther along than I am, so it hasn't been on the map up til now... but I'm going to pursue those features in the same way based on your recommendation though I'll probably leave the columns in the DB for versioning as they're not too terribly distracting.
I started out sql quoting input values, but then I started to use parameterized queries... these seem not to require any special sql quoting, but that might be a symptom of the value types I've been using... it's still possible that I'll need to sql-quote something... I wish the DB-API spec was a little clearer on this. I don't feel like spelunking through the DA code. I'm going to continue to use parameterization and wait til it fails. :-) I did go back and read the spec right now, and you can also use a mapping as a parameter value. I might change my implementation to use a mapping as a result. It's sort of a pain in the butt to have to explicitly provide all the input values positionally.
The repr-eval idea is a good one as well... I will probably go with this instead of busting things up into 2 32-bit ints. It's sure a lot easier and understandable than reconstructing timestamps and oids using the struct module.. it also lets you index the DB better. I sort of like the integer representations for the fact that I know I definitely do not need to worry about sql-quoting them, but if I don't need to worry about sql-quoting anyway with the repr-eval way, that's definitely the way to go.
ChrisMcDonough - 5/14
Well, I tried the repr-eval thing. For some reason, Kinterbasdb fails with a connection error when I try to insert values generated by repr. I tried sql_quoting them to see if it would help, same problem. Though I think this is a good idea, based on my experience so far with it, I'm going to have to punt and use the 2 32-bit ints.
... later in the day....
Nope, this was my bad. I was fetchall'ing from a cursor that had just done an insert. I'm going to give it another go with repr-eval.
I've also managed to not have to use a temporary index in the store method by delaying the SQL commit until _finish.
After I get repr-eval working, I'm going to work on... pack. Shudder.
... and even later in the day ...
I've switched over to the new table structure (in the implementation and by the time you read this the RelationalStorageTableStructures should be updated) using repr-eval. I've also got pack working. Badly, probably. I'm not reassigning record ids in pack, I should. I'm going to put the implementation up (it's linked from the RelationalStorage page), and think about how to reassign record ids during pack.
Undo is next. I guess.
I've also converted my prototype over to use repr(serial) and repr(oid) for the table indexes. The implementation is now a whole lot cleaner and doesn't rely on 64-bit integer SQL extensions. I also split the base class into TransactionalStorage? and NonTransactionalStorage? classes. I think I've got a pretty good grip on how to get both working. I'm glad I started with the non-transactional model, as it's definitely the hardest. I'll probably start on an Interbase implementation too, just to make sure I've got the transactional side working. The next step is to tackle versions. I am sort of wondering when and if these two implementations are going to merge. You can download my latest working code at http://www.zope.org/Members/jfarr/Products/ZRDBStorage/ZRDBStorage-0.0.3.tgz The design specification is at http://www.zope.org/Members/jfarr/Products/ZRDBStorage/design
A note about sql-quoting versus using parameterized queries. The DB API allows an adapter to support any of about six different parameter formats. You can query the adapter for which it supports but then you'd have to adjust your SQL text on the fly. If you don't then you're going to have problems when you try to implement support for different databases. This is why I chose not to use them, although if I can think of a clever way to support it then I probably will because it would make coding other things easier and probably speed up the implementation too.
Jonothan - I looked at your design notes... you've got a pretty good grip on this stuff.. I haven't looked much at the implementation, but the ideas look pretty sound.
I propose that I ditch my latest implementation and sort of follow your lead. I can work on transactional stuff based on your implementation if you want... (i pick the easy jobs :-) You tell me what to work on, I'll jump on it.
I have a question: isn't it possible for a single transaction to contain two disparate changes to the same object? In this case, can you really uniquely identify an object in the data table using just its oid and its serial number (and its status code, too, I guess, for nontransactional stuff)? I went the "record id" way for rows in the data table because I thought it was necessary. If I'm just having a brain fart, be gentle with me...
- My understanding is that each transaction contains only the latest record for each affected object. Honestly, I haven't looked too deeply into verifying this, though. Jim could probably give you a quick answer. --jfarr
Also, I see that I've been missing the point of the "prev" column in the data table, it points to the last record that stored this object. I had no idea what the hell it was for, to be honest.
I like your separation of transaction and nontransactional store/finish/abort methods.. I'm not really sharp enough to say without reservation that the status code is going to be enough to support nontransactional underlying databases, but it's a good start... looks good to me anyway. I have a feeling there's other gotchas to it, but nothing I can point my finger at.
- You're right. It's just a start. I think it'll work well for load/store. It could get really tricky trying to roll back a pack or undo operation without transactions, though. --jfarr
One other thing.. if this gets into the Zope core, do you have a problem releasing it under the ZPL? Release credit would go primarily to you, of course.
I spoke to Jim last night and he recommended the following:
- Don't use repr & eval, instead use the binascii module's a2b_base64 and b2a_base64 like so:
The slice is used to remove the CR appended by the b2a function.
Evidently the repr-eval thing is much slower, and the base64 module's encodestring does a lot of stuff we dont need and is also therefore slower.
- Think about keeping version, current, and history data in tables of their own. I dunno. He put up a small diagram on the Wiki of how he envisions the table structures in ThoughtsOfMineFromAFewMonthsAgo. I'm not quite sure I understand it fully. I'm apt to keep what we have until I figure out why we would want to do this.
- Though it's possible that a transaction may reference two records that modify the same object, filestorage doesn't handle it, and it's a pretty low-possibility error condition.
ChrisMcDonough - 5/22
Over the weekend, I converted things over to use the latest RelationalStorageTableStructures. No new features, still just store load and pack. We've broken out versions into a separate table, and I predict we'll probably break out the stuff even more to separate undo. Once Jonothan and I settle on a finalized table structure, this will make it a lot easier to do development. As it stands, its sort of at a snails pace as we're reimplementing constantly as we design.
Jim has added "historying" into FileStorage. I suppose we'll need to look in to this. I'm wondering how feasible it would be to provide quota support in RelationalStorage. This would be a really nice feature. I'm not sure that we could use the underlying database to do the quota, but if we could that would be a bonus.
These are some excellent points that Chris made in an email to me. I thought they might be of general interest. --jfarr
- By not encouraging people to subclass our
interfaceZRDBStorage? class, we provide another degree of freedom to other implementors of other RS's. Interfaces are all about freedom. Not that they wouldn't have freedom to do what they wanted anyway, they could just not subclass, but in that case, what's the point of making a base class?
- I think that for the most part people are going to want to be able to approach the idea and understand it quickly without getting wrapped up in a heavy interface (which is actually an almost complete implementation which they need to wrap their heads around without changing) layered on top of the existing storage interface. We shouldn't try to wrap the storage interface in our own to make it "easy" as it will actually make it harder for people as time goes on.
- I propose that it will let us develop the thing faster if we're not worried about needing to make mutually incompatible changes to the ZRDBStorage? base class as we go. This worry goes away if it goes away.
- I would also submit that so far no other Storages use any shared base classes besides BaseStorage, even the ones that are very similar (e.g. Ty Sarna's BerkeleyStorage? and Jim's dbmStorage), and that this was "cut and paste by design", as BerkeleyStorage? had DC's full input and was developed after dbmStorage. It's very similar (it probably is cut and paste) to dbmStorage, but it didn't seem to bother anybody.
- I believe that making (inevitable) changes to this base class will probably break things outside the scope of our control once other people start layering things on top of it, and those people will have big messes to clean up whenever we make a new release. I'd rather not be put in the position of having to help them. :-)
- BaseStorage is the primary
referenceimplementation of the Zope storage interface, and as such it's kept pretty up to date. I submit that we'd need to keep the ZRDBStorage? base class in lockstep with the methods implemented in the BaseStorage class if we wanted to provide all its functionality while still providing an "easy" way for other coders to implement a relationalstorage on their favorite database. IMHO, this should be the responsibility of each individual coder, and I don't much want to be in that business. If we fail to keep the thing current, there will be inevitable branching of methods as folks add stuff that we haven't had time to put in, which is what I think you're trying to avoid. I can imagine each RS implementation becoming a sort of spaghetti mess of overridden and non-overridden methods as time goes on. I'm sure you've worked on code like that, and you know it's no fun.
- (probably the most important :-) I suspect that Jim'll never let it go by if it's presented for inclusion in the Zope core. I've been beaten by him a few times for trying to do baroque stuff like this. I can maybe forward him this exchange to get some sort of prejudgement if you'd like, as he's the ultimate gatekeeper.
After pondering these points a while longer, I'm even more inclined to say that no base classes at all would probably be the easiest and best way to go. In fact there really wouldn't be such a thing as a "Relational Storage". There would be a "MySQL? Storage", an "Interbase Storage", an "Oracle Storage" (whoever's up to that task ;)), and whatever other RDBS storages the community comes up with. A new implementor would of course be free to copy and modify an existing implementation of a similar database. There would be no base implementation to break other implementations built on it. There would be no need to come up with a generic interface that could handle parameterized queries, sql quoted queries, and stored procedures. There would be no need for standardized table structures. There would be no need for modularized features like undo and versions. The implementor could just feel free to put in the features they wanted. If someone else wanted to add features to an existing storage they could do that too. The more I think about it the more it seems like the path of least resistance. --jfarr