|
The modeling of distributed systems in general, that includes subject based approaches like client/server systems as
well as de-coupled, event-driven systems is a yet unsolved problem. Let's start with a small example of an enterprise integration project which is supposed to link content management and search engine in a reliable way. And yet, a number of surprisingly tricky questions pop up during the conceptual integration phase. But let's start with a description of the components and some requirements. The CMS component consists of a process which does the administration (e.g. publosishing tasks) and which uses a database to store all page and fragment data. The search engine component consists of several processes which extract, process and index document data. Interfacing between CMS and search is done using so called connectors: processes which can use JDBC calls against the database to extract document meta-data or processes which expose a push interface and let others push document content right into them. The connectors themselves are pusing documents or meta-data toward a document processing pipleine which will forward the processed documents to the indexing process. So far so good. From previous experience it is known that indexing a complete CMS can take a couple of days if done through a web-crawler connector. Sometimes a CMS develops a problem and needs to fall back to a backup version e.g. 3 hours old. Sometimes new documents should show up in the index rather quickly, e.g. important financial news. What kind of requirments are we facing here? The most important one is probably that documents within the CMS should be found through the search engine - not too long after they were published. Unpublished documents should NOT be found. We can add some additional security requirements by stating that access to documents and document-meta data through the search engine still needs to follow the original access control policies. In other words: you should not be able to find what you are not allowed to view. The core requirement has one big problem: it mentions time as an additional constraint. And on top of this - time is specified rather losely: "soon", "sometimes quickly, sometimes not", "takes a long time". But it gets worse. Neither of the connections in the system has yet been specified with respect to delivery guarantees. That leaves all kinds of questions open: What will happen if e.g. the push-update from the CMS cannot proceed because a connector is not running? Will this stop the CMS process itself? Will the update get thrown away? will the CMS keep book about a missed update and try it later again? When will that happen? Soon flags get introduced in the design and they only raise the next question: who resets a flag, e.g. when the indexer has successfully received a document? Will the search engine go straight to the CMS DB and reset the flag? Sureley not as this would create a very ugly form of tight coupling. But then how will the participating processes know that a document has finally been delivered? This calls for a distributed transaction but the costs of such a beast are prohibitive in many cases and most components in our diagram do not support them anyway. Enter crash mode. What happens when the CMS gets corrupt? What happens when the index needs to be rebuilt? (this can happen not only in case of a crash but also when configuration parameters change). When the CMS DB is rolled back to an older version we see a funny side-effect: Search now offers documents which do not really exist yet. Those documents which have been published during the last three hours e.g. and which are now lost due to the rollback. When the index needs to be re-built (assuming a configuration change) the logic to do this should work like this: a) create the new index and then b) exchange it with the existing index. Why? Deleting the old index and creating the new one is logically the same. Yes, but building the new index takes three days and that makes the first solution so much better. Otherwise the company won't have a search capability for three days. This is simple and obvious but not so obvious that we did not try it the other way round first (;-). Our little example soon exposes surprising complexity when errors etc. are introduced and we start forgetting about constraints and problems of our implementation. We would be better off if we could create a model for our solution. To do this we need a list of things that we would like to be put in a model and then we can think about how to express this and what to do with such a model. But before we do so we can already state what we don't need! We do not need a modeling language that is so formal that few people will understand it. Modeling distributed systems should be for the masses - because distributed systems nowadays are too. Model checking is nice but it should not be our first priority. Timo Kehrer suggested to build an ontology of things which are important in distributed systems. This could be a DSL, a UML profile, an ontology etc. What we also not need is kind of a connectivity diagram which includes port numbers, protocols etc. If those are important for a reason, the reason should be captured in our model, not the implementation.
|
|