7/1/10

SQL vs NO SQL Architecture

Team ,Greetings to you .

In my previous mail i have mentioned about NOSQL. Thats just an introduction about that. With that note we shall understand its architecture. Here we go,

Why Non-Relational

Some of the stuff here may not resonate with you if you are an enterprise developer since enterprise apps don’t have to deal with the kind of gigantic scale that (some) consumer web applications deal with. However, given the rate at which data is growing and the number of users who are using IT systems, these issues are only going to become more and more common – for smaller consumer apps, as well as for enterprise apps. In fact, even today, irrespective of the scale at which your app operates, if you want to take advantage of a Cloud platform like Google App Engine or Microsoft Azure or Amazon Web Services, you would perhaps need to think of some of the issues below, because at the infra level these platforms do have to bother about high scale and may impose constraints on the application / data model to help them scale.
 

1.1 Relational databases are hard to scale


1.1.1 Replication - scaling by duplication

  • Master-Slave:

    • Each write results in N x writes where N is the number of slaves. This leads to diminishing returns beyond a point, thus imposing a limit
    • While reads would get faster (since you can now read from N nodes), writes are still bottle-necked to one node
    • Critical reads still need to go the master since the write may not have propagated to all nodes. This logic needs to be built into the app.
    • High volumes of data pose a problem since you need to duplicate the data N times. This also leads to limiting how much you can scale with this approach.

  • Multi-Master


1.1.2 Partitioning (sharding) – scaling by division:

  • Scales reads as well as writes
  • Not transparent to the application. The application needs to be partition aware.
  • The value of an RDBMS is in relations. Once partitioned, these relations get broken – you cannot do a join across shards – this now needs to be done in the app layer.
  • In general, manual sharding in relational databases is not simple.

1.2 Don’t need some features


1.2.1 UPDATEs and DELETEs

  • Typically not used since that leads to loss of information

    • May need the record for auditing, or for re-activation
    • Typically, the info is never really “deleted” from a domain perspective anyway

      • A user “leaves” a community – his posts would not be removed
      • Employees “leave” companies – their employment record is maintained
      • The canonical ACID transactions example: debits from a bank account – this is not a DELETE, but an INSERT

    • Nor is info just “updated”

      • Salary gets “incremented” – the previous record is maintained
      • Your bank account balance is not updated – there are “debits” and “credits”


  • So one can typically model an UPDATE / DELETE as an INSERT and version the record.

    • When data gets too large, archive inactive parts of data

  • Two problems that arise when you go for an INSERT-only system:

    • The database cannot help you with cascades thru triggers - this needs to be done explicitly in the app layer

      • The cascades are actually far more complex than propagating a DELETE / UPDATE – this is a domain requirement:

        • When an employee leaves, you need to update the payroll system so that full and final compensation can be carried out
        • Everytime a bank account gets debited, checks need to be made on standing instructions, minimum account balance, etc.


    • Queries need to filter out the inactive records

      • Can lead to dirty looking code – addressed using views
      • There would be some perf penalty that can be addressed by archival


1.2.2 JOINs

  • Why avoid

    • Joins are expensive when data volumes are high since the database server has to perform complex set operations over large volumes of data
    • Do not work across partitions
    • Techniques like Materialized / Indexed Views not supported by all databases

  • How to avoid? De-normalize!

    • Purpose of normalization

      • Make it easier to have consistent data by keeping just one copy
      • Reduce the amount of storage

    • With De-normalization

      • Burden of consistency shifts from the database to the application layer
      • Easier if you only do INSERTs and no UPDATEs / DELETEs
      • Would lead to data bloat – can be significant for large volumes, but storage is cheap and you can archive inactive data


1.2.3 ACID Transactions

  • Atomic – do not need atomicity on modification of more than one record. Single key atomicity is enough
  • Consistency – CAP theorem – can get any two of Consistency, Availability, Partition tolerance – not all three. (Also seehttp://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.1915


    • Most systems need partition tolerance and availability ahead of consistency.

      • Customer wants to place an order – you will accept the order, not return the money saying the system is unavailable – availability is important
      • Inventory would be checked asynchronously
      • Order details would be checked asynchronously
      • … would be done asynchronously
      • All this while data would be in an inconsistent state
      • This is ok – businesses are like that. They do not operate on a single version of truth. Reconciliation happens all the time.

    • Therefore our data model need not be strongly / strictly consistent. We can do with Eventual Consistency.
    • In most scenarios we need Read-Your-Writes Consistency and Monotonic Reads Consistency (as defined by Vogels in the paper above)
    • Strong consistency relies upon conflict resolution at write time to keep read complexity simpler. This does not scale.

  • Isolation – do not need isolation beyond Read-Committed. Easy with single key atomicity (above)
  • Durability – need durability till the time that RAM becomes cheap enough that one can afford many peer replicated nodes holding data in memory so that data is available even with node failures.

1.2.4 Fixed Schema

  • In an RDBMS you have to define the schema before you can start using data (somewhat like declaring types in statically typed languages)

    • Define each entity (table), its attributes (columns) and relations between entities
    • Define usage patterns (indexes)

  • Modifying schemas is essential

    • Intense competition and rapid growth necessitate adding new features / tweaking existing features rapidly
    • Changes usually require modifying the data model, thus precipitating schema changes

  • Modifying the schema is hard

    • Adding / Deleting / Modifying a column may lock the rows (imagine doing this for several million rows)
    • Adding / Removing an index may lock up the table


1.3 Don’t get some features

  • Hard to model hierarchical data
  • Hard to model graphs
  • Don’t rely primarily on main memory

    • Preferable to avoid going to disk as far as possible and serve out of main memory to get faster response time
    • Most relational systems are not memory oriented, but disk-oriented. Even with large main memory, relational databases end up going to disk for most queries – they are not aggressive about serving data from main memory and avoiding going to disk.
    • Vendors are trying to address this by acquiring / building in-memory database technology, but this is far from mainstream

NoSQL Architecture


The NoSQL movement continues to gain momentum as developers continue to grow weary of traditional SQL based database management and look for advancements in storage technology. A recent article provided a great roundup of some of the great new technologies in this area, particularly focusing on the different approaches to replication and partitioning. There are excellent new technologies available, but using a NoSQL database is not just a straight substitute for a SQL server. NoSQL changes the rules in many ways, and using a NoSQL database is best accompanied by a corresponding change in application architecture.
The NoSQL database approach is characterized by a move away from the complexity of SQL based servers. The logic of validation, access control, mapping querieable indexed data, correlating related data, conflict resolution, maintaining integrity constraints, and triggered procedures is moved out of the database layer. This enables NoSQL database engines to focus on exceptional performance and scalability. Of course, these fundamental data concerns of an application don’t go away, but rather must move to a programmatic layer. One of the key advantages of the NoSQL-driven architecture is that this logic can now be codified in our own familiar, powerful, flexible turing-complete programming languages, rather than relying on the vast assortment of complex APIs and languages in a SQL server (data column, definitions, queries, stored procedures, etc).
In this article, we’ll explore the different aspects of data management and suggest an architecture that uses a data management tier on top of NoSQL databases, where this tier focuses on the concerns of handling and managing data like validation, relation correlation, and integrity maintenance. Further, I believe this architecture also suggests a more user-interface-focused lightweight version of the model-viewer-controller (MVC) for the next tier. I then want to demonstrate how the Persevere 2.0 framework is well suited to be a data management layer on top of NoSQL databases. Lets look at the different aspects of databases and how NoSQL engines affect our handling of data and architecture.

Architecture with NoSQL

In order to understand how to properly architect applications with NoSQL databases you must understand the separation of concerns between data management and data storage. The past era of SQL based databases attempted to satisfy both concerns with databases. This is very difficult, and inevitably applications would take on part of the task of data management, providing certain validation tasks and adding modeling logic. One of the key concepts of the NoSQL movement is to have DBs focus on the task of high-performance scalable data storage, and provide low-level access to a data management layer in a way that allows data management tasks to be conveniently written in the programming language of choice rather than having data management logic spread across Turing-complete application languages, SQL, and sometimes even DB-specific stored procedure languages.
Data Management Architecture

Complex Data Structures

One important capability that most NoSQL databases provide is hierarchical nested structures in data entities. Hierarchical data and data with list type structures are easily described with JSON and other formats used by NoSQL databases, where multiple tables with relations would be necessary in traditional SQL databases to describe these data structures. Furthermore, JSON (or alternatives) provide a format that much more closely matches the common programming languages data structure, greatly simplifying object mapping. The ability to easily store object-style structures without impedance mismatch is a big attractant of NoSQL.
Nested data structures work elegantly in situations where the children/substructures are always accessed from within a parent document. Object oriented and RDF databases also work well with data structures that are uni-directional, one object is accessed from another, but not vice versa. However, if the data entities may need to be individually accessed and updated or relations are bi-directional, real relations become necessary. For example, if we had a database of employees and employers, we could easily envision scenarios where we would start with an employee and want to find their employer, or start with an employer and find all their employees. It may also be desirable to individually update an employee or employer without having to worry about updating all the related entities.
In some situations, nested structures can eliminate unnecessary bi-directional relations and greatly simplify database design, but there are still critical parts of real applications where relations are essential.

Handling Relational Data

The NoSQL style databases has often been termed non-relational databases. This is an unfortunate term. These databases can certainly be used with data that has relations, which is actually extremely important. In fact, real data almost always has relations. Truly non-relational data management would be virtually worthless. Understanding how to deal with relations has not always been well-addressed by NoSQL discussions and is perhaps one of the most important issues for real application development on top of NoSQL databases.
The handling of relations with traditional RDBMSs is very well understood. Table structures are defined by data normalization, and data is retrieved through SQL queries that often make extensive use of joins to leverage the relations of data to aggregate information from multiple normalized tables. The benefits of normalization are also clear. How then do we model relations and utilize them with NoSQL databases?
There are a couple approaches. First, we can retain normalization strategies and avoid any duplication of data. Alternately, we can choose to de-normalize data which can have benefits for improved query performance.
With normalized data we can preserve key invariants, making it easy to maintain consistent data, without having to worry about keeping duplicated data in sync. However, normalization can often push the burden of effort on to queries to aggregate information from multiple records and can often incur substantial performance costs. Substantial effort has been put into providing high-performance JOINs in RDBMSs to provide optimally efficient access to normalized data. However, in the NoSQL world, most DBs do not provide any ad-hoc JOIN type of query functionality. Consequently, to perform a query that aggregates information across tables often requires application level iteration, or creative use of map-reduce functions. Queries that utilize joining for filtering across different mutable records often cannot be properly addressed with map-reduce functions, and must use application level iteration.
NoSQL advocates might suggest that the lack of JOIN functionality is beneficial; it encourages de-normalization that provides much more efficient query-time data access. All aggregation happens for each (less frequent) write, thus allowing queries to avoid any O(n) aggregation operations. However, de-normalization can have serious consequences. De-normalization means that data is prone to inconsistencies. Generally, this means duplication of data; when that data is mutated, applications must rely on synchronization techniques to avoid having copies become inconsistent. This invariant can easily be violated by application code. While it is typically suitable for multiple applications to access database management servers, with de-normalized data, database access becomes fraught with invariants that must be carefully understood.
These hazards do not negate the value of database de-normalization as an optimization and scalability technique. However, with such an approach, database access should be viewed as an internal aspect of implementation rather than a reusable API. The management of data consistency becomes an integral compliment to the NoSQL storage as part of the whole database system.
The NoSQL approach is headed in the wrong direction if it is attempting to invalidate the historic pillars of data management, established by Edgar Codd. These basic rules for maintaining consistent data are timeless, but with the proper architecture a full NoSQL-based data management system does not need to contradict these ideas. Rather it couples NoSQL data storage engines with database management logic, allowing for these rules to be fulfilled in much more natural ways. In fact, Codd himself, the undisputed father of relational databases, was opposed to SQL. Most likely, he would find a properly architected database management application layer combined with a NoSQL storage engine to fit much closer to his ideals of a relational database then the traditional SQL database.

Network or In-process Programmatic Interaction?

With the vastly different approach of NoSQL servers, it is worth considering if the traditional network-based out-of-process interaction approach of SQL servers is truly optimal for NoSQL servers. Interestingly, both of the approaches to relational data point to the value of more direct in-process programmatic access to indexes rather than the traditional query-request-over-tcp style communication. JOIN style queries over normalized data is very doable with NoSQL databases, but it relies on iterating through data sets with lookups during each loop. These lookups can be very cheap at the index level, but can incur a lot of overhead at the TCP handling and query parsing level. Direct programmatic interaction with the database sidesteps the unnecessary overhead, allowing for reasonably fast ad-hoc relational queries. This does not hinder clustering or replication across multiple machines, the data management layer can be connected to the storage system on each box.
De-normalization approaches also work well with in-process programmatic access. Here the reasons are different. Now, access to the database should be funneled through a programmatic layer that handles all data synchronization needs to preserve invariants so that multiple higher level application modules can safely interact with the database (whether programmatically or a higher level TCP/IP based communication such as HTTP). With programmatic-only access, the data can be more safely protected from access that might violate integrity expectations.
Browser vendors have also come to similar conclusions of programmatic access to indexes rather than query-based access in the W3C process to define the browser-based database API. Earlier efforts to provide browser-based databases spurred by Google Gears and later implemented in Safari were SQL-based. But the obvious growing dissatisfaction with SQL among developers and the impedance mismatches between RDBMS style data structures and JavaScript style data structures, has led the W3C, with a proposal from Oracle (and supported by Mozilla and Microsoft), to orient towards a NoSQL-style indexed key-value document database API modeled after the Berkeley DB API.

Schemas/Validation

Most NoSQL databases could also be called schema-free databases as this is often one of the most highly touted aspects of these type of databases. The key advantage of schema-free design is that it allows applications to quickly upgrade the structure of data without expensive table rewrites. It also allows for greater flexibility in storing heterogeneously structured data. But while applications may benefit greatly from freedom from storage schemas, this certainly does not eliminate the need to enforce data validity and integrity constraints.
Moving the validity/integrity enforcement to the data management layer has significant advantages. SQL databases had very limited stiff schemas, whereas we have much more flexibility enforcing constraints with a programming language. We can enforce complex rules, mix strict type enforcements on certain properties, and leave other properties free to carry various types or be optional. Validation can even employ access to external systems to verify data. By moving validation out of the storage layer, we can centralize validation in our data management layer and have the freedom to create rich data structures and evolve our applications without storage system induced limitations.

ACID/BASE and Relaxing Consistency Constraints

One aspect of the NoSQL movement has been a move away from trying to maintain completely perfect consistency across distributed servers (everyone has the same view of data) due to the burden this places on databases, particularly in distributed systems. The now famous CAP theoremstates that of consistency, availability, and network partitioning, only two can be guaranteed at any time. Traditional relational databases have kept strict transactional semantics to preserve consistency, but many NoSQL databases are moving towards a more scalable architecture that relaxes consistency. Relaxing consistency is often called eventual consistency. This permits much more scalable distributed storage systems where writes can occur without using two phase commits or system-wide locks.
However, relaxing consistency does lead to the possibility of conflicting writes. When multiple nodes can accept modifications without expensive lock coordination, concurrent writes can occur in conflict. Databases like CouchDB will put objects into a conflict state when this occurs. However, it is inevitably the responsibility of the application to deal with these conflicts. Again, our suggested data management layer is naturally the place for the conflict resolution logic.
Data management can also be used to customize the consistency level. In general, one can implement more relaxed consistency-based replication systems on top of individual database storage systems based on stricter transactional semantics. Customized replication and consistency enforcements can be very useful for applications where some updates may require higher integrity and some may require the higher scalability of relaxed consistency.
Customizing replication can also be useful for determining exactly what constitutes a conflict. Multi-Version Concurency Control (MVCC) style conflict resolution like that of CouchDB can be very naive. MVCC assumes the precondition for any update is the version number of the previous version of the document. This certainly is not necessarily always the correct precondition, and many times unexpected inconsistent data may be due to updates that were based on other record/document states. Creating the proper update logs and correctly finding conflicts during synchronization can often involve application level design decisions that a storage can’t make on its own.
Courtesy ::

No comments:

Post a Comment

Popular Posts