This question is copied over from StackOverflow because it is OnHold there so I can't respond to it.

I have read quite a few articles lately which describe SQL and NoSQL from both sides of the divide, such as http://use-the-index-luke.com/blog/2013-04/whats-left-of-nosql. These articles very often touch the subjects of things like ACID and scalability. However, a number of issues I usually have with SQL seem to be rarely mentioned in these articles and I was wondering why, and whether that had to do with me not entirely understanding SQL. If anyone can enlighten me, at least partly on one or more of the following items, I would highly appreciate it.

My issues with SQL:

  1. SQL is inherently insecure: SQL is a language which was made for insertion and doesn't have any methods to prevent insertion of code instead of data. The only way to prevent insertion is to completely isolate SQL from the application using it. Why hasn't this been solved yet in SQL itself?
  2. SQL seems to have been made for the smallest storage size possible for the data contained in it. While that still makes a lot of sense for huge amounts of data, it doesn't really for smaller databases anymore, or does it?
  3. SQL forces everything to fit a two dimensional relational model, with specific relation tables to do the other dimensions. To me this poses as two problems:
  4. the consistency of data is completely relying on relation tables
  5. the data is very difficult for humans to make sense of in cases of failures
  6. SQL doesn't maintain a history as it does destructive updates by default: there are of course all kinds of ways to create a history, but that requires custom written stuff with extra tables and the use of time stamps, or writing a new record for every change, leading to exponentially growing table sizes.
  7. SQL seems to prefer data loss to loss of consistency: if an error occurs or loss of consistency, the only way of restoring the situation to a consistent state is to use a backup, which means that the latest changes will be destroyed. This is partly because of the lack of history (see 4), but also, because of the lack of human readability, there is no real way to have a human try to correct errors.
  8. Especially in a web-environment, the use of SQL usually means to have models created often more than once. In a normal (simple) PHP web application twice: once in PHP, once in SQL. In a full stack web application three times: once in the client app, once in the middleware, and once in the SQL database (if no ORM is used). Because of the different programming languages, and the type differences between them, it means that there is a lot of possible conflicts between those models. I am aware that ORMs like ActiveRecord and Django solve at least part of these problems, but the amount of extra work you still need to do because an SQL table contains a VARCHAR(25) and none of the languages used (JavaScript, Ruby, PHP, Perl, Python etc) know that kind of a construct, is huge.
  9. Data structure changes seem to be regarded as a consistency issue: if something changes in the data structure, table changes are applied to every existing record in that table, even if a record didn't have that field originally and whether or not it makes sense for that record to have that field. A set of those changes lead to automated migrations, which add another layer of possible problems, especially with regards to consistency.
  10. Mixing up of storage logic and application logic: SQL seems to be eager to gobble up parts of the application logic as stored procedures (CouchDB also does this through views). While I do understand that for some types of operations you need server side and very strictly controlled procedures, I don't understand why they are stored in the database and as such part of the storage engine, instead of being a part of the application (middleware).

I am aware of (but not very familiar with) things like PostgreSQL Hstore, but I don't see entirely how that solves the things mentioned above. Thanks for any insights!

asked Apr 29 '14 at 20:48

Markus%20Winand's gravatar image

Markus Winand ♦♦
93651120


2 Answers:

EDIT for the clarifications in the new "answer" from original requestor.

  1. Is SQL inherently insecure?

    I think you are referring to SQL Injections which is one of the most dangerous security vulnerabilities at all.

    However, SQL injection is primarily a problem of education because most textbook and courses don't explain bind parameters at all. Writing literal values into the SQL statement itself is handy for ad-hoc queries when humans use the database directly, but is just the plain wrong way in programs. Programs should always use bind parameters (very few exceptions for performance) effectively protecting the program 100% against SQL injection. The problem is that SQL textbooks don't say that.

    Even otherwise, SQL has sound security systems that allow you to limit access to tables, views, and sometimes even selected rows based on some rules ("row level security").

  2. "smallest storage size possible" (datatypes) UPDATED

    In the relational model, you should choose the right data type and not care too much about the size. There are only three main data types: Number, Strings and Temporal data (Date & Time). Each of the exists in different flavors. E.g. Numbers may be exact (NUMBER) or approximate (FLOAT). Strings might be fixed length (CHAR) or variable length (VARCHAR). In any case there are upper bounds for the precision (numbers) or length (strings).

    So, how to choose a proper length limit for, let's say, a last name? As a matter of fact, it is not possible to decide on a sensible maximum limit. There is no law of nature that says last name may not have more than X characters. So, the common approach should be to use some system maximum (e.g. 4096 or so). However, in practice people tend to use too little values for such fields (e.g. 32 chars). There is only one reason I'm aware of that is in favour of setting lower maximum bounds and that is indexing. See here for some constraints implied by various database brands.

    So, yes, SQL enforces you to define some maximum limits even when it is not sensible to do so. Point taken. I'd however, emphasis that this is a limitation of SQL—not the relational model. Please also note that VARCHAR(4096) does not mean it will take that much memory on disk, it is just an upper bound how long the string can get.

  3. About normalization.

    You are right. Normalization solves a few problems (de-duplication and preventing unintentional inconsistencies) but opens some others. Namely:

    • How to easily access data from many tables.
    • How to maintain consistency across many tables.
    • How to cope with intentional "inconsistencies" aka. histories (master data changes)

    In principle, SQL should provide tools to compensate for these challenges raised by normalization.

    Accessing data from many tables should be done using joins and similar operations of SQL. SQL does more than storing and retrieving data in a 1:1 fashion, it provides tools (joins, subqueries, set operations,...) to transform the normalized data into the form that is most suitable for a specific task. This is done intentionally at runtime because the tasks don't need to be known beforehand. The nature of the data, on the other hand, is considered to be static so that storing it an a normalized fashion is valid. This is a very important key concept of the relational model and SQL: The nature of the data doesn't change so that's the way it should be persistent. How you use that data varies widely and does often change over time — hence this has to be done dynamic. This is of course a very regular task so that it makes sense to have a solid tool to make it easy. We call this tool SQL ;) The DRY rule can be accomplished by using views or CTEs but both may hurt performance because the implementations ain't well optimized for that (something that I openly criticize!).

    Keeping data consistent across many tables is mostly done with the help of constraints.

    Handling intended "inconsistencies" (histories) was finally covered by SQL:2011: This will allow "AS OF" queries and provide also tools to maintain temporal consistencies (e.g. validity of one row may not overlap with the validity of another row). Arguably, it is pretty poor that it took 40 years or so to come up with a solution for that. And I don't even know when this will be commonly available!

    I think that part is pretty much true for every system: "the data is very difficult for humans to make sense of in cases of failures" (my emphasis). However, I think you might mean that it is hard to investigate issues because the required data might be spread across several tables. SQL's answer for that is: VIEWs which are basically just stored queries. However, depending on the database band VIEWs might introduce performance issues. That is, however, a limitation of some database bands, not a limitation of SQL or the relational model.

    UPDATE You are referring to a 2D structure which is what we call a table (or actually a "relation"). I'll first try to answer your added questions literally, then adding some background:

    normalization of data into a 2D structure of rows and columns is still actually valid

    Yes, that's the heart of the relational idea.

    especially also in combination with the space it takes (see 2)

    I don't think the relational model inherently taking more or less space. It might take less space due to de-duplication, but more space because it might need extra data (foreign keys) to model the connections between the data.

    How unwanted is duplicate data actually?


    The motivation behind deduplication was to ease maintenance. If you have the client master data in one place, it's easy to maintain it there. Also it will take less space, which is arguably not as important as it was 40 years ago.

    I don't think that your reference to E.W. Dijkstra applies here. With "nature of the data" I meant it like "nature's law". Data has some deeper truth. E.g. Parents and Children. It is a law of nature that each Human being has a biological mother and a biological father. Nobody can have three parents, yet parents can have many children in various "configurations" (half sisters). So, the law of nature already implies how to design the tables. Each person needs one biological father and one biological mother attribute (which would be references to other persons, of course, protected by a foreign key). I theory, that would never change.

    This example is particularly interesting because medical science might break computer science here. What if we allow cloning of humans? What if somebody manages same-sex fertilization? What if somebody manages to mix three or more gen-sets into one "child"? If we would need to reflect those cases in a relational model, we'd add an extra table to implement a n:m relationship (parent-of).

    When you do your analysis properly, these kind of changes are pretty rare. The procedure of normalization is also about extracting the deeper truth of the data. What is quite common, however, is that you need additional attributes on an existing table. But that's quite easy to do.

    On the human readability: you are focusing on the case of corrupted data in the sense of disk failure?

  4. Keeping History

    I've already mentioned that above (SQL:2011).

    The following is also true for every system that want's to keep history: "leading to exponentially growing table sizes." Although I'd say its "ever growing" not "exponentially".

    The tools to cope with it are triggers or the ORMs. If you want to be sure nobody does a "destructive update" you can just revoke UPDATE rights on that table (and also DELETE to be on the save side).

    UPDATE

    dead weight it essentially creates, which will needlessly slow down the retrieval process, or am I wrong in assuming that?

    There are many ways to make sure historic data doesn't slow down your system. My personal preferences: - use indexes properly so you end up with O(log N) [where N is total number of rows incl. historic ones] - use partial indexes (not always possible) that only contain active records so you end up with O(log n) [where n is only the number of active records] - use partitions to physically store unneeded data separately (mostly useful for maintenance, overused for performance IMHO).

  5. "prefer data loss to loss of consistency: "

    I find that an interesting point of view. However, the SQL answer to this is that you try very hard to not get wrong data into the system in the first place. Mostly by using a proper schema, constraints + ACID. In that way your statement is somehow right: instead of accepting inconsistent data is rejected (which is something different than lost!). So, you must handle the error at the time somebody is entering the rejected data as opposed to some later time when you try to resolve inconsistencies because you accepted the bad data in the first place. So yes, that's the philosophy for the relational model and SQL!

    The lack of human readability is obviously depended on your background. However, the correct-ability using SQL is pretty nice, I'd say. Here I'd also like to quote from the original IBM paper about SEQUEL (at that time it was the real name of it):

    SEQUEL is intended as a data base sublanguage for both the professional programmer and the more infrequent data base user.

    In my observation, it is absolutely true: I've recently had an assignment to teach SQL to support staff so that they can investigate cases directly in the database. They were no programmers, yet understood SQL quite quickly. And I think here kicks your "human" argument in: What they had problems with is navigating a real world relational model consisting of several hundred tables. But that issue was quickly solved by asking development to provide views for some common tasks that involve more than a few tables. Joining those view then was no problem anymore.

    For relational thinking you need a different mind set very much you need a different mid set to functional programming. That's not good or bad—yet it might be uncommon for you. Once you use it regularly you'll get used to it.

    UPDATE

    I also think you haven't got the full picture yet. So, let's go on :)

    I've already mentioned the deeper truth in the data above. "law of nature" and so on. Most of the time people struggle with the "inflexible", "static" schema of the relational model didn't try to find the deeper truth in the data they want to store. That can be a hard job and I also admit that it is sometimes impossible (e.g. if the data is just about something arbitrarily man-made - in that case I'd consider the most flexible option to be on the safe side). However, quite often this works out, you find some deeper truth, and that will the hold.

    Why is it not possible to only store certain columns of new rows and leave the old rows intact, and actually show that in the search results?

    Well, strictly speaking it is possible using NULL. NULL is the SQL placeholder for unkown data. Example: you'r adding a new column that you'll propagate from now on. However, you don't know that data attribute for the old records and you also cannot set a sensible default. Solution: allow the column to be NULLable (the default anyway) and just leaf the old records alone. However, you must cope with the fact that this column might be NULL in your application and queries.

    When you select this column, you'll still get the column for each row, some of them might be NULL. It is, however, the heart of the relational model that the result is still presented as a table (no cells are missing, although the might be NULL). The reason for this is how you can process these results with SQL: The result of an SQL query is a relation (=table) and can be put everywhere in an SQL statement where a table(=relation) could be (FROM clause) but also at some special places.

    Abstract examples:

    SELECT * 
      FROM (SELECT a,b FROM TBL)
    
    SELECT *
      FROM x
     WHERE y IN (SELECT y FROM z)
    
    INSERT INTO x SELECT * from y;
    

    Sticking to a string model allows us to combine queries like shown above.

    About the constraints etc: In how far is the schema and the constraints part of the application instead of the database?


    The schema is always implicitly part of the application (your application needs to be aware of the deeper truth in data data as well: the facts about parent-child relationships will end up in you code anyways). However, as strict database validated schema reduces some risks: if you have a DATE column, you know that it can only store valid dates. You cannot store April 31 in a DATE field. Your application doesn't need to care about that anymore. You can also be sure that all rows of a table have all columns—the worst thing that can happen is that some of them are NULL if you don't have a NOT NULL constraint on them. The database takes care of that.

  6. Object/Relational Impedance Mismatch

    I think that topic doesn't need any long discussion: yes it exists, yes there are tools to cope with it in some way or another. I've made my point about excessive use in my article.

  7. Data structure changes

    I think that is mostly due to a poor understanding of the relational model in the first place. Compare above: "the nature of the data"

    It's also a pretty well discussed argument: schema vs. "schema less". Choose your flavor. "Schema less" quite often just means "doesn't provide schema management tools" nevertheless you have to cope with the fact that we sometimes want to add more properties to an existing entity. RDBMSs provide tools for that: new columns can be nullable or have default values. More drastic changes such as moving one attribute to an extra table (e.g. 1:n) can be done with CREATE AS SELECT. You might even provide a compatibility view that still delivers the data as it ought to be (as though the moved attribute would still be stored in the table). Once you changed your schema, your application can rely on it's constraints (such as the existence of columns or the validity of constraints). That's quite a lot of stuff the database can do for you in an extremely reliable manner. Stuff you don't need to care about in your app anymore.

    An argument you didn't mention is that those schema changes often involve downtimes. That's definitively true for the past, and to some extent also today. E.g. MySQL introduce online ALTER TABLE just in 5.6 recently. However, that's quite often an implementation limitation, not a problem inherently tied to the relational model or SQL. Even some more complex changes (like moving an attribute to another table) can be done online when done right and planned carefully (I've done that with one of the expensive database that provide all the tools you need for it). Generally it is about keeping the migration code out of your application and coping in the database with it. After the migration you should neither have migration artifacts in the DB nor in the application code. Of course, there are cases where a downtime is inevitable (I think ;).

    UPDATE

    Seems like you are coming form an object oriented background?

    Table inheritance as such is not part of the SQL standard. (PostgreSQL, however, supports it proprietorially). Having an extra table for each class (in OOP terms) would be the classical way to model it. Although I was working for almost 10 years at a bank where we were using OOP (in Java) and SQL, I cannot remember ever having had the urgent demand for table inheritance. Inheritance was mostly used for behavioral polymorphism (which is no-fun in Java anyways). Inheritance was hardly used to add attributes. In those cases where it was, it were single attributes that we then just left as nullable columns in the main table. Today, I'd suggest adding a CHECK constraint to make sure it is not NULL if the row is about a derived object that requires this attribute.

    I absolutely see your point, but in my experience that problem isn't that big in practice. However, that doesn't help you if you need it for your application :)

    Regarding your last name example: modelling addresses is indeed one of the harder problems because it is one of the above mentioned arbitrarily man-mad concepts. This problem was greatly discussed between E.F.Codd and C.Date (if I remember well). From my memories of what I read a while ago somewhere: Originally Codd used the term "atomic" to describe to which level you should store data artifacts in different columns (otherwise it's not normalized). From that perspective you should separate first name and last name. However, C.Date criticized this because the term 'atomic' could also mean to store each letter individually (a string can always be de-composed into more atomic units). Date then suggested something along the idea of the "domain value" meaning it should be one value in the domain model. E.g. if all you need to know about a customer is the shipping address, then it is valid to consider the complete shipping address (e.g. Name, Street Address, ...) as one domain value that might be stored in one column. I think in almost all cases, having the classy first name, middle initials and last name fields is a bad choice.

  8. "Mixing up of storage logic and application logic"

    SQL actually does the exact opposite: SQL abstracts the storage layer completely away.

    Nobody forces you to use stored procedures. I personally also think that stored procedures are overused, mostly because stored procedures are stored in the database and can thus be changed (optimized) by database administrators that might not have access to other source code. In other words: I think it is often done out of desperation.

    The second argument is, of course, the excessive use of ORMs again and policies that disallow the use of real SQL in the application.

    UPDATE As mentioned above, I also think that always requiring a length limitation is not the best idea and I'm in favor of using very large values whenever there is no sensitive way to define a limit.

    Under "storage layer" I don't consider the complete database (like you do, I think) but only how the tables and indexes and so on are actually stored. With SQL, you operate on logical entities. Joe Celko would say "Columns are not fields. Rows are not records."

answered Apr 29 '14 at 20:50

Markus%20Winand's gravatar image

Markus Winand ♦♦
93651120

edited Apr 30 '14 at 22:26

I've tried to shoehorn a much longer answer in the comments here, and I don't think that is going to work: how could I add this longer point by point reply?

(Apr 29 '14 at 22:39) Maurits Lamers

@Maurits Lamers Usually, you should edition the question. However, as I created it, feel free to add another answer (I don't see any other option).

(Apr 29 '14 at 22:43) Markus Winand ♦♦

Thanks for answering my questions with so much detail. Taking your numbered list as a lead:

  1. I didn't know indeed about bind parameters.
  2. What I meant is that SQL is very much concerned with the sizes of the fields and deduplication (see also 3). While that makes sense in a situation where disk space is limited (terabytes of data or very expensive hard disk space), what is the relevance of it today for say a simple WordPress site?
  3. I guess that my wording didn't really convey the intention of the question, being whether this normalization of data into a 2D structure of rows and columns is still actually valid, especially also in combination with the space it takes (see 2) and the related question: How unwanted is duplicate data actually?

    Your answer though gives a few nice points, especially where you mention that the "nature of the data […] is considered to be static". To me this sounds very much like E.W. Dijkstra, when he said something along the lines of "if you have to change your program after you finished writing it, you didn't do it right". My impression is that the praxis of software development has shown that continued use of a program will constantly require adjustments to the new circumstances in which this program is being used, and to new requirements users want to have. In that light, a static data structure seems to me to be a bit idealistic, but in how far is that feeling justified?

    On the human readability: JSON is way more human readable than SQL tables, even for an equal data set. When things go wrong, and an SQL engine can no longer make sense of the data, it is almost impossible to read an SQL database file directly, because the data as such doesn't make sense. There is no human way to read the data and still know the relations. In JSON, this is not normalized, and while this obviously has its drawbacks (duplicate data, difficulties searching through the data), it keeps being human readable, because it is plain text.
  4. You are completely right about "ever growing" vs "exponentially", but my remark was more about the amount of dead weight it essentially creates, which will needlessly slow down the retrieval process, or am I wrong in assuming that?
  5. I think that it is very, very hard to not get the wrong data into the system, because of the way the table system works. If you have a very big table containing people names stretching over a long period of time, you will have a field called lastname, which makes sense for people living now, but not really for most people living in the 12th century. While that might be a bad example, and a different explanation of the term "wrong data", it is the kind of example I see in quite a few databases where fields were added that effectively change a old record in a "human" inconsistent way. Why is it not possible to only store certain columns of new rows and leave the old rows intact, and actually show that in the search results?
    About the constraints etc: In how far is the schema and the constraints part of the application instead of the database?

    I am aware that a certain mind set is required, however I found very often this mindset to be unsuited for the tasks at hand because of their extreme inflexibility and the resistance and complexity to change. This was also part of the reason for me to ask this question in the first place, because I thought that I might just have misunderstood a few things about SQL in general.
  6. clear.
  7. What I think that I am actually mostly searching for is a middle ground between schema-less and full schema. A flexible schema with polymorphism would be great, but am I right in thinking that is not something that can be done with SQL without having to include an extra table for every possible type?

    About the additional columns: this is exactly what I referred to with the lastname example in 5. While the table might be consistent that way, the record itself might inherit fields that do not make sense for that record, whether or not the value is null.
  8. Together with 7: What I have seen quite a few times, is that an application has to be aware of that a certain field can only be 25 characters long, as it is defined in the database that way. To me that feels not like abstracting the storage layer away, or am I wrong in thinking that, and is that just a case of bad database / schema design?

answered Apr 30 '14 at 11:43

Maurits%20Lamers's gravatar image

Maurits Lamers
12

I've added some update. Hope I haven't overseen something. Let me know if it is more clear now. If I find the time, I might consolidate all of that into a proper blog post.

(Apr 30 '14 at 22:27) Markus Winand ♦♦