There are four main challenges when scaling a database: search, concurrency, consistency, and speed.
Suppose you have a list of 10 names. To find someone, you just go down the list.
But what if there are 1 million names? Now you need a strategy for finding something. A telephone book lists the names in alphabetical order so you can skip around. This is a solution to the search problem.
What if 1 million people are trying to use the telephone book at the same time? This is the problem of concurrency. Everyone could wait in one very long line at City Hall, or you could print 1 million copie
There are four main challenges when scaling a database: search, concurrency, consistency, and speed.
Suppose you have a list of 10 names. To find someone, you just go down the list.
But what if there are 1 million names? Now you need a strategy for finding something. A telephone book lists the names in alphabetical order so you can skip around. This is a solution to the search problem.
What if 1 million people are trying to use the telephone book at the same time? This is the problem of concurrency. Everyone could wait in one very long line at City Hall, or you could print 1 million copies of the book -- a strategy called "replication". If you put them in people's homes -- a strategy called "distributed" -- you also get faster access.
What if someone changes their phone number? The strategy of replication created a problem, which is that you now have to change all 1 million phone books. And when are you going to change them, because they are all in use? You could change them one at a time, but this would create a data consistency problem. You could take them all away and issue new ones, but now you have an availability problem while you are doing it.
And what if thousands of people are changing their phone numbers every hour? Now you have a giant traffic jam called "contention for resources" which leads to "race conditions" (unpredictable outcomes) and "deadlocks" (database gridlock).
All of these problems have solutions, but the solutions can get very complex. For example, you can issue addendums to the phone books (called "change logs") rather than reprinting all of them. But you have to make sure to check the addendums all the time. You can distribute new versions of the phone books with a cut-over date, so that everyone switches at the same time to get greater consistency, but now the phone books are always slightly out of date.
Now scale this to billions of names in data centers distributed around the world accessed by millions of users.
The basic goal of a database is to maintain the illusion that there is only one copy, only one person changes it at a time, everyone always sees the most current copy, and it is instantly fast. This is impossible to achieve at scale when millions of people are accessing and updating trillions of data elements from all over the world.
The task of database design, therefore, is to come as close to this illusion as possible using hundreds of interlocking algorithmic tricks.
Here is an explanation for true laymen, i.e. non-technical people who don't understand databases.
(For people who do, just ignore this and the minor technical errors in the analogy)
There are numerous ways in which "scaling" is hard but first I want to explain why scaling is fundamentally hard - the reason is that "scaling" isn't a single activity. In gross terms, it is basically about making a comp
Here is an explanation for true laymen, i.e. non-technical people who don't understand databases.
(For people who do, just ignore this and the minor technical errors in the analogy)
There are numerous ways in which "scaling" is hard but first I want to explain why scaling is fundamentally hard - the reason is that "scaling" isn't a single activity. In gross terms, it is basically about making a complex system "greater" - usually bigger or larger, typically having to do so very quickly. The key is that a complex system cannot be made bigger or more productive or more efficient in any one simple way - often the effects of the system interact with each other, so if you just expand one part, the other parts usually fail to function with it correctly and you don't get the desired expansion in capability - you almost always have to do some re-engineering.
An analogy:
Think of a database as a library. It's where you store books, or collections of books (e.g. all the Harry Potter books). In particular, your web application is a library that stores books and makes them available to people who want to read them. Imagine that this library gets featured on TechCrunch and becomes very popular, and suddenly you have to deal with a bunch of new scaling issues. Let's examine how some of them play out, in simplistic terms:
Example one: Many, many books.
Your library is popular, and is growing. In fact, now you have many more books than you started with, and your current building cannot house them all. For a long time, you just put them on the next shelf in the room. But now you have exceeded the size of your humble library building. What you have to do is buy or lease some adjacent buildings, and put the books in those buildings. You might run into issues doing so, because there are a finite number of buildings near you, or maybe real estate prices are very high and it's just not financially sustainable to lease the very expensive building next door. So you have to think carefully in terms of which buildings to lease, and how to find them so that they are near enough and cost-effective to use for book storage.
The real-world analog here is that databases are often stored on hard drives, and hard drives are of finite space, and you can only pack so many hard drives on a computer (in a data center, on a rack). Data centers are now very, very large in order to contend with this common problem but if you are an extremely voluminous library, you may exceed this constraint, and overflow your data center rack's ability to hold hard drives, or even your entire data center and need a new one (rare). Still, the point is that no matter how much room you start off with, you may exceed it, and you may not be able to linearly keep adding more "units" of room (i.e. shelves in the same building) and you might have to take some sort of step jump, like leasing another building or renting another rack or building another datacenter.
Example two: Finding a book among many, many books.
When your library fit comfortable in one room, all you had to do was line up all the books alphabetically and if someone wanted a book, they just searched through the big room until they found the book. It probably took at most 30 minutes.
Now your library is so big that you've rented several buildings. When someone wants a book, they have to walk through (potentially) several buildings to find the book. Let's say that people simply won't stand for this amount of lag time to find a book (similar to how long it takes to load a webpage). They just want to go to the correct building right away, go to the right floor, go to the right shelf, and grab the book they want. They don't want it to take more than the 30 minutes it used to take.
To do that, you need to create a new reference system, called an index. Libraries in real life actually have this problem, and the solution used to be card catalogs. They look like this:
Young people don't know what these are, because they came in the era before computers. Card catalogs are literally a database, but done using small drawers and little pieces of paper (the cards). Because they're so bulky, we digitized them and put them on computers now, which is why if you are younger than ~25, you've probably never seen one.
What the card catalog (the index) does is create a card for every book, and puts them in drawers, sorted by title, author, subject, etc. Then, if you want to look for a book, you go to the card catalog - which fits inside a single room - look up the card corresponding to the book you want, and the card tells you exactly which building, floor, and shelf the book is on. So you spend 10 minutes at the card catalog, 10 minutes walking over to the right building, and then 5 minutes walking up to the right floor and right shelf, and 5 minutes locating the exact book.
When a library gets too large, it has to implement a card catalog to keep book searching times down to a reasonable time (e.g. 30 minutes), otherwise it will take days to search through several buildings for a book and people won't use the library. This feature is qualitatively different from just leasing more buildings and searching them all - and is an example of how you have to come up with a new solution in order to overcome a scaling issue once you've passed a certain threshold - you're not just getting some more shelves, you have to put together the catalog, you have to print up all the cards (which is hard, because you have to go and list all your books and sort them first by author, then title, then subject, etc, and this is a huge pain because you've already got several buildings full of books), and you have to make a special room for the card catalog in your library near the front door and tell everyone to check the card catalog first.
Example Three: Many, many people looking for a book all at once
Let's examine the simplest form of this problem: your library is now so popular that tons of people are visiting. Huge crowds, entire mobs of people. This is not as absurd as it sounds - it is one of the most common problems facing web applications that suddenly become very popular.
So you have so many people who want to look for a book that they can't fit through the door. This sounds absurd because it rarely happens in real life, but think about it - a regular door can admit only about one person per second. So if you have 20 people per second who want to enter your library, you end up getting a huge crowd stuck outside your door waiting to pass through it. More people keep coming and the crowd gets bigger and bigger. Eventually it is so huge that more people are waiting outside the library trying to pass through the door than there are who can use your library, so the majority of users attempting to use your library report an experience that is really just waiting outside and never reading a book. Bad word of mouth ensues, and more people hate you than those who have a satisfying book-related experience.
The obvious solution is to cut more doors in the walls. Okay, so you cut another door. Twice as many people can come in now! You cut more doors. Eventually hundreds of doors. Every wall is filled with doors. Hell, you just remove all the walls! Suddenly so many more people can use the library now! Orders of magnitude more!
But soon you run into another problem. It turns out that there is only a limited amount of space available for people to be standing in front of a shelf looking for a particular book. Maybe they can do so quickly - they scan the shelf and find the book they're looking for and get out of the way. But they still occupy that floorspace for a few seconds, and eventually you are so popular that hundreds of people are looking for the same book (or a book that happens to be at the same vertical space as another book that someone is looking for) that they can't all jam into that spot in front of the shelf.
Once again, waiting crowds end up forming in front of the shelves - maybe all of them, maybe just the ones holding the more popular books. There are a few possible solutions here:
If the crowds are clustered around only the most popular books, you just spread them out throughout the library. But then the books are no longer sorted, they are distributed kind of randomly, so now you have to restructure (re-sort) your card catalog so that people can find the book they want at the new location. That's a pain, but not too much - you just have to update the cards for all the popular books.
If the crowds are everyone, i.e. all books are too popular, or you just have too many people, you can try replication. That is, make a copy of your entire library and lease a whole new set of buildings on the other side of town (or the next block), and send half the people to the new building. You can do this a few more times, and make several replicas. One issue you have to contend with here is that in order to keep the copies of your library up to date, you have to make sure all new books get continually copied between the libraries. One way to do this is to call one library the "master" library and new books only come to this library, and every time it happens, you have someone make a copies of those books and dispatch a runner to send them to all the other library copies. The traffic generated by these runners counts against the number of people who can use your library, so you'll have to impose traffic limits on how many people can use one of the libraries and once it gets too high, you create yet another replica instance of your library.
The key idea here is again, that in the beginning in order to scale against the problem of the crowd, you just cut a new door. That doubled your throughput capacity. You could cut yet another door to increase it again. So for awhile you could scale by cutting out new doors in your walls, until you ran out of walls to cut doors in - you removed all the walls. Once you hit that limit and need to scale further (and remember, the number of people coming to your library just keeps on increasing without end), you have to come up with a wholly new solution, like creating multiple copies of your library. Doing so requires a lot of effort - you have to copy all your books, lease whole new sets of buildings, and then figure out a way to manage the incoming traffic so that each library gets a proper fraction of the traffic. All of that is new infrastructure, and you can't do it at the moment you realize you've cut out your last door because during the time it takes to set all that up the total traffic will continue to rise and you'll get hordes of dissatisfied library users crowding up around the library again. So you have to predict based on the rate of traffic increase when the door-cutting solution isn't going to work anymore and start early on the library-replication.
Example Four: Adding many, many new books
Every library has to stay updated, which means that it has to add new books. Let's say your library represents a very active literary field, and there are tons of new books coming out all the time, constantly.
So you have people who buy the new books, make copies of them at the master library, and then distribute the new books to the shelves everywhere. You even have a good enough handle on the traffic that the runners who add the new books create a certain amount of traffic themselves but you've created enough replica libraries so that the ambient traffic levels are low enough that it's not a problem. Great, things look awesome and you decide you can finally take a day off.
Okay, it turns out that all the books in your library are sorted. That is, they're not just randomly put on shelves, they're ordered by author, or title, or whatever. (in the real world, ...

Scaling a database can be challenging for several reasons:
- Data Size: As the amount of data grows, it can become harder to store and manage everything efficiently. Larger databases may require more powerful hardware and complex configurations.
- Performance: When more users access a database simultaneously, it can slow down response times. Ensuring fast performance as the number of users increases often requires advanced techniques.
- Complexity: Databases have relationships between data (like how customers relate to orders). As you scale, maintaining these relationships and ensuring data consistency
Scaling a database can be challenging for several reasons:
- Data Size: As the amount of data grows, it can become harder to store and manage everything efficiently. Larger databases may require more powerful hardware and complex configurations.
- Performance: When more users access a database simultaneously, it can slow down response times. Ensuring fast performance as the number of users increases often requires advanced techniques.
- Complexity: Databases have relationships between data (like how customers relate to orders). As you scale, maintaining these relationships and ensuring data consistency gets more complicated.
- Cost: Upgrading hardware or using complex distributed systems can be expensive. Organizations must balance the cost of scaling with their budget.
- Technical Limitations: Some databases are designed to work in specific ways. For example, relational databases excel in structured data but might struggle with unstructured data or large volumes.
- Coordination: In larger setups, coordinating data changes across multiple servers can lead to issues like data conflicts or delays, making it harder to keep everything in sync.
In summary, scaling a database involves managing increasing data size, maintaining performance, handling complexity, controlling costs, and addressing technical limitations—all of which can be quite challenging.
Okay, this might be the lamest analogy but I will go ahead anyways.
Scaling databases can be very similar to effectively using wardrobes at home.
Initial/Happy state:
My wife and I used to make do with 2 wardrobes in our house initially. All our clothes used to fit in those even when the space in wardrobes was shared and our clothes could be present in either one of them. Moving clothes in and out was easy too.
More clothes ( 2-4x scale):
By now, we needed more than two wardrobes. Luckily in our house, we had space to put an additional pair of those. So we happily bought and started stuffing th
Okay, this might be the lamest analogy but I will go ahead anyways.
Scaling databases can be very similar to effectively using wardrobes at home.
Initial/Happy state:
My wife and I used to make do with 2 wardrobes in our house initially. All our clothes used to fit in those even when the space in wardrobes was shared and our clothes could be present in either one of them. Moving clothes in and out was easy too.
More clothes ( 2-4x scale):
By now, we needed more than two wardrobes. Luckily in our house, we had space to put an additional pair of those. So we happily bought and started stuffing the new wardrobes with our stuff. This is akin to adding one more database for additional data (loosely speaking).
Higher look-up/search time
We started taking more time to find our stuff since now we had 4 different places to look for it. Often times, we would return empty handed from one wardrobe and go looking for our stuff in another one. Hence, higher search times.
Concurrency
Given lack of order, we would also bump into each other at times trying to look for our stuff in the same wardrobe. Imagine 1000s of user transactions in a real database.
Consistency
At times, my wife would go and move a pile of clothes from one place to other. This would upset my ability to recall which stuff lies where. This often lead to me finding a pair of shorts at the place where I was really looking for a formal shirt.
Vertical Partitioning (again, loosely so)
To solve higher search/look up times, my wife and I decided to cleanly partition our clothes. As a result, she ended up with 3 out of 4 wardrobes. Guess what, it was a fair deal for my cloth search queries. I had considerably lesser clothes and hence I took much lesser time to find them by shooting straight for my wardrobe.
Horizontal Partitioning
My wife had, in the meantime, figured out that within her 3 wardrobes, she would further subdivide and arrange clothes in a certain manner. She decided to split them in latest, not so old and old set of clothes. Not the most optimal strategy I thought, but worked for her. Of course, this only works till she she buys more clothes and needs a fourth wardrobe for her latest stuff. You see?
Replication
I love to wear t-shirts but all of them are actually stored in just one wardrobe. Assume I lock that one and lose the key. How do I find another t-shirt to wear? As a safety net for this hypothetical situation, I setup a small closet and store another set of my t-shirts. This is akin to replicating your most crucial data so that you can retrieve it even if one of your databases becomes unavailable.
Now imagine adding even more clothes and may be planning for a few more members in the family. I could go on and on, but I guess you get the drift by now :).
PS - I use caching for my favorite pair of jeans by throwing them on the couch. It super easy to find them out there :)
Adding some aspects that don't seem to have received attention in the above answers.
a) It's possible to scale a database to any size at all as long as you don't query it or update it. You just keep adding disks and you keep adding data. This is not a facetious statement but is meant to clarify that databases scale or don't scale depending on the kind of queries you run and whether you ever update the data.
b) So it's possible to scale read-only database much easier than a database that is constantly read/written-to/updated hence the issues related to concurrency.
c) Database scaling also depen
Adding some aspects that don't seem to have received attention in the above answers.
a) It's possible to scale a database to any size at all as long as you don't query it or update it. You just keep adding disks and you keep adding data. This is not a facetious statement but is meant to clarify that databases scale or don't scale depending on the kind of queries you run and whether you ever update the data.
b) So it's possible to scale read-only database much easier than a database that is constantly read/written-to/updated hence the issues related to concurrency.
c) Database scaling also depends on how you read the data ie the complexity of the query. if you look up data just by a unique key then this is a much simpler access pattern than if you read data selecting on multiple fields or columns or attributes. e.g. if all you do is look up your SS# in the SS database once a year this is not a hard problem. It is a query that accesses a very small part of the whole database and gets the data relatively quickly with little or no computation involved.
d) On the other hand if you want to query the US Census database by many demographic factors over the whole US population this is a complex query that has to touch the whole database and sort and filter and do many intermediate computations so this is an example of a complex query which is much harder to perform scalably especially if many people are running their own complex queries and other people are changing or entering data.
So scalability depends on what queries you want to run and what the background read/write/update processes are while you run your query
e) Finally there is the issue of data normalization ie splitting up your data into logical chunks each chunk called a table. At the time you write the data you split it up for the sake of efficient storage but when you read it you have to join it all over again. This join process is very expensive as the tables grow larger and the core of the problem with relational databases is not that "SQL doesn't scale" as the common (but wrong) meme goes. It is that *joins* don't scale well.
Now when you have to join three or more tables and these tables contain millions of rows as social network databases do, then joins cause relational databases to choke and fall over.
Early Web 2.0 companies like Flickr and Twitter had a particularly bad query pattern which joined multiple tables and then filtered them based on users and followers and this caused major scaling issues that led to the meme "SQL doesn't scale" but it was the nested join in these applications that caused the bottleneck, not the SQL language which was merely a programmer interface to the data.
The query in question had a common pattern that I call "visibility query" pattern.
In Flickr you had the ability to post a photo so that only certain groups could see it.
So when I logged into my account, before rendering my home page Flickr's database had to compute a complicated query to see all the published photos, to which groups they were published, and then create a union of these and intersect it with the groups that I am a member of and then display those photos to me.
Early Flickr database model design had these queries all happen at the same time on a single database with massive contention.
Later Twitter had a very similar anti-pattern with similar contention issues.
When I logged in to Twitter it had to compute who I followed and which out of them were not private tweets and then display my timeline - then keep refreshing it by running the same query for each person logged in. Ignoring the error of the other bad meme (Rails doesn't scale) this was a problem very similar to the Flickr photo visibility problem except in the context of tweets. Note that Flickr used Php (possibly Perl) and no one concluded that Php doesn't scale or that Perl doesn't scale. But in the Twitter case while Rails certainly had issues, the join costs of the query over millions of rows, many times a second easily dominated Rails performance issues.
So again databases are easy to scale if you dont want to run any query on the data. Even moderately sized databases can fall over if your query makes the database run around touching every bit of data every few seconds just to check if something has changed.
Finally databases are hard to scale because real world requirements are hard to fulfill simultaneously satisfying the reads writes updates and deletes for thousands of users who are asking complex question directly or implicitly (via rendering of their home page).
Hope this was simple enough to understand. The fairly rigorous analysis and explanations have been very competently done by others.
Where do I start?
I’m a huge financial nerd, and have spent an embarrassing amount of time talking to people about their money habits.
Here are the biggest mistakes people are making and how to fix them:
Not having a separate high interest savings account
Having a separate account allows you to see the results of all your hard work and keep your money separate so you're less tempted to spend it.
Plus with rates above 5.00%, the interest you can earn compared to most banks really adds up.
Here is a list of the top savings accounts available today. Deposit $5 before moving on because this is one of th
Where do I start?
I’m a huge financial nerd, and have spent an embarrassing amount of time talking to people about their money habits.
Here are the biggest mistakes people are making and how to fix them:
Not having a separate high interest savings account
Having a separate account allows you to see the results of all your hard work and keep your money separate so you're less tempted to spend it.
Plus with rates above 5.00%, the interest you can earn compared to most banks really adds up.
Here is a list of the top savings accounts available today. Deposit $5 before moving on because this is one of the biggest mistakes and easiest ones to fix.
Overpaying on car insurance
You’ve heard it a million times before, but the average American family still overspends by $417/year on car insurance.
If you’ve been with the same insurer for years, chances are you are one of them.
Pull up Coverage.com, a free site that will compare prices for you, answer the questions on the page, and it will show you how much you could be saving.
That’s it. You’ll likely be saving a bunch of money. Here’s a link to give it a try.
Consistently being in debt
If you’ve got $10K+ in debt (credit cards…medical bills…anything really) you could use a debt relief program and potentially reduce by over 20%.
Here’s how to see if you qualify:
Head over to this Debt Relief comparison website here, then simply answer the questions to see if you qualify.
It’s as simple as that. You’ll likely end up paying less than you owed before and you could be debt free in as little as 2 years.
Missing out on free money to invest
It’s no secret that millionaires love investing, but for the rest of us, it can seem out of reach.
Times have changed. There are a number of investing platforms that will give you a bonus to open an account and get started. All you have to do is open the account and invest at least $25, and you could get up to $1000 in bonus.
Pretty sweet deal right? Here is a link to some of the best options.
Having bad credit
A low credit score can come back to bite you in so many ways in the future.
From that next rental application to getting approved for any type of loan or credit card, if you have a bad history with credit, the good news is you can fix it.
Head over to BankRate.com and answer a few questions to see if you qualify. It only takes a few minutes and could save you from a major upset down the line.
How to get started
Hope this helps! Here are the links to get started:
Have a separate savings account
Stop overpaying for car insurance
Finally get out of debt
Start investing with a free bonus
Fix your credit
This is how I explain in non-technical terms why scalability can be hard to managers. This analogy works in general for explaining scalability, not limited to databases only.
Take a cake recipe:
* 1 cup white sugar
* 1/2 cup butter
* 2 eggs
* 2 teaspoons vanilla extract
* 1 1/2 cups all-purpose flour
* 1 3/4 teaspoons baking powder
* 1/2 cup milk
Suppose you have on hand the following quant
This is how I explain in non-technical terms why scalability can be hard to managers. This analogy works in general for explaining scalability, not limited to databases only.
Take a cake recipe:
* 1 cup white sugar
* 1/2 cup butter
* 2 eggs
* 2 teaspoons vanilla extract
* 1 1/2 cups all-purpose flour
* 1 3/4 teaspoons baking powder
* 1/2 cup milk
Suppose you have on hand the following quantities of ingredients:
* 2 cups white sugar
* 2 cups butter
* 6 eggs
* 8 teaspoons vanilla extract
* 8 cups all-purpose flour
* 8 teaspoons baking powder
* 4 cups milk
How many cakes can you make with what you have on hand?
Making a cake is a
transaction
. The ingredients on hand represent your
computing resources
. The resource requirements per unit of output (per cake) is the
resource demand
. And the number of cakes you can make represent your
capacity
.
In the above case you could make 2 cakes only. After that you would be out of sugar. Sugar is the ingredient where the
demand
/
computing resources
is the greatest. Sugar is the limiting factor, also called the
bottleneck
.
So you go and buy more sugar, say another 8 cups. How many cakes can you now bake, in total, including the two already baked? The answer is 3. Do you see it? Even though you now have plenty of sugar, you are running out of eggs after 3 cakes. Eggs are now the bottleneck.
An important observation is there is always a bottleneck. There is always something that limits further scaling of a recipe. In computing systems the most common bottlenecks are CPU, RAM, disk and network. These three resources are the basic ingredients that computer systems require.
But suppose you have access to unlimited ingredients? What then? You would still have a bottleneck. But it might be given names like "number of bakers" or "room in the oven" or "electrical capacity of the kitchen" or "cooling system in the kitchen".
Again, there is always a bottleneck. There is always some computing resources with a demand that is greater than the others. Scalability is "easy" when the bottleneck is on a factor that is inexpe...
Brian's principle of scalability: a properly-scaled database provides the end user with a consistent and satisfying user experience, irrespective of the quantity of data stored in the database.
On scaling for capacity (how many billions of records can the database handle):
Q. Can you transport one child from her school to karate practice at an average speed of 25 mph within a specific time frame in your two-passenger Mazda Miata?
A: Yes.
Q. Can you transport three children from their school to karate practice at an average speed of 25 mph within a specific time frame in your two-passenger Mazda Mi
Brian's principle of scalability: a properly-scaled database provides the end user with a consistent and satisfying user experience, irrespective of the quantity of data stored in the database.
On scaling for capacity (how many billions of records can the database handle):
Q. Can you transport one child from her school to karate practice at an average speed of 25 mph within a specific time frame in your two-passenger Mazda Miata?
A: Yes.
Q. Can you transport three children from their school to karate practice at an average speed of 25 mph within a specific time frame in your two-passenger Mazda Miata?
A: No. It doesn't scale well. Time to "scale" to the higher-capacity Toyota Prius.
Q. Can you transport six children from their school to karate practice at an average speed of 25 mph within a specific time frame in your Toyota Prius?
A: No. It doesn't scale well. Time to "scale" to a higher-capacity soccer mom van.
Q. Can you transport 20 children from their school to karate practice at an average speed of 25 mph within a specific time frame in your soccer mom van?
A: No. It doesn't scale well. Time to "scale" to a 20-passenger school bus.
Q. Can you transport 200 children from their school to karate practice at an average speed of 25 mph within a specific time frame in your single school bus?
A: No. It doesn't scale well. Time to "scale" to five 40-passenger school buses and multiple bus drivers
Q. Can you transport 2,000 children from their school to karate practice at an average speed of 25 mph within a specific time frame if you deploy fifty 40-passenger school buses?
A: Unlikely, because the fifty 40-passenger school buses that would be needed to transport the 2,000 children at 25 mph within the specific time frame would probably exceed the carrying capacity of the existing roads for that specific block of time.
On scaling for speed/performance (how long does the user wait for a response from the database after hitting the enter key):
Q. Can you transport three children from their school to karate practice at an average speed of 150 mph within a specific time frame in your Toyota Prius?
A: No. It doesn't scale well. Time to "scale" to the faster BMW M5.
Q. Can you transport three children from their school to karate practice at an average speed of 300 mph within a specific time frame in your BMW M5?
A: No. It doesn't scale well. In fact, it's not really possible in any vehicle you can purchase off-the-shelf, so you've reached the limit of performance. That is, of course, unless you are willing to throw an unreasonable amount of money at the problem.
A real world example of scaling in layman's terms -- SABRE, the IBM/American Airlines airline reservation/booking system:
From the Wikipedia entry for SABRE:
http://en.wikipedia.org/wiki/Sabre_%28computer_system%29
In the 1950s, American Airlines was facing a serious challenge in its ability to quickly handle airline reservations in an era that witnessed high growth in passenger volumes in the airline industry. Before the introduction of SABRE, the airline's system for booking flights was entirely manual, having developed from the techniques originally developed at its Little Rock, Arkansas reservations center in the 1920s. In this manual system, a team of eight operators would sort through a rotating file with cards for every flight. When a seat was booked, the operators would place a mark on the side of the card, and knew visually whether it was full. This part of the process was not all that slow, at least when there were not that many planes, but the entire end-to-end task of looking for a flight, reserving a seat and then writing up the ticket could take up to three hours in some cases, and 90 minutes on average. The system also had limited room to scale. It was limited to about eight operators because that was the maximum that could fit around the file, so in order to handle more queries the only solution was to add more layers of hierarchy to filter down requests into batches.
More than 1.7 million people travel via air each day in the US. For an airline to remain viable, that airline needs to maintain a minimum revenue rate per mile flown, which means the airline cannot fly planes that are empty.
Matching available flight capacity across all airlines to the individual travel needs of 1.7 million people per day, in such a way that each person can look up and book a travel reservation in under a minute, and so that airplanes don't fly empty, is a non-trivial database problem that encompasses many of the computer science-related issues such as ACID compliance, parallelism and concurrency control, and deadlock detection and resolution raised by others who have posted answers to this question.
Yes, there is a way. If you're wondering whether someone has a dating profile, it's actually pretty easy to find out. Just type in their name and click here 👉 UNCOVER DATING PROFILE. This tool checks a bunch of dating apps and websites to see if that person has a profile—either now or in the past.
You don’t need to be tech-savvy or know anything complicated. It works with just a name, and you can also try using their email or phone number if you have it. It’s private, fast, and really helpful if you’re trying to get some peace of mind or just want to know what’s out there.
🔍 HERE IS HOW IT WORK
Yes, there is a way. If you're wondering whether someone has a dating profile, it's actually pretty easy to find out. Just type in their name and click here 👉 UNCOVER DATING PROFILE. This tool checks a bunch of dating apps and websites to see if that person has a profile—either now or in the past.
You don’t need to be tech-savvy or know anything complicated. It works with just a name, and you can also try using their email or phone number if you have it. It’s private, fast, and really helpful if you’re trying to get some peace of mind or just want to know what’s out there.
🔍 HERE IS HOW IT WORKS:
- Start by going to this link 👉 UNCOVER DATING PROFILE
- Enter the person’s name, email address, or phone number. Name and phone number searches usually give the best and most accurate results
- The site scans billions of public records in just a few seconds. It also scans over 120 dating and social media websites to see if the person has a profile
- It will ask you a few quick questions to narrow down the results (like location)
- Once the search is done, you’ll see blurred preview with:
- Their full name
- Dating profiles & social media accounts
- All known phone numbers
- Current and past addresses
- A list of family members
- Any available court or criminal records
- And more useful background info
⚠️ KEY CALL OUTS ⚠️
- Its not free. You will need to pay to see everything, but its pretty cheap.
- If nothing shows up, it doesn’t always mean they’re in the clear — some people use fake names or burner emails. So it’s worth digging a little deeper just to be sure.
If you’re in a situation where you need to know whether someone is still acting single online, this is one of the most effective and low-stress ways to find out.
👉 Check it out here if you’re ready to start your search.
ALSO HERE ARE OTHER HELPFUL TOOLS:
Dating Research Tool – Search a large database to learn more about who you’re dating.
Who’s Texting Your Partner – Discover who your partner is texting or calling, including their name, age, location, and social profiles.
Verify People Tool – Confirm if someone is really who they say they are.
Find Social Profiles – Locate someone's social media and dating profiles.
People Search Directory – Look up someone's phone number and contact details.
Dating Safety Check – Review your date’s background to help keep you safe.
Depending on the kind of data in the database, it may or may not be hard to scale the database. Here's is what I feel could be a definition for the layman:
Data: Information
Database: A container/repository/vault for data.
Querying a Database: Asking a question about the data to the database.
Consistency: Is the information correct at the given time?
Performance: How fast did I get the information?
Availability: Is the database available to me at any time of my choice to answer my queries?
Scaling: Are others (could be as many interrogators as we would like them to be) getting the answers/re
Depending on the kind of data in the database, it may or may not be hard to scale the database. Here's is what I feel could be a definition for the layman:
Data: Information
Database: A container/repository/vault for data.
Querying a Database: Asking a question about the data to the database.
Consistency: Is the information correct at the given time?
Performance: How fast did I get the information?
Availability: Is the database available to me at any time of my choice to answer my queries?
Scaling: Are others (could be as many interrogators as we would like them to be) getting the answers/responses at the same time as me when I was the only person asking the question to the database?
To give an example, consider the following sci-fi scenario:
(1) I am asking the Terminator a question. Only he has access to the sacred book which contains all knowledge. He looks up the book and answers the question. I bring 10 of my friends and all of them start to ask different questions to the guy. We threaten to send him to the Underworld if he does not answer each of our questions immediately.
(2) The Terminator ensures his survival by cloning 10 of himself, and answering each of my 10 friends immediately. We decide not to terminate him that day.
(3) Now each of my 10 friends bring 10 of their friends, and demand the same quality of service from him. The Terminator finds that just cloning himself does not work anymore. Since they spend time fighting to look up the answer in that single book. The Master Terminator saves the day by creating 10 copies of the sacred book and giving a book to each of his 10 clones. Depending on the type of queries, he could also equally divide the chapters of the book among the clones, and route my friend to the clone with the answer to his question.. , thereby not wasting money in copying the entire book. But this is what Optimus Prime would have done because his prime love is optimization.
(4) The Terminators may have been happy, but now my 1000 friends get smarter and demand that not all answers are 100% correct, and he update the sacred book by what we think is correct. The Terminators pursuit of happiness ends here.. Because the updating Terminator must ask each of his fellow Terminators to stop answering the questions whose answer got updated (until corrections have been made to their books). More the number of books in existence, the slower is this process, and we start to get really angry.
(5) Here the Terminators must take a decision which may spell life or doom for them. Should they always give a correct answer to my friends, or just give an answer to my friends which had been correct in the past? What happens if one of the Terminator dies trying to make my human friends happy, and it turns out to be too much for him?
(6) The Terminators figure out that the dumb humans may not be able to figure out if the answers are correct or not. So they decide to give slightly incorrect answers, until the time all the sacred books have been updated to contain the correct answers. They just decide that it is better to live and fight next day, rather than to die in the battle.
In summary, it really boils down to what questions we are asking of the Terminators, and there is no one strategy fit all situations case. We really need to understand our data and queries to come up with approaches which scales. The strategies might involve doing trade-offs between consistency (correctness of the answer) and availability (ability of the master Terminator to clone himself at will, so some of my friends don't have to attend the funeral of the dead Terminator Clones) and partition tolerance (ability of the Terminator to create copies of the sacred books). One dude named Brewer, says we can only pick any two of the three while scaling a distributed database. That is pretty deep , for even a non-lay-person.
- Because the database does not know what we are trying to accomplish. Instead it sees commands engineered by humans telling it how to organize, access, and update data. Thus, the database does not have enough information to scale itself and is limited to following instructions we give it.
- In worst cases, which are all too common, our instructions are written in a procedural language that forces serial, rather than parallel, database operations, causing execution times to increase exponentially (O(n^x)).
- Database optimizers are simply not very good. They try to figure out what we want by looking
- Because the database does not know what we are trying to accomplish. Instead it sees commands engineered by humans telling it how to organize, access, and update data. Thus, the database does not have enough information to scale itself and is limited to following instructions we give it.
- In worst cases, which are all too common, our instructions are written in a procedural language that forces serial, rather than parallel, database operations, causing execution times to increase exponentially (O(n^x)).
- Database optimizers are simply not very good. They try to figure out what we want by looking at statistics gathered during previous accesses. That's as hopeless as trying to 'optimize' your driving by staring into the rear view mirror.
- Even if the database knew what we wanted, its solution would be suboptimal because the database field is full of unsolvable problems, technically called NP-complete problems.
Here is an example of a scaling failure that illustrates all four points. AT&T asked me to fix a weekly batch job that took eight days to run. It had not run to completion in over a year. It ran in a half hour when it was written ten years earlier, when the company (then SBC) had ten million long-distance customers. Over the years, as the customer count grew, the job kept running longer. One programmer improved it by splitting it into ten parallel processes on the client (application) side. A later programmer improved it by splitting each of those into ten, giving one hundred parallel client processes. (Is that all they teach in computer science school?) Four programmers tried and failed. When I got the job, there were 120 million long-distance customers.
I reduced the run time from eight days to 30 minutes, using the same database structure, running on the same Oracle server(s), solely by rewriting the application. Here's how:
- Scrap multiprocessing on the client side. My job ran as a single client process. Parallelism belongs on the database server side, not the client side.
- Scrap PL/SQL in which the job had been written. I rewrote it in straight SQL using large scale set-based operations rather than row-at-a-time forced by PL/SQL. That gave the Oracle optimizer a fighting chance to improve the query strategy and use parallelism on the server side. That illustrates 2. Run time is down to 12 hours.
- Study the execution plans. When the optimizer made bad decisions, hold its hand by breaking queries into smaller steps. That illustrates 3. Run time is down to three hours.
- Employ considerable knowledge of subject matter in the first stage, which was creating a set of 30 thousand customers meeting certain criteria involving sales tax. This is one of the unsolvable problems in 4., technically called a Boolean conjunctive query. For an ideal database to do that on its own, it would need detailed knowledge of tax law, the North American Numbering Plan (NANP), and history of AT&T acquisitions. That information could be written into a high level statement of the problem, as in 1. Run time is down to half an hour.
The scalability failure was caused by bad application programming. It wasn't Oracle's fault.
I think the solution to scalable databases is to scrap SQL and the Procrustean relational model. Replace them with a database that takes as input a description of the problem written in something like XSLT. Let the database design, and have the authority to autonomously redesign, storage structures and replicas.Query and update it with a non-procedural language like XQuery, but make it backward compatible with SQL by mapping to legacy schemas. The database that most closely follows that model is MarkLogic.
You have to define what “scale a database” means.
One reason an existing database may have a hard time growing beyond a certain size and still meeting application-side performance requirements is it may have a poor schema. You can have a poor schema in both relational and “NoSQL” environments (ie, you may have not designed your documents to match your searches, etc), but this problem is pretty well-known in relational environments.
In the vast majority of situations, the upper-bound for a relational database is driven by the quality of its schema, the level of parameter tuning done on the db eng
You have to define what “scale a database” means.
One reason an existing database may have a hard time growing beyond a certain size and still meeting application-side performance requirements is it may have a poor schema. You can have a poor schema in both relational and “NoSQL” environments (ie, you may have not designed your documents to match your searches, etc), but this problem is pretty well-known in relational environments.
In the vast majority of situations, the upper-bound for a relational database is driven by the quality of its schema, the level of parameter tuning done on the db engine, the overall host configuration, and the quality of application-side queries.
If the above are good, you can scale a single-instance database for quite a long time before seeing significant performance degradation. Once you get into the hundreds of millions of records range, you may need to start to worry about partitions. Once you exceed a few terabytes or a few billion records, you may be reaching the upper limit of many single-instance databases (at least unless you buy a ginormous and hugely expensive host to run your instance on and tune the heck out of it).
If you’re going to grow much bigger, you may need to start application-side sharding and have a multi-instance database. If you do your sharding well and are careful to have Shared-nothing architecture in both your app world and your overall database schema (ie, no “cross-instance” joins), you can have hundreds or even more individual database instances in a giant relational database world.
You’ll probably end up with each shard having multiple instances so you can do nearly-instant cutovers and failovers if you go with an architecture of this kind.
The thing that limits relational databases from having app-unaware horizontal scaling is ACID transactions and relational joins. Doing either across a large node-world is quite difficult, which is why sharding is needed, as each shard is effectively its own independent and isolated database.
As for NoSQL databases, they internalize the sharding (called “partitioning” in NoSQL-land) - and for the most part dispense with ACID transactions across the dataworld as well as joins. As for data integrity, they have a notion of “eventual consistency”, typically governed by a notion of “quorum”, meaning that writes to the database can be specified as being “accepted” by a number of nodes - specified by the quorum number - before the write returns to the application.
This allows them to have a largely shared-nothing dataworld without needing all that much in the way of application-side design (although for best results, partition keys and such must be chosen Very Carefully), and since you don’t have joins, you have to structure your data organization so you don’t need them, which may require more planning and design than a lot of people may think.
This answer is a bit rambling, and I may update it later…
There are far too many answers already to this question, and some of them are very good, very detailed answers, so I will keep mine short, more like an elevator pitch for why DB scalability is difficult.
First off, Applications drive databases, not the other way around. A lot of SQL scalability pain is due to the app being un-optimized, not the SQL DB.
#1 - SQL Databases hide a lot of computing complexity within a simple language, and getting deep visibility into that complexity, and the performance issues caused by it, is a hard problem. That one line of SQL which magically fetches users from
There are far too many answers already to this question, and some of them are very good, very detailed answers, so I will keep mine short, more like an elevator pitch for why DB scalability is difficult.
First off, Applications drive databases, not the other way around. A lot of SQL scalability pain is due to the app being un-optimized, not the SQL DB.
#1 - SQL Databases hide a lot of computing complexity within a simple language, and getting deep visibility into that complexity, and the performance issues caused by it, is a hard problem. That one line of SQL which magically fetches users from california from one table, and then fetches their orders for the last quarter from another table, and then sorts it, and then totals the orders up so you can show one column in a report that says (Revenues from California in Q3), might seem magically simple, but there is a lot of hard work that goes into making that work, and not understanding that work leads to poor SQL performance.
#2 - Scaling up is easy but limited and expensive, scaling out requires solving some of the hardest computing challenges in the world. You can only add so much RAM, Flash and cores to that one server before it's not feasible to go beyond anymore. 2 sockets with 6-8 cores with around 128GB of RAM and around a 100K IOPS SSD is the sweet spot today in price:performance:power ratios. Going beyond that, even simple stuff like having a readable secondary replica to offload read queries so the primary can handle more write load, requires dealing with replication and the associated lag, which if not handled right can lead to invalid data being shown at the app layer.
#3 - Scaling out is a very hard data distribution and consistency problem; and you can't even use the scaling out "sharding" technique unless you can actually modify the app to support sharding, or use a transparent sharding solution such as ScaleArc and its competitors. Choosing which data should go to which server out of a cluster of two is easy enough, but going beyond that gets harder and harder very quickly, and especially hard when you have to now migrate data between servers to ensure the load on them stays about equal.
#4 - High availability is a very expensive problem to solve right. You either have to throw in a lot of 'spare' resources to achieve close-to-synchronous replication and instant fail-over, or have to design around failure with a certain tolerance to data loss. This is an added dimension that makes it hard to operate very large SQL clusters.
All that said, after having been in this space for as long as I have, I can tell you one thing the myth that "SQL doesn't scale" is just that, a myth. I've seen enough alternate database vendors put up benchmarks on what their systems can do, and have seen SQL do a lot better with more control, resliency, and an easier app development experience.
Achieving scalability and elasticity is a huge challenge for relational databases. Relational databases were designed in a period when data could be kept small, neat, and orderly. That’s just not true anymore. Yes, all database vendors say they scale big. They have to in order to survive. But, when you take a closer look and see what’s actually working and what’s not, the fundamental problems with relational databases start to become more clear.
Relational databases are designed to run on a single server in order to maintain the integrity of the table mappings and avoid the problems of distribu
Achieving scalability and elasticity is a huge challenge for relational databases. Relational databases were designed in a period when data could be kept small, neat, and orderly. That’s just not true anymore. Yes, all database vendors say they scale big. They have to in order to survive. But, when you take a closer look and see what’s actually working and what’s not, the fundamental problems with relational databases start to become more clear.
Relational databases are designed to run on a single server in order to maintain the integrity of the table mappings and avoid the problems of distributed computing. With this design, if a system needs to scale, customers must buy bigger, more complex, and more expensive proprietary hardware with more processing power, memory, and storage. Upgrades are also a challenge, as the organization must go through a lengthy acquisition process, and then often take the system offline to actually make the change. This is all happening while the number of users continues to increase, causing more and more strain and increased risk on the under-provisioned resources.
To handle these concerns, relational database vendors have come out with a whole assortment of improvements. Today, the evolution of relational databases allows them to use more complex architectures, relying on a “master-slave” model in which the “slaves” are additional servers that can handle parallel processing and replicated data, or data that is “sharded” (divided and distributed among multiple servers, or hosts) to ease the workload on the master server.
Other enhancements to relational databases such as using shared storage, in-memory processing, better use of replicas, distributed caching, and other new and ‘innovative’ architectures have certainly made relational databases more scalable. Under the covers, however, it is not hard to find a single system and a single point-of-failure (For example, Oracle RAC is a “clustered” relational database that uses a cluster-aware file system, but there is still a shared disk subsystem underneath). Often, the high cost of these systems is prohibitive as well, as setting up a single data warehouse can easily go over a million dollars.
The enhancements to relational databases also come with other big trade-offs as well. For example, when data is distributed across a relational database it is typically based on pre-defined queries in order to maintain performance. In other words, flexibility is sacrificed for performance.
Additionally, relational databases are not designed to scale back down—they are highly inelastic. Once data has been distributed and additional space allocated, it is almost impossible to “undistribute” that data.
On a normal week day in the morning, you can walk into Walmart, pick what you want, and check out in no time. There is plenty of parking available, no shoulder fights in the isles, and hardly any waiting at the checkout counters. On a Black Friday however, things are different. You would probably go at the wee hours in the morning, race to find parking, stand in the long line braving cold, and when the doors open fight everyone else to find what you want and finally, wait for hours at the checkout counter to save those five dollars. It is the same Walmart that you visit every week and that day
On a normal week day in the morning, you can walk into Walmart, pick what you want, and check out in no time. There is plenty of parking available, no shoulder fights in the isles, and hardly any waiting at the checkout counters. On a Black Friday however, things are different. You would probably go at the wee hours in the morning, race to find parking, stand in the long line braving cold, and when the doors open fight everyone else to find what you want and finally, wait for hours at the checkout counter to save those five dollars. It is the same Walmart that you visit every week and that day, although there are additional checkout counters open, you would have a miserable life compared to your normal shopping experience. Databases are like Walmarts -- they can handle certain amount of load under normal circumstances. But when there is a stampede, they can't keep up. For the same reason why Walmart cannot increase its size, inventory, staff, etc. infinitely, databases also cannot get bigger or distributed without challenges.
I have a short, though somewhat technical view:
It's "only" a matter of :
- integrity of data (ACID) and
- serialization of processing, due #1 (queuing theory+latency issues)
If you don't need #1, #2 becomes far more easy.
If you want to educate yourself, take a look at Distributed Lock Manager (DLM) and Optimistic Concurrency Control (OCC) to get a better picture of the real life challenges.
One of the challenges is that it is quite hard to mix Non-ACID and ACID requirements with same database engine, thus in many cases "same size does not fit all". Even though MySQL is good in certain tasks, it i
I have a short, though somewhat technical view:
It's "only" a matter of :
- integrity of data (ACID) and
- serialization of processing, due #1 (queuing theory+latency issues)
If you don't need #1, #2 becomes far more easy.
If you want to educate yourself, take a look at Distributed Lock Manager (DLM) and Optimistic Concurrency Control (OCC) to get a better picture of the real life challenges.
One of the challenges is that it is quite hard to mix Non-ACID and ACID requirements with same database engine, thus in many cases "same size does not fit all". Even though MySQL is good in certain tasks, it is more closely on-par with commercial products when embedded with ACID "core" such as InnoDB.
NoSQL is a good approach to many, but not all problems as it has very limited approach to concurrency/integrity without some "neat tricks". Low latency (aka. scalable) generic lock manager to implement ACID is quite a task to implement. On the other hand, having a "lockless" NoSQL and 100% fit to the problem, implementing a very fast and low latency subset of required lock management is not "piece of cake", but just "good engineering".
I know that it's not exactly "layman's" answer, but take the
#1 as "accidents may not happen" thus we have traffic signs and traffic lights. Then take the
#2 as there are 2-10 lanes for traffic, and when the traffic exceedes the designed limits, there will be queues because of #1. Eh?
#3 Go to Destruction Derby to see what happens if you violate #1 ;-)
Scalability is hard to grasp. We don't run into it much in daily life. I got it beaten into me because scalability is vital to understanding most phenomena in the life sciences.
When talking about information systems, we tend to abstract the systems "on paper" with little lines represent communications and in abstraction, those communications occur instantly. In the real world they don't. It takes a finite amount of time for each transaction/communication. Those real world lags feedback into each other, piling up and causing system the system to slow and potentially jam up all together.
Here'
Scalability is hard to grasp. We don't run into it much in daily life. I got it beaten into me because scalability is vital to understanding most phenomena in the life sciences.
When talking about information systems, we tend to abstract the systems "on paper" with little lines represent communications and in abstraction, those communications occur instantly. In the real world they don't. It takes a finite amount of time for each transaction/communication. Those real world lags feedback into each other, piling up and causing system the system to slow and potentially jam up all together.
Here' an analogy I often use to explain scalability issues in any kind of information management or other complex systems. It applies to all kinds of informational systems requiring communication between parts e.g. governments, economies, a growing company or a DB. People grasp the analog because we've all had to solve this particular problem.
The task is a common one: deciding where to go to eat lunch. As the number of people in the group grows, the time it takes to decide where to eat and the granularity of the choices decreases.
-) Just you: It takes a few seconds of thought. Hmmm, hamburgers, tex-mex of french bistro? Tex-mex. Done. When you get there you can order any dish.
-) Two people: Now it take a minute or so because you've got to go back and forth. "Where do you want to go?", "I don't really care where do you want to go?" Done. Order anything.
-) 3-5 people: Now your up to 5-10 minutes because each person has to be polled, then conflict negotiated. Takes at least 1 minute per person(N) but usually something like 1.5(N) minutes. Order any dish.
-) 5-10: Same problem but now its more like 2N minutes just from the lag in communications. Order any dish.
-) 10-20: More like an hour because now you have infrastructure issues. "Hmmm, this is a big party, we should call ahead and make sure the restaurant can seat us. You can still get everyone in the same room to communicate but it's going to take a while to poll everyone. Probably order any dish but if everyone orders the same thing, restaurant might run out of ingredients.
-) 20-50: 5-24 hours. Now you can't even get everyone in the same room. Individual diners aren't talking directly to each other. You've got to break into smaller groups which each appoints a representative to go to a meeting and hash out where to eat. They definitely have to contact the restaurant and make sure they're prepared. Everyone ordering a custom dish is pretty much not an option. Everyone is eating one of three or so dishes.
-) 50-100+: 3-7 days+. Now it's just catering. Elect representatives, representatives have to each take a piece of the problem. Strictly limited ordering options for individuals which must be specified long in advance.
-) 1000+: Now it months or years. You're in the army now. Massive preplanning by specialist who great a dedicated system that will decide months in advance what will everyone will eat. Individuals get to decide how much ketchup to use on their mystery meat.
Substitute "tables" for "people", "relationships" for "where do you want to eat", "queries" for "restaurant" and you have a descent analogy for most database scaling problems.
It also explains why town hall style democracy doesn't scale. Why successful small scale test programs in education/corrections/business etc using hundreds of test subjects don't scale don't scale to millions in real use. It explains why people's economic intuition doesn't scale from managing personal transactions to economies of hundreds of millions and so on.
Some databases scale very easily and others don’t. As with everything else, there are tradeoffs. The tradeoff in this case is ACID
compliance. If you need to be sure that all of your data *operations* are valid, that is difficult to do when more than one machine is involved. Ensuring absolute consistency across multiple machines doesn’t generally happen with out-of-the-box database management systems (although it is possible). If, on the other hand, you are okay with your data being ‘eventually consistent’ across multiple machines, that’s much more easily attained. In those cases, you can easFootnotes
Some databases scale very easily and others don’t. As with everything else, there are tradeoffs. The tradeoff in this case is ACID
compliance. If you need to be sure that all of your data *operations* are valid, that is difficult to do when more than one machine is involved. Ensuring absolute consistency across multiple machines doesn’t generally happen with out-of-the-box database management systems (although it is possible). If, on the other hand, you are okay with your data being ‘eventually consistent’ across multiple machines, that’s much more easily attained. In those cases, you can easily spread your database across many, many servers. But, depending upon which nodes are queried, you may not get the most up-to-date results. If your process can handle ‘reasonably close’, then this can be a good choice for you (it is almost always faster, too). These scalable databases tend to fall into a group commonly called NOSQL. The less scalable databases tend to be relational databases based on SQL. For most use cases, a relational SQL database will be better, and there is a certain amount of overlap as most relational database management systems (e.g. Oracle, SQL Server, PostgreSQL, etc.) have added various functionality such as support for XML, JSON, column store, etc. that were first popularized by NOSQL systems.Footnotes
Thermo-dynamics.
Computation involves the movement, persistance and substitution of information. These activities of computation are done in many different ways, but lets just call all of it energy. If you had infinite amounts of energy you could create a database that scales to any load. The universe, however, does not permit the use of infinite amounts of energy so the work of computation has to be performed with low energy.
The information itself is immutable, it never gets destroyed, but the universe if constantly shuffling the information. To fight the constant shuffling, we use ma
Thermo-dynamics.
Computation involves the movement, persistance and substitution of information. These activities of computation are done in many different ways, but lets just call all of it energy. If you had infinite amounts of energy you could create a database that scales to any load. The universe, however, does not permit the use of infinite amounts of energy so the work of computation has to be performed with low energy.
The information itself is immutable, it never gets destroyed, but the universe if constantly shuffling the information. To fight the constant shuffling, we use machines to gather and aggregate information in bits. Bits are very useful in the fight against the shuffling because you can gather up an island of information and pretend that an amount of information above a certain level is a "one" and below is a "zero". As long as you keep refreshing the piles of information, you can keep the ones and zeros around despite the universes constant grinding on the pile.
The piles of information can be made in a lot of ways. In integrated circuts, it can be the charge pushed into a cluster of atoms. On flash, its the trapping of electrons in a nanoscale box. On hard drives its the magnetic charge on a thin layer of iron. All these things have one thing in common, the faster the computation that can be performed with the bits, the more energy it takes to maintain it. Much of the innovation around computational hardware is reducing the ratios of speed to energy, but there does seem to be that "fast" bits take more energy than "slow" bits.
How does this apply to database scalability?
The speed of bits are never as fast as we need. To make up for the lack of speed, the information we need gets summarized, duplicated, and spatially organized to improve efficiency. It's a trick to get around the lack of infinite energy. By duplicating a lot of limited energy you can do the computational work in parallel. These tricks to operate in parallel can get quite complex and that is why its is "hard to scale a database".
The problem with scaling databases lies in the difficulty in handling writes rather than reads. Imagine, for a moment, that you have a database that doesn't change very often but has to handle a million reads in an hour. You could make ten copies of this database (assuming you can easily apply the occasional change after the fact) to ten different servers, each of which now only has to handle one hundred thousand reads in an hour. If the number of reads doubles, you just double the number of servers.
Writes, on the other hand, can't be so easily scaled. Let's return to our ten-server example
The problem with scaling databases lies in the difficulty in handling writes rather than reads. Imagine, for a moment, that you have a database that doesn't change very often but has to handle a million reads in an hour. You could make ten copies of this database (assuming you can easily apply the occasional change after the fact) to ten different servers, each of which now only has to handle one hundred thousand reads in an hour. If the number of reads doubles, you just double the number of servers.
Writes, on the other hand, can't be so easily scaled. Let's return to our ten-server example. Imagine that the number of writes is now a million per hour. Every one of those servers has to handle one million writes in order to be up-to-date. While increasing the number of servers enables you to reduce the read load on any one individual server, if you write data to any one of the servers, it needs to be written to all of them -- increasing the number of servers has no effect on the write load each has to sustain. There's no way to scale up a database with heavy writes without inspecting the structure of the database itself.
The saying “SQL doesn’t scale” was never true or relevant.
There is no technology of any type that scales if you employ it in dumb ways.
People have said “SQL doesn’t scale” because they use it in dumb ways and expect it to handle whatever load or scale of work they throw at it.
- They don’t analyze queries, or create indexes needed for their queries.
- They don’t provision server capacity needed for the scale of their data.
- They let data grow and grow without archiving some of it.
- They let users run unoptimized custom reporting queries.
The saying “SQL doesn’t scale” was never true or relevant.
There is no technology of any type that scales if you employ it in dumb ways.
People have said “SQL doesn’t scale” because they use it in dumb ways and expect it to handle whatever load or scale of work they throw at it.
- They don’t analyze queries, or create indexes needed for their queries.
- They don’t provision server capacity needed for the scale of their data.
- They let data grow and grow without archiving some of it.
- They let users run unoptimized custom reporting queries.
There seems like a simpler explanation: it's the transactions stupid.
Databases are intrinsically about presenting a single unified source of truth about a massive amount of state. While there is room for cheating (and rules for how much one can), their interface is really designed around the notion of all mutations of the state to be processed serially, which means even you mutate the data in a millisecond (which is exceptionally fast considering databases are required to persist a transaction prior to completion), we're talking 1000 updates per second MAX.
Now, a good database engine will do i
There seems like a simpler explanation: it's the transactions stupid.
Databases are intrinsically about presenting a single unified source of truth about a massive amount of state. While there is room for cheating (and rules for how much one can), their interface is really designed around the notion of all mutations of the state to be processed serially, which means even you mutate the data in a millisecond (which is exceptionally fast considering databases are required to persist a transaction prior to completion), we're talking 1000 updates per second MAX.
Now, a good database engine will do its darndest to processes as many transactions in parallel as possible to maximize scalability, but that doesn't help that dependancies intrinsically exist between them, particularly since application logic itself tends to manipulate the database very serially.
In short, because scaling a database means accommodating increased data volumes and user populations without sacrificing key considerations like speed, data consistency, and transactional integrity — and without increasing operational complexity and costs. Meeting all these challenges requires a lot of design, planning, and troubleshooting.
There are two ways to scale a database:
- Scaling up (vertical scalability) increases a database’s capacity by adding fewer, larger computers (or cloud instances) with more memory, storage, and CPU.
- In contrast, scaling out (horizontal scalability) spreads the d
In short, because scaling a database means accommodating increased data volumes and user populations without sacrificing key considerations like speed, data consistency, and transactional integrity — and without increasing operational complexity and costs. Meeting all these challenges requires a lot of design, planning, and troubleshooting.
There are two ways to scale a database:
- Scaling up (vertical scalability) increases a database’s capacity by adding fewer, larger computers (or cloud instances) with more memory, storage, and CPU.
- In contrast, scaling out (horizontal scalability) spreads the database workloads over many smaller machines.
Vertical scalability is limited because computers (either physical servers or cloud instances) can only have so much CPU, storage, input/output (I/O), and memory. Even the most powerful computer can be overwhelmed by the amount of traffic handled by a successful digital application.
On the other hand, horizontal scalability divides and distributes data and workloads across clusters of machines — a practice known as sharding. While sharding can effectively spread the load across multiple computers, it also requires them to work together in order to query, process, and consolidate their data. To the application, the sharded data set distributed across multiple independent instances should look and behave like a single, logical database.
In today’s digital economy, there’s been a paradigm shift towards horizontal scalability, which is the only way to handle the massive demand generated by successful applications. As a result, a new class of database technology (non-relational databases) has emerged.
I work at MongoDB, a distributed, document-based database with flexibility and scalability baked into its design. Previously, our founders had worked at DoubleClick, where they had struggled to scale existing technology to match demand. But rather than building their own custom solution for internal use, they decided to create a database that could handle this traffic — and offer it up for a global community of developers. Now, other MongoDB products (including Atlas, our fully-managed cloud database service) enable seamless scaling by enabling users to quickly provision globally distributed database clusters, and adding additional nodes on demand, as workloads grow.
MongoDB’s unique document model is also very popular with developers because these documents strongly resemble objects in their code — making them intuitive and straightforward to work with. MongoDB documents can store different data structures, and further, developers can easily modify schema as needed (such as when they’re introducing a new application feature). As a result, developers are more productive and can simplify the development process, rapidly evolving apps to keep up with customer expectations and an ever-changing business landscape.
For teams that work on high-traffic applications with tight development cycles, MongoDB Atlas provides a capable, easily scalable solution. Plus, teams can easily try a free database cluster for proof of concept before scaling to higher tiers for more capacity and more capabilities, including the ability to distribute data across geographic regions for low latency and data sovereignty demanded by modern privacy regulations.
Try a free cluster today.
Pure myth. Many databases of all paradigms, both relational and non-relational, scale well either or both upwards to larger more powerful systems and outwards onto multiple relatively inexpensive systems. Oracle, DB2, Informix, PostgreSQL, and others all scale very well thank you. MongoDB, Cassandra, and others all scale well.
Before asking “Why?” or “Why not?” ask the ‘if’ question: “Do databases scale?”, “How does one scale >this< database or >that< database?”, “Can >such database< scale up? Can it scale out?”
Just about any system that defined the term of "guarantee" of doing/offering something, that system will be facing some scalability difficulties in the future. In Computer Science, a "guarantee" is always constructed under some constraints and/or assumptions. The problem is that requirements change, eventually, some or all of those constraints/assumptions either fail or become significant bottlenecks when inputs are changed, and the scalability issue surface.
Database is such a system that offers very tight "guarantees" due to the nature of such system (e.g. transactional consistency that has
Just about any system that defined the term of "guarantee" of doing/offering something, that system will be facing some scalability difficulties in the future. In Computer Science, a "guarantee" is always constructed under some constraints and/or assumptions. The problem is that requirements change, eventually, some or all of those constraints/assumptions either fail or become significant bottlenecks when inputs are changed, and the scalability issue surface.
Database is such a system that offers very tight "guarantees" due to the nature of such system (e.g. transactional consistency that has to guarantee ordering).
who says it is hard to scale.. Big data solutions are available. But you may be questioning what is the big deal with big data. On my project at Microsoft we wanted to INGEST (upload) SEARCH results for purpose of analyzing advertising metrics..Each and every day there was a large amount of data to be processed. Measured in TERABYTES. The big deal was if something was missing or delayed, there was hell to pay to reprocess (try again) that daily ingestion. I was able to leverage error reports so that a portion of the data could be handled efficiently.
To answer simply, because to scale significantly requires an A-to-Z approach to scaling and that often means that legacy applications have to be entirely refactored to remove bottlenecks.
A prerequisite to understanding why it is hard to scale is to understand the principle of scaling relative to performance-tuning, I explain that in total layman terms in the first 70-seconds of this YouTube video:
More often than not, databases are designed without keeping in mind future load requirements. Add to this very poorly written programs.
While these flaws are the almost always the culprits, the database becomes the fall guy!
With a robust architecture, database driven applications can be scaled up very smoothly.
Because a database has requirements to their operation that difficult to do in a distributed environment, mostly consistency.
We have 2 major types of database: SQL and NoSQL. Nosql is developing fast now. We also scale Nosql easily by sharding, good for performance. We don’t need to use a complex relation database because it runs low to deal with default functions unnecessarily.
(I am looking for a job)
Big Data, Cloud, Internet of Things are sexy, marketing buzzwords to describe existing technologies that are ready for the mainstream. In fact, at LinuxCon I was at a talk emphasizing on creating such marketing goo to help whip up the excitement.
Dilbert comic strip for 07/29/2012 from the official Dilbert comic strips archive.
Big Data used to be called Analytics/Business Intelligence before the industry felt the need for a sexier term. If you have ever drawn a chart in Excel out of a column of data, you have used a tiny version of "Big Data". Just that scale is massive. Big data just mean
Big Data, Cloud, Internet of Things are sexy, marketing buzzwords to describe existing technologies that are ready for the mainstream. In fact, at LinuxCon I was at a talk emphasizing on creating such marketing goo to help whip up the excitement.
Dilbert comic strip for 07/29/2012 from the official Dilbert comic strips archive.
Big Data used to be called Analytics/Business Intelligence before the industry felt the need for a sexier term. If you have ever drawn a chart in Excel out of a column of data, you have used a tiny version of "Big Data". Just that scale is massive. Big data just means making sense out of a large volume of data.
Ok, enough of cynicism.
How is Big Data different from "little data"?
Let's assume you have a leak in a water pipe in your garden. You take a bucket and a some sealing material to fix the problem. After a while, you see that the leak is much bigger that you need a specialist (plumber) to bring bigger tools. In the meanwhile, you are still using the bucket to drain the water. After a while, you notice that a massive underground stream has opened and you need to handle millions of liters of water every second.
You don't just need new buckets, but a completely new approach to looking at the problem just because the volume and velocity of water has grown. To prevent the town from flooding, maybe you need your government to build a massive dam that requires an enormous civil engineering expertise and an elaborate control system. To make things worse, everywhere water is gushing out from nowhere and everyone is scared with the variety.
Welcome to Big Data.
I will give you an example from my previous startup. [More details: Does Social Media Affect Capital Markets?] We had a hypothesis that we could understand the market psychology by looking at the tweets. For instance, if I want to predict the movement of Apple stock, I could look at the tweets related to:
- Media perceptions of Apple - how many times the company/product gets mentioned in major media.
- Customer perceptions of Apple - are the customers positive or negative about the upcoming iPhone 6? Will people continue to buy Apple?
- Employee perceptions of Apple - are there any tweets from Cupertino [the company's location] that could be linked to some employees of the company? How happy or sad are they?
- Investor perceptions of Apple - what do sophisticated investors and analysts think about Apple?
The sum of all these perceptions will determine what will be the price of Apple's stock in the future. Getting that right could mean billions of dollars.
To put it layman's terms, if we could really understand what the different people are talking about a particular company and its products, we could somewhat predicts its future earnings and thus the direction in which the stock price would move. That would be a huge advantage to some investors.
Babson MBAs Use Social Media to Predict Moves in the Stock Market
However the problem is this:
- There are over 500 million tweets every day that is flowing every second (High Volume & Velocity)
- We have to understand what each tweet means - where is it from, what kind of a person is tweeting, is it trustworthy or not. (High Variety)
- Identify the sentiment - is this person talking negative about iPhone or positive? (High Complexity)
- We need to have a way to quantify the sentiment and track it in real time. (High Variability)
The key elements that make today's Big Data different from yesterday's analytics is that we have a lot more volume, velocity, variety, variability and complexity of data. [called as the 5 Key Elements of Big Data.]
Applications
Big data includes problems that involve such large data sets and solutions that require a complex connecting the dots. You can see such things everywhere.
- Quora and Facebook use Big data tools to understand more about you and provide you with a feed that you in theory should find it interesting. The fact that the feed is not interesting should show how hard the problem is.
- Credit card companies analyze millions of transactions to find patterns of fraud. Maybe if you bought pepsi on the card followed by a big ticket purchase, it could be a fraudster?
- My cousin works for a Big Data startup that analyzes weather data to help farmers sow the right seeds at the right time. The startup got acquired by Monsanto for big $$.
- A friend of mine works for a Big Data startup that analyzes customer behavior in real time to alert retailers on when they should stock up stuff.
There are similar problems in defense, retail, genomics, pharma, healthcare that requires a solution.
Summary:
Big Data is a group of problems and technologies related to the availability of extremely large volumes of data that businesses want to connect and understand. The reason why the sector is hot now is that the data and tools have reached a critical mass. This occurred in parallel with years of education effort that has convinced organizations that they must do something with their data treasure.
Adapting from one of my prior answers. The other answers on this post talk about CAP, and I’ll talk about actual differential implementation details of NoSQL vs. RDBMS engines.
Here's where the conventional wisdom that "NoSQL is faster than RDBMS" comes from:
- Logging - Many more NoSQL engines offer the option to circumvent logging, meaning that they do not write the changes to disk before signaling the row as inserted. RDBMSes do not publicize this feature as much. The trade-off is that you don't necessarily know if your upserts have been made, and you risk the row being lost in the event of an
Adapting from one of my prior answers. The other answers on this post talk about CAP, and I’ll talk about actual differential implementation details of NoSQL vs. RDBMS engines.
Here's where the conventional wisdom that "NoSQL is faster than RDBMS" comes from:
- Logging - Many more NoSQL engines offer the option to circumvent logging, meaning that they do not write the changes to disk before signaling the row as inserted. RDBMSes do not publicize this feature as much. The trade-off is that you don't necessarily know if your upserts have been made, and you risk the row being lost in the event of an outage. Is that what you want to happen?
- Transactions - Transactions are a relatively rare feature of NoSQL engines. The most many offer is single-row atomicity (when I see that, I think
pwrite
). And there's a good reason why, too - transactions have a tendency to block entire ranges of rows or tables from being updated. Well, if the table's being locked, that'd slow down any concurrent inserts, no? But is transactions something you want to lose? - Schema enforcement - if the NoSQL engine treats whatever you insert into it as a binary blob, then, there's a lot of computation overhead saved by not enforcing a schema. But really, do you want that trade-off and do you want to push that enforcement & managing all the way to the application layer?
- Topology corner-cutting - Some NoSQL engines (i.e., Couchbase) push topology decisions to the client level. But while this means the client can communicate with data nodes directly, this necessarily means there's some latency in getting notified of failovers. And from my development experience, it's *much* harder to prevent data loss in these events. Just another thing to watch out for...
- Joins - NoSQL systems distribute data among different nodes. This, by necessity, makes joining rows that might be located on different nodes a slower endeavor. Indexes and MapReduce can only do so much; it’s a different style of thinking about and storing data.
On average, if I had to generalize between RDBMS and NoSQL, I’d say:
- NoSQL is faster at inserts. Fewer logs, more I/O nodes, less schema overhead, fewer indices.
- Whether NoSQL is faster at updates is vendor-specific. As a general rule of thumb, I’d say HBase is faster at updating than MySQL than Couchbase.
- NoSQL is faster only in certain categories of queries. Queries that NoSQL respond well to include: those along the primary key only, and those that can do computation portions in parallel such as aggregation functions.
- NoSQL is terrible at indexing non-primary keys, and queries on non-primary, non-clustered fields are going to be slow. In RDBMS, indexing represents a layer of abstraction on the same node. In NoSQL, indexing represents a layer of abstraction on a different node with all the performance downsides this implies.
Lately, I’ve been kicking around the idea that there is a parallel between RDBMS vs. NoSQL and TCP/IP vs. UDP/IP, respectively.
It’s a myth that RDBMS are difficult to scale.
Scaling any system is hard (corresponding to the degree), and scaling RDMBS is only harder because people lack the know-how (and for some, the will to learn), and traditionally DB vendors haven’t focused in this space with their tooling.
When there were only commercial RDBMS, it was perhaps beyond most people’s means to buy thousands of CPU licenses. With Postgres & MySQL variants, there are no reason people can’t do horizontal scaling if they are willing to invest the time to learn, and get the power of RDBMS in a scalable system.
CAP is usually bro
It’s a myth that RDBMS are difficult to scale.
Scaling any system is hard (corresponding to the degree), and scaling RDMBS is only harder because people lack the know-how (and for some, the will to learn), and traditionally DB vendors haven’t focused in this space with their tooling.
When there were only commercial RDBMS, it was perhaps beyond most people’s means to buy thousands of CPU licenses. With Postgres & MySQL variants, there are no reason people can’t do horizontal scaling if they are willing to invest the time to learn, and get the power of RDBMS in a scalable system.
CAP is usually brought up as why SQL doesn’t scale, though it’s really a non-sequitur. CAP has nothing to do with scaling, as network partitioning can happen at the smallest distributed systems, not only at scale. The only question is does your app choose consistency vs availability when that happens.
Banks have been running on RDBMS and they scale just fine, and they have been dealing with distributed systems before there were computers.
Imagine you are writing out a list of 1000 players including their name, address and a list of sports they want to play for a town sports league. You start out by writing each name and address and then a list of sports but after writing the first hundred your hand is getting tired. You decide to see if there is a way you can write less.
The first thing you realize is that there are a lot of siblings signed up and that means you have to write the same address a number of times. To get around this you decide to write each address once on a different piece of paper and assign a unique number to i
Imagine you are writing out a list of 1000 players including their name, address and a list of sports they want to play for a town sports league. You start out by writing each name and address and then a list of sports but after writing the first hundred your hand is getting tired. You decide to see if there is a way you can write less.
The first thing you realize is that there are a lot of siblings signed up and that means you have to write the same address a number of times. To get around this you decide to write each address once on a different piece of paper and assign a unique number to it starting at 1 and increasing. Then, when you write the player information you just put the number in there instead of the full address.
You carry on like this for a 100 more and your hand is really in tired now. You realize that you are writing the same sports names many times as well. Instead, you decide to write each sport name out on a piece of paper with a number next to it. You also go back and number all of the players. Finally you get another piece of paper and on it you write a new line for each sport that each player wants to play containing the players number and the sport number.
Lucky you, you have now discovered normalization. If you move data that can repeat like addresses or lists of data like sports out into their own pieces of paper and instead reference them with unique numbers you have normalized your data and saved your hand from pain.
There are quite a few answers to this question, but I would like to answer it from my own perspective.
The basic principle on which RDBMS works is that there are relations between tables and if you want to get meaning out of the data, you will have to join 2 or more tables.
On a single machine, as all the data is present on it self, the processor does not have to pull data across network from other machine(If the RAM is high enough, the quite a few times the data will be available in Cache) and thus the join is fast and we can get the meaningful data out easily.
But the problem starts when the da
There are quite a few answers to this question, but I would like to answer it from my own perspective.
The basic principle on which RDBMS works is that there are relations between tables and if you want to get meaning out of the data, you will have to join 2 or more tables.
On a single machine, as all the data is present on it self, the processor does not have to pull data across network from other machine(If the RAM is high enough, the quite a few times the data will be available in Cache) and thus the join is fast and we can get the meaningful data out easily.
But the problem starts when the data goes on increasing beyond few TBs. All though the size of data has increased, the speed at which it can read from the disk is low(100 MBPS for Magnetic Drive, 550 MBPS for SSD, 2–3 GBPS for M.2 SSD approx.)
To solve this issue, you can store the data on multiple disks(multiple machines if the data is pretty huge). This will straight forward increase the read speed.(IF you read half the data from Disk 1 and other half from Disk 2, you read the data twice as fast as a single disk)
So, you store part of each table in different servers to reduce the load of reading. All is well and good now, right? Nope, now there is another problem, if you remember, this is relational database and we need to join the data to get meaning out of it. Consider the following example to understand it better.
Assume we have two tables
User - Details of users. [Columns : UserID, FirstName, LastName, City, Province, Country]
UserOrders - Details of orders placed by the user [Columns : UserOrderID, UserID, OrderDate, OrderAmount]
Now this data is pretty huge and thus we store it in 2 servers.
As a result a million of Userorders went to Server1
A million of UserOrders went to Server2
Few hundred thousand Users were saved in Server1
other few hundred Users were saved in Serve2
Now, suppose we have to run following query on the Data,
SELECT U.Country, SUM(UO.OrderAmount) AS TotalSales
From User AS U
INNER JOIN UserOrders UO
ON U.UserID = UO.UserID
GROUP BY U.Country
To achieve this, we will have to pull all of the data into one server to and then join. This is where the system will slow down. In RDBMS, if we want to scale to multiple servers, we will land into this issue of pulling data.
But if you remove the relations and maintain the data as it is, there is no need of pulling the data, you can aggregate on individual servers and then push the aggregated data(which will be very small in size) to one server.
Hope I answered your question. Please let me know if I didn’t put it properly. This is like my second answer of all time. Any suggestions and changes are welcome.
I have worked at both Google and Zynga, two pretty big tech companies.
We almost always use SQL when interacting with databases. Usually the data is located on multiple Hadoop type clusters that help deal with any potential scalability issues (Google had a bunch of internal names for their system when I was there, the tech behind the system was constantly changing). Most of our data sets are on the order of billions of rows and we usually do not have an issue with runtime.
SQL is great for a few reasons:
- Anyone from the CMO to CTO to a low level Product Manager or analyst can usually read, run an
I have worked at both Google and Zynga, two pretty big tech companies.
We almost always use SQL when interacting with databases. Usually the data is located on multiple Hadoop type clusters that help deal with any potential scalability issues (Google had a bunch of internal names for their system when I was there, the tech behind the system was constantly changing). Most of our data sets are on the order of billions of rows and we usually do not have an issue with runtime.
SQL is great for a few reasons:
- Anyone from the CMO to CTO to a low level Product Manager or analyst can usually read, run and edit SQL without knowing that much about how things are working under the hood. This allows everyone to contribute quickly and easily to getting as much work done as possible.
- Most modern day databases are optimized to the point of data being able to run quickly and easily.
- Building out dashboard and exporting to other systems is usually super easy. Other non-relational systems can be harder to extract data into other forms, whether that be tableau, excel or other internal tools. This makes visualizing and getting value out of the data easy.
I hope to make my answer real lay-person friendly.
Data is a unit of information. Database is a container/depot/holding area/... for the data.
Using this definition: A book is a database, where data could be individual chapters of knowledge. A library is a database, where book itself is a data unit. The university system could be a database, with individual library as unit data.
Secondly, it is really important to organize the data with specific objectives in mind. For a library, with basic function of book lending, a primary objective is to be able to search a particular book efficiently. This c
I hope to make my answer real lay-person friendly.
Data is a unit of information. Database is a container/depot/holding area/... for the data.
Using this definition: A book is a database, where data could be individual chapters of knowledge. A library is a database, where book itself is a data unit. The university system could be a database, with individual library as unit data.
Secondly, it is really important to organize the data with specific objectives in mind. For a library, with basic function of book lending, a primary objective is to be able to search a particular book efficiently. This could be the storage organization of the data (labelled racks, book shelves, maybe drawers, and other things which a lay-person may-not know about). This is a real-world database.
So is RAM (Random Access Memory) a database? Absolutely, yes. My data unit is a byte (8 bits), and data needs to be organized as an array, and the basic objective is random access in an array storage. This is a hardware database.
So is a Hard Disk a database? You bet it is. My data unit is blocks of data, and the storage organization needs to facilitate quick location of Files and stuffs, kind of like File Cabinets in real life.
Can I pool in RAM and Hard-disks separately to create a bigger database? Of course. This is why technology is so much fun.
Then what the heck is a software database like MySQL? Well it is a particularly good container for data with lot of relations. And the data is organized in such a way, so that we can ask pretty tough questions to it. E.g Tell me the names of all males who are between ages 20 to 30, and love database?
Arranging data as tables comes most natural to us, as in maintaining a list of students and their attendances. So most of real world data is easily captured in relational databases.
If it makes more sense to arrange the data in some other way which solves other specific purposes then there are other databases available for the purposes.
Storing data reliably in a distributed system is prone to slow down operations which is a bad tradeoff for modern apps that require fast response and have to deal with astronomically large heaps of input.
There is a theorem called Consistency or Availabilty on Partition (CAP theorem) with a mathematical proof that if you have a distributed datastore and a network partition occurs — that is, some nodes can't connect to others — the system can choose between either staying available and accepting writes that may render data inconsistent or not accepting writes and preserving data consistent. Rel
Storing data reliably in a distributed system is prone to slow down operations which is a bad tradeoff for modern apps that require fast response and have to deal with astronomically large heaps of input.
There is a theorem called Consistency or Availabilty on Partition (CAP theorem) with a mathematical proof that if you have a distributed datastore and a network partition occurs — that is, some nodes can't connect to others — the system can choose between either staying available and accepting writes that may render data inconsistent or not accepting writes and preserving data consistent. Relational databases choose the latter, as resolving a conflict is not at all easy. Why is it so hard? Imagine we both try to withdraw money from an account which has €100 in it and there is a database constraint forbidding negative balance. You want €42 and I — €70. We start our transactions, they are sent for processing to distinct nodes and a network partition occurs. If the system chooses to proceed, each transaction will be found to satisfy the constraint and will be written down. However when network connectivity between the nodes is restored and the database tries to reconcile writes it will be impossible to keep data in compliance to the constraint. Relational databases are required to support ACID transactions, the C standing for consistency — that's the reason they must refuse storing transactions if nodes get out of touch. And remember, there is a theorem that prooves it's impossible to overcome this, thus leaving room just for not so satisfying solutions such as:
- Master-slave partitioning where writes are committed to a single master node which distributes updated data to read-only slaves. It won't speed up writes which are the bottleneck usually.
- Sharding where different data is stored on distinct nodes. In the above banking example the first node may store accounts of people whose name starts with A-M, the other N-Z. It makes your app more complex by shifting the burden of scaling higher on the stack.
Because when you are building a massive distributed system you need to make a choice, consistency or availability. Which one is more important when a node fails (and generates partitions in the network)? If you are a bank, consistency is the choice, if you are a social network availability is the way to go. If your path is consistency, the db does not give an answer until all nodes have the same version of the data (and all ACID properties are being enforced). The bank rather tell people they have a ‘comunication’ problem than risk a $100 million transaction going where it doesnt supposed to.
Because when you are building a massive distributed system you need to make a choice, consistency or availability. Which one is more important when a node fails (and generates partitions in the network)? If you are a bank, consistency is the choice, if you are a social network availability is the way to go. If your path is consistency, the db does not give an answer until all nodes have the same version of the data (and all ACID properties are being enforced). The bank rather tell people they have a ‘comunication’ problem than risk a $100 million transaction going where it doesnt supposed to. In NoSQL enviroments availability is the chosen one, the db always gives you an answer that is correct ‘most’ of the time because they enforce the BASE properties instead of ACID. When you scale, the system becomes vulnerable to network partitions, nodes can not communicate with other nodes 100% of the time and that generates the choice already described. There are new technologies that are trying to resolve the big problem, read about NewSQL , VoltDB and in memory db
If by scaling, you mean massively distributed architectures with numerous servers of the Apache Cassandra variety, relational databases have particular difficulties in these environments due to joins. There are a few tricks that can be done to make joins run OK for certain types of joins, but solving this generally is very difficult.
There are some "NewSQL" databases that have grasped this nettle with varying degrees of success (MemSQL, VoltDB, etc), but it remains a fundamentally hard problem.
The problem of COMMIT and ACID is actually relatively easy compared to the general join problem, a
If by scaling, you mean massively distributed architectures with numerous servers of the Apache Cassandra variety, relational databases have particular difficulties in these environments due to joins. There are a few tricks that can be done to make joins run OK for certain types of joins, but solving this generally is very difficult.
There are some "NewSQL" databases that have grasped this nettle with varying degrees of success (MemSQL, VoltDB, etc), but it remains a fundamentally hard problem.
The problem of COMMIT and ACID is actually relatively easy compared to the general join problem, and while Codd wouldn't be impressed with a relational db with eventual consistency, there would be plenty of users who would happily use one - if it supported joins well.
At the app level, you can bypass the join problem by clever application-aware sharding, which is how most people end up handling the "scale out" problem in my experience.
This is not true - relational databases do scale though one has to jump through the hoops to do so. The perceived difficulty of scaling RDBMS is due to its structured storage, low data redundancy and, most importantly, ACID compliance and related locking mechanisms.
With NoSQL “eventual consistency” model one would be ill advised to run financial applications or to do anything that requires transactional support… I suppose you’d be nonplussed to find out that your bank will eventually credit your account for the cash you’ve just deposited - no matter how fast it will tell you so.
To see examples
This is not true - relational databases do scale though one has to jump through the hoops to do so. The perceived difficulty of scaling RDBMS is due to its structured storage, low data redundancy and, most importantly, ACID compliance and related locking mechanisms.
With NoSQL “eventual consistency” model one would be ill advised to run financial applications or to do anything that requires transactional support… I suppose you’d be nonplussed to find out that your bank will eventually credit your account for the cash you’ve just deposited - no matter how fast it will tell you so.
To see examples of highly scaled SQL systems in MPP environment look at Greenplum, for instance - a highly scalable implementation of PostgreSQL.
So, to sum it up - RDBMS do scale; more easily scalable NoSQL databases lack many important features of RDBMS - there is a trade-off; both have their specific domains, and both are here to stay in foreseeable future.
There are already some great answers here that explain the benefits of relational/SQL architecture and how that might not be amenable to scaling out (vs. scaling up).
But I’d like to point out that times are changing:
Some “SQL Databases” that are relational and transactional, and scale-out by way of distributed shared-nothing architecture, like “NoSQL Databases”:
MySQL Cluster (row-oriented)
MariaDB (a MySQL fork, column-based)
Microsoft’s SQL Database on Azure (row-oriented)
Microsoft’s SQL Data Warehouse on Azure (column-based, optimized for data warehousing)
Amazon Redshift (column-based - optimi
There are already some great answers here that explain the benefits of relational/SQL architecture and how that might not be amenable to scaling out (vs. scaling up).
But I’d like to point out that times are changing:
Some “SQL Databases” that are relational and transactional, and scale-out by way of distributed shared-nothing architecture, like “NoSQL Databases”:
MySQL Cluster (row-oriented)
MariaDB (a MySQL fork, column-based)
Microsoft’s SQL Database on Azure (row-oriented)
Microsoft’s SQL Data Warehouse on Azure (column-based, optimized for data warehousing)
Amazon Redshift (column-based - optimized for data warehousing)
Vertica (column-based - optimized for data warehousing)
…I’m sure there are others.
Note, this answer initially address why “it is said” that relational Databases don’t scale. Further down I’ll clarify on my initial points to demonstrate that it’s not actually a valid point.
Mostly because of Sharding
Sharding is when you break up your data and store it across multiple servers
Image from MongoDB for GIANT Ideas
This can speed up reads, writes, and allow you to increase storage processing cheaply with multiple (as many as hundreds or thousands) of small, cheap servers (commodity servers) rather than vertically scaling expensive, massive servers.
Most NoSQL flavors have some sort of
Note, this answer initially address why “it is said” that relational Databases don’t scale. Further down I’ll clarify on my initial points to demonstrate that it’s not actually a valid point.
Mostly because of Sharding
Sharding is when you break up your data and store it across multiple servers
Image from MongoDB for GIANT Ideas
This can speed up reads, writes, and allow you to increase storage processing cheaply with multiple (as many as hundreds or thousands) of small, cheap servers (commodity servers) rather than vertically scaling expensive, massive servers.
Most NoSQL flavors have some sort of capability of this horizontal scaling, and this happens out of the box with only a little bit of configuration required.
Unfortunately, this line of thinking is pretty outdated. As long ago as 2005, Microsoft SQL had a feature known as Table Partitioning that allowed you to accomplish horizontal scaling on a table by table basis.
That said, it wasn’t easy then to configure MS SQL to do this at the time, or even now, it’s sort of a hidden feature. It’s worth mention that most of the big names in RDBMS have this capability. Oracle, MySQL, PostGRE all are able to shard.It’s basically a marketing point for NoSQL manufacturers now that obfuscates the point that RDBMSes can do it too.
Other benefits of NoSQL include optimized and offloaded writing with eventual consistency. This means that the system can act like it’s completed the write process, but in reality it’s only queued up, allowing the system to move along with life without having to confirm the transaction. Combining this with sharding and replication, you can have really fast reads and writes on systems that don’t have to have consistent data. Things like real-time online games, social media websites, various kinds of real-time maps. If you don’t have the exact correct data immediately after you submit it, it’s not a game changer.
You can’t do this for applications that require ACID transactions like banking systems where a transaction isn’t complete until the money is out of one account and into the other account, and if either operation fails, the whole thing needs to be rolled back to the beginning. It would also be beneficial for sites like eCommerce, where if the displayed inventory was inaccurate it could affect customer experiences. Where consistency is absolutely key. When you start requiring more and more ACID compliance on a database it will necessarily bog it down, increasing latency and harming the scale of it.
Most NoSQL databases are designed around low latency transactions and getting things done at scale quickly.
RDBMS makers have upped their game, but there is more overhead and will get bogged down when you get millions of transactions a second, and scaling out is more expensive, particularly with a solution like MS SQL Server where you don’t only have the hardware costs to consider, but the licensing costs as well.
It’s worth noting that the old school RDBMS makers haven’t sat on their laurels and do have both on-prem and cloud based solutions available that have horizontal scaling options available.
The big difference still is that the NoSQL databases came up because of these options being unavailable, and the RDBMS databases are now playing catch up.I should go back and clarify some points.
First, NoSQL isn’t a single thing. It’s a family of database technologies that share one thing in common: They’re not RDBMSes. This is important as they all do different things and are designed to do different things. Some might perform better on inserts, some might perform better on reads, some are designed to do complex pattern searches and some are extremely simple, but pointedly so in order to reduce load time to be negligible but at a cost of complexity.
Let’s take a database like Redis for example. Redis is as simple as they come, a key-value store database. You take a key, like: “red” and then you take a value, “blue”. When you need to retrieve the value of “red”, the read time is trivial, because it’s only storing keys and values. Similarly, the insert time is trivial, since you can’t have multiple keys that are the same. But what good is it? Sure, it’s fast as hell, but you can’t really do related data without dozens of queries. It turns out it’s great at storing cacheable data. If a user loads a page that won’t change often, why rebuild the page every time the user refreshes the browser? Simply put, you shouldn’t, you should cache it to reduce hits on the backend database.
If we look at a database like Mongo or any other document DB, when you perform an insert, it’s very simple. There’s no schema to validate, you don’t have to fill in multiple key constraints. You just drop a JSON document into a file. The real issue there is that the index has to be updated, this could actually take a while especially with sharding in place, which is why consistency is pretty loose with Mongo.
RDBMSes are actually very fast, even compared to most NoSQL flavors, if properly tuned. They do tend to have a higher entry point in terms of hardware requirements, but they also do more, and are battle tested. With Mongo, in particular, the primary advantage is how low the bar is and how simple it is to replicate and shard allowing you to have thousands of PC quality servers that you can throw away if they burn out.
Long story short, if they do scale better for you, it’s because you’ve hit on a perfect niche case for them and like to do things cheaply.
Footnotes
A database is comprised of tables. Tables are comprised of rows and columns. Tables are all the types of things you have in memory, rows are the the things of that type, and columns are all the ways to describe each thing of that type.
Let's think of database as your memory. You have, for example, people, colors, emotions, historical events, and so on stored in your memory. These are all tables. Let's take the table that stores all specific cars about which you have a memory. Each car has a color, a producer, a country of origin, among many other attributes. For instance, your dad's BMW was
A database is comprised of tables. Tables are comprised of rows and columns. Tables are all the types of things you have in memory, rows are the the things of that type, and columns are all the ways to describe each thing of that type.
Let's think of database as your memory. You have, for example, people, colors, emotions, historical events, and so on stored in your memory. These are all tables. Let's take the table that stores all specific cars about which you have a memory. Each car has a color, a producer, a country of origin, among many other attributes. For instance, your dad's BMW was made in Germany.
Now Germany itself has a lot of attributes itself (it's in Europe! Its capital is Berlin! And so on.). In such cases, when an attribute itself has attributes, the attribute gets its own table to hold all of its descriptors and whenever that attribute describes something in a database, it is referred to by ID instead of by name, with the ID corresponding to the ID of its row in its table.
This concept isn't actually as foreign as it may seem. When I ask you what the capital of the country your dad's car was made in, you say Berlin. You had never actually considered that, but you were able to arrive at that conclusion rapidly. Now, if Germany decides to change its capital to Munich, and I ask you the same question, while would have never associated that change with your dad's car (let alone which human is your dad), you have no trouble updating your answer.
Why? Simple. You updated your memory of Germany's capital, and you didn't go through the awkwardly laborious process of updating your memory of the capital of the country with which you associate anything associated with Germany. Why, that would be foolish.
Databases are designed to make the organization of information with complex relationships straightforward.
Incidentally, the process by which you view, add, edit or remove things and associations from your memory is, when we use databases, done with a language called SQL which has its own set of rules governing how you describe relationships, and, importantly, how you arrive at answers to anything you want to know.