Im looking into the database design for a data ware house kind of project which involves large number of inserts daily.The data archives will be further used to generate reports.I will have a list of user's (for example a user set of 2 million ) , for which I need to monitor daily social networking activities associated with them.

For example, let there be a set of 100 users say U1,U2,...,U100

I need to insert their daily status count into my database.

Consider the total status count obtained for a user U1 for period June 30 - July 6, is as follows

June 30 - 99
July 1 - 100
July 2 - 102
July 3 - 102
July 4 - 105
July 5 - 105
July 6 - 107

The database should keep daily status count of each users ,like

For user U1,

July 1- 1 (100-99)
July 2- 2 (102-100) 
July 3- 0 (102-102) 
July 4- 3 (105-102) 
July 5- 0 (105-105) 
July 6- 2 (107-105)

Similarly the database should hold archived details of the full set of user's.

And on a later phase , I envision to take aggregate reports out of these data like total points scored on each day,week,month,etc; and to compare it with older data.

I need to start things from the scratch.I am experienced with PHP as a server side script and MYSQL. I am confused on the database side ? Since I need to process about a million insertion daily,what all things should be taken care of ?

I am confused on how to to design a mysql database in this regard ? On which storage engine to be used and design patterns to be followed keeping in my mind the data could later used effectively with aggregate functions.

Currently I envision the DB design with one table storing all the user id's with a foreign key and separate status count table for each day.Does lot of table's could create some overhead?

Does MySQL fits my requirement? 2 million or more DB operations are done every day.How the server and other things are to be considered in this case.

EDIT

1)The database should handle concurrent inserts,which should enable 1-2 million inserts per day.

Before inserting I suggest to calculate daily status count,i.e the difference today's count with yesterday's.

2)on a later phase,the archives data (collected over past days) is used as a data warehouse and aggregation tasks are to be performed on it.

Comments:

I have read MyISAM is the best choice for data warehousing projects and at the same time heard INNODb excels in many ways.Many have suggested on proper tuning to get it done,I would like to get thoughts on that as well.

Please advise?

asked Aug 27 '14 at 08:14

Surabhil's gravatar image

Surabhil
1112

edited Aug 27 '14 at 09:10

Markus%20Winand's gravatar image

Markus Winand ♦♦
93651120


One Answer:

The questions you are asking are too broad to be answered sufficiently. I'll just pick out some questions that I can answer quickly:

data ware house kind of project

So why not using data warehouse concepts? Have you had a look into Star Schema or free open source tools such as Pentaho?

Since I need to process about a million insertion daily,what all things should be taken care of ?

Quite a loot, but it also depends on where you data is coming from. Generally, doing it the php-way is probably the wrong way!

I am confused on how to to design a mysql database in this regard ?

Data warehouses typically work on two databases:

  • the master data in a properly normalized format—often the OLTP system;
  • the DWH database typically kept in a Star Schema.

In your case, you might need a master database that holds the total status counts for each user & day, which is then used to load the DWH with the diff relative to the previous day. However, depending on your data source, you might be able to merge this into a single step.

On which storage engine to be used and design patterns to be followed keeping in my mind the data could later used effectively with aggregate functions. [...] I have read MyISAM is the best choice for data warehousing projects and at the same time heard INNODb excels in many ways.

For the DWH system MyISAM might be better in the usual case. InnoDB is generally slower to improve data safty. However, in a typical DWH project, the DWH database doesn't need that safty because:

  • it has only data derived from the master db
  • there are no uncoordinated concurrent write operations.

Regarding aggregate functions: the key for fast aggregation in DWH is to pre-aggregate. Tools such as Pentaho provide facilities to do that.

Currently I envision the DB design with one table storing all the user id's with a foreign key and separate status count table for each day. Does lot of table's could create some overhead?

This is like the master DB should be designed. However, in the DWH you should consider a Star Schema.

Does MySQL fits my requirement? 2 million or more DB operations are done every day.How the server and other things are to be considered in this case.

It may or may not be the right choice for you. 2 M inserts a day is not generally too much for MySQL. However, you'll probably want to load the data in a shorter period— e.g. during night. If you target a 5 hours window, it would mean 111 inserts per second—absolutely possible I'd say with the right approach (doing it yourself with PHP being the wrong one!).

answered Aug 27 '14 at 09:29

Markus%20Winand's gravatar image

Markus Winand ♦♦
93651120