Let's assume your are running a very large social web-site. You are sending birthday reminder e-mails every day so that the users receive an e-mail the day before a friend's birthday.

For sake of this exercise, let's forget about the friends relations. Just focus on finding all people whose birthday is tomorrow.

CREATE TABLE member (
  [...]
  date_of_birth  DATE,
  [...]
)

Use whatever SQL database you like, just let us know the DB you assume.

  1. Write the select to retrieve all member whose birthday is tomorrow.
  2. Provide an index definition that supports your query.
    Note: if that is a daily statement, it's probably not worth indexing. However, that's an exercise.
  3. explain your rationale
  4. Explain the impacts if the e-mailing is done weekly, covering all birthdays of the upcomming week.

asked Dec 17 '10 at 07:46

Markus%20Winand's gravatar image

Markus Winand ♦♦
93651120


2 Answers:

Main problem is handling the correct birthday for people with Feb 29 as birthday. They usually celebrate their birthday in leap years on Feb 29, otherwise Feb 28. Here's some code for PostgreSQL (9.0 used)

-- A simple helper function

create or replace function getDayInYear(timestamp) returns int as $$
    select floor(date_part('day', $1) +
                date_part('month', $1)*100)::int;
$$ language sql;

CREATE TABLE member (
  id serial not null primary key,
  date_of_birth date not null,
  dob int2 not null default 0
);

-- Create a suitable index

create index member_dob_idx on member(dob);

-- And a trigger to fill dob.

create or replace function computeDob() returns trigger as $$
begin
    NEW.dob = getDayInYear(NEW.date_of_birth);
    return NEW;
end
$$ language plpgsql;
create trigger tgDow 
       before insert or update 
       on member for each row execute procedure computeDob();

-- populate table with 5 million rows testdata with users ranging from 13 to 60 years.

insert into member (date_of_birth)
    select current_date -
            (floor(13*365 + random() * (47*365)) || ' days')::interval
     from generate_series(1, 5000000);
analyze member;

-- makes it easier for demos

create or replace function getBirthdayMembers(timestamp) returns setof member as $$
    select *
      from member
     where dob between
           getDayInYear($1 + '1 day'::interval)
             and
           (getDayInYear($1 + '1 day'::interval) +
             case
             -- called on Feb 27 in a leap year
             when     getDayInYear($1) = 227 
                  and getDayInYear($1 + '2 days'::interval) = 229 
                  then 0
             -- called on Feb 27 in a non-leap year
             when     getDayInYear($1) = 227 
                  and getDayInYear($1 + '2 days'::interval) > 229 
                  then 1
             else 0
             end
           )
$$ language sql;

-- get all members having birthday tomorrow (first parameter is called date)
explain analyze select * from getBirthdayMembers(CURRENT_DATE);

explain analyze select * from getBirthdayMembers('2011-02-27');

explain analyze select * from getBirthdayMembers('2011-02-28');

explain analyze select * from getBirthdayMembers('2012-02-27');

explain analyze select * from getBirthdayMembers('2012-02-28');

Performance is good, even on my old desktop machine I get response times of ~100 msec for 5 mio rows. However, 5 mio is small enough to fit in memory, it would become more interesting with 500 mio members (but I do not have enough disk space on this machine to test).

Weekly jobs would be similar, however there would be more corner cases to test (I guess so). Nothing too complicated.

answered Feb 09 '11 at 08:54

Neutrino's gravatar image

Neutrino
71115

edited Feb 09 '11 at 12:05

Markus%20Winand's gravatar image

Markus Winand ♦♦
93651120

Your solution seems to work fine. However, two issues:

  • Your explain analyze doesn't show what's happening inside the function. If you explain analyze the actual select statement, you'll see the 'Index Scan using member_dob_idx' operation in the plan.

  • Can you think of a solution that doesn't need an additional column and a trigger?

(Feb 09 '11 at 13:22) Markus Winand ♦♦

One more thing. The "weekly job" doesn't work that way, because of the 'year end' case. The birthdays at the year end are at the end of the index, while the birthdays at the start of the year are at the beginning. So, that special case would either need some kind of FULL scan, or two tree traversals.

(Feb 15 '11 at 14:33) Markus Winand ♦♦

The SRF is only to make it easier to test different scenarios. I checked the query plan before to use the index. Regarding triggerless solutions, sure:

DROP TRIGGER tgdow on member;
DROP FUNCTION computedob ();
ALTER TABLE member DROP dob;
CREATE index member_idx2 on member(getDayInYear(date_of_birth));

EXPLAIN ANALYZE select *
      from member
     where getDayInYear(date_of_birth) between
                getDayInYear(CURRENT_DATE + '1 day'::interval)
                and
                (getDayInYear(CURRENT_DATE + '1 day'::interval) +
                  case
                        -- called on Feb 27 in a leap year
                        when getDayInYear(CURRENT_DATE) = 227 and getDayInYear(CURRENT_DATE + '2 days'::interval) = 229 then 0
                        -- called on Feb 27 in a non-leap year
                        when getDayInYear(CURRENT_DATE) = 227 and getDayInYear(CURRENT_DATE + '2 days'::interval) > 229 then 1
                        else 0
                  end
                );

Still a indexable scenario, but not that portable (needs functional indices)

answered Feb 10 '11 at 03:40

Neutrino's gravatar image

Neutrino
71115