Wednesday 15 January 2014

Scale a PostgreSQL stored procedure -



Scale a PostgreSQL stored procedure -

my current implementation of postgresql stored procedure doesn't scale, though problem split parallel processes/threads.

the setup

an application behaves much alike dating platform, i.e. users sign up, come in couple of profile details , based on these details matching against other users done. details can summed 60-70 attributes, booleans, stored within user record in user_attributes table. there's 1 big user_attributes table, consisting of user id , attributes (other profile info stored in separate table). attribute-by-column scheme chosen due performance concerns, i.e. prevent additional queries getting attributes of 1 user. every matching, there's per-user matching table, every user has own table consisting of user_id, other_user_id, matching_score.

we'd have 300k users per db instance, it'd interesting see how scale 10 times, i.e. 3 1000000 users. beyond that, can scale distributing other database instances. nevertheless, start have scalability problems around 80k users.

the question

as mentioned before, due performance concerns attributes set user_attributes table 1 column per attribute. we've create stored procedure (create_user) takes of 60-70 attributes arguments, create record in user table , starts select other users, including attributes, user_attributes table , starts calculate matching score final result beingness inserted newly created userxyz_matches table.

we run test see how setup performs (insert 1 user @ time, until 300k users reached) , turns out around 80k users, our cpu becomes bottleneck. though test machines comes 4 cores / 8 threads, 1 utilized. problem matching-per-other-user takes long (pl/pgsql performs poor here), core issue of these matchings happen on 1 cpu. example, matching against other users split 8 different operations, each taking 1/8 of user_attributes table records, performing matching , inserting result table. can optimize badly performing pl/pgsql, don't know way on how distribute work across other cpu cores/threads.

other information

please post recommendations on approach whole comments. appreciate advice on how improve in general, not reply specific question.

all of user matching tables stored within 1 tablespace backed xfs , lvm striping across couple of disks. number of user matching tables (one per user) doesn't seem scalability problem (as first thought). disks not problem , huge amount of tables seems covered specific setup.

a call/query create_user should atomic, i.e. transaction based. that's our test-run, doesn't need hard requirement end product.

the create_user procedure looks (too long post whole):

create or replace function create_user(...) -- (1) input_user = insert user_attributes values (parameter0, parameter1, ...) -- (2) create userxyz_matching_table -- (3) row in select * "user_attributes" "id" <> input_user."id" loop -- -- repeat every attribute -- if row.this_attribute = input_user.this_attribute -- match := match + 1 -- end if; -- -- -- insert userxyz_matching_table values (input.user.id, row.id, match) -- end loop; language plpgsql;

i know high cpu usage comes amount of if, elsif, end if blocks (60-70). again, can optimized question on how scale such stored procedure remains.

the server tests run on looks below, illustrates problem pretty well:

to best of knowledge , documentation-reading abilities, pl/pgsql not back upwards parallelism, nor server parallelize processing individual queries. i'm inclined say, therefore, scaling farther require parallelization @ client (new users inserted via multiple concurrent threads / processes separate connections).

overall, though, have inherent scaling problem in add together new record need compare other records. cost of doing n total records scales n^2, , you're pegging cpu 25% of way through. adding 320,000th record 4 times expensive adding 80,000th, , adding 320,000 records in total @ to the lowest degree sixteen times expensive adding 80,000.

it conceivable improve performance using select into query instead of stored procedure, won't improve asymptotic complexity. consider creating matching tables asynchronously, improve initial response.

postgresql stored-procedures plpgsql

No comments:

Post a Comment