Deadlock between Distribution Agent and Distribution Agent Cleanup

Discussion in 'microsoft.public.sqlserver.replication' started by Andrew Pike, Dec 30, 2005.

  1. Andrew Pike

    Andrew Pike Guest

    This is occurring regularly on SQL Server 2000 build 878.

    The problem is a deadlock in the Distribution database. The Distribution
    Agent spid is executing the SELECT statement below:

    select @max_xact_seqno = max(xact_seqno) from MSrepl_commands (READPAST)
    where
    publisher_database_id = @publisher_database_id and
    command_id = 1 and
    type <> -2147483611

    which is found in sp_MSget_repl_commands. It holds an Intent Shared page
    lock on a data page in the MSrepl_commands table.

    The Distribution Agent Cleanup spid is found to be running the command below:

    DELETE MSrepl_commands WITH (PAGLOCK) where
    publisher_database_id = @publisher_database_id and
    xact_seqno <= @max_xact_seqno

    located in the stored procedure sp_MSdelete_publisherdb_trans. This spid
    holds an exclusive page lock on another data page in MSrepl_commands.

    Both spids then attempt to obtain the same lock type on the page which is
    locked by the other.

    The Distribution Agent runs continuously and the Cleanup job is scheduled
    for every 10 minutes. The Publication, Distribution and Subscription
    databases are all on the same instance (3rd party vendor solution, not mine!)
    in an active/active Win2003 cluster configuration. The articles are all
    stored procedure executions.

    Has anybody else seen this deadlock? Is it just a timing issue? Why is the
    PAGLOCK hint used in sp_MSdelete_publisherdb_trans as above?
    (I can't find any articles which correlate exactly to this problem)

    Kind Regards
    Andrew Pike
     
    Andrew Pike, Dec 30, 2005
    #1
    1. Advertisements

  2. Do you have anonymous subscribers or named. With named subscribers the
    distribution clean up agent cleans up more aggressively and you may see
    problems like this when a subscriber has been offline for some time.

    First off issue a select * from distribution.dbo.MSdistribution_status to
    see how many undelivered vs delivered commands there are. If there are a
    high number of delivered commands, I would stop the SQL Server Agent and run
    the distribution clean up agent manually.

    I can't comment on why the decision was made to implement the two types of
    locks, but in general MS has done a lot of research to deliver optimal
    performance. For example the 27 in sp_MSadd_repl_commands27 comes from
    tests that they did to find the optimal number of commands to send to the
    distribution database in a batch from the log reader agent. And yes, they
    tested a range of commands to find which offered best performance.

    It looks like the readpast is to prevent locking, and the page lock is to
    prevent a table lock.
     
    Hilary Cotter, Dec 30, 2005
    #2
    1. Advertisements

  3. Andrew Pike

    Andrew Pike Guest

    Thanks for the speedy reply Hilary,

    The subscription in question is a named push subscription - anonymous pull
    subscriptions are not enabled.

    I ran SELECT * FROM distribution.dbo.MSdistribution_status as requested.
    There were two rows in the result set, both of which had a value of 0 for
    UndelivCmdsInDistDB, while the values of DelivCmdsInDistDB were 2 and 7. A
    couple of minutes later this query returned no rows - the Cleanup job had run
    I presume. Would I be correct in stating that the sum of the
    DelivCmdsInDistDB column will correlate with the number of commands purged by
    the Cleanup job?
    The job history of the last execution reads:

    Removed 9 replicated transactions consisting of 9 statements in 0 seconds (0
    rows/sec)

    It certainly looks that way.
    These don't look to be particularly high numbers to me, although it is just
    past midnight here! The deadlock does occur at many different times though.
    The job history that relates to the last occurrence of the deadlock is as
    follows:

    Removed 31 replicated transactions consisting of 9669 statements in 2
    seconds (4850 rows/sec)

    You're right about the READPAST hint, it's there to prevent blocking caused
    by the Log Reader Agent. I suppose the PAGLOCK hint has the alternative of
    ROWLOCK, although as you suggest Microsoft have almost certainly tested this
    to the nth degree. ROWLOCK would undoubtedly incur much greater lock
    resources given very large volumes of replicated commands.

    Kind Regards,
    Andrew Pike
    --
    SQL Server DBA
    Accenture UK



     
    Andrew Pike, Dec 31, 2005
    #3
  4. The commands are shared by multiple subscribers. So when the commands are
    distributed to all subscribers they are marked for deletion. For a single
    subscriber your conclusions are accurate.

    Note that the delivered commands in distribution db can vary widely as some
    transactions affect more rows than others. Then some transactions take
    longer to be applied on the subscriber, especially text ones. You might want
    to look at replicating the execution of stored procedures. If a large part
    of your DML affects a large number of rows (in other words a single proc
    will affect more than a singleton) you can get substantial performance
    improvements by doing so.



     
    Hilary Cotter, Dec 31, 2005
    #4
  5. Andrew Pike

    Andrew Pike Guest

    Our current replication topology involves a single subscriber - in fact the
    subscription database is located on the same instance as the publication
    database.
    All of the articles contained within the publication are stored procedure
    executions.
    This is a component of a 3rd party application that I cannot alter, although
    I may be
    able to suggest improvements to the vendor.

    Kind Regards
    Andrew Pike
     
    Andrew Pike, Dec 31, 2005
    #5
  6. The complexity of the procs on both sides (they can be different, they just
    have to have the same name) and the indexes in place on both sides will have
    an impact on performance, and hence the amount of time it takes to apply the
    transactions/commands/execute the stored procedures.

    You might want to run profiler and watch for recompiles, duration, CPU and
    IO.
     
    Hilary Cotter, Dec 31, 2005
    #6
  7. Andrew Pike

    Andrew Pike Guest

    I'm back in the office on Wednesday; I'll try to run a trace then.

    However, returning to the original SQL involved in the deadlock as given
    below:

    select @max_xact_seqno = max(xact_seqno) from MSrepl_commands
    (READPAST)
    where
    publisher_database_id = @publisher_database_id and
    command_id = 1 and
    type <> -2147483611

    Assuming this is executed at the default isolation level of READ COMMITTED
    (I don't have access to a server at the moment to check), the shared locks it
    obtains will be released at the end of its execution. Therefore why would
    long running replicated transactions/stored procedure executions have
    an impact? This part of the stored procedure sp_MSget_repl_commands is
    near to the top of the code and is used to find the most recent command
    written by the Log Reader for replication to subscribers. This is then used
    as an upper bound for further queries.

    Kind Regards
    Andrew Pike
     
    Andrew Pike, Jan 1, 2006
    #7
  8. Your last comment maybe the cue. If indeed it is stored procedure
    replication, then you have no way to control the impact the execution of a
    stored procedure at the subscriber will have. Especially if the proc touches
    on many tables. Where as the sequence is garanteed at the publisher it may
    not be at the subcriber. If as you say you do not have much control over the
    application because it is a third party application, then you maybe stuck.
    However, what is not clear is whether the subscriber is also involved in the
    third party application. If it is not, I would look at transactional, one
    way, replication to the subscriber and the possibility of moving the
    sbscription DB to another server if that is an option as well. HTH.
     
    Patrick Ikhifa, Jan 2, 2006
    #8
  9. Looking at this proc it does the read past to prevent locking due to new
    commands being written to this table by the log reader. So we are really
    getting a dirty read; and only reading the transactions committed some time
    ago. This proc is actually fired by the distribution agent.

    Now long running transactions will have an impact on the amount of data that
    the log reader has to read in the log to construct the replicated
    transactions and commands and will degrade performance of the log reader
    (not to mention all of the processes accessing the database).

    If the commands are wide (i.e. affects many columns, or has text or image
    columns in them) these procs will take longer to be applied on the
    subscriber, and to be extracted by the distribution agent from
    msrepl_transactions and msrepl_commands.

    Now to get to your question - long running transactions will not impact the
    distribution agent per se. It will affect the log reader (more log to read)
    and the overall database.

    Replicating the execution of stored procedures will offer better performance
    if the procs you are replicating affect several rows.

    Its like this. I issue this update statement on the publisher - update
    tablename where pk>10 and pk < 20. This affects 9 rows on the publisher. The
    log reader reads the update tablename where pk>10 and pk < 20 from the log
    and writes it transaction information to msrepl_transactions (1 row for the
    transaction) and then it breaks down the update statement to 9 singletons
    and writes each singleton to msrepl_commands.

    The distribution agent then reads msrepl_transactions and finds the
    transaction information in this table for the update, and then reads all 9
    singleton commands which form this transaction and fires each proc
    individually on the subscriber to ensure that it affects only 1 row. If it
    affects 0 or more than 1 row it knows that it has lost database consistency
    between the publisher and the subscriber because transactional replication
    replicates transactions transaction.

    Now with replicating the execution of stored procedures, the transaction
    information is written to msrepl_transactions as before, but this time
    msrepl_commands has one row in it - the proc that was used to fire the dml
    on the publisher.

    So all the distribution agent has to do is fire the 1 proc on the
    subscriber - hence better performance.

    As I hope you can see, long running transactions are not really significant
    with this proc; with the log reader agent they are. With your database they
    are; but not so much with the distribution agent.
     
    Hilary Cotter, Jan 2, 2006
    #9
  10. Andrew Pike

    Andrew Pike Guest

    Thanks for those technical details Hilary, very interesting.
    You've confirmed my understanding of the situation as outlined in my
    previous message - long running replicated transactions can't be
    identified as the cause of the deadlock in question. They may affect
    overall performance, but reducing the run-time won't prevent the
    deadlock from occurring. The Distribution Agent Cleanup job code
    which I identified as one of the spids in the deadlock chain may take
    some time to execute if the schedule is infrequent:

    DELETE MSrepl_commands WITH (PAGLOCK) where
    publisher_database_id = @publisher_database_id and
    xact_seqno <= @max_xact_seqno

    Our schedule is every 10 minutes so this should not take too long.
    Are there any indexes on MSrepl_commands?

    The code found to be executing by the other spid in the deadlock
    chain should be almost instantaneous.

    Kind Regards
    Andrew Pike
     
    Andrew Pike, Jan 2, 2006
    #10
  11. Andrew Pike

    Andrew Pike Guest

    The replication topology is wholly devised by the application vendor, indeed
    they supply their own custom wizard for creating the replication objects. If
    this deadlock occurred within application code I would perform the analysis
    and refer it onto them. As the deadlock is caused by two system stored
    procedures I feel the need to investigate further, hence my post here.

    Kind Regards
    Andrew Pike
     
    Andrew Pike, Jan 3, 2006
    #11
  12. You might want to see what happens if you change the frequency of your
    distribution clean up agent, perhaps have it run every 5 minutes and see
    what happens.

    There is a single index (clustered) on msrepl_commands on
    publisher_database_id, xact_seqno, command_id.

    You might want to see what happens if you configure your database to disable
    automatically update statistics. Automatically updating statistics will
    cause a performance hit for every 500 rows inserted/deleted for this table,
    but it might speed up deletes.

    Another option will be to put your distribution database in bulk copy
    recover model. You loose recoverability, but insert and update performance
    is better.

    Other than that I would work on the database itself to see if you can
    improve performance.
     
    Hilary Cotter, Jan 3, 2006
    #12
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.