Deadlock between Distribution Agent and Distribution Agent Cleanup

Discussion in 'microsoft.public.sqlserver.replication' started by zerg2k, Dec 21, 2006.

  1. zerg2k

    zerg2k Guest

    I am experiencing this problem. Deadlock of these two M$ stored
    procedures :

    sp_MSget_repl_commands (Executed by the Distribution Agent --pull
    subscriber ) and
    sp_MSdistribution_cleanup (Executed by the Distribution Agent Cleanup
    job)

    the offending queries are :
    DELETE MSrepl_commands WITH (PAGLOCK) where
    publisher_database_id = @publisher_database_id and
    xact_seqno <= @max_xact_seqno
    select @max_xact_seqno = max(xact_seqno) from MSrepl_commands
    (READPAST)
    where
    publisher_database_id = @publisher_database_id and
    command_id = 1 and
    type <> -2147483611

    I searched this and other groups and no convincing answer was posted.
    Is there anyone experiencing this problem ? if so what did you do to
    "resolve" it (not to decrease its frequency)

    Thanks in Advance.

    -Noel
    Sr. DBA
     
    zerg2k, Dec 21, 2006
    #1
    1. Advertisements

  2. zerg2k

    Kevin3NF Guest

    I've seen this a lot, since they are both hitting the same repl table at the
    same time, but I've never seen it fail/deadlock for extended periods of
    time. If your agent failing, then succeeding?

    --
    Kevin Hill
    3NF Consulting
    http://www.3nf-inc.com/NewsGroups.htm

    Real-world stuff I run across with SQL Server:
    http://kevin3nf.blogspot.com
     
    Kevin3NF, Dec 21, 2006
    #2
    1. Advertisements

  3. I can't promise there is anything M$ can do about it (how often does this
    occur?) but it would be great if you can post the deadlock trace here. (Or,
    you can log a feedback item @ http://www.microsoft.com/connect)

    -Raymond
     
    Raymond Mak [MSFT], Dec 21, 2006
    #3
  4. zerg2k

    zerg2k Guest

    Here it goes:

    Deadlock encountered .... Printing deadlock information
    2006-12-21 13:29:58.05 spid4
    2006-12-21 13:29:58.05 spid4 Wait-for graph
    2006-12-21 13:29:58.05 spid4
    2006-12-21 13:29:58.05 spid4 Node:1
    2006-12-21 13:29:58.05 spid4 PAG: 6:1:304949
    CleanCnt:2 Mode: X Flags: 0x2
    2006-12-21 13:29:58.05 spid4 Grant List 0::
    2006-12-21 13:29:58.05 spid4 Owner:0x9d1a3b40 Mode: X
    Flg:0x0 Ref:0 Life:02000000 SPID:201 ECID:0
    2006-12-21 13:29:58.05 spid4 SPID: 201 ECID: 0 Statement Type:
    DELETE Line #: 162
    2006-12-21 13:29:58.05 spid4 Input Buf: Language Event: EXEC
    dbo.sp_MSdistribution_cleanup @min_distretention = 0,
    @max_distretention = 72
    2006-12-21 13:29:58.05 spid4 Requested By:
    2006-12-21 13:29:58.05 spid4 ResType:LockOwner Stype:'OR' Mode:
    IS SPID:76 ECID:0 Ec:(0x24E45A50) Value:0x3955af40 Cost:(0/0)
    2006-12-21 13:29:58.05 spid4
    2006-12-21 13:29:58.05 spid4 Node:2
    2006-12-21 13:29:58.05 spid4 PAG: 6:1:304955
    CleanCnt:2 Mode: IS Flags: 0x0
    2006-12-21 13:29:58.05 spid4 Grant List 1::
    2006-12-21 13:29:58.05 spid4 Owner:0x3955af80 Mode: IS
    Flg:0x0 Ref:1 Life:00000000 SPID:76 ECID:0
    2006-12-21 13:29:58.05 spid4 SPID: 76 ECID: 0 Statement Type:
    SELECT Line #: 72
    2006-12-21 13:29:58.05 spid4 Input Buf: RPC Event:
    sp_MSget_repl_commands;1
    2006-12-21 13:29:58.05 spid4 Requested By:
    2006-12-21 13:29:58.05 spid4 ResType:LockOwner Stype:'OR' Mode:
    X SPID:201 ECID:0 Ec:(0x366CD5F8) Value:0x9d1a3cc0 Cost:(1/0)
    2006-12-21 13:29:58.05 spid4 Victim Resource Owner:
    2006-12-21 13:29:58.05 spid4 ResType:LockOwner Stype:'OR' Mode: X
    SPID:201 ECID:0 Ec:(0x366CD5F8) Value:0x9d1a3cc0 Cost:(1/0)
    2006-12-21 13:38:01.66 spid4
    Deadlock encountered .... Printing deadlock information
    2006-12-21 13:38:01.66 spid4
    2006-12-21 13:38:01.66 spid4 Wait-for graph
    2006-12-21 13:38:01.66 spid4
    2006-12-21 13:38:01.66 spid4 Node:1
    2006-12-21 13:38:01.66 spid4 PAG: 6:1:606735
    CleanCnt:2 Mode: X Flags: 0x2
    2006-12-21 13:38:01.66 spid4 Grant List 3::
    2006-12-21 13:38:01.66 spid4 Owner:0x6ed511a0 Mode: X
    Flg:0x0 Ref:0 Life:02000000 SPID:217 ECID:0
    2006-12-21 13:38:01.66 spid4 SPID: 217 ECID: 0 Statement Type:
    DELETE Line #: 162
    2006-12-21 13:38:01.66 spid4 Input Buf: Language Event: EXEC
    dbo.sp_MSdistribution_cleanup @min_distretention = 0,
    @max_distretention = 72
    2006-12-21 13:38:01.66 spid4 Requested By:
    2006-12-21 13:38:01.66 spid4 ResType:LockOwner Stype:'OR' Mode:
    IS SPID:76 ECID:0 Ec:(0x24E45A50) Value:0x89d873c0 Cost:(0/0)
    2006-12-21 13:38:01.66 spid4
    2006-12-21 13:38:01.66 spid4 Node:2
    2006-12-21 13:38:01.66 spid4 PAG: 6:1:494567
    CleanCnt:2 Mode: IS Flags: 0x0
    2006-12-21 13:38:01.66 spid4 Grant List 1::
    2006-12-21 13:38:01.66 spid4 Owner:0x89d6c8a0 Mode: IS
    Flg:0x0 Ref:1 Life:00000000 SPID:76 ECID:0
    2006-12-21 13:38:01.66 spid4 SPID: 76 ECID: 0 Statement Type:
    SELECT Line #: 72
    2006-12-21 13:38:01.66 spid4 Input Buf: RPC Event:
    sp_MSget_repl_commands;1
    2006-12-21 13:38:01.66 spid4 Requested By:
    2006-12-21 13:38:01.66 spid4 ResType:LockOwner Stype:'OR' Mode:
    X SPID:217 ECID:0 Ec:(0x9AC675F8) Value:0x33685820 Cost:(1/0)
    2006-12-21 13:38:01.66 spid4 Victim Resource Owner:
    2006-12-21 13:38:01.66 spid4 ResType:LockOwner Stype:'OR' Mode: X
    SPID:217 ECID:0 Ec:(0x9AC675F8) Value:0x33685820 Cost:(1/0)
    2006-12-21 13:46:39.30 spid4
    Deadlock encountered .... Printing deadlock information
    2006-12-21 13:46:39.30 spid4
    2006-12-21 13:46:39.30 spid4 Wait-for graph
    2006-12-21 13:46:39.30 spid4
    2006-12-21 13:46:39.30 spid4 Node:1
    2006-12-21 13:46:39.30 spid4 PAG: 6:1:168158
    CleanCnt:2 Mode: IS Flags: 0x0
    2006-12-21 13:46:39.30 spid4 Grant List 1::
    2006-12-21 13:46:39.30 spid4 Owner:0x394e9300 Mode: IS
    Flg:0x0 Ref:1 Life:00000000 SPID:76 ECID:0
    2006-12-21 13:46:39.30 spid4 SPID: 76 ECID: 0 Statement Type:
    SELECT Line #: 72
    2006-12-21 13:46:39.30 spid4 Input Buf: RPC Event:
    sp_MSget_repl_commands;1
    2006-12-21 13:46:39.30 spid4 Requested By:
    2006-12-21 13:46:39.30 spid4 ResType:LockOwner Stype:'OR' Mode:
    X SPID:62 ECID:0 Ec:(0xB7C355F8) Value:0x29bb1840 Cost:(1/0)
    2006-12-21 13:46:39.30 spid4
    2006-12-21 13:46:39.30 spid4 Node:2
    2006-12-21 13:46:39.30 spid4 PAG: 6:1:168154
    CleanCnt:2 Mode: X Flags: 0x2
    2006-12-21 13:46:39.30 spid4 Grant List 0::
    2006-12-21 13:46:39.30 spid4 Owner:0x29a1cac0 Mode: X
    Flg:0x0 Ref:0 Life:02000000 SPID:62 ECID:0
    2006-12-21 13:46:39.30 spid4 SPID: 62 ECID: 0 Statement Type:
    DELETE Line #: 162
    2006-12-21 13:46:39.30 spid4 Input Buf: Language Event: EXEC
    dbo.sp_MSdistribution_cleanup @min_distretention = 0,
    @max_distretention = 72
    2006-12-21 13:46:39.30 spid4 Requested By:
    2006-12-21 13:46:39.30 spid4 ResType:LockOwner Stype:'OR' Mode:
    IS SPID:76 ECID:0 Ec:(0x24E45A50) Value:0x394e91a0 Cost:(0/0)
    2006-12-21 13:46:39.30 spid4 Victim Resource Owner:
    2006-12-21 13:46:39.30 spid4 ResType:LockOwner Stype:'OR' Mode: X
    SPID:62 ECID:0 Ec:(0xB7C355F8) Value:0x29bb1840 Cost:(1/0)
    2006-12-21 13:56:09.30 spid4
    Deadlock encountered .... Printing deadlock information
    2006-12-21 13:56:09.30 spid4
    2006-12-21 13:56:09.30 spid4 Wait-for graph
    2006-12-21 13:56:09.30 spid4
    2006-12-21 13:56:09.30 spid4 Node:1
    2006-12-21 13:56:09.30 spid4 PAG: 6:1:71310
    CleanCnt:2 Mode: X Flags: 0x2
    2006-12-21 13:56:09.30 spid4 Grant List 3::
    2006-12-21 13:56:09.30 spid4 Owner:0x827e4ae0 Mode: X
    Flg:0x0 Ref:0 Life:02000000 SPID:244 ECID:0
    2006-12-21 13:56:09.30 spid4 SPID: 244 ECID: 0 Statement Type:
    DELETE Line #: 162
    2006-12-21 13:56:09.30 spid4 Input Buf: Language Event: EXEC
    dbo.sp_MSdistribution_cleanup @min_distretention = 0,
    @max_distretention = 72
    2006-12-21 13:56:09.30 spid4 Requested By:
    2006-12-21 13:56:09.30 spid4 ResType:LockOwner Stype:'OR' Mode:
    IS SPID:76 ECID:0 Ec:(0x24E45A50) Value:0x4d372680 Cost:(0/0)
    2006-12-21 13:56:09.30 spid4
    2006-12-21 13:56:09.30 spid4 Node:2
    2006-12-21 13:56:09.30 spid4 PAG: 6:1:93326
    CleanCnt:2 Mode: IS Flags: 0x0
    2006-12-21 13:56:09.30 spid4 Grant List 1::
    2006-12-21 13:56:09.30 spid4 Owner:0x4d10ef00 Mode: IS
    Flg:0x0 Ref:1 Life:00000000 SPID:76 ECID:0
    2006-12-21 13:56:09.30 spid4 SPID: 76 ECID: 0 Statement Type:
    SELECT Line #: 72
    2006-12-21 13:56:09.30 spid4 Input Buf: RPC Event:
    sp_MSget_repl_commands;1
    2006-12-21 13:56:09.30 spid4 Requested By:
    2006-12-21 13:56:09.30 spid4 ResType:LockOwner Stype:'OR' Mode:
    X SPID:244 ECID:0 Ec:(0x532C35F8) Value:0x8d172f80 Cost:(1/0)
    2006-12-21 13:56:09.30 spid4 Victim Resource Owner:
    2006-12-21 13:56:09.30 spid4 ResType:LockOwner Stype:'OR' Mode: X
    SPID:244 ECID:0 Ec:(0x532C35F8) Value:0x8d172f80 Cost:(1/0)

    This is happening in periods of 'high' transaction activity. (once a
    day at peak hours)

    Is there any recomendation from you guys ? How do you solve this
    problem?

    I don't have a profiler trace for this (running profiler at those times
    could be dangerous ...)

    Hopefully you can offer some workaround.

    Many Thanks!

    -Noel
    Sr. DBA
     
    zerg2k, Dec 21, 2006
    #4
  5. zerg2k

    zerg2k Guest

    Kevin,

    This is not 'extreme' for me but the fact that those deadlocks are
    happening makes me nervous in case the activity expands for more
    extended periods. This is something that I would like to avoid if at
    all possible.

    and you are correct it fails, then retrys and if the 'high' activity
    period some how subsides a bit it succeeds. I thought those lock hints
    were pretty safe to avoid such situations but apparently I was wrong.

    Thanks for the feedback.

    -Noel
    Sr DBA
     
    zerg2k, Dec 21, 2006
    #5
  6. Believe it or not, a considerable amount of work had gone into minimizing
    deadlocks between sp_MSdistribution_cleanup and sp_MSget_repl_commands in
    SQL2005 (not me). The keyword here is, of course, minimizing rather than
    eliminating deadlocks, and here are some of the reasons why eliminating
    deadlocks completely can be difficult:

    1) MSSQL Server optimizer can generate widely different query plans
    depending on server load and data distribution; and different query plans
    can lead to different lock acquisition order that are difficult for us to
    anticipate ahead of time. Once we know that a particular plan chosen by the
    query optimizer can lead to unacceptable frequency of deadlocks with other
    parts of our (replication) system, we can try to strong-arm the optimizer
    into choosing a particular plan using query hints (+\- indexes) . Doing so
    (which we actually do) can be problematic because i) the optimizer is not
    obligated to honor our hints, and ii) if the optimizer is actually "forced"
    to choose the plan we want, the chosen plan is in a sense sub-optimal since
    the optimizer was "clearly" able to find a "better" plan based on its
    internal costing factors.
    2) Lock escalation (rows->page->table) is effectively non-deterministic
    which makes guaranteeing a certain kind of locking behavior/order impossible
    3) Different customers can have widely different work-loads, and tweaks that
    we put in to avoid deadlocks for one customer can make the situation worse
    for another. This is why it is generally a good idea to participate in our
    TAP\CTP programs just so we are not blind to your scenario when we are
    working on a new release. Of course, all the wise\experienced DBAs will
    never install a new version of any Microsoft product until SP2. The trouble
    with that approach (which I will not dismiss completely) is that with the
    complexity of modern software, you will still likely run into trouble
    specific to your environment with SP2 anyway but by that time it will be
    difficult for us mere code developers to fix your problem because of fear of
    regression in a shipped product.
    4) Deadlock elimination can often be an exercise in trade-offs. To give you
    a not-so-valid solution for eliminating deadlocks between
    sp_MSdistribution_cleanup and sp_MSget_repl_commands, we can always use
    application locks to guarantee that only one of them can run at a given time
    but you will likely be less than pleased with the resulting throughput on
    your big multi-proc\core server.

    And because of the above reasons (which the more cynical of you will likely
    think of as excuses), we put deadlock retry logic in our code.

    -Raymond
     
    Raymond Mak [MSFT], Dec 22, 2006
    #6
  7. zerg2k

    zerg2k Guest

    Well, "the optimizer is not obligated to honor our hints" is news for
    me.
    I thought that a hint was more a "Command" than a "begging" and because
    the
    throughput is *not* the problem when Cleaning old transactions I
    probably couldn't
    care less, but that's just me. Deadlocking on the other hand is
    problematic because many resources that were already invested on the
    activity will have to be rolledback ( affecting thoughput also)

    This is a 2000 Server and we will be migrating to 2005 "after" SP2 :)

    I do have a question, would it be acceptable in 2005 to set the
    distribution database to "read commited snapshot Isolation" ?

    Thanks for the feedback.

    -Noel
     
    zerg2k, Dec 22, 2006
    #7
  8. I have been repeatedly told by the optimizer folks that query hints are just
    that: hints. They have practically drilled it in my head that the optimizer
    will not *always* honor query hints (most of the time they are honored)

    Regarding your following statement:
    Believe me when I say you will care when you have gigs of old commands and
    the cleanup agent fails to catch up. To make matters worse, the optimizer
    tends to have radical notion of what an optimal plan is when tables get big.

    The distribution agent is mostly a read-only process at the distribution
    database so it is not too costly if it is chosen as the deadlock victim.

    If I remember correctly, we actually did an experiment turning on
    read-committed snapshot at the distribution database and found that
    distribution cleanup throughput is significantly degraded at all times
    (mostly due to extra IOs incurred for maintaining row versions). But since
    you don't care about cleanup throughput, that can be the right trade-off in
    your case.

    -Raymond

     
    Raymond Mak [MSFT], Dec 22, 2006
    #8
  9. zerg2k

    zerg2k Guest

    Excellent,

    Thank you very much for your feedback. The issue is that the volume is
    very high at certain hours but that goes away at others in which the
    Cleanup agent could be more agressive with the cleanup. I think that
    no matter what I decide finally I will have to change and see what
    happens.

    Your input is very much appreciated.

    Regards,

    -Noel


     
    zerg2k, Dec 22, 2006
    #9
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.