Friday, March 30, 2012

How can I remove duplicate entries in a sql query?

I have a database being populated by hits to a program on a server.
The problem is each client connection may require a few hits in a 1-2
second time frame. This is resulting in multiple database entries -
all exactly the same, except the event_id field, which is
auto-numbered.

I need a way to query the record w/out duplicates. That is, any
records exactly the same except event_id should only return one record.

Is this possible??

Thank you,

Barrytry this, not tested :-)

SELECT *
FROM table1 a,
(SELECT min(id) FROM table1) AS b
WHERE a.id = b.id

BarrySDCA wrote:
> I have a database being populated by hits to a program on a server.
> The problem is each client connection may require a few hits in a 1-2
> second time frame. This is resulting in multiple database entries -
> all exactly the same, except the event_id field, which is
> auto-numbered.
> I need a way to query the record w/out duplicates. That is, any
> records exactly the same except event_id should only return one record.
> Is this possible??
> Thank you,
> Barry|||hrm...no luck. I ran this:

SELECT * from `playback_log` a,(SELECT min(EVENT_ID) FROM
`playback_log` ) AS b WHERE a.EVENT_ID = b.EVENT_ID

Can you see anything? I appreciate the help a bunch. thank you|||Haven't tested this:

SELECT *
FROM playback_log a
WHERE a.event_id = (select min(event_id) from playback_log b
where a.field1 = b.field1)
;

You will need to list as many fields as need to match the identical rows in the
where clause of the sub-select.

Thomas

BarrySDCA wrote on 03.03.2006 00:33:
> hrm...no luck. I ran this:
>
> SELECT * from `playback_log` a,(SELECT min(EVENT_ID) FROM
> `playback_log` ) AS b WHERE a.EVENT_ID = b.EVENT_ID
>
> Can you see anything? I appreciate the help a bunch. thank you|||"Thomas Kellerer" <WVIJEVPANEHT@.spammotel.com> wrote in message
news:46petmFc69qnU1@.individual.net...
> SELECT *
> FROM playback_log a
> WHERE a.event_id = (select min(event_id) from playback_log b
> where a.field1 = b.field1)

Here's a similar possibility, without using a correlated subquery:

SELECT a.*
FROM playback_log AS a
WHERE a.event_id IN (
SELECT MIN(b.event_id)
FROM playpack_log AS b
GROUP BY b.field1, b.field2, b.field3, ...)

What I've seen missing in the several solutions proposed is any use of GROUP
BY. You'll need to GROUP BY all the fields of the table _except_ for
event_id.

Regards,
Bill K.|||Bill Karwin (bill@.karwin.com) writes:
> "Thomas Kellerer" <WVIJEVPANEHT@.spammotel.com> wrote in message
> news:46petmFc69qnU1@.individual.net...
>> SELECT *
>> FROM playback_log a
>> WHERE a.event_id = (select min(event_id) from playback_log b
>> where a.field1 = b.field1)
> Here's a similar possibility, without using a correlated subquery:
> SELECT a.*
> FROM playback_log AS a
> WHERE a.event_id IN (
> SELECT MIN(b.event_id)
> FROM playpack_log AS b
> GROUP BY b.field1, b.field2, b.field3, ...)
> What I've seen missing in the several solutions proposed is any use of
> GROUP BY. You'll need to GROUP BY all the fields of the table _except_
> for event_id.

One more variation, using a derived table:

SELECT a.*
FROM playback_log AS a
JOIN (SELECT MIN(event_id)
FROM playback_log
GROUP BY field1, field2, ...) AS b ON a.event_id = b.event_id

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pr...oads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodin...ions/books.mspx|||I'm not sure why you need a nested query as others have suggested.

I see two options:

Select Distinct blah1, blah2, [all fields except event_id], blah99
>From LoginTable

or

Select blah1, blah2, [all fields except event_id], blah99
>From LoginTable
Group by blah1, blah2, [all fields except event_id], blah99

if you want an event_id, just for kicks, you could add an aggregator:

Select blah1, blah2, max(event_id), blah99
>From LoginTable
Group by blah1, blah2, [all fields except event_id], blah99|||BarrySDCA wrote:
> I have a database being populated by hits to a program on a server.
> The problem is each client connection may require a few hits in a 1-2
> second time frame. This is resulting in multiple database entries -
> all exactly the same, except the event_id field, which is
> auto-numbered.
> I need a way to query the record w/out duplicates. That is, any
> records exactly the same except event_id should only return one record.
> Is this possible??
> Thank you,
> Barry

Hopefully you found the suggestions on how to filter out the duplicates
helpful.

I just want to point out that this shows the weakness of using a
pseudokey as the primary key of a table. If they really are the same
events, they shouldn't be duplicated.

Actually, I don't think you have duplicate events, since at least some
of the connection hits must have failed or timed out. Otherwise why
were there retries? So they are not really the same. Are you maybe
missing a status or result attribute?

(but the ID field is still a poor crutch for a possibly weak design.)

just some food for thought.
Ed.|||>(but the ID field is still a poor crutch for a possibly weak design.)

hmmmm. i'd prefer to think of it as the table stores ALL hits to the
table, but this particular data consumer only wants to see ONE. The
way I think about it, your natural key would be the composite of all
fields in the table; perhaps "natural", but certainly awkword.

I like figital's solution.|||It's exactly what I'm doing....I'm recording hits to an advertising
engine, before they get to the media server. Some client players hit
more than once before they connect, causing a duplicate entry. I just
want to see it sorted out w/out the duplicates. I have the
distribution server logs for accurate playback reporting, I'm only
interested in seeing cleaned up hits.

I setup code to remove the duplicates, but it's not fool proof. I will
try the suggestions here and let the group know how it goes. Here is a
link to what I'm doing:

http://www.sundiegolive.com/advertising.htm notice if two duplicates
come in now and are staggered between eachother, it shows twice still.
I will try and let you all know...

I would like to say that I didn't expect such great help from the
group. I've thrown a few questions out to different groups but usually
don't have so many great replies. Thank you everyone!|||Have you looked at Kx and Stonebreaker's StreamBase? They are databaes
designed for a high te,mproal flow of data. SQL is not always thje
right tool, in spite of what Bill Gates tells you.|||On Sat, 04 Mar 2006 11:01:08 -0800, BarrySDCA wrote:

> It's exactly what I'm doing....I'm recording hits to an advertising
> engine, before they get to the media server. Some client players hit
> more than once before they connect, causing a duplicate entry. I just
> want to see it sorted out w/out the duplicates. I have the
> distribution server logs for accurate playback reporting, I'm only
> interested in seeing cleaned up hits.

If everything is the same except the timestamp, you can round that up to
the nearest second and use a SELECT DISTINCT.

Check your SQL server's date conversion options to find out how to get
different values from the timestamp field.|||Doug wrote:
> >(but the ID field is still a poor crutch for a possibly weak design.)
> hmmmm. i'd prefer to think of it as the table stores ALL hits to the
> table, but this particular data consumer only wants to see ONE.

It isn't clear that this was ONE of several data consumers.. Even so
the ID pseudokey might not be the best choice.

> The
> way I think about it, your natural key would be the composite of all
> fields in the table; perhaps "natural", but certainly awkword.
> I like figital's solution.

It is not clear from the previous posts that the natural key would
require ALL the columns of that table. If the table is really storing
events, then the key might be as simple as a timestamp, perhaps with an
event type. It doesn't have to be ALL of the columns.

My point is too many people immediately jump to ID fields as the PK.
Many times before they even sit down to think about what entity the
table represents. If you never look for the natural key, you are sure
not to find it.

HTH
ed|||"Ed Prochak" <edprochak@.gmail.com> wrote in message
news:1141752316.933161.102260@.i39g2000cwa.googlegr oups.com...
> My point is too many people immediately jump to ID fields as the PK.

That's certainly a good point. However, I've worked on projects in which
the decision-makers wouldn't commit to _any_ combination of attributes that
would uniquely identify the entity. There were always cases where the value
in any column could be either non-unique, or else have no value specified
(i.e. NULL). Neither would they commit to any attributes that could be
reasonably stable and unchanging (though I understand that this is not
strictly necessary for a key).

So in those kinds of situations, I felt I had to create pseudokeys to have
any chance of the application working. Even if we know the best practices
for database modeling, the project on which we are working may have
constraints that don't allow us to follow those best practices.

Regards,
Bill K.|||Bill Karwin wrote:
> "Ed Prochak" <edprochak@.gmail.com> wrote in message
> news:1141752316.933161.102260@.i39g2000cwa.googlegr oups.com...
> > My point is too many people immediately jump to ID fields as the PK.
> That's certainly a good point. However, I've worked on projects in which
> the decision-makers wouldn't commit to _any_ combination of attributes that
> would uniquely identify the entity. There were always cases where the value
> in any column could be either non-unique, or else have no value specified
> (i.e. NULL). Neither would they commit to any attributes that could be
> reasonably stable and unchanging (though I understand that this is not
> strictly necessary for a key).

It's not necessarily something the End users or even managers should
make the final decision upon. I one really big project I had to good
fortune of working with a real Data Modeler. No she did not
programming, only modelling. She interviewed the users to find out what
data they had, what data they wanted and how they used it. She
organized it and created the ER model. The model included enough
information to design the DB model, which is when the PK's were
determined. Almost all of the cases you mentioned were encountered.
Each was handled by the DB designers. There are times when you create a
pseudokey and times when you split the logical table
maybe to model a more generic entity you create a pair of
parent/child tables where the parent lacks one of the fields that might
be NULL if only the Child table existed, or
maybe the entity really was 2 entities that looked similar, as an
address that represents the customer's location and the adddress that
represents the billing address. A DB that application that needed both
would be poorly served by one ADDRESS table.
What happens internally to the DB does not matter as long as you can
present the right data in the right combinations.

It is our job to engineer the software, not the end users and managers.

> So in those kinds of situations, I felt I had to create pseudokeys to have
> any chance of the application working. Even if we know the best practices
> for database modeling, the project on which we are working may have
> constraints that don't allow us to follow those best practices.
> Regards,
> Bill K.

Sounds like you take pseudokeys as the last resort, which IMO is
exactly what they are. That approach is best practices. Keep it up.

Ed

No comments:

Post a Comment