# What's the most efficient way to erase duplicates and sort a vector?

### Question

I need to take a C++ vector with potentially a lot of elements, erase duplicates, and sort it.

I currently have the below code, but it doesn't work.

``````vec.erase(
std::unique(vec.begin(), vec.end()),
vec.end());
std::sort(vec.begin(), vec.end());
``````

How can I correctly do this?

Additionally, is it faster to erase the duplicates first (similar to coded above) or perform the sort first? If I do perform the sort first, is it guaranteed to remain sorted after `std::unique` is executed?

Or is there another (perhaps more efficient) way to do all this?

1
251
11/23/2016 6:03:21 PM

I agree with R. Pate and Todd Gardner; a `std::set` might be a good idea here. Even if you're stuck using vectors, if you have enough duplicates, you might be better off creating a set to do the dirty work.

Let's compare three approaches:

Just using vector, sort + unique

``````sort( vec.begin(), vec.end() );
vec.erase( unique( vec.begin(), vec.end() ), vec.end() );
``````

Convert to set (manually)

``````set<int> s;
unsigned size = vec.size();
for( unsigned i = 0; i < size; ++i ) s.insert( vec[i] );
vec.assign( s.begin(), s.end() );
``````

Convert to set (using a constructor)

``````set<int> s( vec.begin(), vec.end() );
vec.assign( s.begin(), s.end() );
``````

Here's how these perform as the number of duplicates changes: Summary: when the number of duplicates is large enough, it's actually faster to convert to a set and then dump the data back into a vector.

And for some reason, doing the set conversion manually seems to be faster than using the set constructor -- at least on the toy random data that I used.

545
5/28/2017 11:38:45 AM

I redid Nate Kohl's profiling and got different results. For my test case, directly sorting the vector is always more efficient than using a set. I added a new more efficient method, using an `unordered_set`.

Keep in mind that the `unordered_set` method only works if you have a good hash function for the type you need uniqued and sorted. For ints, this is easy! (The standard library provides a default hash which is simply the identity function.) Also, don't forget to sort at the end since unordered_set is, well, unordered :)

I did some digging inside the `set` and `unordered_set` implementation and discovered that the constructor actually construct a new node for every element, before checking its value to determine if it should actually be inserted (in Visual Studio implementation, at least).

Here are the 5 methods:

f1: Just using `vector`, `sort` + `unique`

``````sort( vec.begin(), vec.end() );
vec.erase( unique( vec.begin(), vec.end() ), vec.end() );
``````

f2: Convert to `set` (using a constructor)

``````set<int> s( vec.begin(), vec.end() );
vec.assign( s.begin(), s.end() );
``````

f3: Convert to `set` (manually)

``````set<int> s;
for (int i : vec)
s.insert(i);
vec.assign( s.begin(), s.end() );
``````

f4: Convert to `unordered_set` (using a constructor)

``````unordered_set<int> s( vec.begin(), vec.end() );
vec.assign( s.begin(), s.end() );
sort( vec.begin(), vec.end() );
``````

f5: Convert to `unordered_set` (manually)

``````unordered_set<int> s;
for (int i : vec)
s.insert(i);
vec.assign( s.begin(), s.end() );
sort( vec.begin(), vec.end() );
``````

I did the test with a vector of 100,000,000 ints chosen randomly in ranges [1,10], [1,1000], and [1,100000]

The results (in seconds, smaller is better):

``````range         f1       f2       f3       f4      f5
[1,10]      1.6821   7.6804   2.8232   6.2634  0.7980
[1,1000]    5.0773  13.3658   8.2235   7.6884  1.9861
[1,100000]  8.7955  32.1148  26.5485  13.3278  3.9822
``````