Removing duplicate characters from two argument strings in C

3437 views c
1

I'm trying to optimize a problem that I have to make it more readable with the same speed optimization. My problem consists in this:

Allowed function: write.c, nothing else.

Write a program that takes two strings and displays, without doubles, the characters that appear in either one of the strings.

The display will be in the order characters appear in the command line, and will be followed by a \n.

As you can see, in the main it will take two of your argument strings (argv[1] and argv[2]) into our function (void remove_dup(char *str, char *str2) after is compiling it with GCC. That temporary array will hold the ASCII value of the character after a duplicate is detected. For example, str1 = "hello" and str2 = "laoblc". The expected output will result as "heloabc" using the write function.

However, GCC was complaining because I have an array subscript with my temporary character array filled in with zeroes from the index of my strings. To stop making the compiler complaint, I had to cast the string index as an int to hold the ASCII value inside my temporary array. This will be our checker, which will determine if there is a duplicate in our string depending on the value of the character. Recompiling it again, but this time using warning flags: gcc -Wextra -Werror -Wall remove_dup.c. This is the error that I get:

remove_dup:11 error: array subscript is of type 'char' [-Werror,-Wchar-subscripts]

           if (temp[str[i]] == 0)
                     ^~~~~~~

remove_dup:13 error: array subscript is of type 'char' [-Werror,-Wchar-subscripts]

                   temp[str[i]] = 1;
                        ^~~~~~~

remove_dup:21 error: array subscript is of type 'char' [-Werror,-Wchar-subscripts]

           if (temp[str2[i]]  == 0)
                   ^~~~~~~~

remove_dup.c:23 error: array subscript is of type 'char' [-Werror,-Wchar-subscripts]

                  temp[str2[i]] = 1;
                      ^~~~~~~~

Now my real question is, how can I have the same time efficiency BUT without using any kind of casting into my array? This program is running as O(m * n) where m is our first string and n is our second string.

This is the code:

void    remove_dup(char *str, char *str2)
{
    int temp[10000] = {0};
    int i;

    i = 0;
    while (str[i])
    {
        if (temp[(int)str[i]] == 0)
        {
            temp[(int)str[i]] = 1;
            write(1, &str[i], 1);
        }
        i++;
    }
    i = 0;
    while (str2[i])
    {
        if (temp[(int)str2[i]]  == 0)
        {
            temp[(int)str2[i]] = 1;
            write(1, &str2[i], 1);
        }
        i++;
    }
}

int main(int argc, char *argv[])
{
    if (argc == 3)
        remove_dup(argv[1], argv[2]);
    write(1, "\n", 1);
    return (0);
}

I hope this is clear enough with the logic structure I explained. I might have grammar mistakes, so bear with me :).

answered question

AFAIK casting does not incur a runtime penalty. On a related note, temp only needs to be 256 elements long because that's how large char is.

I agree with Schwen. Your code looks good. A slightly more elegant way to remove the annoying gcc warning is to replace temp[(int)str[i]] with temp[+str[i]]. This will also avoid the explicit cast; it is best to avoid casts whenever possible.

Correction: Unless your chars are signed!

@JosephQuinsey can you explain why chars must be signed in this case?

1 Answer

6

Now my real question is, how can I have the same time efficiency BUT without using any kind of casting into my array?

I don't believe casting in C has a runtime penalty. Everything in C is a number anyway. I believe it's just telling the compiler that yes, you know you're using the wrong type and believe it's ok.

Note that char can be signed. It is possible for a negative number to sneak in there.

This program is running as O(m * n) where m is our first string and n is our second string.

No, it's running as O(n). O(m*n) would be if you were iterating over one string for every character of the other.

for( int i = 0; i < strlen(str1); i++ ) {
    for( int j = 0; j < strlen(str2); j++ ) {
        ...
    }
}

But you're looping over each string one after the other in two independent loops. This is O(m + n) which is O(n).


On to improvements. First, temp only ever needs to hold the char range which is, at most, 256. Let's give it a variable name that describes what it does, chars_seen.

Finally, there's no need to store a full integer. Normally we'd use bool from stdbool.h, but we can define our own using signed char which is what stdbool.h is likely to do. We're sure to wrap it in an #ifndef bool so we use the system supplied one if available, it will know better than we do what type to use for a boolean.

#ifndef bool
  typedef signed char bool;
#endif
bool chars_seen[256] = {0};

You might be able to get a bit more performance by eliminating i and instead increment the pointer directly. Not only more performance, but this makes many string and array operations simpler.

for( ; *str != '\0'; str++ ) {
    if( !chars_seen[(size_t)*str] ) {
        chars_seen[(size_t)*str] = 1;
        write(1, str, 1);
    }
}

You might be able to shave a touch off by using post-increment, though harming readability.

    if( !chars_seen[(size_t)*str]++ ) {
        write(1, str, 1);
    }

Finally, to avoid repeating your code and to extend it to work with any number of strings, we can write a function which takes in the set of characters seen and displays one string. And we'll give the compiler the hint to inline it, though it's of questionable use.

inline void display_chars_no_dups( const char *str, bool chars_seen[]) {
    for( ; *str != '\0'; str++ ) {
        if( !chars_seen[(size_t)*str] ) {
            chars_seen[(size_t)*str] = 1;
            write(1, str, 1);
        }
    }
}

Then main allocates the array of seen characters and calls the function as many times as necessary.

int main(int argc, char *argv[]) {
    bool chars_seen[256] = {0};

    for( int i = 1; i < argc; i++ ) {
      display_chars_no_dups( argv[i], chars_seen );
    }
    write(1, "\n", 1);
}

posted this

Have an answer?

JD

Please login first before posting an answer.