Implementing a dynamically sized array in C

klodolph · on Dec 30, 2012

Nitpicks:

* Typo in the header file (SIZE versus CAPACITY).

* No error checking on malloc() or realloc() return.

* Out of bounds access is a programmer error, so you should call abort() and not exit(). Alternatively, assert() is fine too.

* Potential integer overflow for capacity field (size field can't overflow because capacity overflows first). On a 64-bit non-ILP64 system, resizing to a capacity of 2^31 elements will request a buffer of 0xfffffffe00000000 bytes, assuming sizeof(int) is 4.

* A 32-bit version could theoretically get an overflow, but only if a 2^31 byte allocation is successful... the next larger allocation will be for 0 bytes.

Edit: As someone pointed out in the comments, arr[271] is the 272nd element, whereas the article claims it is the 270th element.

Further notes:

If you end up implementing a lot of these, invest in a consistent naming scheme for the fields. For example, map, mapsz, mapalloc; then data, datasz, dataalloc.

You can also try writing generics in C but I strongly recommend just switching to C++ (or some other language) if the boilerplate code bothers you. I believe that std::vector was the biggest reason for GCC's switch to C++ internally.

alinajaf · on Dec 30, 2012

Article author here, thanks for going to the trouble to highlight these. Really appreciate the feedback and will edit.

fruchtose · on Dec 30, 2012

I've read that malloc(3) will not always return errors when you expect them to [1]. Any comment on this?

[1] http://news.ycombinator.com/item?id=4735589

klodolph · on Dec 30, 2012

I expect it to fail when it runs out of address space, or when address space fragmentation prevents it from fulfilling the request. Isn't that exactly when it fails?

Someone · on Dec 30, 2012

It is unlikely the world will change, so adjust your expectations.

Also keep checking for error returns according to the API, though, as that is a 'not always'.

RegEx · on Dec 30, 2012

Fun read :)

I'm not sure if this is correct:

> By elements we mean the size of the type of the array. If an array holds elements of type int which each occupy four bytes, then integer_array[5000] is the same as

    *(integer_array + (5000 * 4)).

And that directly contradicts the earlier statement

>with subscripts e.g. my_array[271] would give us the 270th element. This by the way, would do exactly the same thing as

    *(my_array + 271).

(which also may be incorrect: my_array[271] would give you the 272nd element if you count my_array[0] as the "first" and 271st element if you count my_array[0] as the "0th".)

I dug up the relevant part in the C standard just to be sure:

> The definition of the subscript operator [] is that E1[E2] is identical to

    (*((E1)+(E2))).

greenyoda · on Dec 30, 2012

You're right, the article is wrong. If integer_array is an array of ints or a pointer to an int, then integer_array[5000] is equivalent to

  *(integer_array + 5000), not

  *(integer_array + (5000 * 4)),

since pointer arithmetic in C automatically adjusts for the size of an array element.

alinajaf · on Dec 30, 2012

Thanks for pointing this out, will fix!

skrebbel · on Dec 30, 2012

if you don't have a not-invented-here symptom, you could also consider re-using the excellent stb stretchy buffers by the excellent Sean Barrett:

http://nothings.org/stb/stretchy_buffer.txt

apaprocki · on Dec 30, 2012

Or simply use GLib's GArray:

http://git.gnome.org/browse/glib/tree/glib/garray.c

lazyjones · on Dec 30, 2012

Indeed, this has been done many times before and more carefully. I don't know why this was newsworthy, we used to implement dynamically-sized arrays for our programs in C 20 years ago ... :-/

jlgreco · on Dec 30, 2012

Once for a project in Uni I ripped some datastructs out of Git's sourcecode. I don't recommend doing that normally of course, but it is a great source of high quality C code.

Firstmate · on Dec 30, 2012

Hi, college student here...what's that do?

voidlogic · on Dec 30, 2012

This is like 100 level C-S stuff, why is this on hacker news?

mitchi · on Dec 30, 2012

I agree but we get way worse don't we? example : http://news.ycombinator.com/item?id=4745713 And the comments so far are interesting.

niggler · on Dec 30, 2012

My guess: most people on HN don't have basic C competency.

voidlogic · on Dec 30, 2012

That is pretty scary (and possibly true). If a hacker does not have basic C competency, they probably do not understand how their code, C or otherwise is executed...

I'm all for C-S programs having a focus on teaching how to learn, frameworks of understanding and theory, but there need to be a few practical lessons along the way, and C competency is one of those.

Maybe things have changed greatly since my undergrad but, the capstone of my undergrad C-S program was writing a C(lite) compiler for MIPS in C. This is in a program where for 60% of the classes the language in use was Java. I think this was a fantastic capstone choice.

Students were expected to have deep understanding of C and Java and a working knowledge of Scheme, Haskell, Prolog with HTML/CSS/Javascript never taught due to the expectation you teach them to yourself on the way like everyone else. Depending your electives you might have also picked up C#, Python and Objective-C.

10098 · on Dec 30, 2012

Because it's so much all about dynamically-typed high-level languages today that raw pointers and manual memory management seem old-skool and cool!

gsg · on Dec 30, 2012

Allocating memory on initialisation is a bit questionable. It's better to alloc no memory and let the growth logic handle that case (either by checking for zero or by including an offset, such as vec->capacity += vec->capacity + 2).

The reason is code like this:

    vector vec;
    vector_init(&vec);
    for (i = 0; i < n; i++)
	   vector_append(vec, some_operation(n));
    whole_vector_operation(&vec);
    vector_free(&vec);

It's both natural and desirable for this code to do no allocation when n = 0.

Since this is C, you might also include a macro to allow for static initialisation: #define VECTOR_INIT { 0, NULL, NULL } or something.

shadowmint · on Dec 30, 2012

Weird implementation. If you want an magical 'append' function, use a stack, or a linked list.

There's no reason you'd want to 'append' to an array.

(Imagine the situation you remove an item in the middle; what is append supposed to do now? What is remove supposed to do? The append function in this example is basically just 'make_me_bigger')

Still, if you did want to, I'd be doing this to reduce verbosity:

    #define _V(a) ((int *) &a)
    struct Vec { 
      int *data;
      int size;
      int offset;
    }

    struct v;
    _V(v)[5] = 10;
    int value = _V(v)[5];

You only need two special calls; append() (aka. make this array longer) and free().

klodolph · on Dec 30, 2012

The _V() macro horrifies me. It's not remotely safe, nor is it even slightly correct. From least important to most important,

1. The _V() macro is in the compiler namespace, which is a no-no. You can't legally use identifiers that start with underscores which are followed by upper-case ASCII characters.

2. The _V() macro performs no type checking. It takes any addressable object as an argument. You can't do _V(3), but you can do _V(printf).

3. The _V() macro scribbles on memory.

    #define V(a) ((int *) &a)
    struct Vec { int *data; int size; int offset };

    struct v v; // don't forget the second v
    _V(v)[5] = 10; // this scribbles on the stack

What the macro does is take the address of the Vec structure (or anything else addressable, since it does no type checking) and pretends that that address is the address of an integer array.

The stack:

    |            |
    +------------+
    | int *data; | <---- &v, (int *) &v
    +------------+
    | int size   |
    +------------+
    | int offset |
    +------------+
    | <stack>    |
    +------------+
    | <stack>    |
    +------------+
    | <stack>    | <---- &_V(v)[5] scribbles here
    +------------+
    |            |

On a 32-bit system, _V(v)[5] is three words after the Vec structure, somewhere on your stack.

shadowmint · on Dec 30, 2012

My bad, you are correct, that's not quite right. You'd do it like this:

  #include <stdio.h>
  #include <stdlib.h>

  #define _V(a) (*((int **) &a))
  struct Vec {
    int *data;
    int size;
  };

  int main (int argc, char** argv) {
    struct Vec v;
    v.data = (int *) malloc(sizeof(int) * 10);
    v.size = 10;

    printf("%lx <-- actual\n", (long) v.data);
    printf("%lx <-- cast\n", (long) _V(v));

    for (int i = 0; i < 10; ++i) {
      _V(v)[i] = i;
    }
    for (int i = 0; i < 10; ++i) {
      printf("%d was: %d\n", i, _V(v)[i]);
    }

    free(v.data);
    return 0;
  }

...

    roar:~$ valgrind ./a.out
    ==30957== Memcheck, a memory error detector
    ==30957== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
    ==30957== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
    ==30957== Command: ./a.out
    ==30957==
    51f1040 <-- actual
    51f1040 <-- cast
    0 was: 0
    1 was: 1
    2 was: 2
    3 was: 3
    4 was: 4
    5 was: 5
    6 was: 6
    7 was: 7
    8 was: 8
    9 was: 9
    ==30957==
    ==30957== HEAP SUMMARY:
    ==30957==     in use at exit: 0 bytes in 0 blocks
    ==30957==   total heap usage: 1 allocs, 1 frees, 40 bytes allocated
    ==30957==
    ==30957== All heap blocks were freed -- no leaks are possible
    ==30957==
    ==30957== For counts of detected and suppressed errors, rerun with: -v
    ==30957== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2)

klodolph · on Dec 30, 2012

So, what's the point? Why not just use...

    v->data[i] = x;

This is, drum roll... type checked by the compiler, unlike _V().

P.S. Stop casting the result of malloc(), it makes you look like you're trying to write C++. There's no good reason for it in C, and there's a good reason not to do it (because it can mask an error if you forget to include <stdio.h>).

For this reason, and because it looks like you want generics, it sounds like you want to write C++. Why not write C++? Then you can just overload operator[] and go on your merry little way, without having to circumvent the type checker with macros.

shadowmint · on Jan 1, 2013

Who said anything about generics?

You could just #define v(a) (a->data) and be done with it, it's true.

zeteo · on Dec 30, 2012

> // set a value at an arbitrary index

> // this will expand and zero-fill the vector to fit

> vector_set(&vector, 4452, 21312984);

Love to see a C article where we get some insight on how dynamic arrays might work behind the scenes. But sparse arrays are best implemented as linked lists or hash tables.

80x86 · on Dec 30, 2012

There are much better libs out there that do this.

I always use the utarray and uthash stuff when I can.

tsahyt · on Dec 30, 2012

I believe this article was mainly for educational purposes though.

halayli · on Dec 30, 2012

This C code is bad for the following reasons:

1. No alloc checks 2. This module is limited to storing integers. 3. The type Vector is exposed to the user.

An alternative approach for contiguous dynamically sized array is to use mmap/madvise.

jws · on Dec 30, 2012

Strangely, alloc checks no longer work in Linux and possibly other OSes. Because of overcommit you can have malloc give you an non-NULL result that will fail when you try to write to it.

I still include them in my code, but I don't expect they will save me.

loeg · on Dec 30, 2012

However, it is perfectly possible in Linux to turn overcommit off. So you still need to check (or segfault…).

halayli · on Dec 30, 2012

malloc failure isn't just bound to running out of physical memory. It can fail because of fragmentation, with or without overcommit.

jws · on Dec 30, 2012

Sure, but I still have to design for random segfaults on memory writes. I guess I could write to all newly allocated pages and get a little more assurance that the allocation worked. I could still be whacked by a deduplicator, though unique writes would help.

halayli · on Dec 30, 2012

I think when you run out of physical memory & swap, there's a bigger issue at that point. Or that's what the designers of overcommit decided.

askimto · on Dec 30, 2012

OOM killer sends a SIGKILL. Writing to all pages will work for ensuring you actually got the memory. What a pain though.

alinajaf · on Dec 30, 2012

Thanks for the feedback,

> 1. No alloc checks

I'm going to modify the post to do checks on alloc based on this and other comments, thanks!

> 2. This module is limited to storing integers.

Since this is for the most part for learning purposes, I'm not too worried about this. In the article I mention that you could maintain a list of void pointers and then cast to whatever you like.

Is that the sort of thing you had in mind or could there be another way to hold data of multiple types in vanilla C?

> 3. The type Vector is exposed to the user

Interesting, I didn't realize you could build this library without exposing that type. If you have a moment could you elaborate on how you might achieve this?

Thanks again!

10098 · on Dec 30, 2012

> Interesting, I didn't realize you could build this library without exposing that type. If you have a moment could you elaborate on how you might achieve this?

Very simple, move the struct to the .c file. Add a forward declaration of that struct to the header file instead.

robomartin · on Dec 30, 2012

It's really unfortunate to see so many pedantic posts on this thread.

The article author explored something that probably isn't obvious to most programmers who didn't come up from low level work. And I am sure this isn't on the radar for your average PHP, JS programmer. And that's fine.

No, the code isn't production ready. I didn't get the sense that this was the article's intent. The author actually did a very nice job of explaining one approach. Good article and very accessible . Ease up. Help him make it better without being so condescending.

edanm · on Dec 30, 2012

This is a great article for anyone wanting a "behind-the-scenes-look" at how things work.

It's also great in reminding me: the fact that we've moved to automatical memory managing languages is a HUGE LEAP FORWARD in computing, possibly the biggest leap forward in the last few decades. The amount of code you don't have to deal with, the amount of bugs you're not making on a daily basis, is astounding.

Thank God we're in the age of modern languages. Makes me wonder what the next great leap forward in computing will be.

jxi · on Dec 30, 2012

I wouldn't call automatic memory management a "huge leap forward" at all. Sure it makes code a bit more concise and eliminates leaks (more or less), but it makes your code less performant and just shifts the problem away from the programmer to the VM instead. It's a leap into lazyness, if anything. It also creates problems since now people think it's OK to "new" objects within a for loop freely

You can probably tell by now that my goto language for any kind of large scale application is still C++ w/ STL for safety :). For small hacks, obviously I'd use python, js or ruby or something like that any day.

thangalin · on Dec 30, 2012

On a slightly related note, a just-in-time compiler with a superb garbage collector can be more performant in some situations due to its ability to profile and optimize the code while the application is running.

Such runtime statistical analysis and real-time optimization is not possible in an application that is compiled to native code.

For $189, I can slap a 240GB SSD in the machine to make software more performant. The question you have to ask is: How much optimization can a senior-level programmer accomplish for $189?

JDeveloper (an IDE) takes about 30 seconds to start on my co-workers' computers. On my laptop, boasting a 240GB SSD, JDeveloper is ready in about 5 seconds.

See also: http://stackoverflow.com/a/145122/59087

And see also: http://www.ibm.com/developerworks/java/library/j-jtp09275/

burstmode · on Dec 30, 2012

>On a slightly related note, a just-in-time compiler with a superb garbage >collector can be more performant in some situations due to its ability to >profile and optimize the code while the application is running.

No Garbage collector can be faster then a programm that produces no garbage at all, because its programmer knew what he was doing.

>For $189, I can slap a 240GB SSD in the machine to make software more >performant. The question you have to ask is: How much optimization can >a senior-level programmer accomplish for $189?

This ist the typical academic way of optimization: Let's write some programm and then throw hardware on it until it runs fast enough. In the real world, ressources are limited. That $189 are needed for each install of your garbage-collected software, not just YOUR personal develoment machine.

thangalin · on Jan 7, 2013

http://www.crucial.com/store/partspecs.aspx?imodule=CT032V4S...

SSDs are now $51. I don't see this as an academic means of optimization; I see it as a practical business value. Why would I pay a developer to spend 8 hours on optimizations when I can pay a developer to spend 8 hours on features and bug fixes? Meanwhile, I can drop $50 today (as opposed to $200 8 days ago) and get SSDs for the database, web server, and two developer machines?

If SSDs were down to $1 for 64 GB, would you still claim throwing hardware at the problem is an academic solution?

thangalin · on Jan 9, 2013

http://www.stefankrause.net/wp/?p=4

"Saying that C is generally several times faster than java is – according to those benchmarks – simply wrong. In other benchmarks Sun and JRockit were even able to beat ICC. Not by much but I guess it’s nevertheless just very remarkable that it’s possible to beat ICC [using a Java Virtual Machine]."

capkutay · on Dec 30, 2012

Seems relatively similar to implementing a dynamically resizable in Java (assuming you're not allowed to use an array list). The only difference is the malloc business, which isn't necessarily safe given that there's no check to see if the pointer is null after the mem allocation. It seems like the only critical piece is checking the array size, if it's too small, double the size of the array.

zamio · on Dec 30, 2012

Here is a nice implementation of dynamic generic arrays in C:

https://github.com/twitter/twemproxy/blob/master/src/nc_arra...

AYBABTME · on Dec 30, 2012

I think it would be wise to add a `shrink` operation when the array ends up being 1/4 full (shrink to 1/2 size). Otherwise you might end up hoarding memory you don't need anymore. The operation would be called automatically whenever you delete an item from the array, checking if a `shrink` is appropriate.

joe_the_user · on Dec 30, 2012

This approach pretty much depends on the underlying OS to do the reallocation of memory.

It is usually a good idea to do this. At the same time, if performance is an issue, you could look at what other allocation algorithms are available.

DrJokepu · on Dec 30, 2012

There's really no need for the pointer inside the struct. You could just use a flexible array member (introduced in C99) or a "struct hack" if you're still using C89.

to3m · on Dec 30, 2012

If you do this, the Vector could likely move when resized.

DrJokepu · on Dec 30, 2012

True, the vector_append function would have to return a pointer to the new location; this might or might not be a problem, depending on how your program is structured.

icebraining · on Dec 30, 2012

Mooby, you're hellbanned and showing as [dead].

jmspring · on Dec 30, 2012

Memory pools, dynamic arrays, custom vector implementations, etc. Pretty sure most have been done to death. Find the one that meets your usage and move along.

era86 · on Dec 30, 2012

reminds me of my college days!

tuananh · on Dec 30, 2012

any idea what causes this? the code run fines on Windows but when i tried on os x (with sublime text), i got this error instead.

    symbol(s) not found for architecture x86_64