[SDL] [OT] Resource file
Andre de Leiradella
leiradella at bigfoot.com
Thu Feb 21 06:39:05 PST 2008
> Well, mmap is a bit faster than reading/writing a file. I'm rather fond of
> it myself. Not to mention that having the data in a binary format is very
> nice. Do remember that addresses may not stay meaningful when mmaped into
> memory. The worst case is when they do stay meaningful on your machine, and
> on some other machines, but are not meaningful on every machine. This can
> lead you to putting addresses into the file and then having to redo
> everything so that you store offsets.
I like it a lot too. Implementing decompressing readers with mmapped
files is so much easier than reading chunks of data when the input
buffer of the decompressor is empty.
I'm not storing pointers in the file even because the file can be mapped
to different addresses each time on the same machine. All I have are
offsets and they're all at the file header, and a copy-on-write mmap
helps when converting them to platform-dependent pointers. For each
offset I have 16 bytes of extra space in the file to accomodate the
> B. Spare space where I could insert the digital signature of the file.
>> This space must be filled with zeroes while computing the hash for the
> That isn't really a big deal, you can always create a zeroed out item in any
> resource file that can be used to store the signature after it is computed.
> Even better, you can just put the signature in a wrapper so that you don't
> actually change the signature of the file by adding the signature to the
> file. A simple header consisting of the length of the signature followed by
> the signature can be prepended to the file to sign it. Or, even simpler, you
> can just append the signature to the file and never care about the format.
Yeah, you're right.
> The preliminary version is working quite well. Data is accessed by name
>> (char *), with a reader that supports basic data input operations being
>> returned. Since the file is mmaped, the location of a given chunk of
>> data within the file is quickly found via a bsearch call. Data can be
>> stored without compression (good for mp3, ogg, jpeg...) or bzipped. The
>> resource file can even be part of another file, the most common use
>> being to append the resource file to an executable.
>> There is also support for transparently using a directory instead of a
>> real resource file just like zziplib, and reading entries via SDL_RWops.
>> So my questions are:
>> 1. Is there any thing bad about mmapping a resource file? The file size
>> can easily be greater than 4 GiB.
> Well, by allowing your file to be so large you restrict yourself to 64 bit
> architectures. In general 32 bit machines can not address more than 4 gigs
> of process space. In reality they can rarely address more than 2 gigs of
> process space. If you don't care about 32 bit machines then there is really
> nothing wrong with what you are doing. Just remember that your addresses and
> offsets need to be 64 bits.
Sure, they are. Do you know of any gotchas when mmapping files bigger
than 2 GiB into an application's address space?
> 2. Are there other resource file formats that provide A and B above?
> Well, your "A" requirement is a requirement of the implementation of the
> access library and has absolutely nothing to do with the file format. So
> basically all and/or no file format gives you that. You could take any
> existing file format and create a library for accessing it that uses mmap.
> If you look deep down in the file code for your favorite compiler you might
> find that ti already uses mmap to implement read and write in which case all
> libraries have this feature.
I partially agree. The file format is being designed so that all offsets
are stored at the header so that I don't have to walk through all the
file with a pointer to find an entry, which would cause the OS to bring
many pages to RAM. I didn't really check if other formats would behave
the same. The TAR is one I know which doesn't. Besides, the format I'm
designing will allow a simple bsearch call to find an entry instead of
comparing all entries' names. I know the speed gain is small if one is
going to read a large entry, but I like to think that many small speed
gains result in a overall speed gain.
I'm aware that some compilers might use mmap behind the scenes for
regular file IO, but as I said before decompressing things with mmapped
files is a breeze.
> And your "B" requirement can be met by simply appending the signature to an
> existing file or by writing it to a different file, so pretty much all other
> file formats meet this requirement.
> 3. Are there other important characteristics for a resource file format
>> I'm missing?
> That it is available right now and you don't have to write it from scratch?
Yeah :-) I'm trying to use readly available code for everything in my
projects. zziplib already does almost everything I need, and I could
implement the signature like you suggested. But although it can be used
to open ZIP files that are part of larger files, the documentation does
not say how this can be done, and I'm not really into source code study
of someone else's code, I prefer to study and to do this myself.
> 4. Do you have requirements for a new resource file?
> The exposed interface so far is:
> Hmmm, you aren't being consistent about the use of uint32_t and int. The
> interface as written may blow up if you try to do arithmetic on offsets or
> lengths because you are mixing unsigned and signed integers for lengths and
> offsets. It depends on whether the "int" variables are 32 bits or 64 bits.
Entries are 4 GiB maximum each. But you are right, af_entry_read,
af_entry_seek and af_entry_tell should take and return uint32_t values
too. I was just closely following the SDL_RWops interface; my file
format will be used in a SDL application later on and I wanted them have
the same interface.
> Above you mentioned that the file size can easily be bigger than 4 gigs but
> you seem to only have 32 bits of offset in this format which restricts you
> to <= 4 gigs. You need to changes this interface to use 64 bit offsets.
> Bob Pendleton
Thanks for your input.
More information about the SDL