View Single Post

  #3 (permalink)  
Old 08-24-2005
_firstname_@lr_dot_los-gatos_dot_ca.us
 
Posts: n/a
Default Re: interfilesystem copies: large du diffs

In article <1124874526.327566.271690@g14g2000cwa.googlegroups .com>,
orgone <orgone31@hotmail.com> wrote:
>I recently rsync'd around 2.8TB between a RHE server (jfs fs) and a
>Netapps system. Did a 'du -sk' against each to verify the transfers:
>
>2894932960 sources total, KB
>2751664496 destination total, KB
>
>That's a 140GB discrepancy. Subsequent verbose rsyncs have turned up
>nothing that was not originally transferred.
>
>I often note similar behaviour with smaller transfers between servers
>with similar OS/fs combos and have always seen it to come extent with
>transfers between systems of any type. It's just that the usual
>discrepancies in this case are magnified greatly by the sheer volume of
>data. Needless to say, 140GB going missing would be a bit of a problem
>and it's not much fun picking through 2.8TB for MIA data.
>
>Can anyone shed some light on why this happens?
>
>tia


First, this is only a 5% difference. I could easily imagine the
difference being much larger:

The du command (and the underlying st_blocks field in the result of
the stat() system call) reports the amount of space used. But:
- A filesystem uses space not only for the data component (the bytes
that are stored in the files), but also for overhead: directories,
per-file overhead like inodes and indirect blocks, and more, often
referred to as metadata. How efficiently this overhead is stored
varies considerably by filesystem. And whether this overhead is
reported as part of the answer from du also varies. In some extreme
cases (filesystems that separate their data and metadata physically)
this overhead is not reported at all. The ratio of metadata to data
varies considerably by file system type and by file/directory size,
but for many small files 5% is not out of line.
- The amount of space allocated to a file typically has some
granularity, which is often 4KB or 16KB (historically, it has ranged
from 128 bytes for the cp/m filesystem to 256KB for some filesystems
used in high-performance computing). This means the size of the
file is rounded up to this granularity, which if your files are
typically small can make a huge difference. Say your files are all
2KB, and you store them on a file system with a 512B allocation
granularity and on another one with 16KB allocation granularity,
you'll get a result from du that is different by a factor of 32!
- Are any of your files sparse? I think every commercial filesystem
that's in mass-production today supports sparse files. But exactly
how can vary widely. What is the granularity of holes in the file?
What is the metadata overhead for holes (in extent-based filesystems
this can make a significant difference if implemented carelessly)?
Also, it is quite possible (maybe even likely) that your rsync
copying turned sparse files into contiguous files. Given that your
total space usage shrank instead of increased, this doesn't seem
likely to be the main effect here.
- On the netapp, did you have snapshot turned on? If yes, does the
result from du include the snapshots?
- It isn't even completely clear what the result from du is supposed
to be. The real disk usage? The size of the file rounded to
kilobytes? Here is a suggestion to stir the pot: Assume you have a
1MB file stored on a RAID-1 (mirrored) disk array. I think du
should report the space usage as 2MB, because you are actually
storing two copies of the file (you are using 2MB worth of disks).
If you now migrate the file to a compressing filesystem that is not
mirrored, du should report the space usage as 415KB, if that's how
much disk space it really uses. No filesystem today would report
those values, they would all report something pretty close to 1MB.

For you, this is my suggestion: Instead of looking only at the total,
make a complete list of the disk usage for each file. An easy way to
do this from the command line is this. Make two listings of space
usage, one each for source and destination, merge the lists, and look
at the differences. Here is a quick attempt at a script which does
this (just typed this in, you may have to debug it a little bit, and
it assumes you don't have spaces in file names, if you do you'll have
to do a lot of quoting and null-terminating):
cd $SOURCE
find . -type f | xargs du -k | sort +1 > /tmp/source.du
cd $TARGET
find . -type f | xargs du -k | sort +1 > /tmp/target.du
cd /tmp
join -j 2 source.du target.du > both.du
awk '{print $1, $3 - $2}' < both.du | sort -n +1 > diff.du
In the end, you'll have a listing of the difference in space usage in
diff.du, sorted (I hope, I can never remember whether the -n switch to
sort works correctly for negative numbers). Then pick a few examples
of files that have large differences, or see whether you can make out
a trend (maybe most files have a small difference). Then spot-check a
few files, to make sure they were copied correctly.

You can also use "join -j 2 -v 1 source.du target.du" to find files
that were not copied, and the same with "-v 2" to find files that
showed up in the copy uninvited.

Now changing gears: Speaking as file system implementor (and somewhat
of an expert), I would wish that the du command and the underlying
information returned by the stat() system call would go away. On one
hand, they are just to crude and don't begin to describe the
complexity of space usage in a modern (complex) filesystem. On the
other hand, they don't give the information answers that a system
administrator (or an automated administration tool) really needs. As
we saw above, for a 1GB file, the correct answer for space usage might
be any of (all the numbers are made up)
- 1GB worth of bytes
- 1GB is the file size, but it is sparse, so it only uses 876MB.
- 1GB worth of bytes on the data disk, plus 7.4MB of metadata on the
metadata disk.
- 2GB worth of bytes, because of RAID 1.
- 437MB worth of bytes, because of compression.
- 0.456GB on datadisk_123, 1.234GB on datadisk_456, and 2.345GB on
datadisk_789, plus 7.4MB on metadisk_abc and 3.7MB on metadisk_def.
- 5.678GB on disk, because of RAID 1, asynchronous remote copy (still
0.3GB worth of copying to be done, currently held in NVRAM), and
fourteen snapshot copies, all slightly different, not to mention
that the remote copy is compressed, and this figure includes
metadata overhead on the metadata disks.
- 4.567GB on expensive SCSI disks (at $3/GB plus $0.50/year/GB), and
1.234GB on cheap SATA disks (at $1/GB plus $0.25/year/GB).
As you see, returning one number is woefully inadequate.

We need to ask ourselves: What is the purpose of the space usage
information? It is not to verify that the file system has correctly
stored the data (for that it is too crude), it is to enable
administrating the file system, so it needs to give the information a
system administrator might care about.

If I had my way (fortunately, nobody ever listens to me), I would
remove the du command and completely remove all notions of space usage
from the user-mode application API, and put all space usage
information into a file system management interface. There, questions
like the following need to be answered:

- How much space is user fred using (or files used by the wombat
project, or files stored on storage device foobar)?
- Has fred's usage increased recently?
- How expensive is the storage used by fred? Original purchase, lease
payments, yearly provisioning and administration cost?
- Are the wombat projects requirements for data availability being
met, or could I improve them by allocating more space to it and
storing more redundant copies of their data?
- If I move the wombat project to the netapp, and then use the free
space on the cluster filesystem to put fred's files on, would that
save me money or increase speed or availability?
- Is the netapp still a cost-effective device, given that we just
started using the fancy new foobar device from Irish Baloney
Machines with the new cluster filesystem from Hockey-Puckered?

(If it isn't clear, all mentions of the word "netapp" and oblique
references to large computer companies are meant as humor, and are
intended to neither praise nor denigrate my current, former or future
employers).

--
The address in the header is invalid for obvious reasons. Please
reconstruct the address from the information below (look for _).
Ralph Becker-Szendy _firstname_@lr_dot_los-gatos_dot_ca.us
Reply With Quote