Sort FAQ (8 July 2010, non-trivial updates in GREEN)

Common
Input Data
Sorting Process
Output Data
GraySort
PennySort
JouleSort
Indy vs. Daytona
Measuring the Sort
Report Submission
Medals

Common

Are there any requirements that are considered 'common knowledge' that may surprise novices ?

There is a "reasonable OS" requirement, since we are trying to test the IO subsystem. If you think the OS should do it, then it should. We are not OS guys, we are OS users. So, it should do what you would expect the OS to do if you were a DB guy or a graphics guy or a word processing guy.

Can we write our own file system? Can we define partitions, striping, etc. as we see fit?

This is a test of the OS file and IO system and the underlying hardware. If the OS or its extensions support partitioning among disks or anything else, cool!!!

Can we fix bugs in the OS or do we have to use available patches or service packs?

Yes, you can fix bugs, just make sure that the IO library still works as advertised. You are welcome to use patches as well

Can we use assembler?

Yes, but it is a bit surprising if you need to. This is a test of the OS IO system not of the compilers.

Can we use raw devices for temporary space?

No. This is a test of the file system, OS and hardware.

Input Data

How can I generate the sort input data?

Every entrant should use the same data generator gensort, or their own data generator based on the gensort code. This insures that all entrants use the same input records.
Usage: gensort [-a] #Records FileName
(Each record is 100 bytes long.)
PennySort and 10GB to 1TB JouleSort entrants are required to generate ascii records by specifying the "-a" argument to gensort. Ascii records have 95 possible values for each of the 10 key bytes - allowing for 95¹⁰ (59,873,693,923,837,890,625) possible key values.
GraySort, MinuteSort, and 100TB JouleSort applicants must use the default binary records. Binary records have 256 possible values for each key byte - allowing for 2⁸⁰ (1,208,925,819,614,629,174,706,176) possible key values.
Use the "-c" flag to have gensort calculate the checksum of the records it generates. You will need to insure that the checksum of the sort input matches the checksum of the sort output.

Can multiple input files be used?

For PennySort and 10GB to 1TB JouleSort categories, only one input file can be used.
For GraySort, MinuteSort, and 100TB JouleSort categories multiple input files can be used. A concatenation of the input files must be the same as the single file that would be generated by gensort with same number of records. Input files that meet this requirement are simple to generate with gensort. See the gensort documentation for more information.

Can the data be compressed?

No. The data cannot be compressed at all; neither the input, temp, nor output files. Also, the data cannot be compressed when sent over the network. To keep in spirit of the original benchmark rules, we want 100 MB of data to generate 100 MB of IO (whether disk or network).

Can the input file be read from the file system's buffer cache in main memory?

We want the entire input file to be read from disk. If you are not using direct (unbuffered) IO, before you run the sort program you must first insure that the input file is not in the file system buffer cache (e.g. by reading a large, unrelated file as large as your system's main memory).

Can we control the placement of the file on the drive surface?

Yes, you can control placement (e.g. outer bands on the disk). You can place the file any way you like. I particular, you will probably want to defragment the disk and also place the file on the outer bands on the disk.

Sorting Process

Since the last 80 bytes of each 100-byte record generated by gensort can be derived from the first 20 bytes, can we sort the first 20 bytes and compute the rest to form the result file?

No, you must move all the bytes, since this is an IO benchmark. You must sort all of the data -- that's what the contest is all about. Internally, in RAM memory, you are allowed to use pointers to the data and then write out the data in order of the pointers.

In what order should the sorted file be?

For binary records (GraySort or MinuteSort), the 10-byte keys should be ordered as arrays of unsigned bytes. The memcmp() library routine can be used for this purpose.
For ascii records, there are three popular orders:

The ASCII binary order strncmp( ) or memcmp( ) : !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ [To get this order, simply use strcmp( ) or memcmp( ).]
The case-sensitive order used by NTSort (the SORT command in Windows NT/2000/XP): '- !"#$%&()*,./:;?@[\]^_`{|}~+<=>0123456789AabBcCdDEeFfGghHIijJkKlLmMnNoOPpQqrRsStTuUvVwWxXYyZz
The strnicmp( ) order (NTSort default locale with "C", by using "sort /l "C" source /o dest"): !"#$%&'()*+,-./0123456789:;<=>?@[\]^_`AaBbCcdDeEFfGgHhiIjJKkLlMmnNoOpPqQrRSstTUuVvwWXxYyZz{|}~ [To get this order, use stricmp( ) or strnicmp( ).]

And then there are the various Unicode and code page orders. For the competition you can use any order you want in particular the binary order (a or c) is typically fastest. In the standard C library this is strncmp( ) or strnicmp( ). The validator uses memcmp for binary records and strncmp() for ascii.

Output Data

What is the required output format?

For PennySort and the 10GB to 1TB JouleSort categories, the output must be a single file. Output file is a sorted permutation of the input file (same checksum). It should be a sequential file, as seen by fopen( ) and a linear seek or whatever other file manipulation verbs are native to your operating system.
For GraySort, MinuteSort, and 100TB JouleSort categories multiple output files can be created. In this case a concatenation of the output files must be the same as one ordered output file.

Can the output file already exist when the sort program is started?

No, it cannot. The output file must be created by the sort program.

Can the output file be written to the file system's buffer cache?

Yes, this is what happens if you are using normal (non-direct or unbuffered) writes. In this case your sort program should flush the output file to disk before exiting. The entire output file must be written to disk during the elapsed time measured for the sort.

Should the output file be on sequential sectors on the surface of a disk?

It needn't be. What we need is a sequential file that can be opened by the file system.

How can I deal with the temporary files? Must I delete them? What about the format on striped disk systems?

File space must be zeroed or the OS must have some other strategy to prevent unprivileged users from seeing your files after you delete them. This is a standard OS security issue. There are ways to avoid zeroing. Indeed this was a big performance improvement in some operating systems.

Can we leave the sorted file in two disks? I mean, each disk will have only 1/2 of the result, and only if I simply concatenate them together, it will be the final result.

You can have partitioned output files for the GraySort, MinuteSort, and 100TB JouleSort categories.
For PennySort and 10GB to 1TB JouleSort, input and output has to look like a single file (this is a file system test). So if you want to distribute the data file on different disks, you need use disk striping or RAID.

Can the sort output destroy (reuse) the sort input file?

For the Daytona sorts, destroying or overwriting the input file is never allowed.
For the Indy sorts, the input file can be overwritten or destroyed if there is not sufficient disk space to hold both the input and output files. However the sort must always create a new output file and allocate space for it. It can delete the input file after reading it and reclaim that space. That is at the end of phase 1 (when the runs have been generated) the program could delete the input file and then create the output file and merge the runs into that new file.

How can we verify the correctness of the output file(s)?

The program valsort, included with the gensort tar file, validates the order of the records in the output file using memcmp ( ). If you sorted the records using a different sorting order, you should modify valsort to use the appropriate key comparison.
The validation program also computes a checksum of the records and counts the number of duplicate keys. You should verify that the checksum of the output file(s) matches the checksum of the input file(s). You should report the checksum and number of duplicate keys as part of your contest entry.

GraySort

What is the GraySort Benchmark?

The goal of the benchmark is to measure the performance of very large sorts. It is named after Jim Gray, the author of the original sort benchmark and the sponsor of them until 2007.

Why is the metric TBs sorted per minute?

Back in 1994 when the MinuteSort benchmark was first conceived, Jim's goal for the benchmark was to see a terabyte sorted in a minute. The metric is a tribute to that vision.

Why is there a minimum size requirement of 100 TB? Why not just fix the size at 100 TB and measure the elapsed time?

We want to make sure the sort data set does not fit in main memory, so we require a minimum input size. Entries with larger input sizes are welcome. We plan to increase the minimum size requirement over time as system capacities increase. Using a speed metric of TBs sorted per minute will allow us to track the progress of benchmark winners over time. Even as the minimum changes, the name of this benchmark will remain GraySort.

PennySort

Can we over-clock the CPU?

You should just use the system as priced. If you need a faster cpu, buy it. This is a software test, not a hardware test. Money is the constraint. Over-clocking warps the money constraint. So, please just use the off-the-shelf devices.

How can we calculate the PennySort price and performance?

Assuming your computer has a lifetime of three years, which is T = 94,608,000 seconds, divide T by your system cost (in pennies) to get a time budget for one penny. This is the capital cost of that amount of cpu time. It does not include electricity, or people or rent on the space or air conditioning and the many other costs. So it is optimistic.
Sort the maximum amount of data in your time budget.

How is system cost measured?

PennySort has a "shopping" component that makes it very hard to compare results. We are trying to capture price-performance without turning this into a shopping expedition. One can buy parts from any sources, but all the parts should be listed on only one of the famous websites (such as compusa.com, frys.com, newegg.com, spartantech.com or techdepot.com).
Pricing should be based on a single website all prices in dollars.
Tax and shipping are not included but each entrant should add an assembly fee ($35) for each box.
No discounts of any kind are allowed, including "instant" discounts.
No refurbished or returned (open box) equipment is allowed.
Use only the standard price of new equipment.
The price of the monitor, keyboard, and mouse are excluded from the cost but the system should have memory and persistent storage, processor, networking (100Mbps or better), and cabinet (fans, power, ).

Is software cost included in the system cost?

We don't include the development costs of the competition software, so we have decided not to price the sort software or the OS software as part of the system price. That makes the system price just the hardware price.

JouleSort

Which meter should I use to measure power ?

There are a number of meters that can measure power of all kinds of systems. The SPEC committee lists some here. In the original JouleSort paper, we used the BrandElectronics Model 20-1850/CI. We have two requirements: (1) The meter should be fairly accurate, within +/- 1%. (2) One should be able to synchronize the power meter with the test run through a computer interface, e.g. serial port, USB, ethernet, etc.

How should I take and report the energy measurement ?

You need to synchronize the timing measurements with the power measurements, as mentioned above. Power should be sampled at least once a second; the samples should be equally spaced. Timing must be done externally as specified below. Energy is the product of average power (during the run) and time. Take five consecutive runs, and report the average energy and its standard deviation.

What are the rules for sorting ?

The same as for all the other sorts. The only difference is that we have 3 different categories with fixed input sizes: 10⁸, 10⁹, 10¹⁰, and 10¹² recs. A system can be considered for multiple classes.

What are the guidlines for the 100TB (10¹² recs) JouleSort category?

Many have pointed out that the largest and fastest sorts in the world appear to be egregious wasters. This category is intendend to promote and estimate the energy-efficiency of the largest sorting machines in the world. Since our largest performance-oriented category starts at the 100TB mark, we set the input size accordingly.

The input is 10¹² 100-byte records generated by gensort. In this case, the records must be binary.
The JouleSort paper states, "Our aim is to account for all energy consumed to power the physical system executing the sort. " Since measuring the energy consumed by a 100 or 1000 node system can be a challenge in practice, we will accept reasonable estimates based on measurements of pieces of the system.
To estimate energy use across a large distributed system, one must measure the average power of each distinct class of machine or device in the system and extrapolate total average power. The usual rules for measuring the power of individual servers and pieces of equipment apply. Total reported energy use is the product of total average power and average runtime. Runtime is the elapsed time from start until the last machine is finished sorting.

For example, consider a MapReduce-like system with 100 "worker" nodes that perform roughly the same map and reduce tasks, a single node that controls and monitors tasks, multiple rack switches, and a single core switch. For this system, one can measure the average power of a worker node, the control node, a rack switch, and the core switch. These distinct readings can be taken across several runs if needed. To obtain a better estimate for worker nodes, one should randomly vary the node measured across runs. The average power is the sum of the average power for the distinct classes of machines and devices scaled by their cardinality.

Indy vs. Daytona

What is an Indy sort?

Much like the Indy classification of racing cars, an Indy sort can be custom made for the sort benchmark. It can assume and take advantage of

the fixed key and record sizes
the random, dense and uniformly distributed input keys - both for sorting records and for partitioning records into multiple output files

What is a Daytona sort?

Daytona sorts should

be capable of sorting other record and key types besides 100-byte records with 10-byte random keys
not significantly degrade in performance when sorting other key and record types
not be overly dependent on the uniform and random distribution of key values in the sort input. If the sort data is to be divided into multiple partitions (for instance in a cluster sort), the sort should not rely on any predetermined partition boundaries. The partition boundaries must be determined either by sampling the input data or during the sort. If sampling is used, it should be done evenly across all input partitions. That is, the same number of records should be sampled from each input partition. The report should explain how the sampling was done and state its elapsed time.
Similarly, non-uniform key distributions should not significantly increase network traffic in a cluster sort.
not overwrite or destroy their input file
be able the run continuously for one hour without a system failure. This requirement is specific to GraySort and MinuteSort benchmarks.
use the same degree of replication for the input and output files. This is particularly important for the GraySort. For example, if the input resides on a clustered file system and the input files are replicated 3-fold, then one can take advantage of the replication to improve performance. But, the output files must also be replicated 3-fold.

Measuring the Sort

Can the sort program measure its own elapsed time?

No. We want to make sure all sort program setup and shutdown activity is included in the elapsed time. Use an external program or shell command such as the Unix time command.

How should we handle variances in the elapsed time from one run of the sort program to another?

You should report the median time of all your attempts with a particular input size. For MinuteSort, you should make at least 15 consecutive attempts and the median time should be 60.00 seconds or less.

How should we report the size of a sort data set?

In units of:

gigabytes (10⁹ bytes)
terabytes (10¹² bytes)

Since when is a gigabyte 10⁹ bytes?

Clearly, a gigabyte of main memory is 2³⁰ bytes. However, according to disk industry standards, a gigabyte of disk memory is 10⁹ bytes. The sort benchmarks are disk-to-disk. Also, with the 100-byte records used in the sort benchmarks, it is much easier to measure in powers of 10. We used to accept entries using either standard but it became tedious to compare different records (now did they use decimal or binary gigabytes?).
To simplify things, we use powers of 10. When the Unix ls or the Windows dir commands display file sizes in hexadecimal, we might reconsider.

Report Submission

What should be submitted by May 15th?

A report describing the work and the system. You should also tell us, but not necessarily include in the report, the checksum of the input and output file(s), and the number of duplicate keys in the output.

Whom should our results be submitted to?

Submit report to chris dot nyberg at ordinal dot com

How can we get our report approved?

Submit it as above or to any past winner for approval.

Are there any medals for the winners?

Yes. Winners get a medal for each team member. We no longer give out a trophy as it became too cumbersome. You can keep the medal forever as a lifelong souvenir.

Medals

When and where are the medals awarded?

Medals are awarded each year at ACM SIGMOD at an informal ceremony if the winners are present. Awards are presented at the conference, but are not sponsored by SIGMOD. This is informal. It is also a convenient opportunity for the winners to buy dinner for the previous winners (an informal tradition). If the winners cannot be present then the medals are mailed to them.

What else will we get from joining the competition?

You will have a chance to set a new world record, which evaluates the rapid progress has been making on hardware and software. If you win, your results will be listed on the Sort Benchmark homepage, and your report will be a good reference to the researchers who interest in this field in the world. If so, you'll be proud of your accomplishment, since it is rare in one's life to be the best-in-the-world in something, and to be recognized for it.