Fake Data ...very real-looking fake data

Just over a week ago, I came across this posting on the 37Signals blog that discusses some of the resources they used to populate testing databases for their new product, Highrise. Given that this product is a contact manager, they wanted contact names with details... and lots of 'em. In the comments to that post, "Jes" mentioned yet another resource -- the "Fake Name Generator" web site. He mentioned that you get full contact details for a fake identity and that you could get up to 20,000 for free. Hmm.

This interested me because I always like getting hold of useful data to tinker with on side projects. One of my passions in development is for data visualization, or "infoporn," so the more data to look at, the better. I've downloaded data that includes the Netflix Prize data set, the Enron internal emails released by FERC, and geo-coded zipcode lists. You never know what might be useful, right?

But now you're thinking... "if those contacts are fake, then why would they be interesting?"

The reason is that the person/people behind the Fake Name Generator have gone out of their way to make it credible-looking fake data. For example,

  • The cities match the states.
  • The zip codes match the cities.
  • The area codes (mostly) match the zip codes (I found a Bakersfield area code with an LA zip code).
  • The names are more than just random letters and resemble names you'd find in any US-based list of contacts.

Having a set of data like this greatly improves the testing of code that works with contact details. Who among us developers hasn't created fake records for "Donald Duck", "John Smith", and "Joe Blow"?

My understanding is that the data is created from various legitimate sources, but the values across columns are randomized -- so that someone's real first name is used with someone else's last name, someone else's address, someone else's city, and so on. A few searches turn up other discussions of this data, including a set of contacts uploaded to Swivel.

The data is provided free for up to 20,000 fake identities, provided that you're willing to wait up to a week to download your data. If you need it sooner, you pay $10US to expedite the process.

A few other cool things about this service:

  • You can specify which columns you'd like in your data, including credit card numbers (fake - but numerically valid), SSN/National ID numbers (also fake - but numerically valid), and gender.
  • Email addresses use domains from various temporary email services (mailinator.com, mytrashmail.com, etc). Again, they validate but aren't useful as anything other than test data.
  • You can get the data in various formats, including HTML, Excel XLS, SQL script, or delimited text files.
  • You can specify the countries and name types for your data... so if you need some data that includes Swiss addresses and Hispanic name sets, you could request it.

I also found the data to be reasonably well distributed, at least in the US-centric set of data I received. For example, across 20,000 contacts, I found:

  • The bulk of addresses were in California, Texas, and New York. The fewest were in Wyoming, Delaware, and New Hampshire. I had one record whose state was 'NN' -- ??
  • Most surnames started with the letters M, S, and B. The letters with fewest surnames were X, Q, and U.
  • The zipcode with the largest set of contacts was 90017 (Los Angeles), but the Area Code with the most contacts was 703 (Virginia). As I dug in further, it seemed somewhat logical because the LA area has numerous area codes spread across it.
  • Social security numbers had starting numbers that were evenly distributed from 0 to 6 (2500-3500 each), with just 700 of them beginning with the number 7. There were none that started with the number 8 or 9. I learned on this CodeProject article that SSNs beginning with 9 are reserved for special government use (Witness Protection, I'm sure... hah!), but I'm not sure why there were none starting with an 8.

Anyway, I've been impressed. It's an interesting service and seems worth bookmarking/tagging the site for later... you never know when you'll need a bunch of bogus (but real looking!) data.

Note: I've got no affiliation with this site whatsoever, aside from requesting a set of 20K fake identities and getting an email with download details a week later.