Hypothetical Google Voice Export Format

closeThis post was published 8 years 6 months 13 days ago. A number of changes have been made to the site since then, so please contact me if anything is broken or seems wrong.

Google Voice was released over a year ago, and I’ve been using it as my primary phone number since switching from its predecessor, GrandCentral, which I’d used since 2006 (also as my primary phone number).1 In the time since making the switch, I’ve seen the interface revised more than a little—the quality rating buttons for transcripts and calls are among my favorite enhancements—but despite all the new features introduced over the last 14 months, one has been distinctly lacking.

That one feature: Export.

Think about it. Google Docs lets one download all of one’s documents, spreadsheets, and presentations in a ZIP archive. Gmail allows both IMAP and POP access to mail accounts, facilitating the complete backup of account data to a personal computer or server. Google Reader offers export of one’s entire subscription list. Google Calendar offers export formats compatible with several desktop and Web competitors’ products. Google Contacts can be downloaded and imported into Microsoft Outlook, Apple Address Book, and countless others.

I could go on.2 Just about every Google service offers some way for users to get their data out.3 Google’s Data Liberation Front initiative is a demonstration of their commitment. So why can’t I export my Google Voice data?

The Case

When GrandCentral was shutting down, users had to download messages one at a time. There were also large holes in the data that could be recovered due to a glitch in the storage system that irrevocably lost dozens of my messages, and likely thousands more from other users. (Most annoyingly, pleas from users for the company to do something about the data loss fell on deaf ears.) The issue was never officially addressed or explained; all we former users can do is speculate as to why our messages were forever irretrievable.

Fast-forward to Google Voice. The export function is still limited to downloading individual voicemails and call recordings, one at a time, manually. There is no support for exporting the transcripts of these audio recordings. It is not possible to download SMS conversations (save for copy-and-paste to text files). Call logs can only be backed up by painstaking manual duplication into a spreadsheet or other suitable format.

Every Google Voice account is amassing even more data than GrandCentral accounts did, thanks to support for text messaging (a long-awaited improvement, even if it doesn’t support SMS-to-email or4 shortcodes5). Billing logs for international long-distance are another piece of the corpus.

All of this information is potentially useful in the future. There is a reason that users have flocked to the service and its promises of one number forever, keeping messages forever, and so forth. It’s unlikely that Google itself will enter the deadpool any time soon, but services have been cut before. If Google Voice doesn’t meet all the right expectations of Google’s higher-ups, it too could get the axe. All those messages that were supposed to be kept forever? Gone.

I like to use the now-defunct Twitter-like service Pownce as an example. When the decision was made to shut down the site, a new section appeared in users’ settings. That section allowed them to request a backup of their account data for download. To this day I can still open up a backup file and peruse my activity on the site, though it is long dead.

Unfortunately, I can’t believe that Google would have its engineers develop a similar export tool for a service about to be shut down. It didn’t happen for GrandCentral, that’s for sure. Other companies (*cough* Nambu *cough*) have a similar attitude: Users don’t need copies of their data; just jerk the tablecloth off the table, and all the dishes with it. Besides, I can’t forget the seemingly arrogant launch of Google Buzz, right into my face. In short, there are precedents for Google violating DBAD.6

The Solution

Solving this problem is relatively simple. If a complete and total export feature is developed and released well before Google Voice is threatened with a shut-down, there won’t be any issue. The phone numbers assigned to Google Voice accounts (so-called Google numbers) can be ported out to another provider—that was a core policy right from the start—so there’s no issue of losing that; I could port my Google number to a cell phone right now. All future issues would be handled by someone else. Google’s problem is historical data and keeping their promise to users.

How should the Google Voice team go about accomplishing this feat? I’ve been mulling over different ideas for the past few months, and I think I’ve come up with a reasonable export format.

The Format

My hypothetical export file looks something like this:

Google_Voice_export_acctusername_2010-04-30T14:23:47.zip

  • /Greetings
    • /System Default.mp3
    • /Call Widget Greeting.mp3
    • /Robotics Team Greeting.mp3
    • ...
  • /Notes
    • /nt234.txt
    • /nt601.txt
    • ...
  • /Recordings
    • /cr234.mp3
    • /cr623.mp3
    • ...
  • /SMS
    • /sc601.txt
    • /sc728.txt
    • ...
  • /Transcripts
    • /ts142.txt
    • /ts234.txt
    • /ts324.txt
    • /ts623.txt
    • ...
  • /Voicemail
    • /vm142.mp3
    • /vm324.mp3
    • ...
  • /call_logs.csv
  • /recordings.csv
  • /sms.csv
  • /voicemail.csv

A few notes:
1) Files in any of the directories (except /Greetings) can be divided into date-dependent subfolders, but it’s simpler to not do so. It’s only an issue if the number of files in a directory exceeds file system limitations.
2) Obviously the IDs would be much larger in a production setting with thousands or millions of users; mine are just for illustration purposes.
3) I don’t know if Google’s database maintains separate IDs for each data type or if it keeps a single ID counter for all records, but that’s why I prefixed each file type with a letter code indicating what it is: cr = call recording, nt = note, sc = SMS conversation, ts = transcript, vm = voicemail.

The Parts

Files

Within each CSV file are rows with the following data fields:

  • call_logs.csv: ID,Timestamp,HasNote,Number,Type,Name,StartTime,EndTime
  • recordings.csv: ID,Timestamp,HasNote,Number,Name,Duration
  • sms.csv: ID,Timestamp,HasNote,Number,Name
  • voicemail.csv: ID,Timestamp,HasNote,Number,Name,Duration

All timestamps are in UTC. It is easiest for all IDs to be unique, across all item types, though again I don’t know how Google stores the data. I assume that the ID namespace is shared among all Google Voice items. I’ve used that assumption as the basis for some of my archive structure decisions; because of it, I did not need to disambiguate between notes attached to the different item types.

The exported CSV files contain required, optional, and conditional fields. Required fields must be non-empty; optional fields may be empty and are filled in if appropriate information is available; and conditional fields are required based on the value of another field.

Field notes: The HasNote and Duration fields would be useful to have, but are not required as the values they contain can be determined using other methods—respectively, by checking for the corresponding nt<ID>.txt file in /Notes and by checking the duration of the corresponding audio file in /Recordings or /Voicemail. I’ve left them in because, in the long run, having them would make it easier and more efficient to write a program to read the archive.

call_logs.csv

Required fields: ID, a unique record identifier; Timestamp, the item timestamp used on the Google Voice website; HasNote, whether the item has an attached note (1 for yes, 0 for no); Number, the phone number of the other party; and Type, the type of call record (placed/missed/received)

Optional fields: Name, the name of the contact (if the phone number can be matched to a contact)

Conditional fields: StartTime & Endtime, start and end timestamps for calculating call duration (empty for missed calls, as there is no start or end time)

ID cross-references: note

Call log entries are cross-referenced by ID to note files in the Notes directory if HasNote is 1.

recordings.csv

Required fields: ID, a unique record identifier; Timestamp, the item timestamp used on the Google Voice website; HasNote, whether the item has an attached note (1 for yes, 0 for no); and Number, the phone number of the other party

Optional fields: Name, the name of the contact (if the phone number can be matched to a contact); and Duration, the audio file duration

Conditional fields: none

ID cross-references: audio, note, transcript

Call recording records are cross-referenced by ID with audio files in the Recordings directory, note files in the Notes directory (if HasNote is 1), and transcript files in the Transcripts directory.

sms.csv

Required fields: ID, a unique record identifier; Timestamp, the item timestamp used on the Google Voice website; HasNote, whether the item has an attached note (1 for yes, 0 for no); Number, the phone number of the other party

Optional fields: Name, the name of the contact (if the phone number can be matched to a contact)

Conditional fields: none

ID cross-references: conversation text, note

SMS records are cross-referenced by ID to text files containing the full conversation, formatted like instant messaging transcripts:

(2010-03-21T04:12:02) Me: what’s up?
(2010-03-21T04:14:53) John Smith: not much, got a test tomorrow fml
(2010-03-21T04:17:19) Me: what subject?
(2010-03-21T04:18:17) John Smith: history ugh
(2010-03-21T04:20:02) Me: ugh indeed. good luck and try not to die 😉
(2010-03-21T04:23:47) John Smith: thx. if u dont hear frm me tmrw its bcuz my brain asploded

Again, all timestamps are in UTC.

SMS records are also cross-referenced by ID to note files in the Notes directory if HasNote is 1.

voicemail.csv

Required fields: ID, a unique record identifier; Timestamp, the item timestamp used on the Google Voice website; HasNote, whether the item has an attached note (1 for yes, 0 for no); Number, the phone number of the other party

Optional fields: Name, the name of the contact (if the phone number can be matched to a contact); and Duration, the audio file duration

Conditional fields: none

ID cross-references: audio, note, transcript

Voicemails are cross-referenced to audio files in the Voicemail directory, note files in the Notes directory (if HasNote is 1), and transcript files in the Transcripts directory.

Folders

The folders should be pretty self-explanatory. /Greetings contains recorded greetings (the only files with “real” names, though I’m sure they too have IDs on Google’s end), /Notes contains the text of notes added with the Google Voice website’s “Add note” feature, /Recordings contains recorded calls, /SMS contains full transcripts of text-message conversations, and /Voicemail contains voicemails.

The Greetings folder doesn’t have an associated CSV file because I think the files it contains should just be given the same name as the corresponding greeting in Google Voice’s settings. None of the other items really have names, so they can all go by ID and be indexed in CSV files; but the user is likely to name each greeting descriptively and that name shouldn’t be hidden behind an abstraction (read: obfuscation) layer in the exported backup file.

Alternate Ideas

I toyed with the idea of somehow including time zone information to help put timestamps in context, but there’s no good way of doing it. Put a time zone at the account level and you lose changes. I doubt there’s a user-preference history somewhere in Google’s database. Try to put it on each record and you have a nightmare, since most of the time there’s no indication that the user changed time zones. The user can figure out where he/she was on any given day and mentally adjust the UTC timestamps given if it’s really that important.

Similarly, I thought about including preferences, caller groups, and so forth, but I don’t know enough about the data structure to come up with an estimated export format.

It occurred to me that the exported CSV files could also contain a ContactID field, matching up with the corresponding Google Contacts entry. That way, external applications could hook up to the GData Contacts API and pull the contact’s information to enhance the information display. For example, a third-party app could emulate the way Google Voice’s website places the contact’s photo next to each related entry. I left this out of the above spec because of the potential for inaccurate ContactID values; who knows what the user will change in her contacts between exporting the Google Voice data and trying to use a third-party app with it.

Speaking of third-party apps, that’s why I’ve tried to keep my hypothetical format so machine-readable. What if Google or another developer wrote a Web app or cross-platform application that could import the archived data and present it in a graphical interface? It’d be a great way to access archived information while offline—of course, Google could also add offline support to the site, but it’s always good to have alternatives. The possibilities are truly endless; my contact-photo example above could be done with the export format as-is, though it would take a little more API work.

The Goal

My objective is not to have Google implement my solution verbatim; I know there are glitches in my reasoning, holes in my contingencies, omissions in available fields, etc. I wrote this specification (for that’s pretty much what it turned out to be) to prove that it’s possible to come up with a reasonable way to export all the data currently trapped inside Google Voice accounts.

Like I said, I know this isn’t perfect; it’s just a starting point. If you think the way I designed some or all of this format was unreasonable, go ahead and tell me. The comment form is there for a reason: That’s where you can say, “I think you’re wrong; here’s why.”7

Anyway. If I can come up with a data export format that includes most of the information Google has tucked away in a database somewhere, the engineers who work on Google Voice can certainly come up with a format to include every last scrap of data. After all, I’m just an amateur. 🙂

Update (06/11): Navarr, in the comments, reminded me that I left out the billing logs, as well as the per-call cost data. Since it would just add another CSV file and a field in call_logs.csv, I’ll declare it an omission trivial enough to not bother correcting.


Notes:

  1. Unfortunately, GrandCentral was notorious among many of us users for “losing” messages from before November 2007, so I have no records of my first year-or-so using the service. []
  2. Google Page Creator, though discontinued, offers ZIP downloads of the entire site and a redirection facility to keep links from breaking. (I myself have used it, along with an excellent WordPress plugin, to migrate files I uploaded to Google Pages to bits.technobabbl.es, their new home.) []
  3. Blogger offers an export as well, but so far as I know it can only be imported into another Blogger site. []
  4. Update (06/11): Thanks to Nathan Brauer for correcting me about SMS-to-email. Sometimes it’s dangerous to write blog posts too far in advance: things slip through during the proofing process. 😛 []
  5. It does support a few Google-controlled short numbers; I know of three: 46645, 466453, & 48368 []
  6. DBAD was an essay on the English Wikipedia, formerly known as WP:DICK or WP:DBAD; in the months since I was last really involved in the Wikimedia culture, it was apparently moved to the Meta wiki. Surprised the hell out of me. []
  7. Obviously, if you’re not reading this on the site, you’ll need to take an extra step to get a comment form. 😉 []

dgw

I am an avid technology and software user, in addition to being reasonably well-versed in CSS, JavaScript, HTML, PHP, Python, and (though it still scares me) Perl. Aside from my technological tendencies, I am also a theatre technician, sound designer, violinist, singer, and actor.

4 Comments:

  1. *coughcough* or it could just be available via a reMAP server.. but thats another story for another day x3

    This sounds really good.. but the way you have it laid out is a major PITA for accessing information about a single message at once.

    Instead of separating everything out like that, we have something more like

    billing_logs/
    — text file
    — text file
    voicemails/
    –3468
    —-recording.wav
    —-transcript.txt
    –33469
    —-etc, etc.

    Recordings, Transcripts, and Notes should be stored together when they revolve around the same content, don’t’cha think?

    • Well, my reasoning takes the position that having a folder for every ID would make for large directory trees with few files in each folder. To be honest, the structure was somewhat database-inspired, though it’s definitely not a high-performance structure—imagine what all the JOINs would do to a high-traffic database server.

      Perhaps the separation by type goes back to my favorite activity in preschool: sorting beads. I like to categorize; many of my email conversations in Gmail have upwards of half a dozen labels. I figured on IDs being the thread that would tie everything together. The code I envisioned being used to access the information here would have retrieval functions for each component.

      You did remind me of something I forgot though. I completely spaced on the billing logs. “Like I said, I know this isn’t perfect…” 😉

  2. I’d like to easily export Google Voice call logs and add them, for instance, to Google Calendar.
    I’ve been using the following two projects to assist me:
    http://code.google.com/p/google-voice-java/
    http://code.google.com/p/pygooglevoice/

  3. This is much needed. All online services should be mandated to provide export functions, in my opinion.

    It’s the user’s data after all, and in many cases it’s of incredible historical and sentimental value.

Leave a Reply

Your email address will not be published. Required fields are marked *

Notify me of followup comments via e-mail (or subscribe without commenting)

Comments are subject to moderation, and are licensed for display in perpetuity once posted. Learn more.