Don MacAskill (SmugMug) -- Set Amazon's Servers on Fire, Not Yours
Smugmug 140MM photos, no debt, profitable since first year. 192TB stored at S3. Doubling yearly.
S3::Simple Storage Service. $0.15/Gb/month w/replicats. REST API. Fast, not 15K-SCSI fast, but internet fast.
Why use them?
not a lot of web scale expertise on planet earth
reputation for systems
[he] once competed with Amazon - fatbrain
They eat their own dogfood. Dozens of products.
Focus on the app, not the muck.
Show me the money!
Guestimate: ~$500k save per year
Actual:
Growth: 64MM photos -> 140MM photos
Disks would cost $40k -> $100k/month
$922K would have been spent
$230K spent instead
$692K in cold, hard savings
Nasty taxes (on capital goods)! $295K 'saved' in cash flow. Bonus!
Reselling disks to recoupe dunk costs
this is a partial cost of ownership number
sweet spots
perfect for startups & small companies
ideal for store lots, serve little businesses of all sizes
not so great (yet) for serving lots if you're a medium or large sized business. Transfer costs high if you can buy bandwidth in 1Gbps+ chunks.
We're a store lots, serve lots company. What to do?
Like SmugFS
Architecture remarkably similar to internal Smug filesystem
Similar to lots of startups
Stupid we're all building the same thing
Easy to drop in
Started on Monday, live in on production on Friday
S3 evolution
started just doing secondary storage. Too cold!
Tried out as Primary. Too hot!
Finally hot & cold model == just right!
Amazon gets 100% of the data
Smugmug keeps hot data local (about 10%)
95% reduction in # of disks bought
Sample Request
they check for image in cache, if not there they log it and then retrieve the image from S3 and return to client
Proxy vs Redirect vs Direct Links
Build SmugMug->S3 with multiple mods
Can flip a switch to change
Nearly 100% served are proxy reads
Sometimes HTTP redirects
Rarely direct S3 links
Permissions
SmugMug has complicated permissions
Passwords, privacy, external links
Proxying allows strong protection
REST vs SOAP
loves rest, hates SOAP
Lightweight
Nothing useful added with SOAP's complexity
Reliability
not 100%, close though
more reliable than SmugFS
no service level agreement
Lots of failure points:
SmugMug's datacenter
Internet backbones
Amazon's datacenter
No other software, hardware, or service [they] use is 100% reliable either
Handling failure
Build from day one with dailure in mind.
Stuff breaks, try again
Writes ail? Write locally, sync later.
Reads fail? Handle Intelligently. Alerts?
Performance
Fast for reads and writes
Mostly speed of light limited 20-80ms
Parallel I/O for massive throughput. 100s of Mbps
Machine measurable, human indistinguishable
CDN?
S3 is not a CDN[content delivery network]
it's storage
no global locations yet
limited edge caching
perhaps a future Amazon web service?
How do they do their proxy reads?
Store and forward vs stream
Store and forward
great resiliency
poor performance
if it's a big file, really poor performance
Stream
poor resiliency
great performance
do a quick HEAD first to verify
The speed of light problem
he was misquoted as saying Amazon was slow when trying to explain the speed of light
Amazon has not solved fasther than light data transmission. yet.
unavoidable, make sure your application can tolerate
parallelized I/O can mask problem
caching can help
streaming can help
Outages and Problems
not perfect, five major issues
3 outages of 15-30 minutes, 2 were core switch failures and one DNS problem. Amazon.com affected.
2 performance degradations. On a smugmug customer noticed, another wasn't noticed
Not a big deal, everything fails, expect it.
SLA, Service and Support
Smugmug do not care about SLA, but others might
Service Support: One area where Amazon is weak.
This is a utility
They need a service status dashboard
Pro-active customer notifications
Ability to get a hold of a human
Support for developers is quite good.
Amazon.com's customer service is good, AWS will likely catch up
Saving SmugMug's butts
knocked out power to ~70TB of storage. Oops!
Moved datacenters during normal business hours, customers not affected
Stupid bugs
Miscellaneous Tips
use cURL
fasther
more reliable
storing vs streaming is simple
make stuff as asynchronous as possible
hides speed of light issues
hides or masks problems
fast customer service
Elastic Compute Cloud (EC2)
Like S3 but for computing
scale up or down via API
web servers, procesing boxes, development test beds, etc
Launching large EC2 implementation "soon"
image processing
500k-1M photos/day
10-20 terapixels/day processed
peaky traffic on weekends, holidays
ridiculously parallel
Simple Queue Service (SQS)
Simple, reliable queueing
Mates well with EC2 and S3
Stick jobs in SQS
retrieve jobs with EC2 instances using S3 data
run jobs, report status to SQS
$0.10/1000 items
Priced well for small projects
gets costly for large ones (millions)
Missing Pieces
Database API or DB grade EC2 instances
Fast (lots of local spindles, lots of RAM)
Persistent
Load Balancer API
Single IP in front of lots of EC2 instances
Programmable to add/remove/change clusters
Can be done with software on an EC2 instance, but painful
CDN
Slides to be at http://blogs.smugmug.com/
How are they using EC2?
The EC2 instances invoke smugmug APIs to do work. The SmugMug servers don't really know much about EC2
I asked: Has their use of Amazon been an issue, either to outside investors or customers?
Not an issue, they have no outside investors, and further they've talked with VCs to raise the issue that startups should be looking at Amazon's services (and if not, why not)
Superninja Privacy Techniques for web Application Developers
Marc Hedlud and Brad ...? from Wesabe. Wesabe is a personal finance web application.
Keep critical data local. If there's data you'd never ever ever want to lose, don't put it on a web site.
created a wesabe uploader for Mac/Windows to keep bank credentials on your computer. The uploader downloads data from bank sites, strips certain data out of the files, then uploads to Wesabe.
don't trust the site. sensitive data filtered before it ever reaches the server.
requires a download
puts burden on user to maintain a secure machine (same risk as using a web browser to bank)
if successful, risk of trojan targeting
Use a privacy wall to separate public and private data
use secret key as index in db
secret key is only computed when user is logged in (they use hash(password + salt))
secret key stored in session data
other paths through the db: need to ensure that if you're using a privacy wall all transactions must traverse the privacy wall
the data itself can leak information
logs and exception reports can capture leaked information
password changin and recovery becomes trickier
use a locker generate a one time key for user stored in locker
encrypt using the locker rather than the password
troubleshooting can be harder
Use partitioning to protect against breaches
keep pools of sensitive data separate
eg membership and financial records kept separate
no relationship between them other than status
reduces impact of any brach -- firewalls off anything truly identifiable
allow separate politices and approaches by data type
pretty much zero drawbacks other than implementation time
data fuzzing and log scrubbing
(currently) no requirement to retain specific data on users of a server (in the US)
Subpoena / warrant may require that you give up all data on a user
Different countries have different data retention politices (see epic.org)
filter key parameters from logs
remove some of the precision of IP addresses
remove precision from timestamps since they too can be used to identify someone (cf. example of whistleblower information)
prevents leakage of passwords
avoids giving attackers / law enforcement a way through the privacy wall
loss of certain private data may require you to notify your customers
best protection is to delete your logs
important to have a public policy in place (cf link to eff.org policy information)
no protection against wiretap orders
difficult to cover all your bases (use centralized logging)
use voting algorithms to determine public information
"the esp game" to tag things at CMU.edu. If two people tag something the same thing at the same time, maybe that's a good tag to apply.
look at google image labeller
when people agree on a term, it's common knowledge
if enough people agree, it's probably publicly known
private transactions shouldn't be shown on the site
lots of users naming a merchange probably means it's public
works on opaque information
reliable -- very few faults since launch
no manual work needed
drawbacks
information is hidden until threshold met (understates available info)
can leak data if threshhold is too low
miscellaneous
hash your passwords. don't store in plaintext.
random (non-sequential) database ids. Don't use auto-inc ids in public data.
data bill of rights -- your data is your data. can export, delete, etc.
Comments are hosted through disqus effective November 2008.
ETech 2007 Day 2 p.m. sessions
h2. Don MacAskill (SmugMug) -- Set Amazon's Servers on Fire, Not Yours
p. Smugmug 140MM photos, no debt, profitable since first year. 192TB stored at S3. Doubling yearly.
* S3::Simple Storage Service. $0.15/Gb/month w/replicats. REST API. Fast, not 15K-SCSI fast, but internet fast.
h3. Why use them?
* not a lot of web scale expertise on planet earth
* reputation for systems
* [he] once competed with Amazon - fatbrain
* They eat their own dogfood. Dozens of products.
* Focus on the app, not the muck.
h3. Show me the money!
* Guestimate: ~$500k save per year
* Actual:
** Growth: 64MM photos -> 140MM photos
** Disks would cost $40k -> $100k/month
** $922K would have been spent
** $230K spent instead
** $692K in cold, hard savings
* Nasty taxes (on capital goods)! $295K 'saved' in cash flow. Bonus!
* Reselling disks to recoupe dunk costs
* this is a partial cost of ownership number
h3. sweet spots
* perfect for startups & small companies
* ideal for _store lots, serve little_ businesses of all sizes
* not so great (yet) for serving lots if you're a medium or large sized business. Transfer costs high if you can buy bandwidth in 1Gbps+ chunks.
* We're a _store lots, serve lots_ company. What to do?
h3. Like SmugFS
* Architecture remarkably similar to internal Smug filesystem
* Similar to lots of startups
* Stupid we're all building the same thing
* Easy to drop in
* Started on Monday, live in on production on Friday
h3. S3 evolution
* started just doing secondary storage. Too cold!
* Tried out as Primary. Too hot!
* Finally hot & cold model == just right!
* Amazon gets 100% of the data
* Smugmug keeps _hot_ data local (about 10%)
* 95% reduction in # of disks bought
h3. Sample Request
* they check for image in cache, if not there they log it and then retrieve the image from S3 and return to client
h3. Proxy vs Redirect vs Direct Links
* Build SmugMug->S3 with multiple mods
* Can flip a switch to change
* Nearly 100% served are proxy reads
* Sometimes HTTP redirects
* Rarely direct S3 links
h3. Permissions
* SmugMug has complicated permissions
* Passwords, privacy, external links
* Proxying allows strong protection
REST vs SOAP
* loves rest, hates SOAP
* Lightweight
* Nothing useful added with SOAP's complexity
h3. Reliability
* not 100%, close though
* more reliable than SmugFS
* no service level agreement
* Lots of failure points:
** SmugMug's datacenter
** Internet backbones
** Amazon's datacenter
* No other software, hardware, or service [they] use is 100% reliable either
h3. Handling failure
* Build from day one with dailure in mind.
* Stuff breaks, try again
* Writes ail? Write locally, sync later.
* Reads fail? Handle Intelligently. Alerts?
h3. Performance
* Fast for reads and writes
* Mostly speed of light limited 20-80ms
* Parallel I/O for massive throughput. 100s of Mbps
* Machine measurable, human indistinguishable
h3. CDN?
* S3 is not a CDN[content delivery network]
* it's storage
* no global locations yet
* limited edge caching
* perhaps a future Amazon web service?
h3. How do they do their proxy reads?
p. Store and forward vs stream
* Store and forward
** great resiliency
** poor performance
** if it's a big file, really poor performance
* Stream
** poor resiliency
** great performance
** do a quick HEAD first to verify
h3. The speed of light problem
* he was misquoted as saying Amazon was slow when trying to explain the speed of light
* Amazon has not solved fasther than light data transmission. yet.
* unavoidable, make sure your application can tolerate
* parallelized I/O can mask problem
* caching can help
* streaming can help
h3. Outages and Problems
* not perfect, five major issues
* 3 outages of 15-30 minutes, 2 were core switch failures and one DNS problem. Amazon.com affected.
* 2 performance degradations. On a smugmug customer noticed, another wasn't noticed
* Not a big deal, everything fails, expect it.
h3. SLA, Service and Support
* Smugmug do not care about SLA, but others might
* Service Support: One area where Amazon is weak.
** This is a utility
** They need a service status dashboard
** Pro-active customer notifications
** Ability to get a hold of a human
* Support for developers is quite good.
* Amazon.com's customer service is good, AWS will likely catch up
h3. Saving SmugMug's butts
* knocked out power to ~70TB of storage. Oops!
* Moved datacenters during normal business hours, customers not affected
* Stupid bugs
h3. Miscellaneous Tips
* use cURL
** fasther
** more reliable
** storing vs streaming is simple
* make stuff as asynchronous as possible
** hides speed of light issues
** hides or masks problems
** fast customer service
h3. Elastic Compute Cloud (EC2)
* Like S3 but for computing
** scale up or down via API
** web servers, procesing boxes, development test beds, etc
* Launching large EC2 implementation "soon"
** image processing
** 500k-1M photos/day
** 10-20 terapixels/day processed
** peaky traffic on weekends, holidays
** ridiculously parallel
h3. Simple Queue Service (SQS)
* Simple, reliable queueing
* Mates well with EC2 and S3
** Stick jobs in SQS
** retrieve jobs with EC2 instances using S3 data
** run jobs, report status to SQS
* $0.10/1000 items
** Priced well for small projects
** gets costly for large ones (millions)
h3. Missing Pieces
* Database API or DB grade EC2 instances
** Fast (lots of local spindles, lots of RAM)
** Persistent
* Load Balancer API
** Single IP in front of lots of EC2 instances
** Programmable to add/remove/change clusters
** Can be done with software on an EC2 instance, but painful
* CDN
Slides to be at http://blogs.smugmug.com/
* How are they using EC2?
** The EC2 instances invoke smugmug APIs to do work. The SmugMug servers don't really know much about EC2
* I asked: Has their use of Amazon been an issue, either to outside investors or customers?
** Not an issue, they have no outside investors, and further they've talked with VCs to raise the issue that startups _should_ be looking at Amazon's services (and if not, why not)
h2. Superninja Privacy Techniques for web Application Developers
Marc Hedlud and Brad ...? from Wesabe. Wesabe is a personal finance web application.
Keep critical data local. If there's data you'd never ever ever want to lose, don't put it on a web site.
created a wesabe uploader for Mac/Windows to keep bank credentials on your computer. The uploader downloads data from bank sites, strips certain data out of the files, then uploads to Wesabe.
don't trust the site. sensitive data filtered before it ever reaches the server.
requires a download
puts burden on user to maintain a secure machine (same risk as using a web browser to bank)
if successful, risk of trojan targeting
Use a privacy wall to separate public and private data
use secret key as index in db
secret key is only computed when user is logged in (they use hash(password + salt))
secret key stored in session data
other paths through the db: need to ensure that if you're using a privacy wall all transactions must traverse the privacy wall
the data itself can leak information
logs and exception reports can capture leaked information
password changin and recovery becomes trickier
use a _locker_ generate a one time key for user stored in locker
encrypt using the locker rather than the password
troubleshooting can be harder
Use partitioning to protect against breaches
keep pools of sensitive data separate
eg membership and financial records kept separate
no relationship between them other than status
reduces impact of any brach -- firewalls off anything truly identifiable
allow separate politices and approaches by data type
pretty much zero drawbacks other than implementation time
data fuzzing and log scrubbing
(currently) no requirement to retain specific data on users of a server (in the US)
Subpoena / warrant may require that you give up all data on a user
Different countries have different data retention politices (see epic.org)
filter key parameters from logs
remove some of the precision of IP addresses
remove precision from timestamps since they too can be used to identify someone (cf. example of whistleblower information)
prevents leakage of passwords
avoids giving attackers / law enforcement a way through the privacy wall
loss of certain private data may require you to notify your customers
best protection is to delete your logs
important to have a public policy in place (cf link to eff.org policy information)
no protection against wiretap orders
difficult to cover all your bases (use centralized logging)
use voting algorithms to determine public information
"the esp game" to tag things at CMU.edu. If two people tag something the same thing at the same time, maybe that's a good tag to apply.
look at google image labeller
when people agree on a term, it's common knowledge
if enough people agree, it's probably publicly known
private transactions shouldn't be shown on the site
lots of users naming a merchange probably means it's public
works on opaque information
reliable -- very few faults since launch
no manual work needed
drawbacks
information is hidden until threshold met (understates available info)
can leak data if threshhold is too low
miscellaneous
hash your passwords. don't store in plaintext.
random (non-sequential) database ids. Don't use auto-inc ids in public data.
data bill of rights -- your data is your data. can export, delete, etc.