Giving Day Technology Overview
GiveGab, Inc. has all of its technology assets hosted in the Cloud backed by market leading cloud providers with highly available (H/A) and fault-tolerant infrastructure at the highest levels of security, availability, and compliance:
- Amazon Web Services - https://aws.amazon.com/security/
- Infrastructure for hosting application servers, database servers, and object data (images and documents)
- Heroku - http://policy.heroku.com/security
- Infrastructure for hosting application servers and monitoring our platform security
- DNSimple - https://dnsimple.com/security
- DNS hosting with high-availability automatic failover DNS to DNSMadeEasy
- GitHub - https://help.github.com/articles/github-security
- Hosts our source code which can be deployed to any cloud provider
- Fastly CDN and Dynamic Caching Service
- Geographically distributed, highly-available and redundant content delivery network servers
- Stripe Payments
- A highly-available, industry leading payment provider that processes $100Ms each day
- MailChimp Mandrill
- Email processing
Stateful data and assets are stored in secure locations and backed-up automatically in fashions at various facilities not only in real-time to hot-failover replicants but in with snapshots hourly, daily, and weekly. Cloud providers by nature must provide a superior level of commodity services to its customers and this in turn provides us the least risk and quickest recovery from a disaster. A significant portion of the Internet’s services run in these cloud providers’ infrastructure.
While there is always risk, at this point, GiveGab feels that we have over-protected against these risks with a Double Failover Strategy with the 2nd level failover being to a completely different cloud IaaS. We are dealing with risk tolerance in the 0.0001%. This involves distributing the hosting of our software to multiple independent cloud providers as well as DNS failover. While some components of the overall technology architecture could fail at this level, our systems are built so that automated-detection would failover to healthy resources, such that systemic failure is extremely unlikely to occur. If systemic failure occurs, then we invoke our contingency plan to leverage our Double Failover Strategy noted below with an RTO of 120 minutes for cold DB failover from backup.
GiveGab consistently runs high-load events including days when we have 70+ giving day sites running live and processing over 125,000+ donation transactions for more than $15M+. Our platform is continuously and automatically load-testing every other week to ensure peak performance without regression.
Diagram of our Level 1 - Primary Cloud Technology Stack

Double Failover Strategy
Level 1 - Primary Cloud Technology Stack
- This is our current technology stack that leverages the technologies noted above
- It is highly available with built in automatic fault-tolerance and redundancy at each level and for each cloud provider
- It has resulted in 99.9% uptime and no complete outages for all of our giving days so far
- Is it auto-monitored by HoneyBadger, NewRelic, and Sentry and we are alerted if there are any issues
- It scales elastically meaning capacity is added automatically to handle spikes in volume usage
- It performs in under 500ms per request. If that threshold is hit, additional servers are provisioned
- Essentially, our Level 1 contingency is more reliable and highly-available than most solutions’ aggregate architecture and contingency plans
Level 2 - Secondary Cloud Technology Stack
- This stack runs on a fully independent set of cloud providers such that if a systemic issue occurred with one or more of the components, we could failover to this secondary technology stack
- Portions of this stack are actually run all of the time in parallel to our primary stack for ongoing testing and seamless failover
- This is a continuation of the same architecture from Level 1, but on different independent Cloud providers to isolate Cloud vendor-specific issues
Diagram of our Double Failover Strategy Architecture

Contingency Planning
Overall Contingency Sequence
- Level 1 - Internal high-availability supported through automated monitor
- Level 2 - Failover to secondary cloud providers if Level 1 technology cloud services were determined to be undergoing catastrophic failure (RTO 120 mins for DB cold failover)
Contingency Planning Summary
In general, our giving day architecture is designed to withstand many issues: 99.999% of issues are seamlessly handled without disruption by our Level 1 H/A cloud services.
In the extremely unlikely situation that we need to failover to Level 2 for our application services, users could expect to see a brief and at most 60 minute delay, and many users and groups would see no downtime at all. For practical purposes, there would likely be no noticeable downtime for the majority of users. If total catastrophic Level 1 infrastructure failover happened to our DB, we would need to provision and load a new database from one of our real-time backups and RTO would be 120 minutes.
Finally, in the most catastrophic situation, we could work to failover to a nonprofit’s website; and leverage our embeddable for solutions to host forms right on their website; however, some major issues impacting most of the internet would likely be occurring and the giving day site may be overshadowed by much more serious things.
Level 1 - Primary Tech Stack: Contingency Planning Details
Since our primary technology stack is built on top of the #1 Cloud provider, Amazon WS (https://aws.amazon.com/resources/gartner-2016-mq-learn-more/), we are able to build contingency planning into every component of this architecture even on top of the fault-tolerancy and redundancy that you get by default from Amazon WS.
Tech Components:
- Database
- We leverage Heroku’s Postgres database hosting
- This is highly available with real-time replicant following that is failed over to if the primary database encounters an error (all automatic)
- We also have several geographically distributed hot replicants
- We also have periodic data snapshots taken and save off for archival redundancy purposes. These are then copied for additional backup at other Cloud providers namely AWS S3 and Google Cloud
- API Servers
- We leverage Heroku’s Ruby on Rails server containers and auto-scaling mesh grid of virtual servers which are automatically provisioned/de-provisioned and load balanced across
- We run 3 separate API applications (main, stats micro-service, leaderboard micro-service) to ensure independent scalability and optimized performance
- This is a completely separate application hosted from our giving day sites
- Web Servers
- We leverage Heroku’s Ruby on Rails server containers and auto-scaling mesh grid of virtual servers which are automatically provisioned/de-provisioned and load balanced across
- This is a completely separate application hosted from our API
- Content Distribution Network
- We leverage Fastly which has a high-performance content delivery network with cached assets and content spread across servers throughout the world to decrease latency
- Fastly has run web sites for Super Bowl ads for AirBnB and others
- It also caches full giving day pages as well as giving day stats, leaderboards, API calls, etc for extreme performance on the day
- DNS
- We leverage DNSimple as our primary DNS service
- It is highly-available
- If failure exists from a DDoS attack or another issue, we automatically fail-over to DNSMadeEasy
- DNSMadeEasy is also used in parallel to DNSimple, regardless of a failover situation to load balance and ensure constant testing of failover mechanisms
- Payment Processing
- Stripe Payments is our provider and is one of the market leaders in payment processing
- It is highly-available with 100% uptime for all of our giving days and for the past 120 days
- If failure exists, the we are able to provision a failover form and processor through our Enterprise platform if the customer has an in-house gateway account that we can use. This ensure continuous gift processing
Contingency Triggering:
- If our error monitoring services detect an issue, our 24 x 7 on call engineering and support team are alerted
- During the day, all hands are on deck from our staff so we’d be monitoring our tools live then anyway
- 99.9% of errors are automatically handled without disruption to users within the current primary technology stack by leveraging the built-in H/A capabilities
- If we determine that there is a systemic catastrophic issue with any one component, we will failover to Level 2 fully or to a specific component from Level 2
Impact on Users:
The impact on users is dependent on the particular component that is failing over or experiencing an issue.
In most cases, where the built-in H/A and fault-tolerant capabilities are invoked, users should not experience issues other than the couple of users that made a request which invoked the error alerting and automatic failover.
In the case of catastrophic Level 1 issues, we would expect at most 15 mins of degraded performance. This again is in the 0.001% risk factor percentile.
Plan of Contingency Execution:
Our Fastly load balancer would be quickly re-configured to remove any endpoints from the failing technology stack and to only point at the secondary Level 2 providers. We would also provision out more resources to handle the influx in volume usage.
If our Fastly service fails, we can quickly reconfigure our DNS to point directly at our web servers on Heroku.
Communication Plan:
- We would communicate the failover to Level 2 to the giving day host
- We would determine with them the severity of the end user impact (e.g. site completely down, giving down, stats / leaderboards down, etc)
- Based on the severity of the incident and the duration of impact, we would determine whether we would need to email participating nonprofits, post socially, or say nothing. For example, if a one minute performance degradation was expected and didn’t impact all users, we would determine with the giving day host whether a communication was necessary or not
- Generally speaking, we always need to have a discussion with the giving day host about what they thought the best communication plan was. This would be with the GiveGab project manager for the giving day and discussed well in advance of the giving day
Implications:
If Level 1 fails, this means that there is likely a catastrophic failure with Amazon WS which would effectively impact at least 20% of the websites on the internet. At this point, more than the giving day itself would be highlighted by users on the internet.
Level 2 - Secondary Tech Stack: Contingency Planning Details
Similar to Level 1, our Level 2 Secondary architecture is built on top of highly-available Cloud providers; therefore, would have similarly built in failover components.
Tech Components:
- Database
- In the case that our primary H/A fully redundant Postgres hosting provider was down (Heroku), we would flip our primary database server to ElephantSQL Postgres hosting on Google Cloud. RTO 120 minutes
- API Servers
- We would already be running our API on Ruby on Rails server containers within IBM Cloud
- Web Servers
- We would already be running secondary Ruby on Rails web servers on IBM Cloud servers
- Content Distribution Network
- If Fastly were unavailable, we would re-configure our DNS to point directly at our API and web servers instead of through Fastly, our Level 1 CDN
- Dynamic and API caching wouldn’t be available so we would need to scale up the API to handle the extra load - this would happen automatically
- DNS
- If DNSimple was not available, we’d leverage DNSMadeEasy automatically
- It is highly-unlikely that both DNS services would be down. We could easily point our nameserver records at another DNS provider if the backup was also down
- At this point, if multiple DNS service providers were down, there would likely be a larger systemic issue going on in the Internet
- Payment Processing
- We would flip the donation form to use the failover form feature built into the platform
Contingency Triggering:
- If our error monitoring services detect an issue, our 24 x 7 on call engineering and support team are alerted
- During the day, all hands are on deck from our staff so we’d be monitoring our tools live then anyway
- 99.9% of errors are automatically handled without disruption to users within the current secondary technology stack by leveraging the built-in H/A capabilities
- If we determine that there is a systemic catastrophic issue with any one component, we will work with the giving day host to determine if the severity and expected duration warranted a disaster recovery fail-over to point DNS at their web site
Impact on Users:
Similar to above, the impact on users is dependent on the particular component that is experiencing an issue.
In the case of catastrophic Level 2 issues where we’d need to failover to DNS re-direction, we would expect up to 1 hour of degraded performance. This is in the 0.0001% risk factor percentile.
Plan of Contingency Execution:
We would point our DNS records to resolve specifically at each participant’s self-identified web site.
Communication Plan:
-
- We would communicate the failover to Level 3 to the giving day host
- We would work with the giving day host to quickly email participating nonprofits about the issue and post socially as well as put up a message on the giving day site
- Generally speaking, we would need to have a discussion with the giving day host about what they thought the best communication plan was AND whether they wanted to point back to Level 1 or Level 2 if issues under those configurations were resolved. This discussion and plan will be created well in advance of the giving day
Implications:
If Level 2 fails, this means that there is likely a catastrophic failure on the United States Internet and that there was a serious national or global disruption occurring. Likely at least 75% of websites on the internet would be down and even though we will fail-over to nonprofits’ websites for people to look at and give to directly, it could be very likely that their own websites were also disrupted at this point.
Something catastrophic and bad would likely be happening and the giving day would not necessarily be the most pressing issue to deal with.
DNS Failover to Participating Nonprofit Websites: Contingency Planning Details
The giving day URL would no longer resolve to the GiveGab giving day platform, but instead to the nonprofit’s website.
Tech Components:
- DNS
- Our DNS service would resolve all endpoints to the specific nonprofit’s website
- The giving day URL would resolve to a static website on a tertiary web host with a list of all of the nonprofit’s taking part
- Payment Processing
- Handled by the donate button on the nonprofit’s website
Impact on Users:
Users would start to see URLs resolve to the nonprofit’s web site and people would still be able to give with little downtime during the switchover. Unfortunately, we would lose the ability to keep track of, aggregate and total up stats around gifts.
This is serious disaster recovery mode, but again, users are not impacted as they are directed simply to the nonprofit’s website, assuming that is still available itself.
Communication Plan:
-
- At this point, we’d be working with the giving day host and the nonprofits closely on updates and advised communications
- Again, we would plan this out well in advance of the giving day with the host as part of the project
Implications:
It would be likely that there are some serious issues going on with many of the major Cloud providers and likely internet connectivity and service issues across the World Wide Web would be severe and not isolated to the giving day site. Again, it’s likely that each nonprofit’s site could also be experiencing issues.
Updated: October 4, 2021