diff --git a/Distributed_Systems/site_reliability_engineering.md b/Distributed_Systems/site_reliability_engineering.md new file mode 100644 index 0000000..120e1fc --- /dev/null +++ b/Distributed_Systems/site_reliability_engineering.md @@ -0,0 +1,47 @@ +# SRE (Site Reliability Enrineering) + +## Managing Machines +- Google Borg [Ver15] + - similar to *Apache Mesos* + - Kubernetes [Bur16] (this is the successor) +- Clos network fabrik [Clos53][Sin15] +- B4 - backbone network [Jai13] +- OpenFlow communication protocol + +## Storage +- Colossus + - successor of a GFS (Google File System)[Gheo03] + - comparable to *Lustre* or *Hadoop Distributed File System* (HDFS) + +## DB +- BigTable [Cha06] +- Spanner [Cor12] +- Blobstore + +## network +- Bandwidth Enforcer (BwE)[Kum15] +- Global Software Load Balacer (GSLB) + - 3 levels: Geo DNS, user service, RPC + +## other +- Lock Service - Chubby [Bur06] + - similar to Zookeeper + - consistent DB +- Borgmon + - similar to *BB GUTS* +- Stubby + - similar to *gRPC* or *BB BAS* + - uses *protobufs*, check also *Apache Thrift* + + +## Reference +[Bur06]: http://research.google.com/archive/chubby-osdi06.pdf +[Bur16]: https://research.google/pubs/pub44843.pdf +[Cha06]: https://research.google/pubs/pub27898.pdf +[Clos53]: https://ia801901.us.archive.org/8/items/bstj32-2-406/bstj32-2-406_text.pdf +[Cor12]: https://research.google/pubs/pub39966.pdf +[Ghe03]: https://research.google.com/archive/gfs-sosp2003.pdf +[Jai13]: http://conferences.sigcomm.org/sigcomm/2013/papers/sigcomm/p3.pdf +[Kum15]: https://research.google/pubs/pub43838.pdf +[Sin15]: https://research.google.com/pubs/archive/43837.pdf +[Ver15]: https://research.google.com/pubs/archive/43438.pdf