Sunday 4 July 2021

Working in Industry vs. Academia

One of the most significant decisions scientists face is choosing whether to pursue a career in industry or academia. While this decision is easy for some, it can be incredibly challenging for others. If you’ve struggled with this question of which career path you’ll choose after your formal education ends, you’re not alone. 

There are several key differences between working in industry and academia. It’s critical to understand these nuances and consider your skills, qualifications, personality, and career goals when deciding which path is right for you.

 

1. Responsibilities

Academic careers will vary, depending on the size of an institution, but in an academic research career, most professionals have some version of the following broad responsibilities:

  • Applying for grants
  • Conducting self-directed research
  • Publishing papers
  • Teaching courses
  • Mentoring students
  • Performing departmental service

Working in “industry” can mean many things, as the term encompasses all research work that occurs outside of universities. Professionals who choose this route can work for small biotech startups, mid-sized corporations, or even international organizations with thousands of employees. The scope of work is typically focused on applied research that will have direct value. Industry work also requires a more business-minded approach. You must be able to develop projects that meet the company goals as you support the company's business plan.

 2. Flexibility

For some, an appealing aspect of working in academia is the freedom to dictate your schedule, choosing when to teach, conduct research, and publish your work. Not having to answer to anyone about how you allocate your time also means you must be proficient in time management and prioritization.

A business organization’s research lab is more structured and typically revolves around a standard 9-to-5 workday. For some people, this type of structure is preferable to ensure maximum productivity.

 3. Collaboration

Academic research is mainly collaborative and team-work oriented. An academic environment creates an extraordinary opportunity for cross-disciplinary thinking and research. You can, however, enjoy an immense sense of autonomy, should you choose, with the freedom to decide when and with whom you collaborate.

In industry, researchers are working toward a larger, shared goal. Due to the complex nature of drug discovery, there is much collaboration across multiple functional areas and disciplines. Whereas researchers in academia can be highly competitive, researchers must collaborate and work as a team in the industry.

 4. Workplace Culture

Academia is highly research and discovery focused, and much research is done for the sake of learning, as opposed to clinical application. In contrast, “industry” work allows researchers to feel a sense of immediate impact on patient lives.

Both workplaces have their own share of pressures and demands, as well. In academia, the researcher’s plight is often “obtain funding and publish, or perish.” Academics are under immense pressure to be self-starters, continually publish their research, and to promote and advocate for their work.

The pressures are typically more deadline-driven in industry, as teams work to integrate science and business-focused problem solving on tight project timelines following larger product and business goals. Thus, it’s crucial for people working in the industry to be excellent communicators and have sharp people skills to manage projects.

The pace of work also differs between industry and academia. In contrast to the fast-paced nature of drug development, academic timelines tend to be longer and focused more on long-term goals and education.

 5. Individual Impact

As an academic, you’ll typically not have quarterly deadlines to meet, monthly reports to file, or a superior that you’re being held directly accountable to. Thus, the ability to make an individual impact and receive recognition for your work can be greater than in industry, where you are a single member working on behalf of an organization.

The flip side, however, is that academics can struggle to have their ideas adopted in practice. In contrast, the work that that industry researchers do is often directly motivated by business goals.  Although this does remove a measure of autonomy, the positive aspect is that research results are often immediately and directly impactful. To work in the industry, one must be willing to work on a team and share credit. This teamwork aspect can also take off some of the pressure of having to achieve results individually.

6. Intellectual Freedom

In academia, professionals enjoy intellectual freedom, free from the constraints of short-term deadlines and answering those setting the research priorities. This allows individuals to choose what they would prefer to spend their time researching and how to pursue it. With this freedom also comes the responsibility of securing funding and resources.

When working in the industry, most work is done on a quick timeline and is driven by a product or business goal. This type of clear direction can be very appealing to some researchers, while others may see it as a hindrance to their ability to investigate their own areas of personal interest. A benefit of working in the industry is that the larger organization will supply the funding and more state-of-the-art resources.

7. Career Advancement

Generally speaking, an academic research scientist’s career moves one of two directions—toward tenure and professorship or toward work as an academic staff scientist. The career ladder can be difficult if only a handful of universities may specialize in your discipline or are actively hiring in a given year. There is great job security, however, if you achieve tenure.

However, industry career opportunities are broader and can range from research at the bench to work in product marketing or development. In industry, you also have the opportunity to climb the organizational ladder to manage larger teams and projects.

How to Decide Whether Academia or Industry Is Right for You

Ultimately, the choice between academia and an industry research lab involves many compromises, and the best “fit” for you will likely depend on your individual preference and working style.

Here are some factors to consider before heading down either career path:

  • Determine your priorities. Consider what matters most to you. Whether you’re most concerned about salary potential, intellectual freedom, or flexibility, it’s important to do some soul-searching to decide what you value most.
  • Think about how you want to spend your time. Consider how you want to spend your time day-to-day. Think about how you feel about teaching, publishing, managing, interacting, traveling, negotiating, collaborating, presenting, reporting, reviewing, fundraising, etc.
  • Know your strengths. Are you a self-starter who can proactively manage your own time? Or do you prefer to work in a more structured, process-oriented environment? Knowing your strengths can help direct you to the path that will increase your chances of success.
  • Factor in your personality. Do you prefer to work independently, or do you thrive when working alongside others? Are you comfortable with self-promotion, or would you be more comfortable sharing your successes with a team?
  • Think long-term, but keep your options open. Where do you see yourself in five years? 10? 20? Think about where you’d like to be long-term, but remember the choice you make is merely for the next step in your career. It doesn’t have to be final. The field is currently more conducive to transitions between the two fields more than ever before.
  • Be true to yourself. Most of all, be honest with yourself. Stay true to who you are, and consider what you are most passionate about. If you do this, you will find success in whichever path you choose.

Thursday 25 July 2019

Robots at Recruitment


The digital revolution in society had its starting point with the emergence of the personal computer in the early 1980s. The second step that impacted society was the commercial exploitation of the Internet and the creation of the World Wide Web by Tim Berners-Lee. Jumping a couple of decades to the present days, we are now witnessing the rapid success of Robotics thanks to the current stage of Artificial Intelligence (AI).

Several companies are trying to automate human tasks using robots, defending fewer mistakes in repetitive tasks. The detractors support that fewer jobs will be available. Regardless of the opinion of the defenders and detractors, we are witnessing in our daily life­ social collaboration with robots.

Currently, there is a debate about the impacts that robotics is making to society. The labor market is one side that is completely changing. In terms of social robots, there is a novelty coming from two companies in Sweden – Furhat Robotics and Tengai AB­.


Figure 1: Furhat Robot

A new company called Furhat Robotics has been developing social robots that can be used as the first frontline for any interaction process with humans. The robots are composed of a face mask, where inner light projects all the facial movements. The combination of the light beam and the mask cover creates the illusion that the face, lips, and eyes are moving.


The hardware plus the software creates a social robot that can be used in several social interactions. The robots have been used as a concierge in the airport [1]; as a teaching assistant to engage students [2]; as a pre-screening medical robot to detect signs of the world’s common diseases [3]; and as a recruiter to remove bias from the hiring process [4].

Figure 2: Tengai is the word's first unbiased social interview robot. Here with Elin Öberg Mårtenzon, Chief Executive Officer at Tengai and Chief Innovation Officer at TNG.


Tengai AB is a company that offers unbiased recruitment using robots called Tengai Unbiased. With Tengai robots, the recruitment process can be transparent, data-driven, and anonymous. All the applicants will follow the same interview procedures, making it data-driven.

Elin Oberg Martenzon is the forefront of Tengai AB and assured that soft skills and personality traits could also be screened by adjusting the interview process according to different role descriptions.

The recruiter can easily prepare Tengai robots for the interview using Tengai software to manage and overview the process. The interview is fully automated and has an integration with a self-service booking for the candidate.

Besides their national customers, Tengai also aims international customers, and they will launch an English-speaking robot at the beginning of 2020. “The interest of this product has been massive, and it is now possible to sign up on the waiting list to get your Tengai as some of the first companies in the world.” – says, Elin.

In the early stages, these, robots are helping hiring managers in doing the first interview process, but, with the current stage of AI, these robots, most likely, will be able to perform the most of the hiring process.

We see Tengai AB at the forefront of the future recruitment process, and one potential customer segment will be the international staffing and recruitment industry. So robots are here to help.



Monday 13 May 2019

From Blockchain to Hashgraph

Blockchain is a decentralized, distributed and public digital ledger that uses hundred of computers to prove that something, e.g., a transaction or a record, actually happened. Blockchain is a transparent and living digital document of collecting and verifying records.

Blockchain uses smart contracts to incentivize the usage of technology. Smart contracts permit trusted transactions and agreements to be carried out among different, anonymous parties without the need for a central authority, legal system, or external enforcement mechanism. The transactions are traceable, transparent, and irreversible.

In the past, only official entities could verify transactions; nevertheless, the verification process was not transparent, and it could be corrupted easily.
This novel technology has a significant impact on society, and it is turning into a philosophical opening in how we re-arrange society and business.
Several companies, like Maersk or Open, use Blockchain to improve the efficiency of their business.

Maersk manages circa 150 ports around the world, where vessels are continuing deploying containers. One challenging point for the company is to track the containers.

For deploying a container, a vessel requires coordination with several companies and lots of documentation to track the shipment between two ports. By adopting blockchain technology, Maersk achieved with the help of smart contracts, full control over the container interactions while reducing the correspondent operational costs by 10%-15%.

Another company, Open – the first end to end logistic supply platform for commodities (rice, grain, oil, gas, metal), had difficulty in tackling topics around fraud, corruption, and price manipulation, mainly because transactions were not transparent. Again, with the use of smart contracts, this company improved the efficiency of its operations. An end to end logistic process that would originally take weeks to happen would now happen in minutes, while at the same time reducing the number of intermediaries, and, ultimately, cutting the product selling price.

In terms of social impact, Blockchain can be a useful technology for casting, tracking and counting votes. Although there are detractors about the use of this technology for replacing the traditional ballot box due to vulnerabilities, it is true that Blockchain has what is necessary to eliminate voter fraud.
Blockchain voting startups are working to overcome this skepticism. Maybe in the future Blockchain or alternative technology could make this scenario a reality.

There are different types of Blockchain implementations with different properties, as well as, alternative technologies.

One technology that is gaining hype and it is gaining the support of prominent groups is Hashgraph. Groups like Deutsche Telekom and Swisscom have been backing up Hashgraph. As a result, it is expectable that the market will adopt this technology to some extent.

Hedera Hashgraph


Hashgraph is a decentralized, distributed and public ledger that uses hundreds of computers to register and validate transactions. It is considered an alternative to Blockchain, one which tries to tackle the limitations of the technology concerning performance, fairness, cost, and security. There are not many Blockchain applications that can do five transactions per second. In contrast, Hashgraph can do a hundred thousand transactions per second.

Hedera is a company co-founded by Leemon Baird and has the goal to build a trusted and secure digital future. Hedera is governed by a council composed of corporations and organizations from different sectors. Hashgraph offers smart contracts, micro-storage, decentralized storage, and allows micro-transactions.

However, what makes Hashgraph different from Blockchain?

Hashgraph is a patented algorithm that has its data structure and achieves good results of scalability and performance due to two techniques: gossip of gossip and virtual voting.

The gossip of gossips technique broadcasts messages through an existing network of pre-authorized nodes. In Hashgraph, only hosts can read and write to the database and must have a database that is immune Sybil and DDoS attacks.

A Sybil attack occurs when an attacker is hit most of the time by creating several different accounts. As a result, the attacker can approve such transactions as triggering a double expense problem. For Bitcoin, this problem was solved through the work test.

In turn, Hedera what the results were pre-approved by them, are not considered valid or are of interest like no Blockchain.

Hedera Hashgraph is a probabilistic algorithm, that is, a non-deterministic algorithm. As such, it may not be able to reach the odds of being hit, nor the messages that are exchanged, how long does it take to reach a consensus.

Consensus is expected to be reached as time passes.

Structure level, the Block is a blockchain, which comes each block has a hash of its transactions and a pointer to itself and to the previous block. In contrast, Hashgraph stores the projects that are connected in a direct acyclic graph. In Hashgraph, each event contains two pointers for different parents.

We can see in the image below a distinction between a Blockchain event and Hashgraph. The event contains two hashes that point to the parent, a timestamp and a list of transactions. In Blockchain, the indexer in each block contains the address and content of the previous block.

The consensus algorithm in Hashgraph is not proof-of-work or proof-of-stake. It is an asynchronous Byzantine Fault-Tolerant algorithm that is resistant to DDoS and Sybil attacks.
Hashgraph Example

In this example, we have five different nodes in this Hashgraph. Each member starts by creating an event which is a small data structure in memory. Then the community is going to run a gossip protocol. E.g., Bob is going to send all the event that Ed does not know, yet. This could be a single transaction that Bob has made. Ed records the transaction by creating a new event which is the new circle in the image. Now Bob has a connection with Ed. Both Bob and Ed have two connections (From Bob to Bob, and from Bob to Ed. And from Ed to Ed and Ed to Bob). The new connections are represented in the hashes that the new event contains.

We can see in the image below that a Hashgraph event contains two hashes that point to the parent, a timestamp and a list of transactions.
Blockchain vs Hashgraph Event

Then, Ed sends all these events to Carol, and Carol sends the events to Bob. This exchange of events propagates the data. Moreover, because the events contain hashes, it is called Hashgraph. Each event contains the hashes from the previous events, and its creator digitally signs it.

Hashgraph has the notion of round. Hashgraphs contains witnesses that will vote on the validity of the transaction. The first event on each round for a single node will be called a witness. When we have all the witnesses, we have to see if it is a famous witness. A famous witness is one that is used multiple times.

Famous witnesses are chosen in an election. This is done considering the witnesses in the next round. A witness is deemed to be famous if it is seen by many of the witnesses in the next round. To determine if it is a famous witness, all witnesses must vote for it.

The goal of the election is to determine how often a transaction has passed all the nodes. If it is valid, it has been passed a lot. So, the longer the transaction is maintained in the hashgraph, the more trusted it can be.

Hedera is the public version of Hashgraph in the works. It will be governed by a council of leading corporations and organizations across multiple industries, and it will use this council as a Proof of Work to prevent Sybil attacks. Hedera claims to achive four main objectives: to provide smart contracts, micro storage, decentralized storage, and its own cryptocurrency allowing microtransactions.

There are several differences between Blockchain and Hashgraph. Blockchain ensures that data is not stored at a single location, while Hashgraph is a distributed ledger that works on the data structure with a distributed consensus algorithm.

The Hashgraph does not require Proof of Work, and can also deliver low-cost and high performance without a single point of failure. Also, Hashgraph does not need high computation power.

Hashgraph allows hundreds of thousands of transactions per second. In contrast, technologies that use Blockchain have different speed in terms of transactions per second. Bitcoin can make 7-10 transactions in a second, Ethereum has potential to perform 15-20 transactions per second, and Hyperledger can make thousands of transactions in a second.

Hashgraph also proves to be fairer than Blockchain as miners can choose the order of transactions, can delay them or even stop them from entering the block if necessary. However, in Hashgraph, a consensus of a timestamp is used preventing individuals from changing the order of transactions.

Hashgraph is a promising technology, but it also comes with some limitations.

Currently, the technology has only been deployed in private and permission-based network. It is still to be tested and explored in the public network. In gossip of gossips technique, when a node passes information to another node, there are chances that the closest nodes are malicious which may prevent the passage of information to other nodes.

The technology behind the Hedera Hashgraph is exceptionally intriguing, but its real potential and effectiveness are promising.

Tuesday 29 January 2019

Byzantine Fault Tolerance, from Theory to Reality

Many people connected to the computer business in several distinct roles (programmers, architects, and many more) don't know the concept of Byzantine failures. If there's someone who has heard this term once, most likely forgot the term immediately.

In the most general sense, Byzantine failures are defined as arbitrary deviations of a process due to any anomaly. The term "any" really means any fault (permanent or occasional) that occurred in the software or the hardware. This type of faults are pivotal in a distributed systems world and are often neglected when creating systems, or when debugging them. This clear unrecognition makes this topic confined to the academy or singular projects.

Let's separate the concepts of Byzantine faults and failures because I will use them interchangeably.

  1. Byzantine faults: a fault presenting different symptoms to different observers.
  2. Byzantine failure: the loss of a system service due to a Byzantine fault.

This paper from Driscoll et al. revisits the Byzantine problem from the practitioner's perspective and alerts everyone that this type of faults should not be neglected.

The term Byzantine failure was proposed a long time ago by Leslie Lamport et al. (The Byzantine Generals Problem). They present the scenario of a group of General divisions that surround an enemy camp. The Generals must communicate to each other to agree on a plan of action -- whether attack or retreat. If they all attack at the same time, they will win. If they retreat, they will live another day. If just part of the division attack, then the generals will die.

This sounds like an easy problem, but in fact, all generals are surrounding the enemy, and they can only communicate with each other using the messenger. The messenger can be killed or replaced by the enemy, or the enemy can change the message. Moreover, some generals may be traitors and send inconsistent messages to disrupt loyal generals from reaching consensus.

The Lamport paper discusses the problem in the context of oral messages and written signed messages. Regarding oral messages, it has been proven that to tolerate f traitorous generals, it is necessary 3f+1 generals and f+1 rounds of information exchanged. If it is used signed messages to prove authenticity from the senders, it is possible to reduce the number to 2f+1 generals and f+1 rounds of information exchange.

Consensus is an integrated part of a Byzantine fault tolerant system. Addressing the consensus problem at the protocol level reduces the software complexity, but leaves the protocol itself vulnerable to Byzantine failures.

The common use of COTS components increases the probability of a system to suffer a Byzantine fault. Good hardware is not also sufficient! Small electric perturbations in the components can guide the hardware to a have a 1/2 input. For example, the CPU suffered a perturbation in the voltage and emitted a bit that it is not possible to know if it was a 0 or 1. Back-propagation can even amplify the error, and the whole system can fail. Moreover, this error can become permanent, and the system will never achieve a correct state.

This paper presents a Time-Triggered Architecture (TTA) as a generic architecture for fault-tolerant distributed real-time systems. This architecture is targeted to the automative industry where communication requisites are high. This architecture implements a deterministic fault-tolerant communication protocol (TTP/C) that guarantees synchronization and deterministic message communication. This service provides membership service to allow a global consensus on message distribution and can tolerate this type of faults as long as the errors are transient. By radiating components with ions, was sufficient to introduce faults in the components. This careful design resulted in the introduction of centralized filtering to remove invalid signals (e.g., 1/2 signal) into valid logic signals.

This papers also presents a Multi-Microprocessor Flight Control System (MMFCS) that was developed in the '70s. This system introduced the concept of self-checking pairs and used a dual self-checking pair bus distribution topology. The self-checking pair comparisons of processors, bus transmission and reception enabled a precise detection of Byzantine faults. The MMFCS observations concluded that Byzantine faults are not as rare as we can suppose. They can run permanently or intermittently. A misconception is to assume that random events are uniformly distributed in time. In reality, the precipitating conditions can make a fault persist. Similarly, it is a myth that Byzantine faults won't propagate to other components. In the worst case scenario, a Byzantine fault can shut down the whole system.

Effective methods for dealing with Byzantine faults can be divided into three types: full exchange, hierarchical exchange, and filtering topologies. These methods can be used in conjunction or separately. The first method implements the exchanges described in the Lamport paper. The second method uses private exchanges inside subsets of a system's node followed by simplified exchanges between the subsets. The third method tries to remove Byzantine faults via filtering. The paper presents examples of implementation of these three methods. Although these examples are implemented in the hardware, there are also other software solutions.

In overall, Byzantine faults are very real, and nowadays the use of COTS components make this type of faults even more severe. It is very hard or impossible to design a Byzantine fault tolerant system without consensus. Consequently, the term "Byzantine faults" should be in the head of every software and hardware engineer, and this paper gives credible examples. In my personal view, I hope that companies that produce hardware and software start to include this topic in their plans.

[1] Byzantine Fault Tolerance, from Theory to Reality, K. Driscoll, B. Hall, et. al, Computer Safety, Reliability, and Security
[2] The Byzantine Generals Problem, L. Lamport, R. Shostak, M. Pease, ACM

Tuesday 24 January 2017

Hadoop ecosystem

The Hadoop ecosystem consists of several Hadoop related projects that address specific areas like distributed file system, distributed programming, NoSQL databases, and much more.

Here you have a list of several related projects that companies like Facebook, LinkedIn, and much more have developed.

MapReduce vs Spark

This paper presents a comparison of MapReduce vs Spark. Nowadays, Spark is becoming very popular and it can achieve in general better results than the vanilla MapReduce. Here is a list of points of the comparison that is in the paper.

  • Spark does everything In-Memory. Spark needs a lot of memory to process big files.
  • It is easier to work with Spark than MR in general, but you have several tools available for Mapreduce to make it easier this interaction.
  • The main causes of Spark's speedups are the efficiency of the hash-based aggregation component for "combine" stage, as well as reduced CPU and disk overheads due to RDD caching in Spark.
  • Spark uses Resilient Distributed Datasets (RDDs) which implement in-memory data structures used to cache intermediate data across a set of nodes.

In Spark, there may be many stages (map and reduce) which are built at shuffle dependencies. MapReduce only has a map and reduce.
Spark and MapReduce take similar time in the reduce tasks because the reduce stage is network-bound and the amount of data to shuffle is similar in both cases.


So, in which points Spark wins over MapReduce?

  1. Spark is faster than MapReduce in task initialization.
  2. Spark is circa 3x faster in input read and map operations.
  3. Spark is circa 6x faster in combine stage, because the hash-based combine is more efficient than the sort-based combine.
  4. Spark has low complexity in the in-memory collection. But MapReduce can be faster in bigger files. Eg, for a 500GB input data, MapReduce is faster because it reads splits from the input file. Spark reads the whole file. In this case, there is significant CPU overhead for swapping pages in OS buffer cache.
  5. Spark and MapReduce are CPU-bound in the map stage. For Spark, the disk I/O is significantly reduced in the map stage compared to the sampling stage, although its map stage also scans the whole input file.
  6. Spark does not support the overlap between shuffle write and read stage. Spark may want to support this overlap in the future to improve performance.
  7. MapReduce is slower than Spark in these two points:
        1. The load time in MapReduce is much slower than that in Spark.
        2. The total times of reading the input (Read), and for applying the map function on the input (Map) is higher than Spark.
    8. Spark performs better than MapReduce due to these two points:
        1. Spark reads part of the input from the OS buffer cache since its sampling stage scans the whole input file. On the other hand, MapReduce only partially reads the input file during sampling thus OS buffer cache is not very effective during the map stage.
        2. MapReduce collects the map output in a map side buffer before flushing it to disk, but Spark's hash-based shuffle writer writes each map output record directly to disk, which reduces latency

This is a list of points of advantages in using Spark.

Spark is fast because it executes batch processing jobs about 10 to 100 times faster than the Hadoop MapReduce framework just by merely cutting down on the number of reads and writes to the disc. The concept of RDDs (Resilient Distributed Datasets) lets you save data on memory and preserve it to the disc if and only if it is required and as well it does not have any kind of synchronization barriers that possibly could slow down the process.

Spark is easy to manage due to two reasons: (i) With Spark, it is possible to control different kinds of workloads, so if there is an interaction between various workloads in the same process it is easier to manage and secure such workloads which come as a limitation with MapReduce. It is possible to perform Streaming, Batch Processing and Machine Learning all in the same cluster with Spark. (ii) Spark streaming is based on Discretized Streams, which proposes a new model for doing windowed computations on streams using micro-batches.
Spark has a caching system to ensure lower latency computations by caching the partial results across its memory of distributed workers. MapReduce is disk oriented completely, so does not have a cache mechanism as efficient as Spark.

RDD is the main abstraction of Spark. It allows recovery of failed nodes by re-computation of the DAG. Storing a spark job in a DAG allows for lazy computation of RDD's and can also allow spark's optimization engine to schedule the flow in ways that make a big difference in performance.
Spark has retries per task and speculative execution just like MapReduce.

Spark needs at least memory as large as the amount of data you need to process. The need for large memory comes from the fact that data has to fit into the memory for optimal performance. So, if you need to process really Big Data, Hadoop will definitely be the cheaper option since hard disk space comes at a much lower rate than memory space. On the other hand, considering Spark's benchmarks, it should be more cost-effective because less hardware can perform the same tasks much faster (especially on the cloud where compute power is paid per use).
https://hackernoon.com/hn-images/1*klFtrGr8U-XmyiZ1CJx-0w.png

Wednesday 23 November 2016

XFT: Practical Fault Tolerance Beyond Crashes

In the OSDI'16, it was introduced a new fault tolerant approach called XFT to build reliable and secure distributed systems to tolerate crash and non-crash (Byzantine) faults. Paper here.

XFT allows to design reliable protocols that tolerate crash machine faults regardless on the number of network faults, and, at the same time, tolerate non-crash machine faults when the number of faulty machines or partitioned are within a threshold.

The intuition behind the XFT protocol is that conditions when there is a total control of the network and the nodes are very rare. Therefore, using most of the BFT protocols in applications are excessive and a little over the top. The XFT is for the cases in which an adversary cannot easily coordinate enough network partitions and non-crash faulty machine actions at the same time. XFT can be used to protect against "accidental" non-crash faults, which happen occasionally. Or, to protect against malicious non-crash faults as long as the attacker do not perform a coordinated attack to compromise Byzantine machines and partition the network under a certain threshold. Or, the attacker cannot compromise the communication among a large number of correct participants.




State-of-the art asynchronous CFT protocols guarantee consistency despite any crash faults or  n-1 partitioned replicas. They also guarantee availability whenever a majority of replicas are correct.

In the case of XFT, guarantee consistency in two modes: (i) without non-crash faults, despite any number of crash-faulty and partitioned replicas, and (ii) with non-crash faults, whenever a majority of replicas are correct and not partitioned, i.e., provided the sum of all kinds of faults (machine or network faults) does not exceed a majority of correct process. Similarly, it also guarantees availability whenever a majority of replicas are correct and not partitioned.

The consistency guarantees of XFT are incomparable to those of asynchronous BFT. On the one hand,  XFT is consistent with the number of non-crash faults in the range [n/3; n/2), whereas asynchronous BFT is not. On the other hand, unlike XFT, asynchronous BFT is consistent when the number of non-crash faults is less than n/3.

In this paper, the authors also present XPaxos, a Paxos-like algorithm designed specifically in the XFT model. I will explain this protocol in another post.