Call Search
     

New to Ham Radio?
My Profile

Community
Articles
Forums
News
Reviews
Friends Remembered
Strays
Survey Question

Operating
Contesting
DX Cluster Spots
Propagation

Resources
Calendar
Classifieds
Ham Exams
Ham Links
List Archives
News Articles
Product Reviews
QSL Managers

Site Info
eHam Help (FAQ)
Support the site
The eHam Team
Advertising Info
Vision Statement
About eHam.net

   Home   Help Search  
Pages: [1] 2 Next   Go Down
  Print  
Author Topic: Failsafe systems can fail too  (Read 11284 times)
LA9XSA
Member

Posts: 376




Ignore
« on: June 11, 2013, 05:21:22 PM »

"Digital systems can't fail, so emcomm volunteers need not bother with learning to pass official traffic."
"Handsets can't be hacked without stealing each unit and reprogramming it yourself"
"Handsets can't crash"
"Systems whch are not connected to the Internet can't be hacked."
"If the city's radio net is down, everything else is down too, so why bother"
"If communications fail, we're done for anyway, so I'd rather camp out in the water tower and shoot looters than stand around with a radio and look silly"
"Maybe communications can fail out in the backwoods or in the third world, but not in my big city in the USA"

While not verbatim quotes, these are notions that arose in a couple of earlier threads which dealt with volunteer "emcomm" on a more existential level; questions like if it is right to require that volunteers are trained and exercised before an emergency happens, or if it should all be ad hoc. Leaving existential questions for other threads, I'd like to discuss a few pertinent facts about the current information technology environment - facts that I think should inform the wider debate.

1: Digital systems can be more robust than old analog systems, if done right.
2: Digital systems can fail in new ways that wouldn't have affected analog systems - despite statement 1 these failures can sometimes knock the whole system out.
3: Digital systems may contain bugs or vulnerabilities which can be exploited by criminals, terrorists or enemies - or which may cause the system to fail on its own.
4: There is a thriving black market in vulnerabilites and exploits.
5: Attackers don't need to be geniuses.
6: A communications emergency can be caused by trivial events, it doesn't need to be "the end of the world as we know it"
7: Systems can be attacked even if they're not supposed to be reachable over the Internet
8: Both the good guys and the bad guys take advantage of new technology.
9: Sometimes the Internet is available even if the phones or radio systems are down, or the other way around
10: Served agencies may expect more digital communications, and new services, from their volunteers

I'd like to flesh out that list of statements with real world incidents, (mis)use cases, and references as we go along, but to avoid a wall of text in the first post, I invite other posters to help turn this into a conversation. Do you agree or disagree with my statements. or perhaps with the paraphrased statements at the top of the post? Have I paraphrased unfairly? Do you want to add to the list?

As I said, I won't elaborate on every point in the first post. I'll just address a notion from an earlier thread that was closed, the notion that digital handsets can't just stop working - either by malice or accident - unless analog is also affected by the same issue. I'm going to go into detail on statement 3 first.

3: Digital systems may contain bugs or vulnerabilities which can be exploited by criminals, terrorists or enemies - or which may cause the system to fail on its own.

Misuse case: Our imaginary band of terrorists or criminals are not super geniuses, but they've bought a 0-day exploit on the black market that allows them to shut down the particular model of public service handsets in your city. Even though the handsets are not supposed to be programmable over the air, a coding mistake made by the manufacturer a couple of years ago has made it possible anyway. The attackers have effectively shut down local public service communications without having to deploy active jamming equipment that can be found with direction finding. They just uploaded their exploit with a short burst transmission. The base stations still work, but all the handsets that were turned on during the attack are "bricked" and need to be sent in for repair. Analog radios, even on nearby frequencies, are unaffected by this particular attack. Digital public service radios from a different manufacturer would also be unaffected.

Background:

The advanced digital technology, with all its added functionality, usability and (hopefully) robustness, comes at the price of complexity. The firmware and software that goes into modern systems can contain thousands or millions of lines of code, spread out in sections written by hundreds of different programmers from different companies. The code may have been written in the course of decades. There are interdependencies to manage. This code can contain programming errors - bugs - or there might have been mistaken assumptions made in the design - design flaws - both of which are faults in the system. This could be in the firmware that runs on a digital handset, or perhaps on a base station unit.

When speaking about mistakes that eventually cause a problem if left alone - we say that the bug or design flaw is the underlying fault, which gives rise to the error which is an incorrect system state, and which again might finally lead to failure when the system does something it should not be doing - such as stopping completely, slowing down, or sending corrupted data. The error could be lurking until a time when a specific system state triggers the failure, such as an unusual amount of data, or a particular date or counter rolling over.

When speaking about malicious attackers, we call faults vulnerabilities. Remember, it usually isn't a malicious programmer who created the vulnerability - it is usually an accident or due to poor developer habits. Maybe the code was correct when written, but the requirements have changed since then.

Some vulnerabilities are known, while others are not known yet. Vulnerabilities can be exploited by attackers. Some exploits only allow an attacker to slow down the system, or shut it down, while others allow the attacker to take over the system and command it to do as he wants. Hopefully the manufacturer finds the vulnerability before anyone else does, or a Good Samaritan lets them know about it, so it can be fixed in a firmware upgrade - and all the users apply these patches diligently.  Unfortunately, sometimes patches come too late, or the manufacturer doesn't even know about it until after an exploit has been made and used - a "zero-day exploit".

There are ways to mathematically prove that your code does as it's supposed to do, but it's usually reserved for only the most critical core of a system. A risk-based development cycle that focuses on security and reliability from the beginning through deployment, can identify what parts of the system might benefit from that level of scrutiny. Part of that risk analysis means keeping tabs on likely threats to the system, using tools such as misuse cases, and observing what the known "bad guys" are doing. If a vulnerability slips by anyway, patches should be made and applied swiftly.

If you are interested in this topic, search terms for further reading are:
"Software security" or "Application security" (developing secure software, as opposed to "information security" which deals more with concepts like cryptography and access control)
"zero-day exploit", "risk analysis" and" software quality"
Logged
KD0REQ
Member

Posts: 1047




Ignore
« Reply #1 on: June 20, 2013, 10:06:11 AM »

any and every "foolproof" system is just a magnet for new and improved fools.  expect imperfection.
Logged
G7MRV
Member

Posts: 481


WWW

Ignore
« Reply #2 on: June 22, 2013, 06:05:40 AM »

Nothing in the phrase 'fail-safe' implies failure cannot happen. All it implies is that when it does fail, the result should not cause more harm than was already the case. A failsafe system, is a backup to another system to prevent more harm being done by the primary failure.
Logged

LA9XSA
Member

Posts: 376




Ignore
« Reply #3 on: June 24, 2013, 02:46:47 AM »

Exactly. This is adressing point 1 in my list. A well designed failsafe system will fail gracefully, and keep working with the system in a degraded state. For example, if the control element stops working, the repeaters still keep working, and if the repeaters stop working, you can still work simplex. Or, there could be redundancy built in, such that if one control computer freezes, another computer notices this and takes over its responsilibities, while alerting operators and maintainance about the problem. Or, like on the Apollo program, you have several computers voting about what is the correct result of a calculation.

Failover systems can work so seamlessly that without the alert of the failure, no human being would have noticed that a component had failed at all.

This is an advantage that properly designed digital systems can have over analog systems; in other words a case where the digital system will keep working where the analog system would have failed.

There are cases, however, when the sort of fault mentioned in point 3 is lurking in the actual failure handling system, so the failover system is one of those parts of a systems which should be subject to mathematical scrutiny and exhaustive testing. As an example of such a fault, here's a public bug track for a broker system where the failover thread itself went into an infinite loop before the bug was fixed: http://forge.centreon.com/issues/3527

Relevant search terms: "fault tolerance", "failover", "redundancy"
Logged
W9IQ
Member

Posts: 104




Ignore
« Reply #4 on: June 24, 2013, 04:00:55 PM »

From a digital security perspective there is an axiom that is widely accepted:

"The more complex the software product, the larger the attack surface."

There is no programmatic way to address this issue completely. We see this is clearly demonstrated by the hundreds of vulnerabilities per year discovered in Microsoft operating systems. If such a programmatic tool existed, it would be to Microsoft's commercial advantage to deploy it. This is not quality that you can inspect into the system with some type of post production tool.

From a reliability engineering perspective if such identical complex software systems are combined in a parallel system with the notion of providing redundancy (failure resistance), it can be mathematically shown that no additional benefit is gained other than to potentially delay the onset of a total system failure.

When looking at this issue from a ham radio perspective, the advantage that hams have over any incumbent communications system is that we uniquely possess the ability to repair, extend, and enhance our communications ability on an ongoing basis. This is independent of digital verses analog or simple verses complex. This capability as an adaptive system is not addressed in traditional reliability engineering. This is sharp contrast to the "appliance operator" represented by the typical civil servant that has no clue what to do if their communications equipment is not working (no insult to our civil servants is intended). Interestingly, reliability engineering has clear mathematical models to project this outcome.

- Glenn W9IQ
Logged
LA9XSA
Member

Posts: 376




Ignore
« Reply #5 on: June 25, 2013, 04:31:05 AM »

It would take millions (billions?) of years of present day computing time to test a system with all possible states and inputs, yes, so this is something that's reserved for only the most critical parts of a system. A good design for security would make it simple to decide which components need this level of pre-coding scrutiny and post-coding testing. Good candidates might be the access control model, the microkernel, and the failover functionality.
Logged
KB8VUL
Member

Posts: 136




Ignore
« Reply #6 on: June 29, 2013, 04:57:55 AM »

Not sure that comparing a Microsoft operating system on a PC platform and a public safety radio.  The PC with it's multiple functionality, requirement of supporting multiple third party applications and far more open platform design are by design more open to external intrusion. 

Radio systems, even with the more complex systems are designed for a single primary purpose.  They pass voice traffic from the subscriber to other subscribers and dispatch interfaces.  It doesn't allow for third party applications that open access to circumvent the primary function of the radio. 

I can't say that there is no way that these radios could be effected by external methods but its far more difficult than hacking a PC or creating a virus for a Windows PC. 
The other thing to consider is that if someone is going to go to the effort of creating and deploying such technology to disable public safety communications, they are going to know what the backups for that system is and have technology to deploy that will disable those backups as well.  This includes ham radio communications.  So the idea that we as hams need to be vigilant because someone may hack the local public safety system and disable it and we will need to fill the gap is folly.  Our ability to communicate would be equally effected. 
Logged
AA4PB
Member

Posts: 13032




Ignore
« Reply #7 on: June 29, 2013, 05:48:20 AM »

It's easy to do a denial of service attack against most radio systems, whether analog or digital. All you have to do is throw up a big carrier on the operating frequency.
Logged
NA4IT
Member

Posts: 893


WWW

Ignore
« Reply #8 on: June 29, 2013, 05:54:37 AM »

The ultimate "fail safe" system.... "cell phone backup"...
Logged
AA4PB
Member

Posts: 13032




Ignore
« Reply #9 on: June 29, 2013, 09:10:22 AM »

The ultimate "fail safe" system.... "cell phone backup"...

Tell that to the folks that tried to get through to NYC during 9/11 or New Orleans during Katrina.
Logged
KC8YHN
Member

Posts: 30




Ignore
« Reply #10 on: July 01, 2013, 07:48:24 PM »

Just a thought I learned a long time ago;

the more complex the system to ensure a fail safe or a fall back system, the more likely the system will fail at one point or another.

Keeping it simple seems to work better than to make it complex.

Just another thought, I don't worry about the bad guys and don't know why anyone else would because what ever system we use, they will always find a way to get into it. I worry about things we can't really control, like bad weather or an earthquake and how we respond to that.
Logged
K1CJS
Member

Posts: 6061




Ignore
« Reply #11 on: July 02, 2013, 05:15:01 AM »

No matter what fail-safe systems are in place, they are made by man--an imperfect being who makes mistakes.  Also the more fail safe systems there are, (or the more complex the system is) the more failure prone they become.  Proof about man making mistakes about failure proof systems?  Read 'Colossus' by D. F. Jones. 
Logged
NA4IT
Member

Posts: 893


WWW

Ignore
« Reply #12 on: July 02, 2013, 06:16:51 AM »

The ultimate "fail safe" system.... "cell phone backup"...

Oh, I meant it fully as a joke...
Logged
AA4PB
Member

Posts: 13032




Ignore
« Reply #13 on: July 02, 2013, 08:52:30 AM »

No system is completely fail proof. However, the more alternate paths you have the less likely you are to have a complete system failure. For example, if your Internet connection has only one route between you and an e-mail destination then if anything along that route fails, your e-mail doesn't get through. If you have 20 alternate routes that can be used then you have increased your chances of getting through by 20 times.
Logged
W6EM
Member

Posts: 900




Ignore
« Reply #14 on: July 17, 2013, 07:08:59 PM »

"Failover" has been defined to be a state where trunked systems revert to zone repeaters.  Essentially like conventional repeaters.  So, if a system has ten trunked "node" sites, they will function as conventional repeaters if the trunking system controller takes a hike.

Oh, great.  In failover, that amounts to a bunch of very low power mobiles talking only to themselves in each  node.  And, they can't talk to dispatch or to other mobiles more than a mile or two surrounding that node.

Digital protocols are fine.  That is, as long as they are used inconventional mobile relay fashion.  And yes, simplex will work in such concepts since mobiles typically operate at relatively high power levels.

Trunked systems made communication by first responders post Katrina and 9/11 very difficult, at best.  Classic single point failure vulnerability in both cases.

Don't ask me if I think Smartnet or EDACs are smart.    :-(  And, don't ask the City of Detroit's police chief what he thinks about them either.  They had a multi-day total system outage earlier this month.....
Logged
Pages: [1] 2 Next   Go Up
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.11 | SMF © 2006-2009, Simple Machines LLC Valid XHTML 1.0! Valid CSS!