25 May 2010

Mistakes in silicon chips to help boost computer power


by Mark Ward
Technology correspondent, BBC News

Making sure chips do not make mistakes has a financial and power cost


Silicon chips that are allowed to make mistakes could help ensure computers continue to get more powerful, say US researchers.


As components shrink, chip makers struggle to get more performance out of them while meeting power needs.

Research suggests relaxing the rules governing how they work and when they work correctly could mean they use less power but get a performance boost.

Special software is also needed to cope with the error-laden chips.

The silicon industry is defined by Moore's Law, which predicts that the number of transistors that can fit on a given area of silicon, for a given price will double every 18-24 months.

This is usually accomplished by shrinking transistors and typically means that processing steadily gets more powerful.


Chips that make mistakes demand less power

Transistors are tiny switches that are used as the fundamental building blocks of silicon chips. However, many experts point out that the relentless march of Moore's Law could stumble when components get so small they become unreliable.

The unreliability - or "statistical variability" - of chips is a problem that many researchers were trying to deal with, said Professor Asen Asenov from the Department of Electronics and Electrical Engineering at the University of Glasgow.

Variability increases as components shrink, said Professor Asenov, who has been using large scale simulations on grid computers to study how the behaviour of transistors changes as they get smaller.

For Professor Rakesh Kumar at University of Illinois the demise of Moore's Law is being hastened by an insistence on making silicon chips operate flawlessly.

Professor Kumar said variations in manufacturing, environment, and workload can conspire to make a chip suffer errors. Manufacturers try to ensure that whatever happens, he said, the chip works correctly.

"It's a case of 'if the software asks the chip to do something it does it at any cost,'" he said.

Professor Kumar's research suggests that the pursuit of perfection forces manufacturers to make some poor choices.

"To ensure correct operation you are purposefully running the chips at higher power than you need to," he said.



Error condition


That insistence on perfection also pushes up manufacturing costs because many chips have to be discarded if they fall short.

Professor Kumar said that it would become harder and harder for chip makers to ensure instructions are executed flawlessly as components shrink.

The tiny components in chips are already starting to give rise to errors. Instead of trying to eliminate this, he said, it should be embraced to produce so-called "stochastic processors" that are subject to random errors.

"The hardware is already stochastic so why continue pretending its flawless?" he asked. "Why put in more and more money to make it look flawless?"

Through research part-funded by Intel, Professor Kumar and his colleagues are designing processors that forgo flawlessness. Instead they attempt to manage the number and type of errors so they can be coped with efficiently.

The clocks in chips keep processing co-ordinated.

An example error, said Professor Kumar, is when a chip fails to complete a cycle of instructions within a given time. The workings of most chips are governed by a clock and the data processing they do advances with each tick of that time-keeper.

The upside of using chips that can make mistakes is much reduced power consumption.

Depending on how many errors a designer is prepared to tolerate, power consumption can be cut by up to 30%, he said. With only 1% error rates, power can be cut by 23%.

In many cases the errors will not have a significant impact on the workings of a computer. In other cases, he said, they could cause a system to crash.

To cope with this, Professor Kumar and colleagues are researching ways to make applications more tolerant of mistakes.

The "robustification" of software, as he calls it, involves re-writing it so an error simply causes the execution of instructions to take longer.

In another approach, the more robust software logs a user's actions. As the software is used, this log can be consulted to spot when something unexpected occurs.

The work on applications and programs may be more immediately useful, said Professor Kumar, as it can be applied to existing applications. This should make them cope with bugs that are showing up now and prepare them for use with future processors.

No comments: