尽管最近有些小小的迷茫,但是刷题不能马虎啊!最近在做790. 数的三次方根1,所以回顾一下C语言中是如何表示浮点数类型的。
以下是ChatGPT的回复:
In C, floating point types are used to represent real numbers (numbers with fractional parts) and are typically implemented following the IEEE 754 standard. The most common floating point types in C are
float
,double
, andlong double
. These types differ in their precision and range, which is a result of how they represent a number using three components: the sign bit, the exponent, and the mantissa (or significand).
Components of a Floating Point Number in IEEE 754
Sign Bit:
This is a single bit representing the sign of the number.
0 indicates a positive number, and 1 indicates a negative number.
Exponent:
The exponent is used to represent the scale (or magnitude) of the number.
It is stored in “biased” form, meaning that a fixed bias is added to the actual exponent value to get a non-negative value stored in the exponent field. This bias is 2 ( n ? 1 ) ? 1 2^{(n-1)} - 1 2(n?1)?1, where n n n is the number of bits in the exponent.
For example, in a 32-bit float, 8 bits are used for the exponent, and the bias is 127 = 2 7 ? 1 2^7 - 1 27?1.
Mantissa (Significand):
The mantissa represents the precision bits of the number. It’s the fractional part of the number, normalized such that its range is just below 1 to just less than 2 (in binary, this means it starts with 1.).
However, the leading 1 is implicit and usually not stored (known as the “hidden bit” assumption). This is due to the fact that normalized numbers always start with 1, and hence it can be omitted to save storage space.
The remaining part of the mantissa is what is actually stored and represents the fraction.
Floating Point Types in C
float:
Typically 32 bits: 1 bit for sign, 8 bits for exponent, and 23 bits for mantissa.
Precision of about 6-7 decimal digits.
double:
Typically 64 bits: 1 bit for sign, 11 bits for exponent, and 52 bits for mantissa.
Precision of about 15-16 decimal digits.
long double:
Size and precision can vary depending on the system and compiler, but it is at least as large as double.
Often 80 or 128 bits on many systems, with varying distribution among sign, exponent, and mantissa.
Considerations
Floating point numbers can represent a very wide range of values, but they do so with limited precision. Operations on floating point numbers can introduce rounding errors.
Special values are represented in IEEE 754, such as Infinity (positive and negative) and NaN (Not a Number), which result from operations that have undefined or unrepresentable results in real numbers.
Due to their inexact nature, floating point numbers should not be used where exact results are required, such as in monetary calculations.
Let’s break down how a floating-point number is represented using the IEEE 754 standard, which is commonly used in C for the float and double types.
Floating Point Representation
A floating-point number is typically represented in three parts:
Sign Bit: Indicates whether the number is positive or negative.
Exponent: Determines the scale of the number.
Mantissa (or Significand): Represents the precision of the number.
The IEEE 754 Standard for float
Let’s consider a32-bit float
in IEEE 754 standard:
1 bit for the sign (S).
8 bits for the exponent (E).
23 bits for the mantissa (M).
The actual value represented by this float is:
( ? 1 ) S × 1. M × 2 ( E ? b i a s ) (?1)^S ×1.M×2^{(E?bias)} (?1)S×1.M×2(E?bias)
The bias for a 32-bit float is 127.
Example
Let’s take a float number, say5.75
, and see how it is represented:
Convert to Binary:
The integer part,5
, is101
in binary.
The fractional part,.75
, is.11
in binary (since0.5 + 0.25 = 0.75
).
So,5.75
in binary is101.11
.
Normalize the Binary Number:
Write it in a form such that there’s only one non-zero digit before the decimal point. For101.11
, it is1.0111 × 2^2
.
Here,1.0111
is the mantissa, and2
is the exponent.
Encode the Exponent:
Add the bias (127 for 32-bit float) to the exponent. So,2 + 127 = 129
.
The binary form of 129 is10000001
.
Encode the Mantissa:
Drop the leading1
(since it’s implicit) and consider only the fractional part.0111
.
If necessary, pad it to the right with zeros to make it 23 bits long.
Sign Bit:
For a positive number, the sign bit is0
.
So, 5.75 in IEEE 754 32-bit floating-point format is:
Sign bit: 0
Exponent: 10000001 (129 in binary)
Mantissa: 01110000000000000000000 (fractional part .0111, padded with zeros)
Combining these: 0 10000001 01110000000000000000000.
Precision and Rounding
In the conversion process, if the mantissa has more significant digits than can be represented (i.e., more than 23 bits for a float), it is rounded (如果尾数的有效位数多于可表示的位数(即浮点数多于 23 位),则四舍五入). This rounding can introduce small errors, which is an inherent limitation of floating-point arithmetic.
Conclusion
Understanding floating-point representation involves converting a decimal number into binary, normalizing it, adjusting the exponent with a bias, and fitting the mantissa within the allowed bit length. It’s a process that balances the range of representable numbers with precision, and it’s fundamental to how computers handle real numbers. (这是一个平衡可表示数范围和精度的过程,也是计算机处理实数的基础。)
In floating-point representation, particularly following the IEEE 754 standard, zeros are represented in a special way to distinguish them from other numbers. Both positive zero and negative zero have their own representations. Both positive and negative zeros are represented with all exponent and mantissa bits set to 0, but they differ in their sign bit. This distinction allows floating-point systems to handle certain edge cases and mathematical nuances more accurately.
In C programming, when dealing with floating point types (float, double, and long double), you don’t directly manipulate the mantissa and exponent parts as they are internally handled by the floating-point representation (IEEE 754 standard, for most systems). However, you can work with numbers in a way that is conceptually similar to scientific notation, and you can also format floating-point output to reflect this.
Conceptual Use in Calculations
When you write a floating-point number in C, you can use scientific notation (also known as exponential notation) for clarity or convenience, especially for very large or very small numbers. For example:double a = 1.23e5; // Equivalent to 123000 double b = 4.56e-3; // Equivalent to 0.00456
Here, 1.23 e 5 1.23e5 1.23e5 is shorthand for 1.23 × 1 0 5 1.23×10^5 1.23×105, and 4.56 e ? 3 4.56e-3 4.56e?3 is shorthand for 4.56 × 1 0 ? 3 4.56×10^{-3} 4.56×10?3.
Formatting Output
When printing floating-point numbers, you can use format specifiers in functions like printf to display numbers in scientific notation:printf("%e\n", a); // Output in scientific notation printf("%f\n", b); // Output in fixed-point notation
The
%e
format specifier will display the number in scientific notation, while%f
will display it in fixed-point notation.
1: https://www.acwing.com/problem/content/792/ ??