Commit 9edeeedc authored by simonmar's avatar simonmar

[project @ 2000-08-21 15:12:04 by simonmar]

merge conflicts (I hope)
parent cd1a93d8
This diff is collapsed.
This diff is collapsed.
IDEAS ABOUT THINGS TO WORK ON
* mpq_cmp: Maybe the most sensible thing to do would be to multiply the, say,
4 most significant limbs of each operand and compare them. If that is not
sufficient, do the same for 8 limbs, etc.
* Write mpi, the Multiple Precision Interval Arithmetic layer.
* Write `mpX_eval' that take lambda-like expressions and a list of operands.
* As a general rule, recognize special operand values in mpz and mpf, and
use shortcuts for speed. Examples: Recognize (small or all) 2^n in
multiplication and division. Recognize small bases in mpz_pow_ui.
* Implement lazy allocation? mpz->d == 0 would mean no allocation made yet.
* Maybe store one-limb numbers according to Per Bothner's idea:
struct {
mp_ptr d;
union {
mp_limb val; /* if (d == NULL). */
mp_size size; /* Length of data array, if (d != NULL). */
} u;
};
Problem: We can't normalize to that format unless we free the space
pointed to by d, and therefore small values will not be stored in a
canonical way.
* Document complexity of all functions.
* Add predicate functions mpz_fits_signedlong_p, mpz_fits_unsignedlong_p,
mpz_fits_signedint_p, etc.
mpz_floor (mpz, mpq), mpz_trunc (mpz, mpq), mpz_round (mpz, mpq).
* Better random number generators. There should be fast (like mpz_random),
very good (mpz_veryrandom), and special purpose (like mpz_random2). Sizes
in *bits*, not in limbs.
* It'd be possible to have an interface "s = add(a,b)" with automatic GC.
If the mpz_xinit routine remembers the address of the variable we could
walk-and-mark the list of remembered variables, and free the space
occupied by the remembered variables that didn't get marked. Fairly
standard.
* Improve speed for non-gcc compilers by defining umul_ppmm, udiv_qrnnd,
etc, to call __umul_ppmm, __udiv_qrnnd. A typical definition for
umul_ppmm would be
#define umul_ppmm(ph,pl,m0,m1) \
{unsigned long __ph; (pl) = __umul_ppmm (&__ph, (m0), (m1)); (ph) = __ph;}
In order to maintain just one version of longlong.h (gmp and gcc), this
has to be done outside of longlong.h.
Bennet Yee at CMU proposes:
* mpz_{put,get}_raw for memory oriented I/O like other *_raw functions.
* A function mpfatal that is called for exceptions. Let the user override
a default definition.
* Make all computation mpz_* functions return a signed int indicating if the
result was zero, positive, or negative?
* Implement mpz_cmpabs, mpz_xor, mpz_to_double, mpz_to_si, mpz_lcm, mpz_dpb,
mpz_ldb, various bit string operations. Also mpz_@_si for most @??
* Add macros for looping efficiently over a number's limbs:
MPZ_LOOP_OVER_LIMBS_INCREASING(num,limb)
{ user code manipulating limb}
MPZ_LOOP_OVER_LIMBS_DECREASING(num,limb)
{ user code manipulating limb}
Brian Beuning proposes:
1. An array of small primes
3. A function to factor a mpz_t. [How do we return the factors? Maybe
we just return one arbitrary factor? In the latter case, we have to
use a data structure that records the state of the factoring routine.]
4. A routine to look for "small" divisors of an mpz_t
5. A 'multiply mod n' routine based on Montgomery's algorithm.
Dough Lea proposes:
1. A way to find out if an integer fits into a signed int, and if so, a
way to convert it out.
2. Similarly for double precision float conversion.
3. A function to convert the ratio of two integers to a double. This
can be useful for mixed mode operations with integers, rationals, and
doubles.
Elliptic curve method description in the Chapter `Algorithms in Number
Theory' in the Handbook of Theoretical Computer Science, Elsevier,
Amsterdam, 1990. Also in Carl Pomerance's lecture notes on Cryptology and
Computational Number Theory, 1990.
* Harald Kirsh suggests:
mpq_set_str (MP_RAT *r, char *numerator, char *denominator).
* New function: mpq_get_ifstr (int_str, frac_str, base,
precision_in_som_way, rational_number). Convert RATIONAL_NUMBER to a
string in BASE and put the integer part in INT_STR and the fraction part
in FRAC_STR. (This function would do a division of the numerator and the
denominator.)
* Should mpz_powm* handle negative exponents?
* udiv_qrnnd: If the denominator is normalized, the n0 argument has very
little effect on the quotient. Maybe we can assume it is 0, and
compensate at a later stage?
* Better sqrt: First calculate the reciprocal square root, then multiply by
the operand to get the square root. The reciprocal square root can be
obtained through Newton-Raphson without division. To compute sqrt(A), the
iteration is,
2
x = x (3 - A x )/2.
i+1 i i
The final result can be computed without division using,
sqrt(A) = A x .
n
* Newton-Raphson using multiplication: We get twice as many correct digits
in each iteration. So if we square x(k) as part of the iteration, the
result will have the leading digits in common with the entire result from
iteration k-1. A _mpn_mul_lowpart could help us take advantage of this.
* Peter Montgomery: If 0 <= a, b < p < 2^31 and I want a modular product
a*b modulo p and the long long type is unavailable, then I can write
typedef signed long slong;
typedef unsigned long ulong;
slong a, b, p, quot, rem;
quot = (slong) (0.5 + (double)a * (double)b / (double)p);
rem = (slong)((ulong)a * (ulong)b - (ulong)p * (ulong)quot);
if (rem < 0} {rem += p; quot--;}
* Speed modulo arithmetic, using Montgomery's method or my pre-inversion
method. In either case, special arithmetic calls would be needed,
mpz_mmmul, mpz_mmadd, mpz_mmsub, plus some kind of initialization
functions. Better yet: Write a new mpr layer.
* mpz_powm* should not use division to reduce the result in the loop, but
instead pre-compute the reciprocal of the MOD argument and do reduced_val
= val-val*reciprocal(MOD)*MOD, or use Montgomery's method.
* mpz_mod_2expplussi -- to reduce a bignum modulo (2**n)+s
* It would be a quite important feature never to allocate more memory than
really necessary for a result. Sometimes we can achieve this cheaply, by
deferring reallocation until the result size is known.
* New macro in longlong.h: shift_rhl that extracts a word by shifting two
words as a unit. (Supported by i386, i860, HP-PA, POWER, 29k.) Useful
for shifting multiple precision numbers.
* The installation procedure should make a test run of multiplication to
decide the threshold values for algorithm switching between the available
methods.
* Fast output conversion of x to base B:
1. Find n, such that (B^n > x).
2. Set y to (x*2^m)/(B^n), where m large enough to make 2^n ~~ B^n
3. Multiply the low half of y by B^(n/2), and recursively convert the
result. Truncate the low half of y and convert that recursively.
Complexity: O(M(n)log(n))+O(D(n))!
* Improve division using Newton-Raphson. Check out "Newton Iteration and
Integer Division" by Stephen Tate in "Synthesis of Parallel Algorithms",
Morgan Kaufmann, 1993 ("beware of some errors"...)
* Improve implementation of Karatsuba's algorithm. For most operand sizes,
we can reduce the number of operations by splitting differently.
* Faster multiplication: The best approach is to first implement Toom-Cook.
People report that it beats Karatsuba's algorithm already at about 100
limbs. FFT would probably never beat a well-written Toom-Cook (not even for
millions of bits).
FFT:
{
* Multiplication could be done with Montgomery's method combined with
the "three primes" method described in Lipson. Maybe this would be
faster than to Nussbaumer's method with 3 (simple) moduli?
* Maybe the modular tricks below are not needed: We are using very
special numbers, Fermat numbers with a small base and a large exponent,
and maybe it's possible to just subtract and add?
* Modify Nussbaumer's convolution algorithm, to use 3 words for each
coefficient, calculating in 3 relatively prime moduli (e.g.
0xffffffff, 0x100000000, and 0x7fff on a 32-bit computer). Both all
operations and CRR would be very fast with such numbers.
* Optimize the Schoenhage-Stassen multiplication algorithm. Take advantage
of the real valued input to save half of the operations and half of the
memory. Use recursive FFT with large base cases, since recursive FFT has
better memory locality. A normal FFT get 100% cache misses for large
enough operands.
* In the 3-prime convolution method, it might sometimes be a win to use 2,
3, or 5 primes. Imagine that using 3 primes would require a transform
length of 2^n. But 2 primes might still sometimes give us correct
results with that same transform length, or 5 primes might allow us to
decrease the transform size to 2^(n-1).
To optimize floating-point based complex FFT we have to think of:
1. The normal implementation accesses all input exactly once for each of
the log(n) passes. This means that we will get 0% cache hit when n >
our cache. Remedy: Reorganize computation to compute partial passes,
maybe similar to a standard recursive FFT implementation. Use a large
`base case' to make any extra overhead of this organization negligible.
2. Use base-4, base-8 and base-16 FFT instead of just radix-2. This can
reduce the number of operations by 2x.
3. Inputs are real-valued. According to Knuth's "Seminumerical
Algorithms", exercise 4.6.4-14, we can save half the memory and half
the operations if we take advantage of that.
4. Maybe make it possible to write the innermost loop in assembly, since
that could win us another 2x speedup. (If we write our FFT to avoid
cache-miss (see #1 above) it might be logical to write the `base case'
in assembly.)
5. Avoid multiplication by 1, i, -1, -i. Similarly, optimize
multiplication by (+-\/2 +- i\/2).
6. Put as many bits as possible in each double (but don't waste time if
that doesn't make the transform size become smaller).
7. For n > some large number, we will get accuracy problems because of the
limited precision of our floating point arithmetic. This can easily be
solved by using the Karatsuba trick a few times until our operands
become small enough.
8. Precompute the roots-of-unity and store them in a vector.
}
* When a division result is going to be just one limb, (i.e. nsize-dsize is
small) normalization could be done in the division loop.
* Never allocate temporary space for a source param that overlaps with a
destination param needing reallocation. Instead malloc a new block for
the destination (and free the source before returning to the caller).
* Parallel addition. Since each processors have to tell it is ready to the
next processor, we can use simplified synchronization, and actually write
it in C: For each processor (apart from the least significant):
while (*svar != my_number)
;
*svar = my_number + 1;
The least significant processor does this:
*svar = my_number + 1; /* i.e., *svar = 1 */
Before starting the addition, one processor has to store 0 in *svar.
Other things to think about for parallel addition: To avoid false
(cache-line) sharing, allocate blocks on cache-line boundaries.
Local Variables:
mode: text
fill-column: 77
fill-prefix: " "
version-control: never
End:
==============================================================================
Cycle counts and throughput for low-level routines in GNU MP as currently
implemented.
A range means that the timing is data-dependent. The slower number of such
an interval is usually the best performance estimate.
The throughput value, measured in Gb/s (gigabits per second) has a meaning
only for comparison between CPUs.
A star before a line means that all values on that line are estimates. A
star before a number means that that number is an estimate. A `p' before a
number means that the code is not complete, but the timing is believed to be
accurate.
| mpn_lshift mpn_add_n mpn_mul_1 mpn_addmul_1
| mpn_rshift mpn_sub_n mpn_submul_1
------------+-----------------------------------------------------------------
DEC/Alpha |
EV4 | 4.75 cycles/64b 7.75 cycles/64b 42 cycles/64b 42 cycles/64b
200MHz | 2.7 Gb/s 1.65 Gb/s 20 Gb/s 20 Gb/s
EV5 old code| 4.0 cycles/64b 5.5 cycles/64b 18 cycles/64b 18 cycles/64b
267MHz | 4.27 Gb/s 3.10 Gb/s 61 Gb/s 61 Gb/s
417MHz | 6.67 Gb/s 4.85 Gb/s 95 Gb/s 95 Gb/s
EV5 tuned | 3.25 cycles/64b 4.75 cycles/64b
267MHz | 5.25 Gb/s 3.59 Gb/s as above
417MHz | 8.21 Gb/s 5.61 Gb/s
------------+-----------------------------------------------------------------
Sun/SPARC |
SPARC v7 | 14.0 cycles/32b 8.5 cycles/32b 37-54 cycl/32b 37-54 cycl/32b
SuperSPARC | 3 cycles/32b 2.5 cycles/32b 8.2 cycles/32b 10.8 cycles/32b
50MHz | 0.53 Gb/s 0.64 Gb/s 6.2 Gb/s 4.7 Gb/s
**SuperSPARC| tuned addmul and submul will take: 9.25 cycles/32b
MicroSPARC2 | ? 6.65 cycles/32b 30 cycles/32b 31.5 cycles/32b
110MHz | ? 0.53 Gb/s 3.75 Gb/s 3.58 Gb/s
SuperSPARC2 | ? ? ? ?
Ultra/32 (4)| 2.5 cycles/32b 6.5 cycles/32b 13-27 cyc/32b 16-30 cyc/32b
182MHz | 2.33 Gb/s 0.896 Gb/s 14.3-6.9 Gb/s
Ultra/64 (5)| 2.5 cycles/64b 10 cycles/64b 40-70 cyc/64b 46-76 cyc/64b
182MHz | 4.66 Gb/s 1.16 Gb/s 18.6-11 Gb/s
HalSPARC64 | ? ? ? ?
------------+-----------------------------------------------------------------
SGI/MIPS |
R3000 | 6 cycles/32b 9.25 cycles/32b 16 cycles/32b 16 cycles/32b
40MHz | 0.21 Gb/s 0.14 Gb/s 2.56 Gb/s 2.56 Gb/s
R4400/32 | 8.6 cycles/32b 10 cycles/32b 16-18 19-21
200MHz | 0.74 Gb/s 0.64 Gb/s 13-11 Gb/s 11-9.6 Gb/s
*R4400/64 | 8.6 cycles/64b 10 cycles/64b 22 cycles/64b 22 cycles/64b
*200MHz | 1.48 Gb/s 1.28 Gb/s 37 Gb/s 37 Gb/s
R4600/32 | 6 cycles/64b 9.25 cycles/32b 15 cycles/32b 19 cycles/32b
134MHz | 0.71 Gb/s 0.46 Gb/s 9.1 Gb/s 7.2 Gb/s
R4600/64 | 6 cycles/64b 9.25 cycles/64b ? ?
134MHz | 1.4 Gb/s 0.93 Gb/s ? ?
R8000/64 | 3 cycles/64b 4.6 cycles/64b 8 cycles/64b 8 cycles/64b
75MHz | 1.6 Gb/s 1.0 Gb/s 38 Gb/s 38 Gb/s
*R10000/64 | 2 cycles/64b 3 cycles/64b 11 cycles/64b 11 cycles/64b
*200MHz | 6.4 Gb/s 4.27 Gb/s 74 Gb/s 74 Gb/s
*250MHz | 8.0 Gb/s 5.33 Gb/s 93 Gb/s 93 Gb/s
------------+-----------------------------------------------------------------
Motorola |
MC68020 | ? 24 cycles/32b 62 cycles/32b 70 cycles/32b
MC68040 | ? 6 cycles/32b 24 cycles/32b 25 cycles/32b
MC88100 | >5 cycles/32b 4.6 cycles/32b 16/21 cyc/32b p 18/23 cyc/32b
MC88110 wt | ? 3.75 cycles/32b 6 cycles/32b 8.5 cyc/32b
*MC88110 wb | ? 2.25 cycles/32b 4 cycles/32b 5 cycles/32b
------------+-----------------------------------------------------------------
HP/PA-RISC |
PA7000 | 4 cycles/32b 5 cycles/32b 9 cycles/32b 11 cycles/32b
67MHz | 0.53 Gb/s 0.43 Gb/s 7.6 Gb/s 6.2 Gb/s
PA7100 | 3.25 cycles/32b 4.25 cycles/32b 7 cycles/32b 8 cycles/32b
99MHz | 0.97 Gb/s 0.75 Gb/s 14 Gb/s 12.8 Gb/s
PA7100LC | ? ? ? ?
PA7200 (3) | 3 cycles/32b 4 cycles/32b 7 cycles/32b 6.5 cycles/32b
100MHz | 1.07 Gb/s 0.80 14 Gb/s 15.8 Gb/s
PA7300LC | ? ? ? ?
*PA8000 | 3 cycles/64b 4 cycles/64b 7 cycles/64b 6.5 cycles/64b
180MHz | 3.84 Gb/s 2.88 Gb/s 105 Gb/s 113 Gb/s
------------+-----------------------------------------------------------------
Intel/x86 |
386DX | 20 cycles/32b 17 cycles/32b 41-70 cycl/32b 50-79 cycl/32b
16.7MHz | 0.027 Gb/s 0.031 Gb/s 0.42-0.24 Gb/s 0.34-0.22 Gb/s
486DX | ? ? ? ?
486DX4 | 9.5 cycles/32b 9.25 cycles/32b 17-23 cycl/32b 20-26 cycl/32b
100MHz | 0.34 Gb/s 0.35 Gb/s 6.0-4.5 Gb/s 5.1-3.9 Gb/s
Pentium | 2/6 cycles/32b 2.5 cycles/32b 13 cycles/32b 14 cycles/32b
167MHz | 2.7/0.89 Gb/s 2.1 Gb/s 13.1 Gb/s 12.2 Gb/s
Pentium Pro | 2.5 cycles/32b 3.5 cycles/32b 6 cycles/32b 9 cycles/32b
200MHz | 2.6 Gb/s 1.8 Gb/s 34 Gb/s 23 Gb/s
------------+-----------------------------------------------------------------
IBM/POWER |
RIOS 1 | 3 cycles/32b 4 cycles/32b 11.5-12.5 c/32b 14.5/15.5 c/32b
RIOS 2 | 2 cycles/32b 2 cycles/32b 7 cycles/32b 8.5 cycles/32b
------------+-----------------------------------------------------------------
PowerPC |
PPC601 (1) | 3 cycles/32b 6 cycles/32b 11-16 cycl/32b 14-19 cycl/32b
PPC601 (2) | 5 cycles/32b 6 cycles/32b 13-22 cycl/32b 16-25 cycl/32b
67MHz (2) | 0.43 Gb/s 0.36 Gb/s 5.3-3.0 Gb/s 4.3-2.7 Gb/s
PPC603 | ? ? ? ?
*PPC604 | 2 3 2 3
*167MHz | 57 Gb/s
PPC620 | ? ? ? ?
------------+-----------------------------------------------------------------
Tege |
Model 1 | 2 cycles/64b 3 cycles/64b 2 cycles/64b 3 cycles/64b
250MHz | 8 Gb/s 5.3 Gb/s 500 Gb/s 340 Gb/s
500MHz | 16 Gb/s 11 Gb/s 1000 Gb/s 680 Gb/s
____________|_________________________________________________________________
(1) Using POWER and PowerPC instructions
(2) Using only PowerPC instructions
(3) Actual timing for shift/add/sub depends on code alignment. PA7000 code
is smaller and therefore often faster on this CPU.
(4) Multiplication routines modified for bogus UltraSPARC early-out
optimization. Smaller operand is put in rs1, not rs2 as it should
according to the SPARC architecture manuals.
(5) Preliminary timings, since there is no stable 64-bit environment.
(6) Use mulu.d at least for mpn_lshift. With mak/extu/or, we can only get
to 2 cycles/32b.
=============================================================================
Estimated theoretical asymptotic cycle counts for low-level routines:
| mpn_lshift mpn_add_n mpn_mul_1 mpn_addmul_1
| mpn_rshift mpn_sub_n mpn_submul_1
------------+-----------------------------------------------------------------
DEC/Alpha |
EV4 | 3 cycles/64b 5 cycles/64b 42 cycles/64b 42 cycles/64b
EV5 | 3 cycles/64b 4 cycles/64b 18 cycles/64b 18 cycles/64b
------------+-----------------------------------------------------------------
Sun/SPARC |
SuperSPARC | 2.5 cycles/32b 2 cycles/32b 8 cycles/32b 9 cycles/32b
------------+-----------------------------------------------------------------
SGI/MIPS |
R4400/32 | 5 cycles/64b 8 cycles/64b 16 cycles/64b 16 cycles/64b
R4400/64 | 5 cycles/64b 8 cycles/64b 22 cycles/64b 22 cycles/64b
R4600 |
------------+-----------------------------------------------------------------
HP/PA-RISC |
PA7100 | 3 cycles/32b 4 cycles/32b 6.5 cycles/32b 7.5 cycles/32b
PA7100LC |
------------+-----------------------------------------------------------------
Motorola |
MC88110 | 1.5 cyc/32b (6) 1.5 cycle/32b 1.5 cycles/32b 2.25 cycles/32b
------------+-----------------------------------------------------------------
Intel/x86 |
486DX4 |
Pentium P5x | 5 cycles/32b 2 cycles/32b 11.5 cycles/32b 13 cycles/32b
Pentium Pro | 2 cycles/32b 3 cycles/32b 4 cycles/32b 6 cycles/32b
------------+-----------------------------------------------------------------
IBM/POWER |
RIOS 1 | 3 cycles/32b 4 cycles/32b
RIOS 2 | 1.5 cycles/32b 2 cycles/32b 4.5 cycles/32b 5.5 cycles/32b
------------+-----------------------------------------------------------------
PowerPC |
PPC601 (1) | 3 cycles/32b ?4 cycles/32b
PPC601 (2) | 4 cycles/32b ?4 cycles/32b
____________|_________________________________________________________________
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
#include "gmp.h"
main ()
{
printf ("/* gmp-mparam.h -- Compiler/machine parameter header file.\n\n");
printf (" *** CREATED BY A PROGRAM -- DO NOT EDIT ***\n\n");
printf ("Copyright (C) 1996 Free Software Foundation, Inc. */\n\n");
printf ("#define BITS_PER_MP_LIMB %d\n", 8 * sizeof (mp_limb_t));
printf ("#define BYTES_PER_MP_LIMB %d\n", sizeof (mp_limb_t));
printf ("#define BITS_PER_LONGINT %d\n", 8 * sizeof (long));
printf ("#define BITS_PER_INT %d\n", 8 * sizeof (int));
printf ("#define BITS_PER_SHORTINT %d\n", 8 * sizeof (short));
printf ("#define BITS_PER_CHAR 8\n");
exit (0);
}
This diff is collapsed.
/* __gmp_extract_double -- convert from double to array of mp_limb_t.
Copyright (C) 1996 Free Software Foundation, Inc.
This file is part of the GNU MP Library.
The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of the GNU Library General Public License as published by
the Free Software Foundation; either version 2 of the License, or (at your
option) any later version.
The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Library General Public
License for more details.
You should have received a copy of the GNU Library General Public License
along with the GNU MP Library; see the file COPYING.LIB. If not, write to
the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston,
MA 02111-1307, USA. */
#include "gmp.h"
#include "gmp-impl.h"
#ifdef XDEBUG
#undef _GMP_IEEE_FLOATS
#endif
#ifndef _GMP_IEEE_FLOATS
#define _GMP_IEEE_FLOATS 0
#endif
#define MP_BASE_AS_DOUBLE (2.0 * ((mp_limb_t) 1 << (BITS_PER_MP_LIMB - 1)))
/* Extract a non-negative double in d. */
int
#if __STDC__
__gmp_extract_double (mp_ptr rp, double d)
#else
__gmp_extract_double (rp, d)
mp_ptr rp;
double d;
#endif
{
long exp;
unsigned sc;
mp_limb_t manh, manl;
/* BUGS
1. Should handle Inf and NaN in IEEE specific code.
2. Handle Inf and NaN also in default code, to avoid hangs.
3. Generalize to handle all BITS_PER_MP_LIMB >= 32.
4. This lits is incomplete and misspelled.
*/
if (d == 0.0)
{
rp[0] = 0;
rp[1] = 0;
#if BITS_PER_MP_LIMB == 32
rp[2] = 0;
#endif
return 0;
}
#if _GMP_IEEE_FLOATS
{
union ieee_double_extract x;
x.d = d;
exp = x.s.exp;
sc = (unsigned) (exp + 2) % BITS_PER_MP_LIMB;
#if BITS_PER_MP_LIMB == 64
manl = (((mp_limb_t) 1 << 63)
| ((mp_limb_t) x.s.manh << 43) | ((mp_limb_t) x.s.manl << 11));
#else
manh = ((mp_limb_t) 1 << 31) | (x.s.manh << 11) | (x.s.manl >> 21);
manl = x.s.manl << 11;
#endif
}
#else
{
/* Unknown (or known to be non-IEEE) double format. */
exp = 0;
if (d >= 1.0)
{
if (d * 0.5 == d)
abort ();
while (d >= 32768.0)
{
d *= (1.0 / 65536.0);
exp += 16;
}
while (d >= 1.0)
{
d *= 0.5;
exp += 1;
}
}
else if (d < 0.5)
{
while (d < (1.0 / 65536.0))
{
d *= 65536.0;
exp -= 16;
}
while (d < 0.5)
{
d *= 2.0;
exp -= 1;
}
}
sc = (unsigned) exp % BITS_PER_MP_LIMB;
d *= MP_BASE_AS_DOUBLE;
#if BITS_PER_MP_LIMB == 64
manl = d;
#else
manh = d;
manl = (d - manh) * MP_BASE_AS_DOUBLE;
#endif
exp += 1022;
}
#endif
exp = (unsigned) (exp + 1) / BITS_PER_MP_LIMB - 1024 / BITS_PER_MP_LIMB + 1;
#if BITS_PER_MP_LIMB == 64
if (sc != 0)
{