TY - GEN
T1 - Effective usage of vector registers in decoupled vector architectures
AU - Villa, Luis
AU - Espasa, Roger
AU - Valero, Mateo
N1 - Publisher Copyright:
© 1998 IEEE
PY - 1998
Y1 - 1998
N2 - This paper presents a study of the impact of reducing the vector register size in a decoupled vector architecture. In traditional in-order vector architectures, long vector registers have typically been the norm. We start presenting data that shows that, even for highly vectorizable codes, only a small fraction of all elements of a long vector register are actually used. We also show that reducing the register size in a traditional vector architecture in an attempt to reduce hardware cost and maximize register utilization results in a severe performance degradation. However, we combine the decoupling technique with the vector register reduction and show that the resulting architecture tolerates very well the register size cuts. We simulate a selection of Perfect Club and Specfp92 programs using a trace driven approach and compare the execution time in a conventional vector architecture with a decoupled vector architecture using different registers sizes. Halving the register size and using decoupling provides speedups between 1.04-1.49 over a traditional in-order vector machines. Even reducing the register length to 1/4 the original size (and, in some cases, to 1/8) the performance of the decoupled machine is better than a conventional vector model. Moreover, we observe that the resulting decoupled machine with short registers tolerates very well long memory latencies.
AB - This paper presents a study of the impact of reducing the vector register size in a decoupled vector architecture. In traditional in-order vector architectures, long vector registers have typically been the norm. We start presenting data that shows that, even for highly vectorizable codes, only a small fraction of all elements of a long vector register are actually used. We also show that reducing the register size in a traditional vector architecture in an attempt to reduce hardware cost and maximize register utilization results in a severe performance degradation. However, we combine the decoupling technique with the vector register reduction and show that the resulting architecture tolerates very well the register size cuts. We simulate a selection of Perfect Club and Specfp92 programs using a trace driven approach and compare the execution time in a conventional vector architecture with a decoupled vector architecture using different registers sizes. Halving the register size and using decoupling provides speedups between 1.04-1.49 over a traditional in-order vector machines. Even reducing the register length to 1/4 the original size (and, in some cases, to 1/8) the performance of the decoupled machine is better than a conventional vector model. Moreover, we observe that the resulting decoupled machine with short registers tolerates very well long memory latencies.
UR - http://www.scopus.com/inward/record.url?scp=85117484992&partnerID=8YFLogxK
U2 - 10.1109/EMPDP.1998.647238
DO - 10.1109/EMPDP.1998.647238
M3 - Contribución a la conferencia
AN - SCOPUS:85117484992
T3 - Proceedings of the 6th Euromicro Workshop on Parallel and Distributed Processing, PDP 1998
SP - 495
EP - 501
BT - Proceedings of the 6th Euromicro Workshop on Parallel and Distributed Processing, PDP 1998
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 6th Euromicro Workshop on Parallel and Distributed Processing, PDP 1998
Y2 - 21 January 1998 through 23 January 1998
ER -