联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codehelp

您当前位置:首页 >> R语言程序R语言程序

日期:2020-05-01 09:14

CIS341, Spring 2020
Cache2 Lab: Understanding Cache Memories
Assigned: Tuesday, April 7th, 2020
Due: Saturday, May 2nd, 8:59PM
1 Overview
This assignment will help you understand the behavior of cache memories and the impact that cache memories
have on performance. You may work individually, or as a group consisting of up to 4 individuals.
The assignment consists of two parts. In the first part, you will deduce the cache parameters such as block
size, number of sets, number of ways, and hit time for a two-level cache simulator. In the second part, you
will optimize a matrix multiply function to minimize the average memory read time. You will turn in not
only your C code for the cache-optimized matrix multiply, but also a short report documenting your work.
2 Milestones
The entirety of the assignment is due on May 2nd, 8:59pm, but there are two milestones before the due date.
While you may choose to turn-in everything on the final due date, hitting the first two milestones on time
will grant you bonus points.
2.1 Milestone 1: April 14th, 8:59pm
You must form a group to receive the files to work on the lab. Even a single-person group must be explicitly
formed. You will do this by contacting me via e-mail, stating the team name, and the list of group members.
Please check the announcement from April 8th for a sample of such e-mail.
You will receive 10 bonus points if you form a group by the first milestone. Any students who do not belong
to a group by the first milestone will be automatically and randomly grouped into new teams.
You will receive the files for the assignment on the class server within 24 hours of your group formation.
1
2.2 Milestone 2: April 21st, 8:59pm
You will receive bonus points if you correctly deduce the cache memory parameters for the cache simulator
before the second milestone. You may make only one guess, and I will either confirm or deny the correctness
of your deduction.
The awarded of bonus points depend on the size of your group. For a single-person group, you will receive
up to 24 bonus points if you correctly deduce the cache parameters. A two-person group will receive up to
12 bonus points; a three-person group, up to 8 points, and a four-person group, up to 6 points. You may
receive partial credit for the educated guess, but you have one chance in requesting confirmation for the all
of the deduced cache parameters. My response will either be a) all the parameters in your deduction are
correct or b) not all parameters are correct. I will not tell you which parameters are correct and which are
not.
The maximum attainable bonus points between Milestone 1 and 2 are limited to 20 points.
2.3 Due date: May 2nd, 8:59pm
By the due date, a report that documents your work and the matrix multiply code must be submitted. The
report will be uploaded through Blackboard, and the C code will be placed in the turnin directory of a
group member. The report must specify the name of the group member who placed the matrix multiply code
in his/her turnin directory.
Late penalties apply similarly to the prior assignments: each late-day will reduce your score by 20%. This
reduction is applied after bonus points are added. After 5 days, you will not receive any points even if you
received bonus points.
3 Communication
One of the cornerstones of a group project is frequent and efficient communication within its members. The
Blackboard group system allows group members to have private discussion boards, send e-mails to each
other, and write journal entries.
I would like you to have a habit of documenting your progress, no matter how minor it is. You may start by
starting a thread on some of the observations you have made in the lab, or by writing a digest of your group
meeting as journal entries. Even if you work alone, you should utilize these features, and some points will
be graded based on how you document your progress.
4 Tasks
Once you form a group, you will receive a tarball for the assignment, placed in your home directory of
the class server, lcs-vc-cis341.syr.edu. Untar it, and you will find the relevant files in the created
directory.
2
linux> cd
linux> tar xvf cache2lab-handout_Team*.tar
linux> cd cache2lab-handout
4.1 Task #1: Deducing cache parameters
csim is the binary for the two-level cache simulator whose parameter you must deduce through experimentation.
These parameters include the cache block size, number of sets, number of ways, and hit time for
each level 1 and level 2 cache. Do not delete csim: make will not be able to create this binary for you!
4.1.1 Running and understanding csim
You will run csim by giving it a trace file to run.
linux> ./csim -v -t traces/dave.trace
L 10,4 L2:miss 100
S 18,4 L1:hit
L 20,4 L2:miss 100
S 28,4 L1:hit
S 50,4 L2:miss
L1: hits:2 misses:3 evictions:0
L2: hits:0 misses:3 evictions:0
Average read time: 100.00
linux>
Here, the -v flag tells the simulator to be verbose, and the trace file path must follow the -t flag. The
simulator run under verbose will output the result of each memory access, followed by a summary for each
cache, and the average memory read time.
L 10,4 L2:miss 100
The first memory trace is a load (memory read) of 4 bytes from address 0x10. It resulted in an L2 miss
(implying that it also missed in L1), and incurred a total access time of 100 cycles. This result tells us that
the sum of L1 hit time, L2 hit time, and DRAM access time is 100 cycles.
S 18,4 L1:hit The second memory trace is a store (memory write) of 4 bytes to address 0x18. It
resulted in an L1 hit. This result tells us that the cache block size for L1 is greater than 8 bytes because the
previous load from address 0x10 was installed in the L1 cache. Note that the store does not log any access
time. We assume that all stores are non-blocking, and they are buffered and written in the background.
The average read time indicates 100 cycles. The two loads were only taken into consideration when computing
this average read time.
3
4.1.2 csim policies
Aside from cache parameters of block size, number of sets, number of ways, and hit time, the cache simulator
implements a number of policies that determine the behavior of the cache. These policies are discussed
here.
? Write-back
Write-back policy means that on a write-hit (the cache hits on a store), the write will only be applied
to that cache, not the levels below. For example, if a store hits in L1, it will not be written to L2 nor
DRAM. If a store L2 hits (through a dirty data eviction from L1 being written to L2), it will not be
written to DRAM.
? Write-allocate
Write-allocate policy means that on a write-miss (the cache misses on a store), the data will be first
loaded into the cache before applying the updates. For example, if a store misses in L1, the corresponding
cache line will first be brought into L1.
? Least-recently-used (LRU) replacement
LRU replacement policy means that to evict a cache line, the data that was least recently used will
be replaced. In real hardware, it may not be possible to track LRU perfectly; but for our simulator,
assume that we achieve perfect LRU.
? Non-inclusive, non-exclusive (NINE)
NINE policy means that there are no strict inclusivity and exclusivity requirements to the data in L1
and L2. For example, the same data may reside in both L1 and L2, or not be in both L1 and L2. It
may be possible for the data to only be in L1 and not in L2 (through L2’s eviction policy), or only in
L2 and not in L1 (after data is evicted in L1). However, on a miss in L1, data will always be brought
into both L1 and L2.
4.1.3 csim parameter assumptions
You should assume the following relations to be true.
? On hit time: L1 hit time is strictly smaller than L2’s hit time, and L2’s hit time is strictly smaller than
DRAM access time.
? On block size: L2’s block size is greater than or equal to L1’s block size.
? On the numbr of ways: L2’s associativity is greater than or equal to L1’s associativity.
? On the numbr of sets: The number of sets in L2 is greater than or equal to that in L1.
? On capacity: The size of L2 is strictly greater than the size of L1.
? On numbers: The block size, number of sets, and number of ways are all powers of 2.
Also, you should assume that memory accesses are aligned properly, such that single memory access never
crosses block boundaries. By making this assumption, you can ignore the request sizes in the traces.
4
4.1.4 Writing traces
In the directory traces, you will find several sample traces. The memory traces have the following form:
I 0400d7d4,8
M 0421c7f0,4
L 04f6b868,8
S 7ff0005c8,8
Each line denotes one or two memory accesses. The format of each line is
[space]operation address,size
The operation field denotes the type of memory access: “I” denotes an instruction load, “L” a data load,
“S” a data store, and “M” a data modify (i.e., a data load followed by a data store). There is never a space
before each “I”. There is always a space before each “M”, “L”, and “S”. The address field specifies a 64-bit
hexadecimal memory address. The size field specifies the number of bytes accessed by the operation.
csim will ignore any instruction load, so I 0400d7d4, 8 will not affect the simulation.
4.1.5 Generating traces
Instead of handcrafting memory traces to exercise csim, you may generate traces directly from a program.
A Linux program called valgrind can trace the memory accesses during the execution of a program.
For example, typing
linux> valgrind --log-fd=1 --tool=lackey -v --trace-mem=yes ls -l
on the command line runs the executable program “ls -l”, captures a trace of each of its memory accesses
in the order they occur, and prints them on stdout.
Thus, if you choose so, you may write a program in a high-level language to access memory in a particular
pattern, compile to an executable binary, and then run valgrind to gather the memory trace. However,
this includes a lot of accesses outside of the memory region of interests, such as using the standard libraries
and accessing the stack. You may choose to inspect how test-mm.c and tracegen.c create isolated
memory traces.
4.2 Task #2: Writing a cache-optimized matrix multiply
You will write a matrix multiply code in mm.c, optimized for the csim binary. A reference matrix multiply
code is included in mm ref.c.
Running test-mm will validate the correctness of your implementation in mm.c, and evaluate its performance
by generating a memory trace and running it on csim. Below shows an example of testing a correct
but slow implementation of matrix multiply.
5
linux>
linux> ./test-mm -M 50 -N 51 -P 52
Step 1: Validating and generating memory traces for mm.c
Step 2: Evaluating performance
L1: hits:126418 misses:141337 evictions:141273
L2: hits:142899 misses:978 evictions:0
Average read time: 13.05
There are several approaches that you should consider when writing mm.c.
? Spatial locality
mm ref.c shows a na¨?ve implementation of matrix multiply. Try changing the order of the loops.
? Temporal locality
We’ve seen in lectures that chopping up the matrices into smaller blocks can reuse data already in
cache. Consider writing matrix multiply such that once data are brought into the cache, they will be
re-used.
? Spatio-temporal locality
Try combining the two techniques above.
? Two-level cache
csim has a two-level cache structure. Think about how you would design a matrix multiply for
two-level caches.
Furthermore, below are rules that you must adhere to when writing mm.c
? You are allowed to define at most 20 local variables of type int in your mm function. The reason for
this restriction is that our testing code is not able to count references to the stack. We want you to
limit your references to the stack and focus on the access patterns of the source and destination arrays.
? You are not allowed to side-step the previous rule by using any variables of type long or by using
any bit tricks to store more than one value to a single variable.
? Your matrix multiply function may not use recursion.
? If you choose to use helper functions, you may not have more than 20 local variables on the stack at a
time between your helper functions and your top-level matrix multiply function. For example, if your
matrix multiply declares 16 variables, and then you call a function that uses 4 variables, which calls
another function that uses 2, you will have 22 variables on the stack, and you will violate the rule.
? Your mm function may not modify array A or B. You may, however, do whatever you want with the
contents of array C.
? You are NOT allowed to define any arrays in your code or to use any variant of malloc.
6
4.3 Task #3: Writing a short report
You will write a short report documenting your work in Task #1 and Task #2. Only one report is needed per
group. You must include the following information in the report.
? Team members
A list of team members who were part of the work.
? Deduced cache parameters
A list of cache parameters that you deduced. There are 9 parameters you must uncover:
– L1: block size, number of sets, number of ways, hit time
– L2: block size, number of sets, number of ways, hit time
– DRAM access time
Even if you have met Milestone #2, you must still explicitly list these parameters.
? Evidence for your deduction of cache parameters
Provide a summary of the experiments you have performed to arrive at the conclusions. You need to
include a description of the types of workload traces you exercised, and how you interpreted these
results.
? Directory location of the matrix multiply code
The location of a complete and working mm.c must be disclosed in the report. I will look at this
location to validate your code. Please include the full path. You can run pwd in the directory to find
the full path.
? Description of your matrix multiply implementation
Briefly describe the final version of your matrix multiply code implemented in mm.c.
? Exemplar output from testing matrix multiply
Include the output of a test run of test-mm.
? Your reasoning on the performance evaluation of matrix multiply
Describe why the performance of your mm.c is better than the reference mm ref.c. If you implemented
and combined multiple techniques, a step-by-step description of all the intermediary implementations
is encouraged, although not required. If you see significant performance improvement,
you must provide your reasoning to it. If you do not see performance improvement, explain why is
the case.
5 Grading
We will grade the assignment 5 days after the due date to take account of late submissions. As did with
previous assignments, each late submission will reduce the total points by 20%.
7
The assignment itself is worth 100 points and possibly up to 20 bonus points. Out of the 100 base points,
70 points are allocated to be evaluated based on the submission: mm.c and the report. 30 points are participation
points that are assigned based on your peer-evaluation and group activity.
The breakdown of points is as follows.
? 10 points: cache parameters
You will receive partial credit for each correctly deduced parameters (9 parameters total).
? 20 points: matrix multiply
5 points will be awarded for mm.c that correctly compiles with no syntax errors. Another 5 points
will be awarded for mm.c that is correct functionally. 10 points will be based on the performance
improvement of mm.c. Because each group will receive different csim, the maximum possible
improvement is different for each group. Thus, the scale for awarding performance improvement
will depend on the csim parameters. Also, I reserve the exact matrix dimensions I will use for
performance evaluation.
? 40 points: report
20 points will be reserved for how well the group approaches Task #1, and the other 20points for
Task #2. The general criteria for judging these points are described in Task #3: Writing a short report
section.
? 30 points: participation
A short questionnaire for peer-evaluation will be made available after the deadline. It will be used to
evaluate 20 points participation grades for each team member. The remaining 10 points will be based
on your activity on Blackboard: your group’s discussion board and journal entries.
? (bonus) 10 points: forming a group by Milestone #1
An additional 10 points will be awarded for groups that are formed before Milestone #1.
? (bonus) 6/8/12/24 points: correctly deducing the parameters by Milestone #2
Points will be awarded to groups that correctly deduce the parameters by Milestone #2. Awarded
points depend on the group size (a group size of 4 receives up to 6, and a group size of 1 receives up
to 24), and how many parameters are correctly deduced. You will only have one attempt at submitting
your educated guess.
8

版权所有:留学生编程辅导网 2021,All Rights Reserved 联系方式:QQ:99515681 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。