6/1/18
Assignment Semantics in Python, JavaScript, Java, C++, and Rust
By Carlo Milanesi
What does happen when a data collection is copied and then the new copy is changed? Does the original remains the same, or does it changes too?
If you think of copying as creating a completely new object, of course you expect that any change to the new copy does not affect the original object. But if you think of copying as creating a new name for the same, single object, then you expect that any change to the object through the new name appears also when you access the same object through the old name.
Let's see how is the behavior of Python, Javascript, Java, C++, and Rust regarding the assignment operator ("=") between collection variables.
Both using a Python interpreter and a JavaScript interpreter (like Node.js), if you write the following lines:
you get as output: [123, 20, 30].
This means that the a and b variables refers the same array object, so that when such object is changed using b, it comes out to be changed also accessing it through a.
The same happens using Java. If you compile and run the following program:
The output is 123.
C++ behaves differently, though.
If you compile and run the following C++11 program:
the output is 10.
The same output is obtained by the following C++ program:
This means that for both the std::array and the std::vector standard C++ collections, when a is copied to b, the whole underlying object is copied, so that such two variables represent distinct collections.
We can say that Python, JavaScript and Java use share semantics, while C++ use copy semantics.
All this is explained in the section "Assignment semantics" of chapter 21 of my "Beginning Rust" book, reported in the rest of this post.
What does the following program do?
Conceptually, first, the header of v1 is allocated in the stack. Then, as such vector has content, a buffer for such content is allocated in the heap, and the values are copied onto it. Then the header is initialized so that it references the newly allocated heap buffer.
Then the header of v2 is allocated in the stack. Then, there is the initialization of v2 using v1. But, how is that implemented?
In general there are at least three ways to implement such operation:
- Share semantics. The header of v1 is copied onto the header of v2, and nothing else happens. Subsequently, both v1 and v2 can be used, and they both refer to the same heap buffer; therefore, they refer to the same contents, not to two equal but distinct contents. This semantics is implemented by garbage-collecting languages, like Java.
- Copy semantics. Another heap buffer is allocated. It is as large as the heap buffer used by v1, and the contents of the pre-existing buffer is copied onto the new buffer. Then the header of v2 is initialized so that it references the newly allocated buffer. Therefore, the two variables refer to two distinct buffers, that initially have equal contents. This is implemented, by default, by C++.
- Move semantics. The header of v1 is copied onto the header of v2, and nothing else happens. Subsequently, v2 can be used, and it refers to the heap buffer that was allocated for v1, but v1 cannot be used anymore. This is implemented, by default, by Rust.
This code generates the compilation error use of moved value: `v1` at the last line. When the value of v1 is assigned to v2, the variable v1 ceases to exist. Trying to use it, even only to get its length, is disallowed by the compiler.
Let's see why Rust does not implements share semantics. First, if variables are mutable, such semantics would be somewhat confusing. With share semantics, after an item is changed through a variable, that item appears to be changed also when it is accessed through the other variable. And it wouldn't be intuitive, and possibly a source of bugs. Therefore, share semantics would be acceptable only for read-only data.
But there is a bigger problem, regarding deallocation. If share semantics was used, both v1 and v2 would own the single data buffer, and so when they are deallocated, the same heap buffer would be deallocated twice. A buffer cannot be allocated twice, without causing memory corruption and consequently program malfunction. To solve this problem, the languages that use share semantics do not deallocate memory at the end of the scope of the variable using such memory, but resort to garbage collection.
Instead, both copy semantics and move semantics are correct. Indeed, the Rust rule regarding deallocation is that any object must have exactly one owner. When copy semantics is used, the original vector buffer keeps its single owner, that is the vector header referenced by v1, and the newly created vector buffer gets its single owner, that is the vector header referenced by v2. On the other hand, when move semantics is used, the single vector buffer changes owner: before the assignment, its owner is the vector header referenced by v1, and after the assignment, its owner is the vector header referenced by v2. Before the assignment, the v2 header does not exist yet, and after the assignment the v2 header does not exist anymore.
And why Rust does not implement copy semantics?
Actually, in some cases copy semantics is more appropriate, but in other cases it is move semantics to be more appropriate. Even C++, since 2011, allows both copy semantics and move semantics.
This C++ program will print: 0 3 3. The vector v1 is first copied to the vector v2 and then moved to the vector v3. C++ move standard function empties the vector, but does not make it undefined. Therefore, at the end, v2 has a copy of the three items, v3 has just the original three items that were created for v1, and v1 is empty.
And also Rust allows both copy semantics and move semantics.
This will print 3 3.
This Rust program is similar to the C++ program above, but here it is forbidden to access v1, at the last-but-one line, because it is moved. While in C++ the default semantics is a copy, and it is needed to invoke the "move" standard function to make a move, in Rust the default semantics is a move, and it is needed to invoke the "clone" standard function to make a copy.
In addition, while the v1 moved vector in C++ is still accessible, but emptied, in Rust such a variable is not accessible at all anymore.
###
JavaScript, Python, and Java have an efficient but somewhat surprising and error-prone semantics for their assignment operator. It actually copies the whole value for primitive objects, but it copies only a reference for composite objects, actually creating aliases to the referred object.
Instead, C++ assignment operator has the safer and more consistent, but possibly inefficient, behavior of copying always the referred object. This is odd, as usually C++ is considered less safe but more efficient than the previously cited languages.
Rust language manages to combine the semantic safety of C++ (i.e. no surprises) with the efficiency of Python, JavaScript, and Java (i.e. copy just the minimum required).
About the Author
Carlo Milanesi is a professional software developer and expert who uses Rust. He has contributed to the Rust development community, and also has done web application development in Linux with PHP, JavaScript, Java, Ionic and Vaadin frameworks. Lastly, he has been in involved in these other technologies: GUI design, 2D and 3D rendering, testing automation, database access. Carlo's applications include CAM/CAM for the stone machining industry, lens cutting laboratory automation, and corporate-wide web applications.
This blog post was contributed by Carlo Milanesi, the author of “Beginning Rust: From Novice to Professional”.