Applied Programming/Dictionaries and Sets

Overview

edit

Data Structure

edit

In computer science, a data structure is a data organization, management, and storage format that enables efficient access and modification.[1][2][3] More precisely, a data structure is a collection of data values, the relationships among them, and the functions or operations that can be applied to the data.[4]

Data structures provide a means to manage large amounts of data efficiently for uses such as large databases and internet indexing services. Usually, efficient data structures are key to designing efficient algorithms. Some formal design methods and programming languages emphasize data structures, rather than algorithms, as the key organizing factor in software design. Data structures can be used to organize the storage and retrieval of information stored in both main memory and secondary memory.[5]

Data structures are generally based on the ability of a computer to fetch and store data at any place in its memory, specified by a pointer—a bit string, representing a memory address, that can be itself stored in memory and manipulated by the program. Thus, the array and record data structures are based on computing the addresses of data items with arithmetic operations, while the linked data structures are based on storing addresses of data items within the structure itself.

The implementation of a data structure usually requires writing a set of procedures that create and manipulate instances of that structure. The efficiency of a data structure cannot be analyzed separately from those operations. This observation motivates the theoretical concept of an abstract data type, a data structure that is defined indirectly by the operations that may be performed on it, and the mathematical properties of those operations (including their space and time cost).[6]

There are numerous types of data structures, generally built upon simpler primitive data types:[7]

  • An array is a number of elements in a specific order, typically all of the same type (depending on the language, individual elements may either all be forced to be the same type, or may be of almost any type). Elements are accessed using an integer index to specify which element is required. Typical implementations allocate contiguous memory words for the elements of arrays (but this is not always a necessity). Arrays may be fixed-length or resizable.
  • A linked list (also just called list) is a linear collection of data elements of any type, called nodes, where each node has itself a value, and points to the next node in the linked list. The principal advantage of a linked list over an array is that values can always be efficiently inserted and removed without relocating the rest of the list. Certain other operations, such as random access to a certain element, are however slower on lists than on arrays.
  • A record (also called tuple or struct) is an aggregate data structure. A record is a value that contains other values, typically in fixed number and sequence and typically indexed by names. The elements of records are usually called fields or members.
  • A union is a data structure that specifies which of a number of permitted primitive types may be stored in its instances, e.g. float or long integer. Contrast with a record, which could be defined to contain a float and an integer; whereas in a union, there is only one value at a time. Enough space is allocated to contain the widest member datatype.
  • A tagged union (also called variant, variant record, discriminated union, or disjoint union) contains an additional field indicating its current type, for enhanced type safety.
  • An object is a data structure that contains data fields, like a record does, as well as various methods which operate on the data contents. An object is an in-memory instance of a class from a taxonomy. In the context of object-oriented programming, records are known as plain old data structures to distinguish them from objects.[8][9]

Dictionary (Data Structure)

edit

In computer science, an associative array, map, symbol table, or dictionary is an abstract data type composed of a collection of (key, value) pairs, such that each possible key appears at most once in the collection.

Operations associated with this data type allow:[10][11]

  • the addition of a pair to the collection;
  • the removal of a pair from the collection;
  • the modification of an existing pair;
  • the lookup of a value associated with a particular key.

Implementing associative arrays poses the dictionary problem, a classic computer science problem: the task of designing a data structure that maintains a set of data during 'search', 'delete', and 'insert' operations.[12] The two major solutions to the dictionary problem are a hash table and a search tree.[10][11][13][14] In some cases it is also possible to solve the problem using directly addressed arrays, binary search trees, or other more specialized structures.

Many programming languages include associative arrays as primitive data types, and they are available in software libraries for many others. Content-addressable memory is a form of direct hardware-level support for associative arrays.

Associative arrays have many applications including such fundamental programming patterns as memoization and the decorator pattern.[15]

The name does not come from the associative property known in mathematics. Rather, it arises from the fact that we associate values with keys.

In an associative array, the association between a key and a value is often known as a "mapping", and the same word mapping may also be used to refer to the process of creating a new association.

The operations that are usually defined for an associative array are:[10][11]

  • Add or insert: add a new   pair to the collection, mapping the new key to its new value. The arguments to this operation are the key and the value.
  • Reassign: replace the value in one of the   pairs that are already in the collection, mapping an existing key to a new value. As with an insertion, the arguments to this operation are the key and the value.
  • Remove or delete: remove a   pair from the collection, unmapping a given key from its value. The argument to this operation is the key.
  • Lookup: find the value (if any) that is bound to a given key. The argument to this operation is the key, and the value is returned from the operation. If no value is found, some associative array implementations raise an exception, while others create a pair with the given key and the default value of the value type (zero, empty container...).

Often then instead of add or reassign there is a single set operation that adds a new   pair if one does not already exist, and otherwise reassigns it.

In addition, associative arrays may also include other operations such as determining the number of mappings or constructing an iterator to loop over all the mappings. Usually, for such an operation, the order in which the mappings are returned may be implementation-defined.

A multimap generalizes an associative array by allowing multiple values to be associated with a single key.[16] A bidirectional map is a related abstract data type in which the mappings operate in both directions: each value must be associated with a unique key, and a second lookup operation takes a value as an argument and looks up the key associated with that value.

Suppose that the set of loans made by a library is represented in a data structure. Each book in a library may be checked out only by a single library patron at a time. However, a single patron may be able to check out multiple books. Therefore, the information about which books are checked out to which patrons may be represented by an associative array, in which the books are the keys and the patrons are the values. Using notation from Python or JSON, the data structure would be:

{
    "Pride and Prejudice": "Alice",
    "Wuthering Heights": "Alice",
    "Great Expectations": "John"
}

A lookup operation on the key "Great Expectations" would return "John". If John returns his book, that would cause a deletion operation, and if Pat checks out a book, that would cause an insertion operation, leading to a different state:

{
    "Pride and Prejudice": "Alice",
    "The Brothers Karamazov": "Pat",
    "Wuthering Heights": "Alice"
}

The basic definition of the dictionary does not mandate an order. To guarantee a fixed order of enumeration, ordered versions of the associative array are often used. There are two senses of an ordered dictionary:

  • The order of enumeration is always deterministic for a given set of keys by sorting. This is the case for tree-based implementations, one representative being the <map> container of C++.
  • The order of enumeration is key-independent and is instead based on the order of insertion. This is the case for the "ordered dictionary" in .NET Framework and Python.

The latter sense of ordered dictionaries are more commonly encountered. They can be implemented using an association list, or by overlaying a doubly linked list on top of a normal dictionary. The latter approach, as used by CPython before version 3.6, has the advantage of keeping the potentially better complexity of another implementation. CPython 3.6+ makes dictionaries ordered by splitting the hash table into an insertion-ordered array of k-v pairs and a sparse array ("hash table") of indices.[17]

Hash Table

edit

Key-Value Database

edit

What is it?

A key-value database is a type of nonrelational database that uses a simple key-value method to store data. A key-value database stores data as a collection of key-value pairs in which a key serves as a unique identifier. Both keys and values can be anything, ranging from simple objects to complex compound objects. Key-value databases are highly partitionable and allow horizontal scaling at scales that other types of databases cannot achieve.[18]

Compared to relational databases?

Relational databases organize data using tables. Tables are structures that impose a schema on the records that they hold. Each column within a table has a name and a data type. Each row represents an individual record or data item within the table, which contains values for each of the columns. Relational databases get their name from mathematical relationships that use tuples (like the rows in a table) to represent ordered sets of data.[19]

What does this mean?

Key–value databases work in a very different fashion from the better known relational databases (RDB). RDBs predefine the data structure in the database as a series of tables containing fields with well defined data types. Exposing the data types to the database program allows it to apply a number of optimizations. In contrast, key–value systems treat the data as a single opaque collection, which may have different fields for every record. This offers considerable flexibility and more closely follows modern concepts like object-oriented programming. Because optional values are not represented by placeholders or input parameters, as in most RDBs, key–value databases often use far less memory to store the same database, which can lead to large performance gains in certain workloads.[20]

Set (Abstract Data Type)

edit

Activities

edit

1) Write a program that creates a dictionary where the keys are even integers between 2 and 20 and the values are the corresponding key multiplied by the odd integer before it.

Example output: {2: 2, 4: 12, 6: 30, 8: 56...}

2) Write a program that takes two lists and merges them into a single dictionary.

Example:

list_1 = [5, 10, 15, 20] 

list_2 = ['a', 'b', 'c', 'd']

Result: {5: 'a', 10: 'b', 15: 'c', 20: 'd'}

3) Write a program that searches for the minimum value and returns the corresponding key.

Example:

prices = {'Popcorn': 1.50, 'Soda': 1.00, 'Hotdog': 1.25, 'Nachos': 0.75}

Result: 'Nachos'

Key Terms

edit

Dictionary / Associative array / Hash / Map - An abstract data type that can hold data in (key, value) pairs.[21]

Frozen Set - A native data type in Python that have the qualities of sets — including class methods — but are immutable like tuples.[22]

Hash collision - Occur when two entries are generated using the same index key. This has a high chance of occurring when hashing a random subset of a large set of possible keys.[23]

Hash table - A type of data structure that stores key-value pairs. The key is sent to a hash function that performs arithmetic operations on it. The result (commonly called the hash value or hash) is the index of the key-value pair in the hash table.[24]

Key - A field, or combination of fields, in a database table used to retrieve and sort rows in the table based on certain requirements. Keys are defined to speed up access to data and, in many cases, to create links between different tables.[25]

Load factor - A critical statistic for hash tables that references how quickly the hash table can be accessed. A higher load factor indicates the table may become slower or fail to work at all.

Merge algorithm - A family of algorithms that take multiple sorted lists as input and produce a single list as output, containing all the elements of the inputs lists in sorted order. These algorithms are used as subroutines in various sorting algorithms, most famously merge sort.[26]

Multimap - A generalization of a map or associative array abstract data type in which more than one value may be associated with and returned for a given key.[27]

Multiset - A type of associative containers similar to set, with an exception that multiple elements can have same values.[28]

Serialization - Produces a text or binary representation of the original objects that can be written directly to a file and offers a solution to use associative arrays in permanent form.[29]

Set - An unordered, mutable data type where each value must be unique.[22]

References

edit
  1. Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2009). Introduction to Algorithms, Third Edition (3rd ed.). The MIT Press. ISBN 978-0262033848.
  2. Black, Paul E. (15 December 2004). "data structure". In Pieterse, Vreda; Black, Paul E. (eds.). Dictionary of Algorithms and Data Structures [online]. National Institute of Standards and Technology. Retrieved 2018-11-06.
  3. "Data structure". Encyclopaedia Britannica. 17 April 2017. https://www.britannica.com/technology/data-structure. Retrieved 2018-11-06. 
  4. Wegner, Peter; Reilly, Edwin D. (2003-08-29). Encyclopedia of Computer Science. Chichester, UK: John Wiley and Sons. pp. 507–512. ISBN 978-0470864128.
  5. "When data is too big to fit into the main memory". homes.sice.indiana.edu.
  6. Dubey, R. C. (2014). Advanced biotechnology : For B Sc and M Sc students of biotechnology and other biological sciences. New Delhi: S Chand. ISBN 978-81-219-4290-4. OCLC 883695533.
  7. Seymour, Lipschutz (2014). Data structures (Revised first ed.). New Delhi, India: McGraw Hill Education. ISBN 9781259029967. OCLC 927793728.
  8. Walter E. Brown (September 29, 1999). "C++ Language Note: POD Types". Fermi National Accelerator Laboratory. Archived from the original on 2016-12-03. Retrieved 6 December 2016.
  9. https://en.wikipedia.org/wiki/Data_structure
  10. a b c Goodrich, Michael T.; Tamassia, Roberto (2006), "9.1 The Map Abstract Data Type", Data Structures & Algorithms in Java (4th ed.), Wiley, pp. 368–371
  11. a b c Mehlhorn, Kurt; Sanders, Peter (2008), "4 Hash Tables and Associative Arrays", Algorithms and Data Structures: The Basic Toolbox (PDF), Springer, pp. 81–98
  12. Andersson, Arne (1989). "Optimal Bounds on the Dictionary Problem". Proc. Symposium on Optimal Algorithms. Lecture Notes in Computer Science. Springer Verlag. 401: 106–114. doi:10.1007/3-540-51859-2_10. ISBN 978-3-540-51859-4.
  13. Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001), "11 Hash Tables", Introduction to Algorithms (2nd ed.), MIT Press and McGraw-Hill, pp. 221–252, ISBN 0-262-03293-7.
  14. Dietzfelbinger, M., Karlin, A., Mehlhorn, K., Meyer auf der Heide, F., Rohnert, H., and Tarjan, R. E. 1994. "Dynamic Perfect Hashing: Upper and Lower Bounds" . SIAM J. Comput. 23, 4 (Aug. 1994), 738-761. http://portal.acm.org/citation.cfm?id=182370 doi:10.1137/S0097539791194094
  15. Goodrich & Tamassia (2006), pp. 597–599.
  16. Goodrich & Tamassia (2006), pp. 389–397.
  17. https://en.wikipedia.org/wiki/Associative_array
  18. https://www.xspdf.com/resolution/21031178.html
  19. https://www.prisma.io/dataguide/intro/comparing-database-types#nosql-databases-modern-alternatives-for-data-that-doesnt-fit-the-relational-paradigm
  20. Key-value database
  21. https://brilliant.org/wiki/associative-arrays/
  22. a b https://betterprogramming.pub/what-are-frozen-sets-in-python-88f8a15a28dc
  23. https://en.wikipedia.org/wiki/Hash_table
  24. https://www.educative.io/edpresso/what-is-a-hash-table
  25. https://www.techopedia.com/definition/1780/key
  26. https://en.wikipedia.org/wiki/Merge_algorithm
  27. https://en.wikipedia.org/wiki/Multimap
  28. https://www.geeksforgeeks.org/multiset-in-cpp-stl/
  29. https://en.wikipedia.org/wiki/Associative_array