MindMap Gallery Data Structure Chapter 7 - Search
Chapter 7 of "Data Structure" - Searching knowledge, including basic concepts of search, sequential search and binary search, tree search, hash table, B-tree and B-tree, etc.
Edited at 2022-11-23 16:07:51One Hundred Years of Solitude is the masterpiece of Gabriel Garcia Marquez. Reading this book begins with making sense of the characters' relationships, which are centered on the Buendía family and tells the story of the family's prosperity and decline, internal relationships and political struggles, self-mixing and rebirth over the course of a hundred years.
One Hundred Years of Solitude is the masterpiece of Gabriel Garcia Marquez. Reading this book begins with making sense of the characters' relationships, which are centered on the Buendía family and tells the story of the family's prosperity and decline, internal relationships and political struggles, self-mixing and rebirth over the course of a hundred years.
Project management is the process of applying specialized knowledge, skills, tools, and methods to project activities so that the project can achieve or exceed the set needs and expectations within the constraints of limited resources. This diagram provides a comprehensive overview of the 8 components of the project management process and can be used as a generic template for direct application.
One Hundred Years of Solitude is the masterpiece of Gabriel Garcia Marquez. Reading this book begins with making sense of the characters' relationships, which are centered on the Buendía family and tells the story of the family's prosperity and decline, internal relationships and political struggles, self-mixing and rebirth over the course of a hundred years.
One Hundred Years of Solitude is the masterpiece of Gabriel Garcia Marquez. Reading this book begins with making sense of the characters' relationships, which are centered on the Buendía family and tells the story of the family's prosperity and decline, internal relationships and political struggles, self-mixing and rebirth over the course of a hundred years.
Project management is the process of applying specialized knowledge, skills, tools, and methods to project activities so that the project can achieve or exceed the set needs and expectations within the constraints of limited resources. This diagram provides a comprehensive overview of the 8 components of the project management process and can be used as a generic template for direct application.
Find
Basic concepts of search
1) Find
The process of finding data elements that meet certain conditions in a data collection
2) Lookup table (lookup structure)
Data collection used for lookup
static lookup table
Only used for queries
dynamic lookup table
Support dynamic insertion and deletion
3) Keywords
The value of a data item in a data element that uniquely identifies the element
4) Average search length
n: length of lookup table
Pi: The probability of finding the i-th element (usually 1/n)
Ci: the number of comparisons required to find the i-th element
Sequential search and binary search
sequential search
Sequential search of general linear tables
Find average successful length: (n 1)/2
Average length of search failures: n 1
Sequential search in ordered list
Find average successful length: (n 1)/2
Average length of failed search: n/2 n/(n 1)
Search by half (two points)
Basic idea
Only applicable to ordered sequential lists, the key is compared with the middle position element, and then the middle position element in the first half or the second half is compared recursively according to the size until the search succeeds or fails.
Implement code
decision tree
Used to describe the process of half search
Maximum branch height
Since the decision tree is a balanced binary tree, the maximum height difference between branch trees is 1, that is, the maximum number of comparisons - the minimum number of comparisons = 1
The decision tree is also a binary sorting tree (i.e. left subtree keywords < root node < right subtree, or vice versa)
time complexity:
Block search
Basic idea
Also known as index sequence search. Divide the lookup table into several blocks. The keywords of the previous block are smaller than those of the next block, but the elements in the block can be disordered. Create an index table. Each item indicates the maximum keyword of each block and the address of the first element. The index table is organized by key. words in order
When searching, first determine the block where the record to be searched is located in the index table (use sequential or half-search index table); then search sequentially within the block.
Average search length = index search within block search
All search in order
Divide the lookup table of length n into b blocks, each block has s records.
The index table uses half search and sequential search within the block.
tree search
Binary sort tree (BST)
definition
A binary sorting tree (also called a binary search tree) is either an empty tree or a binary tree with the following properties:
1) If the left subtree is not empty, the values of all nodes on the left subtree are less than the value of the root node.
2) If the right subtree is not empty, then the values of all nodes on the right subtree are greater than the value of the root node.
3) The left and right subtrees are also a binary sorting tree respectively.
Perform in-order traversal on a binary sorted tree to obtain an increasing ordered sequence.
Find
insert
1) If the keyword k is less than the root node value, insert it into the left subtree; otherwise, insert it into the right subtree
2) The inserted node must be a newly added leaf node and the left or right child of the last node visited on the search path when the search fails.
3) Delete and insert the same node in binary sorting. If the node is originally a leaf node, the previous and subsequent binary trees will remain unchanged, and vice versa. For a balanced binary tree, no matter whether the original node deleted is a leaf node or not, Nodes, AVL before and after may change (or not)
structure
construction process
algorithm
delete
①If the deleted node is a leaf node, just delete it directly
②If the deleted node z has only one left subtree or right subtree, make its subtree become the subtree of z’s parent node
③If z has two subtrees, left and right, let its direct successor (calculated according to the in-order sequence) or direct predecessor replace z, and then delete the successor node to convert to the previous situation
efficiency analysis
1) The height difference between the left and right subtrees does not exceed 1, that is, a balanced binary tree
Average search length:
2) A single tree with only right (left) child
Average search length:
Comparison with binary search
1) The search performance of a binary sort tree is related to the input order of the data. Only in the best case, the search performance is the same as the binary search.
2) In terms of maintaining the orderliness of the table, a binary tree does not need to move nodes. It only needs to modify the pointer to complete the insertion and deletion operations. The time complexity is O(log2n), while binary search requires O(n); when there is When the sequence table is a static lookup table, it is appropriate to use the sequence table and binary search. When it is dynamic, it is appropriate to use a binary sorting tree as the logical structure.
balanced binary tree
definition
Define the height difference between the left and right subtrees of a node as the balance factor (left high - right high), then the balance factor of a balanced binary tree node can only be 0, 1, -1, that is, the height difference between the left and right subtrees of any node no more than 1
insert
1) Inserting nodes similar to binary sorting tree
2) Once the inserted node destroys the balance, find the smallest unbalanced subtree and rotate it.
①LL balanced rotation (right single rotation)
Insert a new node on the left subtree of the left child of node A; it needs to be rotated to the right. Rotate A's left child B upward to become the root node. A serves as the root node of B's right subtree, and B's original right child tree as the left subtree of A
②RR balanced rotation (left single rotation)
Insert a new node into the right subtree of the right child of node A; it needs to be rotated to the left. Rotate A's right child B upward to become the root node. A becomes the root node of B's left subtree. B The original left subtree of A serves as the right subtree of A
③LR balanced rotation (double rotation first left and then right)
To insert a new node into the right subtree of A's left child, you need to first rotate the root node C of the right subtree of A's left child B to the left and up to the position of B, and then rotate C to the right and up to A. s position
④RL balanced rotation (double left rotation in succession)
delete
1) Use the binary sorting tree method to perform the deletion operation on node w.
2) Starting from node w, trace upward to find the first unbalanced node z (i.e., the smallest unbalanced subtree); y is the child node with the highest height of node z; x is the height of node y The highest child node.
3) Then balance the subtree with z as the root, where there are 4 possible positions of x, y and z:
① y is the left child of z, x is the left child of y (LL, right single rotation);
②y is the left child of z, and x is the right child of y (LR, double rotation first left and then right);
③y is the right child of z, x is the right child of y (RR, left single rotation);
④y is the right child of z, x is the left child of y (RL, double rotation first right and then left)
og
Find
The search process is the same as that of a binary sorting tree, and the average search length is O(log2n)
Number of nodes
nh is the minimum number of nodes of a balanced binary tree with height h
red black tree
definition
A red-black tree is a binary sorted tree that satisfies the following red-black properties:
①Each node is either red or black.
②The root node is black.
③Leaf nodes (fictitious external nodes, NULL nodes) are all black.
④ There are no two adjacent red nodes (that is, the parent node and child node of the red node are both black).
⑤ For each node, the simple path from the node to any leaf node contains the same number of black nodes.
The total number of black nodes on any simple path from a node (excluding this node) to a leaf node is called the black height of the node (recorded as bh). The black height of the root node is the red-black tree. The black height
nature
1. The longest path from the root node to the leaf node is no more than twice the shortest path
2. Height of a red-black tree with n internal nodes
insert
Insert using the binary search tree insertion method and color the new node z in red
If z is the root node, color it black and end
If z is not the root node
If the parent node of z is black, no adjustment is needed
If the parent node of z is red
①The uncle node y of z is black, and z is a right child
LR (if z.p is the right child, it is RL), rotate left once, and become ②
②The uncle node y of z is black, and z is a left child
LL (or RR if z.p is the right child), rotate right once, and swap z.p and z.p.p and colors
③If the uncle node y of x is red
Color both z.p and y black, color z.p.p red, and then use z.p.p as the new node z to repeat the cycle until ①, ② or z is the root node.
delete
1) The deletion process first executes the deletion method of the binary search tree. If the node to be deleted has two children, it is exchanged with the successor node in the order (that is, the smallest node in the right subtree), and then converted to delete This successor node; the successor node has at most one child, so it is converted to the situation where the node to be deleted has no children or only one child.
2) If the node to be deleted has only a right child or a left child, it can be seen from its properties that it must be a black node and its child must be a red node.
3) If the node to be deleted has no children
①If the node is red, delete it directly
② If the node to be deleted has no children and is red, record x as the node used to replace y (x is a black null node). In order to keep the black height unchanged, x is positioned as a double black node, and it needs to be Revert to normal node
1. The sibling node w of x is red
At this time, w must have black left and right children and parent nodes; exchange the colors of w and x.p, and then rotate x.p left, converting situation 1 to 2, 3, and 4
2. The sibling node w of x is black, the left child of w is red, and the right child is black.
RL, that is, the red node is the left child of the right child of its parent node. Swap the colors of w and its left child, right-rotate w, and convert to case 3
3. The sibling node w of x is black, and the right child of w is red.
RR. Swap the colors of w and x.p, color the right child of w black, and rotate x.p left to change x back to single black, and end
4. The sibling node w of x is black, and the two child nodes of w are both black.
Remove a layer of black from both x and w (that is, w becomes red). As compensation, add an additional layer of black to x.p to maintain the local black height.
If x.p is originally red, change it to black and end
If x.p is originally black, use it as a new node x to enter the loop
hash table
basic concept
Hash function: A function that maps a keyword in a lookup table to the address corresponding to the keyword, recorded as Hash(key)=Addr (the address here can be an array subscript, index, or memory address, etc.).
The hash function may map two or more different keywords to the same address, which is called a collision. These different keywords that collide are called synonyms. On the one hand, a well-designed hash function should minimize such conflicts; on the other hand, since such conflicts are always inevitable, a method of handling them must be designed.
Hash table: A data structure that is directly accessed based on keywords. In other words, the hash table establishes a direct mapping relationship between keywords and storage addresses.
How to construct a hash function
Require
1) The domain of the hash function must include all the keywords that need to be stored, and the range of the value domain depends on the size or address range of the hash table.
2) The addresses calculated by the hash function should be evenly distributed in the entire address space with equal probability, thereby reducing the occurrence of conflicts.
3) The hash function should be as simple as possible and can calculate the hash address corresponding to any keyword in a short time.
Commonly used hash functions
direct addressing method
Basic idea
Directly take the value of a linear function of the keyword as the hash address
hash function
H(key)=key or H(key)=a×key b
Features
It is simple and does not cause conflicts, and is suitable for continuous distribution of keywords, otherwise it will waste storage space.
division leaving remainder method
Basic idea
Let the length of the hash table be m, take a prime number p that is less than and closest to or equal to m, and use the formula to convert the keyword into a hash address
hash function
H(key)=key%p
Features
The most common and simple method
digital analytics
Basic idea
For r digits of an r-base number, select a number of bits with a relatively even distribution of digits as the hash address.
Features
Suitable for known keyword sets. If the keywords are changed, a new hash function needs to be reconstructed.
Square-Medium Method
Basic idea
Take the middle digits of the squared value of the keyword as the hash address
Features
The values of each bit used only for keywords are not uniform enough or are smaller than the number of bits required for the hash address.
How to handle conflicts
open addressing method
definition
The free address where a new entry can be stored is open to both its synonym entries and its non-synonym entries.
recursion formula
Hi represents the hash address obtained by the i-th detection in the conflict processing, m represents the length of the hash table, and di is the increment sequence.
incremental sequence
Linear detection method
Basic idea
di=0, 1,...,m-1, that is, when a conflict occurs, check the next unit in the table sequentially until a free unit is found or the entire table is searched.
Features
It is easy to cause a large number of elements to "aggregate" at adjacent hash addresses, which greatly reduces the search efficiency.
With linear probing, conflicts can occur even if they are not synonyms
square detection method
Basic idea
di=0^2, 1^2, -1^2...,k^2,-k^2, where k≤m/2, the hash table length m must be a prime number that can be expressed as 4K 3, also known as secondary detection method
Features
Avoids "stack-up" problems, but cannot detect all cells on the hash table (but can detect at least half of the cells)
double hashing
Basic idea
di=Hash2(key), that is, two hash functions are used. When the address obtained by the first hash function H(key) conflicts, the second hash function Hash2(key) is used to calculate the key. address increment
hash function
pseudorandom sequence method
di = pseudo-random number sequence
Zipper method (chaining)
Store all synonyms in a linear linked list that is uniquely identified by its hash address; suitable for situations where insertions and deletions are frequent
Hash lookup and performance analysis
Search process
① Check whether there is a record at the address of Addr in the lookup table. If there is no record, return the search failure; if there is a record, compare it with the value of the key. If they are equal, return the search success flag, otherwise proceed to step ②.
② Use the given conflict handling method to calculate the "next hash address", set Addr to this address, and go to step ①.
Average search length ASL=2.5
Classification
Search successful
Find the length of each known element
The number of successful comparisons = the number of conflicts 1
Search failed
The search length of each search position determined by the hash function
Find factors that affect efficiency
hash function
How to handle conflicts
Filling factor α = number of records in the table n/hash table length m
B-tree and B-tree
B-tree and its basic operations
definition
B-tree, also known as multi-way balanced search tree, the maximum number of children of all its nodes is called the order of the B-tree, denoted as m. An m-order B-tree is either an empty tree or a tree that satisfies the following properties m-ary tree:
1) Each node in the tree has at most m subtrees, that is, it contains at most m- 1 keywords
2) If the root node is not a terminal node, there are at least two subtrees
3) All non-leaf nodes except the root node have at least "m/2] subtrees, that is, they contain at least "m/2]-1 keywords.
4) The structure of all non-leaf nodes is as follows:
Ki is the keyword of the node, sorted in ascending order
Pi is a pointer to the root node of the subtree. The keys of all nodes in the subtree pointed to by Pi-1 are less than Ki.
n is the number of keywords
5) All leaf nodes appear at the same level and carry no information (i.e. external nodes, pointers to these nodes are empty)
Properties (assuming m=5)
1) The number of children of a node is equal to the number of keywords in the node plus 1.
2) If the root node has no keywords, there will be no subtrees, and the B-tree is empty at this time; if the root node has keywords, its subtrees must be greater than or equal to two, because the number of subtrees is equal to the number of keywords plus 1.
3) All non-terminal nodes except the root node have at least "m/2]-1=3 subtrees and at most 5 subtrees (that is, at most 4 keywords).
4) The keywords in the node are in increasing order from left to right. There are pointers to the subtree on both sides of the keyword. All the keywords of the subtree pointed to by the left pointer are smaller than the keyword. The subtree pointed by the right pointer is smaller than the keyword. All keywords are greater than this keyword.
5) All leaf nodes are on the 4th level, representing the location where the search failed.
The height of the B-tree (number of disk accesses)
1) Each node of the B-tree has at most m subtrees and m-1 keywords. Therefore, the number of keywords in an m-order B-tree with a height of h should satisfy n≤(m-1)(1 m m^2 …… m^(h-1))=m^h-1
2) The root node has at least 2 subtrees, the non-root node has at least "m/2]-1 keywords, and the h 1st level has at least 2("m/2])^(h-1) nodes points, and the number of leaf nodes is n 1
B-tree search
①Find nodes in B-tree
Performed on the disk, find the corresponding node based on keyword comparison
If a leaf node is found (the corresponding pointer is a null pointer), it means that there is no corresponding keyword in the tree and the search fails.
②Find keywords in nodes
Performed in memory, using sequential or binary search methods within the node
Insertion into B-tree
1) If the number of node keywords after insertion is less than m, you can insert it directly
2) Otherwise, take the node at the middle position ("m/2]) and insert it into the parent node of the original node, and the original node will be split into two pieces; if this causes the number of keywords of the parent node to overflow, continue Split until it reaches the root node, which in turn causes the height of the B-tree to be increased by 1
Deletion of B-tree
1) If the deleted keyword k is not in the terminal node (the lowest level non-leaf node), replace k with k's predecessor (or successor) k', and then delete k' in the corresponding node (k' must be falls in a terminal node)
2) If the deleted keyword k is in the terminal node
①Delete keywords directly. The condition is that the number of keywords in the node is ≥ "m/2]. After deleting the keyword, it still meets the definition of B-tree.
②Brothers can borrow enough. If the number of keywords of the node = "m/2]-1, and the number of keywords of the sibling node ≥ "m/2], then adjust the node, right (left) sibling node and its parents Node (father-son transposition method) to achieve a new balance
③There are not enough brothers to borrow. If the number of keywords of the node and its left and right sibling nodes are both "m/2]-1, delete the keywords and merge them with the keywords of the left (right) sibling node and parent node.
Basic concepts of B-trees
B-tree is a deformed tree of B-tree that meets the following conditions:
1) Each branch node has at most m subtrees (child nodes).
2) The non-leaf root node has at least two subtrees, and each other branch node has at least "m/2] subtrees.
3) The number of subtrees of a node is equal to the number of keywords.
4) All leaf nodes contain all keywords and pointers to corresponding records. The keywords are arranged in size order in the leaf nodes, and adjacent leaf nodes are linked to each other in size order.
5) All branch nodes (indexes that can be regarded as indexes) only contain the maximum value of the keyword in each of its child nodes (that is, the index block of the next level) and the pointers to its child nodes.
The difference between B-tree and B-tree:
1) In the B-tree, a node with n keywords contains only n subtrees, that is, each keyword corresponds to one subtree; while in the B-tree, a node with n keywords contains n 1 subtrees .
2) In the B tree, the range of the number n of keywords for each node (non-root internal node) is "m/2]<n<m (root node: 2<n<m); in B In the tree, the range of the number n of keywords for each node (non-root internal node) is "m/2]-1<n<m-1 (root node: 1≤n≤m-1).
3) In the B-tree, leaf nodes contain information, and all non-leaf nodes only serve as indexes. Each index item in a non-leaf node only contains the maximum keyword of the corresponding subtree and a pointer to the subtree. The storage address of the record corresponding to this keyword is not included.
4) In the B-tree, leaf nodes contain all keywords, that is, keywords that appear in non-leaf nodes will also appear in leaf nodes; in B-tree, leaf nodes (the outermost inner node) and the keywords contained in other nodes are not repeated.
Find
① There are two head pointers in the B tree: one points to the root node, and the other points to the leaf node with the smallest keyword; therefore, two search operations can be performed on the B tree: one is a sequential search starting from the smallest keyword, and the other is The first is a multi-way search starting from the root node
②When searching in the B-tree, no matter whether it is successful or not, each search is a path from the root node to the leaf node.