pyspark.RDD.fullOuterJoin¶

RDD.fullOuterJoin(other: pyspark.rdd.RDD[Tuple[K, U]], numPartitions: Optional[int] = None) → pyspark.rdd.RDD[Tuple[K, Tuple[Optional[V], Optional[U]]]][source]¶

Perform a right outer join of self and other.

For each element (k, v) in self, the resulting RDD will either contain all pairs (k, (v, w)) for w in other, or the pair (k, (v, None)) if no elements in other have key k.

Similarly, for each element (k, w) in other, the resulting RDD will either contain all pairs (k, (v, w)) for v in self, or the pair (k, (None, w)) if no elements in self have key k.

Hash-partitions the resulting RDD into the given number of partitions.

New in version 1.2.0.

Parameters

otherRDD: another RDD
numPartitionsint, optional: the number of partitions in new RDD

Returns

RDD: a RDD containing all pairs of elements with matching keys

See also

RDD.join()
RDD.leftOuterJoin()
RDD.fullOuterJoin()
pyspark.sql.DataFrame.join()

Examples

>>> rdd1 = sc.parallelize([("a", 1), ("b", 4)])
>>> rdd2 = sc.parallelize([("a", 2), ("c", 8)])
>>> sorted(rdd1.fullOuterJoin(rdd2).collect())
[('a', (1, 2)), ('b', (4, None)), ('c', (None, 8))]

pyspark.RDD.foreachPartition

pyspark.RDD.getCheckpointFile